Responsibilities:
• Design, develop, and maintain scalable data pipelines and systems for data processing.
• Utilize Hadoop and related technologies to manage large-scale data processing.
• Perform data ingestion using Kafka & spark, Sqoop and various file formats and process data into Hive using Beeline/spark.
• Develop and maintain shell scripts for automation of data processing tasks.
• Implement full and incremental data loading strategies to ensure data consistency and availability.
• Orchestrate and monitor workflows using Apache Airflow.
• Ensure code quality and version control using Git.
• Troubleshoot and resolve data-related issues in a timely manner.
• Stay up-to-date with the latest industry trends and technologies to continuously improve our data infrastructure.
Requirements:
• Proven experience as a Data Engineer (ETL, data warehousing).
• Strong knowledge of Hadoop and its ecosystem (HDFS, YARN, MapReduce, Tez and spark).
• Proficiency in Kafka & spark, Sqoop and Hive.
• Experience with shell scripting for automation.
• Expertise in full and incremental data loading techniques.
• Excellent problem-solving skills and attention to detail.
• Ability to work collaboratively in a team environment and communicate effectively with stakeholders.
Good to have:
• Understanding of PySpark and its application in real-time data processing.
• Hands-on experience with Apache Airflow for workflow orchestration.
• Proficiency with Git for version control
• Experience on Postgres SQL or SQL server or MSBI