Responsibilities:
• Design, develop, and maintain scalable data pipelines and systems for data processing.
• Utilize Hadoop and related technologies to manage large-scale data processing.
• Perform data ingestion using Kafka & spark, Sqoop and various file formats and process data into Hive using Beeline/spark.
• Develop and maintain shell scripts for automation of data processing tasks
.• Implement full and incremental data loading strategies to ensure data consistency and availability.
• Orchestrate and monitor workflows using Apache Airflow. • Ensure code quality and version control using Git.
• Troubleshoot and resolve data-related issues in a timely manner.
• Stay up to date with the latest industry trends and technologies to continuously improve our data infrastructure.
Requirements:
• Proficiency in SQL and Hive (Essential).
• Proven experience as a Data Engineer (ETL, data warehousing) (Essential).
• Strong knowledge of Hadoop and its ecosystem (HDFS, YARN, MapReduce, Tez and spark).
• Expertise in full and incremental data loading techniques.
Good to have:
• Understanding of PySpark and its application in real-time data processing.
• Hands-on experience with Apache Airflow for workflow orchestration.
• Proficiency with Git for version control
• Experience on Postgres SQL or SQL server or MSBI.