Data Engineer
Current- Architected scalable data pipelines using Airflow, Azure Data Factory, and AWS EMR for centralized data storage in Postgres, S3, and Apache Druid.
- Utilized containerization with AWS ECS to implement open-source data infrastructure tools like Druid and Superset, reducing data infrastructure costs by $6000/month.
- Improved query performance by 30% and reduced stored function run time by upto 70% through Fact/Dimension modeling and query optimizations.
- Renovated the data warehouse, optimizing its architecture for efficiency and ease of use, leading to improved performance, scalability, and streamlined analytics.
- Transferred SQL-based code to PySpark-based Apache Spark, resulting in a 92% reduction in data processing time and considerable time and cost benefits.
- Collaborated with other engineers to deploy Data Lakehouse (DeltaLake) using S3, AWS EMR, and PySpark, improving data availability and organization.