Data Engineer
- Built a centralized Data Lake on AWS Cloud leveraging primary services such as S3, EMR, Redshift, and Athena to enable seamless data ingestion and processing
- Created data pipelines for different events to load the data from DynamoDB to the AWS S3bucket and then into the HDFS location, using PySpark for efficient data processing
- Developed PySpark jobs to pull data from third-party and native data sources and dump data into AWS S3 for analytics and reporting purposes, after required transformations using best compression techniques, creating.
- Loaded data into S3 buckets using AWS Glue and PySpark, filtered data stored in S3 buckets using Elasticsearch, and loaded data into Hive external tables
- Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon S3 and DynamoDB, running Hadoop/Spark jobs on AWS EMR
- Designed and developed a Security Framework to provide fine-grained access to objects in AWS S3 using AWS Lambda and DynamoDB