Caleb D. work email
- Valid
Caleb D. personal email
Big Data and Machine Learning Cloud Hadoop Spark Engineer and IT professional with client-facing and Data Lake, Data Warehouse, and Machine Learning Ops technical skills. Recently I have been developing my skills and interest in Decentralized Finance and Machine Learning Ops. Who knows what the future holds.
-
Sr. Manager Analytics EngineerPfizer Feb 2024 - PresentNew York, New York, Us• Translating business requirements by business team into Data Engineering specifications and pipelines for international Data Analytics and Data Science for a multitude of pharmaceutical solutions • Building scalable data sets with Python and SQL on Snowflake Data Warehouse based on specifications from raw data and derive business metrics• Understanding and Identify server APIs needed to be instrumented for data analytics aligning events for established data pipelines• Engineering Data pipelines for Analytics on Snowflake for Rare Disease Cardiovascular products live Dashboard displaying in Tableau with reduced load times reduced to seconds from > 30 mins• Developing deployment processes for Snowpark and SQL script process with Airflow • Exploring and understanding sophisticated data sets, identifying and formulating correlational rules between heterogenous sources for effective analytics and reporting• Architecting data engineering environment for US Commercial Analytics supporting insights for Oncology, Cardiovascular and Vaccines• Created automatic CICD code promotion GitHub workflow decreasing release process from 7+ days to ~20 mins • Developed plan and execution of International expansion rollout of commercial analytics engineering package called Platinum allowing internal and worldwide teams to utilize our code products as a template• Processing, cleaning and validating the integrity of data used for analysis• Developed one of first forms of engineered business rules into QA with producer-consumer Contract Validation Framework for US commercial analytics team • Developing Python and Shell Scripts for data ingestion and stitching from external data sources for business insights• Configuring repository, branches, and rules for DevOps CI/CD Process• Working with analytics building automatic refreshed datasets recalculating insights with fresh data for different aggregations and cadences -
Senior Data EngineerPfizer Oct 2023 - PresentNew York, New York, Us• Working with a novel commercial analytics team to develop a process for data warehouse and data lake for support of reporting, analytics, and machine learning • Translating business requirements by business team into data and engineering specifications• Building scalable data sets based on specifications from the available raw data and deriving business metrics/insights• Cooperating across worldwide Pizer teams to produce processes and tools for commercial analytics teams by learning about Snowpark and Snowflake deployment and CI/CD methods• Structuring DevOps process for Snowflake SQL and Snowpark applications and processes with Git, GitHub, Github workflows and actions• Configuring new repository for CI/CD process and code development process for version control• Developing training for Git and GitHub and Apache Airflow code best practices for integrating etl code into CI/CD• Constructing Airflow dags for orchestration of Snowflake SQL scripts and Snowpark applications • Exploring and understanding sophisticated data sets, identifying and formulating correlational rules between heterogeneous sources for effective analytics and reporting• Processing, cleaning and validating the integrity of data used for analysis• Developing Python and Shell Scripts for data ingestion and stitching from external data sources for business insights• Parameterizing sensitive data values for security protocols and Pfizer data governance • Creating POC process end to end for new DevOps, CI/CD and etl process for support for Dataiku, Tableau, PowerBI, and Machine learning consumption layer• Partnering with data scientists to build scalable modeling features pipeline• Collaborating closely with analytics and data science teams to develop robust data pipeline and analytics visualization • Assisting in the planning of future data warehouse and data lake cleaning of data and maintaining of accurate environment for repeatable use of various Pfizer data products -
Sr. Data EngineerDuke Energy Corporation Jul 2022 - Jul 2023Charlotte, North Carolina, Us• Migrated on premises Hadoop Data Lake and energy grid IOT device data sources to downstream cloud data lake and data warehousing for the use of machine learning and analytics teams SageMaker development• Used AWS Glue Data Catalog as a metadata store for EMR and managed metadata tables • Deployed AWS infrastructure using various Bitbucket repositories and terraform workspaces, promoting code via pull requests • Collaborated with security team to create compliant quality, development and production environments using KMS for encrypting data and IAM roles for AWS permissions• Utilized on premises Bash scripts and hdfs distcp copy commands to transfer data from Hadoop to AWS S3 and use Kafka for change data capture • Created Boto3 Utilization tool to authenticate python jobs that interacted with AWS API and check for valid http success or failure response codes • Produced parquet tables with new partitions by converting ORC tables ranging from ~100GB to ~2TB to parquet files using pyspark scripts in AWS EMR• Found vulnerabilities and improved cloud infrastructures on AWS with s3 versioning and EMR Hadoop configurations and redeployed with terraform• Utilized Git bash, Visual Studio Code and Bitbucket to develop the Python, Shell scripts, HCL terraform code bases and repositories • Constructed AWS Machine Learning Platform SageMaker Studio to work with EMR clusters to prepare an environment and data for Machine Learning teams• Optimized Spark data pipelines originally taking ~3 hour+ to process 200 GB of data to about 40 minutes by careful memory and partition engineering • Automated pyspark EMR jobs with AWS Step functions and Lambda functions that created or refreshed large datasets or dataframes• Refracted monolithic terraform code bases into easy to use and reusable terraform modules in terraform registry with immutability • Cooperated with team with legacy Oracle Databases, MS SQL Databases, DynamoDB, RDS and Kafka data sources and targets -
Senior Data EngineerMark43 Aug 2021 - May 2022New York, Ny, Us• Provisioned and managed AWS and Azure cloud data lakes for first responders and global entities • Developed python and SQL scripts to manage data lake access for new and old users • Managed Terraform scripts to whitelist and blacklist new users IP addresses to data lakes • Repaired and optimized Qlik analytics visualization backend SQL queries by ~300% with indexing • Architected complex, highly available, optimized batch and real-time data pipelines• Worked with analytics product engineers to ensure performance, stability, availability of our MS SQL Server analytics databases• Collaborated together with DevOps engineers to improve our AWS/Azure data infrastructure• Assisted in Data warehouse design and implementation for reconstructing data lakes • Worked in a startup environment data engineering team using agile and scrum methodologies • Built powerful and scalable Data lakes that served software that sets a new standard for the tools upon which first responders relied• Used Stimzy on Apache Kafka to maintain and implement new Data pipelines that constructed new MySQL and MS SQL server databases• Populated and created new data lakes and using AWS DMS for data migration and change data capture incremental loads • Built self-service internal tool connecting to AWS RMS to manage data lake users • Established POCs with Kubernetes and Docker container to support the CAD and RMS applications • Expended API commands to manage keys with Terraform Hashi-vault to secure work environment • Maintained Pub/sub streaming data pipelines with Kafka connect and maintained with Pager Duty • Used kaka UI to monitor Kafka pipelines and monitored AWS DMS tasks with AWS Cloudwatch• Set up IntelliJ IDEA and terminal shell scripts to develop SQL scripts to manage database views and tables • Used Github and Git for version control and CICD best practices to iterate company codebase• Applied Python scripts to manage MySQL databases, enforced data governance data lakes -
Big Data/Machine Learning Ops EngineerAnthem, Inc. Apr 2021 - Aug 2021Indianapolis, Indiana, Us• Packaged, refactored and managed Machine learning models for business IT operations • Developed terraform IAC for AWS EMR pyspark and Scala jobs • Facilitated responsibility for Hadoop development Implementation including loading from disparate data sets, preprocessing using Hive and Pig.• Optimized ML spark batch jobs by ~5 hours using feature extraction and filtering • Operationalized code for machine learning model automation tool and pyspark jobs • Performed runs for extracts for data scientist experiments • Used SQL and HiveQL to optimize machine learning pipelines and for feature engineering • Migrated on-prem data sources to AWS for development of AWS Sagemaker Feature store • Architected and designed AWS sage maker machine learning pipeline for ML Ops development• Delivered various Big Data solutions Ability to design solutions independently based on high-level architecture.• Managed the technical communication between the survey vendor and internal systems Maintain the production systems (Kafka, Hadoop, Cassandra, Elasticsearch)• Created packaged production level code to run and tune ETL jobs with Hadoop, hue and spark UI• Collaborated with other development and research teams Building a cloud based platform that allows easy development of new applications• Built Scala JARs with Intellij for EMR jobs for S3 and AWS Documentdb• Used AWS Glue for building Data Lake and Data Warehouse ETL jobs • Utilized Anaconda and conda for packaging and maintaining dependency for machine learning models within cloudura platform • Developed bash/shell scripts for running models within Hadoop environment for ML deployment• Created documentation for delivery of IT packages using Confluence and Jira with Agile Workflow • Debugged and refactoring Python and Scala code issues with move to production • Deployed Sagemaker Machine learning pipelines in AWS with terraform -
Big Data Cloud EngineerAaa Life Insurance Company May 2020 - Mar 2021Livonia, Mi, Us• Architected and built a new AWS Organizations Cloud environment with PCI, PHI, PII compliance• Worked on creating new data lake ingesting data from on-prem and other clouds to s3 and redshift, and RDS• Used Terraform Enterprise and GitLab to deploy IAC to various AWS accounts• Integrated big data spark jobs with EMR and glue to create ETL jobs for around 450 GB of data daily• Optimized EMR clusters with partitioning and parquet format to increase speeds and efficiently by 200-500%• Created new Redshift cluster for data science using Quicksight for reporting and mobile visualization• Created training material for others and assist others with interacting with AWS boto3 and lambda• Developed new API Gateway for streaming to Kinesis and ingestion of event streaming data• Implemented AWS Step-Functions as orchestration and cloud watch events for automation of pipelines• Used Hive Glue data catalog to obtain and validate schema of data for data governance• Worked and maintained Git Lab repository using Git, Bash, and Ubuntu for working on project code• Created metadata tables for Redshift Spectrum and amazon Athena for serverless querying ad hoc• Tuned EMR cluster for big data with different gzip and parquet data formats and compression types• Created s3 bucket structure and data lake layout for optimal use of glue crawlers and s3 buckets• Used Terraform to create various lambda functions for orchestrations and automations• Extracted Data from Redshift clusters with SQL cli tool to identify issues with data science long queries• Used Spark and PySpark for Streaming and batch applications on many ETL jobs to and from data sources• Built Machine Learning Pipelines for SageMaker for Data Science and Machine Learning Engineers• Developed PySpark code to optimize use with RDDs Dataframes, and internal data structures• Coded step function for kinesis and kinesis firehose using DynamoDB for metadata store -
Senior Big Data EngineerRobinhood Jul 2018 - May 2020Menlo Park, California, Us• Configured Spark streaming to receive real time data from Kafka and store to HDFS• Use Spark streaming with Kafka and MongoDB to build continuous ETL pipeline for real time analytics• Managed ETL jobs with UDFs in pig scripts with spark before for transformations, joins, aggregations before HDFS• Preformed performance tuning for Spark Streaming setting right batch internal time, correct level of parallelism, selection of correct Serialization and memory tuning• Data ingestion using Flume with source as Kafka Source and sink as HDFS • Used Spark SQL and Data Frames API to load structured and semi structured data into Spark Clusters• Experience in optimizing the data storage in Hive using partitioning and bucketing mechanisms on managed and external tables• Collected, aggregated and moved data from servers to HDFS using Apache Spark & Spark Streaming • Used Spark API over Hadoop YARN to perform analytics on data in Hive• Worked on Multi Clustered environment and setting up Cloudera and Hortonworks Hadoop echo-System.• Developed Scala scripts on Spark to perform operations as data inspection, cleaning, loading and transforms the large sets of JSON data to Parquet format.• Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.• Prepared Spark builds from MapReduce source code for better performance.• Used Spark API over Hortonworks, Hadoop YARN to perform analytics on data in Hive.• Exploring with Spark improving performance and optimization of the existing algorithms in Hadoop MapReduce using Spark Context, Spark-SQL, Data Frames, Pair RDD's and Spark YARN.• Integrated Kafka with Spark Streaming for real time data processing• Moved transformed data to Spark cluster where the data is set to go live on the application using Kafka.• Created a Kafka producer to connect to different external sources and bring the data to a Kafka• Analyzed and tuned data model Cassandra tables during DB2 to Cassandra migration process. -
Aws Big Data EngineerCapital One May 2017 - Jul 2018Mclean, Va, Us• Created highly scalable, resilient, and performant architecture using amazon AWS cloud technologies such as simple storage service, S3, Elastic Map Reduce (EMR), Elastic Cloud Compute (EC2), Elastic Container Service (ECS), Lambda, Elastic Load Balancing (ELB) • Deployed containerized applications using Docker, allowing for standardized service infrastructure.• Monitored production software with logging, visualizing, and incident management software such as slunk, Kibana• Took advantage of new Spark Avro functionality through upgrading• Provided live demonstrations of software systems to nontechnical, executive level personnel, showing how the systems were meeting business goals and objectives. • Spark clusters exclusively from the AWS Management Console.• Made and oversaw cloud VMs with AWS EC2 command line clients and AWS administration reassure.• Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.• Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.• Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) AWS Redshift • Implemented AWS Lambda functions to run scripts in response to events in Amazon Dynamo DB table or S3.• Populating database tables via AWS Kinesis Firehose and AWS Redshift.• Automated the installation of ELK agent (file beat) with Ansible playbook. • AWS Cloud Formation templates used for Terraform with existing plugins.• AWS IAM was used for creating new users and groups. -
Big Data Engineer - RemoteAlibaba Group Mar 2016 - Jul 2017Hangzhou, Cn• Building scalable distributed data solutions using Hadoop.• Installed and configured Pig for ETL jobs and make sure Pig scripts with regular expression for data cleaning • Used Zookeeper and Oozie for coordinating the cluster and scheduling workflows • Used Oozie Scheduler system to automate the pipeline workflow and orchestrate the jobs that extract the data in a timely manner• Move data from Oracle to HDFS and vice versa using Sqoop• Imported data using Sqoop and load data from MySQL and oracle to HDFS on regular basis • Used Linux shell scripts to automate the build process, and to perform regular jobs like, file transfers between with different hosts• Documented Technical Specs, Dataflow, Data Models, Class Models• Successfully loading files to HDFS from Teradata, and loaded from HDFS to HIVE• Worked on installing cluster, commissioning, and decommissioning of data node, name node, recovery, capacity planning, and slots configuration• Captured data and importing it to HDFS using Flume and Kafka for semi-structured data and Sqoop for existing relational databases.• Worked on analyzing Hadoop cluster and different big data analytic tools including Hive and Spark, and managed Hadoop log files.• Designed Hive queries to perform data analysis, data transfer and table design.• Created Hive tables, loading with data and writing hive queries to process the data.• Used Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.• Loaded the data from different source such as HDFS or HBase into Spark RDD and do in memory data computation to generate the output response.• Handled the data exchange between HDFS and different Web Applications and databases using Flume and Sqoop.• Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.• Created UNIX shell scripts to automate the build process, and to perform regular jobs like file transfers between different hosts. -
Hadoop Developer - InternGulfstream Aerospace Jan 2015 - Mar 2016Savannah, Ga, Us• Deployed the application jar files into AWS instances. • Used the image files of an instance to create instances containing Hadoop installed and running • Developed a task execution framework on EC2 instances using SQL and DynamoDB• Designed a cost-effective archival platform for storing big data between then using Sqoop and various ETL tools• Extracted the data from RDBMS (Oracle, MySQL) to HDFS using Sqoop• Used hive with spark streaming for real-time processing • Imported data from different sources into Spark RDD for processing • Built a prototype for real-time analysis using Spark streaming and Kafka• Transferred data using Informatica tool from AWS S3• Used AWS Redshift for storing data on cloud• Collected the business requirements from the subject matter experts like data scientist and business partners • worked on streaming the analyzed data to Hive Tables using Sqoop for making it available for visualization and report generation by the BI team • Configured Oozie workflow engine scheduler to run multiple Hive, Sqoop and pig jobs• Used NoSQL databased like MongoDB in implantation and integration• Used Ambari stack to manage big data clusters, and performed upgrades for Ambari stack, elastic search etc.• Installed and configured Tableau Desktop on one of the three nodes to connect to the Hortonworks Hive Framework (Database) through the Hortonworks ODBC connector for further analytics of the cluster.• Assist in Install and configuration of Hive, Pig, Sqoop, Flume, Oozie and HBase on the Hadoop cluster with latest patches.• Install and configuration of Hive, Pig, Sqoop, Flume, Oozie and HBase on the Hadoop cluster with latest patches.• Created Oozie workflow for various tasks like Similarity matching and consolidation• Enabled security to the cluster using Kerberos and integrated clusters with LDAP at Enterprise level.• Implemented user access with Kerberos and cluster security using Ranger.
Caleb D. Skills
Caleb D. Education Details
-
College Of Charleston School Of BusinessFinance
Frequently Asked Questions about Caleb D.
What company does Caleb D. work for?
Caleb D. works for Pfizer
What is Caleb D.'s role at the current company?
Caleb D.'s current role is Cloud Computing Data Professional.
What is Caleb D.'s email address?
Caleb D.'s email address is ca****@****hem.com
What schools did Caleb D. attend?
Caleb D. attended College Of Charleston School Of Business.
What skills is Caleb D. known for?
Caleb D. has skills like Leadership, Consulting, Aws Glue, Data Warehousing, Amazon Elastic Mapreduce, Amazon Web Services, Data Science, Data Engineering, Continuous Integration And Continuous Delivery, Financial Modeling, Data Analysis, Tensorflow.
Free Chrome Extension
Find emails, phones & company data instantly
Aero Online
Your AI prospecting assistant
Select data to include:
0 records × $0.02 per record
Download 750 million emails and 100 million phone numbers
Access emails and phone numbers of over 750 million business users. Instantly download verified profiles using 20+ filters, including location, job title, company, function, and industry.
Start your free trial