Home

AKASH REDDY - DATA ENGINEER with AL/ML
[email protected]
Location: Washington, DC, USA
Relocation: YES
Visa: H1B
Resume file: Resume_AkashReddy_DataEngineer (3)_1774970143052.docx
Please check the file(s) for viruses. Files are checked manually and then made available for download.
Akash Reddy
Senior Data Engineer | AWS |Databricks | Snowflake
AWS Certified Data Engineer Associate
[email protected] | +1 240-273-1830

Senior Data Engineer with over 10 years of experience in designing, developing, and managing cloud-
based data warehousing, feature engineering, big data, ETL/ELT processes, and Business Intelligence
solutions.
Specializes in AWS and Azure and GCP frameworks, Cloudera, Hadoop Ecosystem,
Spark/PySpark/Scala, Databricks, Hive, Redshift, Snowflake, relational databases, Tableau, Airflow, DBT,
and Python programming.
Strong AWS Cloud-specific skills encompassing EC2, EBS, S3, VPC, RDS, SES, ELB, EMR, ECS, CloudFront,
CloudFormation, ElastiCache, SageMaker, Fargate, CloudWatch, Redshift, Lambda, SNS, DynamoDB, Flink
and Kinesis.
Skilled in Azure Cloud, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure
Analytical Services, Azure Cosmos NoSQL DB, and Databricks.
Expertise in Python, NumPy, Pandas, AWS, Postgres, Kafka, Cassandra, and MongoDB for data
engineering and analysis tasks.
Proficient in Django for web application development, utilizing Python for backend logic and REST APIs.
Proficient in UNIX shell scripting for processing large data volumes and loading into AWS Redshift and
Snowflake databases.
In-depth knowledge of Hadoop Ecosystem components like Hive, HDFS, Sqoop, Spark, Kafka, and Pig.
Experienced in architecting, installing, configuring, and managing Apache Hadoop Clusters, MapReduce,
and Cloudera Hadoop Distribution.
Skilled in data migration, profiling, ingestion, cleansing, transformation, import, and export using ETL
tools such as Informatica PowerCenter.
Hands-on expertise with Spark RDD, DataFrame API, Dataset API, Data Source API, Spark SQL, and
Spark Streaming.
Extensive experience developing stored procedures, functions, triggers, views, cursors, indexes, CTEs,
joins, and subqueries with T-SQL.
Knowledgeable in managing Azure Data Lakes (ADLS) and Data Lake Analytics and integrating with
other Azure Services.
Proficient in creating dashboards using SAS, Tableau, Power BI, BO, and QlikView for reporting and
visualization of large datasets.
Expertise in SQL databases including MSSQL Server, MySQL, Oracle DB, PostgreSQL, and NoSQL
databases such as DynamoDB, MongoDB, Cassandra, and HBase.
Strong understanding of business and data principles, data test cases, and the Software Development
Life Cycle (SDLC) including Agile and Waterfall methodologies.
Demonstrated proficiency in the Fact/Dimension data warehouse design model, employing star and
snowflake design methods effectively.
Technical Skills:
Languages: Python, PySpark, Scala, SQL, R
Cloud Technologies: AWS, Microsoft, Azure
Big Data Technologies: Hadoop, MapReduce, HDFS, Pig, Hive, HBase, Kafka, Apache Spark
Databases: Oracle, MySQL, SQL, Cassandra, DynamoDB, PostgreSQL, Cosmos, Snowflake
Frameworks: Django REST framework, MVC
Tools: PyCharm, Eclipse, Visual Studio
Versioning tools: SVN, Git, GitHub, CVS
Visualization/Reporting: Tableau, SSRS, Power BI






Professional Experience:
Ameriprise Financial, Minneapolis, Minnesota October 2024- Present
Cloud Data Engineer
Job Description:
Architected and led the full migration of enterprise SAS analytics workloads to an AWS Data Lake (S3 +
Glue 4.0 + Athena), enabling ML-ready datasets and reducing analytics query costs by 40%.
Migrated legacy SAS analytics workloads to an AWS-based Data Lake, using Amazon S3, AWS Glue,
and Amazon Athena to build scalable ingestion pipelines, enforce schema-on-read, and expose curated
Gold-layer datasets for downstream ML and BI consumption.
Designed and implemented feature engineering pipelines using PySpark on Amazon EMR, producing
training-ready datasets, behavioral aggregations, and time-window features consumed by downstream
classification and regression models.
Built scalable distributed data processing frameworks using Amazon EMR (including Serverless EMR)
to process multi-terabyte datasets; tuned Spark executor configs, partition strategies, and shuffle
optimizations to cut job runtimes by 35%.
Converted legacy SAS data transformation logic into optimized SQL and PySpark workflows in
Amazon Athena and EMR, replacing brittle SAS macros with testable, version-controlled PySpark and
SQL transformations integrated into CI/CD pipelines.
Developed and orchestrated end-to-end ETL pipelines using AWS Glue 4.0, EMR 7.0, Lambda, S3, and
Athena, reducing data processing time by 50% and improving pipeline reliability and data accuracy by
25%.
Engineered feature stores and ML-ready data marts in AWS Data Lake, supporting both scheduled
batch scoring and low-latency.
Designed data ingestion workflows for structured and semi-structured datasets, leveraging Great
Expectations checks, schema validation, and custom anomaly detection rules to enforce data SLAs before
model ingestion.
Built a Proof of Concept integrating Snowflake with AWS Data Lake architecture, using AWS Glue
Data Catalog to enable unified metadata management and cross-platform data querying.
Managed AWS Glue Data Catalog for metadata management, improving dataset discoverability and
optimizing Athena query performance across large ML datasets.
Implemented data quality frameworks and anomaly detection checks to ensure clean and reliable
training data before model ingestion.
Monitored and optimized production pipelines using AWS CloudWatch and monitoring frameworks,
ensuring SLA compliance and stable ML pipeline operations.
Utilized PostgreSQL and advanced SQL optimization techniques for managing relational datasets
supporting analytics and ML feature pipelines.
Designed and developed interactive Power BI dashboards to visualize ML pipeline metrics, data quality
checks, and operational KPIs, enabling stakeholders to monitor model performance and data health in
near real time.
Implemented Row-Level Security (RLS) and data governance practices to ensure secure and role-based
data access across stakeholders.
Collaborated with data scientists, ML engineers, and product teams to design scalable data pipelines
and feature engineering frameworks for AI-driven applications.
Environment/Technologies: AWS EMR 7.0, Serverless EMR, S3, Glue 4.0, Lambda, Athena, SageMaker,
Redshift, CloudWatch, CloudFormation, IAM, PySpark, Python, Spark SQL, Glue Data Catalog, PostgreSQL,
Bitbucket, Jenkins



US Bank, Minneapolis, Minnesota November 2021- September
2024
Cloud Data Engineer
Job Description:
Implemented and optimized metadata-driven AWS data pipelines. Automated deployment processes and
remediated application vulnerabilities, enhancing infrastructure efficiency and security. Leveraged various
AWS services for real-time data processing and business intelligence. My responsibilities encompass a wide
range of tasks aimed at optimizing data workflows, enhancing data processing efficiency, and ensuring data
governance and security compliance.
Responsibilities:
Implemented a metadata-driven data pipeline using Google Cloud Dataflow, Cloud Storage, and
Cloud Functions, reducing processing time by 40%.
Streamlined batch data workflows for large-scale processing by implementing BigQuery partitioning
and clustering, improving performance and reducing processing times by 30%.
Implemented a Proof of Concept (POC) using Dataflow, Cloud Functions, and Cloud Storage within a
serverless architecture, demonstrating a 50% reduction in data processing time and a 30% decrease
in infrastructure costs.
Developed ETL solutions using Spark SQL in Dataproc for data extraction, transformation, and
aggregation from various sources, utilizing multiple file formats.
Automated data ingestion processes leveraging Cloud Composer (Airflow), Apache NiFi, and custom
Python automation scripts, reducing manual intervention by 70% and improving operational efficiency.
Designed and developed robust Python ETL scripts integrating diverse data sources using Pandas,
NumPy, and PySpark.
Automated deployment processes using Jenkins, Docker, Kubernetes (GKE), Deployment Manager,
and Ansible, ensuring efficient and reproducible infrastructure.
Developed modular Terraform configurations to streamline provisioning of GCP services such as Cloud
Storage, Cloud Functions, and Dataflow, enhancing maintainability and scalability.
Automated data governance tools with Collibra and custom scripts, reducing manual effort by 50%,
saving 20 hours weekly, and ensuring 90% accuracy in tracking data dependencies, leading to a 60%
efficiency boost.
Managed CI/CD pipelines for Cloud Functions and Dataflow using Cloud Build and Jenkins, ensuring
fast and reliable deployments.
Ensured compliance with regulatory standards such as GDPR and HIPAA using Cloud IAM, Cloud Audit
Logs, and Security Command Center, reducing security vulnerabilities by 30% and ensuring continuous
compliance.
Managed and optimized relational databases including Cloud SQL (MySQL, PostgreSQL), Oracle on
GCE, and NoSQL databases like Bigtable, Firestore, MongoDB, and Cassandra.
Collaborated on projects involving data warehouses like BigQuery and Snowflake, optimizing
structured data storage and retrieval.
Utilized Jupyter Notebooks for developing and testing ETL workflows.
Integrated Cloud Audit Logs and Cloud Logging into Splunk for enhanced monitoring and visibility,
improving observability by 40%.
Utilized BigQuery for ad-hoc SQL queries on processed data, enabling efficient exploration and analysis
without additional infrastructure.
Created and executed Splunk Dashboards to highlight key business metrics and monitor third-party
system performance.
Leveraged Python (Scikit-Learn, Matplotlib, Seaborn, Pandas, NumPy) for advanced data analysis,
visualization, and machine learning tasks, supporting data-driven decision-making.
Configured real-time streaming pipelines with Pub/Sub, Dataflow, and Spark Streaming on
Dataproc, storing high-volume data into Cloud Storage / Bigtable.
Environment/Technologies: Google Cloud Dataflow, Cloud Storage, Cloud Functions, Cloud Composer,
Cloud IAM, Cloud Audit Logs, Security Command Center, Dataproc, PySpark, Splunk, Apache Flink,
Pub/Sub, Kafka, Jenkins, Ansible, Apache NiFi, Talend, BigQuery, Looker Studio, Collibra, Bigtable,
Firestore, Snowflake, Pandas, NumPy, Scikit-Learn, Matplotlib, Seaborn
HEB Digital, Austin, Texas September 2019- Oct 2021
Cloud Data Engineer
Job Description:
Deployed, automated, and managed AWS cloud-based production systems, enhancing scalability, data
processing efficiency, and reducing operational costs. Designed and optimized ETL workflows using AWS
services and Python scripts, automated deployment processes, and integrated real-time data streaming and
serverless solutions for improved data-driven decision-making.
Responsibilities:
Deployed, automated, and managed GCP cloud-based production systems, achieving a 30% increase
in scalability while maintaining security standards.
Executed a successful migration from SQL Server (On-Prem) to GCP BigQuery.
Designed and optimized ETL workflows using Scala alongside Python, Alteryx, Cloud Dataflow 2.0,
Cloud Functions, Cloud Storage, and BigQuery, integrating diverse data sources seamlessly. Achieved a
50% reduction in data processing time and enhanced data accuracy by 25%.
Developed and maintained Python applications for automating various data engineering tasks, using
libraries like Pandas, NumPy, and PySpark.
Developed and integrated Python automation solutions with Cloud Functions and Cloud Composer to
enhance functionality and streamline data workflows within cloud environments.
Worked with SQL, PL/SQL procedures and functions, stored procedures, and packages within mappings.
Managed and optimized relational databases, including Cloud SQL, PostgreSQL, Oracle, and NoSQL
databases like MongoDB.
Developed data processing applications in Scala using Apache Spark on Dataproc, enhancing
distributed data processing tasks and achieving higher throughput.
Designed and implemented data storage and retrieval solutions using Firestore/Bigtable for real-time
data access and optimized handling of large-scale data.
Developed robust ETL pipelines to automate data synchronization between Cloud Storage and
Elasticsearch, enhancing real-time search functionality and improving data retrieval efficiency by 60%.
Enhanced database performance by optimizing SQL queries, achieving a 40% reduction in query
execution time and significantly improving overall system efficiency.
Automated infrastructure provisioning and configuration management using Terraform in CI/CD
workflows (Cloud Build + Cloud Deploy), ensuring consistent environments across development and
production.
Implemented Terraform scripts to manage and scale GCP resources dynamically, optimizing
configurations for Cloud Storage and BigQuery, which led to a 30% improvement in resource
utilization and reduced operational costs.
Optimized Apache Flink job performance by tuning parallelism and resource allocation, achieving
higher throughput and lower latency.
Automated deployment pipelines for containerized applications using Google Kubernetes Engine
(GKE), integrated with Jenkins and Cloud Build, ensuring consistent and rapid delivery of updates and
reducing deployment times by 30%.
Integrated real-time log data monitoring using Pub/Sub and Stackdriver (Cloud Logging/Monitoring)
to provide immediate insights and enhanced decision-making based on live data.
Implemented and optimized data workflows using Hadoop ecosystem tools such as Spark, Hive, Pig, and
Sqoop on Dataproc, resulting in a 25% reduction in data processing costs.
Ingested, processed, and analyzed real-time streaming data using Pub/Sub + Dataflow, providing timely
insights for data-driven decision-making.
Designed and implemented robust data workflows using Cloud Composer (Apache Airflow), enabling
automated scheduling and monitoring of ETL processes, resulting in a 35% increase in operational
efficiency.
Leveraged BigQuery to perform ad-hoc analysis and gain insights from large datasets stored in Cloud
Storage, improving data accessibility and supporting real-time decision-making processes.
Utilized Scala to process and analyze real-time streaming data, leveraging frameworks like Apache Flink
for event-time processing and accurate data handling.
Environment/Technologies: Cloud Storage, BigQuery, Dataflow, Cloud Functions, Cloud Composer,
Cloud SQL, Firestore, Bigtable, Pub/Sub, Dataproc, Cloud Build, Cloud Deploy, Vertex AI, Databricks,
Apache Spark, Flink, Hive, Pig, Sqoop, NiFi, Elasticsearch, Stackdriver (Logging/Monitoring), GKE
(Kubernetes), Docker, Jenkins, Terraform, Python, PySpark, Scala, SQL, PL/SQL, Oracle, MongoDB.
GE Healthcare, Chicago, IL May 2017- Aug 2019
Cloud Data Engineer
Responsibilities:
Migrated SQL databases to Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database,
Databricks, and Azure SQL Data Warehouse.
Managed database access and control during the migration process using Azure Data Factory.
Developed Spark applications using PySpark-SQL in Databricks for data extraction, transformation, and
aggregation from multiple file formats, analyzing and transforming data to uncover insights into customer
behaviour.
Experienced in dimensional modeling, forecasting with large-scale datasets (Star schema, Snowflake
schema), transactional modeling, and Slowly Changing Dimensions (SCD).
Built a claims simulation app using R Shiny to estimate total loss amounts for claims settlement,
leveraging multiple frequency and severity distributions.
Launched R Shiny apps integrating machine learning algorithms via R Studio Connect with Azure and
Docker for scalable deployment.
Developed scripts for data transfer from FTP servers to the ingestion layer using Azure CLI commands.
Automated Azure HDInsights cluster creation using PowerShell scripts.
Utilized Azure Data Lake Storage Gen2 for storing Excel and Parquet files, retrieving user data via Blob
API.
Worked with Azure Databricks, PySpark, Spark SQL, Azure ADW, and Hive for data loading and
transformation.
Designed Azure Cloud Architecture and implementation plans for hosting complex application
workloads on Microsoft Azure.
Developed MapReduce programs for data extraction, transformation, and aggregation from various file
formats including XML, JSON, CSV, and compressed formats.
Automated processes for flattening JSON data from Cassandra and used Hive UDFs for data
transformation.
Utilized Graph SQL to model and query complex, interconnected datasets, improving the efficiency of
querying relationships and patterns within large healthcare data sets.
Created Impala tables, SFTP transfers, and Shell scripts for data import into Hadoop.
Worked with Snow SQL client and Snow Pipe to check data quality issues, improve efficiency and
perform analysis
Developed a Snow pipe to ingest continuous data load from other stages.
Utilized JIRA for bug tracking and CVS for version control
Environment/Technologies: Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database,
Databricks, Azure SQL Data Warehouse, Azure Data Factory, Spark, PySpark-SQL, Databricks, R Shiny,
Azure, Docker, Azure HDInsight, Azure Data Lake Storage Gen2, Blob API, Cassandra, Impala, Hadoop.

Us Cellular, Chicago, IL Jan 2016- April 2017
Hadoop Developer
Responsibilities:
Orchestrated comprehensive deployment of web applications on AWS, optimizing efficiency by leveraging
S3 buckets.
Designed and developed large-scale data ingestion pipelines using HDFS, Hive, and Spark, processing
structured and unstructured data from multiple on-premise and cloud sources.
Automated data movement workflows using Oozie and Shell/Python scripts, reducing manual effort
and improving data availability across environments.
Implemented Sqoop jobs to efficiently migrate data between RDBMS (MySQL, Oracle, PostgreSQL) and
HDFS/Hive.
Optimized Hive queries through partitioning, bucketing, and compression techniques, reducing query
latency and improving cluster performance.
Developed PySpark-based ETL frameworks to cleanse, transform, and aggregate data, improving overall
pipeline reliability and maintainability.
Integrated Spark Streaming for near real-time data processing and analytics to support operational
dashboards.
Monitored and fine-tuned Hadoop cluster performance using YARN, Ambari, and Cloudera Manager,
ensuring efficient resource utilization.
Created and managed Hive Metastore tables integrated with Tableau and Power BI, enabling seamless
analytics and visualization.
Developed custom MapReduce jobs for large-scale data processing and aggregation, improving
throughput and scalability.
Implemented data archival and retention strategies using S3 integration with Hadoop for cost-
optimized storage management.
Utilized Git and Jenkins for version control and automated deployment of ETL code across environments.
Created parameterized Python scripts for batch automation and metadata-driven workflow
orchestration.
Participated in data migration projects, transitioning on-prem Hadoop workloads to AWS EMR and S3,
maintaining schema consistency and performance.
Collaborated with BI and analytics teams to define and standardize data models, transformation logic,
and reusable Spark components.
Conducted unit and integration testing using PyTest and PyUnit, ensuring production-grade quality for
ETL jobs.
Wrote detailed technical documentation for data pipelines, schema design, and operational best
practices to support knowledge transfer.
Environment/Technologies:Hadoop, HDFS, Hive, Pig, Spark, PySpark, MapReduce, Sqoop, Oozie, HBase,
YARN, Ambari, Cloudera Manager, AWS EMR, S3, Jenkins, Git, Python, Shell Scripting, Tableau, PostgreSQL,
MySQL, Oracle.

SAP Inc, India Jan 2014- July 2015
Hadoop Developer
Responsibilities:
Designed and optimized large-scale data pipelines using HDFS, Hive, Pig, and MapReduce, processing
over 5TB of data daily.
Automated ETL workflows integrating on-premises data sources with AWS EMR and S3 for real-time
reporting and analytics.
Built scalable data lakes and performed complex data wrangling, transformation, and aggregation to
support banking and risk analytics use cases.
Developed and tuned Hive and Pig scripts, improving query performance and reducing data latency for
BI teams.
Integrated Hadoop-based data flows with Tableau dashboards for end-to-end KPI reporting and
visualization.
Developed end-to-end data pipelines using Hadoop, Spark, and Sqoop, enabling seamless migration of
core financial data from RDBMS to HDFS.
Automated data ingestion from MySQL and Oracle into Hive using Oozie workflows, reducing manual
intervention and operational overhead.
Defined and implemented partitioning, bucketing, and compression strategies in HDFS/Hive to
optimize both storage and performance.
Participated in migration assessments and developed proofs of concept (PoCs) for transitioning legacy
Hadoop workloads to AWS Cloud.
Developed custom Python utilities and Shell scripts for workflow orchestration, data validation, and
metadata management.
Implemented data quality checks and schema validations to ensure accuracy and consistency across
ingestion and curation layers.
Environment/Technologies:Python, Django, Flask, NumPy, REST, HTML, CSS, XML, JavaScript, Spark SQL,
PySpark, Hadoop, Hive, Pig, MapReduce, Sqoop, Oozie, Jenkins, AWS EMR, S3, Linux, Shell scripting
Keywords: continuous integration continuous deployment artificial intelligence machine learning business intelligence sthree database active directory rlang procedural language Illinois

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];7065
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: