| AKASH REDDY - DATA ENGINEER with AL/ML |
| [email protected] |
| Location: Washington, DC, USA |
| Relocation: YES |
| Visa: H1B |
| Resume file: Resume_AkashReddy_DataEngineer (3)_1774970143052.docx Please check the file(s) for viruses. Files are checked manually and then made available for download. |
|
Akash Reddy
Senior Data Engineer | AWS |Databricks | Snowflake AWS Certified Data Engineer Associate [email protected] | +1 240-273-1830 Senior Data Engineer with over 10 years of experience in designing, developing, and managing cloud- based data warehousing, feature engineering, big data, ETL/ELT processes, and Business Intelligence solutions. Specializes in AWS and Azure and GCP frameworks, Cloudera, Hadoop Ecosystem, Spark/PySpark/Scala, Databricks, Hive, Redshift, Snowflake, relational databases, Tableau, Airflow, DBT, and Python programming. Strong AWS Cloud-specific skills encompassing EC2, EBS, S3, VPC, RDS, SES, ELB, EMR, ECS, CloudFront, CloudFormation, ElastiCache, SageMaker, Fargate, CloudWatch, Redshift, Lambda, SNS, DynamoDB, Flink and Kinesis. Skilled in Azure Cloud, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical Services, Azure Cosmos NoSQL DB, and Databricks. Expertise in Python, NumPy, Pandas, AWS, Postgres, Kafka, Cassandra, and MongoDB for data engineering and analysis tasks. Proficient in Django for web application development, utilizing Python for backend logic and REST APIs. Proficient in UNIX shell scripting for processing large data volumes and loading into AWS Redshift and Snowflake databases. In-depth knowledge of Hadoop Ecosystem components like Hive, HDFS, Sqoop, Spark, Kafka, and Pig. Experienced in architecting, installing, configuring, and managing Apache Hadoop Clusters, MapReduce, and Cloudera Hadoop Distribution. Skilled in data migration, profiling, ingestion, cleansing, transformation, import, and export using ETL tools such as Informatica PowerCenter. Hands-on expertise with Spark RDD, DataFrame API, Dataset API, Data Source API, Spark SQL, and Spark Streaming. Extensive experience developing stored procedures, functions, triggers, views, cursors, indexes, CTEs, joins, and subqueries with T-SQL. Knowledgeable in managing Azure Data Lakes (ADLS) and Data Lake Analytics and integrating with other Azure Services. Proficient in creating dashboards using SAS, Tableau, Power BI, BO, and QlikView for reporting and visualization of large datasets. Expertise in SQL databases including MSSQL Server, MySQL, Oracle DB, PostgreSQL, and NoSQL databases such as DynamoDB, MongoDB, Cassandra, and HBase. Strong understanding of business and data principles, data test cases, and the Software Development Life Cycle (SDLC) including Agile and Waterfall methodologies. Demonstrated proficiency in the Fact/Dimension data warehouse design model, employing star and snowflake design methods effectively. Technical Skills: Languages: Python, PySpark, Scala, SQL, R Cloud Technologies: AWS, Microsoft, Azure Big Data Technologies: Hadoop, MapReduce, HDFS, Pig, Hive, HBase, Kafka, Apache Spark Databases: Oracle, MySQL, SQL, Cassandra, DynamoDB, PostgreSQL, Cosmos, Snowflake Frameworks: Django REST framework, MVC Tools: PyCharm, Eclipse, Visual Studio Versioning tools: SVN, Git, GitHub, CVS Visualization/Reporting: Tableau, SSRS, Power BI Professional Experience: Ameriprise Financial, Minneapolis, Minnesota October 2024- Present Cloud Data Engineer Job Description: Architected and led the full migration of enterprise SAS analytics workloads to an AWS Data Lake (S3 + Glue 4.0 + Athena), enabling ML-ready datasets and reducing analytics query costs by 40%. Migrated legacy SAS analytics workloads to an AWS-based Data Lake, using Amazon S3, AWS Glue, and Amazon Athena to build scalable ingestion pipelines, enforce schema-on-read, and expose curated Gold-layer datasets for downstream ML and BI consumption. Designed and implemented feature engineering pipelines using PySpark on Amazon EMR, producing training-ready datasets, behavioral aggregations, and time-window features consumed by downstream classification and regression models. Built scalable distributed data processing frameworks using Amazon EMR (including Serverless EMR) to process multi-terabyte datasets; tuned Spark executor configs, partition strategies, and shuffle optimizations to cut job runtimes by 35%. Converted legacy SAS data transformation logic into optimized SQL and PySpark workflows in Amazon Athena and EMR, replacing brittle SAS macros with testable, version-controlled PySpark and SQL transformations integrated into CI/CD pipelines. Developed and orchestrated end-to-end ETL pipelines using AWS Glue 4.0, EMR 7.0, Lambda, S3, and Athena, reducing data processing time by 50% and improving pipeline reliability and data accuracy by 25%. Engineered feature stores and ML-ready data marts in AWS Data Lake, supporting both scheduled batch scoring and low-latency. Designed data ingestion workflows for structured and semi-structured datasets, leveraging Great Expectations checks, schema validation, and custom anomaly detection rules to enforce data SLAs before model ingestion. Built a Proof of Concept integrating Snowflake with AWS Data Lake architecture, using AWS Glue Data Catalog to enable unified metadata management and cross-platform data querying. Managed AWS Glue Data Catalog for metadata management, improving dataset discoverability and optimizing Athena query performance across large ML datasets. Implemented data quality frameworks and anomaly detection checks to ensure clean and reliable training data before model ingestion. Monitored and optimized production pipelines using AWS CloudWatch and monitoring frameworks, ensuring SLA compliance and stable ML pipeline operations. Utilized PostgreSQL and advanced SQL optimization techniques for managing relational datasets supporting analytics and ML feature pipelines. Designed and developed interactive Power BI dashboards to visualize ML pipeline metrics, data quality checks, and operational KPIs, enabling stakeholders to monitor model performance and data health in near real time. Implemented Row-Level Security (RLS) and data governance practices to ensure secure and role-based data access across stakeholders. Collaborated with data scientists, ML engineers, and product teams to design scalable data pipelines and feature engineering frameworks for AI-driven applications. Environment/Technologies: AWS EMR 7.0, Serverless EMR, S3, Glue 4.0, Lambda, Athena, SageMaker, Redshift, CloudWatch, CloudFormation, IAM, PySpark, Python, Spark SQL, Glue Data Catalog, PostgreSQL, Bitbucket, Jenkins US Bank, Minneapolis, Minnesota November 2021- September 2024 Cloud Data Engineer Job Description: Implemented and optimized metadata-driven AWS data pipelines. Automated deployment processes and remediated application vulnerabilities, enhancing infrastructure efficiency and security. Leveraged various AWS services for real-time data processing and business intelligence. My responsibilities encompass a wide range of tasks aimed at optimizing data workflows, enhancing data processing efficiency, and ensuring data governance and security compliance. Responsibilities: Implemented a metadata-driven data pipeline using Google Cloud Dataflow, Cloud Storage, and Cloud Functions, reducing processing time by 40%. Streamlined batch data workflows for large-scale processing by implementing BigQuery partitioning and clustering, improving performance and reducing processing times by 30%. Implemented a Proof of Concept (POC) using Dataflow, Cloud Functions, and Cloud Storage within a serverless architecture, demonstrating a 50% reduction in data processing time and a 30% decrease in infrastructure costs. Developed ETL solutions using Spark SQL in Dataproc for data extraction, transformation, and aggregation from various sources, utilizing multiple file formats. Automated data ingestion processes leveraging Cloud Composer (Airflow), Apache NiFi, and custom Python automation scripts, reducing manual intervention by 70% and improving operational efficiency. Designed and developed robust Python ETL scripts integrating diverse data sources using Pandas, NumPy, and PySpark. Automated deployment processes using Jenkins, Docker, Kubernetes (GKE), Deployment Manager, and Ansible, ensuring efficient and reproducible infrastructure. Developed modular Terraform configurations to streamline provisioning of GCP services such as Cloud Storage, Cloud Functions, and Dataflow, enhancing maintainability and scalability. Automated data governance tools with Collibra and custom scripts, reducing manual effort by 50%, saving 20 hours weekly, and ensuring 90% accuracy in tracking data dependencies, leading to a 60% efficiency boost. Managed CI/CD pipelines for Cloud Functions and Dataflow using Cloud Build and Jenkins, ensuring fast and reliable deployments. Ensured compliance with regulatory standards such as GDPR and HIPAA using Cloud IAM, Cloud Audit Logs, and Security Command Center, reducing security vulnerabilities by 30% and ensuring continuous compliance. Managed and optimized relational databases including Cloud SQL (MySQL, PostgreSQL), Oracle on GCE, and NoSQL databases like Bigtable, Firestore, MongoDB, and Cassandra. Collaborated on projects involving data warehouses like BigQuery and Snowflake, optimizing structured data storage and retrieval. Utilized Jupyter Notebooks for developing and testing ETL workflows. Integrated Cloud Audit Logs and Cloud Logging into Splunk for enhanced monitoring and visibility, improving observability by 40%. Utilized BigQuery for ad-hoc SQL queries on processed data, enabling efficient exploration and analysis without additional infrastructure. Created and executed Splunk Dashboards to highlight key business metrics and monitor third-party system performance. Leveraged Python (Scikit-Learn, Matplotlib, Seaborn, Pandas, NumPy) for advanced data analysis, visualization, and machine learning tasks, supporting data-driven decision-making. Configured real-time streaming pipelines with Pub/Sub, Dataflow, and Spark Streaming on Dataproc, storing high-volume data into Cloud Storage / Bigtable. Environment/Technologies: Google Cloud Dataflow, Cloud Storage, Cloud Functions, Cloud Composer, Cloud IAM, Cloud Audit Logs, Security Command Center, Dataproc, PySpark, Splunk, Apache Flink, Pub/Sub, Kafka, Jenkins, Ansible, Apache NiFi, Talend, BigQuery, Looker Studio, Collibra, Bigtable, Firestore, Snowflake, Pandas, NumPy, Scikit-Learn, Matplotlib, Seaborn HEB Digital, Austin, Texas September 2019- Oct 2021 Cloud Data Engineer Job Description: Deployed, automated, and managed AWS cloud-based production systems, enhancing scalability, data processing efficiency, and reducing operational costs. Designed and optimized ETL workflows using AWS services and Python scripts, automated deployment processes, and integrated real-time data streaming and serverless solutions for improved data-driven decision-making. Responsibilities: Deployed, automated, and managed GCP cloud-based production systems, achieving a 30% increase in scalability while maintaining security standards. Executed a successful migration from SQL Server (On-Prem) to GCP BigQuery. Designed and optimized ETL workflows using Scala alongside Python, Alteryx, Cloud Dataflow 2.0, Cloud Functions, Cloud Storage, and BigQuery, integrating diverse data sources seamlessly. Achieved a 50% reduction in data processing time and enhanced data accuracy by 25%. Developed and maintained Python applications for automating various data engineering tasks, using libraries like Pandas, NumPy, and PySpark. Developed and integrated Python automation solutions with Cloud Functions and Cloud Composer to enhance functionality and streamline data workflows within cloud environments. Worked with SQL, PL/SQL procedures and functions, stored procedures, and packages within mappings. Managed and optimized relational databases, including Cloud SQL, PostgreSQL, Oracle, and NoSQL databases like MongoDB. Developed data processing applications in Scala using Apache Spark on Dataproc, enhancing distributed data processing tasks and achieving higher throughput. Designed and implemented data storage and retrieval solutions using Firestore/Bigtable for real-time data access and optimized handling of large-scale data. Developed robust ETL pipelines to automate data synchronization between Cloud Storage and Elasticsearch, enhancing real-time search functionality and improving data retrieval efficiency by 60%. Enhanced database performance by optimizing SQL queries, achieving a 40% reduction in query execution time and significantly improving overall system efficiency. Automated infrastructure provisioning and configuration management using Terraform in CI/CD workflows (Cloud Build + Cloud Deploy), ensuring consistent environments across development and production. Implemented Terraform scripts to manage and scale GCP resources dynamically, optimizing configurations for Cloud Storage and BigQuery, which led to a 30% improvement in resource utilization and reduced operational costs. Optimized Apache Flink job performance by tuning parallelism and resource allocation, achieving higher throughput and lower latency. Automated deployment pipelines for containerized applications using Google Kubernetes Engine (GKE), integrated with Jenkins and Cloud Build, ensuring consistent and rapid delivery of updates and reducing deployment times by 30%. Integrated real-time log data monitoring using Pub/Sub and Stackdriver (Cloud Logging/Monitoring) to provide immediate insights and enhanced decision-making based on live data. Implemented and optimized data workflows using Hadoop ecosystem tools such as Spark, Hive, Pig, and Sqoop on Dataproc, resulting in a 25% reduction in data processing costs. Ingested, processed, and analyzed real-time streaming data using Pub/Sub + Dataflow, providing timely insights for data-driven decision-making. Designed and implemented robust data workflows using Cloud Composer (Apache Airflow), enabling automated scheduling and monitoring of ETL processes, resulting in a 35% increase in operational efficiency. Leveraged BigQuery to perform ad-hoc analysis and gain insights from large datasets stored in Cloud Storage, improving data accessibility and supporting real-time decision-making processes. Utilized Scala to process and analyze real-time streaming data, leveraging frameworks like Apache Flink for event-time processing and accurate data handling. Environment/Technologies: Cloud Storage, BigQuery, Dataflow, Cloud Functions, Cloud Composer, Cloud SQL, Firestore, Bigtable, Pub/Sub, Dataproc, Cloud Build, Cloud Deploy, Vertex AI, Databricks, Apache Spark, Flink, Hive, Pig, Sqoop, NiFi, Elasticsearch, Stackdriver (Logging/Monitoring), GKE (Kubernetes), Docker, Jenkins, Terraform, Python, PySpark, Scala, SQL, PL/SQL, Oracle, MongoDB. GE Healthcare, Chicago, IL May 2017- Aug 2019 Cloud Data Engineer Responsibilities: Migrated SQL databases to Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Databricks, and Azure SQL Data Warehouse. Managed database access and control during the migration process using Azure Data Factory. Developed Spark applications using PySpark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats, analyzing and transforming data to uncover insights into customer behaviour. Experienced in dimensional modeling, forecasting with large-scale datasets (Star schema, Snowflake schema), transactional modeling, and Slowly Changing Dimensions (SCD). Built a claims simulation app using R Shiny to estimate total loss amounts for claims settlement, leveraging multiple frequency and severity distributions. Launched R Shiny apps integrating machine learning algorithms via R Studio Connect with Azure and Docker for scalable deployment. Developed scripts for data transfer from FTP servers to the ingestion layer using Azure CLI commands. Automated Azure HDInsights cluster creation using PowerShell scripts. Utilized Azure Data Lake Storage Gen2 for storing Excel and Parquet files, retrieving user data via Blob API. Worked with Azure Databricks, PySpark, Spark SQL, Azure ADW, and Hive for data loading and transformation. Designed Azure Cloud Architecture and implementation plans for hosting complex application workloads on Microsoft Azure. Developed MapReduce programs for data extraction, transformation, and aggregation from various file formats including XML, JSON, CSV, and compressed formats. Automated processes for flattening JSON data from Cassandra and used Hive UDFs for data transformation. Utilized Graph SQL to model and query complex, interconnected datasets, improving the efficiency of querying relationships and patterns within large healthcare data sets. Created Impala tables, SFTP transfers, and Shell scripts for data import into Hadoop. Worked with Snow SQL client and Snow Pipe to check data quality issues, improve efficiency and perform analysis Developed a Snow pipe to ingest continuous data load from other stages. Utilized JIRA for bug tracking and CVS for version control Environment/Technologies: Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Databricks, Azure SQL Data Warehouse, Azure Data Factory, Spark, PySpark-SQL, Databricks, R Shiny, Azure, Docker, Azure HDInsight, Azure Data Lake Storage Gen2, Blob API, Cassandra, Impala, Hadoop. Us Cellular, Chicago, IL Jan 2016- April 2017 Hadoop Developer Responsibilities: Orchestrated comprehensive deployment of web applications on AWS, optimizing efficiency by leveraging S3 buckets. Designed and developed large-scale data ingestion pipelines using HDFS, Hive, and Spark, processing structured and unstructured data from multiple on-premise and cloud sources. Automated data movement workflows using Oozie and Shell/Python scripts, reducing manual effort and improving data availability across environments. Implemented Sqoop jobs to efficiently migrate data between RDBMS (MySQL, Oracle, PostgreSQL) and HDFS/Hive. Optimized Hive queries through partitioning, bucketing, and compression techniques, reducing query latency and improving cluster performance. Developed PySpark-based ETL frameworks to cleanse, transform, and aggregate data, improving overall pipeline reliability and maintainability. Integrated Spark Streaming for near real-time data processing and analytics to support operational dashboards. Monitored and fine-tuned Hadoop cluster performance using YARN, Ambari, and Cloudera Manager, ensuring efficient resource utilization. Created and managed Hive Metastore tables integrated with Tableau and Power BI, enabling seamless analytics and visualization. Developed custom MapReduce jobs for large-scale data processing and aggregation, improving throughput and scalability. Implemented data archival and retention strategies using S3 integration with Hadoop for cost- optimized storage management. Utilized Git and Jenkins for version control and automated deployment of ETL code across environments. Created parameterized Python scripts for batch automation and metadata-driven workflow orchestration. Participated in data migration projects, transitioning on-prem Hadoop workloads to AWS EMR and S3, maintaining schema consistency and performance. Collaborated with BI and analytics teams to define and standardize data models, transformation logic, and reusable Spark components. Conducted unit and integration testing using PyTest and PyUnit, ensuring production-grade quality for ETL jobs. Wrote detailed technical documentation for data pipelines, schema design, and operational best practices to support knowledge transfer. Environment/Technologies:Hadoop, HDFS, Hive, Pig, Spark, PySpark, MapReduce, Sqoop, Oozie, HBase, YARN, Ambari, Cloudera Manager, AWS EMR, S3, Jenkins, Git, Python, Shell Scripting, Tableau, PostgreSQL, MySQL, Oracle. SAP Inc, India Jan 2014- July 2015 Hadoop Developer Responsibilities: Designed and optimized large-scale data pipelines using HDFS, Hive, Pig, and MapReduce, processing over 5TB of data daily. Automated ETL workflows integrating on-premises data sources with AWS EMR and S3 for real-time reporting and analytics. Built scalable data lakes and performed complex data wrangling, transformation, and aggregation to support banking and risk analytics use cases. Developed and tuned Hive and Pig scripts, improving query performance and reducing data latency for BI teams. Integrated Hadoop-based data flows with Tableau dashboards for end-to-end KPI reporting and visualization. Developed end-to-end data pipelines using Hadoop, Spark, and Sqoop, enabling seamless migration of core financial data from RDBMS to HDFS. Automated data ingestion from MySQL and Oracle into Hive using Oozie workflows, reducing manual intervention and operational overhead. Defined and implemented partitioning, bucketing, and compression strategies in HDFS/Hive to optimize both storage and performance. Participated in migration assessments and developed proofs of concept (PoCs) for transitioning legacy Hadoop workloads to AWS Cloud. Developed custom Python utilities and Shell scripts for workflow orchestration, data validation, and metadata management. Implemented data quality checks and schema validations to ensure accuracy and consistency across ingestion and curation layers. Environment/Technologies:Python, Django, Flask, NumPy, REST, HTML, CSS, XML, JavaScript, Spark SQL, PySpark, Hadoop, Hive, Pig, MapReduce, Sqoop, Oozie, Jenkins, AWS EMR, S3, Linux, Shell scripting Keywords: continuous integration continuous deployment artificial intelligence machine learning business intelligence sthree database active directory rlang procedural language Illinois |