| Manasa - Senior AWS Data Engineer |
| [email protected] |
| Location: Remote, Remote, USA |
| Relocation: Yes |
| Visa: H1B |
| Resume file: NAGA MANASA PEMMASANI - Data Engineer- AWS-_1772028299007.docx Please check the file(s) for viruses. Files are checked manually and then made available for download. |
|
PROFESSIONAL SUMMARY:
Experienced Data Engineer with over 12 years of proven success in fraud analytics, financial crime detection, and risk management. Skilled in leveraging AWS and GCP platforms to build and manage production-grade data applications with an emphasis on transaction monitoring. Domain expertise includes Fintech, digital transactions, and data governance. Adept at transforming complex business problems into scalable, data-driven solutions using PySpark, Databricks, and cloud services. Strategic in building end-to-end data pipelines using PySpark, Google Cloud Dataflow, and BigQuery, enabling seamless ingestion, transformation, and analytics. Proficient at implementing CI/CD workflows with Cloud Build integration for faster and more reliable deployments. Skilled in crafting optimized data models and schema designs within BigQuery, managing multi-region datasets for agile data access. Detail-oriented in tuning Spark jobs and SQL queries, improving data retrieval and system performance in large-scale datasets. Dynamic in leading cloud migration efforts to GCP, utilizing Cloud CDN and Cloud Load Balancing to boost delivery and availability. Efficient in automating ETL workflows using Cloud Composer (Airflow) and Cloud Functions, significantly cutting down manual effort and processing time. Thorough in validating, cleansing, and transforming raw data to maintain high integrity and trustworthiness throughout the pipeline. Resourceful in managing batch and real-time processing using BigQuery, Pub/Sub, Dataflow, and Cloud Dataproc. Well-versed in handling structured and unstructured data sources for analytics and reporting. Innovative in creating data lake architectures using Cloud Storage and BigQuery aligned with business goals and future scaling. Experienced in implementing SCD handling, improving historical tracking and ensuring dimensional accuracy in reporting systems. Collaborative in working with analysts and product teams to translate business questions into robust data models and analytics solutions. Disciplined in applying Agile methodologies using JIRA for sprint tracking, issue resolution, and cross-team transparency. Adept at leveraging Cloud SQL and Cloud Spanner for importing/exporting between cloud-native and hybrid environments. Methodically in applying advanced BigQuery techniques like clustering and partitioning for improved query performance. Grounded in delivering insights and aligning data architecture with enterprise strategies for automation and efficiency. TECHNICAL SKILLS: Big Data & Distributed Computing: Apache Spark, PySpark, MapReduce, Hive, Dataflow, Cloud Dataproc, Pub/Sub, Cloud Composer (Airflow), Hadoop (Cloudera, Hortonworks), Databricks Cloud Platforms & Services: GCP (BigQuery, Cloud Storage, Vertex AI, Cloud Run, Cloud Functions, Cloud Build, Pub/Sub, Dataflow, Cloud SQL), AWS (S3, EC2, SageMaker, Lambda, IAM, CloudWatch, Athena, Redshift, Kinesis, Glue), Azure (Data Factory, Databricks, Synapse, ML SDK) Programming & Scripting Languages: Python, SQL, PL/SQL, Scala, JavaScript, HiveQL, HTML, CSS, XML, JSP Machine Learning & Deep Learning: TensorFlow, PyTorch, Scikit-Learn, Vertex AI, Hugging Face Transformers, MLflow, spaCy, NLTK, OpenCV, Faster R-CNN, Fraud Detection, Financial Crime Analytics APIs & Model Deployment: Flask, FastAPI, Django ORM, REST APIs, SOAP, Microservices, Cloud Run, Cloud Functions Data Management & Processing: Pandas, NumPy, BigQuery, Cloud SQL, Cloud Spanner, MS SQL Server, Oracle, MySQL, MongoDB, MS Excel, MS Access Data Warehousing & BI: BigQuery, Looker, Power BI, Tableau, Data Studio DevOps & Automation: Git, GitHub, GitLab, Bitbucket, GitHub, Cloud Build, Docker, Kubernetes (GKE), Jenkins, Ant, Maven. Testing & Quality Assurance: PyTest, UnitTest, Statistical Testing, LIME, SHAP Tools & Environments: Eclipse, Visual Studio, JIRA, Confluence, Agile, Functional Programming, ServiceNow Operating Systems: Windows, UNIX, Linux, Ubuntu, CentOS PROFESSIONAL EXPERIENCE: BCBS, Richardson, TX | Feb 2024 Present Data Engineer Built end-to-end data ingestion and transformation pipelines using AWS Glue, PySpark, Amazon EMR, and GCP Dataflow/Dataproc, accelerating insights across high-volume datasets. Built reusable ETL frameworks integrating Amazon Kinesis, AWS Glue, Amazon Redshift, and GCP Pub/Sub/BigQuery to support both batch and real-time processing. Designed and optimized schema models, materialized views, and partitioning strategies in Redshift and BigQuery, improving data accessibility and query performance. Implemented Infrastructure as Code using Terraform for provisioning AWS services including S3, Lambda, Redshift, and IAM roles. Managed CI/CD pipelines using GitLab CI for automated deployments across dev, QA, and prod environments. Developed Python-based automation scripts for system provisioning, deployment, scaling, and upgrades. Integrated AWS CloudFormation and Terraform modules to standardize infrastructure across environments. Translated business requirements into scalable data pipelines using Python, Scala, SQL, and cloud-native services in AWS and GCP, ensuring robust and maintainable data solutions. Applied Redshift ML, SageMaker, and GCP Vertex AI for feature engineering and transformation logic to support end-to-end machine learning workflows. Developed serverless microservices for real-time model inference using AWS Lambda, API Gateway, FastAPI, and GCP Cloud Run enabling scalable AI integration. Optimized Redshift, SparkSQL, and BigQuery queries to reduce processing overhead and improve performance for latency-sensitive workloads. Integrated model explainability tools like SHAP and LIME within SageMaker and GCP AI Platform to generate interpretable insights from predictive models. Migrated legacy batch workflows to Amazon EMR, AWS Glue, and GCP Dataproc, reducing runtime and enabling efficient, autoscaling compute environments. Automated pipeline orchestration using Amazon MWAA (Managed Airflow), AWS CodePipeline, and GCP Cloud Composer, standardizing CI/CD for data and ML workflows. Partnered with analysts and scientists to design training pipelines in SageMaker and Vertex AI, define model features, and monitor production model performance. Implemented data quality layers using AWS Glue, Redshift, GCP Dataform, and custom Lambda/Cloud Functions, ensuring data accuracy and completeness. Leveraged Amazon S3, Redshift Spectrum, GCS, and BigQuery for distributed analytics and large-scale data transformations. Created S3/GCS staging layers and developed Redshift/BigQuery-compatible scripts to support incremental loads, including Slowly Changing Dimensions (SCD). Trained and deployed models using SageMaker, TensorFlow, and Vertex AI, exposing scalable inference endpoints via SageMaker Endpoints, GCP AI Platform, and Lambda. Participated in Agile delivery using JIRA and Confluence, contributing to backlog grooming, sprint planning, and collaborative problem solving. Maintained version control using AWS CodeCommit, AWS CodeBuild, AWS CodePipeline, and GCP Cloud Source Repositories for automated deployments of both data and ML assets. Environment: AWS Glue, Amazon EMR, Amazon Redshift, Amazon S3, Amazon Kinesis, AWS Lambda, SageMaker, Redshift ML, API Gateway, TensorFlow, PySpark, Python, Scala, SQL, SHAP, LIME, FastAPI, CodeCommit, CodeBuild, CodePipeline, MWAA (Airflow), JIRA, Confluence. Best Buy, Richfield, MN | June 2022 -- Jan 2024 Data Engineer Mapped complex business challenges to ML solutions by engaging stakeholders, defining objectives, and translating them into actionable ML pipelines using Amazon SageMaker Studio. Engineered advanced Amazon Redshift SQL queries with joins and aggregations to extract and consolidate data for analytics and model development. Cleaned and transformed raw datasets using Python, Pandas, and AWS Glue to prepare high-quality features for model training. Built ETL workflows using AWS Step Functions, Lambda, and Python to automate ingestion, validation, and transformation processes across AWS services. Scaled distributed data processing pipelines using Amazon EMR and PySpark for efficient feature engineering and large-scale ML tasks. Fine-tuned transformer-based models using SageMaker JumpStart and Hugging Face for Generative AI use cases. Built prompt-driven experimentation pipelines for business use cases leveraging LLM APIs and structured product metadata. Implemented experiment tracking and versioning for LLM workflows using MLflow and SageMaker Model Registry. Optimized token usage and inference cost through batching and model configuration tuning. Enabled real-time inference by integrating streaming pipelines with Amazon Kinesis Data Streams and Kinesis Data Analytics for continuous data ingestion and transformation. Applied NLP techniques using spaCy and developed vision models with Faster R-CNN on SageMaker; also experimented with Hugging Face Transformers for LLM-based applications. Designed and fine-tuned Generative AI models and LLMs using SageMaker JumpStart, ensuring alignment with business goals and ethical AI standards. Trained and optimized deep learning models such as CNNs using SageMaker Training Jobs with PyTorch, incorporating advanced hyperparameter tuning techniques. Validated models using cross-validation and integrated LIME for explainability, enabling transparent and auditable ML decisions. Tracked experiments and managed models using SageMaker Model Registry and MLflow for version control, reproducibility, and smooth handoffs between dev and prod. Developed and deployed RESTful APIs for real-time model serving using Amazon API Gateway, Lambda, and FastAPI. Designed and executed automated API tests using Postman to validate RESTful endpoints exposed via API Gateway and FastAPI. Followed Test-Driven Development (TDD) practices using PyTest and implemented mock services using WireMock for API simulation in CI pipelines. Integrated SonarQube into GitHub Actions for static code analysis and continuous quality checks on Python-based microservices. Containerized ML models using Amazon ECR and deployed with AWS Fargate and SageMaker Endpoints for scalable, serverless inference. Deployed and monitored ML workflows using Amazon S3, SageMaker, CloudWatch, and IAM, ensuring secure and observable machine learning operations. Built CI/CD pipelines using AWS CodePipeline, CodeBuild, and CodeCommit, ensuring automated deployments, quality checks, and rollback capabilities. Participated in Agile sprints, managed user stories and tickets in JIRA, and maintained thorough documentation in Confluence to support cross-functional collaboration. Environment: Python, NumPy, Pandas, AWS Lambda, FastAPI, PyTorch, SageMaker, spaCy, Faster R-CNN, Hugging Face Transformers, LIME, AWS Glue, Redshift, Amazon S3, EMR, Kinesis, ECR, Fargate, CloudWatch, CodePipeline, CodeBuild, CodeCommit, IAM, Git, Agile, JIRA, Confluence Nationwide Mutual Insurance, Scottsdale, AZ | Mar 2021 -- Apr 2022 Data Engineer Built automated workflows using AWS Lambda, SQS, and DynamoDB and GCP Cloud Functions, Pub/Sub, and Firestore to sync files nightly from on-prem storage to Amazon S3 and Google Cloud Storage (GCS), enhancing operational efficiency. Migrated large-scale Oracle databases to Google BigQuery and Amazon Redshift, enabling scalable analytics and integrating results with Power BI dashboards. Provisioned and configured AWS infrastructure components such as VPC, EC2, S3, IAM, EMR, RDS, and CloudFormation and GCP components such as VPC, Compute Engine, IAM, Dataproc, BigQuery, and Cloud Storage, supporting data pipeline deployments and analytics workloads. Leveraged Spark SQL and Scala to process Parquet and Avro data formats and load datasets into Hive on AWS EMR and GCP Dataproc for downstream consumption. Deployed multi-environment data pipelines for ingesting and transforming data across Oracle, Postgres, and Informix systems using ETL/ELT best practices on AWS Glue and GCP Dataflow. Designed and managed an Enterprise Data Lake on Amazon S3 to accommodate rapidly changing structured and unstructured data for analytics and archival use cases. Developed scalable APIs using AWS Lambda and Python and GCP Cloud Run/Cloud Functions to automate server-side tasks and streamline cloud operations. Orchestrated data streaming pipelines using Kafka and Cassandra and GCP Pub/Sub, enabling real-time ingestion and processing across distributed systems. Engineered data integration workflows with Hadoop, HDFS, Spark, and Hive on AWS EMR and GCP Dataproc for handling high-volume batch workloads. Created robust Python scripts to extract and parse XML datasets and automate ingestion into structured databases on AWS and GCP. Extracted and transformed financial data from Oracle into Redshift and BigQuery, preparing it for analytical use in EMR and Dataproc. Applied dimensional modeling techniques to build Star and Snowflake schemas, defining fact and dimension tables for visualization in Tableau and Looker. Built real-time data streaming services using Kafka and Spark Streaming, integrating data via REST APIs from external systems into AWS and GCP environments. Delivered scalable data processing functions using Python APIs on AWS Lambda and GCP Cloud Functions, improving array-level debugging within processing layers. Enhanced system reliability by establishing CI/CD pipelines with Git, Jenkins, and custom Python utilities on AWS CodePipeline and GCP Cloud Build. Implemented unit testing frameworks using PyTest for PySpark applications on AWS EMR and GCP Dataproc, reinforcing code quality and pipeline stability. Tuned performance of analytical systems by optimizing data structures and managing lifecycle policies across cloud and on-prem resources on AWS and GCP. Environment: AWS (Lambda, S3, EC2, Kinesis, RDS, CloudFormation), BigQuery, Redshift, Hive, Spark, Spark Streaming, Scala, Python, PySpark, Kafka, Cassandra, EMR, HDFS, PostgreSQL, Oracle, Tableau, Jenkins, Git, JIRA, XML, REST APIs, Airflow, Impala, Ab Initio, DataStage, Linux. IBM, Bangalore, India | Apr 2017 -- Dec 2020 Data Engineer Imported and exported structured data between Oracle, PostgreSQL, and HDFS using Sqoop to support analytical workflows. Re-engineered legacy MapReduce programs into Spark-based pipelines using Python, reducing execution time and improving maintainability. Transferred datasets from on-prem Hive-based Data Lake to Azure Data Lake Storage (ADLS), enabling cloud-native access. Conducted rigorous data reconciliation between Hive and ADLS to ensure consistency and completeness post-migration. Utilized Spark DataFrame API on Cloudera to perform complex transformations and aggregations on Hive datasets. Built high-performance batch jobs in Apache Spark, achieving a 10x improvement in data processing efficiency. Designed real-time ingestion pipelines using Apache Kafka, creating and managing multiple Kafka topics for parallel processing. Extracted and streamed messages from Kafka into Spark for downstream analytics and data enrichment. Migrated processed data from ADLS into Snowflake, supporting downstream reporting and business intelligence needs. Authored advanced HiveQL queries to deliver insights from partitioned and bucketed datasets, aligned with stakeholder requirements. Transitioned critical workloads from on-prem infrastructure to Azure, streamlining scalability and storage optimization. Developed Pig Latin scripts to parse web server log files and ingest structured data into HDFS. Applied advanced Hive techniques such as partitioning, bucketing, and map-side joins to improve query performance. Created custom UDFs and UDAFs in Spark SQL and Hive to accommodate complex business logic. Refactored existing Hive/SQL logic into Spark RDDs and transformations using Scala, modernizing data flows. Authored Databricks notebooks with Spark SQL and Scala to deliver ETL processes and produce analytics dashboards. Tuned Spark jobs through effective management of memory, cores, and executor parameters, improving cluster resource utilization. Implemented ZooKeeper-based locking mechanisms to coordinate concurrent access to shared Hive tables within multi-user environments. Environment: Linux, Cloudera, Apache Hadoop, HDFS, YARN, Hive, Spark, Scala, Pig, Sqoop, Kafka, Snowflake, Azure Data Lake Storage, Databricks, Spark SQL, Zookeeper, Spark Streaming ADP, Hyderabad, India | Sep 2014 -- Mar 2017 Data Engineer Installed and configured Apache Hadoop on Azure HDInsight, building multiple MapReduce jobs in Java for large-scale data cleaning and preprocessing. Developed advanced Hive queries on HDInsight integrated with Azure Data Lake Storage (ADLS Gen2) for structured data extraction and transformation. Employed Azure Data Factory (ADF) with custom activities and linked services to ingest data from SQL Server, Oracle, and MySQL into ADLS, followed by transformations using Hive and MapReduce. Designed and maintained Pig UDFs on HDInsight to process unstructured and semi-structured datasets for exploratory analysis. Wrote unit and integration test cases for Apache Spark and Apache Ignite jobs using Scala, ensuring reliability and logic correctness on Azure Synapse and HDInsight clusters. Gained hands-on exposure to Snowflake on Azure, supporting schema design and optimizing queries during migration from traditional Hadoop pipelines. Built real-time Spark Structured Streaming pipelines on Azure Databricks to ingest and process event data into Delta Lake and Hive Metastore. Delivered high-throughput batch processing solutions using Azure Databricks (Apache Spark), enhancing ETL pipeline stability and efficiency. Developed resilient Spark Streaming applications for ingesting data from Azure Event Hubs and Azure IoT Hub, storing output in Data Lake Storage. Enabled real-time reporting by integrating Apache Cassandra on HDInsight with Spark, including automated failure recovery and checkpointing. Designed custom restartable ETL flows using ADF pipelines and parameterized triggers to support dynamic data loads and on-demand reruns. Created ad-hoc Python and Spark jobs for business-critical reporting, running them on Azure Synapse with data sourced from ADLS. Used Sqoop on HDInsight to transfer large datasets from on-prem relational databases into Cassandra and ADLS Gen2. Automated end-to-end ETL processes with Bash scripting, Azure CLI, and ADF pipeline triggers, enabling nightly and event-driven processing. Modeled HiveQL tables and created views over semi-structured datasets stored in Azure Data Lake, enabling downstream analytics. Built and exposed RESTful APIs using Azure Functions over HBase on HDInsight, allowing real-time, scalable access to key datasets. Monitored and debugged cluster jobs using Azure Monitor, Log Analytics, and native HDInsight logs, resolving performance and job failures proactively. Processed large-scale raw text data using Hadoop Streaming jobs and shell scripts, enabling downstream machine learning and analytics tasks. Used Azure DevOps and Jenkins for build automation, addressing code review feedback, managing releases, and resolving production defects in MapReduce and Spark jobs. Environment: Azure HDInsight, Azure Data Lake Storage Gen2, Azure Data Factory, Azure Databricks, Azure Synapse Analytics, Apache Spark, Spark Streaming, Hive, Pig, Sqoop, HBase, Cassandra, Azure Functions, Azure Monitor, Azure DevOps, Jenkins, Python, Scala, Java, HiveQL, Bash Keywords: continuous integration continuous deployment quality analyst artificial intelligence machine learning business intelligence sthree active directory rlang information technology microsoft mississippi procedural language Arizona Minnesota Texas |