Home

Naga Manasa Pemmasani - Data Engineer Position
[email protected]
Location: Charlotte, North Carolina, USA
Relocation: Yes, Anywhere
Visa: H1B
Resume file: NAGA MANASA PEMMASANI - Data Engineer- Azure_1772027652527.docx
Please check the file(s) for viruses. Files are checked manually and then made available for download.
Experienced Data Engineer with over 12 years of proven success in fraud analytics, financial crime detection, and risk management.
Azure Data Engineer and Identity Engineer with 12+ years of experience designing, implementing, and supporting enterprise-scale data platforms and identity solutions.
Strong hands-on expertise in Azure Data Factory (ADF) for building, orchestrating, and monitoring complex ETL/ELT pipelines.
Proven experience administering Azure Active Directory (Azure AD) for secure authentication, authorization, and access management.
Extensive experience configuring and managing Azure AD Connect (AADC) for directory synchronization between on-prem Active Directory and Azure AD.
Skilled in troubleshooting Azure AD Connect sync issues, including user, group, and attribute synchronization errors.
Hands-on experience planning and maintaining staging servers and disaster recovery strategies for Azure AD Connect.
Strong working knowledge of Active Directory Federation Services (ADFS) for enterprise authentication and federation scenarios.
Experienced in configuring, restoring, and maintaining ADFS services, including relying on party trusts and federation metadata.
Proficient in managing certificate lifecycle, including updates, renewals, and rotations for ADFS and Azure SSO integrations.
Expertise in enterprise application integration and registration for Single Sign-On (SSO) in Azure.
Hands-on experience provisioning, de-provisioning, and rotating client secrets and certificates for Azure AD application registrations.
Strong background in Active Directory administration, including group policy (GPO) troubleshooting and user access management.
Experienced in Active Directory replication troubleshooting across multi-domain and multi-site environments.
Proficient in monitoring Azure AD Connect Health and responding proactively to alerts and performance issues.
Consultative experience supporting directory synchronization and SSO for Microsoft Office 365 (Microsoft 365) services.
Solid understanding of Azure security best practices, including RBAC, Key Vault, encryption, and identity governance.
Experienced in automating deployments and workflows using Azure DevOps and Azure Pipelines.
Strong ability to collaborate with cross-functional teams including infrastructure, security, application, and business stakeholders.
Adept at working in fast-paced enterprise environments, managing multiple priorities and critical production systems.
TECHNICAL SKILLS:
Big Data & Distributed Computing:
Apache Spark, PySpark, Databricks, Hadoop (Cloudera, Hortonworks), Hive, HiveQL, MapReduce, Kafka, Batch & Streaming Data Processing, Delta Lake, Medallion Architecture (Bronze/Silver/Gold)
Cloud Platforms & Services:
Azure (Azure Data Factory, Azure Databricks, Azure Synapse Analytics, Azure Data Lake Storage Gen2, Azure Active Directory, Azure AD Connect, Azure AD Connect Health, Active Directory Federation Services, Azure Key Vault, Azure Functions, Azure Event Hub, Azure Monitor),AWS (S3, EC2, Lambda, IAM, Glue, Redshift, CloudWatch), GCP (BigQuery, Cloud Storage, Dataflow, Pub/Sub, Cloud Functions), IBM Data Stage
Programming & Scripting Languages:
Python, SQL, PL/SQL, PySpark, Scala, JavaScript, HiveQL, PowerShell, Bash/Shell Scripting, XML
Machine Learning & Deep Learning:
TensorFlow, PyTorch, Scikit-Learn, MLflow, spaCy, NLTK, Hugging Face Transformers, OpenCV, Faster R-CNN, Fraud Detection, Financial Crime Analytics
APIs & Application Integration:
REST APIs, SOAP, OAuth 2.0, OpenID Connect, SAML, Flask, FastAPI, Microservices, Secure API Authentication & Authorization
Data Management & Processing:
Pandas, NumPy, Azure Data Lake Storage Gen2, BigQuery, Azure Synapse SQL, Cloud SQL, Cloud Spanner, MS SQL Server, Oracle, MySQL, MongoDB, Data Validation & Reconciliation
Data Warehousing & BI:
Azure Synapse Analytics, Power BI, Tableau, BigQuery, Looker, Dimensional Modeling (Star & Snowflake), SCD Type 1 & Type 2
DevOps & Automation:
Azure DevOps, Azure Pipelines (YAML), Git, GitHub, GitLab, Bitbucket, CI/CD for Data & Identity Workloads, Terraform (Basic), Docker, Kubernetes (AKS/GKE)
Testing & Quality Assurance:
PyTest, Unit Testing, Data Quality Frameworks, Statistical Testing, LIME, SHAP
Tools & Environments:
JIRA, Confluence, ServiceNow, Agile/Scrum, SDLC, Documentation & Runbooks, Eclipse, Visual Studio Code
Operating Systems:
Windows Server, Linux (Ubuntu, CentOS), UNIX

PROFESSIONAL EXPERIENCE:

BCBS, Richardson, TX | Feb 2024 Present
Azure Engineer
Designed and developed large-scale data ingestion and transformation pipelines using Azure Data Factory (ADF), Azure Databricks (PySpark), Azure Synapse Analytics, and Azure Data Lake Storage (ADLS) to support enterprise analytics across high-volume healthcare datasets.
Designed and implemented scalable data and feature pipelines using Azure Data Factory, Azure Databricks (PySpark), Azure Synapse Analytics, and Azure Data Lake Storage to support AI, machine learning, and advanced analytics workloads.
Designed ingestion pipelines using Apache NiFi processors for data routing and transformation.
Configured NiFi flow files and custom processors for structured and semi-structured data.
Integrated NiFi with ADLS and Azure Databricks for downstream processing.
Implemented fault-tolerant NiFi workflows with retry and monitoring mechanisms.
Built reusable ingestion and transformation frameworks to prepare high-quality, AI-ready datasets, supporting both batch and near-real-time data required for model training and inference.
Developed feature engineering pipelines in Azure Databricks, leveraging PySpark to transform raw data into structured feature sets optimized for machine learning models.
Implemented Delta Lake architectures with medallion (Bronze Silver Gold) layers, enabling reliable feature versioning, schema evolution, and reproducibility for AI experiments.
Integrated machine learning models into production pipelines using Azure Machine Learning, Databricks MLflow, and deployed inference endpoints through Azure Kubernetes Service (AKS) and Azure Container Apps.
Enabled near-real-time AI use cases by processing streaming data using Azure Event Hub, Databricks Structured Streaming, and Stream Analytics, supporting intelligent dashboards and event-driven predictions.
Designed and optimized analytical data structures in Synapse SQL to support model evaluation, feature validation, and downstream AI-driven insights.
Automated MLOps and CI/CD workflows using Azure DevOps, GitHub Actions, and Terraform, supporting version-controlled deployment of data pipelines, ML models, and infrastructure.
Implemented data governance, security, and responsible AI foundations using Azure Purview, RBAC, Private Endpoints, Key Vault, and encryption to ensure compliant and secure AI solutions.
Built serverless and event-driven components using Azure Functions to trigger model pipelines, capture metadata, and enforce automated data and feature quality checks.
Modernized legacy and on-prem data flows into cloud-native Azure architectures, enabling scalable AI experimentation and reducing operational friction.
Developed time-aware and historical feature tracking using slowly changing dimension patterns to support model accuracy and explainability over time.
Monitored AI data pipelines and model-related workflows using Azure Monitor, Log Analytics, and Application Insights to ensure reliability and performance of production AI systems.
Collaborated with data scientists, AI architects, and business stakeholders to translate AI requirements into scalable, production-grade Azure AI solutions.
Environment: Azure Data Factory (ADF), Azure Databricks, Azure Synapse Analytics, Azure Data Lake Storage Gen2 (ADLS Gen2), Delta Lake, Azure Functions, Azure DevOps, Terraform, Azure Event Hub, Azure Stream Analytics, PySpark, SQL, MLflow, Azure Key Vault, VNET, Git
Best Buy, Richfield, MN | June 2022 -- Jan 2024
Data Engineer
Mapped complex business challenges to ML solutions by engaging stakeholders, defining objectives, and translating them into actionable ML pipelines using Amazon SageMaker Studio.
Engineered advanced Amazon Redshift SQL queries with joins and aggregations to extract and consolidate data for analytics and model development.
Cleaned and transformed raw datasets using Python, Pandas, and AWS Glue to prepare high-quality features for model training.
Built ETL workflows using AWS Step Functions, Lambda, and Python to automate ingestion, validation, and transformation processes across AWS services.
Scaled distributed data processing pipelines using Amazon EMR and PySpark for efficient feature engineering and large-scale ML tasks.
Enabled real-time inference by integrating streaming pipelines with Amazon Kinesis Data Streams and Kinesis Data Analytics for continuous data ingestion and transformation.
Applied NLP techniques using spaCy and developed vision models with Faster R-CNN on SageMaker; also experimented with Hugging Face Transformers for LLM-based applications.
Designed and fine-tuned Generative AI models and LLMs using SageMaker JumpStart, ensuring alignment with business goals and ethical AI standards.
Trained and optimized deep learning models such as CNNs using SageMaker Training Jobs with PyTorch, incorporating advanced hyperparameter tuning techniques.
Validated models using cross-validation and integrated LIME for explainability, enabling transparent and auditable ML decisions.
Tracked experiments and managed models using SageMaker Model Registry and MLflow for version control, reproducibility, and smooth handoffs between dev and prod.
Developed and deployed RESTful APIs for real-time model serving using Amazon API Gateway, Lambda, and FastAPI.
Designed and executed automated API tests using Postman to validate RESTful endpoints exposed via API Gateway and FastAPI.
Followed Test-Driven Development (TDD) practices using PyTest and implemented mock services using WireMock for API simulation in CI pipelines.
Integrated SonarQube into GitHub Actions for static code analysis and continuous quality checks on Python-based microservices.
Containerized ML models using Amazon ECR and deployed with AWS Fargate and SageMaker Endpoints for scalable, serverless inference.
Deployed and monitored ML workflows using Amazon S3, SageMaker, CloudWatch, and IAM, ensuring secure and observable machine learning operations.
Built CI/CD pipelines using AWS CodePipeline, CodeBuild, and CodeCommit, ensuring automated deployments, quality checks, and rollback capabilities.
Participated in Agile sprints, managed user stories and tickets in JIRA, and maintained thorough documentation in Confluence to support cross-functional collaboration.
Environment: Python, NumPy, Pandas, AWS Lambda, FastAPI, PyTorch, SageMaker, spaCy, Faster R-CNN, Hugging Face Transformers, LIME, AWS Glue, Redshift, Amazon S3, EMR, Kinesis, ECR, Fargate, CloudWatch, CodePipeline, CodeBuild, CodeCommit, IAM, Git, Agile, JIRA, Confluence

Nationwide Mutual Insurance, Scottsdale, AZ | Mar 2021 -- Apr 2022
Data Engineer
Built automated workflows using AWS Lambda, SQS, and DynamoDB and GCP Cloud Functions, Pub/Sub, and Firestore to sync files nightly from on-prem storage to Amazon S3 and Google Cloud Storage (GCS), enhancing operational efficiency.
Migrated large-scale Oracle databases to Google BigQuery and Amazon Redshift, enabling scalable analytics and integrating results with Power BI dashboards.
Provisioned and configured AWS infrastructure components such as VPC, EC2, S3, IAM, EMR, RDS, and CloudFormation and GCP components such as VPC, Compute Engine, IAM, Dataproc, BigQuery, and Cloud Storage, supporting data pipeline deployments and analytics workloads.
Leveraged Spark SQL and Scala to process Parquet and Avro data formats and load datasets into Hive on AWS EMR and GCP Dataproc for downstream consumption.
Deployed multi-environment data pipelines for ingesting and transforming data across Oracle, Postgres, and Informix systems using ETL/ELT best practices on AWS Glue and GCP Dataflow.
Designed and managed an Enterprise Data Lake on Amazon S3 to accommodate rapidly changing structured and unstructured data for analytics and archival use cases.
Developed scalable APIs using AWS Lambda and Python and GCP Cloud Run/Cloud Functions to automate server-side tasks and streamline cloud operations.
Orchestrated data streaming pipelines using Kafka and Cassandra and GCP Pub/Sub, enabling real-time ingestion and processing across distributed systems.
Engineered data integration workflows with Hadoop, HDFS, Spark, and Hive on AWS EMR and GCP Dataproc for handling high-volume batch workloads.
Created robust Python scripts to extract and parse XML datasets and automate ingestion into structured databases on AWS and GCP.
Extracted and transformed financial data from Oracle into Redshift and BigQuery, preparing it for analytical use in EMR and Dataproc.
Applied dimensional modeling techniques to build Star and Snowflake schemas, defining fact and dimension tables for visualization in Tableau and Looker.
Built real-time data streaming services using Kafka and Spark Streaming, integrating data via REST APIs from external systems into AWS and GCP environments.
Delivered scalable data processing functions using Python APIs on AWS Lambda and GCP Cloud Functions, improving array-level debugging within processing layers.
Enhanced system reliability by establishing CI/CD pipelines with Git, Jenkins, and custom Python utilities on AWS CodePipeline and GCP Cloud Build.
Implemented unit testing frameworks using PyTest for PySpark applications on AWS EMR and GCP Dataproc, reinforcing code quality and pipeline stability.
Tuned performance of analytical systems by optimizing data structures and managing lifecycle policies across cloud and on-prem resources on AWS and GCP.
Environment: AWS (Lambda, S3, EC2, Kinesis, RDS, CloudFormation), BigQuery, Redshift, Hive, Spark, Spark Streaming, Scala, Python, PySpark, Kafka, Cassandra, EMR, HDFS, PostgreSQL, Oracle, Tableau, Jenkins, Git, JIRA, XML, REST APIs, Airflow, Impala, Ab Initio, DataStage, Linux.

IBM, Bangalore, India | Apr 2017 -- Dec 2020
Data Engineer
Imported and exported structured data between Oracle, PostgreSQL, and HDFS using Sqoop to support analytical workflows.
Re-engineered legacy MapReduce programs into Spark-based pipelines using Python, reducing execution time and improving maintainability.
Transferred datasets from on-prem Hive-based Data Lake to Azure Data Lake Storage (ADLS), enabling cloud-native access.
Conducted rigorous data reconciliation between Hive and ADLS to ensure consistency and completeness post-migration.
Utilized Spark DataFrame API on Cloudera to perform complex transformations and aggregations on Hive datasets.
Built high-performance batch jobs in Apache Spark, achieving a 10x improvement in data processing efficiency.
Designed real-time ingestion pipelines using Apache Kafka, creating and managing multiple Kafka topics for parallel processing.
Extracted and streamed messages from Kafka into Spark for downstream analytics and data enrichment.
Migrated processed data from ADLS into Snowflake, supporting downstream reporting and business intelligence needs.
Authored advanced HiveQL queries to deliver insights from partitioned and bucketed datasets, aligned with stakeholder requirements.
Transitioned critical workloads from on-prem infrastructure to Azure, streamlining scalability and storage optimization.
Developed Pig Latin scripts to parse web server log files and ingest structured data into HDFS.
Applied advanced Hive techniques such as partitioning, bucketing, and map-side joins to improve query performance.
Created custom UDFs and UDAFs in Spark SQL and Hive to accommodate complex business logic.
Refactored existing Hive/SQL logic into Spark RDDs and transformations using Scala, modernizing data flows.
Authored Databricks notebooks with Spark SQL and Scala to deliver ETL processes and produce analytics dashboards.
Tuned Spark jobs through effective management of memory, cores, and executor parameters, improving cluster resource utilization.
Implemented ZooKeeper-based locking mechanisms to coordinate concurrent access to shared Hive tables within multi-user environments.
Environment: Linux, Cloudera, Apache Hadoop, HDFS, YARN, Hive, Spark, Scala, Pig, Sqoop, Kafka, Snowflake, Azure Data Lake Storage, Databricks, Spark SQL, Zookeeper, Spark Streaming

ADP, Hyderabad, India | Sep 2014 -- Mar 2017
Data Engineer
Installed and configured Apache Hadoop on Azure HDInsight, building multiple MapReduce jobs in Java for large-scale data cleaning and preprocessing.
Developed advanced Hive queries on HDInsight integrated with Azure Data Lake Storage (ADLS Gen2) for structured data extraction and transformation.
Employed Azure Data Factory (ADF) with custom activities and linked services to ingest data from SQL Server, Oracle, and MySQL into ADLS, followed by transformations using Hive and MapReduce.
Designed and maintained Pig UDFs on HDInsight to process unstructured and semi-structured datasets for exploratory analysis.
Wrote unit and integration test cases for Apache Spark and Apache Ignite jobs using Scala, ensuring reliability and logic correctness on Azure Synapse and HDInsight clusters.
Gained hands-on exposure to Snowflake on Azure, supporting schema design and optimizing queries during migration from traditional Hadoop pipelines.
Built real-time Spark Structured Streaming pipelines on Azure Databricks to ingest and process event data into Delta Lake and Hive Metastore.
Delivered high-throughput batch processing solutions using Azure Databricks (Apache Spark), enhancing ETL pipeline stability and efficiency.
Developed resilient Spark Streaming applications for ingesting data from Azure Event Hubs and Azure IoT Hub, storing output in Data Lake Storage.
Enabled real-time reporting by integrating Apache Cassandra on HDInsight with Spark, including automated failure recovery and checkpointing.
Designed custom restartable ETL flows using ADF pipelines and parameterized triggers to support dynamic data loads and on-demand reruns.
Created ad-hoc Python and Spark jobs for business-critical reporting, running them on Azure Synapse with data sourced from ADLS.
Used Sqoop on HDInsight to transfer large datasets from on-prem relational databases into Cassandra and ADLS Gen2.
Automated end-to-end ETL processes with Bash scripting, Azure CLI, and ADF pipeline triggers, enabling nightly and event-driven processing.
Modeled HiveQL tables and created views over semi-structured datasets stored in Azure Data Lake, enabling downstream analytics.
Built and exposed RESTful APIs using Azure Functions over HBase on HDInsight, allowing real-time, scalable access to key datasets.
Monitored and debugged cluster jobs using Azure Monitor, Log Analytics, and native HDInsight logs, resolving performance and job failures proactively.
Processed large-scale raw text data using Hadoop Streaming jobs and shell scripts, enabling downstream machine learning and analytics tasks.
Used Azure DevOps and Jenkins for build automation, addressing code review feedback, managing releases, and resolving production defects in MapReduce and Spark jobs.
Environment: Azure HDInsight, Azure Data Lake Storage Gen2, Azure Data Factory, Azure Databricks, Azure Synapse Analytics, Apache Spark, Spark Streaming, Hive, Pig, Sqoop, HBase, Cassandra, Azure Functions, Azure Monitor, Azure DevOps, Jenkins, Python, Scala, Java, HiveQL, Bash
Keywords: continuous integration continuous deployment artificial intelligence machine learning business intelligence sthree active directory rlang information technology microsoft mississippi procedural language Arizona Delaware Minnesota Texas

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];6878
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: