Home

Mahesh Gopina - Data engineer
[email protected]
Location: Carver, Minnesota, USA
Relocation: Yes
Visa: H1B
Resume file: GOPINA.M_Data_1775225416009.docx
Please check the file(s) for viruses. Files are checked manually and then made available for download.
Mahesh Gopina

[email protected] | +1314-220-9035 | LinkedIn



Professional Summary:

Senior Multi-Cloud Data Engineer with 10+ years of experience designing and delivering enterprise-scale data platforms across Healthcare, Financial Services, Retail, and Telecom domains, with deep expertise in GCP and strong hands-on delivery in AWS and Azure.

Architect and implement scalable batch and real-time streaming pipelines using BigQuery, Dataflow (Apache Beam), Dataproc (Spark), Cloud Composer (Airflow), Pub/Sub, Dataform, and Google Cloud Storage to produce governed, analytics-ready datasets.

Proven expertise in building enterprise-grade distributed data pipelines and orchestrating ingestion frameworks for both real-time and batch workloads, ensuring high availability, fault tolerance, and SLA compliance.

Lead large-scale data migration and modernization initiatives from SQL Server, Oracle, SAP, EHR, and legacy systems to BigQuery, Snowflake, and Databricks (Delta Lake), owning discovery, source-to-target mapping, CDC strategy, reconciliation, cutover planning, rollback frameworks, and post-production stabilization.

Design and deploy cloud-native ETL/ELT solutions using BigQuery ELT modeling, AWS Glue Spark jobs, Step Functions, Lambda orchestration, Azure Data Factory metadata-driven pipelines, and Synapse-based transformations.

Strong expertise in performance engineering and cost optimization including partitioning, clustering, slot management, shuffle tuning, Redshift Spectrum optimization, workload right-sizing, and lifecycle management strategies.

Hands-on experience building Palantir Foundry applications with React, TypeScript, and Ontology-driven UI design for enterprise workflows.

Experience managing secure, governed enterprise data lakes using IAM/RBAC, KMS/CMEK encryption, AWS Lake Formation policies, private networking, audit-ready logging, lineage tracking, and compliance controls aligned with HIPAA and HL7/FHIR standards.

Advanced proficiency in Python, PySpark, Spark SQL, Pandas, NumPy, and scikit-learn for large-scale data transformation, statistical modeling, feature engineering, and advanced analytics workloads.

Hands-on experience conducting experimentation and LLM validation within Vertex AI Workbench and Jupyter environments, supporting GenAI pipelines, LLM fine-tuning evaluation, and MLOps integration into production workflows.

Strong background in orchestration frameworks including Apache Airflow, Cloud Composer, Apache Beam, Luigi, and enterprise scheduling systems to automate ETL processes and ensure system reliability.

Extensive experience in Kafka, Kafka Streams, Kafka Connect, Apache Sqoop, HDFS, Hive, and distributed storage architectures for scalable batch and streaming data processing.

Proficient in containerization and cloud-native deployment using Docker, Kubernetes, AWS Fargate, and infrastructure-as-code tools such as Terraform and CloudFormation.

Strong DevOps and platform engineering mindset with GitOps workflows and CI/CD automation using GitHub Actions, Azure DevOps, Jenkins, Bamboo, CodePipeline, and CodeBuild for environment-specific releases and automated deployments.

Solid expertise in relational and NoSQL databases including MySQL, MongoDB, Oracle, SQL Server, CosmosDB, and enterprise data warehouse platforms.

Extensive experience working in Agile/Scrum and SAFe environments, actively participating in sprint planning, backlog grooming, story point estimation, daily stand-ups, sprint reviews, and retrospectives to ensure incremental and iterative delivery of data solutions.

Collaborate closely with Product Owners, Architects, QA teams, and DevOps engineers to translate business requirements into technical user stories, ensuring traceability, acceptance criteria alignment, and on-time sprint deliverables.

Drive continuous improvement initiatives within Agile teams by promoting automation, reusable pipeline frameworks, code reviews, peer programming, and engineering best practices to enhance delivery velocity and quality.

Proven ability to deliver high-impact, reusable data architecture patterns across multi-cloud ecosystems, enabling scalable analytics, compliance reporting, and enterprise data product development under tight timelines.

Excellent communication and stakeholder management skills with the ability to translate complex technical architectures into business-aligned solutions while leading cross-functional engineering teams in agile environments.



Technical Skills:



Cloud Platforms: GCP (BigQuery, Dataform, Cloud Composer, GCS, Pub/Sub, Dataproc, Cloud Functions, IAM), Azure,(ADLS,Azure SQL,Synapse Analytics,ADF,Databricks, Event Hub,Service Bus,Azure Purview,RBAC) AWS : (S3,Amazon RDS,Redshift,Glue,EMR,Lambda,Kinesis,QuickSight,MSK,MWAA)

Languages & Frameworks: Scala, Python, SQL, Spark SQL, Shell Scripting,Scala-Spark

Data Engineering & Processing: Google Dataform, Google Pub/Sub, BigQuery ingestion pipelines, Dataflow, Apache Beam, PySpark, Apache Spark, HDFS, Airflow (Composer), Kafka

Data Storage & Modeling: Google Cloud Storage (GCS), BigQuery, Cassandra, MongoDB, Star & Snowflake Schemas, Data Lake

Frontend Technologies: React.js, TypeScript, HTML5, CSS3, Component-Based Architecture

DevOps & CI/CD: GitHub, Jenkins, Terraform, CI/CD Pipelines, SDLC Practices

Monitoring & Security: Stackdriver (Cloud Monitoring), IAM Policies, Data Governance

Project Management & Collaboration: Agile, Scrum, Kanban, JIRA, Confluence, Slack, Microsoft Teams









Work Experience:





Best Buy, Richfield, MN

Sr. Data Engineer Oct 2024 - Present



Responsibilities:

Designed and implemented scalable batch and real-time data pipelines using Apache Spark, Spark SQL, Apache Hive, AWS Glue, Apache Beam, and Dataproc, improving operational efficiency and large-scale processing performance by up to 40%.

Architected and enhanced enterprise data lakes and lakehouse architectures across AWS (S3, Lake Formation), GCP (GCS, BigQuery), and Snowflake to support business intelligence, regulatory reporting, and advanced analytics workloads.

Led healthcare and retail domain analytics initiatives, enabling monitoring of sales trends, customer behavior, inventory optimization, and supply chain intelligence using BigQuery, Dataproc, and distributed Spark-based systems.

Led migration of electronic health records (EHR/EMR) from SQL Server and legacy systems to Snowflake and BigQuery, ensuring HIPAA compliance and secure data exchange via HL7v2 and FHIR APIs.

Designed and deployed GCP-native ingestion frameworks using Pub/Sub, GCS, BigQuery, Dataform, and Cloud Composer (Apache Airflow), enabling near real-time analytics and reducing ingestion latency by 35%.

Orchestrated complex ETL/ELT workflows across multi-cloud environments using Cloud Composer, Apache Airflow, AWS Glue, and Step Functions, proactively managing SLA monitoring, task retries, and automated alerting through Stackdriver and CloudWatch.

Created reusable SQL models using BigQuery (CTEs, window functions, partitioning, clustering) and Dataform, improving modularity, data quality, observability, and query performance.

Optimized distributed query processing across AWS RDS, Apache Hive, Spark SQL, and Redshift Spectrum, reducing query response time by 25 30% and improving interactive analytics performance.

Designed and implemented real-time streaming architectures using Apache Kafka, Kafka Streams, Kafka Connect, AWS Kinesis, and Pub/Sub to enable high-throughput, low-latency data ingestion and event processing.

Engineered scalable data pipelines within Palantir Foundry, enabling cross-platform ingestion and transformation, reducing processing time by 40%, and improving event-streaming efficiency by 45%.

Implemented secure, governed data storage solutions using Amazon S3, AWS Lake Formation, Glue Data Catalog, IAM policies, and encryption controls to ensure compliance and audit-readiness.

Established AI compliance and governance frameworks for Vertex AI pipelines, defining approval workflows, audit logging strategies, role-based access controls, and secure deployment standards.

Conducted experimentation and validation workflows within Vertex AI Workbench and Jupyter environments, supporting LLM fine-tuning evaluation and machine learning lifecycle integration.

Developed predictive analytics models and statistical workflows in Python using Pandas, NumPy, and scikit-learn, improving data preprocessing efficiency by 40 50% and delivering actionable business insights.

Containerized and deployed data processing applications using Docker and orchestrated distributed workloads with Kubernetes and AWS Fargate for scalable, cloud-native deployments.

Applied MapReduce techniques and Hadoop-based processing strategies for high-volume batch workloads, improving performance across distributed systems.

Managed and optimized AWS RDS and relational database environments, achieving a 30% reduction in query latency and enhancing production workload stability.

Integrated Amazon S3 with Redshift Spectrum and Databricks to enable external querying and reduce data movement costs by 30%.

Developed interactive React + TypeScript applications within Palantir Foundry, enabling real-time data exploration and workflow execution.

Built entity-centric UI components using Foundry Ontology objects, allowing users to navigate relationships and business entities.

Designed state-driven UI architecture using React Hooks (useState, useEffect, useContext) for dynamic rendering and performance optimization.

Implemented TypeScript-based Foundry actions to enable UI-triggered data operations and workflow automation.

Developed reusable component libraries and UI patterns to standardize frontend development across Foundry apps.

Integrated frontend applications with Foundry datasets, transforms, and APIs to deliver end-to-end data products.

Built event-driven UI features leveraging streaming pipelines (Kafka / PubSub) for real-time updates.

Automated CI/CD workflows using GitHub Actions, AWS CodePipeline, CodeBuild, and GitOps practices, enabling version-controlled deployment and continuous delivery of data pipelines.

Defined and enforced SDLC best practices including peer code reviews, documentation via Confluence, Git-based version control, automated testing, and release management across multi-cloud environments.

Monitored and troubleshot distributed pipelines using Stackdriver (Cloud Operations), AWS CloudWatch, and centralized logging frameworks to ensure high availability and proactive incident resolution.

Created automated usage, billing, and SLA reporting dashboards using BigQuery and Looker Studio to improve cost transparency and resource accountability.

Collaborated with cross-functional teams including telecom stakeholders (CDRs, device logs, network KPIs), healthcare data teams, DevOps engineers, and compliance officers to align ingestion architecture with enterprise requirements.

Established enterprise data governance frameworks including metadata management, lineage tracking, access control policies, and compliance alignment with HIPAA and regulatory standards.

Designed and executed complex SQL queries to extract business-critical insights while improving distributed query performance by up to 30%.

Built scalable machine learning and analytics workflows on Databricks, integrating collaborative data science and engineering processes.

Developed interactive dashboards and visualizations to enable stakeholders to analyze complex operational and compliance data trends.

Led continuous improvement initiatives through stakeholder feedback collection, root cause analysis, and implementation of version-controlled enhancements across Composer DAGs and SQL models.

Environment:

Apache Spark, Spark SQL, Apache Hive, Hadoop, MapReduce, AWS Glue, AWS Lambda, Amazon S3, AWS RDS, AWS Kinesis, AWS Lake Formation, Glue Data Catalog, Redshift Spectrum, AWS CloudWatch, AWS CodePipeline, AWS CodeBuild, BigQuery, GCS, Pub/Sub, Dataproc, Dataform, Cloud Composer, Stackdriver, Snowflake, Palantir Foundry, Databricks, Docker, Kubernetes, AWS Fargate, Python, Pandas, NumPy, DynamoDB, SQL.



Nationwide, Columbus, OH

Data Engineer Jul 2022 - Sep 2024



Responsibilities:

Developed scalable serverless data processing solutions using AWS Lambda, automating cloud-native workflows and improving operational efficiency across distributed systems.

Designed and deployed real-time and batch data pipelines using GCP Dataflow (Apache Beam) and Cloud Composer (Apache Airflow), automating ingestion from Pub/Sub into BigQuery and reducing latency by 35%.

Engineered distributed data processing pipelines using Apache Spark on Google Cloud Dataproc and Amazon EMR to support high-volume financial data processing and compliance workloads.

Designed and implemented AWS Step Functions and Google Cloud Workflows to orchestrate multi-step cloud-native data processes, reducing execution time and improving reliability of complex workflows.

Developed basic UI components in Foundry using TypeScript and React, enabling visualization of processed datasets.

Assisted in building ontology-driven UI views for business users to interact with curated datasets.

Collaborated with frontend teams to integrate data pipelines with user-facing applications.

Built secure enterprise data lakes using AWS Lake Formation, Amazon S3, Google Cloud Storage, and Google Cloud Dataplex, ensuring centralized governance, policy enforcement, lineage tracking, and regulatory compliance.

Implemented advanced partitioning strategies, clustering logic, and lifecycle policies across Amazon S3 and GCS to optimize storage utilization, reduce query latency, and lower infrastructure costs.

Developed real-time and batch ingestion pipelines using Pub/Sub and Dataflow, transforming streaming datasets into curated BigQuery data marts using Dataform with modular SQL-based transformations.

Orchestrated end-to-end ETL workflows via Cloud Composer (Airflow) across BigQuery, GCS, Cloud SQL, and Dataflow, enhancing reliability, observability, and SLA compliance.

Engineered scalable BigQuery warehouse architectures with optimized schemas, partitioning, clustering, and materialized views, improving analytical query performance by 40% and reducing costs by 35%.

Implemented declarative data modeling using Dataform with CI/CD integration via GitHub Actions, enabling version-controlled SQL transformations and modular pipeline deployment.

Managed secure multi-cloud environments using AWS IAM, Google Cloud IAM, service accounts, and role-based access controls to enforce enterprise governance and least-privilege security models.

Applied encryption mechanisms using AWS KMS and Google Cloud KMS to ensure secure data handling aligned with GDPR and enterprise compliance standards.

Collaborated with DevOps teams to provision cloud infrastructure using Terraform, automating deployment of IAM roles, storage buckets, datasets, and compute resources in GCP and AWS.

Integrated observability frameworks using AWS CloudWatch and Google Cloud Operations Suite (Stackdriver) to monitor pipeline health, enforce telemetry, and enable proactive alerting for production systems.

Developed distributed ingestion and transformation pipelines in Palantir Foundry, processing over 50 million records daily, reducing ingestion latency by 35%, and improving query performance by 60%.

Optimized relational and NoSQL data models within Palantir Foundry and MongoDB environments, reducing storage costs by 25% and enhancing analytical responsiveness.

Automated ETL processes using AWS Glue and Google Cloud Dataflow, eliminating manual transformations and improving pipeline reliability.

Managed and maintained Apache Airflow DAGs across AWS and GCP environments, streamlining hybrid data movement between cloud and on-prem systems.

Designed BigQuery data marts and performed advanced SQL transformations to support ad hoc analytics, BI reporting, and compliance dashboards.

Optimized serverless querying performance using Amazon Athena and Google BigQuery, significantly reducing data retrieval times for financial analytics use cases.

Conducted Python-based data wrangling and preprocessing using Pandas and NumPy to enhance data readiness for analytics and machine learning workloads.

Developed and deployed machine learning models using AWS SageMaker and Google Vertex AI to enhance predictive analytics and automation across financial datasets.

Designed and implemented foundational LLM Ops pipelines using Vertex AI, integrating model monitoring, retraining triggers, versioned endpoints, and governance controls to support enterprise-grade GenAI adoption.

Built scalable machine learning workflows using Databricks across AWS and GCP, facilitating collaborative data science and engineering integration.

Created user-friendly dashboards and visualizations within Palantir Foundry and BI environments, reducing reporting cycle time by 50% and improving executive decision-making.

Deployed CI/CD pipelines using Jenkins and Google Cloud Build to automate deployment of data workflows, reducing release cycle time and improving deployment reliability.

Implemented cross-cloud data integration patterns, enabling interoperability between Amazon Redshift and Google BigQuery for multi-cloud analytical processing.

Containerized and deployed cloud-native data applications using Docker and Kubernetes (Amazon EKS and Google GKE), ensuring scalability, portability, and environment consistency.

Established enterprise data governance frameworks leveraging AWS Glue Data Catalog and Google Cloud Dataplex to ensure metadata management, lineage tracking, security enforcement, and policy compliance.

Implemented cost optimization strategies across AWS and GCP by right-sizing compute, tuning queries, optimizing storage tiers, and refining data architecture design.

Delivered cross-training and technical enablement sessions for internal engineering and business teams to promote adoption of GCP-native and AWS-native data platforms.

Collaborated within Agile project environments, working closely with Product Owners, DevOps engineers, architects, and compliance teams to deliver incremental cloud data solutions aligned with sprint goals and enterprise timelines.

Environment:

AWS Lambda, AWS Step Functions, Amazon S3, AWS Lake Formation, AWS IAM, AWS KMS, Amazon Athena, Amazon EMR, Amazon RDS, Apache Spark, Apache Hadoop, Kafka Streams, Kafka Connect, Apache Airflow, GCP Dataflow, BigQuery, Cloud Composer, Google Cloud Storage, Google Cloud Dataproc, Google Cloud Dataplex, Google Cloud IAM, Vertex AI, Snowflake, MongoDB, MySQL, Jenkins, Python, Matplotlib, Docker, Kubernetes (EKS, GKE), Databricks, Terraform.



AbbVie, Chicago, IL

Data Engineer Jul 2021 - Apr 2022



Responsibilities:

Designed and maintained scalable enterprise data lakes leveraging Azure Synapse Analytics and Azure Data Lake Storage (ADLS Gen2), optimizing large-scale data integration for analytics, regulatory reporting, and business intelligence workloads.

Engineered a Lakehouse Architecture on Azure Databricks using Delta Lake, providing ACID compliance, unified batch and streaming capabilities, and reliable financial data processing pipelines.

Implemented Medallion Architecture (Bronze, Silver, Gold) to transform raw on-prem SQL Server ingestions into curated, high-quality datasets for downstream analytics and reporting.

Developed scalable ETL/ELT frameworks using Azure Data Factory (ADF), Apache Airflow, and Azure Databricks (PySpark, Spark SQL), improving operational efficiency and pipeline reliability.

Designed and implemented ADF pipelines to migrate large-scale SQL Server datasets into Azure Synapse Analytics, ensuring seamless transformation, validation, and reconciliation through Azure Databricks.

Leveraged Databricks Auto Loader (CloudFiles) for incremental ingestion of multi-terabyte datasets from ADLS Gen2, enabling scalable and efficient ingestion frameworks.

Optimized Spark performance using Z-Ordering, Data Skipping, Broadcast Joins, shuffle partition tuning, and file compaction strategies, reducing query latency by 35% and lowering compute costs.

Resolved data skew and small-file challenges by fine-tuning Spark configurations and executing Optimize/Vacuum operations to maintain Delta Lake storage health.

Utilized Delta Lake Time Travel, Schema Enforcement, and schema evolution controls to ensure auditability and prevent pipeline failures due to source-system schema drift.

Managed fine-grained data governance using Unity Catalog, enforcing RBAC controls and ensuring HIPAA and financial regulatory compliance across enterprise datasets.

Established automated CI/CD workflows using Azure DevOps, BitBucket, Databricks Asset Bundles (DABs), and Terraform for infrastructure provisioning and environment promotion.

Applied Azure Functions to implement serverless data workflows, reducing infrastructure management overhead and increasing scalability.

Implemented real-time streaming pipelines using Apache Spark and Azure Stream Analytics, enabling low-latency operational insights and continuous data processing.

Architected distributed PySpark-based data processing solutions to enable parallel processing of large financial datasets across clustered environments.

Designed robust data architecture strategies to support complex data workflows, cross-platform integration, and enterprise-scale analytics requirements.

Built and maintained BigQuery data marts integrating on-prem and cloud datasets using Dataform and Cloud Composer, supporting compliance and financial regulatory reporting.

Designed ingestion architectures using Pub/Sub, Dataflow, and BigQuery, processing over 10 million daily transactions with SLA guarantees.

Migrated legacy Talend-based ETL pipelines to GCP-native tools (Dataform + BigQuery SQL), improving maintainability, modularity, and performance.

Developed reusable DAG templates in Cloud Composer and Airflow to standardize pipeline orchestration across finance and reporting teams.

Created Python-based utilities for metadata validation, schema evolution checks, lineage tracking, and automated quality validations integrated into Composer workflows.

Enforced multi-cloud security policies using fine-grained IAM roles, RBAC, and CMEK encryption across GCS, BigQuery, Azure Storage, and Synapse datasets.

Designed and deployed distributed processing pipelines using Apache Airflow and Dataproc to improve scalability and performance across hybrid environments.

Managed enterprise data lakes for optimal storage, retrieval, and compliance with retention policies and regulatory requirements.

Performed advanced data transformations and statistical analysis using Pandas and NumPy, supporting predictive modeling and analytical workloads.

Migrated on-prem SQL Server databases to Snowflake, optimizing warehouse performance and automating ETL workflows using ADF, dbt, and Snowflake Streams.

Integrated MLflow within Databricks to track experimentation, model lineage, and streamline deployment of predictive models into production.

Integrated machine learning models into enterprise pipelines, leveraging Azure SQL Database data to generate predictive insights for financial analytics.

Enabled real-time analytics by streamlining ingestion from multiple sources into ADLS and processing via Spark and Databricks.

Automated infrastructure provisioning and cloud resource management using PowerShell and Terraform, reducing manual intervention and improving system reliability.

Designed and implemented automated data quality frameworks with testing and validation checkpoints to ensure clean, accurate, and consistent data.

Established enterprise-wide metadata management and governance practices to ensure data traceability, integrity, and compliance across Azure and GCP environments.

Collaborated with business stakeholders, data science teams, and engineering groups to translate BI requirements into scalable data engineering solutions.

Led cross-functional collaboration initiatives to align data engineering strategies with business objectives and long-term modernization goals.





Environment:

Azure Synapse Analytics, Azure SQL Database, Azure Data Factory (ADF), Azure Databricks, Delta Lake, ADLS Gen2, Unity Catalog, Azure Stream Analytics, Azure Functions, Azure DevOps, BitBucket, Terraform, PowerShell, Apache Spark, PySpark, Spark SQL, Apache Airflow, Dataproc, BigQuery, Pub/Sub, Dataflow, Cloud Composer, Snowflake, dbt, MLflow, Pandas, NumPy, SQL, RBAC.



ADP, Hyderabad, India Jan 2017 - Dec 2020

Data Engineer



Responsibilities:

Designed and implemented enterprise data ingestion pipelines using Amazon S3 as centralized data lake storage for structured and semi-structured retail and manufacturing datasets.

Provisioned and managed AWS EC2 instances and EMR clusters to support distributed Spark, Hive, and batch processing workloads across multi-terabyte datasets.

Architected and maintained large-scale Hadoop clusters leveraging HDFS ecosystem components to store and process high-volume structured and unstructured data.

Configured AWS EMR clusters for Spark and Hive-based distributed processing, enabling scalable analytics for retail and supply chain operations.

Built data loading pipelines from Amazon S3 into Amazon Redshift to support enterprise reporting, BI dashboards, and analytics workloads.

Optimized Redshift performance through distribution styles, sort keys, vacuum operations, and workload tuning to improve query responsiveness.

Implemented IAM roles, bucket policies, and security configurations to enforce secure access to S3, EMR, and Redshift environments.

Automated AWS infrastructure provisioning using CloudFormation templates, enabling repeatable and standardized environment deployments.

Implemented monitoring and alerting using Amazon CloudWatch for ETL job health, EMR cluster utilization, and production system stability.

Developed secure data transfer mechanisms between on-prem systems and AWS using SFTP and AWS DataSync for hybrid data integration.

Optimized SQL and PySpark workloads, reducing batch processing time by 25 30% through partitioning strategies and performance tuning.

Engineered structured data ingestion using Apache Sqoop (Oracle/MySQL to HDFS) and real-time log ingestion using Apache Flume into the Hadoop data lake.

Developed scalable Big Data batch jobs using Talend and native Hadoop components (tHDFSOutput, tHiveLoad) to streamline legacy system integration.

Implemented Change Data Capture (CDC) logic using Informatica PowerCenter to ensure near real-time synchronization between production databases and Hadoop clusters.

Orchestrated multi-stage ETL workflows using Apache Oozie and Informatica Workflow Manager, enabling reliable scheduling, dependency management, and automated error handling.

Optimized Hive query performance using Partitioning, Bucketing, ORC, and Parquet file formats, reducing query execution time by 40% for business analysts.

Managed cluster resource allocation using YARN, optimizing CPU and memory utilization to support concurrent high-volume processing workloads.

Designed and maintained HBase schemas for low-latency random-access lookups supporting customer-facing retail applications.

Automated cluster monitoring and health checks using Apache Ambari and shell scripting, proactively identifying node failures and performance bottlenecks.

Developed end-to-end ETL pipelines using Informatica PowerCenter and Talend to ingest high-volume data into HDFS and Hive, ensuring 99.9% data availability.

Designed and implemented enterprise data integration solutions using Informatica PowerCenter, developing mappings, transformations, and workflows for accurate and timely data delivery.

Applied Informatica Data Quality (IDQ) for profiling, cleansing, and standardization, reducing reporting data errors by 25%.

Implemented Master Data Management (MDM) solutions using Informatica MDM Hub and Data Director to consolidate and manage enterprise master data domains.

Designed and implemented match-merge rules, validation rules, and data standardization processes within Informatica MDM to ensure accurate and consistent master data.

Integrated Informatica MDM with CRM, ERP, and data warehouse systems to ensure enterprise-wide data consistency.

Developed data synchronization processes between source systems and Informatica MDM Hub in both real-time and batch modes.

Built custom data validation, enrichment, and deduplication strategies using Informatica MDM to eliminate redundant records and improve data quality.

Designed data hierarchies, domains, and relationships within Informatica MDM to accurately represent enterprise master data structures.

Implemented data quality scorecards and monitoring dashboards using Informatica Data Quality to measure and track master data health.

Performed performance tuning and optimization within Informatica PowerCenter and MDM environments to improve ETL throughput and reduce processing overhead.

Developed and optimized Hadoop-based data pipelines for processing large volumes of structured and unstructured data, improving accessibility and analytical performance.

Collaborated with business stakeholders and data stewards to define data governance policies, ownership models, and stewardship processes.

Ensured adherence to data integration best practices, metadata management standards, and enterprise data governance frameworks.

Stayed current with evolving Hadoop, AWS, and Informatica MDM capabilities, actively researching new features and participating in knowledge-sharing forums.

Environment:

AWS EMR, Amazon Redshift, Amazon Athena, AWS Step Functions, AWS Lambda, Amazon S3, AWS IAM, AWS CloudWatch, AWS DataSync, EC2, Hadoop, HDFS, Hive, HBase, Sqoop, Nifi,Flume, MapReduce, YARN, Apache Oozie, Informatica PowerCenter, Informatica MDM, Informatica Data Quality (IDQ), Talend, SQL, CRM, ERP, Data Governance.



IBM, Hyderabad, India

Data Engineer Jun 2015 Dec 2016



Responsibilities:

Contributed to development of automated data processing tasks using Google Cloud Functions, improving operational efficiency and enabling scalable serverless data workflows.

Designed and deployed distributed data processing pipelines using Google Cloud Dataflow (Apache Beam) and Dataproc (Spark), enabling batch and near real-time analytics integrated with GCS and BigQuery.

Engineered scalable data ingestion frameworks using Pub/Sub, Google Cloud Storage (GCS), and BigQuery, implementing failover recovery mechanisms and Dead Letter Queue (DLQ) handling for resilient streaming architectures.

Utilized GCP Dataproc for large-scale Spark-based data processing, integrating structured and semi-structured datasets into BigQuery for enterprise reporting and analytical workloads.

Optimized BigQuery SQL queries using partitioning, clustering, and query tuning techniques to improve performance for financial reporting and business intelligence use cases.

Developed and optimized ETL workflows using Talend, extracting, transforming, and loading large datasets across multiple enterprise systems.

Converted Spark/Scala-based ETL logic into modular SQL transformations using Dataform, enabling schema versioning, automated testing, and Git-based CI/CD integration.

Implemented CI/CD pipelines for data transformations using GitHub, Cloud Build, and Terraform, supporting version-controlled deployment of BigQuery datasets and Composer DAGs.

Designed scalable pipelines in Dataflow and provisioned infrastructure using Terraform to ensure repeatable, compliant, and production-grade real-time analytics deployments.

Employed Apache Airflow (Cloud Composer) to orchestrate and schedule multi-stage ETL workflows, ensuring dependency management, retry handling, and SLA adherence.

Provided production support for Composer-managed pipelines, monitoring job execution and optimizing BigQuery resource usage to control costs and enhance performance.

Configured secure storage solutions in GCS with lifecycle management policies and encryption at rest using Google-managed keys to ensure secure and compliant data storage.

Assisted in developing and maintaining distributed processing systems using Apache Spark and Hadoop, supporting scalable big data environments.

Developed Scala-based scripts using functional programming paradigms to enhance distributed data transformation efficiency.

Implemented data integrity checks and validation frameworks across ingestion pipelines to ensure data quality and consistency.

Contributed to data transformation initiatives, standardizing diverse data formats and integrating heterogeneous data sources into unified reporting structures.

Supported business intelligence initiatives by assisting in data modeling, enabling optimized reporting and analytical decision-making.

Monitored system performance using Stackdriver (Google Cloud Operations Suite), proactively identifying bottlenecks and improving throughput and latency.

Participated in log analysis and troubleshooting activities, identifying root causes of performance issues and recommending technical optimizations.

Assisted in implementing data governance practices, ensuring secure access control, compliance, and structured data management policies.

Collaborated closely with senior engineers to optimize distributed pipelines for higher throughput and reduced latency in large-scale data environments.

Engaged in cost-optimization strategies by balancing compute performance and budget constraints across cloud processing workloads.

Actively participated in Agile and Kanban teams, contributing to sprint planning, backlog grooming, daily stand-ups, and iterative delivery cycles.

Leveraged GitHub for version control, enabling collaborative development, structured code reviews, and controlled deployment processes.

Enabled end-to-end data movement from Cloud Storage to BigQuery using GCP-native services, accelerating executive reporting and dashboard generation.

Assisted in building cloud-based data architectures to ensure scalable storage, processing, and governance across enterprise financial datasets.

Environment:

Talend, Google Cloud Dataflow (Apache Beam), Dataproc (Spark), BigQuery, Google Cloud Storage (GCS), Cloud Functions, Pub/Sub, Apache Spark, Hadoop, Apache Airflow (Cloud Composer), Stackdriver (Cloud Operations), Scala, SQL, GitHub, Terraform, Agile, Kanban.



Education



Master of Science in Computer Science National Louis University Chicago, USA 2022
Keywords: continuous integration continuous deployment quality analyst artificial intelligence user interface javascript business intelligence sthree active directory Illinois Minnesota Ohio

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];7093
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: