| Sivaji Gutala - Sr.Data Engineer with 12+ Years of Experience |
| [email protected] |
| Location: , , USA |
| Relocation: anywhere |
| Visa: F1 initial opt(stem opt 2Years) |
| Resume file: Sivaji Gutala_Data_Engineer_1771435037764.docx Please check the file(s) for viruses. Files are checked manually and then made available for download. |
|
Sivaji Gutala
mail:[email protected] +19085849456 www.linkedin.com/in/sivaji-gutala ____________________________________________________________________________________________________________________ Professional Summary: Senior Multi-Cloud Data Engineer with over 12 years of experience in designing and deploying enterprise data solutions. Expert-level proficiency in GCP (7+ years), specialized hands-on experience in AWS (~4 years), and specialized hands-on experience in Azure (~4 years). Expert in BigQuery, Databricks (Medallion Architecture), and Snowflake, with a strong focus on GenAI (Vertex AI), LLM Ops, and automated CI/CD pipelines. in Google Cloud Platform (GCP), specializing in scalable data engineering solutions using BigQuery, Dataform, Google Pub/Sub, and Google Cloud Storage. Proven expertise in building enterprise-grade, distributed data pipelines, automating workflows via Cloud Composer (Apache Airflow), and orchestrating ingestion to BigQuery using real-time and batch processing. Proven expertise in designing and deploying cloud-native ETL/ELT solutions, automating workflows with AWS Step Functions, Glue, and Lambda, and managing secure, governed data lakes using AWS Lake Formation. Skilled in optimizing cloud data pipelines for performance and cost, with hands-on experience in RDS, Redshift Spectrum, CloudWatch monitoring, and CI/CD automation via CodePipeline/CodeBuild. Strong experience in software development lifecycle (SDLC), SQL performance tuning, and GitHub-based CI/CD pipelines. Adept at working in agile environments and collaborating with cross-functional teams to deliver reliable and secure GCP-native data solutions. Proficient in designing and implementing scalable data pipelines to collect, cleanse, and transform data from various sources into a unified format for analysis within the Palantir platform. Strong experience in orchestrating data workflows with Apache Airflow, Apache Beam, and Luigi, ensuring automation of ETL processes and system reliability. Conducted development and experimentation within Jupyter Notebooks on Vertex AI Workbench, testing data pipelines, and validating LLM fine-tuning outputs before production deployment. Experience in building and optimizing data models using relational and NoSQL databases, leveraging advanced analytics techniques, including machine learning and statistical analysis. Advanced proficiency in Python, Pandas, NumPy, and scikit-learn for data manipulation, statistical analysis, and building data science models. Developed and maintained Python and Perl-based scripting utilities for file parsing, DAG validations, and orchestration monitoring within Airflow and Cloud Composer workflows. Expertise in HDFS for distributed storage and Apache Hive for building and managing data warehouses to support BI and reporting solutions. Skilled in data ingestion, extraction, and transformation using Apache Sqoop, Kafka, Kafka Streams, and Kafka Connect for real-time data streaming and batch processing. Proficient in Spark SQL, and SQL for querying large datasets and optimizing query performance in distributed environments. Extensive experience with containerization and orchestration using Docker, Kubernetes, and AWS Fargate to build and deploy cloud-native applications. Strong DevOps knowledge, proficient in Jenkins, Git, GitHub, BitBucket, Terraform, and CI/CD pipeline management for automated deployment processes. Configured continuous integration workflows using Bamboo for release automation and environment-specific deployment of data pipelines on GCP. In-depth understanding of data warehousing concepts such as Star Schema Snowflake Schema, and building efficient data models for analytics and reporting. Experienced in visualizing data using Tableau and Power BI to build interactive dashboards and convey business insights to stakeholders. Experienced in Azure services: Synapse Analytics, Data Factory (ADF), Databricks, ADLS, and Azure Stream Analytics. Solid experience with database technologies, including MySQL, MongoDB, Oracle, SQL Server, and CosmosDB for reliable data storage solutions. Proven track record of owning and delivering large-scale GCP data initiatives in high-paced enterprise environments with tight deadlines. Designed and delivered high-impact data products across GCP ecosystems using BigQuery and Dataflow, enabling reusable architecture patterns for analytics in compliance and retail use cases. Excellent verbal, written, and interpersonal communication skills with the ability to clearly articulate complex technical solutions to both technical and business audiences. Technical Skills: Cloud Platforms: GCP (BigQuery, Dataform, Cloud Composer, GCS, Pub/Sub, Dataproc, Cloud Functions, IAM), Azure,(ADLS,Azure SQL,Synapse Analytics,ADF,Databricks, Event Hub,Service Bus,Azure Purview,RBAC) AWS : (S3,Amazon RDS,Redshift,Glue,EMR,Lambda,Kinesis,QuickSight,MSK,MWAA) Languages & Frameworks: Scala, Python, SQL, Spark SQL, Shell Scripting,Scala-Spark Data Engineering & Processing: Google Dataform, Google Pub/Sub, BigQuery ingestion pipelines, Dataflow, Apache Beam, PySpark, Apache Spark, HDFS, Airflow (Composer), Kafka Data Storage & Modeling: Google Cloud Storage (GCS), BigQuery, Cassandra, MongoDB, Star & Snowflake Schemas, Data Lake DevOps & CI/CD: GitHub, Jenkins, Terraform, CI/CD Pipelines, SDLC Practices Monitoring & Security: Stackdriver (Cloud Monitoring), IAM Policies, Data Governance Project Management & Collaboration: Agile, Scrum, Kanban, JIRA, Confluence, Slack, Microsoft Teams Work Experience: Humana Rahway, KY, Sr. Data Engineer Jun 2025 - Present Responsibilities: Designed and implemented scalable data pipelines using Apache Spark, Hive, and AWS Glue, improving operational efficiency by 40%. Led efforts to build and enhance data models and data lakes to support business intelligence and reporting solutions. Automated ETL workflows using Apache Airflow to increase system reliability and reduce data processing time. Worked with large datasets, optimizing queries and improving performance across distributed systems like AWS RDS and Apache Hive. Supported retail domain analytics pipelines to monitor sales trends and customer behaviour, enabling optimization of inventory and supply chain operations using Big Query and Dataproc. Led migration of healthcare data to Snowflake, ensuring HIPAA compliance and seamless data integration with FHIR APIs. Designed and deployed scalable GCP-native data pipelines using Google Cloud Dataform, Big Query, Pub/Sub, and Cloud Composer, supporting both batch and real-time ingestion from various systems. Implemented efficient data ingestion frameworks into Big Query using GCS and Pub/Sub, enabling near real-time analytics and reducing latency by 35%. Orchestrated and monitored complex workflows using Cloud Composer (Apache Airflow) for ETL orchestration across Big Query and Dataform models. Led end-to-end supervision and maintenance of Airflow-based pipelines, proactively managing task retries, SLA misses, and alerting logic via Stack driver to ensure high system reliability. Created reusable SQL modules in Google Dataform, improving data quality, modularity, and pipeline observability. Authored technical documentation for Vertex AI deployment flows, DAG configurations, Dataform model definitions, and CI/CD deployment steps to support team onboarding and long-term maintainability. Developed and maintained data models in Big Query using SQL (window functions, CTEs, and partitioning) to support business intelligence and regulatory reporting. Optimized GCS lifecycle policies and partitioning logic, reducing storage costs and improving access speed. Integrated GitHub workflows for Dataform project versioning and CI/CD automation across GCP environments. Defined and enforced SDLC best practices for data engineering within GCP, including GitOps, CI/CD with GitHub Actions, peer reviews, and documentation using Confluence. Contributed to building an AI compliance and governance framework, defining audit-ready logging strategies, approval workflows, and access control for Vertex AI pipelines in GCP. Performed root cause analysis and pipeline optimization, leveraging Stackdriver for alerting and debugging issues in GCS and Composer DAGs. Created automated usage and billing reports in BigQuery and Looker Studio for SLA tracking and resource accountability, ensuring alignment with cost-optimization goals. Collaborated with cross-functional teams to align data ingestion architecture with AT&T telecom domain requirements (CDRs, device logs, network KPIs). Led continuous improvement initiatives by collecting stakeholder feedback, conducting root cause analysis, and implementing version-controlled updates to GCP Composer workflows and SQL models. Implemented scalable and secure data storage solutions with Amazon S3, supporting diverse datasets in the cloud. Experience in healthcare data migration and analytics, including handling EHR/EMR data, ensuring compliance with regulatory standards (HIPAA, HL7, FHIR), and managing cloud-based data integration, monitoring, and troubleshooting. Managed and optimized AWS RDS relational database environments, achieving a 30% reduction in query response time and enhancing system reliability for production workloads. Improved large-scale data querying and analytics by leveraging Apache Hive, reducing processing times by 25% for high-volume datasets. Designed and implemented data pipelines in Palantir Foundry, enabling efficient data ingestion and transformation across multiple sources, reducing data processing time by 40% and improving system scalability. Automated data workflows using AWS Glue, simplifying ETL processes and improving operational efficiency. Designed and developed scalable data pipelines using Apache Spark and Apache Airflow (GCP Composer), optimizing large-scale data processing workflows, including real-time streaming applications with Apache Kafka and GCP services. Enabled interactive querying across multiple data sources improving query performance and scalability. Migrated electronic patient records (EHR) from SQL Server to Snowflake, ensuring HIPAA compliance and secure data exchange using FHIR APIs. Designed real-time data streaming applications with AWS Kinesis, achieving high throughput and low-latency ingestion for time-sensitive data pipelines. Conducted advanced data wrangling and preprocessing with Python, improving data quality and reducing preprocessing time by 40%. Enhanced large-scale data query processing efficiency by 30% using Spark SQL across distributed data systems. Containerized and deployed data processing applications using Docker, ensuring reproducibility and scalability across environments. Orchestrated data workflows and resource allocation with Kubernetes, managing distributed data systems for optimal performance. Implemented strict data privacy and security controls within Palantir Foundry, achieving 100% compliance with HIPAA and reducing unauthorized data access incidents by 30%. Applied MapReduce techniques for large-scale data processing in distributed systems, improving batch processing performance. Managed metadata and data lineage with Glue Data Catalog, ensuring consistent tracking and easy discovery of datasets. Monitored data pipelines and cloud infrastructure using AWS CloudWatch, proactively ensuring high availability and troubleshooting issues. Streamlined deployment automation using AWS CodePipeline and CodeBuild, integrating data processing pipelines for continuous integration and delivery. Integrated Amazon S3 with Redshift Spectrum, enabling seamless querying of datasets and reducing data movement costs by 30%. Engineered real-time data ingestion and transformation pipelines using Palantir Foundry and Apache Kafka, improving event streaming efficiency by 45% and ensuring near-instant data availability. Designed and executed complex SQL queries, extracting actionable business insights and reducing query processing time by 30%. Optimized Hadoop-based data querying using Apache Hive, improving data analysis efficiency by 20% in large-scale systems. Created dynamic and interactive data visualizations enabling stakeholders to analyse complex data trends. Built predictive models and applied statistical analysis techniques in Python, delivering actionable business insights. Utilized Pandas and NumPy for efficient data manipulation, enabling 50% faster data preparation and transformation workflows. Implemented NoSQL solutions with DynamoDB, reducing query latency for high-velocity applications by 30%. Optimized ETL workflows on Databricks, leveraging collaborative engineering processes and integrating with Amazon Redshift for 40% faster analytics. Built scalable machine learning workflows using Databricks, facilitating collaborative data science and engineering processes. Established data governance frameworks to ensure compliance with industry regulations and organizational data policies. Environment: Apache Spark, AWS Glue, AWS Lambda, Amazon S3, AWS RDS, Apache Hive, AWS CloudWatch, AWS CodePipeline, AWS CodeBuild, DataBricks, Amazon Redshift, HDFS, AWS Kinesis, Python, Spark SQL, Hadoop, Docker, Kubernetes, AWS Lake Formation, MapReduce, Glue Data Catalog, Redshift Spectrum, Apache Beam, Pandas, NumPy, DynamoDB, Databricks, SQL. Bank of America, Charlotte, NC. Data Engineer Mar 2024 - May 2025 Responsibilities: Developed scalable AWS Lambda functions to automate data processing tasks, enhancing efficiency in cloud-native workflows. Developed and deployed GCP Dataflow and Cloud Composer pipelines for real-time and batch data processing, automating ingestion from Pub/Sub into BigQuery. Led data governance frameworks using GCP Dataproc, BigQuery, and GCS for scalable data solutions. Designed and implemented AWS Step Functions and Google Cloud Workflows to automate multi-step cloud data processes, reducing execution time and improving orchestration efficiency. Implemented advanced partitioning and lifecycle policies for Amazon S3 and Google Cloud Storage, reducing query latency and storage costs by optimizing data access patterns. Enterprise Data Lakes & Governance: Designed and implemented secure data lakes using AWS Lake Formation and Google Cloud Dataplex, ensuring data security, governance, and compliance with organizational policies. Developed real-time and batch ingestion pipelines using Google Pub/Sub and Dataflow, transforming streaming data into curated BigQuery datasets with Dataform for downstream analytics. Orchestrated ETL workflows using Cloud Composer (Apache Airflow) to automate complex data transformations and daily load jobs across GCS, BigQuery, and Cloud SQL. Engineered scalable data lake architecture using Google Cloud Storage (GCS) and structured data warehouse models in BigQuery, improving query performance by 40%. Implemented Dataform for declarative data modeling and SQL-based transformations in BigQuery, enabling modular, reusable SQL workflows with CI/CD integration via GitHub Actions. Managed secure GCP environments using IAM roles and service accounts, adhering to enterprise data governance and access control policies. Delivered technical assistance and cross-training sessions to business users and internal engineering teams, ensuring end-to-end understanding of GCP pipeline usage and governance. Conducted performance tuning on BigQuery queries using partitioning, clustering, and materialized views, reducing data processing costs by 35%. Collaborated with cross-functional teams to integrate GCP-native telemetry and monitoring using Stackdriver, ensuring observability and proactive alerting for production pipelines. Developed and deployed GCP Dataflow and Cloud Composer pipelines for real-time and batch data processing, reducing latency by 35%. Implemented data lakes using GCP Dataproc and BigQuery, ensuring data security and compliance with organizational policies. Collaborated with DevOps and data teams to drive fast-paced GCP project delivery using Terraform-based infrastructure automation and agile project management methodologies. Cloud Security & Identity Management: Applied strict IAM policies in AWS and Google Cloud IAM, managing role-based data access controls and ensuring secure multi-cloud environments. Developed distributed data processing pipelines using Apache Spark on Google Cloud Dataproc and implemented Terraform for provisioning GCP infrastructure including IAM, Storage Buckets, and BigQuery datasets. Optimized relational and NoSQL data models within Palantir Foundry, improving query performance by 60% and reducing storage costs by 25%. Developed high-throughput, real-time data ingestion pipelines in Palantir Foundry, processing over 50 million records daily and reducing data ingestion latency by 35%. Automated ETL workflows using AWS Glue and Google Cloud Dataflow, reducing manual data transformations and improving data pipeline efficiency. Managed Apache Airflow for workflow automation in AWS and GCP, streamlining data movement across cloud and on-prem systems. Developed and managed DAGs in GCP Composer (Airflow) for orchestrating complex ETL workflows across BigQuery, Dataflow, and GCS, improving reliability and visibility of data pipelines. Built efficient BigQuery data marts and performed advanced SQL transformations to support ad hoc analytics and BI reporting, reducing query costs by 25% through optimization techniques. Utilized GCP Dataflow with Apache Beam for distributed data processing, transforming high-volume streaming and batch datasets for customer analytics and compliance reporting. Worked closely with GCP IAM and Stackdriver to enforce security policies, monitor pipeline health, and ensure system reliability. Serverless Querying & Analytics: Optimized query execution using Amazon Athena and Google BigQuery, reducing data retrieval times for analytical queries. Data Transformation & Preprocessing: Utilized Python-based data wrangling with Pandas and NumPy, improving data preprocessing efficiency for analytics workloads. Developed and deployed ML models using AWS SageMaker and Google Vertex AI, enhancing predictive analytics and automation in cloud environments. Designed and deployed foundational LLM Ops pipelines using Vertex AI, integrating model monitoring, retraining triggers, and endpoint versioning to streamline enterprise-grade GenAI solutions. Developed user-friendly dashboards and visualizations in Palantir Foundry, cutting report generation time by 50% and improving stakeholder decision-making. Implemented encryption mechanisms in AWS KMS and Google Cloud KMS, ensuring secure data handling and compliance with GDPR and enterprise security policies. Deployed continuous integration and delivery (CI/CD) pipelines using Jenkins for AWS and Google Cloud Build, reducing deployment cycle times for data workflows. Monitored and optimized cloud data pipelines using AWS CloudWatch and Google Cloud Operations Suite (Stackdriver), ensuring real-time alerting and issue resolution. Implemented cost-saving measures in AWS and GCP, reducing compute, storage, and query costs through optimized data architecture. Designed cross-cloud data solutions, integrating AWS Redshift with Google BigQuery for multi-cloud data warehousing and analytical processing. Containerized and deployed data processing applications using Docker and Kubernetes (EKS on AWS, GKE on GCP), ensuring scalability and reproducibility. Built scalable ML workflows using Databricks on AWS and GCP, facilitating collaborative data science and engineering workflows. Established data governance frameworks using AWS Glue Data Catalog and Google Cloud Dataplex, ensuring data integrity, security, and lineage tracking. Environment: AWS Lambda, AWS Step Functions, Amazon S3, AWS Lake Formation, MongoDB, AWS IAM, Apache Spark, Snowflake, Kafka Streams, Kafka Connect, Apache Airflow, MySQL, Amazon Athena, Python, Jenkins, Matplotlib, Amazon EMR, Amazon RDS, Apache Hadoop. ADP, Hyderabad, IN. Data Engineer March 2020 - Dec 2023 Responsibilities: Developed and maintained scalable data pipelines and data lakes leveraging Azure Synapse Analytics optimizing data integration for analytics and reporting. Led the design and development of ETL frameworks using Azure Data Factory (ADF) and Apache Airflow, improving operational efficiency. Implemented ETL processes using Azure SQL Database to automate data transformation and streamline workflows across business systems. Engineered a Lakehouse Architecture on Azure Databricks using Delta Lake to provide ACID compliance and unified batch/streaming capabilities for financial data. Implemented Medallion Architecture (Bronze, Silver, Gold) to transform raw on-prem SQL ingestion into curated, high-quality datasets for downstream analytics. Optimized Spark performance by implementing Z-Ordering, Data Skipping, and Broadcast Joins, reducing query latency by 35% and minimizing cloud compute costs. Developed scalable ETL pipelines using PySpark and Spark SQL within Databricks notebooks, orchestrated via Azure Data Factory (ADF). Leveraged Databricks Auto Loader with CloudFiles for efficient, incremental ingestion of multi-terabyte datasets from Azure Data Lake Storage (ADLS Gen2). Utilized Delta Lake Time Travel and Schema Enforcement to ensure data auditability and prevent pipeline failures from source-system schema drift. Managed Data Governance and fine-grained access control through Unity Catalog, ensuring HIPAA and financial regulatory compliance. Automated Workspace Deployments and job scheduling using Databricks Asset Bundles (DABs) and Terraform within Azure DevOps CI/CD pipelines. Resolved Data Skew and "Small File" problems by fine-tuning Shuffle Partitions and executing Optimize/Vacuum commands for storage health. Integrated MLflow within Databricks to track experimentation and streamline the deployment of predictive models to production environments. Coordinated version control using BitBucket, facilitating smooth collaboration and code integration between cross-functional development teams. Applied Azure Functions to automate serverless data workflows, reducing infrastructure management overhead and increasing operational efficiency. Conducted real-time data processing using Apache Spark, improving processing speed and scalability of big data applications. Built and maintained BigQuery data marts from multiple on-prem and cloud sources using Dataform and Cloud Composer, supporting regulatory and compliance reporting for financial data. Designed data ingestion architecture using Pub/Sub, Dataflow, and BigQuery, processing over 10 million transactions daily with SLA guarantees. Migrated legacy ETL pipelines to GCP-native tools, replacing Talend-based workflows with Dataform + BigQuery SQL for better performance and maintainability. Deployed reusable DAG templates in Cloud Composer to standardize data pipeline development across finance and reporting teams. Developed Python-based scripts for metadata validation, schema evolution checks, and lineage tracking integrated into Composer pipelines. Contributed to supervised onboarding and mentorship of junior engineers on Composer DAG development, BigQuery optimization techniques, and CI/CD pipeline practices using GitHub Actions. Applied GCP security best practices, managing data access using fine-grained IAM roles and implementing encryption via CMEK on GCS and BigQuery datasets. Designed and deployed distributed data processing pipelines using Apache Airflow and Dataproc, improving scalability and performance. Enforced data security policies across cloud platforms, implementing role-based access controls (RBAC) to restrict unauthorized access to sensitive data. Utilized Azure Stream Analytics to process streaming data in real time, providing up-to-date insights for operational decision-making. Architected distributed data processing solutions with PySpark, enabling efficient parallel processing of large datasets across multiple nodes. Implemented data architecture strategies, ensuring a robust, scalable infrastructure that handles complex data workflows and business requirements. Collaborated with business teams to define business intelligence requirements and design solutions that deliver insights from large, diverse datasets. Developed and maintained scalable data lakes using Azure Synapse and GCP BigQuery, optimizing data integration for analytics and reporting. Designed and implemented ETL frameworks using Azure Data Factory (ADF) and Apache Airflow, improving operational efficiency. Managed data lakes for optimal storage and retrieval of diverse data sets, ensuring fast data processing and compliance with retention policies. Performed advanced data processing tasks using Pandas and NumPy, ensuring efficient data transformations and statistical analyses. Migrated on-premise SQL Server databases to Snowflake, optimizing data warehousing & query performance. Automated ETL using ADF, dbt, and Snowflake Streams. Used PowerShell for automating infrastructure provisioning and managing cloud resources, reducing manual intervention and improving system reliability. Designed and implemented data quality frameworks, using automated testing to ensure clean, accurate, and consistent data throughout the pipeline. Led cross-functional collaboration with development, data science, and business teams, ensuring alignment between data engineering and business objectives. Established data governance practices for metadata management, ensuring data integrity, traceability, and compliance across cloud storage solutions. Integrated machine learning models into the data pipeline, using data from Azure SQL Database to generate predictive analytics and insights. Enabled real-time analytics by streamlining data flow from multiple sources using Azure Data Lake Storage and processing it with Spark. Designed & implemented Azure Data Factory pipelines to migrate large datasets from SQL Server to Azure Synapse Analytics, ensuring seamless data transformation & validation using Azure Databricks (PySpark). Environment: Azure Synapse Analytics, Azure SQL Database, ADLS, Azure Functions, Apache Spark, Power BI, Azure DevOps, BitBucket, Azure Stream Analytics, PySpark, PowerShell, Pandas, NumPy, SQL, Role-Based Access Control (RBAC). Godrej Group , India Jan-2016 to Dec-2019 Data Engineer Responsibilities: Designed data ingestion pipelines using AWS S3 as centralized data lake storage for structured and semi-structured datasets. Provisioned and managed AWS EC2 instances to support ETL and batch processing workloads. Implemented IAM roles and security policies to enforce secure access to S3 buckets and EMR clusters. Configured AWS EMR clusters for distributed Spark and Hive processing of large-scale retail datasets. Built data loading pipelines from S3 into Amazon Redshift for enterprise reporting and analytics use cases. Optimized Redshift performance using distribution styles, sort keys, and vacuum operations. Automated AWS infrastructure provisioning using CloudFormation templates for repeatable deployment. Implemented monitoring and alerting using Amazon CloudWatch for ETL job health and cluster utilization. Developed secure data transfer mechanisms between on-prem systems and AWS using SFTP and AWS DataSync. Optimized SQL and PySpark jobs, reducing processing times by 25 30%.Architected and managed large-scale Hadoop clusters, leveraging the HDFS ecosystem to store and process multi-terabyte datasets from diverse retail and manufacturing sources. Developed end-to-end ETL pipelines using Informatica PowerCenter and Talend to ingest high-volume data into HDFS and Hive, ensuring 99.9% data availability. Optimized Hive performance by implementing Partitioning, Bucketing, and ORC/Parquet file formats, resulting in a 40% reduction in query execution time for business analysts. Engineered data ingestion pipelines using Sqoop for structured data transfer from RDBMS (Oracle/MySQL) and Flume for capturing real-time log data into the data lake. Built scalable Big Data Batch jobs in Talend, leveraging native Hadoop components (tHDFSOutput, tHiveLoad) to streamline data movement from legacy systems. Implemented Change Data Capture (CDC) logic using Informatica to ensure real-time synchronization between production databases and the Hadoop cluster. Orchestrated multi-stage workflows using Apache Oozie and Informatica Workflow Manager, ensuring reliable scheduling of complex data dependencies and automated error handling. Enhanced data quality and profiling by utilizing Informatica Data Quality (IDQ) to cleanse source data, reducing data errors in executive reporting by 25%. Managed resource allocation across the cluster using YARN, optimizing memory and CPU utilization to support concurrent high-volume workloads without performance degradation. Designed and maintained HBase schemas for low-latency, random-access lookups, providing faster data retrieval for customer-facing retail applications. Automated cluster monitoring and health checks using Apache Ambari and shell scripting, proactively identifying and resolving node failures before they impacted production. Collaborated with cross-functional teams to translate complex business requirements into technical specifications, ensuring the data architecture supported long-term growth. Extensive experience in designing and implementing data integration solutions using Informatica PowerCenter for extracting, transforming, and loading data from various sources to target systems. Proficient in developing ETL mappings, transformations, and workflows using Informatica PowerCenter to ensure accurate and timely data delivery. Strong understanding of data integration best practices and data quality principles to ensure the integrity and consistency of data throughout the ETL process. Hands-on experience in implementing Master Data Management (MDM) solutions using Informatica MDM to consolidate and manage enterprise-wide master data. Proficient in configuring and customizing Informatica MDM Hub and Data Director for data governance, data stewardship, and data quality management. Implemented data profiling and data cleansing techniques using Informatica Data Quality (IDQ) to identify and resolve data quality issues within source systems. Designed and implemented data validation rules, match and merge strategies, and data standardization processes using Informatica MDM for accurate and reliable master data management. Optimized ETL workflows by fine-tuning Pyspark code and improving data processing efficiency, reducing execution times and resource consumption. Proficient in integrating Informatica MDM with other systems such as CRM, ERP, and data warehouses to ensure consistent master data across the enterprise. Implemented data synchronization processes between source systems and Informatica MDM Hub to ensure data consistency and accuracy in real-time or batch mode. Developed custom data validation and enrichment rules using Informatica MDM to improve data quality and completeness. Collaborated with business stakeholders and data stewards to define data governance policies, data ownership, and data stewardship processes within Informatica MDM. Implemented data profiling and data quality scorecards using Informatica Data Quality to measure and monitor the quality of master data. Developed and optimized Hadoop-based data pipelines to process and analyze large volumes of structured and unstructured data, improving data accessibility and performance. Proficient in designing and implementing data hierarchies, relationships, and data domains within Informatica MDM to ensure accurate and consistent master data representation. Implemented data deduplication and data consolidation strategies using Informatica MDM for effective management of duplicate or redundant data. Proficient in performance tuning and optimization techniques within Informatica MDM and PowerCenter to improve data processing speed and efficiency. Stayed updated with the latest developments in Informatica and MDM, exploring new features and functionalities, and actively participating in Informatica community forums or user groups. Environment:Azure:ADF,AzureSQL,Azurefunctions,Hadoop,Hive,Sqoop,HDFS,Mapreduce,SQL,Informatica, MDM, ETL, CRM, Data Quality, Data Governance, Informatica Power Center. BNY MELLON Technologies Pvt. Ltd, Chennai Data Engineer July 2014 Dec 2015 Responsibilities: Contributed to developing automated data processing tasks utilizing Cloud Functions, improving efficiency and scalability. Utilized GCP Dataproc to process large-scale data sets for analytical purposes, integrating them seamlessly with GCS and BigQuery. Designed and deployed distributed data processing pipelines using Apache Airflow and Dataproc for enhanced performance and scalability. Optimized SQL queries to improve data retrieval performance in BigQuery for reporting and business analysis needs. Developed and optimized ETL workflows using Talend to extract, transform, and load large datasets across various systems. Designed and implemented distributed data processing pipelines in Google Cloud Dataflow and Apache Beam, delivering near real-time analytics to BigQuery. Developed robust data ingestion workflows using Pub/Sub, GCS, and BigQuery, enabling streaming ingestion with failover recovery and DLQ (dead-letter queue) handling. Converted ETL logic from Spark/Scala into modular SQL using Dataform, supporting schema versioning and automated testing with Git-based workflows. Configured secure storage solutions on GCS, with bucket lifecycle management and encryption at rest using Google-managed keys. Implemented CI/CD pipelines for data transformations using GitHub, Cloud Build, and Terraform, enabling version-controlled deployments of BigQuery datasets and Composer DAGs. Provided production support for GCP Composer-managed workflows and optimized BigQuery usage patterns for cost control and performance efficiency. Assisted in managing real-time data pipelines with Dataflow, ensuring efficient data processing and transformation in real-time. Collaborated with team members to build and deploy large-scale data processing jobs using Apache Spark to handle complex datasets. Worked alongside senior engineers to integrate and maintain distributed data processing systems using Hadoop for data scalability. Employed Apache Airflow for orchestrating and scheduling data workflows, ensuring streamlined data pipeline management. Ensured data quality validation by implementing data integrity checks and monitoring across various platforms. Contributed to data transformation initiatives, optimizing the handling of diverse formats and sources. Assisted in querying and optimizing datasets to support faster insights and decision-making for business stakeholders. Monitored and improved system performance using Stackdriver, proactively identifying and resolving performance issues. Assisted in implementing data governance practices to ensure secure and compliant data management and access control. Actively participated in Agile and Kanban teams, following best practices for project management and ensuring timely delivery of tasks. Contributed to system log analysis and troubleshooting, identifying performance bottlenecks and suggesting solutions. Designed scalable pipelines in Google Cloud Dataflow and orchestrated infrastructure provisioning using Terraform to ensure repeatable, compliant, and scalable deployments for real-time analytics workloads. Implemented end-to-end data movement from Cloud Storage to BigQuery using GCP-native services, enabling faster analytics for executive dashboards. Ensured scalable data architecture by assisting in developing cloud-based data processing solutions and managing data in GCP Cloud Storage. Collaborated with senior engineers to optimize data pipelines for higher throughput and lower latency in large-scale data environments. Created and maintained Scala scripts, utilizing functional programming to enhance data processing efficiency. Leveraged GitHub for version control, ensuring collaborative coding practices and maintaining codebase integrity throughout development. Supported business intelligence initiatives by helping with data modeling to meet reporting and analysis requirements. Engaged in cost-effective data processing strategies by optimizing resources and balancing performance with budget constraints. Environment: Talend, Dataflow, Apache Spark, Cloud Functions, Scala, GCP Cloud Storage, Dataproc, BigQuery, Hadoop, Apache Airflow, Stackdriver, GitHub, Agile, Kanban, SQL Optimization. Education: Master s in Computer Science -St. Francis College,Brooklyn,USA Jan-2024 -- Aug-2025 Bachelor of Technology (B.Tech) in Computer Science Andhra University, Visakhapatnam, Andhra Pradesh, India July 2010 Mar 2014 Keywords: continuous integration continuous deployment artificial intelligence machine learning business intelligence sthree active directory information technology Kentucky North Carolina |