Home

Sai Gowtham Kalluri - Senior Data Engineer
[email protected]
Location: Remote, Remote, USA
Relocation:
Visa: H1B
Resume file: Sai_Gowtham_1772032721942.docx
Please check the file(s) for viruses. Files are checked manually and then made available for download.
Sai Gowtham
Senior Data Engineer
[email protected]| 716-994-6974 | LinkedIn


Professional Summary
Senior Data Engineer with 10+ years of experience designing, building, and optimizing enterprise-scale ETL/ELT pipelines, data lakehouse architectures, and real-time streaming solutions across AWS, and Azure.
Architected and delivered production-grade data pipelines and analytics platforms on Azure and AWS, leveraging Databricks, ADF, AWS Glue, Redshift, S3, Synapse, and dbt enabling secure, scalable, and cost-optimized data automation at enterprise scale.
Designed and implemented end-to-end ETL/ELT pipelines integrating ADF, Glue, EMR, and dbt, ingesting on-prem and cloud data into Delta Lake, S3, and Snowflake/Redshift/Synapse warehouses.
Engineered Medallion (Bronze/Silver/Gold) data lakehouse architectures using Delta Lake, Apache Iceberg, Apache Hudi, ADLS Gen2, and S3, enabling distributed processing of structured and unstructured data with optimized partition pruning and query performance.
Built Python, PySpark, Spark, and dbt transformation pipelines in Databricks, AWS Glue, and ADF, leveraging Spark SQL, Delta Live Tables, pandas, and NumPy for large-scale transformations, aggregations, and feature engineering.
Implemented data quality and observability frameworks with Great Expectations, dbt tests, SQL validations, and PySpark exception handling, ensuring schema enforcement, referential integrity, and SLA compliance across all data environments.
Developed and optimized complex SQL, T-SQL, HiveQL queries, stored procedures, views, and window functions in SQL Server, Redshift, Synapse, Snowflake, and BigQuery, supporting high-performance analytical and operational workloads at scale.
Designed Star Schema, Snowflake Schema, and Data Vault 2.0 models for data marts and OLAP solutions, applying CDC, SCD Type 1/2/3, normalization, and de-normalization strategies for optimal performance.
Delivered real-time BI dashboards and self-serve analytics using Power BI, Tableau, Amazon QuickSight, and Looker, integrating curated gold-layer datasets to drive data-driven decisions across finance, risk, and operations.
Automated data pipeline orchestration and real-time streaming via Apache Airflow (MWAA), Apache Kafka, AWS Kinesis, ADF, Step Functions, and Bash scripting, ensuring SLA-compliant scheduling, monitoring, alerting, and fault-tolerant recovery across 50+ production pipelines.
Implemented data security and governance frameworks with AWS KMS, Azure Key Vault, IAM, Lake Formation, and Apache Atlas, enabling enterprise-grade encryption, PHI/PII masking, RBAC, and audit trails compliant with HIPAA, SOX, PCI DSS, and GDPR.
Integrated AWS IAM, Azure Active Directory (AAD), ADFS, and AWS IAM/SSO for centralized identity governance ensuring least-privilege access, seamless SSO, and secure cross-cloud resource management.
Built CI/CD pipelines using GitHub Actions, Azure DevOps, and AWS CodePipeline, automating build, data quality tests, and infrastructure deployments (Terraform/IaC) with zero-downtime rollback and monitoring.
Developed event-driven data microservices on AWS Lambda, Amazon EKS, and Azure AKS, integrating event-driven architectures using Apache Kafka, AWS SNS/SQS, and Azure Service Bus.
Conducted performance tuning and cost optimization for Snowflake, Redshift, Synapse, BigQuery, Databricks, and Spark clusters, applying partition pruning, Z-ordering, indexing, caching, and auto-scaling strategies reducing query costs by up to 40% and ensuring SLA compliance.

Technical Skills:
Programming & Frameworks Python, SQL, PL/SQL, Scala, R, Bash/Shell Scripting, HiveQL, Spark SQL, YAML
Data Processing & Analytics Apache Spark, PySpark, Spark SQL, Spark Streaming, Databricks, dbt (Data Build Tool), Delta Live Tables, Great Expectations, Pandas, NumPy, Apache Flink
Cloud Platforms AWS: Glue, EMR, Redshift, S3, Lake Formation, Lambda, EKS, Kinesis, MSK, Step Functions, Aurora, DynamoDB, Athena, CloudWatch, X-Ray, IAM, KMS
Azure: Synapse Analytics, Data Factory (ADF), Databricks, ADLS Gen2, Key Vault, AKS, DevOps, Monitor, Event Hubs, Service Bus, Azure ML
GCP: BigQuery, Dataflow, Dataproc, Pub/Sub, Cloud Storage, Composer, Cloud Spanner
Data Engineering & ETL AWS Glue, Azure Data Factory (ADF), dbt (Data Build Tool), Apache Airflow (MWAA), AWS Step Functions, Informatica PowerCenter, SSIS, Talend, Fivetran, NiFi, SnapLogic, DataStage, PolyBase, Oozie
Big Data & Streaming Apache Kafka, AWS Kinesis, Spark Streaming, Apache Flink, AWS MSK, Azure Event Hubs, Azure Service Bus, Apache Hadoop (HDFS, Hive, Pig, Sqoop), Amazon EMR, Databricks, Dataproc
Data Bases & Data Warehousing Snowflake, Amazon Redshift, Azure Synapse Analytics, Google BigQuery, Delta Lake, Apache Iceberg, Apache Hudi, PostgreSQL, MySQL, Oracle, SQL Server, DynamoDB, MongoDB, Cassandra, Aurora PostgreSQL, Azure CosmosDB
Data Modeling Star Schema, Snowflake Schema, Medallion Architecture (Bronze/Silver/Gold), Data Vault 2.0, Dimensional Modeling (OLAP/OLTP), Change Data Capture (CDC), Slowly Changing Dimensions (SCD Types 1/2/3), Data Mesh
Visualization & Analytics Tableau, Power BI, AWS Quick Sight, Looker, Google Data Studio, Plotly Dash, Matplotlib, Seaborn
Monitoring & Compliance Amazon CloudWatch, AWS X-Ray, Azure Monitor, Application Insights, Splunk, ELK Stack (Elasticsearch, Logstash, Kibana), Grafana, Datadog | HIPAA, SOX, PCI DSS, GDPR, HL7, FDA | AWS KMS, Azure Key Vault, IAM, Lake Formation, RBAC, PHI/PII Masking, Data Encryption
Version Control & Agile Git, GitHub Actions, Azure DevOps, AWS CodePipeline, Jenkins, GitLab CI/CD, Docker, Kubernetes (EKS/AKS), Terraform, Bitbucket, JIRA, Confluence, Agile (Scrum/Kanban)


CERTIFICATIONS
AWS Data Engineer Associate
Microsoft Data Engineer Associate
Python Programming

Client: Nationwide Mutual Insurance - Columbus, OH (Remote) May 2024 Present
Role: Sr. Data Engineer
Responsibilities:
Architected cloud-native data platforms using AWS cloud services, leading cloud migration of legacy on-prem systems to AWS reducing infrastructure costs by 35% and enabling petabyte-scale processing for enterprise insurance workloads.
Designed and implemented end-to-end ETL and ELT pipelines using Spark, EMR, Glue, and Airflow to ingest, transform, and curate 500M+ daily insurance records with sub-hourly SLA compliance across underwriting, claims, and compliance data domains.
Built data lakehouse architecture leveraging Delta Lake, S3, and AWS Lake Formation for distributed processing of structured and unstructured insurance data.
Architected centralized enterprise data warehouse on Amazon Redshift and Snowflake supporting regulatory compliance, financial auditing, and business intelligence workloads.
Developed real-time streaming pipelines with Apache Kafka and AWS Kinesis, powering underwriting, fraud detection, and event-driven analytics processing 1M+ events/minute with sub-second latency at enterprise scale.
Built event-driven Lambda architectures for secure ingestion of external financial and third-party data, maintaining data integrity and regulatory compliance under SOX and PCI DSS standards.
Deployed containerized microservices using Docker and Amazon EKS, delivering real-time customer scoring, dynamic pricing, and policy recommendation APIs with horizontal scalability.
Strengthened data security and governance through AWS IAM, implementing role-based access, encryption, audit trails, and least-privilege policies aligned with risk and compliance frameworks.
Architected and optimized Amazon Aurora PostgreSQL clusters with multi-AZ failover, read replicas, and automated backups, ensuring 99.9% uptime and supporting 1M+ daily transactions with zero data loss.
Provided Level 3 support across AWS services (EC2, Lambda, Redshift, Aurora), performing root cause analysis, config tuning, and incident resolution, reducing MTTR by 42% through proactive monitoring.
Implemented monitoring and alerting using Amazon CloudWatch and AWS X-Ray, tracking pipeline SLAs, data anomalies, and transformation failures via custom metrics and automated recovery workflows.
Delivered interactive dashboards and KPIs via Tableau, Amazon QuickSight, and Power BI, enabling real-time visibility into claims processing, risk metrics, and financial performance.
Developed dbt transformation models automating silver and gold layer aggregations, reducing manual SQL workloads by 60% also built secure REST APIs with FastAPI and Flask for personalization, risk assessment, and campaign analytics services.
Managed metadata, lineage, and schema governance through AWS Glue Data Catalog and Apache Atlas, enhancing data discoverability, auditability, and change management across 200+ datasets and 50+ data pipelines.
Automated CI/CD pipelines using GitHub Actions and AWS Code Pipeline, integrating unit tests, data quality checks, and infrastructure deployments for seamless delivery of data products and services.
Environment: AWS (Glue, EMR, Redshift, Lambda, Step Functions, Aurora, RDS, DynamoDB, Kinesis, MSK, EKS, CloudWatch, IAM), Snowflake, Python, PySpark, Spark SQL, SQL, Airflow, dbt, Kafka, Fast API, Flask, Tableau, Power BI, QuickSight, Docker, Terraform, GitHub Actions, CI/CD, Data Lakehouse, Governance, Security, Compliance.

Client: Ally Financial - Detroit, MI. Jan 2023 Apr 2024
Role: Sr. Azure Data Engineer
Responsibilities:

Designed and implemented Azure Data Factory (ADF) pipelines and Databricks PySpark workflows to ingest, cleanse, and transform multi-source financial data into Azure Data Lake Storage (ADLS Gen2) and Azure Synapse Analytics.
Developed reusable ETL frameworks in PySpark and SQL, improving pipeline development velocity by 40% and reducing end-to-end data latency by 30%, enabling near real-time financial reporting.
Built Medallion architecture with curated bronze, silver, and gold zones, enabling standardized aggregation, data lineage tracking, and versioned schema evolution.
Integrated and optimized Azure Synapse Analytics for large-scale financial data modeling, implementing partitioning, clustering, and indexing accelerating executive dashboard query performance by 5 and cutting compute costs through right-sized distribution and result-set caching strategies.
Designed and deployed dimensional data models (Star and Snowflake schemas) for equity and portfolio analytics, supporting Tableau and Power BI reporting layers.
Implemented logical ontologies and relationship mappings to represent customer-to-account and account-to-transaction hierarchies across multiple source systems.
Automated CI/CD pipeline deployments through Azure DevOps, integrating build validation, artifact publishing, and environment promotion for Data Factory and Databricks.
Built custom logging and monitoring frameworks leveraging Azure Monitor, Application Insights, and Log Analytics for operational visibility and SLA adherence.
Implemented data quality checks using PySpark validation rules and ADF metadata-driven control tables, ensuring referential integrity and auditability across ingestion layers.
Orchestrated containerized workloads and data microservices using Azure Kubernetes Service (AKS) for scalable job execution and dependency isolation.
Collaborated with cross-functional data science and analytics teams to operationalize feature pipelines for predictive modeling using Databricks MLflow and Delta Live tables.
Worked closely with infrastructure teams to optimize FinOps costs, tuning cluster autoscaling, storage lifecycle policies, and right-sizing compute resources.
Collaborated with risk and analytics teams to standardize financial terminology and entity definitions, supporting regulatory and enterprise reporting.
Led knowledge-sharing sessions on best practices in ADF orchestration, Databricks optimization, and Synapse performance tuning, upskilling 8+ engineers and improving team-wide productivity and pipeline code reuse by 25%.
Environment: PySpark, Python, Spark SQL, T-SQL, SQL, Power BI, Tableau, GitHub, Azure DevOps, ETL/ELT, Data Lakehouse, Delta Tables, Dimensional Modeling (Star/Snowflake), CI/CD, Data Quality Frameworks, RBAC, FinOps Optimization, Performance Tuning.

Client: Blue Cross Blue Shield - Chicago, IL. July 2021 Dec 2022
Role: Data Engineer
Responsibilities:
Designed and developed scalable ETL pipelines using Python and PySpark to process healthcare claims, enrollment, and clinical datasets for analytics and regulatory reporting in HIPAA-regulated environments.
Built high-volume batch and near real-time ingestion pipelines using AWS Glue and PySpark to process healthcare data reliably while meeting audit, compliance, and operational SLA requirements.
Implemented end-to-end ETL workflows integrating PostgreSQL, MySQL, HL7 feeds, and flat files into Snowflake and cloud data warehouses for enterprise healthcare analytics use cases.
Developed reusable Python ingestion frameworks to standardize parsing, validation, and transformation of CSV, JSON, and Excel healthcare data containing protected health information.
Designed and optimized relational schemas and Snowflake tables to support claims adjudication, member analytics, and provider performance reporting across payer systems.
Executed large-scale analytical queries using AWS Athena and Snowflake to improve reporting performance and reduce costs through partitioning and query optimization techniques.
Built and maintained curated data warehouse layers to support population health analytics, utilization reporting, and operational insights for healthcare business stakeholders.
Delivered analytics-ready datasets consumed by Tableau, Power BI, and QuickSight dashboards for monitoring claims processing SLAs, quality metrics, and compliance indicators.
Implemented data quality, validation, and reconciliation frameworks to ensure accuracy, completeness, and audit readiness across healthcare ETL pipelines and reporting layers.
Optimized PySpark transformation logic using partitioning, join optimization, and caching strategies, reducing end-to-end pipeline runtimes by 22% and cutting AWS Glue DPU costs by 18%.
Automated ETL scheduling, monitoring, and failure handling workflows to improve pipeline reliability, observability, and SLA adherence in regulated healthcare environments.
Integrated CI/CD pipelines using Jenkins and GitLab to automate deployment of Spark jobs and data pipeline changes with controlled releases and rollback support.
Enforced HIPAA-compliant data security controls including encryption, RBAC, audit logging, and PHI masking across data storage and processing layers.
Supported production healthcare data pipelines by performing root cause analysis, performance tuning, and incident resolution for critical regulatory reporting workflows.
Authored technical documentation and operational runbooks detailing ETL architecture, data flows, compliance controls, and healthcare data engineering best practices.
Environment: Python, PySpark, Spark SQL, Pandas, NumPy, SQL, Shell Scripting, AWS Glue, AWS Athena, AWS S3, Pandas, Numpy, SQLAlchemy, Snowflake, PostgreSQL, MySQL, MongoDB, Apache Spark, AWS ECS, AWS EKS, Docker, Kubernetes, Jenkins, GitLab CI/CD, Linux, HL7, HIPAA Compliance, PHI Security, Data Encryption, RBAC, Audit Logging, Tableau, Power BI, Amazon QuickSight.

Client: Ericsson - Bangalore, India. Apr 2017 Dec 2019
Role: Data Engineer - Big Data Developer
Responsibilities:
Designed and implemented data pipelines using Apache Spark, Spark SQL, and PySpark, processing 10TB+ daily network telemetry and event records for real-time analytics and capacity planning on Hadoop and AWS S3-based data lake.
Built streaming data pipelines using Spark Streaming integrated with Kafka, enabling near real-time network performance monitoring, predictive equipment failure alerts, and operational dashboards deployed on on-prem Hadoop and AWS cloud infrastructure.
Developed ETL frameworks in Python and Scala to automate ingestion, transformation, and data validation workflows, improving system throughput by 30% and reducing manual intervention across 15+ data sources.
Optimized database performance by developing and tuning stored procedures, triggers, and views in T-SQL and HiveQL reducing average query response time by 35% and improving reporting reliability.
Created data models (Star/Snowflake schemas) with schema governance supporting trend analysis, capacity forecasting visualized in Tableau dashboards and ad-hoc Python notebooks.
Automated weekly data refreshes and validation tasks using Python/Scala scripts, integrating with Oracle databases for production deployment.
Conducted data cleaning, deduplication, and feature engineering using Python libraries including pandas, NumPy, SciPy, and Scikit-learn, ensuring consistent data quality across pipelines.
Performed data quality, validation, and standardization checks using Python (pandas, NumPy) and custom rule engines, ensuring consistency across 50M+ daily network records.
Collaborated with business stakeholders to translate operational KPIs into data-driven dashboards using Tableau, enabling visibility into network quality, service compliance, and operational cost metrics.
Improved ETL architecture through query optimization and Spark job tuning, reducing pipeline runtime by 25% and ensuring SLA compliance for 10TB+ daily telemetry processing.
Participated in the SSIS package lifecycle (design, deployment, and monitoring) for legacy system integrations and bulk data migrations.
Performed performance troubleshooting and monitoring using Dynamic Management Views (DMVs) and system views to maintain database health and availability.
Created Oozie workflows and Apache Airflow DAGs to automate ETL scheduling, batch jobs, and log extraction supported early cloud adoption by migrating on-prem Hadoop pipelines to AWS S3, Glue, and EMR.
Managed tasks, Agile workflows, and sprint deliverables using JIRA and Confluence.

Environment: Python (pandas, NumPy, SciPy), Scala, Apache Spark 2.x, PySpark, Spark SQL, Spark Streaming, Kafka, AWS (S3, Glue, EMR), Hadoop (HDFS, Hive, Oozie), Apache Airflow, SQL Server, T-SQL, HiveQL, Oracle, SSIS, Tableau, ETL, Data Modeling, Query Optimization.

Client: Intel Corporation - Bangalore, India. Aug 2014 Mar 2017
Role: Data Engineer - ETL
Responsibilities
Designed and developed ETL workflows in Informatica PowerCenter and Talend to extract, transform, and load data from Oracle, SQL Server, and flat-file sources into a centralized data warehouse.
Created staging, fact, and dimension tables using SQL and PL/SQL, applying star and snowflake schema models for analytical reporting.
Developed parameterized mappings, reusable transformations, and lookup strategies to improve ETL reusability and performance.
Implemented incremental data loading and Change Data Capture (CDC) with SCD Type 1/2 strategies supporting near real-time updates to 10+ operational dashboards with minimal source system impact.
Automated daily batch jobs and data validation checks using Shell scripts and scheduled workflows through Informatica Scheduler and Unix cron jobs.
Conducted data profiling, cleansing, and validation to ensure accuracy, completeness, and consistency of data across reservation, ticketing, and finance systems.
Built and maintained Hive tables on Hadoop for large-scale log and transaction data, improving performance for ad-hoc analysis and historical trend reporting.
Optimized ETL and SQL performance by tuning queries, using partitioning, indexing, and cost-based optimizer hints in Oracle and SQL Server environments.
Designed Talend transformations for data migration projects, integrating external CSV/XML sources and ensuring data integrity through error handling and audit logs.
Supported data quality and reconciliation processes across multiple business domains including booking, ticketing, and inventory management.
Designed Tableau dashboards and reports to visualize KPIs such as booking rates, cancellations, route utilization, and revenue patterns.
Collaborated with business analysts and QA teams to define data validation rules, conduct UAT sessions, and document mappings and data flow diagrams using Power Designer.
Participated in Agile sprints and contributed to continuous improvement of data loading performance, reducing nightly batch windows by 20% through SQL tuning, parallelism, and incremental load strategies.
Collaborated with cross-functional teams including data architects, BI developers, and business analysts to translate requirements into scalable data solutions.
Environment: Informatica PowerCenter, Talend, Hadoop (HDFS, Hive, MapReduce), SQL Server, Oracle, PL/SQL, Python (pandas, NumPy, SciPy), Shell Scripting, Tableau, Power Designer, ETL, Data Modeling, Query Optimization, Data Validation.

Educational Details:
Master s in Data Science at University at Buffalo, Buffalo, NY 2021
Bachelor s in Computer Science at Saveetha Engineering College, Chennai, India 2014
Keywords: continuous integration continuous deployment quality analyst machine learning business intelligence sthree active directory rlang procedural language Arizona Delaware Illinois Michigan New York Ohio

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];6885
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: