Home

Vipula Reddy Veluru - Data Engineer
[email protected]
Location: Remote, Remote, USA
Relocation: Open to relocation
Visa: H1B
Resume file: VipulaVeluru_DataEngineer_1772029249704.docx
Please check the file(s) for viruses. Files are checked manually and then made available for download.
Vipula Reddy Veluru
Lead Data Engineer
[email protected] | (716)220-5534 | LinkedIn

PROFESSIONAL SUMMARY

Lead Data Engineer with 12+ years of experience in building scalable, enterprise-grade data pipelines across GCP, Azure, and AWS in healthcare, PBM, marketing technology, IoT, retail, and e-commerce domains.
Strong hands-on expertise with GCP (BigQuery, Dataflow, Dataproc, Pub/Sub, Cloud Composer), designing end-to-end ingestion, transformation, and curation pipelines supporting mission-critical claims, financial, and operational reporting.
Demonstrated capability to design multi-cloud data ecosystems, integrating GCP with Azure (ADF, ADLS, Databricks) and AWS (EMR, Glue, S3) to support hybrid ingestion, cross-platform synchronization, and unified analytics.
Proven ability to engineer PySpark, Python, and Spark SQL based workflows for high-volume structured, semi-structured, and time-series datasets across cloud and on-prem environments.
Extensive experience implementing Medallion Architecture (Bronze Silver Gold) using BigQuery, Databricks, and Delta Lake to support standardized, analytics-ready datasets for business and regulatory use.
Strong background in ETL/ELT design, data modeling, large-scale batch & streaming processing, and building metadata-driven frameworks for transforming diverse data sources (claims, identity, sensor data, retail feeds, CRM, POS).
Extensive hands-on experience with dbt Cloud, including job scheduling, environment-based deployments (dev/QA/prod), CI integration with GitHub workflows, and automated testing to support governed, production-grade analytics pipelines.
Strong understanding of EPIC Cogito data models and healthcare interoperability patterns, including integration of provider, clinical, and claims datasets into Snowflake-based analytics platforms.
Extensive experience supporting enterprise data warehouse modernization initiatives, migrating legacy systems into cloud-native Snowflake architectures with scalable ingestion, dimensional modeling, and performance optimization strategies.
Skilled in building orchestration workflows using Airflow/Cloud Composer and implementing CI/CD automation using GitHub Actions, GitLab CI/CD, and Terraform for consistent, version-controlled data pipeline deployments.
Strong understanding of compliance, governance, and regulatory requirements, ensuring PHI protection, HIPAA/SOX adherence, audit readiness, access controls, and secure handling of sensitive healthcare and financial data.

TECHNICAL SKILLS

Cloud Platforms Google Cloud Platform: BigQuery, Dataflow, Dataproc, Pub/Sub, Cloud Composer, GCS
Azure: ADF, ADLS, Azure Databricks
AWS: EMR, Glue, Lambda, S3, Redshift
Data Processing & Frameworks Python, PySpark, Spark SQL, SQL, Delta Lake, Databricks, Hadoop Ecosystem, HDFS.
DBT (Core, Cloud), DBT Testing, DBT Snapshots, DBT Incremental Models
DataEngineering& Architecture ETL/ELT Pipelines, Medallion Architecture (Bronze Silver Gold), Data Lake & Lakehouse Architecture, Batch & Streaming Pipelines, Metadata-Driven Pipelines
Databases, Warehousing &Modeling BigQuery, Redshift, Hive, SQL Server, MySQL, Snowflake (UDFs, Streams, Snow pipe, Secure Views), Redshift, Kimball, Inmon, Star/Snowflake Schema, Data Vault 2.0, Semantic Layers, OLAP Cubes
Orchestration &CI/CD Cloud Composer (Airflow), GitHub Actions, GitLab CI/CD, Terraform, Crontab/Schedulers
Data Quality, Governance Schema Validation, Lineage Tracking, SLA Monitoring, Drift Detection, Audit Logging, PHI/SOX/HIPAA Compliance Controls
Tools & Utilities Linux, Shell Scripting, Git, Jira, QlikView, Tableau
Visualization, Analytics Tableau, Power BI, AWS Quick Sight, Looker, Google Data Studio, Plotly Dash, Matplotlib, Seaborn
Monitoring, Compliance Splunk, ELK Stack (Elasticsearch, Logstash, Kibana), Azure Monitor, Snowflake Alerts, HIPAA, FDA, GDPR, SOX, HL7
Version Control & Agile Git, GitHub, Bitbucket, JIRA, Confluence, Agile (Scrum, Kanban), Waterfall

EXPERIENCE

Client: CVS Health Pittsburgh, PA. Nov 2023 Till Date
Role: Lead Data Engineer
This project focused on modernizing CVS s claims, PBM, pharmacy operations, and financial data pipelines by moving away from fragmented legacy systems into a unified, scalable Google Cloud Platform environment. Daily and month-end reporting cycles were slowed by inconsistent claims feeds, eligibility mismatches, and siloed pharmacy datasets. I led the design of end-to-end pipelines on BigQuery, Dataflow, Dataproc, and Cloud Composer, while using Azure for specific ingestion, landing, and transformation tasks tied to internal applications. My work centered on establishing reliable, regulatory-ready datasets that supported actuarial modeling, reimbursement accuracy, PBM financial workflows, and clinical/operational insights.
Responsibilities:
Designed and implemented large-scale data pipelines on GCP (BigQuery, Dataflow, Dataproc, Pub/Sub) to consolidate claims adjudication data, eligibility feeds, formulary updates, pharmacy transactions, and PBM financial datasets.
Built the Medallion Architecture (Bronze Silver Gold) within BigQuery and Databricks, transforming raw claims and pharmacy feeds into standardized Silver tables and analytics-ready Gold-layer fact models.
Developed PySpark and SQL transformation jobs on Databricks and Dataproc to apply complex healthcare business rules, including NDC/DRG mappings, reversal logic, benefit-tier alignment, and CMS-driven field corrections.
Delivered enterprise-grade dimensional models (fact and dimension tables) to support provider analytics, financial reporting, and regulatory compliance use cases.
Implemented DBT models on top of Snowflake to transform raw claims, eligibility, and pharmacy data into analytics-ready marts aligned with CMS and PBM reporting standards.
Managed cross-cloud data validation between BigQuery and Snowflake to ensure consistency in claims and financial metrics.
Configured and managed dbt Cloud Jobs to schedule claims, eligibility, and pharmacy model builds, snapshots, and tests in alignment with upstream ingestion SLAs.
Implemented automated data loads from BigQuery and GCS into Snowflake using Snowpipe and event-driven triggers.
Automated health checks, lineage tracking, and schema validation using Python to detect late-arriving claims, duplicate adjudications, eligibility gaps, mismatched service dates, and malformed pharmacy records across Bronze and Silver layers.
Developed complex SQL transformations and database-side processing logic using Snowflake stored procedures and scripting frameworks.
Optimized BigQuery and Databricks workloads using partition pruning, clustering, Delta Lake optimization, Z-ordering, and materialized views to meet strict SLAs for daily claims refresh and financial month-end close cycles.
Built scalable ingestion pipelines to process structured (claims, eligibility) and semi-structured datasets (JSON, flat files, API extracts) into Snowflake using dbt and Python.
Designed star and snowflake schemas for customer analytics, claims reporting, and financial dashboards.
Integrated provider and clinical datasets (eligibility, claims, pharmacy, and operational feeds) into Snowflake and dbt models to support care quality, reimbursement, and reporting analytics.
Implemented reconciliation and validation logic using procedural SQL to detect data mismatches and ingestion failures.
Built reusable SQL modules for aggregations, deduplication, and historical backfills and used window functions, analytical queries, and CTEs for large-scale reporting and behavioral analysis.
Provided production support across GCP and Azure workloads by resolving Databricks notebook failures, reprocessing corrupted Bronze-layer data, fixing broken clinical/claims feeds, and ensuring PHI-compliant data handling as per HIPAA/SOX standards.
Environment: GCP(BigQuery, Dataflow, Dataproc, Cloud Composer (Airflow), Pub/Sub, GCS), Azure(Azure Data Factory (ADF), Azure Data Lake Storage (ADLS), Azure Databricks), PySpark, Python, Spark SQL, Delta Lake, Medallion Architecture (Bronze / Silver / Gold), GitHub, Terraform, Linux, Shell Scripting, Jira.

Client: Acxiom Austin, TX. June 2021 Oct 2023
Role: Senior Data Engineer
This project focused on integrating and standardizing massive volumes of customer identity, CRM, web behavior, and advertising data for enterprise clients. With data spread across AWS and GCP systems, the goal was to build reliable, high-speed pipelines that produced accurate identity graphs and real-time audience datasets for marketing activation. My work centered on building Python and PySpark-driven ETL workflows across AWS EMR and GCP Dataproc, automating quality checks, streamlining multi-cloud ingestion, and establishing curated data layers that powered segmentation, attribution, and customer analytics.
Responsibilities:
Built high-volume ETL pipelines using Python, PySpark, and Spark SQL on AWS EMR and GCP Dataproc, processing multi-terabyte CRM, identity, clickstream, and marketing engagement datasets.
Designed multi-cloud ingestion frameworks using AWS Glue, AWS Lambda, Amazon S3, GCP Dataflow, and Pub/Sub to support both batch and streaming ingestion of customer behavior and campaign events.
Led migration of on-prem and legacy Hadoop pipelines to AWS and Snowflake-based architectures.
Developed Python-based identity stitching logic for PII validation, deduplication, fuzzy matching, address normalization, and cross-channel customer linking using rules defined by data governance teams.
Collaborated with Matillion-based ETL workflows for Snowflake ingestion and transformation, supporting enterprise data warehouse modernization initiatives.
Implemented slowly changing dimensions (SCD Type 1 and Type 2) for historical tracking and built fact and dimension tables optimized for BI and reporting workloads.
Tuned Spark workloads by applying partition pruning, broadcast joins, predicate pushdown, and autoscaling policies across both EMR and Dataproc, improving pipeline performance and campaign readiness.
Used DBT documentation and lineage features to improve data transparency for analysts and business users.
Implemented DBT-based ELT framework on Snowflake to transform raw CRM and clickstream data into dimensional models for attribution and segmentation.
Built curated analytical layers in BigQuery, Snowflake, and Redshift, supporting multi-channel attribution, audience segmentation, marketing reporting, and downstream machine-driven activation systems.
Used dbt Cloud Jobs and environment-based deployments to manage CRM and clickstream transformations across development and production environments.
Designed Snowflake ingestion pipelines using Snowpipe, Streams, and Tasks to load multi-terabyte CRM and clickstream data from S3 into curated analytics tables.
Integrated dbt Cloud CI checks with GitHub workflows to validate modified models and tests on pull requests before promoting changes to production.
Optimized Snowflake query performance by implementing clustering keys, materialized views, and warehouse sizing strategies.
Automated schema drift alerts, lineage tracking, ingestion SLAs, and reconciliation checks using Python utilities integrated with AWS CloudWatch, GCP Logging, and internal alerting frameworks.
Developed modular Python frameworks for data ingestion, transformation, and validation across AWS and Snowflake.
Implemented lineage tracking for Snowflake and BigQuery datasets to support audit and compliance requirements.
Implemented dbt tests, freshness checks, and alerting in dbt Cloud to monitor data quality for regulatory and PBM reporting datasets.
Developed automated CI/CD workflows for Databricks notebooks and Python packages using GitHub Actions, integrating secure deployment patterns, environment-based validation, and compliance-driven promotion gates.
Designed and optimized PySpark transformations with data quality checks, schema enforcement, and audit-friendly logging to support identity and PII-sensitive marketing datasets.
Provided production support by troubleshooting failures in EMR, Dataproc, Glue, and Dataflow, coordinating with partner teams for fix-ups, and ensuring timely delivery of datasets required for audience builds and marketing operations.

Environment: AWS(EMR, Glue, Lambda, S3, Redshift, CloudWatch), GCP(Dataproc, Dataflow, Pub/Sub, BigQuery, Stackdriver), Python, PySpark, SQL, Snowflake, GitLab CI/CD, Terraform.

Client: Cyient Hyderabad, India. Apr 2018 Jul 2020
Role: Senior Data Engineer
This project aimed to improve how industrial IoT data was collected and processed for monitoring equipment health across large manufacturing plants. The older on-prem Hadoop setup struggled with continuous sensor streams from hundreds of machines, so the team began migrating workflows to Azure. My work involved building Azure-based ingestion and transformation pipelines for high-volume telemetry like vibration, temperature, and pressure readings. I focused on cleaning and aligning time-series data, improving sensor data quality, and automating end-to-end workflows so engineering teams could detect anomalies faster and reduce equipment downtime.
Responsibilities:

Built Azure Data Factory (ADF) pipelines to ingest raw telemetry from on-prem PLC/SCADA systems and edge devices into Azure Data Lake Storage (ADLS) for centralized processing.
Developed PySpark transformation jobs in Azure Databricks to cleanse, standardize, and time-align vibration, temperature, load, and pressure readings from large fleets of industrial machines.
Implemented robust data quality logic to handle out-of-order timestamps, missing packets, sudden value spikes, and sensor drift, improving reliability of the incoming time-series feeds.
Designed a structured data architecture separating Raw Validated Curated zones, ensuring traceability and consistent contract-level datasets for reliability and operations teams.
Automated routine schema checks, data freshness validation, and feed monitoring using Python, reducing manual review tasks and increasing the dependability of daily sensor data loads.
Partnered with industrial engineers to translate threshold rules, vibration bands, machine cycles, and shutdown indicators into scalable data transformation logic that aligned with equipment behavior.
Coordinated with cloud/network teams to optimize both batch and near real-time ingestion, ensuring uninterrupted delivery of telemetry from factory floors to Azure.
Tuned Databricks jobs using partitioning, caching, and optimized joins to accelerate processing of large sensor datasets and meet strict pipeline SLAs.
Delivered curated datasets for equipment health dashboards, utilization metrics, and anomaly detection workflows, enabling quicker issue identification across manufacturing sites.
Provided production support by resolving ADF failures, correcting malformed device feeds, reprocessing failed batches, and working with sensor/device teams to address recurring data issues.

Environment: Azure Data Factory (ADF), Azure Databricks, Azure Data Lake Storage (ADLS), PySpark, Python, SQL, SCADA/PLC Telemetry Systems, IoT Sensor Data Streams (vibration, temperature, pressure, load), Hive (legacy inputs), Linux, Shell Scripting, Git, Jira, Batch & Streaming Ingestion Pipelines, On-Prem Hadoop Ecosystem (source systems).
Client: Sonata Software Bangalore, India. June 2015 March 2018
Role: Data Engineer
This project focused on improving how retail and e-commerce data was managed for merchandising, pricing, and inventory teams across several international clients. The existing setup had inconsistent data coming from different systems, required frequent manual fixes, and often caused delays in daily reporting. My work involved creating reliable Python- and Spark-based ETL workflows, cleaning and standardizing product and inventory data, and automating multi-step processing in an on-premises environment. These improvements made the data more accurate and up-to-date, reduced manual effort, and helped downstream teams like pricing, forecasting, and store operations use the data more confidently and efficiently.
Responsibilities:
Engineered Python and PySpark-based ETL pipelines on on-prem Hadoop clusters to ingest high-volume retail transaction, product, and inventory datasets from SQL Server, MySQL, and flat-file sources.
Designed multi-stage cleansing and transformation workflows to standardize product attributes, item hierarchies, store information, promotion details, and pricing history across various retail systems.
Built reusable Python modules for data validation, schema checks, null/threshold monitoring, duplicate detection, and consistency checks across daily and weekly retail source feeds.
Developed robust ETL workflows that merged POS transactions, inventory snapshots, and product catalogs into unified, analytics-ready datasets for demand forecasting and merchandising teams.
Implemented PySpark transformation layers to perform SCD-type updates, deduplication, time-series alignment, and aggregation of sales and inventory metrics across store, region, and product levels.
Collaborated with business analysts and merchandising teams to translate pricing rules, replenishment logic, and product lifecycle patterns into reliable, scalable transformation logic.
Tuned PySpark jobs by optimizing joins, partitioning strategies, broadcast hints, and job concurrency to ensure timely completion of overnight processing windows in an on-prem environment.
Built automated reconciliation scripts in Python to verify inventory movements, price changes, store stock discrepancies, and mismatched product attributes across upstream retail systems.
Assisted in creating and maintaining metadata-driven workflows, including configuration-based mappings, reusable transformation templates, and dynamic ingestion frameworks for new data sources.
Participated in daily production support by monitoring ingestion pipelines, resolving failed jobs, debugging Python/PySpark scripts, and ensuring stable data availability for downstream reporting teams.
Environment: Python, PySpark, Hadoop (On-Prem), HDFS, Hive, SQL Server, MySQL, Shell Scripting, Linux, Bash, Crontab/Scheduler Jobs, Git, Jira, Agile Methodologies.

Client: Flipkart Bangalore, India. August 2013 May 2015
Role: Software Engineer Data Platform
The initiative focused on improving the accuracy and reliability of core marketplace and fulfillment data that supported daily operational decisions. As part of the Data Platform team, I prepared validated datasets, resolved data inconsistencies across order, shipment, and seller systems, and supported daily ingestion checks using SQL/Hive and Python utilities. My work helped stabilize reporting pipelines, reduced manual data cleanup, and ensured that operations and seller teams had timely, trusted data for their dashboards.
Responsibilities:
Prepared cleaned and structured datasets for daily operational reporting covering order lifecycle, seller SLAs, cancellations, returns, shipment movement, and last-mile delivery KPIs used by business and operations teams.
Wrote and optimized SQL queries to extract, validate, and standardize high-volume marketplace and logistics datasets before consumption by dashboards and downstream analytics teams.
Performed root-cause analysis on data discrepancies such as missing shipment scans, duplicate orders, mismatched timestamps, incomplete seller feeds, and failed nightly batch loads.
Assisted category managers, warehouse operations, and supply-chain analysts by providing ad-hoc data pulls for investigating spikes in cancellations, routing delays, or seller underperformance.
Supported the Data Platform team in maintaining Hive partitioned tables, monitoring table refresh schedules, and validating data availability for daily and weekly business reviews.
Built small Python-based utilities for data sanity checks, including validation of ingestion completeness, detection of stale partitions, and simple reconciliation between marketplace and fulfillment systems.
Worked closely with the ETL operations team to monitor daily batch jobs, troubleshoot ingestion failures, and escalate upstream feed issues impacting marketplace dashboards.
Standardized core marketplace datasets by applying consistent business rules for order status mapping, SLA tagging, route event sequencing, and seller performance calculations.
Generated QA/test datasets for new dashboard rollouts, ensuring that operations BI teams received accurate, timely, and validated data for Tableau/Qlik reports.
Documented data dictionary entries, field mappings, reconciliation steps, and dataset dependencies as part of cross-team handoffs, audits, and BI onboarding processes.

Environment: SQL, Hive,Hadoop Ecosystem (HDFS storage, Hive metadata), Python, Linux (running Hive queries, log checks, job monitoring), Shell Scripting, QlikView / Tableau, Git.

EDUCATIONAL DETAILS

Master of Science (M.S) in Data Science and Applications University at Buffalo, Buffalo, NY (Dec 2021)
Bachelor of Technology (B.Tech.) in Electrical and Electronics Engineering Srinivasa Ramanujan Institute of Technology, India (April 2013)
Keywords: continuous integration continuous deployment quality analyst business intelligence sthree active directory New York Pennsylvania Texas

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];6882
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: