| venkata - lead data engineer |
| [email protected] |
| Location: Mooresville, North Carolina, USA |
| Relocation: |
| Visa: |
|
Venkata Sai Phanindra Dhavaleswarapu
Email: [email protected] Phone: (484) 441-3014 CERTIFICATIONS: Microsoft Certified: Azure Data Scientist Associate (DP-100) Microsoft Certified: Azure Data Engineer Associate (DP-203) Microsoft Certified: Azure Solutions Architect Expert (AZ-305) Amazon Web Services Certified Solutions Architect Associate (AWS-SAA) AWS Certified Data Engineer Associate (DEA-C01) PROFESSIONAL SUMMARY: I am a Data Engineer with strong hands-on experience building secure, scalable, and production-grade data platforms on AWS for real-time and batch workloads. My work focuses on delivering business-critical data products such as eCommerce inventory accuracy, rewards and promotions processing, and IoT/telemetry operational analytics. I design end-to-end pipelines that ingest data from vendor feeds, transactional systems, and streaming sources, then standardize, validate, and publish curated datasets using medallion architectures (raw/validated/enriched) to support both customer-facing systems and enterprise reporting. On AWS, implemented event-driven and streaming pipelines using Kinesis and Kafka patterns, built transformation layers using Glue and EMR with PySpark, and delivered analytics serving through Redshift and operational stores like PostgreSQL. I emphasize reliability through schema enforcement, data contracts, data quality checks, and clear observability using CloudWatch and centralized logging. I support production operations with incident runbooks, alerting pipelines, and performance tuning for compute and warehouse layers. Security and governance are embedded into every solution using IAM least privilege, encryption, network controls, and cataloging/metadata management. I deliver infrastructure and pipelines through infrastructure-as-code and CI/CD practices to ensure repeatable deployments, auditability, and consistent environment parity. Good interpersonal skills, committed, result oriented, hard working with a quest and deal to learn new technologies. Technical Skills: AWS Services S3, Glue (4.0), EMR (7.0), Redshift (RA3/Serverless), Kinesis Data Streams, MSK (Kafka), Lambda, Step Functions, MWAA (Airflow), EventBridge, RDS/Aurora PostgreSQL, Lake Formation, Glue Data Catalog, DataZone, CloudWatch, OpenSearch, SNS Programming Python (3.9/3.11), PySpark, SQL, Scala, Java, Bash Data Processing Apache Spark, Iceberg, Flink (Kinesis Data Analytics), Pandas, NumPy Streaming & CDC Kinesis, MSK (Kafka), Debezium, Spark Structured Streaming, Kafka Connect Orchestration MWAA (Airflow), Step Functions, EventBridge Data Quality Great Expectations, Glue Data Quality, Pydantic IaC & DevOps AWS CDK (Python), Terraform, CloudFormation, Docker, EKS, GitOps (ArgoCD) Data Lakehouse Iceberg, Delta Lake, Parquet, ORC, Avro, HDFS, Hive Generative AI LangChain 0.1, LlamaIndex, Amazon Bedrock (Claude 3, Llama 2, Titan), OpenAI GPT-4/GPT-3.5, Pinecone, OpenSearch, pgvector, FAISS, RAGAS, LangSmith Monitoring CloudWatch, X-Ray, OpenSearch, Grafana, Prometheus, SNS, PagerDuty BI & Reporting Amazon QuickSight, Power BI, Tableau PROFESSIONAL EXPERIENCE: Lowe s Mooresville, NC Lead Data Engineer May 2024 - Present Project Description: Led the data engineering workstream powering near-real-time inventory accuracy across both digital and physical channels. Built platform to keep "available-to-promise" inventory correct at all times so customers see the right quantity online, store associates see the same truth in-store, and downstream systems (order routing, fulfillment, replenishment, and vendor compliance) don't suffer from oversell, cancellations, or stale stock. Continuously ingested inventory signals from store systems and supply chain sources (receiving, adjustments, transfers, returns) and pulled third-party vendor feeds for drop-ship, replenishment, and SKU attributes. Standardized and validated these feeds, reconciled them against store-level inventory, and published trusted inventory updates for real-time customer experiences and operational dashboards. Roles & Responsibilities: Designed streaming-first architecture using Kinesis Data Streams and MSK (Kafka 3.5) as event backbone, processing inventory signals from 2,000+ stores with sub-two-minute latency from source to serving layer. Built real-time validation pipelines with Lambda (Python 3.11) for lightweight enrichment and Kinesis Data Analytics (Flink) for stateful stream processing including de-duplication and inventory delta calculations. Implemented medallion layering (Bronze/Silver/Gold) on S3 with Iceberg table format, enabling ACID transactions and time travel for downstream order routing and fulfillment systems. Engineered batch transformations using Glue 4.0 (Spark 3.4) and EMR 7.0 (Spark 3.5) to reconcile vendor feeds against store-level inventory for 100,000+ SKUs. Served analytics through Redshift Serverless and Redshift Spectrum for lake queries, and Aurora PostgreSQL 15 for operational metadata and vendor contract rules. Orchestrated workflows using MWAA (Airflow 2.8) for scheduled pipelines and EventBridge for event-triggered automation including vendor feed arrivals. Enforced data contracts at ingestion using Pydantic models, preventing bad vendor payloads from contaminating curated layers and reducing quality incidents by 85%. Implemented CDC patterns using Debezium on EKS to capture changes from source systems and stream via MSK to S3 and Redshift. Applied Lake Formation controls for row and column-level security, cataloging assets in Glue Data Catalog with discovery through AWS DataZone. Implemented observability with CloudWatch dashboards, X-Ray tracing, OpenSearch for log search, and Grafana/Prometheus for platform metrics with SLO tracking. Reduced storage costs by 35% through S3 Intelligent Tiering, lifecycle policies, and Spot Instance usage for Spark workloads. Delivered infrastructure as code using AWS CDK (Python) and Terraform 1.6 with GitOps on EKS (ArgoCD) for repeatable deployments. Integrated Amazon Bedrock for GenAI-assisted anomaly detection in inventory patterns, reducing false positives by 60%. Environment: AWS Lambda (Python 3.11), AWS Glue 4.0 (Spark 3.4), Amazon EMR 7.0 (Spark 3.5), AWS Batch, Amazon Kinesis Data Streams, Amazon MSK (Kafka 3.5), Amazon Kinesis Data Analytics (Flink), Amazon S3 (Intelligent Tiering, Iceberg), Amazon Redshift Serverless, Amazon Redshift Spectrum, Amazon Aurora PostgreSQL 15, Amazon MWAA (Airflow 2.8), AWS Step Functions, Amazon EventBridge, AWS Control Tower, Amazon EFS, Amazon FSx for Lustre, Amazon Glacier Deep Archive, AWS Lake Formation, AWS Glue Data Catalog, AWS DataZone, AWS Glue Data Quality, AWS X-Ray, Amazon OpenSearch, Grafana, Prometheus, Amazon CloudWatch, Amazon SNS, PagerDuty, Slack, AWS CDK (Python), Terraform 1.6, Docker, Amazon EKS, Helm, ArgoCD, FluxCD, Amazon SageMaker, Amazon Bedrock. Mastercard Chicago,IL Role: Sr. Data Engineer Jul 2022- May 2024 Project Description: Architected and built the data platform powering credit card rewards, promotions, and real-time redemption ecosystem processing billions of dollars in points and cashback with perfect reconciliation to core banking ledgers. Responsibilities: Designed hybrid batch-stream architectures ingesting 100M+ daily transactions from authorization switches and settlement processors with 99.99% accuracy for reward calculations. Built incremental ETL pipelines using AWS Glue 3.0 (Spark 3.1) with timestamp-based watermarks, processing only delta data and reducing processing time by 65% for 5+ TB daily volume. Authored complex PySpark transformations performing multi-way joins across billions of rows for cashback, tiered multipliers, and category bonuses covering 50M+ cardholders. Implemented idempotent patterns with checkpointing guaranteeing exactly-once semantics for reward posting across $2B+ annual rewards liability. Architected real-time streaming using Kinesis Data Streams and Kinesis Data Firehose, landing events into S3 and Redshift at sub-two-minute latency. Deployed Lambda (Python 3.9) for stream enrichment, joining events with customer data cached in ElastiCache for Redis, processing 10,000+ events per second. Optimized serving layer on Amazon Redshift RA3 with star-schema models, reducing query execution time by over 70% for customer-facing balance lookups. Deployed Redshift Spectrum for petabyte-scale ad-hoc analytics, reducing analytical query costs by 40%. Automated VACUUM, ANALYZE, and workload management to maintain sub-500ms response time for 95% of customer-facing queries. Used RDS PostgreSQL 13 for operational metadata storing promotion rules, customer enrollments, and audit trails for 100+ internal applications. Integrated Great Expectations into Glue workflows, alerting on discrepancies exceeding one basis point ($10 per $100,000 processed). Enforced banking-grade security with IAM least-privilege, KMS encryption, VPC endpoints, and security groups for PCI-DSS compliance. Owned production monitoring through CloudWatch dashboards and SNS alerts, maintaining 99.95% platform availability over two years. Delivered infrastructure as code using Terraform 1.0 and CloudFormation with zero failed deployments in the last year. Environment: AWS Lambda (Python 3.9), AWS Glue 3.0 (Spark 3.1), Amazon EMR 6.7, Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, Amazon S3, Amazon Redshift RA3, Amazon Redshift Spectrum, Amazon RDS PostgreSQL 13, Amazon ElastiCache for Redis, Amazon MWAA (Airflow 2.2), AWS Step Functions, Amazon CloudWatch, Amazon SNS, AWS IAM, AWS KMS, Terraform 1.0, AWS CloudFormation, Great Expectations, Grafana, Prometheus. Verizon Chicago, IL Role: Data Engineer Aug 2019 Jul 2022 Project Description: Built telemetry and IoT-driven data platform processing machine/device signals for network operations, enabling early outage detection and technician SLA reporting across 500,000+ connected devices. Responsibilities: Built batch and near-real-time pipelines processing 50M+ daily telemetry events from IoT gateways and service platforms. Landed raw telemetry in S3 with partitioning by date, region, and device type, reducing query costs by 40% through partition pruning. Ran Spark clusters on EMR (Spark 2.4) for distributed processing of semi-structured logs, computing error-rate thresholds and performance aggregates for SLA reporting. Deployed Lambda functions for payload validation and routing, processing 10,000+ events per minute with sub-100ms latency. Implemented Kafka as real-time backbone to synchronize device state across monitoring and alerting systems with sub-five-second latency. Operated Airflow on EC2 to schedule ETL pipelines with automated failure notifications and retry logic. Loaded curated telemetry into RDS MySQL for operational queries and Redshift (DC2) for analytics supporting 5,000+ field technicians. Applied distribution styles and sort keys in Redshift, reducing report generation time by 50% for SLA compliance dashboards. Supported production operations with 99.5% pipeline success rate despite noisy data and frequent upstream changes. Implemented structured error handling in Python 3.7 with SNS notifications and documented recovery steps for reprocessing. Managed costs through instance sizing and reserved capacity planning, reducing platform costs by 25%. Environment: Amazon EMR (Spark 2.4), Amazon EC2, AWS Lambda (Python 3.7), Amazon S3, Amazon RDS MySQL, Amazon Redshift DC2, Apache Kafka, Apache Airflow on EC2, Amazon CloudWatch, Amazon SNS, AWS IAM, AWS KMS, Grafana, Prometheus. Zomato Remote Data Engineer Jul 2016 - Jul 2018 Project Description: Built data platform processing POS transactions and vendor feeds to power supply chain analytics and inventory optimization across 9,000+ stores with six-hour SLA. Roles & Responsibilities: Designed automated ingestion pipelines in AWS Glue processing 50M+ daily POS transactions using S3 landing zones with schema drift handling. Built retail transformation workflows in AWS Glue (PySpark) performing SKU-level joins and inventory calculations for 100,000+ SKUs. Modeled supply chain fact layers into Redshift using star-schema designs for BI queries serving 500+ daily users. Engineered curated S3 layers with partitioned Parquet format for reproducible training datasets. Reduced reporting latency by 60% through optimized data models and incremental refresh in Amazon QuickSight. Implemented SCD Type 1 & 2 in Redshift for historical views of product hierarchies and vendor contracts. Developed real-time ingestion integrating Kafka streams with Spark Streaming on EMR for stock-out detection. Built AWS CodePipeline CI/CD pipelines automating deployment of Glue jobs and Redshift schema changes. Reduced compute costs by 30% through auto-scaling and storage tiering strategies. Developed Lambda functions exposing real-time SKU-level KPIs with sub-100ms response time Environment: AWS Glue, Amazon EMR, AWS Lambda, Amazon S3, Amazon Redshift, Amazon RDS, Apache Kafka, Amazon QuickSight, AWS CodePipeline, AWS CloudFormation, Python, PySpark. Flipkart Bangalore, India Data Engineer Jun 2014 - May 2016 Project Description: Built real-time and batch processing pipelines for telecom telemetry data, enabling behavioral analysis and predictive modeling while modernizing infrastructure from 100GB to 10TB+ datasets. Roles & Responsibilities : Built real-time pipelines using Kafka and Spark Streaming processing 10M+ daily telemetry events. Designed scalable Hive and HDFS layouts with partitioning and ORC formats, improving query performance by 40%. Developed Spark batch jobs in Scala and Python, reducing runtimes by 30% through optimized join strategies. Migrated legacy RDBMS workflows to Hadoop using Sqoop and Hive, scaling from 100GB to 10TB+ datasets. Implemented incremental loads with timestamp-based watermarks, reducing data transfer by 70%. Designed Hive partitioning strategies reducing query execution times by 50% through partition pruning. Built Spark applications in Scala for sessionization and funnel analysis. Optimized Spark jobs through broadcast joins, data caching, and shuffle partitioning. Implemented Avro serialization for Kafka, reducing message size by 30%. Developed shell scripts for job scheduling with automated alerts for failures. Environment: Hadoop (Cloudera), Apache Kafka, Apache Hive, Apache Spark (Scala), Spark Streaming, HDFS, Apache Sqoop, Apache NiFi, Apache Flume, Grafana, Prometheus, Avro, Python, Scala, Java, Linux, Shell Scripting. Keywords: continuous integration continuous deployment artificial intelligence access management business intelligence sthree active directory Arizona Delaware Illinois North Carolina |