Home

Sanjay CH - Sr. Data Engineer | Distributed Data Platforms | Spark - Snowflake - Databricks - Kafka | AWS - GCP - Azure | Real-time & Batch Pipelines | ML-Ready Data
[email protected]
Location: Dallas, Texas, USA
Relocation: Yes
Visa: H1B
Resume file: Sanjay_DE_Resume_1771438413983.docx
Please check the file(s) for viruses. Files are checked manually and then made available for download.
SANJAY CHANDAMALA
(Sr. Data Engineer)

Email: [email protected] | Contact: (469) 300-6001

SUMMARY:
Around 8 years of experience in Data Engineering, Data Pipeline Design, Development and Implementation and Data Modeler.
Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
Develop data set processes for data modelling, and Data mining. Recommend ways to improve data reliability, efficiency and quality.
Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and Spark processing frameworks.
Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.
Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
Used Spark Data frames API over Cloudera platform to perform analytics on Hive data and Used Spark Data Frame Operations to perform required Validations in the data.
Experience in building ETL scripts in different languages like PLSQL, Informatica, Hive, Pig and PySpark and expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie.
Experience in Control-M/Autosys scheduling tool and running of ETL jobs through it.
Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
Having extensive knowledge on RDBMS such as Oracle, DevOps, MicrosoftSQLServer and MYSQL.
Extensive experience working on various databases and database script development using SQL and PL/SQL.
Excellent understanding and knowledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.
Expertise in migrating data to vertica from oracle,
Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage mechanism.
Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
Created first of a kind unique Data Model for Intelligence Domain.
Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and Spark jobs on AWS.
Worked on Spark SQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
Expertise in OLAP Database(Vertica) architecture, performance tuning and optimizing techniques such as projections, partitions, Installations, Upgrade, Backup and Restore.
Hands on experience in using other Amazon Web Services like Autoscaling, RedShift, DynamoDB, Route53.
Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats. Has good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO.
Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity in present project.
Working experience on NoSQL databases like HBase, Azure, MongoDB and Cassandra with functionality and implementation.
Good understanding and knowledge of NoSQL databases like MongoDB, Azure, PostgreSQL, HBase and Cassandra.
Experience in extracting files from MongoDB through Sqoop and placed in HDFS and processed.
Excellent programming skills with experience in Python, SQL and C Programming.
Worked on various programming languages using IDEs like Eclipse, NetBeans, and Intellij, Putty, GIT.
Experienced in working in SDLC, Agile and Waterfall Methodologies.
Wrote python code using Ansible python API to automate Cloud Deployment Process.
Have very strong inter-personal skills and the ability to work independently and with the group, can learn quickly and easily adaptable to the working environment.
Good exposure in interacting with clients and solving application environment issues and can communicate effectively with people at different levels including stakeholders, internal teams, and the senior management.

TECHNICAL SKILLS:

Big Data Frameworks: Hadoop (HDFS, MapReduce), Spark, SparkSQL, Spark Streaming, Hive, Impala, Kafka, HBase, Flume, Pig, Sqoop, Oozie, Cassandra.

Hadoop Distribution: Cloudera CDH, Apache, AWS Horton Works HDP

Programming languages: Core Java, Scala, Python, Shell scripting, PL/SQL, R PySpark, HiveQL, Regular Expressions

Spark Components: RDD, Spark SQL (Data Frames and Dataset), and Spark Streaming

Operating Systems: Windows, Linux, UNIX

Databases: Oracle, SQL Server, MySQL, Teradata, Aurora Postgres, NoSQL Database (HBase, MongoDB)

Web Technologies: XML, HTML, JavaScript, jQuery, JSON

Application / Web Servers: Apache Tomcat, WebSphere

Messaging Services: ActiveMQ, Kafka, JMS

Version Tools: Git, GitHub, CVS, SVN

Cloud Technologies: Azure, GCP, AWS

Build Tools: Maven, SBT

Containerization Tools: Kubernetes, Docker, Docker Swarm

Reporting Tools: Junit, Eclipse, Visual Studio, Net Beans, Azure Databricks, UNIX Eclipse, Linux, Google Shell, Unix, Power BI, SAS, Tableau, Looker, Matilllion, Alteryx


PROFESSIONAL EXPERIENCE:

Client: - General Motors, TX Feb 2025 Till Date
Role: Sr. Data Engineer
Responsibilities:

Engineered ETL pipelines using AWS Glue and Informatica, integrating and transforming healthcare data into a centralized Snowflake data warehouse for real-time analytics.
Automated cloud infrastructure provisioning using Terraform, enabling consistent and scalable deployments of AWS and Snowflake environments.
Integrated real-time streaming data from patient monitoring systems into AWS S3 and AWS Lake Formation, processing with Kafka for advanced analytics and healthcare insights.
Designed and implemented AI/ML data pipelines to support machine learning models focused on improving patient outcomes, leveraging AWS Glue and Snowflake.
Developed interactive Power BI dashboards to visualize key healthcare metrics such as patient readmission rates and treatment outcomes, enabling data-driven decision-making.
Implemented automated data quality checks and cleansing routines to ensure integrity and accuracy across healthcare data pipelines.
Migrated PySpark models from on-premises Hadoop clusters to GCP Dataproc, enabling optimized big data processing workflows on GCP.
Utilized GCP Dataflow for stream and batch data processing, implementing windowing strategies and ensuring fault tolerance in data pipelines.
Integrated BigQuery with GCP Dataflow to enable high-performance data transformation and real-time analytics on large datasets.
Automated infrastructure provisioning in GCP using Terraform, ensuring consistency and scalability in GCP Dataflow and GCP Dataproc deployments.
Developed data pipelines in Databricks to process large-scale healthcare datasets, leveraging the optimized Spark runtime for efficient batch and stream processing.
Leveraged Databricks for collaborative development, enabling cross-functional teams to work on healthcare data transformation, feature engineering, and model deployment within a unified platform.
Migrated data and workloads from Teradata to Google cloud platforms optimizing SQL queries and data models for improved performance and cost efficiency.
Ensured HIPAA compliance with encryption, AWS IAM, and key management via AWS KMS and GCP Secret Manager for PHI protection.
Provisioned and maintained cloud infrastructure using Terraform, enabling repeatable, secure deployments across AWS and GCP environments.
Streamed real-time patient monitoring data into AWS S3 and Lake Formation using Kafka, applying CDC patterns for continuous ingestion and analytics.
Designed and deployed ML-ready data pipelines supporting feature engineering and model training using AWS Glue, Snowflake, and Power BI.
Engineered PySpark workflows to process and transform healthcare data on HDFS and AWS S3, optimizing memory and execution plans for distributed computation.
Migrated PySpark workloads from on-prem Hadoop to GCP Dataproc, refactoring code and orchestrating with Oozie and Cloud Composer.
Created federated data pipelines with BigQuery, integrating GCP Cloud Storage, third-party APIs, and Dataflow for cross-platform analytics.
Queried large datasets stored in Amazon S3 using AWS Athena, enabling ad-hoc SQL querying without ETL overhead or infrastructure provisioning.
Implemented robust data quality frameworks using Python and Airflow, ensuring anomaly detection, schema validation, and alerting across pipelines.
Designed Kafka consumers for ingesting streaming health telemetry, with output persisted in GCP Pub/Sub and BigQuery for near real-time dashboards.
Built and optimized Hive, Impala, and Spark SQL queries on Cloudera, improving runtime performance for legacy reporting workloads.
Worked closely with data scientists to prepare clean, structured datasets for machine learning models focused on patient care predictions, fraud detection, and operational efficiency.
Environment: HDFS, Yarn, MapReduce, Hive, Sqoop, Flume, Oozie, HBase, Kafka, Impala, Spark SQL, Spark Streaming, Eclipse, Oracle, Teradata, PL/SQL, UNIX Shell Scripting, Cloudera, Cloud Dataflow, AWS Lambda, Terraform, GCP BigQuery, GCP Dataproc, GCP Pub/Sub, GCP Cloud Storage, GCP Dataflow.


Client: UPS, GA Apr 2023 - Jan 2025
Role: Cloud Data Engineer

Responsibilities:

Performed all phases of software engineering including requirements analysis, application design, and code development & testing.
Developed and maintained end-to-end operations of ETL data pipeline and worked with large data sets in Azure data factory (ADF).
Increased the efficiency of data fetching by using Queries for optimizing and indexing.
Wrote SQL queries using programs such as DDL, DML and indexes, triggers, views, stored procedures, functions, and packages.
Collaborated for developing statistical models and machine learning algorithms, leveraging Python, R, and Azure Databricks for scalable data processing.
Used Azure Data Factory to orchestrate complex data pipelines integrating multiple data sources (SQL, Snowflake, Azure Blob) and applied transformations for machine learning model training.
Built and optimized regression models and clustering algorithms to support predictive analytics for healthcare client, improving forecasting accuracy by 15%
Designed and Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
Worked on Azure Data Factory to integrate data of both on-prem (MYSQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to snowflake.
Deployed Data Factory for creating data pipelines to orchestrate the data into SQL database.
Working on Snowflake modelling using data warehousing techniques, data cleansing, Slowly Changing Dimension phenomenon, surrogate key assignment and change data capture.
Analytical approach to problem-solving; ability to use technology to solve business problems using Azure data factory, data lake and azure synapse.
Migrate data from on-premises to AWS storage buckets.
Developed a python script to transfer data from on-premises to AWS S3. Developed script that will hit REST API's and extract data to AWS S3.
Deployed application using AWS EC2 standard deployment techniques and worked on AWS infrastructure and automation. Worked on CI/CD environment on deploying application on Docker containers.
Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions.
Developed ELT/ETL pipelines to move data to and from Snowflake data store using combination of Python and Snowflake Snow SQL.
Developing ETL transformations and validation using Spark-SQL/Spark Data Frames with Azure data bricks and Azure Data Factor.
Managed and optimized Databricks workspaces, ensuring efficient cluster configuration and resource allocation for large-scale data processing tasks.
Implemented security measures and data governance protocols using Unity Catalog and IAM, maintaining compliance with industry standards.
Administered centralized metadata for efficient data cataloging across multiple workspaces and environments within Databricks.
Organized data assets into catalogs, schemas, and tables, improving data discoverability and access control
Worked with Azure Logic Apps administrators to monitor and troubleshoot issues related to process automation and data processing pipelines.
Developed and optimized code for Azure Functions to extract, transform, and load data from various sources, such as databases, APIs, and file systems.
Designed, built, and maintained data integration programs in Hadoop and RDBMS.
Developed CI/CD framework for data pipelines using Jenkins tool.
Collaborated with DevOps engineers to develop automated CI/CD and test-driven development pipeline using azure as per the client requirement.
Hands on programming experience in scripting languages like python and Scala.
Involved in running all the Hive scripts through Hive on Spark and some through Spark SQL.
Developed a data pipeline using Kafka, Spark, and Hive to ingest, transform and analyze data.
Developed Spark code and Spark SQL scripts using Scala for faster data processing.
Working with JIRA to report on Projects, and creating sub tasks for Development, QA, and Partner validation.
Experience in full breadth of Agile ceremonies, from daily stand-ups to internationally coordinated PI Planning.

Environment: Azure Databricks, Data Factory, Logic Apps, Functional App, Snowflake, MS SQL, Oracle, HDFS, , AWS EMR/EC2/S3/Redshift, MapReduce, YARN, Spark, Hive, SQL, Python, Scala, PYSPARK, shall scripting, GIT, JIRA, Jenkins, Kafka, ADF Pipeline, Power Bi.


Client: Mphasis- India May 2018 Feb 2022
Role: Data Engineer

Responsibilities:

Designed and implemented scalable ETL pipelines using AWS Glue, integrating and transforming banking transaction data into AWS Redshift for reporting and analytics.
Developed automated data quality checks and monitoring scripts leveraging Python and Informatica, ensuring the accuracy and integrity of data across financial systems.
Architected and managed AWS infrastructure utilizing Terraform to automate the provisioning of resources for AWS Redshift and AWS Glue, ensuring operational efficiency.
Created advanced data transformations and integrations with Informatica, facilitating the seamless loading of customer transaction data into AWS Redshift for business analysis.
Automated cloud infrastructure management with Terraform, enabling consistent scaling of resources to handle increasing data volumes efficiently.
Designed and optimized data models for reporting and analytics, ensuring efficient processing and analysis of banking transactions and customer behavior in AWS Redshift.
Leveraged AWS Glue for seamless transformation and migration of large datasets from legacy systems into the cloud for enhanced data accessibility.
Implemented AWS Kinesis for real-time data ingestion and utilized AWS Lambda for orchestrating data transformations based on new incoming data.
Optimized SQL queries and data transformation workflows in AWS Athena and AWS Redshift, reducing costs and significantly improving query performance.
Implemented robust data encryption and access control mechanisms to ensure data privacy and compliance with PCI DSS standards for sensitive financial data.
Developed and optimized distributed data processing jobs using PySpark within the Hadoop ecosystem, ensuring efficient scalability and performance in handling large-scale datasets.
Utilized Oozie for orchestrating both batch and real-time data workflows, integrating Hive data into Apache Kafka to enable real-time analytics and data streaming.
Collaborated with DevOps teams to implement CI/CD pipelines, automating the deployment of data workflows and improving operational efficiency for the banking data architecture.
Leveraged MapReduce within the Hadoop ecosystem to process large-scale datasets efficiently, implementing distributed data processing jobs to handle complex data transformation tasks.
Developed automated workflows for continuous monitoring and error handling within the data pipeline, ensuring consistent performance and smooth operations across the system.
Partnered with cross-functional teams to enforce best practices for cloud data solutions, ensuring the development of efficient, scalable, and secure systems.
Engineered and implemented scalable ETL pipelines using Azure Data Factory, integrating data from on-premises and cloud-based sources into Azure Data Lake and Azure SQL Data Warehouse.
Designed and implemented data transformation workflows with Azure Databricks leveraging Spark (Scala/PySpark) for distributed processing of large datasets, improving query performance.
Automated infrastructure provisioning using Azure Resource Manager (ARM) templates and Terraform, ensuring repeatable and consistent deployments across Azure Storage, Azure Synapse Analytics, and Azure Kubernetes Service (AKS).
Environment: PySpark, Hadoop, Spark, Hive, Oozie, Apache Kafka, AWS Kinesis, AWS Lambda, AWS Glue, AWS Redshift, CI/CD Pipelines, DevOps, Informatica, Azure Data Factory, Azure Databricks, Azure SQL Data Warehouse, ARM.

EDUCATION:

Master s Degree (MS) - Computer Science Mar 2022 Jun 2024
DePaul University, Chicago, IL

Bachelor s Degree(B-Tech) - Computer Science and Engineering Jun 2015 Jun 2019
Aurora's Technological & Research Institute, Hyderabad, India

Keywords: cprogramm continuous integration continuous deployment quality analyst artificial intelligence machine learning business intelligence sthree database active directory rlang information technology microsoft mississippi procedural language Georgia Illinois Texas
Keywords: cprogramm continuous integration continuous deployment quality analyst artificial intelligence machine learning business intelligence sthree database active directory rlang information technology microsoft mississippi procedural language Georgia Illinois Texas
Keywords: cprogramm continuous integration continuous deployment quality analyst artificial intelligence machine learning business intelligence sthree database active directory rlang information technology microsoft mississippi procedural language Georgia Illinois Texas
Keywords: cprogramm continuous integration continuous deployment quality analyst artificial intelligence machine learning business intelligence sthree database active directory rlang information technology microsoft mississippi procedural language Georgia Illinois Texas
Keywords: cprogramm continuous integration continuous deployment quality analyst artificial intelligence machine learning business intelligence sthree database active directory rlang information technology microsoft mississippi procedural language Georgia Illinois Texas
Keywords: cprogramm continuous integration continuous deployment quality analyst artificial intelligence machine learning business intelligence sthree database active directory rlang information technology microsoft mississippi procedural language Georgia Illinois Texas

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];6836
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: