Home

Tharunkumar - Data Engineer
[email protected]
Location: Remote, Remote, USA
Relocation: Yes, Anywhere in USA
Visa: Initial OPT EAD
Resume file: Tharun Data Engineer Resume_1775592394968.docx
Please check the file(s) for viruses. Files are checked manually and then made available for download.
Tharunkumar C
Data Engineer
E: [email protected]

PROFESSIONAL SUMMARY
5 years of experience as a Data Engineer, delivering end-to-end, scalable data solutions across banking, healthcare, and e-commerce domains.
Designed and optimized ETL pipelines in Databricks (Python & PySpark) for large-scale social-service and financial data platforms.
Strong SQL expertise with performance-tuned transformations and data quality frameworks.
Proficient in data warehousing and modeling (Star & Snowflake schemas), implementing CDC, SCD Types, data quality validations, and performance-optimized queries across Snowflake, Teradata, Oracle, SQL Server, and Redshift
Experienced in containerized data solutions using Docker, Kubernetes, and CI/CD automation.
Hands-on experience with big data ecosystems for processing large-scale structured, semi-structured, and unstructured datasets.
Developed interactive Power BI dashboards and data models using DAX and M to deliver KPI-driven insights for business units.
Worked closely with product, data science, and business teams to translate analytical requirements into efficient, scalable data solutions.
Created dataflows and semantic models in Power BI connected to Azure Synapse and Data Lake for self-service analytics.
Extensive knowledge of cloud platforms:
AWS: S3, EC2, EMR, Redshift, Glue, Lambda, RDS, Kinesis, CloudWatch, SNS, Event Bridge
Azure: Data Factory, Synapse, Databricks, Data Lake, Key Vault, DevOps
GCP: BigQuery, Dataflow, Cloud Composer (Airflow), Pub/Sub, GCS
Delivered cloud migrations and hybrid architectures, moving legacy on-prem ETL pipelines to cloud platforms.
Built production-grade ETL and streaming workflows using Airflow, Oozie, Step Functions, ensuring SLA adherence, monitoring, and operational reliability.
Experienced in building scalable data pipelines and data warehouses using AWS Glue, Redshift, Athena, S3, and PySpark, delivering optimized, secure, and high-performing data solutions.
Developed real-time streaming solutions using Spark Streaming, Kafka, Kinesis, Flume, processing millions of transactions/events.
Proficient in NoSQL databases: MongoDB, Cassandra, HBase, DynamoDB, Cosmos DB; skilled with semi-structured/unstructured formats (JSON, Parquet, Avro, XML).
Experienced in DevOps & automation, implementing CI/CD pipelines with GitHub Actions, ArgoCD, Jenkins, Docker, and Kubernetes.
Basic knowledge of C# for API integrations and Lambda-based data workflows
Recognized for zero-defect, production-ready code, ensuring high standards in data quality, security, and governance.
Experienced in designing Delta Lakehouse architectures using Databricks, Delta Lake, and Medallion (Bronze/Silver/Gold) models.
Passionate about data-driven decision-making, optimizing data architectures, and ensuring scalability, accuracy, and performance.



TECHNICAL SKILLS

Programming & Scripting
Python, SQL (T-SQL, PL/SQL), PySpark, Spark SQL, Shell Scripting
Big Data & Distributed Processing
Apache Spark (Core, SQL, Streaming), Apache Kafka, Hadoop (HDFS, Hive)
Cloud Platforms
AWS:
Amazon S3, AWS Glue, Amazon Redshift, AWS Lambda, Amazon Kinesis, IAM, CloudWatch, Step Functions
Azure:
Azure Data Factory, Azure Databricks, Azure Synapse Analytics, Azure Data Lake
GCP (Working Knowledge):
BigQuery
Data Warehousing & Databases
Snowflake, Oracle, SQL Server, MySQL, MongoDB, DynamoDB
ETL & Orchestration
AWS Glue, Azure Data Factory, Apache Airflow, Step Functions
DevOps & CI/CD
Docker, Kubernetes, Jenkins, GitHub Actions
Business Intelligence
Tableau, Power BI

EDUCATION
Master of Science in Computer Science University of Central Missouri
Bachelor of Technology in Computer Science & Engineering RGMCET

PROFESSIONAL EXPERIENCE

ROLE: Data Engineer JAN 2025 Present
CLIENT: CITI BANK
RESPONSIBILITIES:
Designed and built ETL pipelines using AWS Glue, PySpark, and Python to ingest structured/unstructured data.
Implemented real-time streaming with Kafka, Kinesis, and Spark Streaming for low-latency fraud detection.
Built and managed AWS data lake (S3) with raw, curated, and analytics-ready zones.
Modeled and optimized Redshift schemas for analytics and reporting.
Developed data quality frameworks ensuring >99% data accuracy.
Automated serverless workflows with Lambda, Event Bridge, and SNS for fraud alerts.
Collaborated with data scientists to engineer ML features and integrate fraud models.
Exposed data via Fast API-based REST APIs for fraud monitoring apps.
Migrated legacy on-prem ETL (SQL Server, Oracle) to AWS cloud-native pipelines.
Built dashboards in Tableau & Power BI for near real-time fraud insights.
Ensured data governance, security & compliance (IAM, KMS, VPC, audit logging).
Automated deployments with CI/CD (GitHub Actions, Jenkins, Argo CD).
Orchestrated complex workflows with Airflow & Step Functions.
Reviewed pull requests, enforced coding standards, and mentored junior engineers in Spark and Databricks best practices.
Implemented end-to-end observability using CloudWatch and optimized Databricks cluster costs through autoscaling and job cluster configuration
Implemented role-based access control (RBAC), IAM policies, encryption (KMS), and data governance controls in AWS and Databricks.
Leveraged NoSQL (DynamoDB, MongoDB, HBase) for high-throughput event storage.
Deployed microservices and ETL with Docker & Kubernetes.
Tuned Spark, Redshift, and SQL jobs to reduce latency and improve throughput.

TECHSTACK: AWS (S3, Glue, Lambda, Kinesis, SNS, Event Bridge, Redshift, CloudWatch, IAM, KMS, VPC, Step Functions), Apache Kafka, Apache Airflow, PySpark, Python, Spark Streaming, Spark SQL, SQL Server, Oracle, Fast API, GitHub Actions, ArgoCD, Jenkins, Tableau, Power BI, MongoDB, DynamoDB, HBase, Linux, Docker, Kubernetes, Pandas, NumPy, PyArrow


ROLE: Data Engineer JULY 2021 DEC 2023
Client: CARDINAL HEALTH
RESPONSIBILITIES:
Designed and developed ETL pipelines using Azure Data Factory to ingest healthcare data (EHR, claims, and clinical datasets) into Azure Data Lake and Azure SQL Database.
Performed data transformations and large-scale processing using Azure Databricks (PySpark, Spark SQL), improving data processing efficiency and reducing pipeline run times.
Built curated datasets in Azure Synapse Analytics to support reporting and business analytics.
Collaborated with clinical and business teams to gather requirements and ensure HIPAA-compliant data handling and governance.
Developed SQL queries, stored procedures, and views to support analytics and downstream reporting.
Automated data validation and quality checks using Python and shell scripting, reducing manual effort and improving reporting accuracy.
Supported batch and near real-time ingestion workflows using Spark and Kafka-based pipelines
Created dashboards using Tableau to track KPIs and healthcare performance metrics.
Worked with BigQuery in GCP for ad-hoc analysis and cross-platform data integration (limited exposure).
Participated in CI/CD deployments using Git and Jenkins to manage code releases and pipeline automation.
TECH STACK: Azure, GCP, Spark, Spark-Streaming, Spark SQL, HDFS, NiFi, Hortonworks, Cloudera, MapReduce, Zookeeper, Hive, Pig, Sqoop, Python, PySpark, Shell Scripting, Linux, Jenkins, Oracle, Git, Oozie, Databricks, Tableau, MySQL, HBase, Cassandra, and MongoDB.

ROLE: ETL Developer Jul 2020 - Jun 2021
CLIENT: BEST BUY
RESPONSIBILITIES:
Conducted data analysis, cleansing, transformation, integration, migration, import, and export activities across various enterprise data sources and targets.
Developed PL/SQL stored procedures, functions, triggers, views, and packages, leveraging indexes, aggregations, and materialized views to enhance query performance and execution time.
Designed database schemas and authored DDL scripts for ETL metadata tables to store runtime metrics and job audit details for DataStage ETL processes.
Utilized Informatica PowerCenter to design and execute ETL workflows for loading data from flat files and XML into DB2 and Oracle databases.
Implemented Hive partitioning and bucketing techniques to optimize querying and improve the performance of large datasets stored in HDFS.
Worked extensively with Apache Spark, using Spark Context, Spark SQL, Data Frames, Datasets, and deployed Spark jobs on YARN clusters for high-performance distributed data processing.
Wrote and executed Spark SQL queries to populate Hive tables and perform complex data transformations.
Designed and built ETL pipelines using Informatica, extracting and transforming data from multiple sources, including Oracle and SQL Server, and loading it into target Oracle databases.
Developed Apache Pig scripts to process and analyze large volumes of semi-structured data on Hadoop clusters.
Translated Hive/SQL queries into equivalent Spark transformations using RDDs for scalable data processing.
Authored and executed HiveQL queries for performing advanced data analytics to support business requirements.
Ran Hadoop streaming jobs to process unstructured and multi-format datasets across distributed environments.
Created and deployed Tableau dashboards and visual reports for actionable business insights and performance tracking.
Built end-to-end batch data pipelines using Apache Spark with Scala, integrating data from raw input through to analytics-ready layers.
Published data connections and dashboards to Tableau Server for operational and monitoring use across business units.
TECH STACK: IBM, Informatica, Python, Tableau, HDFS, Talend, SQL, Hive, Sqoop, Kafka, Data analytics, Data Flow, Scala, Impala, Spark, Autosys, Unix Shell Scripting.
Keywords: cprogramm csharp continuous integration continuous deployment machine learning business intelligence sthree database active directory information technology procedural language

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];7127
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: