Home

Soujanya G - Senior Data Engineer
[email protected]
Location: Cumming, Georgia, USA
Relocation: In Georgia State
Visa: GC EAD
Soujanya Gouroju
Data Engineer
Email: [email protected] || Ph #: 470-253-1144 || GC-EAD


PROFESSIONAL SUMMARY:

8 years of data engineering experience creating, constructing, and improving scalable ETL/ELT pipelines and cloud-based data solutions for financial and enterprise clients.
Hands-on experience with Azure Data Platform, including Azure Data Factory (ADF), Azure Data Lake, Azure Synapse Analytics, and Databricks, aligned with Microsoft Fabric architecture for large-scale ingestion, transformation, and analytics.
Worked with AtScale to build scalable semantic data models for enterprise BI and analytics reporting.
Designed and maintained Snowflake and Delta Lake data warehouses, including schema design, analytical data modeling, query optimization, and performance tuning for large datasets.
Designed and built scalable real-time and batch data pipelines integrating Spark (PySpark/Scala), Kafka, and Databricks, enabling near real-time processing and advanced analytics.
Developed and deployed MongoDB data models for semi-structured and unstructured datasets, ensuring optimal schema design for performance and scalability.
Designed and implemented distributed data processing solutions using HDFS, YARN, Hive, and HBase for large-scale data workloads.
Applied data quality, governance, and metadata management practices using tools like Ab Initio Metadata Hub, Precisely, and Openflow to ensure lineage, compliance, and enterprise data integrity.
Experienced in multi-cloud environments with AWS (Glue, Lambda, Step Functions, S3, EMR) and GCP fundamentals, enabling seamless integration with Azure and Snowflake-based architectures.
Developed and optimized data transformations using SQL, Python, PySpark, and ELT tools such as dbt for efficient analytical processing.
Implemented robust data security and compliance frameworks, including PII/HIPAA standards, encryption, and IAM policies across cloud platforms.
Actively participated in application design discussions, translating business requirements into scalable data solutions and well-structured technical implementations.
Conducted code reviews, enforced best practices, and contributed to maintaining high-quality, maintainable, and reliable data pipelines.
Implemented testing and QA strategies, including data validation, unit testing, and pipeline monitoring, to ensure accuracy, reliability, and production stability.
Developed and optimized OLAP-style data models in AtScale to support self-service analytics across large datasets in Snowflake, BigQuery, and Azure Synapse.
Created technical documentation and delivered user training and support to ensure smooth adoption of data solutions and effective knowledge transfer across teams.
Integrated ETL pipelines with cloud platforms (Azure, AWS) and data warehouses.
Developed scalable data processing pipelines using Spark (PySpark & Scala).
Designed and developed scalable data pipelines using Google Cloud Platform (GCP) services including Dataflow and Composer for batch and real-time processing.
Delivered BI and analytics solutions utilizing Power BI, Excel, and Tableau, including dashboards and reports to meet regulatory, risk, and business reporting needs.
Designed and managed workflow orchestration using Apache Airflow DAGs.
Integrated Kafka with Spark Streaming for near real-time data processing.
Strong collaboration with Data Architects, Business Analysts, and DevOps teams, using Agile/Scrum approaches to create scalable, automated, and maintainable data pipelines.
Optimized NoSQL schemas for low-latency and high-throughput applications.
Integrated Python scripts with cloud services, APIs, and orchestration tools.
Implemented REST APIs and data services supporting ETL pipelines.
Built SQL and Python-based analytical datasets supporting transaction monitoring and financial crime detection models, enabling risk teams to identify suspicious transaction patterns across banking transactions.
Supported quantitative risk and monitoring models by preparing large-scale transactional datasets, improving analytics performance for AML compliance and regulatory reporting.
Technical Skills
Programming & Scripting: T-SQL, Python, Scala, Java, PySpark, SQL, API, Hadoop
Data Platforms: Teradata, Cassandra, MongoDB, Oracle, SQL Server, ADLS, Snowflake, AWS S3, Redshift, RDS
Cloud & Big Data Services: Google Cloud Platform (GCP), BigQuery, Cloud Storage, Dataflow, Datastream, Dataproc, Cloud Composer, Dataplex, Dataform, Pub/Sub, BigLake, Vertex AI, Cloud Functions, Cloud Run, Apache Spark, Delta Lake.
ETL & BI Tools: SSIS, SSRS, SSAS, ADF, DataStage, Informatica, AWS Data Pipeline, Power BI, Tableau, Cognos, Grafana, Jasper
Testing, QA & Governance:
Data Validation, Testing & QA Strategies, Data Quality Frameworks, Metadata Management, Data Lineage, Governance Tools (Ab Initio, BigID)
Databases: Oracle 10g, SQL Server, MySQL, Azure SQL, Azure Blob Storage, AWS RDS
Geospatial Analysis: ArcGIS, PostGIS, QGIS, GeoPandas
Frameworks & Reporting: Struts, Spring, Hibernate, Jasper
Server & Operating Systems: WebLogic, WebSphere, Apache Tomcat, Windows, Unix, Linux
CI/CD & Version Control: Jenkins, GitHub, Terraform, Git, Bitbucket, SVN
Risk & Compliance Analytics:AML Transaction Monitoring, Financial Crime Detection, Risk Data Analytics, Model Monitoring, Regulatory Reporting
Other Tools: BigID, API Development
Professional Experience

Webster Bank, Waterbury, CT April 2024 Till now
Role: Senior Data Engineer
Responsibilities:

Worked as a Senior SDET (Software Development Engineer in Test) designing and developing scalable, reusable, and maintainable automation frameworks for enterprise web applications, APIs, middleware services, and UI components.
Designed, developed, and maintained enterprise ETL pipelines using IBM DataStage for large-scale data integration and warehouse loading projects.
Developed complex DataStage Parallel Extender (PX) jobs, server jobs, and job sequences to support high-volume data processing and transformation requirements.
Worked extensively with source-to-target mappings, data transformation rules, data cleansing logic, and business validation requirements across enterprise applications.
Collaborated with business analysts, technical leads, data architects, and stakeholders to gather requirements and implement scalable ETL solutions.
Built and optimized ETL workflows for loading data into enterprise data warehouses using Star Schema, Snowflake Schema, and Slowly Changing Dimensions (SCD Type 1 & Type 2).
Developed complex SQL queries, stored procedures, and transformation logic for efficient data extraction, validation, and reconciliation activities.
Performed unit testing, debugging, error handling, and performance tuning of DataStage jobs to improve ETL efficiency and reduce processing time.
Managed ETL job scheduling, monitoring, and batch execution using DataStage Director, Control Center, Control-M, and AutoSys scheduling tools.
Supported enterprise data migration, integration, and warehouse modernization initiatives involving large-scale structured and transactional datasets.
Utilized QualityStage for data standardization, matching, cleansing, deduplication, and data quality management processes.
Worked with Teradata databases for data extraction, transformation, and loading activities using BTEQ, FastLoad, MultiLoad, and TPT utilities.
Developed DataStage jobs using Teradata connectors for high-performance bulk data loading and extraction processes.
Performed Teradata SQL optimization and basic performance tuning to improve query execution and ETL throughput.
Created and maintained Unix/Linux shell scripts for ETL automation, batch processing, file handling, and operational support activities.
Assisted in production support activities including job failure analysis, issue resolution, root cause analysis, and incident management.
Worked with metadata management and data lineage processes to support data governance and compliance initiatives.
Participated in code reviews, ETL design discussions, and deployment activities to ensure adherence to enterprise development standards and best practices.
Supported integration of data from multiple source systems including relational databases, flat files, APIs, and mainframe systems.
Worked in Agile and Waterfall environments, participating in sprint planning, requirement analysis, development, testing, and deployment activities.
Collaborated with QA and business teams to validate ETL processes, reconcile data discrepancies, and ensure data accuracy across enterprise reporting platforms.
Utilized strong analytical, programming, and problem-solving skills to deliver scalable automation solutions aligned with enterprise quality engineering objectives.
Participated in Agile Scrum ceremonies including sprint planning, backlog grooming, daily stand-ups, release coordination, and production deployment activities.
Coordinated SIT, UAT, regression testing, and defect resolution activities for data pipelines, reporting applications, and enterprise analytics solutions.
Worked closely with database administrators, developers, QA teams, and business users to troubleshoot pipeline failures, report discrepancies, performance bottlenecks, and production support issues.
Developed reusable reporting frameworks, SQL components, and enterprise data integration standards to streamline reporting and analytics operations.
Created and maintained technical documentation including architecture diagrams, report specifications, source-to-target mappings, data dictionaries, query logic, and operational runbooks.
Supported report migration, enhancement, maintenance, and modernization activities across enterprise reporting environments.
Ensured enterprise data governance, security, audit compliance, and operational standards were maintained across AWS, Databricks, SQL Server, and reporting platforms.
Kafka was integrated with Spark Streaming to enable near-real-time data processing.
Developed data transformations with SQL, Python, PySpark, and ELT tools such as dbt.
Integrated GCP services with hybrid and multi-cloud environments including AWS and Azure for enterprise-wide analytics solutions.
Azure DevOps was used to implement CI/CD and DevOps automation, which included managing release pipelines, integration runtimes, linked services, logging, error handling, and monitoring across data workflows.
Prepared feature-engineered datasets using Python, PySpark, and SQL to support quantitative analytics models, enabling scalable model execution and monitoring.
Optimized query performance and aggregate awareness within AtScale to improve dashboard response times for high-volume analytical workloads.
Ensured data governance, validation, and compliance by leading data profiling, quality checks, audits, and documentation, as well as providing training and best-practice guidance to analysts and engineering teams.

Great American Insurance, Cincinnati, OH June 2021 March 2024
Role: Senior Data Engineer
Responsibilities:

Developed SQL and Python-based data pipelines supporting AML transaction monitoring programs, enabling analysis of high-volume banking transactions for suspicious activity detection.
Worked with hybrid architectures integrating on-premises systems with GCP using secure data transfer methods.
Built analytical datasets for financial crime detection and risk monitoring models, enabling pattern analysis across accounts, customers, and geographies, while aligning data solutions with Microsoft Fabric analytics workflows.
Collaborated closely with risk analytics, compliance, and fraud teams to deliver curated datasets for model validation, coverage assessment, and regulatory reporting, actively contributing to application design discussions and translating business needs into scalable data solutions.
Supported multi-cloud analytics environments by integrating AtScale with Snowflake, Databricks, Azure, and GCP data platforms.
Optimized SQL queries and data pipelines supporting transaction monitoring models, improving performance and efficiency for large-scale financial datasets, while ensuring data quality through structured testing and QA practices.
Designed, developed, and optimized end-to-end data pipelines using Azure Data Factory (ADF), integrating diverse data sources (Azure SQL, Blob Storage, Azure SQL DW, MongoDB, Cosmos DB, and on-prem systems) aligned with modern Microsoft Fabric and lakehouse architectures.
Worked with hybrid data environments integrating on-prem systems with GCP using APIs and secure data transfer services.
Built and maintained Snowflake and Azure SQL Data Warehouse solutions, enhancing schema design, query performance, and scalability for analytics and regulatory compliance, while implementing validation frameworks, documentation, and user training to support adoption and long-term maintainability.
Automated ingestion and transformation of structured and semi-structured data (JSON/XML) into Azure Data Lake and Snowflake, ensuring consistency and reliability.
Processed structured and semi-structured data formats (JSON, Parquet, and Avro).
Migrated old ETL workloads from on-premises Hadoop to cloud-based data platforms.
Developed Spark/Databricks workflows (PySpark/Scala) for data enrichment, feature engineering, and near real-time processing.
Performed troubleshooting, metadata management, and cube optimization within AtScale to improve reporting efficiency and data accuracy.
MongoDB queries were optimized using indexes, aggregation pipelines, and sharding methods.
Used Spring and related frameworks for enterprise integration.
Scalability was achieved by the design of topic techniques, partitions, and retention regulations.
We integrated Airflow with Spark, Snowflake, cloud services, and REST APIs.
Implemented data quality, governance, and metadata management, including lineage tracking, profiling, and compliance with financial and regulatory standards.
Improved ETL performance to reduce processing time and infrastructure expenses.
Automated data ingestion from a variety of sources, including APIs, files, and databases.
Delivered BI and analytics solutions using Power BI and Excel, enabling reporting, monitoring of pipeline health, and actionable insights for risk and regulatory frameworks.
Optimized pipeline and job performance with monitoring, error handling, and tuning in Spark, ADF, and Snowflake for high-volume financial datasets.
Hadoop ecosystems are integrated with Spark, Kafka, and Sqoop to create end-to-end data pipelines.
Cluster tuning and resource optimization were carried out to improve performance and reduce costs.
Participated in conceptual and logical data modeling, creating business rules, data mappings, and supporting enterprise data warehouse architecture.
Used Agile tools (Jira, Confluence) for sprint planning, documentation, and collaboration across cross-functional teams in Banking, Financial Services, High Tech, and Utilities industries.

QVC , West Chester, PA September 2019 May 2021
Role: Data Engineer
Responsibilities:

ETL/ELT pipelines were designed, created, and optimized utilizing AWS Glue, Lambda, Step Functions, dbt, and Python to process large financial datasets in real-time and batch mode.
Deployed and monitored Spark jobs across Databricks, EMR, and Hadoop clusters.
Created and maintained Snowflake and Databricks data warehouses, including tables, views, and data models for downstream analytics, while optimizing query performance and lowering computation costs.
Implemented data quality, governance, and security controls, such as schema validation, profiling, encryption, IAM policies, and PII/HIPAA compliance.
Integrated data from on-prem and Azure environments into GCP using secure APIs, Cloud Storage, and orchestration workflows.
Supported Airflow workflow CI/CD deployment in production scenarios.
Data quality was ensured through methods for validation, reconciliation, and error handling.
Created real-time streaming pipelines using Kafka, Spark Streaming, and Snowflake, enabling near-real-time fraud detection and ML model interaction with SageMaker.
Sqoop and Flume were used to implement data ingestion from RDBMS and flat file sources.
Automated workflows and managed pipelines with AWS Lambda, Step Functions, and Terraform to reduce manual involvement and provide CI/CD for ML and data pipelines.
Data pipelines were integrated with BI and analytics technologies (Power BI and Tableau) to provide actionable insights, dashboards, and reports to business stakeholders.
Managed end-to-end deployment, monitoring, and support in Dev, QA, and Prod environments to ensure high availability, fault tolerance, and performance optimization for production pipelines.

NBT Bank, Norwich, NY April 2018 August 2019
Data Engineer
Responsibilities:

Converted legacy ETL workloads into Spark + Scala pipelines, significantly improving execution speed and scalability.
Implemented reusable Scala-based transformation utilities, ensuring consistent data quality and schema enforcement.
Tuned Spark jobs for memory optimization, shuffle reduction, and cluster-level performance enhancements.
Migrated on-prem data ingestion workflows to ADF and Snowflake, improving performance and maintainability.
Automated data extraction and ingestion from relational databases and flat files using ADF and Python scripts.
Developed and maintained data ingestion frameworks integrating AWS S3 and Snowflake.
Enhanced data ingestion monitoring and alerting with ADF triggers and log analytics for operational visibility.
Implemented real-time stream ingestion using Kafka and Spark Streaming for transactional data feeds.
Worked on Spark SQL and Scala to replace legacy Hive queries, improving performance.
Built and scheduled Control-M workflows for data orchestration and batch processing.
Utilized Jira and Confluence in Agile-based sprint planning and documentation.
Engineered Delta Lake structures on Databricks and enabled incremental loads using ADF.
Conducted performance tuning for Spark and Snowflake jobs, reducing processing times.
Built reusable PySpark modules and unit-tested data transformations.
Wrote scalable ETL jobs using Apache Spark in PySpark and Scala to handle structured and semi-structured data (JSON, Parquet, Avro).
Collaborated with business analysts to convert business logic into efficient ETL workflows.
Handled JSON/XML data parsing and transformation for ingestion into Redshift

Educational Details:
Mcom from Osmania University, Hyderabad, India 2009
Bachelors from Pragathi College, Osmania University, Hyderabad - 2007
Keywords: continuous integration continuous deployment quality analyst artificial intelligence machine learning user interface business intelligence sthree database green card Connecticut New York Ohio Pennsylvania

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];7354
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: