| Sravya Reddy - Senior Data Engineer |
| [email protected] |
| Location: , , USA |
| Relocation: yes |
| Visa: H1B |
| Resume file: Shravya Resume_1768331825098.docx Please check the file(s) for viruses. Files are checked manually and then made available for download. |
|
Shravya Reddy
Senior Data Engineer [email protected] || +1 (425) 436-1787|| Location: TN https://www.linkedin.com/in/shravya-reddy-2874a4338/ PROFESSIONAL SUMMARY: Sr. Data Engineer with around 8 years of experience in big data technologies including Apache Spark, Kafka, Cassandra, HBase, and Hadoop Distribution with expertise in Spark components such as RDD, Spark Hadoop, Hive, and Spark Streaming. Proficient in multiple programming languages including SQL, PL/SQL, Python, R, and Scala. Experienced in cloud infrastructure such as AWS, Azure, and Pyspark with knowledge of GCP native services such as Big Query, Cloud Data Proc, and Cloud SQL-Postgres. Skilled in various scripting and query languages such as Shell scripting and SQL. Proficient in version control systems such as CVS, SVN, and GIT, and containerization tools such as Kubernetes and Docker. Showcases expertise in transforming existing AWS infrastructure into server less architecture through the implementation of AWS Lambda, Kinesis, and deployment of AWS Cloud Formation. Experience designing and building scalable data pipelines using PySpark, Databricks, and cloud platforms (Azure/AWS). Proficient in ETL development, data modeling, and big data processing Experienced in developing Gen AI ready data pipelines and integrating data engineering workflows with AI/ML frameworks, including Azure OpenAI and AWS SageMaker. Skilled in optimizing data ingestion, governance, and orchestration for machine learning, NLP, and LLM-based applications. Recognized for bridging data engineering and AI enablement automating data preparation for model training, ensuring data quality, and operationalizing AI workloads using CI/CD and DevOps best practices. Possesses extensive experience in seamlessly integrating Informatica Data Quality (IDQ) with Informatica Power Center to ensure robust data quality management throughout the integration process. Experienced in development and support, demonstrating in-depth knowledge of Oracle, SQL, PL/SQL, and T-SQL queries. Possesses extensive expertise in Relational Data Modeling, Dimensional Data Modeling, Logical Data Model Designs, ER Diagrams, and the execution of forward and reverse engineering processes. Demonstrated expertise in leveraging AWS S3 for efficient data staging, transfer, and archival. Utilized AWS Redshift for large- scale data migrations, implementing CDC (change data capture) using AWS DMS (Database Migration Service). Developed robust pipelines in ADF (Azure Data Factory) using Linked Services, Datasets, and Pipelines to extract, transform, and load data from diverse sources such as SQL databases, Blob storage, Azure SQL Data Warehouse, and write-back tools. Implemented CI/CD best practices, security scanning/monitoring, and pipeline integration. Proficient in creating, managing, and maintaining CI/CD pipelines to drive efficient and reliable data engineering workflows. Possesses extensive knowledge of Amazon EC2, providing robust computing, query processing, and storage solutions across diverse applications. Extensive experience in scripting and debugging on Windows environments, with a foundational understanding of container orchestration technologies like Kubernetes, Docker, and AKS. Hands-on experience in setting up workflows using Apache Airflow and Oozie workflow engine, effectively managing and scheduling Hadoop jobs. Strong experience in Data Warehousing projects using Talend, Informatica, and Pentaho, ensuring efficient data integration and transformation. Proficient in working with NoSQL databases and their seamless integration with the Hadoop cluster, including HBase, Cassandra, MongoDB, Dynamo DB, Cosmos DB, and Neo4j for graph databases. Extensive knowledge of Dimensional Data Modeling, implementing Star Schema and Snowflake for Fact and Dimension Tables using Analysis Services. Extensive experience using Sqoop for seamless data ingestion from RDBMS systems, such as Oracle, MS SQL Server, Teradata, PostgreSQL, and MySQL. Involved in successful data migration projects from DB2 and Oracle to Teradata. Developed automated scripts using UNIX shell scripting, Oracle/TD SQL, TD Macros, and Procedures to streamline the migration process. Developed and implemented Spark streaming modules tailored for efficient data retrieval from Rabbit MQ and Kafka, enhancing real-time data processing capabilities and enabling timely insights. Extensive proficiency in working with Data bricks, Data Factory, Data Lake, Function apps, SQL, and leveraging these technologies to drive agile delivery, SAFe, and DevOps frameworks, resulting in successful project outcomes. Technical Skills Methodologies Agile/Scrum, DevOps, DataOps, AI/ML lifecycle management Database MySQL, Teradata, Oracle, Sybase, SQL server, DB2 Cloud Snowflake, AWS Redshift, S3, Redshift Spectrum, RDS, Glue, Athena, Lambda Cloudwatch, HIVE,Presto. Gen AI Integration: LLM data pipeline design, Azure OpenAI, AWS Bedrock, Hugging Face, LangChain Data Orchestration Tools: AWS Managed Apache Airflow(MWAA) & Splunk(logs), Snaplogic. DevOps Docker Images, Kubernetes Container,CI/CD Pipeline Languages/Frameworks Java (Spring Batch, Spring Boot), Python, SQL, PL/SQL. ETL Tools Informatica Power Center 8.x,9.x,10.x, Informatica Intelligent Cloud Services. Data Virtualization tools: Denodo Scheduling Active Batch, CA7, Control- Confidential Scripts Advanced SQL, Python, PL/SQL, Unix shell scripting, XML, JSON Code Mgmnt process/tools: GitLab, GitHub, Bitbucket, Microsoft TFS, SourceTree Project Mgmnt Process JIRA, Confluence Defect Tracking tools: HP Quality Center, Microsoft TFS Design Microsoft Visio. Operating System Windows, Linux Business Domain: Airlines, Healthcare, Hospitality, Product based, Travel, Retail Insurance, Telecom. PROFESSIONAL SUMMARY Senior Data Engineer Oct 2024 Till date American Airlines, Irving, TX Implemented data integration processes to consolidate and analyze inventory data from various sources, improving inventory management. Ensured up-to-date data in Google dashboard tables by scheduling and managing daily auto runs of SQL queries on PLX Workflows, maintaining data accuracy and timely updates. Created data sharing between two snowflake accounts. Created internal and external stage and transformed data during load. Redesigned the Views in snowflake to increase the performance. Unit tested the data between Redshift and Snowflake. Designed and implemented JCL scripts and CICS transactional interfaces for both online and batch workflows, ensuring zero downtime during cutovers; collaborated with QA and business stakeholders to validate behavior against legacy functionality. Working with CI/CD tools such as Jenkins and version control tools Git,Bitbucket. Worked in automation, setup and administration of build and deployment CI/CD tools such as Jenkins, and integrated with Build Automation tools like ANT, Maven, Gradle, Bamboo, JIRA, BitBucket for building of deployable artifac. Designed and implemented robust Java Spring Batch jobs to process high-volume inventory data, ensuring thread-safe execution and error handling. Orchestrated complex batch workflows using Control-M on AWS, utilizing AWS Batch for scalable compute resource management. Created detailed solution architecture diagrams depicting the flow of data from on-prem Oracle to Cloud Snowflake via Java Spring Batch and AWS S3. Day to-day responsibility includes developing ETL Pipelines in and out of data warehouse, develop major regulatory and financial reports using advanced SQL queries in snowflake. Stage the API or Kafka Data(in JSON file format) into Snowflake DB by FLATTENing the same for different functional services. Designed and implemented a robust Lakehouse architecture using Databricks Delta Lake and Unity Catalog, improving data governance and access control across business units. Used Git version control to manage the source code and integrating Git with Jenkins to support build automation and integrated with Jira to monitor the commits. Developed and managed scalable Spark Structured Streaming pipelines for near real-time analytics, leveraging the Dataset and Data Frame APIs for efficient transformations. Architected a Medallion architecture (Bronze, Silver, Gold layers) to enable data reliability, cleansing, and enrichment processes, improving data quality for downstream consumers. Deployed and optimized Databricks serverless compute options to reduce infrastructure costs for ad-hoc and scheduled analytics workloads. Configured and managed Databricks clusters with appropriate autoscaling and sizing strategies to balance performance and cost across ETL and streaming jobs. Orchestrated complex Databricks jobs using multi-task workflows and task dependencies, ensuring reliable execution of data pipelines and reducing manual overhead. Proficient in creating and managing Docker containers, both from existing Linux Containers and AMs, as well as building containers from scratch on Linux and Windows servers. Developed Talend jobs to efficiently copy files between servers, leveraging Talend FTP components. Explored and visualized data using Power BI for comprehensive analysis and reporting. Developed REST API endpoints using Python with Flask and Django frameworks, integrating various data sources such as Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files for seamless data access and processing. Contributed to the development and implementation of the Cloudera Hadoop environment, utilizing Cosmos DB for storing catalog data and event sourcing in order processing pipelines. Collaborated on an Apache Spark data processing project, handling data from RDBMS and multiple streaming sources, and utilized Python to develop Spark applications on AWS EMR, ensuring efficient data processing. Connected AWS stack data to BI tools (Tableau, Power BI) for reporting analytics, enabling insightful data visualization and analysis. Designed and implemented large-scale ETL pipelines using PySpark on Databricks, processing over of data daily. Successfully migrated an entire Oracle database to RedShift, and implemented ETL data pipelines using Airflow on Asseverating different Airflow operators for job scheduling and management. Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, successfully implementing ML use cases under Spark ML and MLlib. Transformed Hive SQL queries into RDD transformations in Apache Spark using Scala, enhancing data processing speed and efficiency. Implemented Spark using Scala and Spark SQL for faster testing and processing of data. Designed and implemented complex SSAS cubes with multiple fact measure groups and dimension hierarchies to meet OLAP reporting needs. Installed and administered GIT source code tool, ensuring application reliability, and devised effective branching strategies for GIT. Utilized Terraform to write templates for Azure Infrastructure as Code, facilitating the creation of staging and production environments. Actively participated in Agile, Safe, and DevOps frameworks, employing DevOps tools for seamless planning, building, testing, releasing, and monitoring of data engineering solutions. Tools Used: GIT, Devops, Agile, Azure, SSAS, OLAP, SQL, RDD, Spark, ETL, PowerBI, Python, Linux, CI/CD. Senior Data Engineer Apr 2024 Sep 2024 Caterpillar, Chicago, IL Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns. Extract Transform and Load data from sources Systems to Azure Data Storage services using a combination of Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks Develop Spark applications using PySpark and spark SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing and transforming the data uncover insight into the customer usage patterns Developed custom Java ETL components to transform complex healthcare EDI/HL7 data before loading into Snowflake. Hands on experience on developing SQL Scripts for automation Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark DataBricks cluster Involved in continuous integration and deployment (CI/CD) using DevOps tools like Looper, Concord Designed a workflow using Airflow to automate the jobs. Implemented a CI/CD pipeline with Jenkins, GitHub, Nexus, Maven and AWS AMI's. Built robust data quality and validation checks; set up automated tests, monitoring on data drift, schema changes, and implemented alerting in AWS CloudWatch / Databricks. Integrated streaming or batch data from legacy COBOL / mainframe exports into Databricks environment; worked on formats like CSV, Parquet, or Avro; optimized ingestion pipelines for cost and performance in AWS (S3, EMR, Glue etc.). Collaborated with cross-functional teams (data engineering, business analysts, operations) to gather requirements, design data models, and deliver analytics outputs; deployed pipelines using CI/CD tools (e.g., Jenkins / Git / Azure DevOps) for smoother deployment and rollback. Ensured security, governance, and regulatory compliance in data handling; applied best practices around IAM, encryption at rest/in transit in AWS; documented lineage and governance for auditability. Leveraged Spark Scala functions to derive real-time insights and generate reports by mining large datasets. Utilized Spark Context, Spark SQL, and Spark Streaming to efficiently process and analyze extensive data sets. Developed cutting-edge data processing pipelines, processing a substantial volume of 40-50 GB daily, by leveraging Python libraries in conjunction with Google' s internal tools such as Pantheon ETL and Plx scripts integrated with SQL. Leveraged the power of the Erwin Data Modeling tool to craft meticulous and efficient designs for relational databases and dimensional data warehouses. Migrated legacy ETL pipelines from SQL Server to Databricks, enabling faster and more scalable processing. Developed, implemented, and meticulously maintained CI/CD pipelines for cutting-edge Big Data solutions, collaborating closely with development teams to ensure seamless code deployment to the production environment. Tools Used: ETL, CI/CD, SQL, Databricks, Spark, Azure Role: Azure Data Engineer May-2023-March-2024 Client: WellCare Health Plans, Tampa, FL Responsibilities: Build and manage data ingestion workflows from on-premises and cloud sources using ADF pipelines, Event Hub, Azure Functions, and Logic Apps. Develop scalable data transformation solutions using Azure Databricks (PySpark, SparkSQL) or Synapse pipelines. Design and develop end-to-end data pipelines using Azure Data Factory, Synapse Analytics, and Databricks for healthcare data ingestion and transformation. Implement secure data storage solutions in Azure Data Lake and SQL Database, ensuring compliance with HIPAA and PHI data protection standards. Integrate diverse healthcare data sources such as EHR, EMR, claims, and clinical systems using FHIR, HL7, and EDI standards. Bridges is an EPIC Interface application. Bridges is the Interface that allow HL7 messages to be sent out and to be accepted by EPIC. Extensively worked on EPIC interface engine for Orders from EPIC and Results from LAB System, Radiology System, Pharmacy System Errors get loaded to the EPIC system, BREC team resolves the errors and corrects the errors in EPIC. Understand PeopleSoft and HCM applications HR data pertaining to Rogers and their functionalities. Requirement gathering & data analysis of new HCM data. Prepared plan and scripts to update existing data as per new HCM definition. Implement GDPR-compliant data handling, storage, and processing practices. Ensure employee data is anonymized or pseudonym zed where required. Collaborate with HR and Legal teams to align data processes with privacy policies. Monitor and audit data access to meet GDPR and organizational standards. Designed and built HCM data mart for new HR data to fit in existing HR application data. Designed and deployed scalable data pipelines using OCI Data Integration and OCI Data Flow (Apache Spark). Managed data storage using OCI Object Storage, Autonomous Data Warehouse, and Oracle Database Cloud Service. Implemented data governance and encryption using OCI Vault and Identity and Access Management (IAM). Developed and monitored ETL jobs integrating Oracle HCM Cloud with other enterprise systems. Automated data ingestion and transformation using OCI Functions and Oracle Integration Cloud (OIC). Built and maintained real-time analytics pipelines leveraging Oracle Stream Analytics. Ensured GDPR and data privacy compliance through secure data handling and audit trails. Ensure data quality and governance through data validation, lineage tracking, and metadata management using tools like Azure Purview. Collaborate with data scientists and analysts to provide clean, curated datasets for patient analytics, population health, and predictive modeling. Optimize performance of data pipelines and queries through partitioning, indexing, and efficient data model design in Azure Synapse. Design and maintain data pipelines integrating various HCM systems. Develop and optimize data models to support HR analytics and reporting. Build and manage ETL processes for accurate data extraction and transformation. Ensure data quality, consistency, and accuracy across HR platforms. Implement data governance and compliance with privacy regulations. Collaborate with HR, IT, and analytics teams to meet business needs. Implement data governance best practices, including data lineage, metadata management, and master data management (MDM). Apply HIPAA, GDPR, and PHI compliance standards to ensure data privacy and security. Manage RBAC (Role-Based Access Control) and data masking/encryption for sensitive healthcare information. Ensure efficient data partitioning, optimization, and storage for high-performance analytics. Design and implement end-to-end data pipelines and ETL/ELT processes using Azure services (e.g., Azure Data Factory, Synapse Analytics, Databricks, Data Lake Storage). Develop data models and data warehousing solutions aligned with healthcare data standards (HL7, FHIR, HIPAA compliance). Architect data ingestion frameworks for structured and unstructured data from multiple healthcare systems (EHRs, claims, EMR, clinical systems). Tools Used: EMR, HER, ETL, AZURE, HL7, FHIR, Databricks, EPIC, GDPR. Data Engineer Aug 2021 Apr 2023 Edward Jones Investment (Deloitte); Raleigh, NC Engineered a robust process to ingest data from RDBMS, execute data transformations, and subsequently export the refined data to Cassandra in accordance with business needs. Spearheaded the design and development of ETL processes using the Informatica ETL tool, contributing to the creation of dimensional and fact files. Conducted wide, narrow transformations and executed diverse actions such as filter, lookup, join, count, etc., on Spark Data Frames to enhance data quality and relevancy. Aggregated log data from various servers using Apache Kafka, making it accessible for downstream systems for improved data analytics. Elevated Kafka performance and bolstered security measures to ensure robust and safe data processing. Documented requirements and existing code for implementation using Spark, Hive, and HDFS, ensuring clarity and effective utilization of resources. Constructed ETL Mapping with Talend Integration Suite to extract data from the source, apply transformations, and load data into the target database, enhancing data flow and quality. Utilized the Kibana interface for filtering and visualizing log messages gathered by an Elasticsearch ELK stack. Transformed data from various file formats (Text, CSV, JSON) using Python scripts. Devised advanced Spark applications for meticulous data validation, cleansing, transformation, and custom aggregation. Employed Spark engine, Spark SQL, and Spark Streaming for in-depth data analysis and efficient batch processing. Conceptualized and executed modern, scalable, and distributed data solutions employing Hadoop, Azure cloud services, and hybrid data modeling. Exploited Erwin s data modeling and transformation capabilities to seamlessly integrate the CRM system with other sales and marketing systems. Unified customer data from multiple sources (CRM, web, support) into a single platform using PySpark and Databricks. Converted PostgreSQL DDL from Oracle DDL using ora2pg and AWS tool, resolving compilation issues in generated PostgreSQL source code. Amplified data-driven decision-making and strategy execution by leveraging visualization tools such as IBM Cognos and Tableau. Established a procedure to migrate source code and data from Oracle to PostgreSQL in AWS, ensuring a smooth transition and operational continuity. Incorporated Machine Learning NLP to design a predictive maintenance system capable of anticipating equipment failures based on NLP analysis of equipment sensor data. Tools Used: Python, Pandas, Shell, Hadoop, PySpark, Sqoop, MapReduce, SQL, Teradata, Snowflake, Hive, Pig, SQL, Azure, Data Bricks, Kafka, Azure Data Factory, Glue, HBase, Apache, Eclipse, Airflow, Databricks, Informatica. Data Engineer July 2016 Dec 2019 ADP Int, Hyderabad, India Developed HIPAA-compliant data pipelines using Apache Spark, Hive, and HDFS, powering patient analytics in a centralized Snowflake data warehouse. Ingested wearable device telemetry using Kinesis and Apache Flink, feeding real-time vitals into Redshift for early rehospitalization risk scoring. Trained ML models using PySpark and MLflow to predict pharmaceutical efficacy, integrating results into clinician dashboards for preventive care. Leveraged Palantir Foundry for GDPR-compliant pipeline versioning and end-to-end lineage tracking for medical trial data. Developed a central Snowflake data warehouse for healthcare data, integrating patient visit registers, prescriptions, and clinical outcomes into an OLAP-compliant store under HIPAA constraints, all access provisioned through CyberArk vaults. Implemented real-time processing of IoT streams through AWS Kinesis Data Analytics, utilizing Apache Flink to ingest streams from wearable devices into Redshift for further analytics, flagging risk of patient rehospitalization earlier. In MLflow, trained predictive models using PySpark by applying clinical and pharmaceutical datasets to predict drug efficacy and find high-risk patient scenarios which were visualized using clinician-facings Tableau dashboards. Used Palantir Foundry Platform to create modular data pipelines and establish lineage tracking for trial and treatment records, with full traceability and automatic tagging for compliance with GDPR regulations. Built and maintained containerized ETL execution environments on OpenShift, enabling portable and secure deployment of Airflow and Spark workloads across internal medical research infrastructure. Designed a dynamic NoSQL layer in MongoDB to store unstructured clinician notes and embedded telemetry metadata, enabling downstream AI-based patient outcome models through retention of linkages across treatment episodes and wearable data streams. Tools Used: Pyspark, Azure, Palantir, ETL, GDPR, HIPAA, Spark Education Master s Computer and Information Sciences Southern Arkansas University, TX - 2021 Bachelor s Computer Science Engineering Jawaharlal Nehru Technological University, Hyderabad - 2016 Keywords: continuous integration continuous deployment quality analyst artificial intelligence machine learning message queue business intelligence sthree database active directory rlang information technology hewlett packard trade national microsoft mississippi procedural language Florida Illinois North Carolina Tennessee Texas |