Home

Pravalika - Java Developer, Dallas TX
[email protected]
Location: Dallas, Texas, USA
Relocation: No
Visa: H1B
Resume file: Pravallika Java Full Stack_1776268967739.docx
Please check the file(s) for viruses. Files are checked manually and then made available for download.
SRIHARSHA V
Sr. GCP Data Engineer
Email: [email protected]
+1 972-924-5835 (Employer)
PROFESSIONAL SUMMARY:

Senior Data Engineer with 10+ years of experience leading Agile development teams and delivering enterprise-scale data acquisition, ETL automation, dimensional modeling, and complex SQL engineering solutions. Expert in Informatica PowerCenter/IICS, Oracle SQL, Snowflake, PL/SQL, and multi-source data integration for analytics, BI, and regulatory reporting environments. Adept at designing optimized data models, developing automated V&V frameworks, building materialized views, and validating high-volume data pipelines.
Strong background working with healthcare, financial, and enterprise data, including PHI/PII under HIPAA-compliant architectures. Experienced in requirements gathering, mockup preparation, technical documentation, ETL architecture, and cross-functional collaboration with architects, analysts, and customers. Proven ability to write efficient, testable code, manage concurrent priorities, and support broad enterprise data platform implementations.


TECHNICAL SKILLS:

Cloud Platforms & Core GCP Services Google Cloud Platform (GCP), BigQuery, Cloud Storage (GCS), Cloud Composer (Apache Airflow on GCP), Cloud Dataflow (Apache Beam), Cloud Dataproc (Apache Spark, Hadoop, Hive), Pub/Sub, Cloud Functions, Cloud Run & App Engine, Vertex AI, Cloud SQL / Cloud Spanner / Bigtable, Data Catalog, Looker Studio (Data Studio), Cloud Logging, Monitoring, and Error Reporting. Additionally experienced with Microsoft Azure ecosystem including Azure Data Factory, Azure Databricks, Azure EventHub, Azure Data Lake Storage (ADLS), Azure Blob Storage, Azure DevOps, and Azure SQL for enterprise-scale data ingestion and transformation.
ETL / ELT & Data Integration Tools Apache Beam, Apache Airflow, Apache NiFi, Informatica, Talend, Datameer.
Dataflow Templates, Composer DAGs, Python ETL Frameworks
Integration with APIs, REST/JSON/XML, Cloud Storage Transfer Service, Cloud Data Fusion
Big Data Ecosystem Apache Spark, Hadoop, Hive, Pig, Oozie, Kafka, HBase, Flume, Sqoop
PySpark, Spark Streaming, Structured Streaming, Spark SQL
Data Lake & Data Warehouse Architecture
Delta Lake, Iceberg, Parquet, Avro.
Programming & Scripting Languages Python Data processing, Airflow DAGs, PySpark jobs, automation scripts
SQL Advanced analytical queries, stored procedures, query optimization (BigQuery SQL, ANSI SQL)
Java / Scala Spark applications and Beam pipelines
Shell Scripting (Bash) Automation and deployment
Go / JavaScript Optional use in GCP Functions and APIs
Data Modeling & Warehousing Dimensional Modeling, Data Vault 2.0, OLAP/OLTP concepts, Visio/Erwin
Data normalization/denormalization, surrogate key management, and slowly changing dimensions (SCDs)
BigQuery partitioned tables, materialized views, data marts, and semantic layer design
Databases & Storage Systems Cloud SQL, Cloud Spanner, Bigtable, Firestore, MySQL, PostgreSQL, Oracle, Teradata, SQL Server, MongoDB, Cassandra
NoSQL / NewSQL systems, Redis, Elasticsearch, Memorystore
Data Governance, Security & Compliance IAM (Identity and Access Management) Role-based access, least privilege, and service accounts
VPC Service Controls, Private Service Connect, Firewall Rules, Network Policies
Cloud KMS (Key Management Service) Data encryption and key rotation
Audit Logs, Data Catalog, Cloud DLP (Data Loss Prevention) Classification and masking of sensitive data
GDPR, HIPAA, and SOC 2 compliance best practices
Cloud Armor, Secret Manager, Cloud Identity
CI/CD, DevOps & Infrastructure as Code Terraform, Cloud Deployment Manager, Ansible, Jenkins, GitLab CI/CD, GitHub Actions, Cloud Build, Cloud Source Repositories
Docker, Kubernetes (GKE)
Helm Charts, Argo Workflows, Anthos
Version control (Git)
Machine Learning & Advanced Analytics Vertex AI Pipelines, AutoML, BigQuery ML, TensorFlow, Scikit-learn, Data Prep / Dataprep by Trifacta
Feature engineering pipelines, ML model deployment, and data preprocessing
Integration with AI services
Version Control, Collaboration & Agile Tools Git, Bitbucket, GitLab, JIRA, Confluence, ServiceNow, Slack, Agile / Scrum / Kanban methodologies
Code reviews, branching strategies, CI/CD automation, and Agile ceremonies participation




PROFESSIONAL EXPERIENCE:
Sr. GCP Data Engineer
Molina Healthcare October 2023 to Present
Responsibilities:
Architected enterprise-wide cloud data ingestion frameworks using Google Cloud Dataflow by leveraging Apache Beam s unified programming model, enabling parallel processing, automatic scaling, and fault tolerance to handle continuous healthcare eligibility and claims transactions with guaranteed delivery and minimal latency.
Automated enterprise-scale ETL data acquisition by designing and orchestrating end-to-end Informatica PowerCenter/IICS workflows as the primary integration engine managing multi-source ingestion, complex transformation logic, error handling, dependency chaining, and dimensional loads.
Supplemented Informatica pipelines with Python utilities for schema validation, file parsing, API payload handling, and reconciliation checks. Integrated these workflows with Cloud Composer for scheduling, monitoring, and coordinated execution across upstream and downstream systems, ensuring fully automated and reliable data acquisition processes.
Developed API-based ingestion frameworks using REST/JSON/XML to pull data from external systems, validate schemas, transform payloads, and load standardized datasets into dimensional models.
Built Oracle materialized views, staging layers, and dimensional structures to optimize downstream Tableau, Power BI, and analytics workloads.
Designed BigQuery analytical warehouse layers with optimized table partitioning, clustering strategies, and columnar compression, ensuring cost-efficient execution of complex clinical, operational, and actuarial queries on billions of healthcare records.
Developed real-time event ingestion pipelines using Pub/Sub as a durable message bus, enabling asynchronous distribution of patient events, provider updates, and claims changes into downstream transformation pipelines for near-instant availability across care systems.
Implemented PySpark-based data transformation frameworks on Dataproc clusters, using in-memory distributed computation, broadcast joins, shuffle optimization, and adaptive execution to process large-scale medical claims datasets efficiently.
Designed and developed complex Informatica mappings, including Source Qualifier logic, reusable transformations, lookup caches, expression logic, filter/pushdown optimizations, update strategy logic, and dynamic partitioning for large healthcare and financial datasets.Acquired data from multiple heterogeneous systems Oracle, SQL Server, Excel, PDFs, internal APIs and automated ingestion frameworks to support analytics, BI dashboards, and regulated reporting.
Applied Java and JavaScript for lightweight components such as data validation utilities, REST API handlers, and transformation helpers supporting ETL and pipeline orchestration.
Performed comprehensive integration, load, and stress testing across ETL pipelines, validating end-to-end data flows, system behavior under high-volume workloads, transformation accuracy, and performance thresholds for Informatica, Oracle SQL, and downstream analytics layers.
Developed complex SQL and PL/SQL scripts to transform diverse healthcare datasets into dimensional models, implementing optimized joins, partitioning strategies, indexing, materialized views, and incremental refresh logic.
Used Informatica and Python-based automation to orchestrate end-to-end ETL flows, error handling, dependency management, reconciliation checks, and automated data validation against business rules.
Designed reusable verification & validation (V&V) frameworks, including automated regression checks, schema enforcement, record-level and aggregation-level validations, enabling consistent quality across releases.
Collaborated closely with data architects, modelers, and product owners to design data acquisition strategies, enforce modeling standards, and refine mockups with business stakeholders.
Produced detailed technical documentation including STTM mapping sheets, workflow diagrams, sequence diagrams, architecture blueprints, and test plans to support DD&I efforts and cross-team transparency.
Participated in all Agile ceremonies and led technical discussions, identifying dependencies, risks, design inconsistencies, and improvements for performance and reliability.
Worked on Azure Data Factory (ADF) pipelines for orchestrating cross-cloud ingestion from on-prem and external sources into Azure Data Lake, integrating with Databricks notebooks for scalable transformations.
Built Azure Databricks PySpark notebooks for data cleansing, schema validation, delta processing, and building curated datasets supporting analytics and reporting.
Integrated Azure EventHub streams for real-time ingestion use cases, processing provider and claims events before routing them into downstream healthcare analytic layers.
Utilized Azure Blob Storage and ADLS Gen2 for distributed data storage, implementing secure folder structures, access policies, and governance.
Collaborated with DevOps teams to integrate CI/CD processes using Azure DevOps pipelines for automated deployment of notebooks, ADF pipelines, and infrastructure components.
Experienced working on large-scale healthcare and human services (HHS) data integration projects involving PHI/PII, eligibility data, provider files, claims data, and regulatory reporting requirements aligned with state HHS standards.
Established a healthcare-focused lakehouse environment using Delta Lake, Hudi, and Iceberg to enforce ACID transactions, schema evolution, and time-travel capabilities, supporting regulatory reporting, auditability, and historical reconciliation.
Developed reusable Cloud Composer (Airflow) DAGs incorporating dependency management, failure recovery, SLA monitoring, and parameterized task execution to orchestrate batch, streaming, and on-demand workflows across the GCP ecosystem.
Integrated Vertex AI pipelines to automate ML workflows such as feature computation, model training, hyperparameter tuning, and online model deployment, enabling clinical risk scoring and care management optimization.
Designed event-driven data workflows using Cloud Functions, enabling trigger-based execution for tasks such as file validation, event notification, metadata registration, and automated schema verification.
Basic exposure to C#-based services used in upstream data integration systems; collaborated with application teams to understand payload structures and ingestion requirements.
Developed healthcare-specific SCD (Slowly Changing Dimensions) pipelines using BigQuery MERGE logic, capturing historical versions of provider, member, and claim attributes to support audits and downstream actuarial modeling.
Built monitoring dashboards using Cloud Monitoring and custom metrics, enabling proactive anomaly detection, throughput monitoring, worker utilization tracking, and alerting for all critical pipelines.
Implemented end-to-end reconciliation logic using Python and SQL, comparing source system records with processed warehouse records to detect missing, mismatched, or late-arriving clinical and operational data.
Designed multi-zone data lake folder structures (raw, curated, trusted, consumer-ready) using GCS, ensuring proper lineage, retention, and refined transformation workflows across healthcare entities.
Implemented schema validation frameworks using JSON schema enforcement and custom Python validators, ensuring consistent formatting, field presence, and type accuracy for diverse healthcare feeds.
Integrated multi-source datasets into Tableau-ready dimensional and semantic layers by creating optimized Oracle SQL views, materialized views, and Informatica-powered ETL pipelines to support high-performance interactive dashboards and analytics.
Developed advanced SQL-based transformation layers in BigQuery, performing complex aggregations, joins, deduplication, condition coding, and clinical quality rule execution.
Familiar with MuleSoft integration concepts, including API-led architecture and basic flow development, through collaboration with integration teams on enterprise data ingestion initiatives.
Implemented deterministic and probabilistic member-matching algorithms for healthcare entity deduplication.
Enabled enterprise BI and metrics reporting by designing Tableau-ready data marts, KPI layers, and curated dimensional models that supported executive dashboards, operational scorecards, and performance analytics.
Used Airflow s Kubernetes Executor for dynamic task scaling during peak processing while maintaining cluster efficiency.
Developed lineage-aware transformation logic ensuring traceability of every field across ingestion, transformation, and output layers.
to ensure that all new data products aligned with GCP best practices, scalability requirements, and healthcare regulatory expectations.
Environment: GCP (BigQuery, Dataflow, Dataproc, Pub/Sub, Cloud Functions, Vertex AI), Delta Lake/Iceberg/Hudi, Airflow/Cloud Composer, Python, SQL, PySpark, MLflow, Unity Catalog, Data Catalog, Prometheus/Grafana, Terraform, Git



Sr Data Engineer
Truist Bank June 2021 to September 2023
Responsibilities:
Engineered large-scale financial data ingestion frameworks using AWS Glue ETL jobs, leveraging PySpark-based distributed processing, dynamic frames, and job bookmarks to ingest transactional banking records into S3 with guaranteed consistency and idempotent data loads.
Designed Redshift-based analytical marts using optimized distribution styles, sort keys, and compression encodings, enabling fast execution of financial aggregations, risk evaluation queries, and regulatory dashboards across high-volume datasets.
Built real-time streaming ingestion pipelines using Kinesis Data Streams and Kinesis Firehose, capturing continuous card transactions, deposits, and balance updates, ensuring low-latency delivery to downstream analytics and fraud detection systems.
Developed Lambda-based micro-ETL functions triggered by S3 events, API calls, and Kinesis updates, enabling lightweight transformations, file validations, schema checks, and notifications without the need for complete cluster execution.
Created Glue Workflows orchestrating multi-step ETL processes, implementing conditional branches, parallel execution, dependency tracking, and automatic retries to maintain reliable processing of sensitive financial data sources.
Configured AWS KMS for encryption-at-rest and encryption-in-transit, ensuring all sensitive account and transaction data was cryptographically protected and compliant with banking security policies.
Built automated data quality frameworks using Great Expectations, embedding validation rules into ETL pipelines to enforce schema conformance, data completeness, business rule accuracy, and anomaly detection.
Designed lineage-aware pipelines tracing transformations from raw transaction ingestion through enriched and aggregated financial layers, enabling audit teams to verify accounting processes and regulatory compliance.
Created SQL-based transformation logic to produce business-critical outputs such as daily account balances, loan performance indicators, fraud alerts, and forecasting metrics for risk monitoring teams.
Implemented standardized financial modeling logic for account hierarchies, transaction categorization, product-level mappings, and regulatory classification (Basel, AML, KYC requirements).
Performed performance tuning across ETL workloads by optimizing Spark partition counts, minimizing shuffles, rewriting complex joins, and adjusting Glue job execution parameters.
Supported migration and integration efforts between AWS and Azure data ecosystems, including building PoC pipelines using Azure Data Factory and Databricks to validate multi-cloud ingestion strategies for financial datasets.
Processed streaming data via Azure EventHub during cross-platform modernization initiatives, enabling parallel ingestion pathways for transactional banking data.
Developed automated reconciliation pipelines comparing source system extractions against curated warehouse records, identifying mismatches in transaction counts, dollar amounts, and account-level aggregates.
Created S3 data zoning structures landing, raw, refined, curated, and regulatory each with specific access rules, retention strategies, and validation logic to maintain governance across the entire financial data lifecycle.
Built anomaly detection logic using PySpark and SQL, identifying suspicious transaction patterns, volume spikes, and inconsistent account behavior to support fraud prevention initiatives.
Developed Glue crawlers and metadata catalogs to automatically register schema updates and maintain accurate table definitions in Athena and Redshift Spectrum.
Integrated third-party financial APIs using Lambda, Python SDKs, and secure credential management, incorporating market data, credit scoring feeds, and risk indexes into the bank s analytics platform.
Configured CloudWatch dashboards displaying ETL health metrics, job runtimes, error rates, partition counts, and streaming lag, enabling proactive incident management and SLA tracking.
Built Elasticsearch + Kibana pipelines for indexing logs, transactional metadata, and ETL output logs, supporting audit queries and operational investigations.
Developed DevOps CI/CD processes using GitLab pipelines and Terraform, automating the provisioning of EMR clusters, Glue jobs, IAM policies, and API endpoints while maintaining version-controlled infrastructure.
Integrated AWS SNS and SQS messaging into ETL workflows to support asynchronous communication, event-driven execution, and workflow decoupling for high-reliability operations.
Enhanced Redshift performance by rewriting complex SQL, applying distribution key changes, minimizing cross-node joins, and introducing materialized views for repeated query patterns.
Built data transformation pipelines consolidating customer profiles from multiple systems (loans, credit cards, deposits, mortgages), creating unified customer 360-degree views.
Designed secure cross-account Redshift data sharing using IAM authentication, parameter groups, and encrypted network connections.
Built ETL regression test suites using PyTest to ensure that pipeline changes did not break downstream accounting or risk calculations.
Performed end-to-end optimization of transaction processing workflows, reducing overall latency and increasing throughput for downstream consumption layers.
Developed source-to-target mapping documents, business transformation rule dictionaries, and data contracts for all major financial datasets.
Created alerting systems detecting stale data, missing partitions, schema mismatches, and unexpected spikes in financial record volume.
Implemented Glue job bookmarking and checkpointing to ensure safe recovery in the event of ETL interruptions or partial dataset failures.
Delivered consistent, audit-ready financial datasets by applying strict governance, quality frameworks, and versioned data pipeline deployments using fully automated infrastructure pipelines.
Environment: AWS (S3, Redshift, Glue, Athena, EMR, Lambda, Kinesis, Step Functions), Python, SQL, PySpark, Airflow, Terraform, Git, Great Expectations, CloudWatch, Datadog, ELK Stack



AWS Data Engineer
Ascena Retail Group January 2019 to May 2021
Responsibilities:
Developed large-scale retail ETL pipelines using AWS Glue, leveraging PySpark dynamic frames to ingest sales, inventory, promotional, and customer activity data from multiple ERP and POS systems, enabling consistent, structured transformation into curated datasets used for merchandising analytics.
Designed Redshift analytical warehouse schemas using optimized sort keys, distribution styles, and encoding strategies, supporting highly responsive reporting for retail operations, demand planning, and inventory optimization teams across thousands of stores.
Built automated ingestion workflows using Lambda triggers on S3 file arrival events, orchestrating schema validation, metadata extraction, and preliminary data sanitization before large-scale Glue transformation jobs executed.
Created real-time streaming ingestion pipelines using Kinesis Firehose to capture POS transactions, promotional events, and digital channel activity, enabling near real-time insights into store performance and customer behavior.
Integrated S3 multi-zone folder structures across landing, raw, refined, and consumption layers, applying automated partitioning strategies that aligned with merchandising calendar periods, seasons, and product category-level hierarchies.
Developed PySpark-based transformation logic to merge multi-channel retail data sources, aligning product hierarchies, SKU-level mappings, store clusters, and omni-channel identifiers to create unified consumer activity datasets.
Implemented data quality frameworks within Glue jobs using Python validators and SQL assertions to ensure data completeness, record consistency, accurate product attributes, and correct time-series alignment across daily retail feeds.
Developed retail-specific fact tables such as daily sales, returns, markdown activities, footfall metrics, and digital conversion funnels, supporting analytics teams across merchandising, planning, and e-commerce divisions.
Built CloudWatch dashboards tracking ETL job runtimes, data freshness, file arrival times, failure counts, and partition growth, enabling proactive identification of pipeline bottlenecks or system delays.
Integrated data from ERP systems such as merchandising management, store replenishment, and vendor purchase orders into unified warehouse structures, connecting operational data to sales and inventory signals.
Built S3 lifecycle policies archiving low-access historical retail data to Glacier and Deep Archive tiers, reducing storage cost while maintaining compliance and supporting long-term trend analysis.
Developed real-time fraud detection and anomaly monitoring scripts using PySpark and Lambda, flagging unexpected spikes in returns, duplicate transactions, or abnormal discount activities.
Built automated Redshift load processes using COPY commands with manifest files, compression handling, distribution tuning, and validation steps to maintain data integrity during nightly refreshes.
Collaborated with merchandising teams to translate complex business rules such as assortment groupings, pricing behavior, and seasonality patterns into SQL logic embedded within transformation layers.
Engineered multi-environment CI/CD pipelines using GitLab and Terraform to deploy Glue jobs, Redshift schema changes, Kinesis configurations, and IAM policies through automated infrastructure provisioning.
Developed anomaly detection logic for inventory mismatches by comparing system-of-record stock counts against store-level transactions, identifying shrinkage and operational discrepancies.
Integrated CloudWatch Alerts and SNS notifications into ETL pipelines to ensure immediate visibility of delayed files, schema errors, or failed Redshift loads.
Designed secure parameter storage using AWS Secrets Manager for API tokens, database credentials, and vendor feed authentication keys, ensuring safe access within ETL jobs.
Authored detailed technical design documents, transformation rules, data dictionaries, and pipeline runbooks, enabling non-technical and technical teams to understand and maintain complex retail data processes.
Environment: AWS (S3, Redshift, Glue, Lambda, Kinesis Firehose, Step Functions), Python, PySpark, SQL, Terraform, CloudFormation, Git, Datadog, CloudWatch




Big Data Engineer
Maisa Solutions Private Limited, Hyderabad, India July 2018 to October 2019
Responsibilities:
Engineered end-to-end big data ingestion pipelines using Hadoop Distributed File System (HDFS), creating scalable folder hierarchies, block-size configurations, replication strategies, and secure access controls to store structured and semi-structured client datasets efficiently within a distributed environment.
Developed Spark Core and Spark SQL applications leveraging resilient distributed datasets (RDDs), DataFrames, and Catalyst optimizations to process terabyte-scale batch workloads, applying partition tuning, caching strategies, and shuffle reduction to enhance throughput and job efficiency.
Implemented Kafka-based log and event ingestion streams, configuring partitions, retention policies, consumer groups, schema registries, and delivery guarantees, enabling near real-time processing of large volumes of event data from diverse operational systems.
Designed Hive data models with ORC and Parquet formats, partition pruning, bucketing, and vectorization, enabling efficient analytical queries across large historical datasets consumed by downstream analytics and reporting teams.
Developed Oozie workflows orchestrating multi-step big data jobs, integrating Hive queries, Spark jobs, Sqoop imports, and shell actions, ensuring reliable execution sequencing, retry logic, dependency management, and time-based scheduling for daily data processing cycles.
Developed Python- and SQL-based validation scripts ensuring the accuracy, completeness, and consistency of processed data, validating schema adherence, record counts, null patterns, and business logic rules during complex ETL workflows.
Built Kafka-to-Spark streaming integrations applying event-time processing, watermarking, and checkpointing mechanisms, ensuring reliable processing of out-of-order, late-arriving, or duplicate messages while maintaining data consistency.
Optimized MapReduce jobs by tuning mapper/reducer counts, adjusting memory buffers, applying combiners, and rewriting inefficient logic, improving the processing speed of legacy codebases still active within customer environments.
Implemented index and storage tuning in Hive and HBase to reduce I/O overhead, enable faster scan operations, and streamline access patterns for frequently queried datasets.
disk usage alerts, node failure alerts, service restarts, and job-level performance metrics to ensure stable big data infrastructure operations.
Built detailed documentation covering Spark/Hadoop architectural workflows, ingestion logic, transformation rules, failure recovery processes, and operational guidelines, enabling seamless handoff and team-level adoption.
Collaborated with analytics and data science teams to translate their modeling needs into optimized Hive tables, pre-aggregated metrics, and efficient data structures designed to accelerate BI workloads and model training pipelines.
Developed quality-assurance scripts that profiled distributions, detected anomalies, validated type mismatches, and compared partition-level results, ensuring data correctness before consumption by downstream applications.
Designed parameterized Spark jobs capable of processing multiple data domains using pluggable schemas, dynamic filtering, and metadata-driven transformation logic, increasing code reusability and decreasing pipeline maintenance overhead.
Mentored junior engineers on Spark best practices, Hadoop ecosystem tooling, distributed computing concepts, debugging strategies, and performance optimization techniques, strengthening the team s technical capability and delivery quality.
Environment: Hadoop (HDFS, Hive, HBase, MapReduce), Spark (Core, SQL, Streaming), Kafka, Airflow, Oozie, Python, Cloudera, Hortonworks



Hadoop Developer
Hudda Infotech Private Limited, Hyderabad, India August 2015 to June 2018
Responsibilities:
Managed Hadoop clusters built on Cloudera and Hortonworks distributions, overseeing HDFS storage balancing, NameNode health, YARN resource allocation, and data node performance tuning to ensure stable and highly available environments for large-scale batch processing workloads.
Developed MapReduce programs for batch transformations involving cleansing, filtering, aggregating, and enriching structured and semi-structured datasets, optimizing mapper and reducer logic to enhance throughput while reducing network shuffle overhead during high-volume processing.
Created Hive-based ETL workflows using optimized table designs, partition structures, and file formats such as ORC and Parquet, enabling downstream analytics teams to execute faster SQL queries and derive insights from large-scale enterprise datasets.
Developed Pig scripts for preprocessing and transformation tasks that required schema flexibility, enabling efficient manipulation of nested and semi-structured data sources before they were ingested into Hive or consumed by MapReduce jobs.
Implemented Sqoop ingestion pipelines integrating on-prem relational databases with HDFS, configuring parallel imports, incremental fetch logic, query-based extraction, and field-level data conversions to maintain synchronized datasets across systems.
Designed and managed Flume agents for ingesting high-volume log files, configuring reliable channel buffering, interceptor-based enrichment, and failover routing to ensure consistent streaming of application logs into HDFS for further analysis.
Optimized MapReduce workloads by adjusting JVM heap sizes, enabling compression codecs, rewriting inefficient file handling logic, and tuning speculative execution settings to improve the performance of long-running batch processes.
Monitored Hadoop infrastructure using Ambari and Nagios, setting up alert thresholds, resource utilization dashboards, service health checks, and automated restart procedures, ensuring continuous operation and rapid resolution of cluster-level issues.
Created detailed documentation covering Hadoop cluster setup, operational workflows, configuration parameters, troubleshooting procedures, and job execution patterns, enabling efficient onboarding and smoother collaboration across engineering teams.
Provided technical guidance to junior engineers on Hadoop ecosystem tools including Hive, Pig, HBase, Sqoop, and MapReduce supporting knowledge transfer, code reviews, and best practices for designing efficient distributed data pipelines.
Environment: Hadoop (HDFS, YARN, Hive, HBase, MapReduce, Pig, Sqoop, Flume), Ambari, Nagios, Python, SQL
Keywords: csharp continuous integration continuous deployment artificial intelligence machine learning business intelligence sthree golang procedural language

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];7179
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: