| SUMAN BAD - SRE Lead- 14+ years |
| [email protected] |
| Location: Dallas, Texas, USA |
| Relocation: Yes |
| Visa: H1B |
|
SUMAN
[email protected] --------------------------------------------------------------------------------------------------------------------------------------------------------------- PROFESSIONAL SUMMARY: 14+ years of experience in Information Technology, currently working as Site Reliability Engineer. Diverse background as Web Administrator, PeopleSoft Administrator, Database Administrator, Application Production Support Analyst, SRE, DevOps Engineer. Skilled in developing automated self-healing systems to enhance system resilience and reduce manual intervention. Strong expertise in automation, performance tuning, server installation, and configuration. Proficient in PeopleSoft migration activities including tax updates, patches, maintenance packs, service packs, and tools upgrades across multiple application versions. Hands-on experience in database administration (backups, restores, failover, and performance tuning). Created comprehensive process documents and runbooks to standardize operations. Extensive experience in 24/7 production support within an onsite-offshore support model. Experience as a Release Engineer, creating change requests, change plans, Approval process, change execution and release validations. Coordinated and executed production weekend activities by collaborating with cross-functional teams. Successfully lead production releases, ensuring smooth execution through ownership and accountability. Specialized in STAT migration tool installation and setup across various applications. Strong knowledge of Incident, Change management and Capacity Management, along with monitoring tool expertise. Proficient in ServiceNow, ITSM processes, Agile methodologies, Jira, and Confluence. Experience in Disaster Recovery (DR) planning, execution, and automated deployments. Hands-on with monitoring tools including AppDynamics, Splunk, Dynatrace, Apica, Geneos, Humio, Logscale, Grafana, BigPanda, Hawkeye, New Relic, and Kibana. Automated Disaster Recovery processes using Ansible playbooks. Exposure to application development and support for Java, .NET, and Kafka-based applications. Solid understanding of Cloud technologies, particularly AWS. AWS expertise in IAM, VPC, Route53, EC2, S3, CodeBuild, CodeDeploy, Lambda, Redshift, RDS, CloudWatch, and CloudFormation. Experience with Kubernetes and Terraform on AWS for container orchestration and infrastructure automation. Worked on application deployments using Jenkins. Strong knowledge of SRE principles with a focus on automation, application stability, reliability, and monitoring of SLI, SLO, and error budgets. TECHNICAL SKILLS: Operating Systems : Linux, Unix, Windows. Database : Oracle, MS SQL, Cassandra ERP : PeopleSoft Tools : Application Designer, Change Assistant, PeopleSoft Performance Monitor Data Mover, STAT, BO, AppDynamics, Apica, Geneos, Splunk, Dynatrace, PerfMon, Grafana, Kibana, NewRelic, Hawkeye Skills : SRE, DevOps,Cloud Technologies, AWS, Java, Python, Kafka EDUCATION: Bachelor of Technology Electrical and Electronics Engineering from Jawaharlal Nehru Technological University, Hyderabad, Telangana, India PROFESSIONAL EXPERIENCE: Client : HCLTech/Verizon Dates of Employment : JAN 2025 Till date Job Title : SRE Lead Accomplishments: Proven experience as a DevOps Site Reliability Engineer (SRE), driving application stability and operational excellence. Optimized application performance, reliability, and scalability while minimizing toil through automation. Skilled in managing and supporting cloud-native and digital applications. Proficient in leveraging observability tools to improve visibility, troubleshoot issues, and enhance system reliability. Hands-on experience with deployments using Jenkins and AWS EKS, ensuring smooth and automated releases. Worked with AWS services including S3, EC2, RDS, EBS, and Redis for scalable infrastructure management. Adept at analyzing logs, traces, and performance bottlenecks, and conducting root cause analysis (RCA) using advanced monitoring tools. Proficient in New Relic, Kibana, Dynatrace, Grafana, Splunk, and Hawkeye for monitoring, alerting, and performance optimization. Automated recurring operational tasks (e.g., disk cleanup, JVM restarts) using Action Engine tool. Expertise in Incident Management and Triage, reducing Mean Time to Resolution (MTTR) through proactive monitoring. Improved alerting systems by eliminating noise and enhancing the signal-to-noise ratio. Authored and maintained detailed Root Cause Analysis (RCA) documentation for critical incidents. Streamlined monitoring processes by onboarding alerts into Hawkeye and creating robust metrics, queries, and data sources. Collaborated with application development teams to permanently resolve recurring issues and improve system resilience. Conducted SRE dashboard reviews, identifying gaps and implementing actionable improvements. Experienced in ServiceNow ITSM for incident, change, and problem management workflows. Partnered with Hawkeye team to enhance dashboards, reducing triage time and accelerating RCA identification. Prepared and delivered client-facing presentations highlighting SRE achievements, metrics, and updates. Consolidated and reviewed daily team updates to provide clear and concise reporting to stakeholders. Client : IBM/PNC BANK Dates of Employment : Sep 2021 Dec 2024 Job Title : Senior SRE Accomplishments: Proven expertise in supporting critical applications in a Site Reliability Engineering (SRE) production support role. Daily scrum meetings. Strong experience in Disaster Recovery (DR) planning, testing, and execution to ensure business continuity. Designed and implemented self-healing automation using the TrueSight tool, integrating workflows with Dynatrace, LogScale, and ServiceNow. Worked with observability practices, including configuring alerts, dashboards, and APM tools for proactive monitoring and issue detection. Extensive experience with the Kafka ecosystem: brokers, streaming applications, producers, consumers, source and sink connectors, topics, partitions, and consumer lag analysis. Collaborated with the Kafka vendor and development teams to support Kafka releases and implement self-healing setups for Kafka connectors. Expertise in Kubernetes and OpenShift Container Platform for container orchestration and application reliability. Onboarded application metrics into Dynatrace and configured monitoring for Kafka systems, infrastructure, and application performance metrics. Worked with Jira, ServiceNow, Humio, LogScale, Dynatrace, BigPanda, Jenkins, Git, and Bitbucket for monitoring, CI/CD, and ITSM processes. Automated operational tasks by building TrueSight workflows for OpenShift pod restarts based on LogScale alerts, significantly reducing toil. Developed TrueSight workflows to automatically update ServiceNow incidents based on incident descriptions. Authored detailed Root Cause Analysis (RCA) documentation for incidents and production issues. Expertise in Incident Management, Change Management, and Capacity Management, ensuring efficient and reliable operations. Delivered consistent on-call production support and managed application release activities in collaboration with cross-functional teams. Experience in Java application support, including performance monitoring, debugging, and issue resolution with development teams. Focused on reducing toil, improving application stability, and enhancing overall system performance through automation and proactive monitoring. Employer : JPMorgan Chase & Co. Location : Houston, TX 77002 Dates of Employment : Nov 2015 May 2021 Dates of Employment : Nov 2011 March 2015 Job Title : Application support Specialist/SRE Accomplishments: Expertise in supporting diverse applications, including Java, .NET, and cloud-based platforms. Extensive experience in executing Disaster Recovery (DR) tests, SR activities, and production releases. Took ownership of critical production activities, ensuring seamless coordination with cross-functional teams. Experience creating serviceNow change requests, Implementation plans, Change approvals, CAB discussions. Automated Disaster Recovery processes using Automation as a Service, reducing manual effort and improving efficiency. Proficient with monitoring tools such as AppDynamics and Splunk for performance analysis and incident troubleshooting. Skilled in using job scheduling tools like Autosys and Control-M to manage application workflows. Resolved incidents based on priority, consistently meeting SLA commitments. Experienced in Agile methodologies, including Scrum ceremonies and daily standups. Provided application support during off-hours, proactively ensuring service continuity. Collaborated with development teams and vendors to troubleshoot and resolve complex application issues. Participated in weekend on-call rotations, providing reliable production support. Strong expertise in Incident Management, Change Management, and Capacity Management. Served as Lead SRE, conducting knowledge-sharing sessions to upskill the team. Automated recurring application service activities using Ansible, improving reliability and reducing toil. Applied SRE principles with a strong focus on automation, reducing toil, application stability, and reliability. Extensive experience in supporting multiple applications as part of Web Administration and PeopleSoft Administration. Skilled in installation and configuration of application servers, web servers, process schedulers, and other components within the PeopleSoft environment. Proficient in using Application Designer, Data Mover, Change Assistant, and related PeopleSoft tools. Applied Oracle-released tax patches using Change Assistant across multiple environments. Hands-on experience in PeopleSoft HCM implementation, upgrades, and maintenance packs. Strong expertise in system design, development, and performance tuning by configuring settings at the application server and web server layers. Adept at load balancing of domains and ensuring high availability of systems. Successfully performed Disaster Recovery (DR) testing, including switching PeopleSoft systems between data centers, validating system integrity, and automating DR activities. Experienced in handling production releases, coordinating with development teams, and ensuring smooth delivery. Designed and developed scripts to automate server restarts, manage log rotation, and optimize disk space utilization. Implemented and managed application migrations using the STAT migration tool, including setting up databases, servers, file objects, workflows, migration paths, post-migration steps, and test connectivity. Automated code migrations within STAT and trained development teams on effective tool usage. Monitored and resolved Peregrine tickets, ensuring timely closure and adherence to SLAs. Provided 24/7 production support, troubleshooting issues in an Onsite-Offshore model, and collaborating with vendors such as Oracle, Vista (OpenText), and Quest (STAT). Expertise in application performance tuning of PeopleSoft and Java-based web applications, leveraging monitoring tools like AppDynamics, Apica, Geneos, and Splunk. Worked on database administration activities including backups, restores, failover, and query optimization. Strong experience in application deployments using WebLogic, WebSphere, Tomcat, and Apache. Supported and optimized Java applications, performing root cause analysis, troubleshooting, and fine-tuning performance. Involved in SSL certificate installation and configuration for secure communication. Successfully upgraded PeopleSoft to HCM 9.2 with PeopleTools 8.53.08, resolving issues, coordinating with development, business, and database teams, and providing post-upgrade production support. Active participation in performance testing, load testing, and disaster recovery readiness validation. Experienced with Business Objects migrations, Vista reporting tools, and related administrative tasks. Led and contributed to multiple production releases, improving system stability and reliability. Upgraded STAT from 5.6.2 to 5.6.4 and configured new applications, resolving migration issues and training teams. Strengths: Leadership, Motivation, Teamwork. Time Management and Analytic skills. Comprehensive problem-solving abilities. Excellent communication and interpersonal skills. Keywords: continuous integration continuous deployment sthree microsoft mississippi Colorado Texas |