Home

Mariana Salinas - Site Reliability Engineer/DevOps/Cloud Engineer
[email protected]
Location: New York City, New York, USA
Relocation: Yes
Visa: TN visa
Resume file: Mariana Irene -- resume_1776094529103.docx
Please check the file(s) for viruses. Files are checked manually and then made available for download.
Mariana Rosas
Senior Site Reliability Engineer | AWS Cloud Specialist |
DevOps Engineering Expert
[email protected] |
+1 (551) 307-9685 | New York City, NY


Professional Summary
Distinguished Senior Site Reliability Engineer with around 11 years of progressive experience delivering enterprise-scale cloud infrastructure solutions across AWS ecosystems, consistently implementing advanced monitoring, incident response, and system reliability practices that achieved 99.9%+ uptime and reduced MTTR by 75% across global production environments.
AWS Cloud specialist with comprehensive expertise in Infrastructure as Code, containerization, and microservices deployment using Terraform, Kubernetes, and Docker, successfully managing petabyte-scale workloads and orchestrating seamless migrations from on-premises to cloud-native architectures.
DevOps Engineering expert with advanced proficiency in CI/CD pipeline optimization, automated deployment strategies, and configuration management using Jenkins, GitLab CI/CD, and Ansible, reducing deployment time by 85% and eliminating production incidents through intelligent automation.
Experience designing and supporting microservices-based architectures using Kubernetes and Docker across AWS environments.
Hands-on expertise with CI/CD tools including Jenkins, FlexDeploy, and Artifactory for build, release, and artifact management.
Strong experience working with enterprise middleware platforms like Oracle, IBM, and Tomcat.
Proficient in JVM troubleshooting, including GC tuning, thread dump analysis, and memory optimization for Java applications.
Experience defining and tracking KPIs, SLAs, SLOs, and SLIs to ensure service reliability and business alignment.
Kubernetes and container orchestration specialist managing large-scale clusters with advanced networking, auto-scaling, and resource optimization across AWS EKS, ECS Fargate, and on-premises environments, supporting hundreds of microservices and thousands of pods with optimal performance.
Monitoring and observability expert implementing comprehensive telemetry solutions using DataDog, New Relic, Splunk, Prometheus, and Grafana, creating intelligent alerting, predictive analytics, and real-time dashboards that prevented 95% of potential system failures.
Incident Management leader with proven expertise in crisis response, post-mortem analysis, and continuous improvement processes, successfully managing high-severity incidents across global teams and implementing automated remediation strategies that improved system resilience.
Security and compliance specialist implementing AWS Security Hub, Cloud Custodian, vulnerability management, and infrastructure hardening practices ensuring enterprise-grade security posture and regulatory compliance across multi-account AWS environments.
Infrastructure as Code pioneer with extensive experience in Terraform, CloudFormation, Atlantis, and Pulumi, managing complex cloud architectures with version control, automated testing, and deployment orchestration across development, staging, and production environments.
Machine Learning operations specialist integrating MLOps practices with cloud infrastructure, implementing custom agents for on-premises ML workloads and hybrid architectures combining AWS services with traditional computing environments for advanced analytics.
Migration and modernization expert leading large-scale transformations from legacy systems to cloud-native architectures, orchestrating application migrations, data transformations, and infrastructure optimization with minimal downtime and enhanced performance.
Performance optimization specialist implementing advanced caching, load balancing, auto-scaling, and resource management strategies using AWS services and third-party tools, achieving 50%+ cost reduction while improving system performance and user experience.
Cross-functional collaboration leader working with development teams, product managers, and business stakeholders to align technical solutions with business objectives, ensuring scalable architecture and operational excellence across enterprise organizations.
Automation and scripting expert with advanced proficiency in Python, Bash, PowerShell, and Go, developing custom tools, monitoring solutions, and operational scripts that streamlined routine tasks and improved team productivity by 60%.
Disaster recovery and business continuity specialist designing comprehensive resilience strategies, automated failover mechanisms, and backup orchestration, ensuring rapid recovery and minimal business impact during critical incidents.
Innovation and emerging technology advocate with continuous learning in serverless computing, edge computing, AI/ML integration, and advanced cloud services, actively implementing next-generation solutions for competitive advantage and operational excellence.

Professional Experience

Thomson Reuters New York City | NY
May 2025 Present | Senior Site Reliability Engineer

Deployed enterprise-scale Kubernetes clusters across multi-region AWS environments using Terraform 1.5, Amazon EKS, and advanced networking configurations, supporting 500+ microservices and 10,000+ pods with 99.99% availability and sub-second response times for global financial data platforms.
Implemented comprehensive Infrastructure as Code practices using Terraform modules, GitOps workflows, and automated testing frameworks, managing complex cloud architectures with version control, peer review processes, and deployment orchestration that reduced infrastructure provisioning time by 80% and eliminated configuration drift.
Established advanced monitoring and observability solutions using DataDog, Prometheus, Grafana, and custom dashboards, implementing intelligent alerting, anomaly detection, and predictive analytics that provided real-time visibility into system performance and prevented 98% of potential service disruptions.
Designed incident response protocols and escalation procedures implementing automated runbooks, disaster recovery workflows, and post-mortem analysis processes, achieving Mean Time to Resolution (MTTR) of under 10 minutes and comprehensive documentation for continuous improvement initiatives.
Developed security automation frameworks using Cloud Custodian, AWS Config, and custom policies ensuring compliance with SOC 2, ISO 27001, and financial industry regulations while automating remediation of security findings and maintaining audit readiness.
Orchestrated complex deployments and rollback strategies using Blue-Green, Canary, and Rolling deployment patterns with automated health checks and intelligent traffic routing, ensuring zero-downtime releases and seamless user experiences across production environments.
Implemented cost optimization strategies using AWS Cost Explorer, Resource tagging, Rightsizing, and Reserved Instances achieving 45% reduction in infrastructure costs while maintaining performance standards and scaling capabilities for dynamic workloads.
Managed large-scale microservices architecture deployed on Kubernetes (EKS) and Docker.
Built and maintained CI/CD pipelines using Jenkins and Artifactory for artifact versioning and deployments.
Defined and monitored KPIs and SLAs for critical production services.
Performed JVM troubleshooting (memory leaks, GC tuning, thread dumps) for Java-based services.
Led end-to-end Incident Response workflows across production systems, ensuring rapid triage, service restoration, and adherence to SRE-defined escalation and communication processes.
Implemented SRE best practices by defining and managing SLAs, SLOs, and SLIs, enabling data-driven reliability improvements and maintaining service health across mission-critical systems.
Built custom automation tools using Python, Go, and AWS SDKs for infrastructure management, compliance checking, and operational tasks, creating reusable libraries and CLI tools that improved team productivity and standardized operations across multiple product teams.
Utilized Core Java Foundation skills to build diagnostic tools, API integrations, and automation utilities supporting SRE troubleshooting and operational workflows.
Established multi-cluster management strategies using GitOps, Flux, and Argo CD enabling centralized configuration management, automated synchronization, and policy enforcement across development, staging, and production Kubernetes environments.
Designed disaster recovery and backup solutions implementing cross-region replication, automated failover, and recovery testing ensuring Recovery Time Objective (RTO) of 15 minutes and Recovery Point Objective (RPO) of 5 minutes for mission-critical financial services.
Developed automated incident-detection mechanisms to identify anomalies early and trigger real-time alerts, significantly reducing MTTR and enhancing operational reliability.
Led performance optimization initiatives using advanced profiling, load testing, and resource analysis techniques, implementing horizontal pod autoscaling, cluster autoscaling, and intelligent resource allocation that improved application performance by 60%.
Collaborated with product development teams to implement SRE best practices including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets, establishing reliability engineering culture and data-driven decision making across engineering organizations.
Applied DE-Web Foundation principles by optimizing backend service interactions, improving HTTP communication reliability, and enhancing API latency for distributed systems.
Designed high-availability and resiliency architectures leveraging automated failovers, redundancy patterns, and distributed systems principles to meet stringent SRE uptime objectives
Implemented advanced networking solutions using Istio Service Mesh, Network Policies, and Traffic Management ensuring secure communication, observability, and intelligent routing between microservices in complex distributed systems.
Developed machine learning integration pipelines using AWS SageMaker, Step Functions, and Event Bridge enabling automated model deployment, A/B testing, and performance monitoring for AI-powered financial analytics applications.
Provided technical leadership and mentorship to junior engineers and cross-functional teams, establishing engineering standards, best practices, and knowledge sharing processes that improved team capabilities and operational excellence across Thomson Reuters infrastructure.

Key Achievements: Achieved 99.99% availability, reduced MTTR to 10 minutes, optimized costs by 45%, managed 500+ microservices

Tech Stack: AWS EKS, Terraform, DataDog, Kubernetes, Python, Prometheus, Cloud Custodian

Kunai Mexico City
September 2024 May 2025 | Site Reliability Engineer

Managed QA environment stability treating non-production systems as production-grade infrastructure implementing comprehensive monitoring, alerting, and incident response processes using New Relic, Splunk, and Observe platforms, achieving 99.5% uptime and consistent performance for continuous testing workflows.
Orchestrated incident management processes including on-call rotations, escalation procedures, and team mobilization strategies, implementing automated alerting, communication workflows, and resolution tracking that reduced Mean Time to Resolve incidents by 65% and improved service reliability.
Utilized New Relic for JVM-level monitoring and troubleshooting, including GC and memory analysis.
Defined and tracked KPIs and SLAs for QA environments to ensure production-grade reliability.
Implemented advanced monitoring solutions using New Relic APM, Infrastructure monitoring, and synthetic testing creating comprehensive dashboards, performance analytics, and proactive alerting that identified performance bottlenecks and prevented service degradation before impacting users.
Developed Splunk-based log analysis and security monitoring implementing custom queries, automated reports, and real-time alerting for security events, application errors, and infrastructure anomalies, providing centralized visibility and intelligent insights across distributed systems.
Established Observe platform integration for modern observability implementing distributed tracing, metrics correlation, and log aggregation with machine learning based anomaly detection providing comprehensive system visibility and predictive analytics for proactive maintenance.
Designed automated runbooks and incident response procedures using scripting, workflow automation, and communication tools ensuring consistent response to critical incidents and minimizing human error during high-pressure situations.
Implemented capacity planning and performance testing strategies using load testing tools and resource monitoring ensuring QA environments could handle production-like workloads and stress testing scenarios with optimal resource allocation.
Conducted comprehensive Autopsy/Postmortem reviews for high-severity outages, delivering actionable RCA reports and driving long-term remediation strategies.
Built custom dashboards and reporting solutions using observability platforms and business intelligence tools providing executive visibility, trend analysis, and performance metrics that supported data-driven decisions and continuous improvement initiatives.
Performed JVM performance tuning, thread analysis, and garbage collection optimization to improve Java-based service stability and reliability under SRE performance requirements.
Conducted deep-dive performance analysis to identify system bottlenecks and implemented optimizations that enhanced system throughput, latency, and overall SRE performance targets.
Collaborated with development teams to implement shift-left practices including early monitoring, performance profiling, and reliability testing in CI/CD pipelines ensuring quality assurance and reliability are built into applications from early development stages.
Established root cause analysis processes using systematic investigation, timeline reconstruction, and post-mortem documentation implementing corrective actions and preventive measures that reduced recurring incidents by 80%.
Developed automation scripts using Python and Shell scripting for routine maintenance, health checks, and system optimization tasks, creating scheduled jobs and monitoring tools that improved operational efficiency and reduced manual workload.
Implemented alerting optimization strategies reducing alert fatigue through intelligent filtering, correlation rules, and priority-based escalation ensuring critical alerts receive immediate attention while minimizing noise from non-critical events.
Built incident response runbooks and playbooks aligned with SRE operational standards, enabling faster diagnosis and consistent response to recurring failure scenarios.
Designed service level management frameworks including SLA tracking, performance benchmarks, and availability reporting providing business stakeholders with transparent metrics and reliability commitments for QA services.
Led knowledge sharing initiatives including documentation, training sessions, and best practices workshops improving team capabilities and ensuring consistent approaches to reliability engineering across development and operations teams.
Developed automated provisioning and deployment pipelines aligned with SRE operational standards, ensuring deterministic, repeatable, and zero-touch environments across production workloads.
Established continuous improvement processes using feedback loops, metric analysis, and process optimization implementing iterative enhancements to monitoring, incident response, and system reliability that drove operational excellence.

Key Achievements: Achieved 99.5% QA environment uptime, reduced MTTR by 65%, decreased recurring incidents by 80%

Tech Stack: New Relic, Splunk, Observe, Python, Shell Scripting, Incident Management, Performance Monitoring

Globant Mexico City
Sept 2021 Sept 2024 | DevOps Engineer

Deployed scalable microservices infrastructure on Amazon ECS using advanced container orchestration, service discovery, and load balancing implementing Fargate launch type for serverless container deployment that eliminated infrastructure management overhead and improved cost efficiency by 40%.
Migrated legacy container deployments from EC2 launch type to ECS Fargate implementing rightsizing, resource optimization, and auto-scaling strategies that reduced operational complexity, improved resource utilization, and enhanced application reliability across production environments.
Maintained and optimized Jenkins CI/CD pipelines implementing pipeline as code, parallel execution, and artifact management integration that reduced build times by 60% and improved deployment frequency from weekly to multiple daily releases.
Remediated critical security vulnerabilities in base container images implementing security scanning, vulnerability assessment, and automated patching workflows using AWS Inspector, Snyk, and custom security tools ensuring compliance with enterprise security standards and industry regulations.
Implemented comprehensive monitoring solutions for Helm-deployed applications using Prometheus, Grafana, and custom metrics collection, creating intelligent alerting, performance dashboards, and capacity planning tools that provided real-time visibility and proactive maintenance capabilities.
Authored detailed system autopsies/postmortems including root-cause identification, impact analysis, and long-term corrective actions, feeding continuous improvement into SRE governance practices.
Developed Infrastructure as Code solutions using Terraform and CloudFormation managing complex AWS architectures including networking, security groups, IAM policies, and resource tagging ensuring consistent deployments and configuration management across multiple environments.
Established container security best practices implementing image scanning, runtime security, and network policies using HashiCorp Vault and AWS Secrets Manager ensuring secure container operations and compliance with security frameworks.
Improved web service resilience by implementing retries, backoff strategies, timeouts, and circuit breaker patterns aligned with SRE best practices
Collaborated with platform and development teams to eliminate systemic issues discovered during autopsies/RCA, integrating fixes into CI/CD and infrastructure pipelines.
Built automated deployment strategies using Blue-Green and Rolling deployment patterns with health checks, rollback capabilities, and traffic routing ensuring zero-downtime deployments and seamless user experiences during application updates.
Implemented auto-scaling and self-healing mechanisms to maintain system reliability under variable load, supporting SRE scalability and operational excellence goals.
Implemented log aggregation and centralized logging solutions using ELK Stack and AWS CloudWatch providing comprehensive log analysis, troubleshooting capabilities, and audit trails for compliance and operational requirements.
Designed disaster recovery procedures and backup strategies for containerized applications implementing cross-region replication, automated failover, and data protection ensuring business continuity and rapid recovery from service disruptions.
Optimized cost management strategies using AWS Cost Explorer, Reserved Instances, and Spot Instances achieving 35% reduction in infrastructure costs while maintaining performance and availability standards.
Integrated Java-based microservices with monitoring frameworks to expose enhanced SLIs such as request latency, error rate, and system throughput.
Collaborated with development teams to implement DevOps best practices including code quality gates, automated testing, and security scanning in CI/CD pipelines ensuring reliable software delivery.
Established incident metrics and reliability KPIs including alerting thresholds, event timelines, and contributing to SRE tracking of reliability and operational health.
Led major incident response activities, coordinating cross-functional teams, restoring services rapidly, and maintaining communication with stakeholders under strict SRE protocols.
Established configuration management processes using Ansible, terraform modules, and GitOps workflows ensuring consistent infrastructure configurations, version control, and automated deployment across development lifecycles.
Implemented performance optimization techniques including container rightsizing, resource limits, and caching strategies improving application response times by 45% and reducing resource consumption.
Provided technical documentation and knowledge transfer to development and operations teams establishing best practices, troubleshooting guides, and operational procedures that improved team productivity and system reliability.

Key Achievements: Reduced costs by 40%, improved build times by 60%, achieved zero-downtime deployments

Tech Stack: Amazon ECS, Fargate, Jenkins, Terraform, Helm, Prometheus, AWS Security Tools

Accenture Mexico City
Feb 2018 Sept 2021 | DevOps Engineer

Provisioned enterprise-scale infrastructure on AWS cloud platform using Terraform and Infrastructure as Code best practices, implementing modular architecture, reusable components, and automated testing that standardized infrastructure deployment and reduced provisioning time by 70%.
Supported development teams with comprehensive CI/CD pipeline provisioning implementing GitLab CI/CD, automated testing, and code quality gates enabling continuous integration and delivery workflows that improved development velocity by 50%.
Designed cloud architecture solutions using AWS services including EC2, S3, and RDS implementing security best practices, cost optimization, and high availability configurations supporting scalable applications and fault-tolerant systems.
Implemented automated infrastructure testing using Terratest, InSpec, and custom validation scripts ensuring infrastructure quality, compliance verification, and deployment reliability across development, staging, and production environments.
Established version control and collaboration workflows for infrastructure code using Git, branching strategies, and pull request processes ensuring code quality, peer review, and change management for infrastructure modifications.
Built monitoring and alerting solutions using AWS CloudWatch, SNS, and custom metrics providing real-time visibility into infrastructure performance, cost optimization, and resource utilization across cloud environments.
Collaborated with development teams to improve Java application reliability by introducing error budgets, dependency analysis, and performance instrumentation aligned with SRE guidelines.
Developed backup and disaster recovery strategies implementing automated backups, cross-region replication, and recovery procedures ensuring data protection and business continuity for critical applications and databases.
Implemented security hardening measures using AWS Security Groups, NACLs, and IAM policies ensuring secure infrastructure deployment and compliance with enterprise security requirements.
Created documentation and runbooks for infrastructure operations including deployment procedures, troubleshooting guides, and maintenance tasks enabling knowledge sharing and operational consistency across cloud engineering teams.
Developed debugging tools in Java to analyze log flows, thread dumps, and memory profiles accelerating root-cause identification during production incidents.
Collaborated with cross-functional teams including developers, architects, and security teams to align technical solutions with business requirements ensuring scalable infrastructure and operational excellence.
Implemented cost optimization strategies using resource tagging, usage monitoring, and rightsizing recommendations achieving 30% reduction in cloud spending while maintaining performance standards.
Established change management processes for infrastructure updates including testing procedures, rollback strategies, and communication protocols ensuring controlled deployments and minimal business impact.
Built infrastructure templates and reusable modules using Terraform and CloudFormation creating standardized components that accelerated project delivery and ensured consistency across multiple applications.
Provided technical support and troubleshooting assistance to development teams for infrastructure-related issues, performance problems, and deployment challenges ensuring smooth operations and quick resolution.
Participated in architecture reviews and design decisions contributing cloud expertise and best practices to ensure scalable, reliable, and cost-effective infrastructure solutions.
Supported applications running on Tomcat and Oracle-based environments.
Worked in enterprise environments involving IBM systems and infrastructure.
Performed JVM troubleshooting and performance tuning for Java applications.

Key Achievements: Reduced provisioning time by 70%, improved development velocity by 50%, achieved 30% cost reduction

Tech Stack: Terraform, AWS (EC2, S3, RDS, Lambda, VPC), GitLab CI/CD, CloudWatch, Infrastructure as Code

Grupo Infodata Mexico City
Jan 2017 Jan 2018 | System Admin

Administrate Server for IBM Internal Accounts and Commercial Accounts, including OS Windows, AIX, and Linux for test, Dev, and Prod environments.
Provide technical support to end users for system and application-related issues.
Automation of tasks with Ansible.
Set up servers to work with Ansible Tower.
Strong working knowledge of Ansible, AWS, and DevOps.
Troubleshooting of database issues.
Security Health checks for Databases.
Implement testing scripts using bash script and Python.
Perform regular system updates, patches, and backups to safeguard data and maintain reliability.
Manage user accounts, permissions, and access rights to ensure security and compliance.
Automate routine administrative tasks using scripts and configuration management tools.

Key Achievements: Automated repetitive administrative tasks using PowerShell/Python scripts, reducing manual effort by 50%.

Tech Stack: PowerShell, Bash, Python, Ansible, Puppet, Chef, AWS, Script, AIX

Client: Pepsi
Junior Infrastructure Engineer
Mexico City / Orizaba
Jun 2015 Dec 2016

Supported Linux and Windows infrastructure environments for internal applications.
Assisted in automation of operational tasks using Bash and Python scripts.
Helped implement configuration management using Ansible and Puppet.
Monitored servers and application health using Nagios and ELK stack.
Participated in early AWS cloud migration initiatives and infrastructure automation.
Key Achievements: Reduced manual infrastructure operations by 40% through automation using Bash and Python scripts while supporting early AWS-based deployments and monitoring improvements.
Tech Stack: Linux, Bash, Python, Ansible, Puppet, AWS (EC2, S3), Git, Jenkins, Nagios, ELK Stack, Shell Scripting, System Administration
Technical Skills Matrix

Cloud Platforms AWS (EC2, EKS, ECS, Lambda, S3, RDS, CloudFormation, CloudWatch, IAM, VPC, ALB, ASG, API Gateway, Elasticsearch, Code Build, Code Pipeline) Oracle, IBM, Tomcat
Container & Orchestration Kubernetes 1.28, Docker 24.0, Amazon EKS, ECS Fargate, Helm Charts, Istio Service Mesh, Linkerd, Kustomize
Infrastructure as Code Terraform 1.5, AWS CloudFormation, Pulumi, Atlantis, CDK (Cloud Development Kit), Crossplane, Terragrunt
CI/CD & DevOps Jenkins, GitLab CI/CD, GitHub Actions, AWS CodePipeline, Azure DevOps, Argo CD, Tekton, Spinnaker, FlexDeploy,Artifactory
Monitoring & Observability DataDog, New Relic, Splunk, Prometheus, Grafana, AWS CloudWatch, Elasticsearch, Jaeger, OpenTelemetry
Programming & Scripting Python 3.11, Go 1.21, Bash/Shell, PowerShell, JavaScript/Node.js, YAML/JSON, SQL, HCL,JVM Troubleshooting, KPIs, SLAs, SLOs/SLIs
Security & Compliance AWS Security Hub, Cloud Custodian, HashiCorp Vault, AWS IAM, GuardDuty, Config, Inspector, SIEM Tools, Microservices Architecture,


Education

Master's Degree - Industrial Engineering | Technological Institute of Orizaba | August 2014 November 2016
Bachelor's Degree - Electronics Engineering | Technological Institute of Orizaba | August 2009 March 2014

Professional Certifications

AWS Certified Solutions Architect - Professional
AWS Certified Cloud Practitioner

Key Achievements & Recognition

Reliability Excellence: Achieved 99.99% uptime across enterprise-scale Kubernetes clusters serving global financial services
Incident Response: Reduced MTTR from 45 minutes to 10 minutes through automated remediation and intelligent alerting
Cost Optimization: Delivered $2M+ in annual savings through infrastructure optimization and resource management
Automation Leadership: Eliminated 85% of manual deployment tasks through comprehensive CI/CD automation
Security Compliance: Maintained zero security incidents across multi-account AWS environments with automated compliance
Team Leadership: Mentored 15+ engineers and established SRE best practices across multiple organizations
Keywords: continuous integration continuous deployment quality analyst artificial intelligence machine learning javascript sthree golang green card Delaware New York

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];7164
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: