Session Description
Imagine an Oncall shift where you don’t start your day sifting through routine capacity alerts, nudging stuck rollouts, or closing noisy, low-impact tickets. Instead, you get to tackle things that matter. This is the goal we’re chasing at Google.
We are developing a system where software agents can autonomously handle a significant chunk of operational toil. The key is to do this generically and horizontally, making the solutions broadly applicable crossing the lines between developers and operations.
In this session, I’ll share our journey and lessons learned. We’ll cover the significant challenges, including evaluation, ensuring safe and secure operations, and how to codify complex, sometimes opinionated, remediation steps. I’ll outline the infrastructure we’ve put in place due to those challenges and requirements.
This talk aims to provide a practical perspective on leveraging automation and agents in a production environment. You’ll leave with critical questions to consider for your own agent that interacts with production.
Speaker
Google, Production & AI manager
Today: Managing a team of Site Reliability Engineers
Before:
10years of being an IC SRE @ Google PhD in Information Retrieval Master in AI
Speaker
Senior Staff Site Reliability Engineer at Google
Ramón is a Senior Staff Site Reliability Engineer at Google where he works on the Identity team. He started back in 2011 as an intern and has since then become team Technical Lead (TL), Engineering Manager and recently moved into a üTL role for the Privacy, Safety and Security teams. Their role is to store, manage and safeguard user accounts, from account creation down to credential management passing by account security like hijacking and phishing protection. The team employs hundreds of microservices across the stack, that offers a variety of protocols and APIs to customers. They run in thousands of machines in tens of data centres across the globe and must be as reliable as possible as not only other Google products depend on them, but also people and enterprises worldwide that use Google, Workspace and the Google Cloud Platform.
Prior to Google, Ramón worked at CERN, being part of the Physics Department and the ATLAS Collaboration, where he developed the ROOT framework for data analysis and then the functional testing framework to validate and ensure the reliability of the distributed computing facilities that allowed for the Higgs Boson discovery in 2012.
He holds a Computer Engineering MSc and Ph.D. For the last decade has been researching part time on autonomic computing and the management of computer fleets in data centres and enterprises to optimise and reduce the power usage of them.







