Ramón Medrano Llamas
03 - 04 May 2023 | Alte Kaserne Winterthur
Senior Staff Site Reliability Engineer at Google
Ramón is a Staff Site Reliability Engineer at Google where he works on the Identity team. He started back in 2011 as an intern and has since then become team Technical Lead (TL), Engineering Manager and recently moved into a üTL role for the Privacy, Safety and Security teams. Their role is to store, manage and safeguard user accounts, from account creation down to credential management passing by account security like hijacking and phishing protection. The team employs hundreds of microservices across the stack, that offers a variety of protocols and APIs to customers. They run in thousands of machines in tens of data centres across the globe and must be as reliable as possible as not only other Google products depend on them, but also people and enterprises worldwide that use Google, Workspace and the Google Cloud Platform.
Prior to Google, Ramón worked at CERN, being part of the Physics Department and the ATLAS Collaboration, where he developed the ROOT framework for data analysis and then the functional testing framework to validate and ensure the reliability of the distributed computing facilities that allowed for the Higgs Boson discovery in 2012.
He holds a Computer Engineering MSc and Ph.D. For the last decade has been researching part time on autonomic computing and the management of computer fleets in data centres and enterprises to optimise and reduce the power usage of them.
Reporting on Reliability - Improving stakeholder conversations
Reporting on Reliability - Improving stakeholder conversations is a best practices presentation for how to communicate about reliability: during incidents, immediately after, and in periodic planning sessions.
"Phew! That incident is resolved. Now let's never speak of it again." Tempting, right?
After a production outage, rehashing it is often the last thing we want to do, especially since the conversation is likely to be a tense "face the music" situation full of finger pointing and apologies. As an alternative, reliability engineers have long practiced blameless retrospectives, having discovered that this is the best way to learn from the past, lest we repeat it. Unfortunately, stakeholders outside the operations team may not be so forgiving. But as organizations increasingly realize the importance of reliability to their users and their bottom line, we find an opportunity to engage our colleagues in developing a more mature, sustainable approach to discussing reliability—and the occasional lack thereof. By normalizing incidents, communicating systematically, and aligning on goals and investments, we mature from an avoidant, reactive approach toward a collaborative, strategic one.