Effective Troubleshooting

We’ve been reading Google’s Site Reliability Engineers book . Chapter 12 is about Effective Troubleshooting.

Here’s a summary from Bing Chat summarizing while I had the pdf open.

Here is a summary of the chapter on effective troubleshooting from the current web page context:

Effective troubleshooting is a skill that can help SREs diagnose and resolve complex problems in large-scale systems.

The chapter outlines a general troubleshooting process that consists of six steps: define the problem, gather information, form a hypothesis, test the hypothesis, fix the problem, and reflect and learn.

The chapter also provides some practical tips and best practices for each step, such as using the SMART criteria to define the problem, using the four golden signals to gather information, using the scientific method to form and test hypotheses, and writing postmortems to reflect and learn.

The chapter emphasizes the importance of having a troubleshooting mindset that is curious, methodical, data-driven, and humble. It also discusses some common pitfalls and anti-patterns to avoid, such as jumping to conclusions, ignoring data, or blaming others.

The 4 golden signals

The four golden signals are a set of metrics that are used to monitor the health and performance of user-facing systems. They were first introduced by Google’s SRE teams and are widely used in the industry 1234. The four golden signals are:

  • Latency: The time it takes to serve a request.
  • Traffic: The amount of use of your service per time unit.
  • Errors: The rate of requests that fail.
  • Saturation: The consumption of your system resources. If you can only measure four metrics of your user-facing system, focus on these four


