Thursday, October 04, 2018

Site Reliability Engineering: How Google Runs Production SystemsSite Reliability Engineering: How Google Runs Production Systems by Betsy Beyer
My rating: 5 of 5 stars

A wonderful book to learn how to manage websites so that they are reliable.

Some good random extracts from the book.

Site Reliability Engineering
1. Operations personnel should spend 50% of their time in writing automation scripts and programs.
2. the decision to stop releases for the remainder of the quarter once an error budget is depleted
3. an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
4. codified rules of engagement and principles for how SRE teams interact with their environment—not only the production environment, but also the product development teams, the testing teams, the users, and so on
5. operates under a blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.
6. There are three kinds of valid monitoring output:
Alerts: Signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation.
Tickets: Signify that a human needs to take action, but not immediately. The system cannot automatically handle the situation, but if a human takes action in a few days, no damage will result.
Logging: No one needs to look at this information, but it is recorded for diagnostic or forensic purposes. The expectation is that no one reads logs unless something else prompts them to do so.
7. Resource use is a function of demand (load), capacity, and software efficiency. SREs predict demand, provision capacity, and can modify the software. These three factors are a large part (though not the entirety) of a service’s efficiency.

SLI - Service Level Indicator - Indicators used to measure the health of a service. Used to determine the SLO and SLA.
SLO - Service Level Objective - The objective that must be met by the service.
SLA - Service Level Agreement - The Agreement with the client with respect to the services rendered to them.

Don’t overachieve

Users build on the reality of what you offer, rather than what you say you’ll supply, particularly for infrastructure services. If your service’s actual performance is much better than its stated SLO, users will come to rely on its current performance. You can avoid over-dependence by deliberately taking the system offline occasionally (Google’s Chubby service introduced planned outages in response to being overly available),18 throttling some requests, or designing the system so that it isn’t faster under light loads.

"If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow."

Four Golden Signals of Monitoring
The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.

Latency: The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.
Traffic: A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second.
Errors: The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, "If you committed to one-second response times, any request over one second is an error"). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content.
Saturation: How "full" your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential.
In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? For very simple services that have no parameters that alter the complexity of the request (e.g., "Give me a nonce" or "I need a globally unique monotonic integer") that rarely change configuration, a static value from a load test might be adequate. As discussed in the previous paragraph, however, most services need to use indirect signals like CPU utilization or network bandwidth that have a known upper bound. Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation.
Finally, saturation is also concerned with predictions of impending saturation, such as "It looks like your database will fill its hard drive in 4 hours."

If you measure all four golden signals and page a human when one signal is problematic (or, in the case of saturation, nearly problematic), your service will be at least decently covered by monitoring.

Why it is important to have control over the software that one is using? Why and when it makes sense to roll out one's own framework and/or platform?
Another argument in favor of automation, particularly in the case of Google, is our complicated yet surprisingly uniform production environment, described in The Production Environment at Google, from the Viewpoint of an SRE. While other organizations might have an important piece of equipment without a readily accessible API, software for which no source code is available, or another impediment to complete control over production operations, Google generally avoids such scenarios. We have built APIs for systems when no API was available from the vendor. Even though purchasing software for a particular task would have been much cheaper in the short term, we chose to write our own solutions, because doing so produced APIs with the potential for much greater long-term benefits. We spent a lot of time overcoming obstacles to automatic system management, and then resolutely developed that automatic system management itself. Given how Google manages its source code, the availability of that code for more or less any system that SRE touches also means that our mission to “own the product in production” is much easier because we control the entirety of the stack.
When developed in-house the platform/framework can be designed to manage any failures automatically. There is no external observer required to manage this.
One of the negatives of automation is that humans forget how to do a task when required. This may not be always good.

Google Cherry Picks features for release. Should we do the same?
"All code is checked into the main branch of the source code tree (mainline). However, most major projects don’t release directly from the mainline. Instead, we branch from the mainline at a specific revision and never merge changes from the branch back into the mainline. Bug fixes are submitted to the mainline and then cherry picked into the branch for inclusion in the release. This practice avoids inadvertently picking up unrelated changes submitted to the mainline since the original build occurred. Using this branch and cherry pick method, we know the exact contents of each release."
Note that cherry picking is of specific release branches and not changes in specific branch.

Surprises vs. boring
"Unlike just about everything else in life, "boring" is actually a positive attribute when it comes to software! We don’t want our programs to be spontaneous and interesting; we want them to stick to the script and predictably accomplish their business goals. In the words of Google engineer Robert Muth, "Unlike a detective story, the lack of excitement, suspense, and puzzles is actually a desirable property of source code." Surprises in production are the nemeses of SRE."

Commenting or flagging code
"Because engineers are human beings who often form an emotional attachment to their creations, confrontations over large-scale purges of the source tree are not uncommon. Some might protest, "What if we need that code later?" "Why don’t we just comment the code out so we can easily add it again later?" or "Why don’t we gate the code with a flag instead of deleting it?" These are all terrible suggestions. Source control systems make it easy to reverse changes, whereas hundreds of lines of commented code create distractions and confusion (especially as the source files continue to evolve), and code that is never executed, gated by a flag that is always disabled, is a metaphorical time bomb waiting to explode, as painfully experienced by Knight Capital, for example (see "Order In the Matter of Knight Capital Americas LLC" [Sec13])."

Writing blameless RCA
Pointing fingers: "We need to rewrite the entire complicated backend system! It’s been breaking weekly for the last three quarters and I’m sure we’re all tired of fixing things onesy-twosy. Seriously, if I get paged one more time I’ll rewrite it myself…"
Blameless: "An action item to rewrite the entire backend system might actually prevent these annoying pages from continuing to happen, and the maintenance manual for this version is quite long and really difficult to be fully trained up on. I’m sure our future on-callers will thank us!"

Establishing a strong testing culture
One way to establish a strong testing culture is to start documenting all reported bugs as test cases. If every bug is converted into a test, each test is supposed to initially fail because the bug hasn’t yet been fixed. As engineers fix the bugs, the software passes testing and you’re on the road to developing a comprehensive regression test suite.

Project Vs. Support
Dedicated, noninterrupted, project work time is essential to any software development effort. Dedicated project time is necessary to enable progress on a project, because it’s nearly impossible to write code—much less to concentrate on larger, more impactful projects—when you’re thrashing between several tasks in the course of an hour. Therefore, the ability to work on a software project without interrupts is often an attractive reason for engineers to begin working on a development project. Such time must be aggressively defended.

Managing Loads
Round Robin Vs. Weighted Round Robin (Round Robin, but taking into consideration the number of tasks pending at the server)
Overload of the system has to be avoided by usage of load testing. If despite this the system is overloaded then any retries have to be well controlled. A retry at a higher level can cascade the retries at the lower level. Use jitter retries (retry at random intervals) and exponential retry (exponentially increase the time between the retries) and fail quickly to prevent overload on the already overloaded system.
If queuing is used to prevent overloading of server then sometimes FIFO may not be a good option as the user waiting for the tasks at the head of the queue may have left the system not expecting a response.
If task is split into multiple pipelined tasks then it will be good to check at each stage if there is sufficient time for performing the rest of the tasks based on the expected time that will be taken by the remaining tasks in the pipeline. Implement a deadline propagation.

Safeguarding the data
Three levels of guard against data loss
1. Soft Delete (Visible to user in the recycle bin)
2. Back up (incremental and full) before actual deletion and test ability to restore. Replicate live and backed up data.
3. Purge data (Can be recovered only from backup now)
4. Out of Band data validation to prevent surprising data loss.

Important to
1. Continuously test the recovery process as part of your normal operations
2. Set up alerts that fire when a recovery process fails to provide a heartbeat indication of its success

Launch Coordination Checklist
This is Google’s original Launch Coordination Checklist, circa 2005, slightly abridged for brevity:
1. Architecture: Architecture sketch, types of servers, types of requests from clients
2. Programmatic client requests
3, Machines and datacenters
4, Machines and bandwidth, datacenters, N+2 redundancy, network QoS
5. New domain names, DNS load balancing
6. Volume estimates, capacity, and performance
7. HTTP traffic and bandwidth estimates, launch “spike,” traffic mix, 6 months out
8. Load test, end-to-end test, capacity per datacenter at max latency
9. Impact on other services we care most about
10. Storage capacity
11. System reliability and failover

What happens when:
Machine dies, rack fails, or cluster goes offline
Network fails between two datacenters
For each type of server that talks to other servers (its backends):
How to detect when backends die, and what to do when they die
How to terminate or restart without affecting clients or users
Load balancing, rate-limiting, timeout, retry and error handling behavior
Data backup/restore, disaster recovery
12. Monitoring and server management
    Monitoring internal state, monitoring end-to-end behavior, managing alerts
    Monitoring the monitoring
    Financially important alerts and logs
    Tips for running servers within cluster environment
    Don’t crash mail servers by sending yourself email alerts in your own server code
13. Security
Security design review, security code audit, spam risk, authentication, SSL
Prelaunch visibility/access control, various types of blacklists
14. Automation and manual tasks
    Methods and change control to update servers, data, and configs
    Release process, repeatable builds, canaries under live traffic, staged rollouts
15. Growth issues
    Spare capacity, 10x growth, growth alerts
    Scalability bottlenecks, linear scaling, scaling with hardware, changes needed
    Caching, data sharding/resharding
16. External dependencies
    Third-party systems, monitoring, networking, traffic volume, launch spikes
    Graceful degradation, how to avoid accidentally overrunning third-party services
    Playing nice with syndicated partners, mail systems, services within Google
17. Schedule and rollout planning
    Hard deadlines, external events, Mondays or Fridays
    Standard operating procedures for this service, for other services

As mentioned, you might encounter responses such as "Why me?" This response is especially likely when a team believes that the postmortem process is retaliatory. This attitude comes from subscribing to the Bad Apple Theory: the system is working fine, and if we get rid of all the bad apples and their mistakes, the system will continue to be fine. The Bad Apple Theory is demonstrably false, as shown by evidence [Dek14] from several disciplines, including airline safety. You should point out this falsity. The most effective phrasing for a postmortem is to say, "Mistakes are inevitable in any system with multiple subtle interactions. You were on-call, and I trust you to make the right decisions with the right information. I'd like you to write down what you were thinking at each point in time, so that we can find out where the system misled you, and where the cognitive demands were too high."

"The best designs and the best implementations result from the joint concerns of production and the product being met in an atmosphere of mutual respect."

Postmortem Culture

Corrective and preventative action (CAPA) is a well-known concept for improving reliability that focuses on the systematic investigation of root causes of identified issues or risks in order to prevent recurrence. This principle is embodied by SRE's strong culture of blameless postmortems. When something goes wrong (and given the scale, complexity, and rapid rate of change at Google, something inevitably will go wrong), it's important to evaluate all of the following:

What happened
The effectiveness of the response
What we would do differently next time
What actions will be taken to make sure a particular incident doesn't happen again

This exercise is undertaken without pointing fingers at any individual. Instead of assigning blame, it is far more important to figure out what went wrong, and how, as an organization, we will rally to ensure it doesn't happen again. Dwelling on who might have caused the outage is counterproductive. Postmortems are conducted after incidents and published across SRE teams so that all can benefit from the lessons learned.

Decisions should be informed rather than prescriptive, and are made without deference to personal opinions—even that of the most-senior person in the room, who Eric Schmidt and Jonathan Rosenberg dub the "HiPPO," for "Highest-Paid Person's Opinion"


View all my reviews

No comments: