Production readiness

Have you ever launched a new service to production? Have you ever been maintaining a production service? If you answer “yes” to one of these questions, have you been guided during the process? What’s good or bad to do in production? And how do you transfer knowledge when new team members want to release production services or take the ownership of existing services?

Most companies end up having organically grown wild west approaches when it comes to production practices. Each team would figure out their tools and best practices themselves with trial-error. This reality often has a real tax not only on the success of the projects but also on engineers.

Trial-error culture creates an environment where finger pointing and blaming is more common. Once these behaviors are common, it becomes harder to learn from mistakes or not to repeat them again.

Successful organizations:

  • acknowledge the need of production guidelines
  • spend time on researching practices that apply to them
  • start having production readiness discussions when designing new systems or components
  • enforce production readiness practices

Production readiness involve a “review” process. Reviews can be a checklist or a questionnaire. Reviews can be done manually, automatically or both. Organizations can produce checklist templates rather than a static list of requirements that can be customized based on the needs. By doing so, it is possible to give engineers a way to inherit knowledge but also enough flexibility when it is required.

When to review a service for production readiness?

Production readiness reviews is not only useful right before pushing to production, they can be a protocol when handing off operational responsibilities to a different team or to a new hire. Use reviews when:

  • Launching a new production service.
  • Handing off the operations of an existing production service to another team such as SRE.
  • Handing off the operations of an existing production service to new individuals.
  • Preparing oncall support.

Production readiness checklists

A while ago, I published an example checklist for production readiness as an example of what they can cover. Even though the list came to existence when working with Google Cloud customers, it is useful and applicable outside of Google Cloud.

Design and Development

  • Have reproducible builds, your build shouldn’t require access to external services and shouldn’t be affected by an outage of an external system.
  • Define and set SLOs for your service at design time.
  • Document the availability expectations of external services you depend on.
  • Avoid single points of failures by not depending on single global resource. Have the resource replicated or have a proper fallback (e.g. hardcoded value) when resource is not available.

Configuration Management

  • Static, small and non-secret configuration can be command-line flags. Use a configuration delivery service for everything else.
  • Dynamic configuration should have a reasonable fallback in the case of unavailability of the configuration system.
  • Development environment configuration shouldn’t inherit from production configuration. This may lead access to production services from development and can cause privacy issues and data leaks.
  • Document what can be configured dynamically and explain the fallback behavior if configuration delivery system is not available.

Release Management

  • Document all details about your release process. Document how releases affect SLOs (e.g. temporary higher latency due to cache misses).
  • Document your canary release process.
  • Have a canary analysis plan and setup mechanisms to automatically revert canaries if possible.
  • Ensure rollbacks can use the same process that rollouts use.

Observability

  • Ensure the collection of metrics that are required by your SLOs are collected and exported from your binaries.
  • Make sure client- and server-side of the observability data can be differentiated. This is important to debug issues in production.
  • Tune alerts to reduce toil, for example remove alerts triggered by the routine events.
  • Include underlying platform metrics in your dashboards. Setup alerting for your external service dependencies.
  • Always propagate the incoming trace context. Even if you are not participating in the trace, this will allow lower-level services to debug debug production issues.

Security and Protection

  • Make sure all external requests are encrypted.
  • Make sure your production projects have proper IAM configuration.
  • Use networks within projects to isolate groups of VM instances.
  • Use VPN to securely connect remote networks.
  • Document and monitor user data access. Ensure that all user data access is logged and audited.
  • Ensure debugging endpoints are limited by ACL.
  • Sanitize user input. Have payload size restrictions for user input.
  • Ensure your service can block incoming traffic selectively per user. This allows to block the abuse cases without impacting other users.
  • Avoid external endpoints that triggers a large number of internal fan-outs.

Capacity planning

  • Document how your service scales. Examples: number of users, size of incoming payload, number of incoming messages.
  • Document resource requirements for your service. Examples: number of dedicated VM instances, number of Spanner instances, specialized hardware such as GPUs or TPUs.
  • Document resource constraints: resource type, region, etc.
  • Document quota restrictions to create new resources. For example, document the rate limit of GCE API if you are creating new instances via the API.
  • Consider having load tests for performance regressions where possible.

Debugging in production

Debugging and debugging tools might be irreplaceable to examine the execution flow and the current state of a program. Given it is not always easy to reproduce bugs, being able to debug in production sometimes is a critical benefit. But intercepting a live program in production for debugging purposes is not feasible because it would impact the production system. Conventional diagnostics signals such as logs and traces help to debug programs without impacting the execution flow but they might be limited in terms of reporting the critical variables and don’t have capabilities to evaluate new expressions.

Additionally, logs or stacktraces might not be helpful to diagnose problems at all. If a process is crashing consistently without enough diagnostics, it might be hard to pinpoint to the problem without conventionally debugging the program. The difficulty of reproducing production environments lead to the fact it is hard to reproduce the execution flows that are taking place in production.

Debugging production bugs is often challenging it requires reproduction of a similarly orchestrated environment and workload. If the bug is causing any downtime, it is critical for developers/operators to be able to take action without risking cascading failures.

To understand a program at any exact moment, we use core dumps. A core dump is a file that contains the memory dump of a running process and its status. It is primarily used for post-mortem debugging of a program and to understand a program’s state while it is still running with minimal overhead.

Core dumps are not only useful for debugging user-space programs but the state of the kernel when snapshot is taken. We can classify core dumps mainly into two categories:

  • Kernel dumps: Core dumps that contain the full or partial memory of the entire system. It is useful to understand the state of the kernel and other programs that might be triggering bugs.
  • Platform dumps: Core dumps that contain the memory of a platform, or agent such as JVM.
  • User-space dumps: Core dumps that contain the memory dumps of a single user-space process. Useful to investigate user-space problems.

Two strategies are adopted to capture dumps:

  • Automatically have a dump upon process crash.
  • Have a dump upon receiving SIGSEGV, SIGILL, SIGABRT or similar.

Once obtained, core files can be used with conventional debugging tools (e.g. gdb, lldb) interactively to backtrace and to evaluate expressions. Core dumps can also be sent to a service that generate reports and help you analyze dumps more in scale. Core dumps also can be used to build live debugging tools like Stackdriver Debugger to debug production services with minimal overhead.

Obtaining core dumps might introduce some challenges such as:

  • PII: Being able to strip out personally identifiable information (PII) can be critical when debugging actual live production systems.
  • Core dump retrieval time: Retrieving core dumps might take a long time depending on the extend of the collection and might introduce additional overhead.
  • Core dump size: Depending on the extend of the dump, the core dump file size can get quite large. This presents a challenge when automatically capturing dumps and storing term in the long term in scale.

For further reading on obtaining core dumps, see:

Why is benchmarking hard?

Benchmarking generally mean producing some measurements from a specific program or workload to typically understand and compare the performance characteristics of the benchmarked workload.

Benchmarks can be useful for:

  • Optimizing costly workloads.
  • Understanding how the underlying platform impact the cost.

Benchmarking can give you insights about your workload on various dimensions such as CPU cycles spent or memory allocation done for a given task. These measurements (even though they might be coming from idealized environments) might give you some hints about the cost. They may help you to pick the right algorithm or optimize an overwhelmingly expensive workload.

Benchmarking can also give you insights about the underlying runtime, operating system, and hardware. And you might find insights to compare how each of these elements impact your performance even if you don’t change the code. You might want to run the same suite of benchmarks on a new hardware to estimate the impact of the new hardware on certain calls.

Benchmarking requires a deep understanding of all layers you depend on. Consider CPU benchmarking. Any aspect below can significantly impact your results:

  • Whether the data is available in the CPU cache or not.
  • Whether you hit a garbage collection or not.
  • Whether compiler has optimized some cases or not.
  • Whether there is concurrency or not.
  • Whether you are sharing the cores with anything else.
  • Whether your cores are virtual or physical.
  • Whether the branch detector will do the same thing or not.
  • Whether the code you are benchmarking has any side effects or not. Compilers are really good optimizing cases if it has no impact on the global state.

This is why when we talk about benchmarking, it is not good practice to limit ourselves to the user-space code. It can sometimes turn into a detective’s job to design and evaluate benchmarks. This is why complicated and long workloads are harder to benchmark reliably. It becomes harder to do the right accounting and figure out whether it is the contention, context switches, caching, garbage collector, compiler optimizations or the user-space code.

Even though microbenchmarks can give some insights, they cannot replicate the situations how the workload is going to perform in production. Replicating an average production environment for microbenchmarking purposes is almost an impossible task. This is why microbenchmarks don’t always serve a good starting point if you want to evaluate your production performance.

To conclude, replicating production-like situations is not trivial. Understanding the impact of underlying stack requires a lot of experience and expertise. Designing and evaluating benchmarks are not easy. Examining the production services and understanding the call patterns and pointing out the hot calls may provide better insights about your production. In the future articles, I will cover how we think about performance evaluation in production.