Debugging in production

Debugging and debugging tools might be irreplaceable to examine the execution flow and the current state of a program. Given it is not always easy to reproduce bugs, being able to debug in production sometimes is a critical benefit. But intercepting a live program in production for debugging purposes is not feasible because it would impact the production system. Conventional diagnostics signals such as logs and traces help to debug programs without impacting the execution flow but they might be limited in terms of reporting the critical variables and don’t have capabilities to evaluate new expressions.

Additionally, logs or stacktraces might not be helpful to diagnose problems at all. If a process is crashing consistently without enough diagnostics, it might be hard to pinpoint to the problem without conventionally debugging the program. The difficulty of reproducing production environments lead to the fact it is hard to reproduce the execution flows that are taking place in production.

Debugging production bugs is often challenging it requires reproduction of a similarly orchestrated environment and workload. If the bug is causing any downtime, it is critical for developers/operators to be able to take action without risking cascading failures.

To understand a program at any exact moment, we use core dumps. A core dump is a file that contains the memory dump of a running process and its status. It is primarily used for post-mortem debugging of a program and to understand a program’s state while it is still running with minimal overhead.

Core dumps are not only useful for debugging user-space programs but the state of the kernel when snapshot is taken. We can classify core dumps mainly into two categories:

Two strategies are adopted to capture dumps:

Once obtained, core files can be used with conventional debugging tools (e.g. gdb, lldb) interactively to backtrace and to evaluate expressions. Core dumps can also be sent to a service that generate reports and help you analyze dumps more in scale. Core dumps also can be used to build live debugging tools like Stackdriver Debugger to debug production services with minimal overhead.

Obtaining core dumps might introduce some challenges such as:

For further reading on obtaining core dumps, see: