Debugging and debugging tools might be irreplaceable to examine the execution flow and the current state of a program. Given it is not always easy to reproduce bugs, being able to debug in production sometimes is a critical benefit. But intercepting a live program in production for debugging purposes is not feasible because it would impact the production system. Conventional diagnostics signals such as logs and traces help to debug programs without impacting the execution flow but they might be limited in terms of reporting the critical variables and don't have capabilities to evaluate new expressions.

Additionally, logs or stacktraces might not be helpful to diagnose problems at all. If a process is crashing consistently without enough diagnostics, it might be hard to pinpoint to the problem without conventionally debugging the program. The difficulty of reproducing production environments lead to the fact it is hard to reproduce the execution flows that are taking place in production.

Debugging production bugs is often challenging it requires reproduction of a similarly orchestrated environment and workload. If the bug is causing any downtime, it is critical for developers/operators to be able to take action without risking cascading failures.

To understand a program at any exact moment, we use core dumps. A core dump is a file that contains the memory dump of a running process and its status. It is primarily used for post-mortem debugging of a program and to understand a program’s state while it is still running with minimal overhead.

Core dumps are not only useful for debugging user-space programs but the state of the kernel when snapshot is taken. We can classify core dumps mainly into two categories:

  • Kernel dumps: Core dumps that contain the full or partial memory of the entire system. It is useful to understand the state of the kernel and other programs that might be triggering bugs.
  • Platform dumps: Core dumps that contain the memory of a platform, or agent such as JVM.
  • User-space dumps: Core dumps that contain the memory dumps of a single user-space process. Useful to investigate user-space problems.

Two strategies are adopted to capture dumps:

  • Automatically have a dump upon process crash.
  • Have a dump upon receiving SIGSEGV, SIGILL, SIGABRT or similar.

Once obtained, core files can be used with conventional debugging tools (e.g. gdb, lldb) interactively to backtrace and to evaluate expressions. Core dumps can also be sent to a service that generate reports and help you analyze dumps more in scale. Core dumps also can be used to build live debugging tools like Stackdriver Debugger to debug production services with minimal overhead.

Obtaining core dumps might introduce some challenges such as:

  • PII: Being able to strip out personally identifiable information (PII) can be critical when debugging actual live production systems.
  • Core dump retrieval time: Retrieving core dumps might take a long time depending on the extend of the collection and might introduce additional overhead.
  • Core dump size: Depending on the extend of the dump, the core dump file size can get quite large. This presents a challenge when automatically capturing dumps and storing term in the long term in scale.

For further reading on obtaining core dumps, see: