Minidump: New Tool for Selecting the Regions of RAM for Post-Crash Analysis

Monday 1/29/24 02:25am
|
Posted By Mukesh Ojha
  • Up0
  • Down0

Snapdragon and Qualcomm branded products are products of
Qualcomm Technologies, Inc. and/or its subsidiaries.

Co-written with Elliot Berman

Instead of copying and parsing the complete RAM dump after a device crash, what if you could collect only the debug information you want? It would reduce the time and resources needed for transfer and storage, and you could study the information much more efficiently.

Qualcomm Technologies’ infrastructure for RAM dump collection, residing in proprietary boot firmware, supports minidump on our system-on-chips (SoCs). The minidump driver developed by Qualcomm Innovation Center, Inc. (QUIC) provides for collecting either a complete RAM dump or a minidump with only the regions of memory you specify. The driver, which we’re in the process of upstreaming, enables client kernel modules and the core kernel to specify in advance the minimal information to be captured for debugging.

In this post, we’ll describe what our minidump driver is, how it works and which types of debug information you can specify to streamline your debugging. This is a summary of our presentation “Minidump to Debug End-User Device Crashes” at the Linux Plumbers Conference.

TMI, and related problems

When you’re working with field devices or devices in a test farm, the engineering mode in Qualcomm Technologies’ firmware can generate full RAM dumps for both kernel and non-kernel crashes. You can then conduct postmortem debugging.

But on user devices, a complete RAM dump at the time of failure can be Too Much Information. It’s not practical to capture and store upwards of 12 GB of data on a user device, let alone transmit it over a wireless network for debugging. And, from the perspective of security, the RAM dump may contain sensitive data the user doesn’t want to share. There are plenty of reasons to collect and send less of the memory dump for analysis.

Minidump infrastructure is part of our boot firmware. We’ve been shipping the minidump kernel driver to customers for several years and it is finding new applications in areas like automotive, extended reality, mobile broadband and IoT. But we’ve seen that it is not easy to get useful information for debugging crashes in the generic kernel image (GKI) of Android. That has led us to share this common problem of collecting the minimum-useful debug data after a system crash. the approach We’ve codified our approach to the problem in boot firmware and a minidump kernel driver, and we’re in the process of upstreaming the driver.

The minidump driver

Qualcomm Technologies’ boot firmware provides core infrastructure that collects the registered regions from each co-processor and subsystem including:

  • the audio digital signal processor (ADSP)
  • the compute DSP (CDSP)
  • the modem
  • the application processor subsystem (APSS) where Android and Linux run

Our minidump driver enables the client kernel modules to register the regions to collect. You enable minidump in firmware by setting the bit in a secure register, as described in our minidump documentation:

+Writing to sysfs node can also be used to set the mode to minidump::
+ echo "mini" > /sys/module/qcom_scm/parameter/download_mode

Then, when the boot firmware gets triggered by a crash, it collects only the regions you've registered from the kernel, or from whichever other subsystems are running on your SoC. The result is less information than in a full system RAM dump, but it’s the information that’s most valuable to you.

The crash handler in boot firmware could be triggered by a kernel panic or a subsystem crash like a network-on-chip (NOC) or bus error, or by a watchdog bite or firmware panic.

What about kdump and pstore?

Of course, there are already solutions for collecting debug information after a crash.

With kdump, you can use any userspace to collect the RAM dump, then save or send it. But kdump reserves a large region (128 MB for arm64) of precious memory. It requires two kernel boots – one for the kernel to crash, another to boot the normal kernel – to come back to normal operation. And, only the kernel can trigger kdump, so it’s not useful in case of something like a NOC error.

Another tool is pstore, which has lower memory overhead and doesn’t require extra kernel boots. But we found room for improvement in the time it takes to collect and copy the RAM dump to the oops device. It requires a memory reservation for the ramoops region, which we wanted to avoid. Plus, pstore collection is limited to memory that the kernel can reliably copy before the crash. That means that in a kernel panic, it may not capture all of the messages, nor details like the kernel panic notifiers path.

Minidump workflow

Our minidump driver is designed to minimize the memory overhead of capturing and sending a small dump to firmware. It maintains a table of physical addresses and sizes and makes it unnecessary to copy information like dmesg buffers or ftrace buffers. The firmware then captures the minidump regions directly to storage, reducing collection time.

The implementation uses tables of contents, each of which has a list of the regions you want to collect in case of a crash. As shown in the image below, it's a two-stage lookup.

Each entry in the table usually corresponds to a particular subsystem. For example, to debug a crash in the Linux subsystem, you would list those regions. If you wanted to collect the log buffers or ftrace buffers in those regions, you would register their physical addresses.

When a minidump crash occurs on a device with a Qualcomm Technologies’ processor:

  1. the device reboots;
  2. firmware traverses the table(s) of contents;
  3. firmware collects the executable and linkable format (ELF);
  4. firmware copies the ELF to whichever storage device is available to the chip (eMMC, UFS, SD card, etc.).

Your turn: Take a look at our implementation

Minidump support for Qualcomm Technologies’ remote processor regions like the modem and ADSP is already supported in upstream. We’ve built this minidump implementation to collect kernel regions as well, on the premise that the SoC and its subsystems crash because of a variety of hardware and software bugs.

Our goal has been to share this implementation with other SoC vendors who want to integrate it with their crash dump solution. We’re putting a lot of work into upstreaming it, with most of our time taken up in community discussions about using an existing solution or building something generic. As work continued, we took the community’s advice and presented “Minidump to Debug End-User Device Crashes” at the Linux Plumbers Conference. Take a look for more details about the implementation, including the kinds of debug information we hope to capture in the future.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.