windows deep internals: print & analyse CUDA coredumps

понедельник, 26 января 2026 г.

print & analyse CUDA coredumps

inconvenient cuda-gdb can't automatically processing them - you need explicitly say something like

target cudacore /full/path/to/coredump

and then type lots of info cuda XXX

So last weekend I wrote tool to parse/dump CUDA coredumps and it even works on machine without CUDA SDK (what might be useful if you collect all crash dumps to some centralized storage with help of CUDA_COREDUMP_PIPE)

But first

Little bit of theory

Format of CUDA coredumps is documented in cudacoredump.h from cuda-gdb.deb

It contains list of devices in .cudbg.devtbl section and 2 groups of data

First is list of contexts and attached to them resources like global memory and list of loaded modules in .cudbg.relfimg.devX.ctxY sections. Those modules are just normal ELF files (some from kernel runtime) and most importantly, they contain the load addresses for each section - this is how we can find module/function of faulty instruction

Second group contains whole thread hierarchy:

list of SMs in .cudbg.smtbl.devX section
list of CTA in .cudbg.ctatbl.devX.smY sections
list of WARPs in .cudbg.wptbl.devX.smY.ctaZ sections
and finally list of threads in each warp - in sections .cudbg.lntbl.devX.smY.ctaZ.wpI

Each thread has own set of sections:

for call stack - .cudbg.bt.devX.smY.ctaZ.wpI.lnJ
registers in .cudbg.regs.devX.smY.ctaZ.wpI.lnJ
predicates in .cudbg.pred.devX.smY.ctaZ.wpI.lnJ
local memory in .cudbg.local.devX.smY.ctaZ.wpI.lnJ. Curious that those sections has the same addresses

At the same time sections for Uniform registers (.cudbg.uregs.devX.smY.ctaZ.wpI) & predicates (.cudbg.upred.devX.smY.ctaZ.wpI) are attached to WARPs

Where get faulty instruction address

This is really good question. Actually we have 3 source of addresses:

for driver with version >= 555 SM has field errorPC
WARP has field errorPC too
finally each lane has fields exception & virtualPC in CudbgThreadTableEntry

The worst part is that all these addresses are different. It can be explained for WARP - they can have diverent threads. Seems that cuda-gdb uses virtualPC

It reminds me of the old joke that there was actually only one breed of dinosaurs, and each paleontologist just put the bones together in their own way

Installation

For low-level parsing of coredumps I add lots of XS functions in my Elf::Reader module, so you need to build and install it

It has the only dependency ELFIO

Note that structures in original cudacoredump.h are not appropriate for versioning, so I split them by version and glued together through public inheritance

Minimal supported driver version is 525, max - 575 from CUDA SDK 13.1. I didn't test on more old versions - perhaps it does not work for them. To find version of driver run

nvidia-smi -q | head

Timestamp : Mon Jan 26 17:12:45 2026 Driver Version : 535.183.01 CUDA Version : 12.2 Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : NVIDIA Very Expensive Card

Command line options

without any options script just try to find faulty instruction address and corresponding ELF module and section. But you can dump numerous things like

backtraces with -b option
grids with -g
registers/predicates with -r
CTA/WAPR/threads with -t

Bcs dump can be huge you can restrict it only to WARPs/threads with faulty instructions using -e option

To setup right version of driver use -D option

Happy debugging!

windows deep internals

понедельник, 26 января 2026 г.

print & analyse CUDA coredumps

Little bit of theory

Where get faulty instruction address

Installation

Command line options

Комментариев нет:

Отправить комментарий

понедельник, 26 января 2026 г.

print & analyse CUDA coredumps

Little bit of theory

Where get faulty instruction address

Installation

Command line options

Комментариев нет:

Отправить комментарий

понедельник, 26 января 2026 г.