понедельник, 26 января 2026 г.

print & analyse CUDA coredumps

inconvenient cuda-gdb can't automatically processing them - you need explicitly say something like
target cudacore /full/path/to/coredump

and then type lots of info cuda XXX 

So last weekend I wrote tool to parse/dump CUDA coredumps and it even works on machine without CUDA SDK (what might be useful if you collect all crash dumps to some centralized storage with help of CUDA_COREDUMP_PIPE)

But first

Little bit of theory

Format of CUDA coredumps is documented in cudacoredump.h from cuda-gdb.deb
It contains list of devices in .cudbg.devtbl section and 2 groups of data
 
First is list of contexts and attached to them resources like global memory and list of loaded modules in .cudbg.relfimg.devX.ctxY sections. Those modules are just normal ELF files (some from kernel runtime) and most importantly, they contain the load addresses for each section - this is how we can find module/function of faulty instruction

Second group contains whole thread hierarchy:

  • list of SMs in .cudbg.smtbl.devX section
  • list of CTA in  .cudbg.ctatbl.devX.smY sections
  • list of WARPs in .cudbg.wptbl.devX.smY.ctaZ sections
  • and finally list of threads in each warp - in sections .cudbg.lntbl.devX.smY.ctaZ.wpI

Each thread has own set of sections:

  • for call stack - .cudbg.bt.devX.smY.ctaZ.wpI.lnJ
  • registers in .cudbg.regs.devX.smY.ctaZ.wpI.lnJ
  • predicates in .cudbg.pred.devX.smY.ctaZ.wpI.lnJ
  • local memory in .cudbg.local.devX.smY.ctaZ.wpI.lnJ. Curious that those sections has the same addresses
At the same time sections for Uniform registers (.cudbg.uregs.devX.smY.ctaZ.wpI) & predicates (.cudbg.upred.devX.smY.ctaZ.wpI) are attached to WARPs 

Where get faulty instruction address

This is really good question. Actually we have 3 source of addresses:
  1. for driver with version >= 555 SM has field errorPC
  2. WARP has field errorPC too
  3. finally each lane has fields exception & virtualPC in CudbgThreadTableEntry
 
The worst part is that all these addresses are different. It can be explained for WARP - they can have diverent threads. Seems that cuda-gdb uses virtualPC
It reminds me of the old joke that there was actually only one breed of dinosaurs, and each paleontologist just put the bones together in their own way

 

Installation

For low-level parsing of coredumps I add lots of XS functions in my Elf::Reader module, so you need to build and install it
It has the only dependency ELFIO 
Note that structures in original cudacoredump.h are not appropriate for versioning, so I split them by version and glued together through public inheritance
 
Minimal supported driver version is 525, max - 575 from CUDA SDK 13.1. I didn't test on more old versions - perhaps it does not work for them. To find version of driver run

nvidia-smi -q | head

Timestamp                                 : Mon Jan 26 17:12:45 2026
Driver Version                            : 535.183.01
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA Very Expensive Card 

Command line options

without any options script just try to find faulty instruction address and corresponding ELF module and section. But you can dump numerous things like
  • backtraces with -b option
  • grids with -g
  • registers/predicates with -r
  • CTA/WAPR/threads with -t

Bcs dump can be huge you can restrict it only to WARPs/threads with faulty instructions using -e option

To setup right version of driver use -D option

Happy debugging!

Комментариев нет:

Отправить комментарий