воскресенье, 8 февраля 2026 г.

building cuda-gdb from sources

For some reason cuda-gdb from cuda sdk gives on my machine list of errors like

Traceback (most recent call last):
  File "/usr/share/gdb/python/gdb/__init__.py", line 169, in _auto_load_packages
    __import__(modname)
  File "/usr/share/gdb/python/gdb/command/explore.py", line 746, in <module>
    Explorer.init_env()
  File "/usr/share/gdb/python/gdb/command/explore.py", line 135, in init_env
    gdb.TYPE_CODE_RVALUE_REF : ReferenceExplorer,
AttributeError: 'module' object has no attribute 'TYPE_CODE_RVALUE_REF'

so I decided rebuild it with python version installed in system - and this turned out to be a difficult task

The first question is where the source code? Seems that official repository does not contain cuda specific code - so raison d'être of these repo is totally unclear. I extracted from cuda sdk .deb archive cuda-gdb-13.1.68.src.tar.gz and proceed with it

Second - process of configuring is extremely fragile - if you point single wrong option you will know about it only after 30-40 min. Also it seems that you just can't run configure in sub-dirs, bcs in that case linker will claims about tons of missed symbols. So configuration found by trial and error
configure --with-python=/usr/bin/python3 --enable-cuda

And finally we got file gdb/gdb having size 190 Mb. And after running I got stack trace beginning with
arch-utils.c:1374: internal-error: gdbarch: Attempt to register unknown architecture (2)

This all raises some questions for nvidia:

  • do they testing their cuda sdk before releasing?
  • do they have QA at all or like microsoft just test their ai shit directly on users?
  • from which sources was built original cuda-gdb in fact? 

Well, at least having some suspicious source code we can fix this build

понедельник, 26 января 2026 г.

print & analyse CUDA coredumps

inconvenient cuda-gdb can't automatically processing them - you need explicitly say something like
target cudacore /full/path/to/coredump

and then type lots of info cuda XXX 

So last weekend I wrote tool to parse/dump CUDA coredumps and it even works on machine without CUDA SDK (what might be useful if you collect all crash dumps to some centralized storage with help of CUDA_COREDUMP_PIPE)

But first

Little bit of theory

Format of CUDA coredumps is documented in cudacoredump.h from cuda-gdb.deb
It contains list of devices in .cudbg.devtbl section and 2 groups of data
 
First is list of contexts and attached to them resources like global memory and list of loaded modules in .cudbg.relfimg.devX.ctxY sections. Those modules are just normal ELF files (some from kernel runtime) and most importantly, they contain the load addresses for each section - this is how we can find module/function of faulty instruction

Second group contains whole thread hierarchy:

  • list of SMs in .cudbg.smtbl.devX section
  • list of CTA in  .cudbg.ctatbl.devX.smY sections
  • list of WARPs in .cudbg.wptbl.devX.smY.ctaZ sections
  • and finally list of threads in each warp - in sections .cudbg.lntbl.devX.smY.ctaZ.wpI

Each thread has own set of sections:

  • for call stack - .cudbg.bt.devX.smY.ctaZ.wpI.lnJ
  • registers in .cudbg.regs.devX.smY.ctaZ.wpI.lnJ
  • predicates in .cudbg.pred.devX.smY.ctaZ.wpI.lnJ
  • local memory in .cudbg.local.devX.smY.ctaZ.wpI.lnJ. Curious that those sections has the same addresses
At the same time sections for Uniform registers (.cudbg.uregs.devX.smY.ctaZ.wpI) & predicates (.cudbg.upred.devX.smY.ctaZ.wpI) are attached to WARPs 

Where get faulty instruction address

This is really good question. Actually we have 3 source of addresses:
  1. for driver with version >= 555 SM has field errorPC
  2. WARP has field errorPC too
  3. finally each lane has fields exception & virtualPC in CudbgThreadTableEntry

понедельник, 19 января 2026 г.

libcuda.so logger

As illustration of ideas from my previous blogpost I made PoC for logging all libcuda.so calls - as the cuda-gdb debugger sees them

It just installs own debug handler and receives all messages. Note:

  1. only x86_64 linux supported, but logic can be easily extended for x86 32bit and highly likely for arm64 too
  2. events generating before each call, so you can't get result of those calls
Current handler is very simple - it just writes to file, but nothing prevents to add storing to DB, ElasticSearch or gRPC/Apache thrift to send them to some remote storage (or even to WireShark in real time)

Format of messages

Currently almost unknown - for public API events have type 6 and function name at offset 0x30 - and this is all for now. Sure subject for further RE

Dependencies

How to build

Patch ELFIO_PATH & UDIS_PATH in Makefile and just run make
Both gcc (12+) and clang 21 are supported

How connect logger to your own application

You just call single function set_logger. Arguments:

  • full path to libcuda.so. Note that most structures from it gathered with static code analysis and so require some disasm
  • FILE *fp - where to write log
  • mask - pointer to array with masks for each event type. Non-zero value means intercept events with this type, 2 - do hexdump of packets
  • mask_size - size of mask array. libcuda.so from CUDA 13.1 has 31 event types

+ add libdis.so to linker

Also it's not difficult to make classical injection with ancient LD_PRELOAD trick or even inject this logger into already running processes

четверг, 15 января 2026 г.

libcuda.so internals part 2

Previous part

I've noticed that almost all real API functions has the same prologues like:

    mov     eax, cs:dword_5E14C00 ; unique for each API function
    mov     [rbp+var_D0], 3E7h
    mov     [rbp+var_C0], 0
    mov     [rbp+var_C8], 0
    test    eax, eax
    jz      short loc_39603B
    lea     rdi, [rbp+var_C0]
    call    sub_2EE190 ; get data from pthread_getspecific
    test    eax, eax
    jz      loc_396118

 loc_396118:

    lea     rbx, aCustreamupdate_5  ; "cuStreamUpdateCaptureDependencies_ptsz"
    mov     [rbp+var_88], rdx
    call    call_dbg

So I extracted from cudbgApiDetach those dbg_callback and array of debug tracepoints - see method try_dbg_flag. I don't know why debugger needs them -probably this is part of events tracing

When you run your program under cuda-gdb this callback will be set:

api_gate at 0x155554e11940 (155552A2CB50) - /lib/x86_64-linux-gnu/libcudadebugger.so.1

среда, 24 декабря 2025 г.

libcuda.so internals

The first question that comes to mind when looking at them is "why they are so huge?". For example libcuda.so from cuda 10.1 has size 28Mb and from 13.1 already 96Mb. So I rejected the idea that they are just yet another victims of vibe-coding and made some preliminary RE. The answer is - because they contain in .rodata section lots of CUBIN files for

kernel run-time

I extracted them (archive from 13.1) and checked SASS. Now I am almost sure that nvidia has some internal SASS assembler - they use LEPC instruction (to load address of current instruction) which you just can't get from official ptxas
   /*0160*/  LEPC R20 ; R20 now holds 170
   /*0170*/  IADD3 R20, P0, R20, 0x50, RZ 1 ; and if P0 R20 += 0x50

 
What contain those CUBIN files?
  • syscalls like __cuda_syscall_cp_async_bulk_tensor_XX, __cuda_syscall_tex_grad_XX etc
  • implementation of functions like cudaGraphLaunch/vprintf
  • functions cnpXXX like cnpDeviceGetAttribute
  • logic for kernel enqueue
  • some support for profiling like scProfileBuffers
  • trap handlers 
and so on. In essence this is backstage workers - like old good BIOS

 

API callbacks

пятница, 28 ноября 2025 г.

bug in sass MD

Spent couple of days in debugging rare bug in my sass disasm. I tested it on thousands of .cubin files and got bad instruction decoding for one. Btw I never saw papers about testing of disassemblers - compilers like gcc/clang has huge set of tests to detect regressions, so probably I should do the same. The problem is that I periodically add new features and smxx.so files generating every time

My nvd has option -N to dump unrecognized opcodes, so I got for sm55

Not found at E8 0000100000010111111100010101110000011000100000100000000000000011101001110000000000000000

nvdisasm v11 swears that this pile of 0 & 1 must be ISCADD instruction somehow. Ok, lets run ead.pl and check if it can find it:
perl ead.pl -BFvamrzN 0000100000010111111100010101110000011000100000100000000000000011101001110000000000000000 ../data/sm55_1.txt

found 4
........................0.0111000..11...................................................
0000-0-------111111-----0101110001011-------000--00000000000----------------------------
0000000-----------------0101110000111000-00000---00000000000----------------------------
00000--------111111-----01011100000110---000-----00000000000----------------------------
000000-------111111-----0001110---------------------------------------------------------
matched: 0

the first thought was that MD are just too old bcs were extracted from cuda 10, so I made decryptor for cuda 11 (paranoid nvidia removed instructions properties since version 12, so 11 is last source of MD), extracted data, rebuild sm55.cc and sm55.so and run test again

The bug has not disappeared

вторник, 25 ноября 2025 г.

SASS latency table & instructions reordering

In these difficult times, no one wants to report bad or simply weak results (and this will destroy this hypocritical civilization). Since this is my personal blog and I am not looking for grants, I don't care.

Let's dissect one truly inspiring paper - they employed reinforcement learning and claim that

transparently producing 2% to 26% speedup

wow, 26% is really excellent result. So I decided to implement proposed technique, but first I need get source of latency values for each SASS instruction. I extracted files with latency tables from nvdisasm - their names have _2.txt suffixes

Then I made perl binding for my perl version of Ced (see methods for object Cubin::Ced::LatIndex), add new pass (-l option for dg.pl) and done some experiments to dump latency values for each instruction. Because connections order is unknown I implemented all 3:

  1. current column with current row
  2. column from previous instruction with current row
  3. current column with row from previous instruction

The results are discouraging

  • some instructions (~1.5% for best case 1) does not have latency at all (for example S2R or XXXBAR)
  • some instructions have more than 1 index to the same table - well, I fixed this with selecting max value (see function intersect_lat)
  • while comparing with actual stall count the percentage of incorrect values above 60 - it's even worse than just coin flipping

Some possible reasons for failure: