I've add some support of DWARF debug info from nvidia nvcc to my dwarfdump. As everyone knows dwarf is over-complicated, fat and just disgusting - however, nvidia was able to take his nausea to a new level
windows deep internals
вы все еще верите написанному кириллицей ?
четверг, 26 марта 2026 г.
dwarf from nvcc
среда, 18 марта 2026 г.
read a couple of books about compilers
LLVM Compiler for RISC-V Architecture
- there is no introduction about LLVM IR/risc-v specific IR, so long IR listings are very hard to follow
- author don't give link to source code implementing some algo. Fortunately elixir indexed whole LLVM source tree
Dive into Deep Learning Compiler
As far as I know, this is the only book describing AI/ML compilers so far. Also TVM looks very promising - unlike monsters like XLA/iree it is compact and observable for mere mortals
Drawbacks:
- book is not completed - last two chapter about NN & deployment are just "place holder"
- it's unclear why for matrix multiplication on CUDA they didn't get cublas as base case
- and openblas for cpu version
Despite this, considering that the book is freely downloadable, my rating is 4 out of 5
пятница, 6 марта 2026 г.
SASS latency table: second try
In my first attempt I used latency tables extracted from MD file (located inside nvdisasm) and nothing good came out of it
Obvious reason is that real latency table should be located not in disassembler - it must be inside ptxas. But the problem with that file is that it is really huge - in SDK 13 it has size 40Mb. Sure no symbols included
This is not surprisingly bcs it contains lots of things:
- ptxas parser
- lots of macros
- optimizing compiler with 159 passes and don't use LLVM at all
- code generators for several different SMs
Besides it does not have any tracepoints and big part of string are encrypted. So it took lots of time and patience but finally I found and extracted right latency table
And then a lot of discoveries came my way
четверг, 12 февраля 2026 г.
libcudadebugger.so logger
I've done some research of libcudadebugger.so internals - seems that it has exactly the same patterns:
- functions table returned by GetCUDADebuggerAPI located in .data section so you can patch any callback address
- and each API function has logger
This last fact is strange - while loggers from libcuda.so were used by debugger then who consume logs from debugger itself? Check code to load those loggers:
lea rdi, aNvtxInjection6 ; "NVTX_INJECTION64_PATH"
call _getenv
mov rdi, rax ; file
test rax, rax
jz short loc_14B160
mov esi, 1 ; mode
call _dlopen
mov r13, rax
test rax, rax
jz short loc_14B190
lea rsi, aInitializeinje_1 ; "InitializeInjectionNvtx2"
mov rdi, rax ; handle
call _dlsym
test rax, rax
jz short loc_14B1A0
lea rdi, sub_14A270
call rax
lea rax, aFailedCreatede+7 ; "CreateDebuggerSession"
mov [rbp+var_18], rax
mov rax, cs:dbg_log
mov [rbp+var_20], 0
mov dword ptr [rbp+var_40], 300003h
mov dword ptr [rbp+var_20], 1
movaps [rbp+var_30], xmm0
test rax, rax
jz loc_1470AC
lea rdx, [rbp+var_40]
mov r12, rdx
mov rdi, rdx
call rax
воскресенье, 8 февраля 2026 г.
building cuda-gdb from sources
For some reason cuda-gdb from cuda sdk gives on my machine list of errors like
Traceback (most recent call last):
so I decided rebuild it with python version installed in system - and this turned out to be a difficult task
File "/usr/share/gdb/python/gdb/__init__.py", line 169, in _auto_load_packages
__import__(modname)
File "/usr/share/gdb/python/gdb/command/explore.py", line 746, in <module>
Explorer.init_env()
File "/usr/share/gdb/python/gdb/command/explore.py", line 135, in init_env
gdb.TYPE_CODE_RVALUE_REF : ReferenceExplorer,
AttributeError: 'module' object has no attribute 'TYPE_CODE_RVALUE_REF'
The first question is where the source code? Seems that official repository does not contain cuda specific code - so raison d'être of these repo is totally unclear. I extracted from cuda sdk .deb archive cuda-gdb-13.1.68.src.tar.gz and proceed with it
Second - process of configuring is extremely fragile - if you point single wrong option you will know about it only after 30-40 min. Also it seems that you just can't run configure in sub-dirs, bcs in that case linker will claims about tons of missed symbols. So configuration found by trial and errorconfigure --with-python=/usr/bin/python3 --enable-cuda
And finally we got file gdb/gdb having size 190 Mb. And after running I got stack trace beginning witharch-utils.c:1374: internal-error: gdbarch: Attempt to register unknown architecture (2)
This all raises some questions for nvidia:
понедельник, 26 января 2026 г.
print & analyse CUDA coredumps
target cudacore /full/path/to/coredumpand then type lots of info cuda XXX
So last weekend I wrote tool to parse/dump CUDA coredumps and it even works on machine without CUDA SDK (what might be useful if you collect all crash dumps to some centralized storage with help of CUDA_COREDUMP_PIPE)
But first
Little bit of theory
Second group contains whole thread hierarchy:
- list of SMs in .cudbg.smtbl.devX section
- list of CTA in .cudbg.ctatbl.devX.smY sections
- list of WARPs in .cudbg.wptbl.devX.smY.ctaZ sections
- and finally list of threads in each warp - in sections .cudbg.lntbl.devX.smY.ctaZ.wpI
Each thread has own set of sections:
- for call stack - .cudbg.bt.devX.smY.ctaZ.wpI.lnJ
- registers in .cudbg.regs.devX.smY.ctaZ.wpI.lnJ
- predicates in .cudbg.pred.devX.smY.ctaZ.wpI.lnJ
- local memory in .cudbg.local.devX.smY.ctaZ.wpI.lnJ. Curious that those sections has the same addresses
Where get faulty instruction address
- for driver with version >= 555 SM has field errorPC
- WARP has field errorPC too
- finally each lane has fields exception & virtualPC in CudbgThreadTableEntry
понедельник, 19 января 2026 г.
libcuda.so logger
As illustration of ideas from my previous blogpost I made PoC for logging all libcuda.so calls - as the cuda-gdb debugger sees them
It just installs own debug handler and receives all messages. Note:
- only x86_64 linux supported, but logic can be easily extended for x86 32bit and highly likely for arm64 too
- events generating before each call, so you can't get result of those calls
Format of messages
Dependencies
How to build
How connect logger to your own application
You just call single function set_logger. Arguments:
- full path to libcuda.so. Note that most structures from it gathered with static code analysis and so require some disasm
- FILE *fp - where to write log
- mask - pointer to array with masks for each event type. Non-zero value means intercept events with this type, 2 - do hexdump of packets
- mask_size - size of mask array. libcuda.so from CUDA 13.1 has 31 event types
+ add libdis.so to linker
Also it's not difficult to make classical injection with ancient LD_PRELOAD trick or even inject this logger into already running processes