windows deep internals: 2026

пятница, 6 марта 2026 г.

SASS latency table: second try

In my first attempt I used latency tables extracted from MD file (located inside nvdisasm) and nothing good came out of it

Obvious reason is that real latency table should be located not in disassembler - it must be inside ptxas. But the problem with that file is that it is really huge - in SDK 13 it has size 40Mb. Sure no symbols included

This is not surprisingly bcs it contains lots of things:

ptxas parser
lots of macros
optimizing compiler with 159 passes and don't use LLVM at all
code generators for several different SMs

Besides it does not have any tracepoints and big part of string are encrypted. So it took lots of time and patience but finally I found and extracted right latency table

And then a lot of discoveries came my way

Дальше »

четверг, 12 февраля 2026 г.

libcudadebugger.so logger

I've done some research of libcudadebugger.so internals - seems that it has exactly the same patterns:

functions table returned by GetCUDADebuggerAPI located in .data section so you can patch any callback address
and each API function has logger

This last fact is strange - while loggers from libcuda.so were used by debugger then who consume logs from debugger itself? Check code to load those loggers:

  lea     rdi, aNvtxInjection6          ; "NVTX_INJECTION64_PATH"
  call    _getenv
  mov     rdi, rax                      ; file
  test    rax, rax
  jz      short loc_14B160
  mov     esi, 1                        ; mode
  call    _dlopen
  mov     r13, rax
  test    rax, rax
  jz      short loc_14B190
  lea     rsi, aInitializeinje_1        ; "InitializeInjectionNvtx2"
  mov     rdi, rax                      ; handle
  call    _dlsym
  test    rax, rax
  jz      short loc_14B1A0
  lea     rdi, sub_14A270
  call    rax

Very straightforward - load shared library from env var NVTX_INJECTION64_PATH and call function InitializeInjectionNvtx2 - part of Cupti API. Btw excellent injection hook

Unfortunately these loggers don't collect parameters of API functions - only their names in packets with fixed size 0x30 bytes:

  lea     rax, aFailedCreatede+7        ; "CreateDebuggerSession"
  mov     [rbp+var_18], rax
  mov     rax, cs:dbg_log
  mov     [rbp+var_20], 0
  mov     dword ptr [rbp+var_40], 300003h
  mov     dword ptr [rbp+var_20], 1
  movaps  [rbp+var_30], xmm0
  test    rax, rax
  jz      loc_1470AC
  lea     rdx, [rbp+var_40]
  mov     r12, rdx
  mov     rdi, rdx
  call    rax

Name of called function located at offset 0x28 and in logs looks like

Дальше »

воскресенье, 8 февраля 2026 г.

building cuda-gdb from sources

For some reason cuda-gdb from cuda sdk gives on my machine list of errors like

Traceback (most recent call last): File "/usr/share/gdb/python/gdb/__init__.py", line 169, in _auto_load_packages __import__(modname) File "/usr/share/gdb/python/gdb/command/explore.py", line 746, in <module> Explorer.init_env() File "/usr/share/gdb/python/gdb/command/explore.py", line 135, in init_env gdb.TYPE_CODE_RVALUE_REF : ReferenceExplorer, AttributeError: 'module' object has no attribute 'TYPE_CODE_RVALUE_REF' so I decided rebuild it with python version installed in system - and this turned out to be a difficult task

The first question is where the source code? Seems that official repository does not contain cuda specific code - so raison d'être of these repo is totally unclear. I extracted from cuda sdk .deb archive cuda-gdb-13.1.68.src.tar.gz and proceed with it

Second - process of configuring is extremely fragile - if you point single wrong option you will know about it only after 30-40 min. Also it seems that you just can't run configure in sub-dirs, bcs in that case linker will claims about tons of missed symbols. So configuration found by trial and error
configure --with-python=/usr/bin/python3 --enable-cuda

And finally we got file gdb/gdb having size 190 Mb. And after running I got stack trace beginning with
arch-utils.c:1374: internal-error: gdbarch: Attempt to register unknown architecture (2)

This all raises some questions for nvidia:

do they testing their cuda sdk before releasing?
do they have QA at all or like microsoft just test their ai shit directly on users?
from which sources was built original cuda-gdb in fact?

Well, at least having some suspicious source code we can fix this build

Дальше »

понедельник, 26 января 2026 г.

print & analyse CUDA coredumps

inconvenient cuda-gdb can't automatically processing them - you need explicitly say something like

target cudacore /full/path/to/coredump

and then type lots of info cuda XXX

So last weekend I wrote tool to parse/dump CUDA coredumps and it even works on machine without CUDA SDK (what might be useful if you collect all crash dumps to some centralized storage with help of CUDA_COREDUMP_PIPE)

But first

Little bit of theory

Format of CUDA coredumps is documented in cudacoredump.h from cuda-gdb.deb

It contains list of devices in .cudbg.devtbl section and 2 groups of data

First is list of contexts and attached to them resources like global memory and list of loaded modules in .cudbg.relfimg.devX.ctxY sections. Those modules are just normal ELF files (some from kernel runtime) and most importantly, they contain the load addresses for each section - this is how we can find module/function of faulty instruction

Second group contains whole thread hierarchy:

list of SMs in .cudbg.smtbl.devX section
list of CTA in .cudbg.ctatbl.devX.smY sections
list of WARPs in .cudbg.wptbl.devX.smY.ctaZ sections
and finally list of threads in each warp - in sections .cudbg.lntbl.devX.smY.ctaZ.wpI

Each thread has own set of sections:

for call stack - .cudbg.bt.devX.smY.ctaZ.wpI.lnJ
registers in .cudbg.regs.devX.smY.ctaZ.wpI.lnJ
predicates in .cudbg.pred.devX.smY.ctaZ.wpI.lnJ
local memory in .cudbg.local.devX.smY.ctaZ.wpI.lnJ. Curious that those sections has the same addresses

At the same time sections for Uniform registers (.cudbg.uregs.devX.smY.ctaZ.wpI) & predicates (.cudbg.upred.devX.smY.ctaZ.wpI) are attached to WARPs

Where get faulty instruction address

This is really good question. Actually we have 3 source of addresses:

for driver with version >= 555 SM has field errorPC
WARP has field errorPC too
finally each lane has fields exception & virtualPC in CudbgThreadTableEntry

Дальше »

понедельник, 19 января 2026 г.

libcuda.so logger

As illustration of ideas from my previous blogpost I made PoC for logging all libcuda.so calls - as the cuda-gdb debugger sees them

It just installs own debug handler and receives all messages. Note:

only x86_64 linux supported, but logic can be easily extended for x86 32bit and highly likely for arm64 too
events generating before each call, so you can't get result of those calls

Current handler is very simple - it just writes to file, but nothing prevents to add storing to DB, ElasticSearch or gRPC/Apache thrift to send them to some remote storage (or even to WireShark in real time)

Format of messages

Currently almost unknown - for public API events have type 6 and function name at offset 0x30 - and this is all for now. Sure subject for further RE

Dependencies

udis86 for x86 disassembler
ELFIO for ELF files parsing

How to build

Patch ELFIO_PATH & UDIS_PATH in Makefile and just run make

Both gcc (12+) and clang 21 are supported

How connect logger to your own application

You just call single function set_logger. Arguments:

full path to libcuda.so. Note that most structures from it gathered with static code analysis and so require some disasm
FILE *fp - where to write log
mask - pointer to array with masks for each event type. Non-zero value means intercept events with this type, 2 - do hexdump of packets
mask_size - size of mask array. libcuda.so from CUDA 13.1 has 31 event types

+ add libdis.so to linker

Also it's not difficult to make classical injection with ancient LD_PRELOAD trick or even inject this logger into already running processes

четверг, 15 января 2026 г.

libcuda.so internals part 2

Previous part

I've noticed that almost all real API functions has the same prologues like:

mov eax, cs:dword_5E14C00 ; unique for each API function mov [rbp+var_D0], 3E7h mov [rbp+var_C0], 0 mov [rbp+var_C8], 0 test eax, eax jz short loc_39603B lea rdi, [rbp+var_C0] call sub_2EE190 ; get data from pthread_getspecific test eax, eax jz loc_396118

loc_396118:

lea rbx, aCustreamupdate_5 ; "cuStreamUpdateCaptureDependencies_ptsz" mov [rbp+var_88], rdx call call_dbg

So I extracted from cudbgApiDetach those dbg_callback and array of debug tracepoints - see method try_dbg_flag. I don't know why debugger needs them -probably this is part of events tracing

When you run your program under cuda-gdb this callback will be set:

api_gate at 0x155554e11940 (155552A2CB50) - /lib/x86_64-linux-gnu/libcudadebugger.so.1

Дальше »

пятница, 6 марта 2026 г.

SASS latency table: second try

четверг, 12 февраля 2026 г.

libcudadebugger.so logger

воскресенье, 8 февраля 2026 г.

building cuda-gdb from sources

понедельник, 26 января 2026 г.

print & analyse CUDA coredumps

Little bit of theory

Where get faulty instruction address

понедельник, 19 января 2026 г.

libcuda.so logger

Format of messages

Dependencies

How to build

How connect logger to your own application

четверг, 15 января 2026 г.

libcuda.so internals part 2

пятница, 6 марта 2026 г.

четверг, 12 февраля 2026 г.

воскресенье, 8 февраля 2026 г.

понедельник, 26 января 2026 г.

понедельник, 19 января 2026 г.

четверг, 15 января 2026 г.