вторник, 2 июня 2026 г.

RE of PTX grammar from ptxas, part 3

Parts 1 & 2

Pseudo instructions

Surprise-surprise - some PTX instructions not mapped directly to underlying SASS 1:1. Instead they generate lots of another PTX code. I already extracted their decrypted bodies, so it's time to describe how they connected to specific PTX pseudo instructions
 
There is function somewhere deep inside ptxas which register lots of handlers for dumping real PTX for pseudo instructions. Code for registration of single item looks like
  mov     rdi, [rbx+250h] ; dictionary of pseudo-instructions
  lea     rdx, emit_multimem_ld_reduce ; handler
  lea     rsi, aMultimemLdRedu         ; "multimem.ld_reduce" - pseudo instruction name
  call    reg_sm_cb

There are 587 such handlers - although 473 have strange names like "1030557441". I don't know what they mean - highly likely that this is product of another encryption somewhere inside parser - at least each such string has exactly 1 reference
Lets look inside some handler
  call    get_pool
  mov     rdi, [rax+18h]
  mov     esi, 0C350h ; 50000₁bytes - they don't skimp on matches
  call    alloc_buf
  test    rax, rax
  mov     r12, rax ; r12 holds address of string buffer
  jz      loc_5626FC1E9D78 ; die in alloc_failed
loc_5626FC1E9733:            ; CODE XREF: emit_multimem_ld_reduce+67D↓j
  lea     rdx, [r13+1A5E95h] ; whut ?
  lea     rsi, aS_11         ; "%s"
  mov     rdi, r12           ; s
  xor     eax, eax
  call    _sprintf ; note that even not snprintf - security above all!
  lea     rdx, [r13+1A5E98h] ; whut again ?
  movsxd  rdi, eax ; store in rdi length of written string
  lea     rsi, aS_11         ; "%s"
  mov     rbx, rdi
  xor     eax, eax
  add     rdi, r12           ; s
  call    _sprintf 
 
Debugger showed that R13 holds address of those decrypted string pool in memory. 
Just assess the level of paranoia - there is huge encrypted blob with strings 1.8Mb. Then they wrote 587 functions where each string from those blob can be used only by offset - 21042 unique offsets! Nvidia definitely didn't want us to see its dirty secrets.
 
So I wrote some code to extract all emitters, then all string offsets - see result. Now it would be good to link offsets from each emitter with real string, right?
 
Nothing is simpler - yet another Perl XS module to load memory mapped file + small perl script - and finally we can see this

Lexer brute-force

четверг, 28 мая 2026 г.

RE of PTX grammar from ptxas, part 2

instructions that cicc cannot generate

The idea occurred to me that we also could make minus of PTX instructions from cicc and so get instructions which cicc just unable to produce. So I add to iptx.pl new option -U and got file ptx_not_in_cicc.txt with 114 unique names
Btw PTX in total has only 268 unique names - so 114 is 42.5%. So what's remarkable instructions missed:
  • cctl for cache control
  • lop3 - yeah, I saw them many times in SASS, so it later generated by ptxas during optimization passes
  • r2p
  • and all video instructions
 
And this lead me to conclusion that official MLIR for cuda is totally incomplete

MLIR was initially a very dubious idea IMHO - what if we have some unscrupulous HW vendor who prefers to hide many details of it's hardware? And even worse - usually you use several MLIR dialects (like gpu, nvgpu, nvvm, linalg etc), so at least one of them must be aware of all of them. And this lead to exponential explosion of complexity - you can expect items from each of used dialects while doing optimization


some instructions are totally undocumented

воскресенье, 24 мая 2026 г.

RE of PTX grammar from ptxas

Disclaimer

Highly likely that author is an illiterate, inattentive, and incompetent lazy person with a poor imagination - therefore his hypotheses may be questionable, ideas delusional and his analysis simply incorrect. Also maybe I still haven't mastered ida pro in 28 years so extracted data can be incomplete/have missed parts. As always all code on perl and therefore offends the aesthetic feelings of believers

 

Prior works

  • Official PTX ISA. We all know than nvidia is evil and paranoid, so this document also incomplete and maliciously conceals information. Proofs are somewhere below in this text
  • ANTLR ptx grammar - very outdated, based on cuda-waste parser from 2010
  • infamous zluda. It's enough to look at their AST to understand that they support at best a third of the instructions
  • nvopen-tools by Grigory Evko. AI generated slop, but at least we can borrow from chapter 7 format of instructions and decoding scheme for arguments

So as you can see there is no machine readable grammar for modern PTX, Why this is important at all? Well, according to "Official guide to inline PTX"

The compiler front end does not parse the asm() statement template string and does not know what it means or even whether it is valid PTX input

Therefore you can successfully compile your buggy code to PTX and suddenly got mysterious errors during dynamic loading over JIT. Plus I always suspected that nvidia hides as much information from us as possible
 
So I started with some disassembly of ptxas version V10.1.243 from sdk 13.1 looking for PTX instruction names (encrypted btw)

 

Data extracting

Instruction attributes dynamically filled in two places
  • in huge function at 0xC2341C - extracted data
  • in array of functions located at 0x2971260 - data merged with previous chunk
Please don't ask me why there are 2 separate places. More importantly that code from both looks uniform

вторник, 14 апреля 2026 г.

SASS latency analysis

After extracting latency table I became curious how good the code produced by ptxas. Projects like CuAsmRL never estimated limits of profit after rescheduling - it's strange and looks even worse than famous "proof left as an exercise to the reader" - what if ptxas generates perfect code and there is just no space for instructions reordering?

So I wrote perl script to measure redundant stalls and want to present it and obtained results

The first thing was to convert latency table from plain text to some code. As you can see format is straightforward but some instructions have special cases like

I2F
3
I2F (not F64)
13

so I made yet another perl script to generate latency table for C++ and bunch of enums for special cases - which then was implemented manually in method NV_renderer::calc_latency. Code is horrible and incompleted - I am not smartest person in the world so was just unable to find appropriate conditions for some cases in MD files. Also note that this code is result of reverse engineering so unknown how correct it is

Anyway having latency value for each instruction is better than nothing, so next step was to add new method ins_lat into perl XS module for SASS disasm

Finally we can try to analyze latency of SASS instructions

Algorithm

Having stall count and latency of single instructions it's easy to compare it - if stall count is bigger - we have redundant latency. But some instructions must wait on read/write barrier - then their latency is variable and should be ignored - see function traverse_lat in dg.pl

But what if stall count (stored in 4bit field) is lesser than latency (which can be up to 48 cycles)? Clearly then we must sum stall counts for several instruction - but how to get their count?

I couldn't think of anything smarter than finding first instruction that uses a register or predicate that is changed by the current instruction. Highly likely it already have some official name in graph theory but being illiterate I named it Joint. In fact it is strictly opposite to SSA dominator. So we need registers/predicates tracking logic - see logic for Joints detection in function track2lat

So for such long latency instructions we must use totally different logic - try to find if we can fit their latency from original instruction till its joint. But there is another problem - what if some instruction inside this path was already patched? For now I used simplest logic - we just check if patched stall count is OK, else revert patch. Sure there can be several patched instructions - for them we should employ some kind of dynamic programming and check if we can fit latency with patch and without it. However this lead to exponential complexity so I decided not include this logic for first version

So algo is simple - we have 3 pass:

  1. try to detect simple redundant stall counts and put highly latency instructions in array (@tails)
  2. process @tails in reverse order to try find redundant stall counts on path till Joint
  3. finally collect all found results and update stat data

Results

вторник, 31 марта 2026 г.

dumping llvm bitcode from cicc

requires building of hijacked .so with appropriate LLVM version. I am too lazy for this
 
cool, but does not work - cicc claims on bad arguments. I've tried many combinations with no luck
 
But hey - we are under linux and can make many hacks, for example check what arguments genuine nvcc passing to cicc. For this I ran nvcc -dc -keep under strace:
strace -o c.strace -s 512 -f --trace=/^exec nvcc ...
Arguments:
  • -s NUM - maximum string size, bcs arguments can be very long - I set this parameter to 512
  • -f - trace child processes
  • and finally --trace - since I don't know which exactly syscall used to launch processes I used regex syntax for all calls starting with exec

Lets check output file c.strace and see launches of

  • gcc/cc1plus
  • cicc
  • ptxas
  • fatbinary
  • bin2c
  • cudafe++
  • etc

After some trials right combination of arguments for cicc is
NVVMCCWIZ=553282 cicc --nv_arch compute_XX --device-c -keep 1.cpp1.ii
ls -l *.bc
-rw-rw-r-- 1 redp redp 8072 mar 31 13:25 1.lgenfe.bc
-rw-rw-r-- 1 redp redp 9988 mar 31 13:25 1.lnk.bc
-rw-rw-r-- 1 redp redp 6500 mar 31 13:25 1.opt.bc

lgenfe.bc - bitcode from front-end

opt.bc - bitcode after all optimization passes

to disassembly we can now just use llvm-dis-21:

  %1 = tail call i32 asm sideeffect "activemask.b32 $0;", "=r"() #3, !dbg !11
  %2 = tail call { i32, i1 } @llvm.nvvm.shfl.sync.i32(i32 %1, i32 3, i32 %val, i32 16, i32 31) #3, !dbg !17
  %3 = extractvalue { i32, i1 } %2, 0, !dbg !17

четверг, 26 марта 2026 г.

dwarf from nvcc

I've add some support of DWARF debug info from nvidia nvcc to my dwarfdump. As everyone knows dwarf is over-complicated, fat and just disgusting - however, nvidia was able to take his nausea to a new level

relocs

their cuda-gdb does not contains reloc_howto_type for CUDA relocs - it's special kind of bare minimal open-source when they publish as little code as possible. So my implementation highly likely incomplete and wrong

locations

stored in section .debug_loc - that's ok, although the last time gсс used them was somewhere around the time of version 4. Also nvidia introduced new attribute DW_AT_address_class for addresses in different segments. Cool, but for example for ADDR_const_space you can't get in which constant bank those address was placed

register names

this is main nightmare

среда, 18 марта 2026 г.

read a couple of books about compilers

LLVM Compiler for RISC-V Architecture

Describes details of risc-v vectorization support in llvm. It should be noted that the implementation of vector operations in risc-v was done later than in Intel and sve in arm64 - they took into account many flaws (like made explicit masks for vector operations) and were implemented in a much more convenient way from the programmer's point of view
On other hand any HW vendor can add it's own ISA subset and support of this custom processors in compiler can become very segmented and pure nightmare
 
Also I want to note that support of risc-v vectors in LLVM carefully avoids MLIR (IMHO second most overrated thing after LLM) - to do this they even had to patch their holy cow tablegen
 
Drawbacks:
  • there is no introduction about LLVM IR/risc-v specific IR, so long IR listings are very hard to follow
  • author don't give link to source code implementing some algo. Fortunately elixir indexed whole LLVM source tree
4/5

Dive into Deep Learning Compiler

As far as I know, this is the only book describing AI/ML compilers so far. Also TVM looks very promising - unlike monsters like XLA/iree it is compact and observable for mere mortals

Drawbacks:

  • book is not completed - last two chapter about NN & deployment are just "place holder"
  • it's unclear why for matrix multiplication on CUDA they didn't get cublas as base case
  • and openblas for cpu version

Despite this, considering that the book is freely downloadable, my rating is 4 out of 5

пятница, 6 марта 2026 г.

SASS latency table: second try

In my first attempt I used latency tables extracted from MD file (located inside nvdisasm) and nothing good came out of it

Obvious reason is that real latency table should be located not in disassembler - it must be inside ptxas. But the problem with that file is that it is really huge - in SDK 13 it has size 40Mb. Sure no symbols included

This is not surprisingly bcs it contains lots of things:

  • ptxas parser
  • lots of macros
  • optimizing compiler with 159 passes and don't use LLVM at all
  • code generators for several different SMs

Besides it does not have any tracepoints and big part of string are encrypted. So it took lots of time and patience but finally I found and extracted right latency table

And then a lot of discoveries came my way

четверг, 12 февраля 2026 г.

libcudadebugger.so logger

I've done some research of libcudadebugger.so internals - seems that it has exactly the same patterns:

  • functions table returned by GetCUDADebuggerAPI located in .data section so you can patch any callback address
  • and each API function has logger

This last fact is strange - while loggers from libcuda.so were used by debugger then who consume logs from debugger itself? Check code to load those loggers:

  lea     rdi, aNvtxInjection6          ; "NVTX_INJECTION64_PATH"
  call    _getenv
  mov     rdi, rax                      ; file
  test    rax, rax
  jz      short loc_14B160
  mov     esi, 1                        ; mode
  call    _dlopen
  mov     r13, rax
  test    rax, rax
  jz      short loc_14B190
  lea     rsi, aInitializeinje_1        ; "InitializeInjectionNvtx2"
  mov     rdi, rax                      ; handle
  call    _dlsym
  test    rax, rax
  jz      short loc_14B1A0
  lea     rdi, sub_14A270
  call    rax 
Very straightforward - load shared library from env var NVTX_INJECTION64_PATH and call function InitializeInjectionNvtx2 - part of Cupti API. Btw excellent injection hook
 
Unfortunately these loggers don't collect parameters of API functions - only their names in packets with fixed size 0x30 bytes:
  lea     rax, aFailedCreatede+7        ; "CreateDebuggerSession"
  mov     [rbp+var_18], rax
  mov     rax, cs:dbg_log
  mov     [rbp+var_20], 0
  mov     dword ptr [rbp+var_40], 300003h
  mov     dword ptr [rbp+var_20], 1
  movaps  [rbp+var_30], xmm0
  test    rax, rax
  jz      loc_1470AC
  lea     rdx, [rbp+var_40]
  mov     r12, rdx
  mov     rdi, rdx
  call    rax
Name of called function located at offset 0x28 and in logs looks like

воскресенье, 8 февраля 2026 г.

building cuda-gdb from sources

For some reason cuda-gdb from cuda sdk gives on my machine list of errors like

Traceback (most recent call last):
  File "/usr/share/gdb/python/gdb/__init__.py", line 169, in _auto_load_packages
    __import__(modname)
  File "/usr/share/gdb/python/gdb/command/explore.py", line 746, in <module>
    Explorer.init_env()
  File "/usr/share/gdb/python/gdb/command/explore.py", line 135, in init_env
    gdb.TYPE_CODE_RVALUE_REF : ReferenceExplorer,
AttributeError: 'module' object has no attribute 'TYPE_CODE_RVALUE_REF'

so I decided rebuild it with python version installed in system - and this turned out to be a difficult task

The first question is where the source code? Seems that official repository does not contain cuda specific code - so raison d'être of these repo is totally unclear. I extracted from cuda sdk .deb archive cuda-gdb-13.1.68.src.tar.gz and proceed with it

Second - process of configuring is extremely fragile - if you point single wrong option you will know about it only after 30-40 min. Also it seems that you just can't run configure in sub-dirs, bcs in that case linker will claims about tons of missed symbols. So configuration found by trial and error
configure --with-python=/usr/bin/python3 --enable-cuda

And finally we got file gdb/gdb having size 190 Mb. And after running I got stack trace beginning with
arch-utils.c:1374: internal-error: gdbarch: Attempt to register unknown architecture (2)

This all raises some questions for nvidia:

  • do they testing their cuda sdk before releasing?
  • do they have QA at all or like microsoft just test their ai shit directly on users?
  • from which sources was built original cuda-gdb in fact? 

Well, at least having some suspicious source code we can fix this build

понедельник, 26 января 2026 г.

print & analyse CUDA coredumps

inconvenient cuda-gdb can't automatically processing them - you need explicitly say something like
target cudacore /full/path/to/coredump

and then type lots of info cuda XXX 

So last weekend I wrote tool to parse/dump CUDA coredumps and it even works on machine without CUDA SDK (what might be useful if you collect all crash dumps to some centralized storage with help of CUDA_COREDUMP_PIPE)

But first

Little bit of theory

Format of CUDA coredumps is documented in cudacoredump.h from cuda-gdb.deb
It contains list of devices in .cudbg.devtbl section and 2 groups of data
 
First is list of contexts and attached to them resources like global memory and list of loaded modules in .cudbg.relfimg.devX.ctxY sections. Those modules are just normal ELF files (some from kernel runtime) and most importantly, they contain the load addresses for each section - this is how we can find module/function of faulty instruction

Second group contains whole thread hierarchy:

  • list of SMs in .cudbg.smtbl.devX section
  • list of CTA in  .cudbg.ctatbl.devX.smY sections
  • list of WARPs in .cudbg.wptbl.devX.smY.ctaZ sections
  • and finally list of threads in each warp - in sections .cudbg.lntbl.devX.smY.ctaZ.wpI

Each thread has own set of sections:

  • for call stack - .cudbg.bt.devX.smY.ctaZ.wpI.lnJ
  • registers in .cudbg.regs.devX.smY.ctaZ.wpI.lnJ
  • predicates in .cudbg.pred.devX.smY.ctaZ.wpI.lnJ
  • local memory in .cudbg.local.devX.smY.ctaZ.wpI.lnJ. Curious that those sections has the same addresses
At the same time sections for Uniform registers (.cudbg.uregs.devX.smY.ctaZ.wpI) & predicates (.cudbg.upred.devX.smY.ctaZ.wpI) are attached to WARPs 

Where get faulty instruction address

This is really good question. Actually we have 3 source of addresses:
  1. for driver with version >= 555 SM has field errorPC
  2. WARP has field errorPC too
  3. finally each lane has fields exception & virtualPC in CudbgThreadTableEntry

понедельник, 19 января 2026 г.

libcuda.so logger

As illustration of ideas from my previous blogpost I made PoC for logging all libcuda.so calls - as the cuda-gdb debugger sees them

It just installs own debug handler and receives all messages. Note:

  1. only x86_64 linux supported, but logic can be easily extended for x86 32bit and highly likely for arm64 too
  2. events generating before each call, so you can't get result of those calls
Current handler is very simple - it just writes to file, but nothing prevents to add storing to DB, ElasticSearch or gRPC/Apache thrift to send them to some remote storage (or even to WireShark in real time)

Format of messages

Currently almost unknown - for public API events have type 6 and function name at offset 0x30 - and this is all for now. Sure subject for further RE

Dependencies

How to build

Patch ELFIO_PATH & UDIS_PATH in Makefile and just run make
Both gcc (12+) and clang 21 are supported

How connect logger to your own application

You just call single function set_logger. Arguments:

  • full path to libcuda.so. Note that most structures from it gathered with static code analysis and so require some disasm
  • FILE *fp - where to write log
  • mask - pointer to array with masks for each event type. Non-zero value means intercept events with this type, 2 - do hexdump of packets
  • mask_size - size of mask array. libcuda.so from CUDA 13.1 has 31 event types

+ add libdis.so to linker

Also it's not difficult to make classical injection with ancient LD_PRELOAD trick or even inject this logger into already running processes

четверг, 15 января 2026 г.

libcuda.so internals part 2

Previous part

I've noticed that almost all real API functions has the same prologues like:

    mov     eax, cs:dword_5E14C00 ; unique for each API function
    mov     [rbp+var_D0], 3E7h
    mov     [rbp+var_C0], 0
    mov     [rbp+var_C8], 0
    test    eax, eax
    jz      short loc_39603B
    lea     rdi, [rbp+var_C0]
    call    sub_2EE190 ; get data from pthread_getspecific
    test    eax, eax
    jz      loc_396118

 loc_396118:

    lea     rbx, aCustreamupdate_5  ; "cuStreamUpdateCaptureDependencies_ptsz"
    mov     [rbp+var_88], rdx
    call    call_dbg

So I extracted from cudbgApiDetach those dbg_callback and array of debug tracepoints - see method try_dbg_flag. I don't know why debugger needs them -probably this is part of events tracing

When you run your program under cuda-gdb this callback will be set:

api_gate at 0x155554e11940 (155552A2CB50) - /lib/x86_64-linux-gnu/libcudadebugger.so.1