windows deep internals

среда, 24 декабря 2025 г.

libcuda.so internals

The first question that comes to mind when looking at them is "why they are so huge?". For example libcuda.so from cuda 10.1 has size 28Mb and from 13.1 already 96Mb. So I rejected the idea that they are just yet another victims of vibe-coding and made some preliminary RE. The answer is - because they contain in .rodata section lots of CUBIN files for

kernel run-time

I extracted them (archive from 13.1) and checked SASS. Now I am almost sure that nvidia has some internal SASS assembler - they use LEPC instruction (to load address of current instruction) which you just can't get from official ptxas

   /*0160*/  LEPC R20 ; R20 now holds 170
   /*0170*/  IADD3 R20, P0, R20, 0x50, RZ 1 ; and if P0 R20 += 0x50

What contain those CUBIN files?

syscalls like __cuda_syscall_cp_async_bulk_tensor_XX, __cuda_syscall_tex_grad_XX etc
implementation of functions like cudaGraphLaunch/vprintf
functions cnpXXX like cnpDeviceGetAttribute
logic for kernel enqueue
some support for profiling like scProfileBuffers
trap handlers

and so on. In essence this is backstage workers - like old good BIOS

API callbacks

Дальше »

пятница, 28 ноября 2025 г.

bug in sass MD

Spent couple of days in debugging rare bug in my sass disasm. I tested it on thousands of .cubin files and got bad instruction decoding for one. Btw I never saw papers about testing of disassemblers - compilers like gcc/clang has huge set of tests to detect regressions, so probably I should do the same. The problem is that I periodically add new features and smxx.so files generating every time

My nvd has option -N to dump unrecognized opcodes, so I got for sm55

Not found at E8 0000100000010111111100010101110000011000100000100000000000000011101001110000000000000000

nvdisasm v11 swears that this pile of 0 & 1 must be ISCADD instruction somehow. Ok, lets run ead.pl and check if it can find it:
perl ead.pl -BFvamrzN 0000100000010111111100010101110000011000100000100000000000000011101001110000000000000000 ../data/sm55_1.txt

found 4 ........................0.0111000..11................................................... 0000-0-------111111-----0101110001011-------000--00000000000---------------------------- 0000000-----------------0101110000111000-00000---00000000000---------------------------- 00000--------111111-----01011100000110---000-----00000000000---------------------------- 000000-------111111-----0001110--------------------------------------------------------- matched: 0

the first thought was that MD are just too old bcs were extracted from cuda 10, so I made decryptor for cuda 11 (paranoid nvidia removed instructions properties since version 12, so 11 is last source of MD), extracted data, rebuild sm55.cc and sm55.so and run test again

The bug has not disappeared

Дальше »

вторник, 25 ноября 2025 г.

SASS latency table & instructions reordering

In these difficult times, no one wants to report bad or simply weak results (and this will destroy this hypocritical civilization). Since this is my personal blog and I am not looking for grants, I don't care.

Let's dissect one truly inspiring paper - they employed reinforcement learning and claim that

transparently producing 2% to 26% speedup

wow, 26% is really excellent result. So I decided to implement proposed technique, but first I need get source of latency values for each SASS instruction. I extracted files with latency tables from nvdisasm - their names have _2.txt suffixes

Then I made perl binding for my perl version of Ced (see methods for object Cubin::Ced::LatIndex), add new pass (-l option for dg.pl) and done some experiments to dump latency values for each instruction. Because connections order is unknown I implemented all 3:

current column with current row
column from previous instruction with current row
current column with row from previous instruction

The results are discouraging

some instructions (~1.5% for best case 1) does not have latency at all (for example S2R or XXXBAR)
some instructions have more than 1 index to the same table - well, I fixed this with selecting max value (see function intersect_lat)
while comparing with actual stall count the percentage of incorrect values above 60 - it's even worse than just coin flipping

Some possible reasons for failure:

Дальше »

четверг, 13 ноября 2025 г.

sass registers reusing

Lets continue to compose some useful things based on perl driven Ced. This time I add couple of new options to test script dg.pl for registers reusing

What is it at all? Nvidia as usually don't want you to know. It implemented in SASS as set of operand attributes "reuse_src_XX" and located usually in scheduler tables like TABLES_opex_X (more new like reuse_src_e & reuse_src_h are enums of type REUSE)

We can consider registers reusing as hint for GPU scheduler that some register in an instruction can reuse the physical register already allocated to one of its source operands, avoiding a full register allocation and reducing register pressure - or in other words as some registers cache

So the first question is how we can detect size of those cache? I made new pass (option -u) to collect all "reuse" attributes and find maximum of acting simultaneously - see function add_ruc

Results are not very exciting - I was unable to find in cublass functions with cache size more than 2. I remember somewhere in numerous papers about dissecting GPU came across the statement that it is equal to 4 - unfortunately I can't remember name of those paper :-(

And the next thing is: can we automatically detect where registers can be reused and patch SASS?

Дальше »

понедельник, 10 ноября 2025 г.

barriers & registers tracking for sass disasm

Finally I add registers tracking in my perl sass disasm

Now I can do some full-featured analysis of sass - like find candidates pairs of instruction to swap/run them in so called "dual" mode - and all of this in barely 1200 LoC of perl code

Let's think what must mean for couple of instructions to be fully independent:

they should belong to the same block - like in case of
IADD R8, -R3, RZ .L_x_14: FMUL R11, R3.reuse, R3 instructions should be treated as located in different blocks
they should not depend from the same barriers
they should not update registers used by each other

So I implemented building of code-flow graph, barriers & registers tracking

Building of CFG

Дальше »

вторник, 28 октября 2025 г.

sass disasm on perl

as an illustration of the use of the modules presented in my previous post I made yet another sass disasm - fully written on Perl. It is almost exact copy of my nvd - implemented just in 460 LoC, the only unsupported feature is registers tracking - bcs I still don't make perl binding for it. What it can do better than original nvdisasm:

shows LUT operations
shows instructions properties/predicates
shows relocs for each code section
shows const bank params

and the most important thing - bcs it's based on Ced - you can patch any instruction from your script. Or customize output/save it somewhere like DB via Perl DBI/add your own passes to reveal some dirty nvidia secrets

Barriers

Дальше »

пятница, 17 октября 2025 г.

perl modules for CUBINs patching

After playing a bit with my ced I came to the conclusion that implemented DSL for editing is not enough - like it would be good to have subroutines to patch repeated/similar instructions, check that patched instruction is what I want, patch attributes/relocs etc

In other words, I need full-fledged PL. Although I've read books series "modern compiler implementation" from Andrew Appel and "crafting interpreters" I think making my own PL is overkill, so I made several XS modules to edit/patch CUBIN files for Perl. Why Perl?

I am able to write on it almost all I want
when I can't - I can always to develop my own module(s)
yet I don't feel sick like from pseudo languages like python
and it damn good and fast when you try to sketch out prototypes for things you have no idea how to make

ELF::FatBinary

source

for extracting/replacing CUBIN files from FatBinaries

see details here

Cubin::Ced

source

In essence this is wrapper around Ced - it allows you to disasm/patch SASS instructions

Currently it don't support registers tracking

See doc in POD format

Cubin::Attrs

source

Module to extract/patch attributes of CUBIN files + also relocs

doc in POD format

Sample

Дальше »

среда, 24 декабря 2025 г.

libcuda.so internals

kernel run-time

API callbacks

пятница, 28 ноября 2025 г.

bug in sass MD

вторник, 25 ноября 2025 г.

SASS latency table & instructions reordering

четверг, 13 ноября 2025 г.

sass registers reusing

понедельник, 10 ноября 2025 г.

barriers & registers tracking for sass disasm

Building of CFG

вторник, 28 октября 2025 г.

sass disasm on perl

Barriers

пятница, 17 октября 2025 г.

perl modules for CUBINs patching

ELF::FatBinary

Cubin::Ced

Cubin::Attrs

Sample

среда, 24 декабря 2025 г.

пятница, 28 ноября 2025 г.

вторник, 25 ноября 2025 г.

четверг, 13 ноября 2025 г.

понедельник, 10 ноября 2025 г.

вторник, 28 октября 2025 г.

пятница, 17 октября 2025 г.