windows deep internals

четверг, 13 ноября 2025 г.

sass registers reusing

Lets continue to compose some useful things based on perl driven Ced. This time I add couple of new options to test script dg.pl for registers reusing

What is it at all? Nvidia as usually don't want you to know. It implemented in SASS as set of operand attributes "reuse_src_XX" and located usually in scheduler tables like TABLES_opex_X (more new like reuse_src_e & reuse_src_h are enums of type REUSE)

We can consider registers reusing as hint for GPU scheduler that some register in an instruction can reuse the physical register already allocated to one of its source operands, avoiding a full register allocation and reducing register pressure - or in other words as some registers cache

So the first question is how we can detect size of those cache? I made new pass (option -u) to collect all "reuse" attributes and find maximum of acting simultaneously - see function add_ruc

Results are not very exciting - I was unable to find in cublass functions with cache size more than 2. I remember somewhere in numerous papers about dissecting GPU came across the statement that it is equal to 4 - unfortunately I can't remember name of those paper :-(

And the next thing is: can we automatically detect where registers can be reused and patch SASS?

Дальше »

понедельник, 10 ноября 2025 г.

barriers & registers tracking for sass disasm

Finally I add registers tracking in my perl sass disasm

Now I can do some full-featured analysis of sass - like find candidates pairs of instruction to swap/run them in so called "dual" mode - and all of this in barely 1200 LoC of perl code

Let's think what must mean for couple of instructions to be fully independent:

they should belong to the same block - like in case of
IADD R8, -R3, RZ .L_x_14: FMUL R11, R3.reuse, R3 instructions should be treated as located in different blocks
they should not depend from the same barriers
they should not update registers used by each other

So I implemented building of code-flow graph, barriers & registers tracking

Building of CFG

Дальше »

вторник, 28 октября 2025 г.

sass disasm on perl

as an illustration of the use of the modules presented in my previous post I made yet another sass disasm - fully written on Perl. It is almost exact copy of my nvd - implemented just in 460 LoC, the only unsupported feature is registers tracking - bcs I still don't make perl binding for it. What it can do better than original nvdisasm:

shows LUT operations
shows instructions properties/predicates
shows relocs for each code section
shows const bank params

and the most important thing - bcs it's based on Ced - you can patch any instruction from your script. Or customize output/save it somewhere like DB via Perl DBI/add your own passes to reveal some dirty nvidia secrets

Barriers

Дальше »

пятница, 17 октября 2025 г.

perl modules for CUBINs patching

After playing a bit with my ced I came to the conclusion that implemented DSL for editing is not enough - like it would be good to have subroutines to patch repeated/similar instructions, check that patched instruction is what I want, patch attributes/relocs etc

In other words, I need full-fledged PL. Although I've read books series "modern compiler implementation" from Andrew Appel and "crafting interpreters" I think making my own PL is overkill, so I made several XS modules to edit/patch CUBIN files for Perl. Why Perl?

I am able to write on it almost all I want
when I can't - I can always to develop my own module(s)
yet I don't feel sick like from pseudo languages like python
and it damn good and fast when you try to sketch out prototypes for things you have no idea how to make

ELF::FatBinary

source

for extracting/replacing CUBIN files from FatBinaries

see details here

Cubin::Ced

source

In essence this is wrapper around Ced - it allows you to disasm/patch SASS instructions

Currently it don't support registers tracking

See doc in POD format

Cubin::Attrs

source

Module to extract/patch attributes of CUBIN files + also relocs

doc in POD format

Sample

Дальше »

среда, 1 октября 2025 г.

addresses of cuda kernel functions

Quote from official document:

It is not allowed to take the address of a __device__ function in host code

I haven't been surprised for a long time that entire CUDA is made up of ridiculous restrictions. What if I told you that paranoid nvidia lies as usually and actually you can get addresses of kernel functions in your host code?

But first lets check what workarounds we can employ to have functions pointers. I don't know for what pedagogical purpose this code intentionally was written so poorly and does not free the allocated memory - and now millions of brainless artificial idiots will copy-paste it forever, so I made patched version. You can realize that attempt to read from early gathered with cudaMemcpyFromSymbol addresses will results error 1 (invalid argument)

Ok. but we could just return address of function directly from another kernel function, right? So I made quick & dirty hack

I brute-forced all combinations of cf1(__device__/__constant__) & variants of cudaMemcpyFromSymbol/cudaMemcpy - and with no luck

So it's time to run

cuda-gdb

Дальше »

суббота, 13 сентября 2025 г.

practical ced usage: extracting sm machine ID

It's funny how paranoid nvidia trying to hide as much info from their customers as it can. One sample is so called "special registers" - even with PTX you can extract only limited set of them

So I played a bit with some undocumented SRs - namely with SR_MACHINE_ID_XX & SR_REGALLOC. I made legal loading of special registers and then patched those SASS instructions with my Ced

Lets see how those code looks in

PTX

      mov.u32         %r2, %tid.x;
      st.global.u32   [%rd2], %r2;
      mov.u32         %r3, %tid.y;
      st.global.u32   [%rd2+4], %r3;
      mov.u32         %r4, %tid.z;
      st.global.u32   [%rd2+8], %r4;
      // inline asm
      mov.u32 %r1, %smid;
      // inline asm
      st.global.u32   [%rd2+12], %r1;
      mov.u32         %r5, 21;
      st.global.u32   [%rd2+16], %r5;

As you can see - nothing special, just load in r2-r5 some values and store them in r12 holding address of function argument (obtained with cvta.to.global.u64)

Дальше »

четверг, 7 августа 2025 г.

tool to extract/replace files within CUDA FatBinaries

Patching of cubin files is good, but loading and running them requires lots of code and using of Driver API. It would be much more convenient to patch SASS directly in binaries produced by nvcc

Unfortunately evil nvidia as usually shows it's paranoia:

cuobjdump can list & extract content but not replace. Also it is extremely buggy on old libraries like libcublas.so v7
official fatbinary is too complex and rebuilds whole file from scratch
format of fatbinary is undocumented

Well, the last problem already was solved. So I made utility to work with fatbinaries. You can

list files with -v option
extract file at some index: -i idx -o output.filename
replace file at some index: -i idx -r replace.filename

Perl binding

Being lazy I prefer to use perl scripts to automate as much as possible, so I also made perl XS module ELF::FatBinary. Having also module ELF::Reader this allows more fine filtering of ELF files - like if file contains section/symbol with some specific name etc. See simple example how it might look like

Limitation

The tool can replace files inside fatbinary only in-place, so

compressed fatbinaries not supported
size of files must be the same

Some results

Дальше »

четверг, 13 ноября 2025 г.

sass registers reusing

понедельник, 10 ноября 2025 г.

barriers & registers tracking for sass disasm

Building of CFG

вторник, 28 октября 2025 г.

sass disasm on perl

Barriers

пятница, 17 октября 2025 г.

perl modules for CUBINs patching

ELF::FatBinary

Cubin::Ced

Cubin::Attrs

Sample

среда, 1 октября 2025 г.

addresses of cuda kernel functions

cuda-gdb

суббота, 13 сентября 2025 г.

practical ced usage: extracting sm machine ID

PTX

четверг, 7 августа 2025 г.

tool to extract/replace files within CUDA FatBinaries

Perl binding

Limitation

Some results

четверг, 13 ноября 2025 г.

понедельник, 10 ноября 2025 г.

вторник, 28 октября 2025 г.

пятница, 17 октября 2025 г.

среда, 1 октября 2025 г.

суббота, 13 сентября 2025 г.

четверг, 7 августа 2025 г.