четверг, 13 ноября 2025 г.

sass registers reusing

Lets continue to compose some useful things based on perl driven Ced. This time I add couple of new options to test script dg.pl for registers reusing

What is it at all? Nvidia as usually don't want you to know. It implemented in SASS as set of operand attributes "reuse_src_XX" and located usually in scheduler tables like TABLES_opex_X (more new like reuse_src_e & reuse_src_h are enums of type REUSE)

We can consider registers reusing as hint for GPU scheduler that some register in an instruction can reuse the physical register already allocated to one of its source operands, avoiding a full register allocation and reducing register pressure - or in other words as some registers cache

So the first question is how we can detect size of those cache? I made new pass (option -u) to collect all "reuse" attributes and find maximum of acting simultaneously - see function add_ruc

Results are not very exciting - I was unable to find in cublass functions with cache size more than 2. I remember somewhere in numerous papers about dissecting GPU came across the statement that it is equal to 4 - unfortunately I can't remember name of those paper :-(


 

And the next thing is: can we automatically detect where registers can be reused and patch SASS?

понедельник, 10 ноября 2025 г.

barriers & registers tracking for sass disasm

Finally I add registers tracking in my perl sass disasm

Now I can do some full-featured analysis of sass - like find candidates pairs of instruction to swap/run them in so called "dual" mode - and all of this in barely 1200 LoC of perl code

Let's think what must mean for couple of instructions to be fully independent:

  1. they should belong to the same block - like in case of
      IADD R8, -R3, RZ
    .L_x_14:
      FMUL R11, R3.reuse, R3
    instructions should be treated as located in different blocks
  2. they should not depend from the same barriers
  3. they should not update registers used by each other 

So I implemented building of code-flow graph, barriers & registers tracking

Building of CFG

вторник, 28 октября 2025 г.

sass disasm on perl

as an illustration of the use of the modules presented in my previous post I made yet another sass disasm - fully written on Perl. It is almost exact copy of my nvd - implemented just in 460 LoC, the only unsupported feature is registers tracking - bcs I still don't make perl binding for it. What it can do better than original nvdisasm:

and the most important thing - bcs it's based on Ced - you can patch any instruction from your script. Or customize output/save it somewhere like DB via Perl DBI/add your own passes to reveal some dirty nvidia secrets

like

Barriers

пятница, 17 октября 2025 г.

perl modules for CUBINs patching

After playing a bit with my ced I came to the conclusion that implemented DSL for editing is not enough - like it would be good to have subroutines to patch repeated/similar instructions, check that patched instruction is what I want, patch attributes/relocs etc
In other words, I need full-fledged PL. Although I've read books series "modern compiler implementation" from Andrew Appel and "crafting interpreters" I think making my own PL is overkill, so I made several XS modules to edit/patch CUBIN files for Perl. Why Perl?
  • I am able to write on it almost all I want
  • when I can't - I can always to develop my own module(s)
  • yet I don't feel sick like from pseudo languages like python
  • and it damn good and fast when you try to sketch out prototypes for things you have no idea how to make

 

ELF::FatBinary

for extracting/replacing CUBIN files from FatBinaries
see details here


Cubin::Ced 

In essence this is wrapper around Ced - it allows you to disasm/patch SASS instructions
Currently it don't support registers tracking
See doc in POD format 


Cubin::Attrs

Module to extract/patch attributes of CUBIN files + also relocs
doc in POD format

Sample

среда, 1 октября 2025 г.

addresses of cuda kernel functions

 Quote from official document:

It is not allowed to take the address of a __device__ function in host code

I haven't been surprised for a long time that entire CUDA is made up of ridiculous restrictions. What if I told you that paranoid nvidia lies as usually and actually you can get addresses of kernel functions in your host code?

But first lets check what workarounds we can employ to have functions pointers. I don't know for what pedagogical purpose this code intentionally was written so poorly and does not free the allocated memory - and now millions of brainless artificial idiots will copy-paste it forever, so I made patched version. You can realize that attempt to read from early gathered with cudaMemcpyFromSymbol addresses will results error 1 (invalid argument)

Ok. but we could just return address of function directly from another kernel function, right? So I made quick & dirty hack
I brute-forced all combinations of cf1(__device__/__constant__) & variants of cudaMemcpyFromSymbol/cudaMemcpy - and with no luck
So it's time to run

cuda-gdb

суббота, 13 сентября 2025 г.

practical ced usage: extracting sm machine ID

It's funny how paranoid nvidia trying to hide as much info from their customers as it can. One sample is so called "special registers" - even with PTX you can extract only limited set of them

So I played a bit with some undocumented SRs - namely with SR_MACHINE_ID_XX & SR_REGALLOC. I made legal loading of special registers and then patched those SASS instructions with my Ced

Lets see how those code looks in

PTX

      mov.u32         %r2, %tid.x;
      st.global.u32   [%rd2], %r2;
      mov.u32         %r3, %tid.y;
      st.global.u32   [%rd2+4], %r3;
      mov.u32         %r4, %tid.z;
      st.global.u32   [%rd2+8], %r4;
      // inline asm
      mov.u32 %r1, %smid;
      // inline asm
      st.global.u32   [%rd2+12], %r1;
      mov.u32         %r5, 21;
      st.global.u32   [%rd2+16], %r5;
As you can see - nothing special, just load in r2-r5 some values and store them in r12 holding address of function argument (obtained with cvta.to.global.u64)

четверг, 7 августа 2025 г.

tool to extract/replace files within CUDA FatBinaries

Patching of cubin files is good, but loading and running them requires lots of code and using of Driver API. It would be much more convenient to patch SASS directly in binaries produced by nvcc

Unfortunately evil nvidia as usually shows it's paranoia:

  • cuobjdump can list & extract content but not replace. Also it is extremely buggy on old libraries like libcublas.so v7
  • official fatbinary is too complex and rebuilds whole file from scratch
  • format of fatbinary is undocumented
Well, the last problem already was solved. So I made utility to work with fatbinaries. You can
  • list files with -v option
  • extract file at some index: -i idx -o output.filename
  • replace file at some index: -i idx -r replace.filename

Perl binding

Being lazy I prefer to use perl scripts to automate as much as possible, so I also made perl XS module ELF::FatBinary. Having also module ELF::Reader this allows more fine filtering of ELF files - like if file contains section/symbol with some specific name etc. See simple example how it might look like

Limitation

The tool can replace files inside fatbinary only in-place, so
  1. compressed fatbinaries not supported
  2. size of files must be the same

Some results