windows deep internals: 2025

вторник, 25 ноября 2025 г.

SASS latency table & instructions reordering

In these difficult times, no one wants to report bad or simply weak results (and this will destroy this hypocritical civilization). Since this is my personal blog and I am not looking for grants, I don't care.

Let's dissect one truly inspiring paper - they employed reinforcement learning and claim that

transparently producing 2% to 26% speedup

wow, 26% is really excellent result. So I decided to implement proposed technique, but first I need get source of latency values for each SASS instruction. I extracted files with latency tables from nvdisasm - their names have _2.txt suffixes

Then I made perl binding for my perl version of Ced (see methods for object Cubin::Ced::LatIndex), add new pass (-l option for dg.pl) and done some experiments to dump latency values for each instruction. Because connections order is unknown I implemented all 3:

current column with current row
column from previous instruction with current row
current column with row from previous instruction

The results are discouraging

some instructions (~1.5% for best case 1) does not have latency at all (for example S2R or XXXBAR)
some instructions have more than 1 index to the same table - well, I fixed this with selecting max value (see function intersect_lat)
while comparing with actual stall count the percentage of incorrect values above 60 - it's even worse than just coin flipping

Some possible reasons for failure:

Дальше »

четверг, 13 ноября 2025 г.

sass registers reusing

Lets continue to compose some useful things based on perl driven Ced. This time I add couple of new options to test script dg.pl for registers reusing

What is it at all? Nvidia as usually don't want you to know. It implemented in SASS as set of operand attributes "reuse_src_XX" and located usually in scheduler tables like TABLES_opex_X (more new like reuse_src_e & reuse_src_h are enums of type REUSE)

We can consider registers reusing as hint for GPU scheduler that some register in an instruction can reuse the physical register already allocated to one of its source operands, avoiding a full register allocation and reducing register pressure - or in other words as some registers cache

So the first question is how we can detect size of those cache? I made new pass (option -u) to collect all "reuse" attributes and find maximum of acting simultaneously - see function add_ruc

Results are not very exciting - I was unable to find in cublass functions with cache size more than 2. I remember somewhere in numerous papers about dissecting GPU came across the statement that it is equal to 4 - unfortunately I can't remember name of those paper :-(

And the next thing is: can we automatically detect where registers can be reused and patch SASS?

Дальше »

понедельник, 10 ноября 2025 г.

barriers & registers tracking for sass disasm

Finally I add registers tracking in my perl sass disasm

Now I can do some full-featured analysis of sass - like find candidates pairs of instruction to swap/run them in so called "dual" mode - and all of this in barely 1200 LoC of perl code

Let's think what must mean for couple of instructions to be fully independent:

they should belong to the same block - like in case of
IADD R8, -R3, RZ .L_x_14: FMUL R11, R3.reuse, R3 instructions should be treated as located in different blocks
they should not depend from the same barriers
they should not update registers used by each other

So I implemented building of code-flow graph, barriers & registers tracking

Building of CFG

Дальше »

вторник, 28 октября 2025 г.

sass disasm on perl

as an illustration of the use of the modules presented in my previous post I made yet another sass disasm - fully written on Perl. It is almost exact copy of my nvd - implemented just in 460 LoC, the only unsupported feature is registers tracking - bcs I still don't make perl binding for it. What it can do better than original nvdisasm:

shows LUT operations
shows instructions properties/predicates
shows relocs for each code section
shows const bank params

and the most important thing - bcs it's based on Ced - you can patch any instruction from your script. Or customize output/save it somewhere like DB via Perl DBI/add your own passes to reveal some dirty nvidia secrets

Barriers

Дальше »

пятница, 17 октября 2025 г.

perl modules for CUBINs patching

After playing a bit with my ced I came to the conclusion that implemented DSL for editing is not enough - like it would be good to have subroutines to patch repeated/similar instructions, check that patched instruction is what I want, patch attributes/relocs etc

In other words, I need full-fledged PL. Although I've read books series "modern compiler implementation" from Andrew Appel and "crafting interpreters" I think making my own PL is overkill, so I made several XS modules to edit/patch CUBIN files for Perl. Why Perl?

I am able to write on it almost all I want
when I can't - I can always to develop my own module(s)
yet I don't feel sick like from pseudo languages like python
and it damn good and fast when you try to sketch out prototypes for things you have no idea how to make

ELF::FatBinary

source

for extracting/replacing CUBIN files from FatBinaries

see details here

Cubin::Ced

source

In essence this is wrapper around Ced - it allows you to disasm/patch SASS instructions

Currently it don't support registers tracking

See doc in POD format

Cubin::Attrs

source

Module to extract/patch attributes of CUBIN files + also relocs

doc in POD format

Sample

Дальше »

среда, 1 октября 2025 г.

addresses of cuda kernel functions

Quote from official document:

It is not allowed to take the address of a __device__ function in host code

I haven't been surprised for a long time that entire CUDA is made up of ridiculous restrictions. What if I told you that paranoid nvidia lies as usually and actually you can get addresses of kernel functions in your host code?

But first lets check what workarounds we can employ to have functions pointers. I don't know for what pedagogical purpose this code intentionally was written so poorly and does not free the allocated memory - and now millions of brainless artificial idiots will copy-paste it forever, so I made patched version. You can realize that attempt to read from early gathered with cudaMemcpyFromSymbol addresses will results error 1 (invalid argument)

Ok. but we could just return address of function directly from another kernel function, right? So I made quick & dirty hack

I brute-forced all combinations of cf1(__device__/__constant__) & variants of cudaMemcpyFromSymbol/cudaMemcpy - and with no luck

So it's time to run

cuda-gdb

Дальше »

суббота, 13 сентября 2025 г.

practical ced usage: extracting sm machine ID

It's funny how paranoid nvidia trying to hide as much info from their customers as it can. One sample is so called "special registers" - even with PTX you can extract only limited set of them

So I played a bit with some undocumented SRs - namely with SR_MACHINE_ID_XX & SR_REGALLOC. I made legal loading of special registers and then patched those SASS instructions with my Ced

Lets see how those code looks in

PTX

      mov.u32         %r2, %tid.x;
      st.global.u32   [%rd2], %r2;
      mov.u32         %r3, %tid.y;
      st.global.u32   [%rd2+4], %r3;
      mov.u32         %r4, %tid.z;
      st.global.u32   [%rd2+8], %r4;
      // inline asm
      mov.u32 %r1, %smid;
      // inline asm
      st.global.u32   [%rd2+12], %r1;
      mov.u32         %r5, 21;
      st.global.u32   [%rd2+16], %r5;

As you can see - nothing special, just load in r2-r5 some values and store them in r12 holding address of function argument (obtained with cvta.to.global.u64)

Дальше »

четверг, 7 августа 2025 г.

tool to extract/replace files within CUDA FatBinaries

Patching of cubin files is good, but loading and running them requires lots of code and using of Driver API. It would be much more convenient to patch SASS directly in binaries produced by nvcc

Unfortunately evil nvidia as usually shows it's paranoia:

cuobjdump can list & extract content but not replace. Also it is extremely buggy on old libraries like libcublas.so v7
official fatbinary is too complex and rebuilds whole file from scratch
format of fatbinary is undocumented

Well, the last problem already was solved. So I made utility to work with fatbinaries. You can

list files with -v option
extract file at some index: -i idx -o output.filename
replace file at some index: -i idx -r replace.filename

Perl binding

Being lazy I prefer to use perl scripts to automate as much as possible, so I also made perl XS module ELF::FatBinary. Having also module ELF::Reader this allows more fine filtering of ELF files - like if file contains section/symbol with some specific name etc. See simple example how it might look like

Limitation

The tool can replace files inside fatbinary only in-place, so

compressed fatbinaries not supported
size of files must be the same

Some results

Дальше »

среда, 23 июля 2025 г.

ced: sed-like cubin editor

Unfortunately, the only sass assembler I know of has several drawbacks:

it's inactive last couple of years. I dropped email to his author and he didn't replied. Hope he is well
it don't support modern sm architectures sm1xx
it's matmul solver sometimes produces wrong instructions
and it don't support many EIATTRS

The last problem is not related with CuAssembler itself - it is more general: seems that nvdisasm produces output which cannot be used to assembly cubin files

Also we still don't know format of some sections like SHT_CUDA_RELOCINFO. All this makes task of rebuilding cubin files very hard

However do we really need to rebuild cubin files? In my experience 99.9% of desired patches are just set/remove some instructions attributes like register reusing/caching policy/wait groups for USCHED_INFO etc - just boring tuning to squeeze out the last couple of percent of productivity

So the flow of thought was something like

it would be good to make plugin for hex-editor to disasm sass instruction at some known offset and show GUI where I could patch some fields
I am talentless at creating GUI - so perhaps it would be better to dump instructions fields in text form and then just edit it
hey - if you can parse this text representation and patch it back to sass - you don't need hex-editor at all - you could just use sed-like tool to patch instructions via script

and so being lazy and impatient I wrote such tool - it's called ced. Name similarity to sed is not coincidence - it allows you run text script to patch or replace some sass instructions inside cubin files

Дальше »

суббота, 19 июля 2025 г.

sass instructions: LUT operations

I was asked yesterday why I didn't transformed sample from my previous record

iadd r8, r2, r8 ; r8 = r2 + r8 iadd r8, r8, r8 ; r8 = r8 + r8 iadd r8, r8, ur4 ; r8 = r8 + ur4

to more simple

imad r8, r8, 2, ur4 ; r8 = r8 * 2 + ur4

While this is technically correct the problem here - ISA is non-orthogonal. You can use my ina to check available forms of IMAD for universal registers - and suddenly we will discover that it has only 2 forms

@Pg IMAD E:wide E:fmt E:Rd E:Pu E:Ra E:reuse_src_a E:Rb E:reuse_src_b -E:URc
@Pg IMAD E:wide E:fmt E:Rd E:Pu E:Ra E:reuse_src_a E:URb -E:Rc E:reuse_src_c

And no forms with imm value for Ra/Rb. So you can generate only something like:

imad r8, r8, rXX, ur4

And for UIMAD with imm values we have forms with universal registers only:

@UPg UIMAD E:wide E:fmt E:X E:URd E:UPu E:URa ,Sb ~E:URc !E:UPp
@UPg UIMAD E:wide E:fmt E:URd E:UPu E:URa ,Sb -E:URc
etc

But all this is just kids games compared to LUT operations. In short - you can have 255 combinations of logical operations over 3 operands driven by index. nvdisasm shows them like:

LOP3.LUT R0, R3, R0, RZ, 0x30, !PT

Very informative, yeah. So I employed sympy to generate table of simplified expressions - however I am too old and lazy to write python scripts. So pretty obvious solution:

make perl script to enumerate all possible combinations and generate python script
which in turn generates string table
and then sed add quotes and commas

And now my disasm shows much clearer output:
LOP3.LUT PT,R0,R3,R0,RZ, 0x30,!PT &req={5}; LUT 30: a & ~b
So here a = R3, b = R0 and result R0 = R3 & ~R0

пятница, 18 июля 2025 г.

sass instructions: registers tracking

I've add tracking of registers to both nvd & pa - you can use -T option. And I have lots of bad news

nvdisasm lies

Yup, again. Let's check this innocent looking code:

CS2R R100, SRZ

You can assume that it stores value from special to single 32bit regular register. Actually it used wide loading and store value to R100 & R101. bcs MD of CS2R looks like

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /QInteger("64"):sz Register:Rd ','SpecialRegister:SRa

PREDICATES IDEST_SIZE = 32 + ((sz==`QInteger@"64"))*32;

QInteger has default value "64" and so omitted in output. Ok, maybe this is bcs special registers are 64bit? Lets check MD for S2UR to store value to uniform register:

FORMAT PREDICATE @[!]UniformPredicate(UPT):UPg Opcode UniformRegister:URd

PREDICATES IDEST_SIZE = 32;

It does not have width modifier at all and destination size is simple 32. So cs2r by default is 64bit and s2ur is 32bit. Srsly? Is it should be obvious? Highly likely this person now inventing mnemonics names for Intel

lack of documentation

In essence we have this short list of instructions and chapter 8.7 from ancient "The CUDA Handbook" (btw published in 2013). Properties & predicates make it a little easier to understand. Unfortunately they contains only info about regular and uniform registers. And we have yet several classes of instructions working with another kind of registers

Predicates

MD for ISETP:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /ICmpAll:icmp /REDUX_SZ("S32"):fmt /Bop:bop /EXONLY:ex Predicate:Pu ','Predicate:Pv

PREDICATES IDEST_SIZE = 0; IDEST2_SIZE = 0;

and for HSETP2:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /OFMT_F16_V2_BF16_V2("F16_V2"):ofmt /FCMP:cmp /H_AND("noh_and"):h_and /FTZ("noftz"):ft z /Bop:bop Predicate:Pu ','Predicate:Pv

PREDICATES IDEST_SIZE = 0; IDEST2_SIZE = 0;

I don't know if they set their first predicate Pu only or both Pu & Pv. Btw famous IMAD has very curious MD for some forms:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /HIONLY:wide /FMT("S32"):fmt /XONLY:X
Register:Rd
','Predicate("PT"):Pu
','Register:Ra {/REUSE("noreuse"):reuse_src_a}
','Register:Rb {/REUSE("noreuse"):reuse_src_b}
',' [~] Register:Rc {/REUSE("noreuse"):reuse_src_c}
',' [!]Predicate:Pp

Usually IMAD means multiply and add, so Rd = Ra * Rb + Rc. But here we have two predicates, so should it have semantic Rd = Ra * Rb * Pu + Rc * Pp?

Barriers

MD for BMOV:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /ONLY32:sz BD:barReg ','CBU_STATE_NONBAR:cbu_state

PREDICATES IDEST_SIZE = 0; IDEST2_SIZE = 0;

Enum BD described as

BD "B10"=10 , "B11"=11 , "B14"=14 , "B4"=4 , "B5"=5 , "B6"=6 , "B7"=7 , "B0"=0 , "B1"=1 , "B2"=2 , "B3"=3 , "B15"=15 , "B12"=12 , "B8"=8 , "B9"=9 , "B13"=13;

I am sure that there are more...

ptxas produces code that is far from perfect

Дальше »

понедельник, 7 июля 2025 г.

sass instructions: uniform registers & wide loading

Having predicates for operands size and properties for type/identification we could write register tracking (well, at least up to sm90). But before we should familiarize yourself with couple of CUDA specific things

Uniform registers

As concise introduction you can read this paper, especially paragraph 3.5.2:

Turing introduces a new feature intended to improve the maximum achievable arithmetic throughput of the main, floating-point capable datapaths, by adding a separate, integer-only, scalar datapath (named the uniform datapath) that operates in parallel with the main datapath

Regular instructions can access both uniform and regular registers. Uniform datapath instructions, instead, focus on uniform instructions almost exclusively

So for example on SM75 you have 255 regular registers and 63 uniform registers UR0-UR62 (and URZ clearly mapped to RZ) + uniform predicates UP0-UP6. Given that they "typically updating array indices, loop indices or pointers" and size of VRAM can be up to 192Gb someone would expect that this is whole new set of registers with width 64bit to access arrays > 4Gb

Well, reality is much more boring - they are just virtual mapping of regular 32bit registers. Proof:

S2R R3, SR_TID.X S2UR UR4, SR_CTAID.Y

Here both SR_XX are so called "special registers" with width 32bit. Also EIATTR_MAXREG_COUNT (being itself 16bit) always contains value 0xff. I saw curious cases when "nvdisasm --print-life-ranges" shows GPR 223 and UGPR 35. If I can use calculator 223 + 35 = 258

I have zero ideas how those URs are mapped to real registers (and uniform predicates to ordinal predicates) - at least there is no EIATTRs for such mapping. Obviously they are not mapped 1:1:

S2R R10, SR_CTAID.Z ; R10 now contains value from special register SR_CTAID.Z ULDC.64 UR10, c[0x0][0x118] IMAD.WIDE R2, R10, R3, c[0x0][0x168] ; and here it's value is still alive

Also it's totally unclear how functions get initial values for this URs. I`ve wrote for my nvd parser for EIATTR_PARAM_CBANK & EIATTR_KPARAM_INFO and it seems that often they are loaded exactly from nowhere:

/*30*/ ULDC.64 UR4,c[0][0x118]; ; unknown cb off 118

/*40*/ IMAD.WIDE R2,PT,R7,R6,c[0][0x168] &req={0}; ; cb in section 254, offset 168 - 160 = 8

as you can see const bank starts from 0x160 and UR4 was loaded from offset 0x118

Wide loading

Дальше »

четверг, 3 июля 2025 г.

sass instructions properties

I've already described so called predicates. Unfortunately they have only size of operands. Unlike predicates properties also have types:

IDEST_OPERAND_MAP = (1<<INDEX(Rd)); IDEST_OPERAND_TYPE = (1<<IOPERAND_TYPE_GENERIC); IDEST2_OPERAND_MAP = (1<<IOPERAND_MAP_NON_EXISTENT_OPERAND); IDEST2_OPERAND_TYPE = (1<<IOPERAND_TYPE_NON_EXISTENT_OPERAND); ISRC_B_OPERAND_MAP = (1<<INDEX(Rb)); ISRC_B_OPERAND_TYPE = (1<<IOPERAND_TYPE_GENERIC); ISRC_C_OPERAND_MAP = (1<<INDEX(Rc)); ISRC_C_OPERAND_TYPE = (1<<IOPERAND_TYPE_TEX); ISRC_A_OPERAND_MAP = (1<<INDEX(Ra)); ISRC_A_OPERAND_TYPE = (1<<IOPERAND_TYPE_SURFACE_COORDINATES);

This sample for suatom instruction. Here destination has single operand so DEST2 marked with NON_EXISTENT_OPERAND. Unfortunately properties has couple of serious drawbacks:

1) they were cut out by paranoid NVidia somewhere in version 12.7-12.8, so I ripped MDs with properties up to sm90 - sm100, sm101 & sm120 don't have them. I also tried to re-apply properties from sm90 to 3 remained - but this is very unreliable

2) they are not complete. Lets see couple of samples

Дальше »

пятница, 27 июня 2025 г.

curse of IMAD

Found strange case while disassembly some forms of IMAD (btw raison d'être of GPU). Official nvdisasm shows:

IMAD.WIDE R2, R7, R6, c[0x0][0x168] ; /* 0x00005a0007027625 */

my nvd:

; IMAD line 63362 n 1196 15 render items 1 missed: wide /*40*/ IMAD R2,P7,R7,R6,c[0][0x168] &req={0};

Problem here not only missed P7 - at least it has default value:

CLASS "imad_wide__RRC_RRC" FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /WIDEONLY:wide /FMT("S32"):fmt Register:Rd ','Predicate("PT"):Pu ','Register:Ra {/REUSE("noreuse"):reuse_src_a} ','Register:Rb {/REUSE("noreuse"):reuse_src_b} ',' [-] C:Sc[UImm(5/0*):Sc_bank]* [SImm(17)*:Sc_addr]

Both P7 & PT has the same value 7 (and btw wide does not have corresponding encoding field). Mask for this instruction ends with "011000100101" - 0x5

Main problem is that IMAD with form Reg, Reg, Reg has another mask:

CLASS "imad__RRC_RRC" FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /LOOnly("LO"):wide /FMT("S32"):fmt Register:Rd ','Register:Ra {/REUSE("noreuse"):reuse_src_a} ','Register:Rb {/REUSE("noreuse"):reuse_src_b} ',' [-] C:Sc[UImm(5/0*):Sc_bank]* [SImm(17)*:Sc_addr]

mask ends with "011000100100" - 0x4

As you can see original instruction bytes is 0x00005a0007027625 - nvdisasm just produced incorrect output

Why this happens? I have hypothesis that Nvidia just don't have own official sass asm and so output of nvdisasm never used/verified

воскресенье, 15 июня 2025 г.

nvdisasm sass parser

Having sass assembler it seems like easy task to make parser for it. So I made parser of nvdisasm output

Lets check some samples:

SHF.R.S32.HI R209, RZ, 0x2, R209 ;

Looks like easy application of LL(1) parser - you first select instruction, then process it's optional enums (separated by dots) and then just try to match operands separated by commas, right? Hwell, no - grammar of sass is not regular and we can have lots of quirky cases

Instruction names with '.'

It's perfectly legal to meet instructions "UIADD3" & "UIADD3.64". And they have different encodings and even not marked as ALTERNATE

Pseudo opcodes

We can observe totally non-distinguishable enum

PSEUDO_OPCODE "nopseudo_opcode"=0 , "SHL"=0 , "ISCADD"=0 , "IADD"=0 , "MOV"=0;

and samples of using:

Opcode /LOOnly("LO"):wide /PSEUDO_OPCODE("nopseudo_opcode"):pseudo_opcode

Btw operand pseudo_opcode don't even have corresponding encoding field. In essence instructions like IMAD.IADD, IMAD.MOV & IMAD.SHL have exactly the same encoding form. I don't know how nvdisasm selects PSEUDO_OPCODE - probably they borrowed hallucination generator from chatgpt

Enums can contain '.' too

Yes - enum names can be something like SR_CTAID.X, SR_CTAID.Y & SR_CTAID.Z

Operands not always separated with ','

BRX R2 -0x110 (*"INDIRECT_CALL"*)

nvidasm can't show some fields

especially batch & pm_pred. Typical instructions tail looks like:

$( { '&' REQ:req '=' BITSET(6/0x0000):req_bit_set } )$
$( { '&' RD:rd '=' UImm(3/0x7):src_rel_sb } )$
$( { '&' WR:wr '=' UImm(3/0x7):dst_wr_sb } )$
$( { '?' USCHED_INFO("DRAIN"):usched_info } )$
$( { '?' BATCH_T("NOP"):batch_t } )$
$( { '?' PM_PRED("PMN"):pm_pred } )$

and nvdisasm output contains only &wr=0x1 for WR, &rd=0x2 for RD and ?something for USCHED_INFO

Results

SM	parsing rate	avg forms
5	1.0	1.0
55	1.0	1.0
57	1.0	1.0
70	1.0	1.002404
75	1.0	1.018318
86	1.0	1.0
90	1.0	1.001589
100	1.0	1.016845
120	1.0	1.000225

Source of ambiguity

Lets run pa with options -Ssv to dump original text and all matched forms. We can see something like:

BAR.SYNC.DEFER_BLOCKING 0x0 
2 forms:
 19342 @Pg.D(7) BAR .E:barmode .E:defer_blocking Sb:UImm E:Rc.D(255) req_bit_set:BITSET src_rel_sb:UImm(7) E:usched_info E:batch_t.D(0) E:pm_pred.D(0)
 19286 @Pg.D(7) BAR .E:barmode .E:defer_blocking Sb:UImm ,Sc:UImm req_bit_set:BITSET src_rel_sb:UImm(7) E:usched_info E:batch_t.D(0) E:pm_pred.D(0)

The first form has additional register operand with default value 255 and second has yet another UImm operand Sc with default value 0 (UImm(12/0)*:Sc) - so they cannot be distinguished

пятница, 30 мая 2025 г.

nvidia sass assembler

I am very skeptical about patching of existing .cubin files - it requires too much book-keeping. Let's say we want to insert several additional instructions into some function - then we need

extend section containing code for those function by patching sections table
patch symbols table/relocs
disasm whole function and build code-flow graph for all instructions in function
fix offsets for jumps
fix attributes like EIATTR_INDIRECT_BRANCH_TARGETS & EIATTR_JUMPTABLE_RELOCS
and so on

While points 1-2 can be implemented with ELF patching libraries like elftools it is anyway too much tedious labour

For example CuAssembler prefers to create new .cubin files from scratch. In any case we need some engine to generate sass instructions and this task is perfectly achieve-able when you have ready disassembler. So I add to my sass disasm engine some primary features for code generation:

dictionary of all instructions for given SM - method INV_disasm::get_instrs
for each instruction add encoders describing how to put values for fields, tables, constant banks & scheduling

As illustration I've implemented interactive sass assembler (with some help of readline for auto-completion)

Дальше »

воскресенье, 4 мая 2025 г.

nvidia sass latency tables

It seems that latency values are the best kept secret - I was able to find only article in internet and author didn't provided any code to decipher those tables. So

Disclaimer

All of the following are the shaky conclusions of my dark mind, almost certainly false and having no connection to reality

How they are look like

Descriptions of latency tables are located in files *_2.txt and look like

TABLE_OUTPUT(UGPR) : UDP_subset`{URd @URdRange,URd2 @URd2Range}
                      R2UR_S2UR`{URd @URdRange,URd2 @URd2Range}
                       OP_R2UR_COUPLED`{URd @URdRange,URd2 @URd2Range}
                        ULDC_VOTEU_UMOV_ULEPC`{URd @URdRange,URd2 @URd2Range}=
{
    UDP_subset`{URd @URdRange,URd2 @URd2Range} : 1 4 7 7
    R2UR_S2UR`{URd @URdRange,URd2 @URd2Range} : 1 1 1 1
    OP_R2UR_COUPLED`{URd @URdRange,URd2 @URd2Range} : 4 4 1 10
    ULDC_VOTEU_UMOV_ULEPC`{URd @URdRange,URd2 @URd2Range} : 1 4 1 1
};

Дальше »

пятница, 18 апреля 2025 г.

nvidia sass disassembler, part 7: dual issued instructions

Previous parts: 1, 2, 3, 4, 5 & 6

As you could notice genuine nvdisasm put couple of instructions in curly braces for old sm (always 88bits). So I finally realized how those dual issued instructions are selected - the first one must have USCHED_INFO eq 0x10 (floxy2)

Interesting note that more new sm (since 70) missed 0x10:

 W15EG=15,
 WAIT15_END_GROUP=15,
 W1=17,
 trans1=17,

results

Дальше »

пятница, 11 апреля 2025 г.

nvidia sass disassembler, part 6: predicates

Previous parts: 1, 2, 3, 4 & 5

Lets check how pairs of instructions are chained together - this information stored in MD files with prefix _2.txt - for example from sm90_2.txt:

CONNECTOR CONDITIONS RaRange = (((((MD_PRED(ISRC_A_SIZE)) >= (1)) ? (MD_PRED(ISRC_A_SIZE)) : (1)) - 1) >> 5) + 1;

What is ISRC_A_SIZE? They are so called PREDICATES of instruction:

PREDICATES
 IDEST_SIZE = 32 + (((sz==`ATOMCASSZ@U64) || (sz==`ATOMCASSZ@"64"))*32 + ((sz==`ATOMCASSZ@"128"))*96);
 ISRC_B_SIZE = 32 + (((sz==`ATOMCASSZ@U64) || (sz==`ATOMCASSZ@"64"))*32 + ((sz==`ATOMCASSZ@"128"))*96);
 ISRC_C_SIZE = 32 + (((sz==`ATOMCASSZ@U64) || (sz==`ATOMCASSZ@"64"))*32 + ((sz==`ATOMCASSZ@"128"))*96);
 ISRC_A_SIZE = 32 + ((e==`E@E))*32;

So their values depend on instruction fields, like:

BITS_1_72_72_e=e
BITS_3_75_73_sz=sz

How we can convert this rules to C++? Well, they already almost have C++ syntax, we need to patch two things:

extract values of all used fields (in this case e & sz)
replace `ENUM@VALUE with numerical value of enum. Perl allows do this using cool regex modifier /e

So rule for ISRC_A_SIZE can be rewritten as:

int e = (int)e_iter->second; // extract value of e field return 32 + ((e==1))*32;

Bcs enum E described as E "noe"=0 , "E"=1;

results

I've add option -p to my disasm to dump predicates:

> LDCU.128 UR16,c:[0][URZ+0x3D0] &0 &0 ?trans1 ?NOP ?PMN
P> ILABEL_URa_SIZE: 32
P> ISRC_A_SIZE: 32
P> IDEST_SIZE: 128

пятница, 4 апреля 2025 г.

ptx instructions emitting by nvidia compiler. part 2

Part 1 described v10

And today let's check cicc v12. The first thing that catches your eye is its size - almost 76Mb! And it also contains at least 5 different decryptors - Nvidia really wants to hide something from its grateful clients

Why it is so fat?

Bcs it contains at least 4 code generators: for arm32, aarch64, x86 & nvptx

+ at least 27 llvm bytecode blobs (signature 0x42 0x43 0xc0 0xde) - they contains mostly bodies of intrinsic functions like nvvm_mulq/nvvm_divq but on some llvm-dis just crashes:

#0 0x000055abcc64743d in llvm::Intrinsic::getIntrinsicInfoTableEntries (id=0, T=...) at /home/redp/disc/src/llvm-project/llvm/lib/IR/Function.cpp:1339 1339 unsigned TableVal = IIT_Table[id-1]; >>> where #0 0x000055abcc64743d in llvm::Intrinsic::getIntrinsicInfoTableEntries (id=0, T=...) at /home/redp/disc/src/llvm-project/llvm/lib/IR/Function.cpp:1339 #1 0x000055abcc5fe41f in UpgradeIntrinsicFunction1 (F=0x55abce63cf18, NewFn=@0x7ffc60414e70: 0x0) at /home/redp/disc/src/llvm-project/llvm/include/llvm/IR/Function.h:204 #2 0x000055abcc60111a in llvm::UpgradeIntrinsicFunction (F=F@entry=0x55abce63cf18, NewFn=@0x7ffc60414e70: 0x0) at /home/redp/disc/src/llvm-project/llvm/lib/IR/AutoUpgrade.cpp:1226 #3 0x000055abcc584778 in (anonymous namespace)::BitcodeReader::globalCleanup (this=0x55abce608e30) at /home/redp/disc/src/llvm-project/llvm/lib/Bitcode/Reader/BitcodeReader.cpp:3696 #4 0x000055abcc5856cc in (anonymous namespace)::BitcodeReader::parseModule (this=<optimized out>, ResumeBit=<optimized out>, ShouldLazyLoadMetadata=<optimized out>, Callbacks=...) at /home/redp/disc/src/llvm-project/llvm/lib/Bitcode/Reader/BitcodeReader.cpp:4385 #5 0x000055abcc5959ca in (anonymous namespace)::BitcodeReader::parseBitcodeInto (Callbacks=..., IsImporting=false, ShouldLazyLoadMetadata=false, M=0x55abce5f3d80, this=0x55abce608e30) at /usr/include/c++/9/bits/std_function.h:564 #6 llvm::BitcodeModule::getModuleImpl (this=<optimized out>, Context=..., MaterializeAll=<optimized out>, ShouldLazyLoadMetadata=<optimized out>, IsImporting=<optimized out>, Callbacks=...) at /home/redp/disc/src/llvm-project/llvm/lib/Bitcode/Reader/BitcodeReader.cpp:7981 #7 0x000055abcc596070 in llvm::BitcodeModule::getLazyModule (this=0x7ffc60415d10, Context=..., ShouldLazyLoadMetadata=<optimized out>, IsImporting=<optimized out>, Callbacks=...) at /usr/include/c++/9/bits/std_function.h:263 #8 0x000055abcc550e49 in main (argc=<optimized out>, argv=<optimized out>) at /home/redp/disc/src/llvm-project/llvm/include/llvm/Support/CommandLine.h:1399

>>> p id $1 = 0

At least they should check that index can become negative, no? Who would doubt that llvm is very reliable and secure

So if they process llvm ByteCode then they also must link half of llvm run-time to do it, but they also use

Дальше »

воскресенье, 30 марта 2025 г.

ptx instructions emitting by nvidia compiler

I recently became curious what exactly ptx instructions can produce nvidia compiler - like if it uses something totally undocumented or vice versa - some official ptx instructions are never generated during compilation

The first thing is where those compiler located - no, it's not nvcc. Real compiler is cicc from packet cuda-nvvm. cicc from v10 has size 21Mb. The strings utility shows many interesting things, like

Portions Copyright (c) 1988-2016 Edison Design Group, Inc.
Portions Copyright (c) 2007-2016 University of Illinois at Urbana-Champaign.
Based on Edison Design Group C/C++ Front End

So they use front-end from Edison Design Group and llvm as back-end

Then I extracted several tables:

compiler errors
list of built-in functions with prototypes
list of llvm attributes - as you can see they are mostly correspond to nvvm LLVM dialect
and finally what I looked for - list with internal instruction names & their bodies to place into PTX file

As you can see mapping is very straightforward - for example for instruction BFE_S32rii (index 0x27) generating PTX bfe.s32

Results

I found only minor PTX instructions not presented in their official documentations: suq.xxx - perhaps should mean surface qword or something like this

среда, 26 марта 2025 г.

nvidia sass disassembler, part 5

Previous parts: 1, 2, 3 & 4

I've finally add native rendering for instructions - actually just rewrite from perl terrible function make_inst. Because in output typically rendering only small fraction of instructions data for formats are filling by demand via std::call_once. Results to compare with genuine nvdisasm:

mine	nvdisasm
LDC R1,c:[0][0x37C] LDCU.64 UR8,c:[0][URZ+0x440] LDC R16,c:[0][0x3B8] LDCU.64 UR12,c:[0][URZ+0x448] LDCU UR4,c:[0][URZ+0x3AC] LDC._64 R4,c:[0][0x450] LDCU.64 UR14,c:[0][URZ+0x380] LDCU.64 UR10,c:[0][URZ+0x358] HFMA2 R13,-RZ,RZ, 1.875000, 0.000000 ISETP.NE.S64.AND P2,PT,RZ,UR8,PT	LDC R1, c[0x0][0x37c] LDCU.64 UR8, c[0x0][0x440] LDC R16, c[0x0][0x3b8] LDCU.64 UR12, c[0x0][0x448] LDCU UR4, c[0x0][0x3ac] LDC.64 R4, c[0x0][0x450] LDCU.64 UR14, c[0x0][0x380] LDCU.64 UR10, c[0x0][0x358] HFMA2 R13, -RZ, RZ, 1.875, 0 ISETP.NE.S64.AND P2, PT, RZ, UR8, PT

IMHO very similar, has some minor problems with formatting of floating point values (I used FP16 to extract 16bit values but don't know what means E8M7Imm in format descriptor)

So the next thing to show is

labels for branches

As I mentioned you can identify instruction as branches via it's PROPERTIES, get value in BRANCH_TARGET_INDEX and render it as label address. There are two problems:

size of branch offset vary in size - it can be 58bit for sm_90, 50bit for sm_75, 24 for sm_3 and so on
branch offset is signed value, so we need some method to detect that some value of known bit size is negative

Дальше »

пятница, 21 марта 2025 г.

nvidia sass disassembler, part 4

I've made native sass disasm - just adding c++ codegen (can be produced by ead.pl with -C option). It works via dynamic loading of right disasm module - see list of supported architectures in map s_sms. For now it supports only operands dump with -O option - not rendered yet (bcs rewriting bunch of perl code with duck-types to C++ is boring and tedious work). Also you can dump attributes with -e option. You can make those modules with something like "make sm90.so". Btw dumb gcc allocates for local vars ~600kb on stack and with -Os option it compiles each module for 10 minutes with stack consumption shrink to normal values)

Tests show zero unrecognized instructions (and I am truly proud of this), however if you will find such - I also add option -N to dump it's content to bit-mask, which you then can pass to ead.pl with the same -N option to see what happened

On the other side it seems that nvidia trying to hide something important from us - let's check libcublas.so from v12 - we can notice lots of sections

.nv.merc.nv.info - genuine nvdiasm unable to show their content
.nv.capmerc.text - however, the instructions they contain are clearly in some other format and cannot be disassembled - I add -s option to disasm single section by it's index, so you can try it by yourself
and they obviously has corresponding relocs in sections .nv.merc.rela.text
and even .nv.merc.rela.debug_frame & .nv.merc.symtab

Known problems

Дальше »

пятница, 14 марта 2025 г.

nvidia sass disassembler, part 3

It looks like this rabbit hole goes much deeper

Some const banks does not have ConstBankAddressX:

CX:Sb[UniformRegister:URb][UImm(16)*:Sb_offset]
BITS_6_37_32_Ra_URb=URb
BITS_14_53_40_Sb_offset=Sb_offset SCALE 4

Btw there is no encoding for field Sb

Next they have desc memory:
DESC:memoryDescriptor[UniformRegister:Ra_URb][Register:Ra /ONLY64:input_reg_sz_64_dist + SImm(24/0)*:Ra_offset]

genuine nvdisasm shows them like LDG.E.U16.CONSTANT R10, desc[UR8][R2.64], my disasm as LDG.E.U16.CONSTANT ,R10,desc[UR8][R2.64 + 0x0]

And finally we also have:

A:srcAttr[ UniformRegister:URa + SImm(11/0)*:URa_offset ]

GMMA:gdesc[ UniformRegister:URb ]

TMA:desc[ UniformRegister:URe ]

TTU:ttuAddr[ UImm(16)*:ImmU16 ]

RF:indexURb[UniformRegister:URb] ','UImm(4/0xf)*:PixMaskU04

TMEMA:tmemA[ UniformRegister:URa ]

TMEM perhaps means "tensor memory" and I have no idea about the rest of the prefixes

понедельник, 10 марта 2025 г.

nvidia sass disassembler, part 2

Lets continue explore "machine descriptions" - in this time try to understand how to make format output more similar to genuine nvdisasm

For example format for one of variant I2F looks like:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /Float64:dstfmt /SRCFMT_U16_S16:srcfmt /Round1("RN"):rnd
Register:Rd
','C:Sb[UImm(5/0*):Sb_bank]*   [SImm(17)*:Sb_addr] /HSEL("H0"):hsel
$( { '&' REQ:req '=' BITSET(6/0x0000):req_bit_set } )$
$( { '&' RD:rd '=' UImm(3/0x7):src_rel_sb } )$
$( { '&' WR:wr '=' UImm(3/0x7):dst_wr_sb } )$
$( { '?' USCHED_INFO("DRAIN"):usched_info } )$
$( { '?' BATCH_T("NOP"):batch_t } )$
$( { '?' PM_PRED("PMN"):pm_pred } )$ ;

...

ENCODING
!i2f_Rd64__Cb_16b_unused;
BITS_3_14_12_Pg = Pg;
BITS_1_15_15_Pg_not = Pg@not;
BITS_13_91_91_11_0_opcode=Opcode;
BITS_3_77_75_sz=*dstfmt;
BITS_3_85_84_74_74_srcfmt=*srcfmt;
BITS_2_79_78_stride=rnd;
BITS_8_23_16_Rd=Rd;

BITS_5_58_54_Sb_bank,BITS_14_53_40_Sb_offset =  ConstBankAddress2(Sb_bank,Sb_addr);
BITS_2_61_60_hsel=hsel;
BITS_6_121_116_req_bit_set=req_bit_set;
BITS_3_115_113_src_rel_sb=VarLatOperandEnc(src_rel_sb);
BITS_3_112_110_dst_wr_sb=VarLatOperandEnc(dst_wr_sb);
BITS_2_103_102_pm_pred=pm_pred;
BITS_8_124_122_109_105_opex=TABLES_opex_0(batch_t,usched_info);

Дальше »

четверг, 6 марта 2025 г.

nvidia sass disassembler

Couple weeks ago I made decryptor to extract from nvdisasm so called "machine descriptions" (MD) (btw nvdisasm v12 uses lz4 compression library, so I made yet another decryptor + results). And after that I became extremely curious whether it was possible to make full SASS disassembler - sure format of those MDs even more undocumented than syntax of PTX - but anyway it's much better than having no documentation about ISA at all

First and most important thing to check is width of instruction - it can be

64bit for sm37 (Kepler) and more old
88bit for sm5x (Maxwell) until sm70
128bit since sm70 (Volta, Turing, Ampere, Ada, Hopper & Blackwell)

I have a very limited imagination so I couldn't imagine how hardware could support alignment for 11 byte instruction just bcs 11 is not power of 2. So after some magic with debugger I found following code snippet:
lea eax, [r9+r9*4] ; eax = r9 * 5 lea ecx, [r9+rax*4] ; ecx = r9 * 21 mov eax, 1FFFFFh ; 17bit mask
and then some search in google revealed this document:

On Kepler there is 1 control instruction for every 7 operational instructions. Maxwell added additional control capabilities and so has 1 control for every 3 instructions

So 88bit became 4 64bit qwords where first is Control qword and 3 remaining are instructions, then 21 + 64 = 85bit - very close to 88

Note: such martian architecture makes it impossible to create IDA Pro processor module for Maxwell and more old GPUs - bcs IDA expects that instruction at any properly aligned address should be valid, and you just don't know there Control qword is located for block of instructions

Lets check how looks description of each instruction (from here onwards I will refer to sm90 MD)

Дальше »

понедельник, 24 февраля 2025 г.

ptx internals

It seems that syntax of PTX is undocumented - at least I was unable to find actual BNF grammar or reference implementation. Grammar from antlr project is greatly out-dated and don't contain instructions like ldmatrix/istypep (and btw it has 3505 lines - much bigger than c++14 syntax)

Another project is much better however it's also incomplete. Official documentation is most vague & obscure reading for last years (previous champion) - it contains word "deprecated" 24 times. So it's time to do some disassembling

First remarkable thing is lots of high-entropy parts of data in .rodata section - this is glare sign of encryption, so I made decryptor. After looking at encrypted data we can discover several undocumented features

Дальше »

вторник, 25 ноября 2025 г.

четверг, 13 ноября 2025 г.

понедельник, 10 ноября 2025 г.

Building of CFG

вторник, 28 октября 2025 г.

Barriers

пятница, 17 октября 2025 г.

ELF::FatBinary

Cubin::Ced

Cubin::Attrs

Sample

среда, 1 октября 2025 г.

cuda-gdb

суббота, 13 сентября 2025 г.

PTX

четверг, 7 августа 2025 г.

Perl binding

Limitation

Some results

среда, 23 июля 2025 г.

суббота, 19 июля 2025 г.

пятница, 18 июля 2025 г.

nvdisasm lies

lack of documentation

ptxas produces code that is far from perfect

понедельник, 7 июля 2025 г.

Uniform registers

Wide loading

четверг, 3 июля 2025 г.

пятница, 27 июня 2025 г.

воскресенье, 15 июня 2025 г.

Instruction names with '.'

Pseudo opcodes

Enums can contain '.' too

Operands not always separated with ','

nvidasm can't show some fields

Results

Source of ambiguity

пятница, 30 мая 2025 г.

воскресенье, 4 мая 2025 г.

Disclaimer

How they are look like

пятница, 18 апреля 2025 г.

results

пятница, 11 апреля 2025 г.

results

пятница, 4 апреля 2025 г.

Why it is so fat?

воскресенье, 30 марта 2025 г.

Results

среда, 26 марта 2025 г.

labels for branches

пятница, 21 марта 2025 г.

Known problems

пятница, 14 марта 2025 г.

понедельник, 10 марта 2025 г.

четверг, 6 марта 2025 г.

понедельник, 24 февраля 2025 г.

вторник, 25 ноября 2025 г.

четверг, 13 ноября 2025 г.

понедельник, 10 ноября 2025 г.

вторник, 28 октября 2025 г.

пятница, 17 октября 2025 г.

среда, 1 октября 2025 г.

суббота, 13 сентября 2025 г.

четверг, 7 августа 2025 г.

среда, 23 июля 2025 г.

суббота, 19 июля 2025 г.

пятница, 18 июля 2025 г.

понедельник, 7 июля 2025 г.

четверг, 3 июля 2025 г.

пятница, 27 июня 2025 г.

воскресенье, 15 июня 2025 г.

пятница, 30 мая 2025 г.

воскресенье, 4 мая 2025 г.

пятница, 18 апреля 2025 г.

пятница, 11 апреля 2025 г.

пятница, 4 апреля 2025 г.

воскресенье, 30 марта 2025 г.

среда, 26 марта 2025 г.

пятница, 21 марта 2025 г.

пятница, 14 марта 2025 г.

понедельник, 10 марта 2025 г.

четверг, 6 марта 2025 г.

понедельник, 24 февраля 2025 г.