пятница, 18 июля 2025 г.

sass instructions: registers tracking

I've add tracking of registers to both nvd & pa - you can use -T option. And I have lots of bad news

nvdisasm lies

Yup, again. Let's check this innocent looking code:
CS2R R100, SRZ
You can assume that it stores value from special to single 32bit regular register. Actually it used wide loading and store value to R100 & R101. bcs MD of CS2R looks like

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /QInteger("64"):sz
Register:Rd
','SpecialRegister:SRa

PREDICATES
 IDEST_SIZE = 32 + ((sz==`QInteger@"64"))*32;

QInteger has default value "64" and so omitted in output. Ok, maybe this is bcs special registers are 64bit? Lets check MD for S2UR to store value to uniform register:

FORMAT PREDICATE @[!]UniformPredicate(UPT):UPg Opcode
UniformRegister:URd

PREDICATES
 IDEST_SIZE = 32;

It does not have width modifier at all and destination size is simple 32. So cs2r by default is 64bit and s2ur is 32bit. Srsly? Is it should be obvious? Highly likely this person now inventing mnemonics names for Intel

 

lack of documentation

In essence we have this short list of instructions and chapter 8.7 from ancient "The CUDA Handbook" (btw published in 2013). Properties & predicates make it a little easier to understand. Unfortunately they contains only info about regular and uniform registers. And we have yet several classes of instructions working with another kind of registers

Predicates

MD for ISETP:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /ICmpAll:icmp /REDUX_SZ("S32"):fmt /Bop:bop /EXONLY:ex
Predicate:Pu
','Predicate:Pv

PREDICATES
 IDEST_SIZE = 0;
 IDEST2_SIZE = 0;

and for HSETP2:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /OFMT_F16_V2_BF16_V2("F16_V2"):ofmt /FCMP:cmp /H_AND("noh_and"):h_and /FTZ("noftz"):ft
z /Bop:bop
Predicate:Pu
','Predicate:Pv

PREDICATES
 IDEST_SIZE = 0;
 IDEST2_SIZE = 0;

I don't know if they set their first predicate Pu only or both Pu & Pv. Btw famous IMAD has very curious MD for some forms:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /HIONLY:wide /FMT("S32"):fmt /XONLY:X
Register:Rd
','Predicate("PT"):Pu
','Register:Ra {/REUSE("noreuse"):reuse_src_a}
','Register:Rb {/REUSE("noreuse"):reuse_src_b}
',' [~] Register:Rc {/REUSE("noreuse"):reuse_src_c}
',' [!]Predicate:Pp

Usually IMAD means multiply and add, so Rd = Ra * Rb + Rc. But here we have two predicates, so should it have semantic Rd = Ra * Rb * Pu + Rc * Pp?

Barriers

MD for BMOV: 

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /ONLY32:sz
BD:barReg
','CBU_STATE_NONBAR:cbu_state

PREDICATES
 IDEST_SIZE = 0;
 IDEST2_SIZE = 0;

Enum BD described as
BD "B10"=10 , "B11"=11 , "B14"=14 , "B4"=4 , "B5"=5 , "B6"=6 , "B7"=7 , "B0"=0 , "B1"=1 , "B2"=2 , "B3"=3 , "B15"=15 , "B12"=12 , "B8"=8 , "B9"=9 , "B13"=13;
 
I am sure that there are more...


ptxas produces code that is far from perfect

Disclaimer: I ripped all examples from nvidia tensorRT for sm120. More old versions show much sadder picture

Never used registers

Usually loaded in prologue of function:

LDC R1, c[0x0][0x17c]

Track shows that R1 is never used inside function. They typically loading from undocumented gap between function arguments and CBank. Anyway this decrease amount of available registers and looks suspicious
 
Bad register liveness
mov r154, r3
@p0 bra ; to some block with EXIT
mov r154, r3 ; srsly?
 
Poor expressions optimization

iadd r8, r2, r8 ; r8 = r2 + r8
iadd r8, r8, r8 ; r8 = r8 + r8
iadd r8, r8, ur4 ; r8 = r8 + ur4
last two instructions can be replaced with single

iadd3 r8, r8, ur4, r8

And so on

3 комментария:

  1. > LDC R1, c[0x0][0x17c]

    c[0x0][0x17c] should be the starting address for local memory. on ampere/sm_80, it's c[0x0][0x28].

    I believe the convention they use is that R1 always stores this address, and local memory access is typically offset from R1. They likely do this to warm up constant cache at the start of kernel, since it's always the first instruction.

    I've been able to get ptx to generate a kernel without this instruction, but it was a toy example.

    ОтветитьУдалить
  2. the problem that offset (17c here) is different on each SM - for example on sm120 it is 37c

    ОтветитьУдалить
  3. at least for sm3 hypothesis is incorrect - functions from ancient libcublas.so.7.5.18 have prologues like
    /*0060*/ LDC.64 R36, c[0x0][0x190];
    /*0068*/ LDC.64 R44, c[0x0][0x198];
    /*0070*/ LDC R37, c[0x0][0x1a0];
    /*0078*/ LDC R45, c[0x0][0x1a4];
    /*0088*/ LDC R1, c[0x0][0x164];
    /*0090*/ LDC R6, c[0x0][0x168];
    /*0098*/ LDC R5, c[0x0][0x158];
    and nvcc for sm3 generates something like
    /*0008*/ MOV R1, c[0x0][0x44];

    ОтветитьУдалить