пятница, 18 июля 2025 г.

sass instructions: registers tracking

I've add tracking of registers to both nvd & pa - you can use -T option. And I have lots of bad news

nvdisasm lies

Yup, again. Let's check this innocent looking code:
CS2R R100, SRZ
You can assume that it stores value from special to single 32bit regular register. Actually it used wide loading and store value to R100 & R101. bcs MD of CS2R looks like

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /QInteger("64"):sz
Register:Rd
','SpecialRegister:SRa

PREDICATES
 IDEST_SIZE = 32 + ((sz==`QInteger@"64"))*32;

QInteger has default value "64" and so omitted in output. Ok, maybe this is bcs special registers are 64bit? Lets check MD for S2UR to store value to uniform register:

FORMAT PREDICATE @[!]UniformPredicate(UPT):UPg Opcode
UniformRegister:URd

PREDICATES
 IDEST_SIZE = 32;

It does not have width modifier at all and destination size is simple 32. So cs2r by default is 64bit and s2ur is 32bit. Srsly? Is it should be obvious? Highly likely this person now inventing mnemonics names for Intel

 

lack of documentation

In essence we have this short list of instructions and chapter 8.7 from ancient "The CUDA Handbook" (btw published in 2013). Properties & predicates make it a little easier to understand. Unfortunately they contains only info about regular and uniform registers. And we have yet several classes of instructions working with another kind of registers

Predicates

MD for ISETP:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /ICmpAll:icmp /REDUX_SZ("S32"):fmt /Bop:bop /EXONLY:ex
Predicate:Pu
','Predicate:Pv

PREDICATES
 IDEST_SIZE = 0;
 IDEST2_SIZE = 0;

and for HSETP2:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /OFMT_F16_V2_BF16_V2("F16_V2"):ofmt /FCMP:cmp /H_AND("noh_and"):h_and /FTZ("noftz"):ft
z /Bop:bop
Predicate:Pu
','Predicate:Pv

PREDICATES
 IDEST_SIZE = 0;
 IDEST2_SIZE = 0;

I don't know if they set their first predicate Pu only or both Pu & Pv. Btw famous IMAD has very curious MD for some forms:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /HIONLY:wide /FMT("S32"):fmt /XONLY:X
Register:Rd
','Predicate("PT"):Pu
','Register:Ra {/REUSE("noreuse"):reuse_src_a}
','Register:Rb {/REUSE("noreuse"):reuse_src_b}
',' [~] Register:Rc {/REUSE("noreuse"):reuse_src_c}
',' [!]Predicate:Pp

Usually IMAD means multiply and add, so Rd = Ra * Rb + Rc. But here we have two predicates, so should it have semantic Rd = Ra * Rb * Pu + Rc * Pp?

Barriers

MD for BMOV: 

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /ONLY32:sz
BD:barReg
','CBU_STATE_NONBAR:cbu_state

PREDICATES
 IDEST_SIZE = 0;
 IDEST2_SIZE = 0;

Enum BD described as
BD "B10"=10 , "B11"=11 , "B14"=14 , "B4"=4 , "B5"=5 , "B6"=6 , "B7"=7 , "B0"=0 , "B1"=1 , "B2"=2 , "B3"=3 , "B15"=15 , "B12"=12 , "B8"=8 , "B9"=9 , "B13"=13;
 
I am sure that there are more...


ptxas produces code that is far from perfect

Disclaimer: I ripped all examples from nvidia tensorRT for sm120. More old versions show much sadder picture

Never used registers

Usually loaded in prologue of function:

LDC R1, c[0x0][0x17c]

Track shows that R1 is never used inside function. They typically loading from undocumented gap between function arguments and CBank. Anyway this decrease amount of available registers and looks suspicious
 
Bad register liveness
mov r154, r3
@p0 bra ; to some block with EXIT
mov r154, r3 ; srsly?
 
Poor expressions optimization

iadd r8, r2, r8 ; r8 = r2 + r8
iadd r8, r8, r8 ; r8 = r8 + r8
iadd r8, r8, ur4 ; r8 = r8 + ur4
last two instructions can be replaced with single

iadd3 r8, r8, ur4, r8

And so on

Комментариев нет:

Отправить комментарий