windows deep internals: sass instructions: registers tracking

пятница, 18 июля 2025 г.

sass instructions: registers tracking

I've add tracking of registers to both nvd & pa - you can use -T option. And I have lots of bad news

nvdisasm lies

Yup, again. Let's check this innocent looking code:

CS2R R100, SRZ

You can assume that it stores value from special to single 32bit regular register. Actually it used wide loading and store value to R100 & R101. bcs MD of CS2R looks like

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /QInteger("64"):sz Register:Rd ','SpecialRegister:SRa

PREDICATES IDEST_SIZE = 32 + ((sz==`QInteger@"64"))*32;

QInteger has default value "64" and so omitted in output. Ok, maybe this is bcs special registers are 64bit? Lets check MD for S2UR to store value to uniform register:

FORMAT PREDICATE @[!]UniformPredicate(UPT):UPg Opcode UniformRegister:URd

PREDICATES IDEST_SIZE = 32;

It does not have width modifier at all and destination size is simple 32. So cs2r by default is 64bit and s2ur is 32bit. Srsly? Is it should be obvious? Highly likely this person now inventing mnemonics names for Intel

lack of documentation

In essence we have this short list of instructions and chapter 8.7 from ancient "The CUDA Handbook" (btw published in 2013). Properties & predicates make it a little easier to understand. Unfortunately they contains only info about regular and uniform registers. And we have yet several classes of instructions working with another kind of registers

Predicates

MD for ISETP:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /ICmpAll:icmp /REDUX_SZ("S32"):fmt /Bop:bop /EXONLY:ex Predicate:Pu ','Predicate:Pv

PREDICATES IDEST_SIZE = 0; IDEST2_SIZE = 0;

and for HSETP2:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /OFMT_F16_V2_BF16_V2("F16_V2"):ofmt /FCMP:cmp /H_AND("noh_and"):h_and /FTZ("noftz"):ft z /Bop:bop Predicate:Pu ','Predicate:Pv

PREDICATES IDEST_SIZE = 0; IDEST2_SIZE = 0;

I don't know if they set their first predicate Pu only or both Pu & Pv. Btw famous IMAD has very curious MD for some forms:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /HIONLY:wide /FMT("S32"):fmt /XONLY:X
Register:Rd
','Predicate("PT"):Pu
','Register:Ra {/REUSE("noreuse"):reuse_src_a}
','Register:Rb {/REUSE("noreuse"):reuse_src_b}
',' [~] Register:Rc {/REUSE("noreuse"):reuse_src_c}
',' [!]Predicate:Pp

Usually IMAD means multiply and add, so Rd = Ra * Rb + Rc. But here we have two predicates, so should it have semantic Rd = Ra * Rb * Pu + Rc * Pp?

Barriers

MD for BMOV:

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /ONLY32:sz BD:barReg ','CBU_STATE_NONBAR:cbu_state

PREDICATES IDEST_SIZE = 0; IDEST2_SIZE = 0;

Enum BD described as

BD "B10"=10 , "B11"=11 , "B14"=14 , "B4"=4 , "B5"=5 , "B6"=6 , "B7"=7 , "B0"=0 , "B1"=1 , "B2"=2 , "B3"=3 , "B15"=15 , "B12"=12 , "B8"=8 , "B9"=9 , "B13"=13;

I am sure that there are more...

ptxas produces code that is far from perfect

Disclaimer: I ripped all examples from nvidia tensorRT for sm120. More old versions show much sadder picture

Never used registers

Usually loaded in prologue of function:

LDC R1, c[0x0][0x17c]

Track shows that R1 is never used inside function. They typically loading from undocumented gap between function arguments and CBank. Anyway this decrease amount of available registers and looks suspicious

Bad register liveness

mov r154, r3
@p0 bra ; to some block with EXIT
mov r154, r3 ; srsly?

Poor expressions optimization

iadd r8, r2, r8 ; r8 = r2 + r8 iadd r8, r8, r8 ; r8 = r8 + r8 iadd r8, r8, ur4 ; r8 = r8 + ur4last two instructions can be replaced with single

iadd3 r8, r8, ur4, r8

And so on

windows deep internals

пятница, 18 июля 2025 г.

sass instructions: registers tracking

nvdisasm lies

lack of documentation

ptxas produces code that is far from perfect

Комментариев нет:

Отправить комментарий

пятница, 18 июля 2025 г.

sass instructions: registers tracking

nvdisasm lies

lack of documentation

ptxas produces code that is far from perfect

Комментариев нет:

Отправить комментарий

пятница, 18 июля 2025 г.