I've add tracking of registers to both nvd & pa - you can use -T option. And I have lots of bad news
nvdisasm lies
CS2R R100, SRZ
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /QInteger("64"):sz
Register:Rd
','SpecialRegister:SRa
PREDICATES
IDEST_SIZE = 32 + ((sz==`QInteger@"64"))*32;
FORMAT PREDICATE @[!]UniformPredicate(UPT):UPg Opcode
UniformRegister:URd
PREDICATES
IDEST_SIZE = 32;
lack of documentation
Predicates
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /ICmpAll:icmp /REDUX_SZ("S32"):fmt /Bop:bop /EXONLY:ex
Predicate:Pu
','Predicate:Pv
PREDICATES
IDEST_SIZE = 0;
IDEST2_SIZE = 0;
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /OFMT_F16_V2_BF16_V2("F16_V2"):ofmt /FCMP:cmp /H_AND("noh_and"):h_and /FTZ("noftz"):ft
z /Bop:bop
Predicate:Pu
','Predicate:Pv
PREDICATES
IDEST_SIZE = 0;
IDEST2_SIZE = 0;
I don't know if they set their first predicate Pu only or both Pu & Pv. Btw famous IMAD has very curious MD for some forms:
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /HIONLY:wide /FMT("S32"):fmt /XONLY:X
Register:Rd
','Predicate("PT"):Pu
','Register:Ra {/REUSE("noreuse"):reuse_src_a}
','Register:Rb {/REUSE("noreuse"):reuse_src_b}
',' [~] Register:Rc {/REUSE("noreuse"):reuse_src_c}
',' [!]Predicate:Pp
Usually IMAD means multiply and add, so Rd = Ra * Rb + Rc. But here we have two predicates, so should it have semantic Rd = Ra * Rb * Pu + Rc * Pp?
Barriers
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /ONLY32:sz
BD:barReg
','CBU_STATE_NONBAR:cbu_state
PREDICATES
IDEST_SIZE = 0;
IDEST2_SIZE = 0;
BD "B10"=10 , "B11"=11 , "B14"=14 , "B4"=4 , "B5"=5 , "B6"=6 , "B7"=7 , "B0"=0 , "B1"=1 , "B2"=2 , "B3"=3 , "B15"=15 , "B12"=12 , "B8"=8 , "B9"=9 , "B13"=13;
ptxas produces code that is far from perfect
Never used registers
LDC R1, c[0x0][0x17c]
mov r154, r3
@p0 bra ; to some block with EXIT
mov r154, r3 ; srsly?
iadd r8, r2, r8 ; r8 = r2 + r8
last two instructions can be replaced with single
iadd r8, r8, r8 ; r8 = r8 + r8
iadd r8, r8, ur4 ; r8 = r8 + ur4
iadd3 r8, r8, ur4, r8
> LDC R1, c[0x0][0x17c]
ОтветитьУдалитьc[0x0][0x17c] should be the starting address for local memory. on ampere/sm_80, it's c[0x0][0x28].
I believe the convention they use is that R1 always stores this address, and local memory access is typically offset from R1. They likely do this to warm up constant cache at the start of kernel, since it's always the first instruction.
I've been able to get ptx to generate a kernel without this instruction, but it was a toy example.
the problem that offset (17c here) is different on each SM - for example on sm120 it is 37c
ОтветитьУдалитьat least for sm3 hypothesis is incorrect - functions from ancient libcublas.so.7.5.18 have prologues like
ОтветитьУдалить/*0060*/ LDC.64 R36, c[0x0][0x190];
/*0068*/ LDC.64 R44, c[0x0][0x198];
/*0070*/ LDC R37, c[0x0][0x1a0];
/*0078*/ LDC R45, c[0x0][0x1a4];
/*0088*/ LDC R1, c[0x0][0x164];
/*0090*/ LDC R6, c[0x0][0x168];
/*0098*/ LDC R5, c[0x0][0x158];
and nvcc for sm3 generates something like
/*0008*/ MOV R1, c[0x0][0x44];