Couple weeks ago I made decryptor to extract from nvdisasm so called "machine descriptions" (MD) (btw nvdisasm v12 uses lz4 compression library, so I made yet another decryptor + results). And after that I became extremely curious whether it was possible to make full SASS disassembler - sure format of those MDs even more undocumented than syntax of PTX - but anyway it's much better than having no documentation about ISA at all
First and most important thing to check is width of instruction - it can be
- 64bit for sm37 (Kepler) and more old
- 88bit for sm5x (Maxwell) until sm70
- 128bit since sm70 (Volta, Turing, Ampere, Ada, Hopper & Blackwell)
I have a very limited imagination so I couldn't imagine how hardware could support alignment for 11 byte instruction just bcs 11 is not power of 2. So after some magic with debugger I found following code snippet:
lea eax, [r9+r9*4] ; eax = r9 * 5
lea ecx, [r9+rax*4] ; ecx = r9 * 21
mov eax, 1FFFFFh ; 17bit mask
and then some search in google revealed this document:
On Kepler there is 1 control instruction for every 7 operational instructions. Maxwell added additional control capabilities and so has 1 control for every 3 instructions
So 88bit became 4 64bit qwords where first is Control qword and 3 remaining are instructions, then 21 + 64 = 85bit - very close to 88
Note: such martian architecture makes it impossible to create IDA Pro processor module for Maxwell and more old GPUs - bcs IDA expects that instruction at any properly aligned address should be valid, and you just don't know there Control qword is located for block of instructions
Lets check how looks description of each instruction (from here onwards I will refer to sm90 MD)
ALTERNATE CLASS "warpsync_rel__RIR"
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /COLLECTIVEONLY:div /ALLOnly:all
[!]Predicate("PT"):Pp
','SImm(58)*:sImm /RelOpt:rel
...
OPCODES
WARPSYNCcbu_pipe = 0b100101001000;
WARPSYNC = 0b100101001000;
ENCODING
!warpsync_rel__RIR_unused;
BITS_3_14_12_Pg = Pg;
BITS_1_15_15_Pg_not = Pg@not;
BITS_13_91_91_11_0_opcode=Opcode;
BITS_2_86_85_cop=*div;
BITS_3_89_87_Pnz = Pp;
BITS_1_90_90_input_reg_sz_32_dist = Pp@not;
BITS_56_81_34_23_16_sImm=sImm SCALE 4;
BITS_6_121_116_req_bit_set=req_bit_set;
BITS_3_115_113_src_rel_sb=7;
BITS_3_112_110_dst_wr_sb=7;
BITS_2_103_102_pm_pred=pm_pred;
BITS_8_124_122_109_105_opex=TABLES_opex_0(batch_t,usched_info);
Enums
COLLECTIVEONLY "COLLECTIVE"=2;
DSTFMT "F16"=0 , "E8M7"=1 , "E6M9"=2 , "TF32"=3 , "E5M2"=4 , "E4M3"=5 , "BF16"=1;
B1B0 H0, H1=2, B(0..3)=(0..3);
and there are even compound enums like
UInteger_old = U8 + U16 + U32;
- like you can guess UXX are enums tooALLOnly "ALL"=0;
Tables
TABLES_opex_0
0 0 -> 0
0 1 -> 1
1 1 -> 33
...
$( { '?' LLOnly "ALL"=0;("DRAIN"):usched_info } )$
$( { '?' BATCH_T("NOP"):batch_t } )$
Decision tree for instructions decoding
- for each bit build tuple of 3 values - for 0 in mask, 1 and X
- if value in this tuple equal to whole count of masks - this bit is useless for decoding and can be ignored
- next we choose bit dividing group of masks into two of largest size
- we merge group of masks having 0 at this bit with X and recursively repeat process for left sub-tree
- and then group of masks with 1 and X for right sub-tree
- this process converges enough quickly and finally degenerated to chain containing only 0,1 or 1,0 in all positions - then we can just add to this leaf node all masks still matched at that lucky moment
Results
readelf -h libtop_secret.incredible_cool_algo.666.cubin
...
Flags: 0x5a055a
Then check version and realize that we should use MD from sm90_1.txt
readelf -S
libtop_secret.incredible_cool_algo.666.cubin | grep PROGBITS
...
[15] .text._ZN5cudnn31 PROGBITS 0000000000000000 00002a00
[16] .text._ZN5cudnn31 PROGBITS 0000000000000000 0000ad00
[17] .text._ZN5cudnn31 PROGBITS 0000000000000000 00013000
[18] .text._ZN5cudnn31 PROGBITS 0000000000000000 00019900
[23] .nv.constant0._ZN PROGBITS 0000000000000000 00020200
We have here 4 sections of code and can dump them in text form via my script hd.pl
libtop_secret.incredible_cool_algo.666.cubin 64 15 > section15
LDC line 91137 68 bits 1 items
filters:
BITS_3_115_113_src_rel_sb t VarLatOperandEnc
BITS_3_112_110_dst_wr_sb t VarLatOperandEnc
BITS_8_124_122_109_105_opex t TABLES_opex_0
mask2enum: BITS_2_103_102_pm_pred->PM_PRED(PMN) BITS_2_79_78_stride->AdMode(IA) BITS_3_14_12_Pg->Predicate(PT) BITS_3_75_73_sz->SZ_U8_S8_U16_S16_32_64(32) BITS_8_23_16_Rd->Register BITS_8_31_24_Ra->ZeroRegister(RZ)
te:BITS_8_124_122_109_105_opex(BATCH_T,USCHED_INFO)
BITS_3_14_12_Pg(7) PT
BITS_1_15_15_Pg_not(0)
BITS_3_75_73_sz(4) 32
BITS_2_79_78_stride(0) IA
BITS_8_23_16_Rd(1) R1
BITS_6_121_116_req_bit_set(0)
BITS_3_115_113_src_rel_sb(7) src_rel_sb = 0xffff
BITS_3_112_110_dst_wr_sb(0) dst_wr_sb = 0
BITS_2_103_102_pm_pred(0) PMN
BITS_8_124_122_109_105_opex(18) batch_t,usched_info = BATCH_T(0)NOP,USCHED_INFO(24)W8
-- const bank 0(Sa_bank,Ra_offset)
BITS_5_58_54_Sb_bank(0)
BITS_16_53_38_Ra_offset(28)
Lets compare this with output from genuine nvdisasm:LDC R1, c[0x0][0x28]
Known problems
00000000000011111110001000000000000000000000000100010100000001100000000000000000000000000000001011111111000001100111100000011001:2
results
000---------111111-----0--00000000000000000000010001---0----------------------------------------11111111------------100000011001 - SHR line 118258 57 bits ALT 1 items
000---------111111-----0--000000000000000000000-000----0------------------------------------------------------------100000011001 - SHF line 117312 47 bits 1 items
It seems that nvdisasm always shows instruction having biggest amount of meaningful bits - in this case SHR (57 bits)
Комментариев нет:
Отправить комментарий