четверг, 6 марта 2025 г.

nvidia sass disassembler

Couple weeks ago I made decryptor to extract from nvdisasm so called "machine descriptions" (MD) (btw nvdisasm v12 uses lz4 compression library, so I made yet another decryptor + results). And after that I became extremely curious whether it was possible to make full SASS disassembler - sure format of those MDs even more undocumented than syntax of PTX - but anyway it's much better than having no documentation about ISA at all

First and most important thing to check is width of instruction - it can be

  • 64bit for sm37 (Kepler) and more old
  • 88bit for sm5x (Maxwell) until sm70
  • 128bit since sm70 (Volta, Turing, Ampere, Ada, Hopper & Blackwell)

I have a very limited imagination so I couldn't imagine how hardware could support alignment for 11 byte instruction just bcs 11 is not power of 2. So after some magic with debugger I found following code snippet:
 lea     eax, [r9+r9*4] ; eax = r9 * 5
 lea     ecx, [r9+rax*4] ; ecx = r9 * 21
 mov     eax, 1FFFFFh ; 17bit mask

and then some search in google revealed this document:

On Kepler there is 1 control instruction for every 7 operational instructions. Maxwell added additional control capabilities and so has 1 control for every 3 instructions

So 88bit became 4 64bit qwords where first is Control qword and 3 remaining are instructions, then 21 + 64 = 85bit - very close to 88

Note: such martian architecture makes it impossible to create IDA Pro processor module for Maxwell and more old GPUs - bcs IDA expects that instruction at any properly aligned address should be valid, and you just don't know there Control qword is located for block of instructions

Lets check how looks description of each instruction (from here onwards I will refer to sm90 MD)

ALTERNATE CLASS "warpsync_rel__RIR"
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /COLLECTIVEONLY:div /ALLOnly:all
 [!]Predicate("PT"):Pp
','SImm(58)*:sImm /RelOpt:rel

...

OPCODES
        WARPSYNCcbu_pipe =  0b100101001000;
        WARPSYNC =  0b100101001000;

ENCODING
!warpsync_rel__RIR_unused;
BITS_3_14_12_Pg = Pg;
BITS_1_15_15_Pg_not = Pg@not;
BITS_13_91_91_11_0_opcode=Opcode;
BITS_2_86_85_cop=*div;
BITS_3_89_87_Pnz = Pp;
BITS_1_90_90_input_reg_sz_32_dist = Pp@not;
BITS_56_81_34_23_16_sImm=sImm SCALE 4;
BITS_6_121_116_req_bit_set=req_bit_set;
BITS_3_115_113_src_rel_sb=7;
BITS_3_112_110_dst_wr_sb=7;
BITS_2_103_102_pm_pred=pm_pred;
BITS_8_124_122_109_105_opex=TABLES_opex_0(batch_t,usched_info);
 

It's pretty obvious that section ENCODING contains masks, !mask means that this mask should be filled with zeros, mask=const is self explanatory and mask BITS_13_91_91_11_0_opcode contains opcode for this instruction
Only problem here is direction of bits inside mask - let's we have some mask "..XXX.." and want to put value 1 in it - should result be "..100.." or "..001.."? Initially I chose first variant (so direction from left to right), but then tests showed that actual it is backward - so always use option -r when run ead.pl
 
This first naive implementation showed very poor results with lots of the same masks referring to different instructions, so it was time to explore what are those cryptic constructs in FORMAT section - for example what is COLLECTIVEONLY:div referred from mask BITS_2_86_85_cop? Answer - they are (mostly) just

Enums

COLLECTIVEONLY "COLLECTIVE"=2;
Don't ask me why use enum with only value instead of BITS_2_86_85_cop=2
Syntax for enums is extremely diverse:
DSTFMT "F16"=0 , "E8M7"=1 , "E6M9"=2 , "TF32"=3 , "E5M2"=4 , "E4M3"=5 , "BF16"=1;
B1B0 H0, H1=2, B(0..3)=(0..3);
and there are even compound enums like UInteger_old = U8 + U16 + U32; - like you can guess UXX are enums too
 
So in next version I just fill all mask with values inside parentheses - and generated masks were almost completely filled and sure were again unusable for decoding (you can check what happens running ead.pl with option -cmr)
 
After numerous experiments I found that only masks that have an assignment of the type "=*" must be filled in, otherwise existing of value in enum should be checked after matching with some mask - and this is how final version works, you can run it with options -Fmr
 
Btw there are lots of enums in FORMAT string not having corresponding encoding masks - like ALLOnly:all in example above. ALLOnly described as:
ALLOnly "ALL"=0;
I don't know if this is bug but there are more common patterns like /LOOnly("LO"):wide
 
However this is not enough - TABLES_opex_0 is not enum but located in section

Tables

and looks like
TABLES_opex_0
0 0 -> 0
0 1 -> 1
1 1 -> 33
... 
Decoded value is on right side and pair of values for batch_t & usched_info are on left. Btw they are also described as enums:
$( { '?' LLOnly "ALL"=0;("DRAIN"):usched_info } )$
$( { '?' BATCH_T("NOP"):batch_t } )$
 
The problem here is that tables can have only limited set of values - probably missed are considered as invalid. You cannot express missed values in form of mask, so this test should be performed after matching with some mask - like I did for enums, see details in function filter_ins

Decision tree for instructions decoding

Usually for instructions decoding used so called "decoding decision tree", but I was unable to find open-source implementation for algos like LISA or ISDL. I also checked decodetree from qemu - it seems that they use topological sorting. So I invented yet another poor and buggy algorithm for building of my own decodetree:
  1. for each bit build tuple of 3 values - for 0 in mask, 1 and X
  2. if value in this tuple equal to whole count of masks - this bit is useless for decoding and can be ignored
  3. next we choose bit dividing group of masks into two of largest size
  4. we merge group of masks having 0 at this bit with X and recursively repeat process for left sub-tree
  5. and then group of masks with 1 and X for right sub-tree
  6. this process converges enough quickly and finally degenerated to chain containing only 0,1 or 1,0 in all positions - then we can just add to this leaf node all masks still matched at that lucky moment
It sounds bit complicated but implementation is easy - see function build_node
As cheap optimization we can start apply masks matching only on level >= minimal count of meaningful bits
 
Result - for sm90_1.txt decodetree has 1227 nodes, 1226 leaves, depth 13 and contains 1529 masks - starting from 1321 non-unique masks
 
This tree reduces amount of mask comparing from n * N to n * log(N), where n is size of input instructions and N is amount of masks (in my test from 2767495 to 3051 calls of cmp_maska)

Results

Current implementation support only 88 & 128bit instructions and works with text data. Lets assume we have target libtop_secret.incredible_cool_algo.666.cubin
First run
readelf -h libtop_secret.incredible_cool_algo.666.cubin
...
Flags:                             0x5a055a 

Then check version and realize that we should use MD from sm90_1.txt
Again run readelf to list sections:
 
readelf -S libtop_secret.incredible_cool_algo.666.cubin | grep PROGBITS
 ...
  [15] .text._ZN5cudnn31 PROGBITS         0000000000000000  00002a00
  [16] .text._ZN5cudnn31 PROGBITS         0000000000000000  0000ad00
  [17] .text._ZN5cudnn31 PROGBITS         0000000000000000  00013000
  [18] .text._ZN5cudnn31 PROGBITS         0000000000000000  00019900
  [23] .nv.constant0._ZN PROGBITS         0000000000000000  00020200 

We have here 4 sections of code and can dump them in text form via my script hd.pl libtop_secret.incredible_cool_algo.666.cubin 64 15 > section15

Now you finally can run ead.pl -BFmr -T section15 sm90.txt and observe something like

LDC line 91137 68 bits  1 items
filters:
  BITS_3_115_113_src_rel_sb t VarLatOperandEnc
  BITS_3_112_110_dst_wr_sb t VarLatOperandEnc
  BITS_8_124_122_109_105_opex t TABLES_opex_0
mask2enum: BITS_2_103_102_pm_pred->PM_PRED(PMN) BITS_2_79_78_stride->AdMode(IA) BITS_3_14_12_Pg->Predicate(PT) BITS_3_75_73_sz->SZ_U8_S8_U16_S16_32_64(32) BITS_8_23_16_Rd->Register BITS_8_31_24_Ra->ZeroRegister(RZ)
  te:BITS_8_124_122_109_105_opex(BATCH_T,USCHED_INFO)
   BITS_3_14_12_Pg(7) PT
   BITS_1_15_15_Pg_not(0)
   BITS_3_75_73_sz(4) 32
   BITS_2_79_78_stride(0) IA
   BITS_8_23_16_Rd(1) R1
   BITS_6_121_116_req_bit_set(0)
   BITS_3_115_113_src_rel_sb(7) src_rel_sb = 0xffff
   BITS_3_112_110_dst_wr_sb(0) dst_wr_sb = 0
   BITS_2_103_102_pm_pred(0) PMN
   BITS_8_124_122_109_105_opex(18) batch_t,usched_info = BATCH_T(0)NOP,USCHED_INFO(24)W8
 -- const bank 0(Sa_bank,Ra_offset)
   BITS_5_58_54_Sb_bank(0)
   BITS_16_53_38_Ra_offset(28) 


Lets compare this with output from genuine nvdisasm:
LDC R1, c[0x0][0x28] 

Known problems

Format of ConstBandAddress is not described in MD files, so I have zero ideas how to convert values 0 & 28 to c[0x0][0x28] - that's not always so straightforward
 
Bcs enums are basically non-unique output can differs from nvdisasm - like it shows enum CInteger with value 0 as "U8" and mine code as "SD"
 
Some instructions marked as ALTERNATE - they have identical masks but slightly differs in formats. I don't know yet how to distinguish them
 
Some masks can overlap (and so we have several matches), like:
00000000000011111110001000000000000000000000000100010100000001100000000000000000000000000000001011111111000001100111100000011001:2
results
000---------111111-----0--00000000000000000000010001---0----------------------------------------11111111------------100000011001 - SHR line 118258 57 bits ALT 1 items
000---------111111-----0--000000000000000000000-000----0------------------------------------------------------------100000011001 - SHF line 117312 47 bits  1 items

It seems that nvdisasm always shows instruction having biggest amount of meaningful bits - in this case SHR (57 bits)
 
to be continued

Комментариев нет:

Отправить комментарий