понедельник, 10 марта 2025 г.

nvidia sass disassembler, part 2

Lets continue explore "machine descriptions" - in this time try to understand how to make format output more similar to genuine nvdisasm

For example format for one of variant I2F looks like: 

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /Float64:dstfmt /SRCFMT_U16_S16:srcfmt /Round1("RN"):rnd
Register:Rd
','C:Sb[UImm(5/0*):Sb_bank]*   [SImm(17)*:Sb_addr] /HSEL("H0"):hsel
$( { '&' REQ:req '=' BITSET(6/0x0000):req_bit_set } )$
$( { '&' RD:rd '=' UImm(3/0x7):src_rel_sb } )$
$( { '&' WR:wr '=' UImm(3/0x7):dst_wr_sb } )$
$( { '?' USCHED_INFO("DRAIN"):usched_info } )$
$( { '?' BATCH_T("NOP"):batch_t } )$
$( { '?' PM_PRED("PMN"):pm_pred } )$ ;

...

ENCODING
!i2f_Rd64__Cb_16b_unused;
BITS_3_14_12_Pg = Pg;
BITS_1_15_15_Pg_not = Pg@not;
BITS_13_91_91_11_0_opcode=Opcode;
BITS_3_77_75_sz=*dstfmt;
BITS_3_85_84_74_74_srcfmt=*srcfmt;
BITS_2_79_78_stride=rnd;
BITS_8_23_16_Rd=Rd;

BITS_5_58_54_Sb_bank,BITS_14_53_40_Sb_offset =  ConstBankAddress2(Sb_bank,Sb_addr);
BITS_2_61_60_hsel=hsel;
BITS_6_121_116_req_bit_set=req_bit_set;
BITS_3_115_113_src_rel_sb=VarLatOperandEnc(src_rel_sb);
BITS_3_112_110_dst_wr_sb=VarLatOperandEnc(dst_wr_sb);
BITS_2_103_102_pm_pred=pm_pred;
BITS_8_124_122_109_105_opex=TABLES_opex_0(batch_t,usched_info);

четверг, 6 марта 2025 г.

nvidia sass disassembler

Couple weeks ago I made decryptor to extract from nvdisasm so called "machine descriptions" (MD) (btw nvdisasm v12 uses lz4 compression library, so I made yet another decryptor + results). And after that I became extremely curious whether it was possible to make full SASS disassembler - sure format of those MDs even more undocumented than syntax of PTX - but anyway it's much better than having no documentation about ISA at all

First and most important thing to check is width of instruction - it can be

  • 64bit for sm37 (Kepler) and more old
  • 88bit for sm5x (Maxwell) until sm70
  • 128bit since sm70 (Volta, Turing, Ampere, Ada, Hopper & Blackwell)

I have a very limited imagination so I couldn't imagine how hardware could support alignment for 11 byte instruction just bcs 11 is not power of 2. So after some magic with debugger I found following code snippet:
 lea     eax, [r9+r9*4] ; eax = r9 * 5
 lea     ecx, [r9+rax*4] ; ecx = r9 * 21
 mov     eax, 1FFFFFh ; 17bit mask

and then some search in google revealed this document:

On Kepler there is 1 control instruction for every 7 operational instructions. Maxwell added additional control capabilities and so has 1 control for every 3 instructions

So 88bit became 4 64bit qwords where first is Control qword and 3 remaining are instructions, then 21 + 64 = 85bit - very close to 88

Note: such martian architecture makes it impossible to create IDA Pro processor module for Maxwell and more old GPUs - bcs IDA expects that instruction at any properly aligned address should be valid, and you just don't know there Control qword is located for block of instructions

Lets check how looks description of each instruction (from here onwards I will refer to sm90 MD)

понедельник, 24 февраля 2025 г.

ptx internals

It seems that syntax of PTX is undocumented - at least I was unable to find actual BNF grammar or reference implementation. Grammar from antlr project is greatly out-dated and don't contain instructions like ldmatrix/istypep (and btw it has 3505 lines - much bigger than c++14 syntax)

Another project is much better however it's also incomplete. Official documentation is most vague & obscure reading for last years (previous champion) - it contains word "deprecated" 24 times. So it's time to do some disassembling

First remarkable thing is lots of high-entropy parts of data in .rodata section - this is glare sign of encryption, so I made decryptor. After looking at encrypted data we can discover several undocumented features