среда, 26 марта 2025 г.

nvidia sass disassembler, part 5

Previous parts: 1, 2, 3 & 4

I've finally add native rendering for instructions - actually just rewrite from perl terrible function make_inst. Because in output typically rendering only small fraction of instructions data for formats are filling by demand via std::call_once. Results to compare with genuine nvdisasm:

minenvdisasm
LDC R1,c:[0][0x37C]
LDCU.64 UR8,c:[0][URZ+0x440]
LDC R16,c:[0][0x3B8]
LDCU.64 UR12,c:[0][URZ+0x448]
LDCU UR4,c:[0][URZ+0x3AC]
LDC._64 R4,c:[0][0x450]
LDCU.64 UR14,c:[0][URZ+0x380]
LDCU.64 UR10,c:[0][URZ+0x358]
HFMA2 R13,-RZ,RZ, 1.875000, 0.000000
ISETP.NE.S64.AND P2,PT,RZ,UR8,PT
LDC R1, c[0x0][0x37c]
LDCU.64 UR8, c[0x0][0x440]
LDC R16, c[0x0][0x3b8]
LDCU.64 UR12, c[0x0][0x448]
LDCU UR4, c[0x0][0x3ac]
LDC.64 R4, c[0x0][0x450]
LDCU.64 UR14, c[0x0][0x380]
LDCU.64 UR10, c[0x0][0x358]
HFMA2 R13, -RZ, RZ, 1.875, 0
ISETP.NE.S64.AND P2, PT, RZ, UR8, PT

IMHO very similar, has some minor problems with formatting of floating point values (I used FP16 to extract 16bit values but don't know what means E8M7Imm in format descriptor)

So the next thing to show is 

labels for branches

As I mentioned you can identify instruction as branches via it's PROPERTIES, get value in BRANCH_TARGET_INDEX and render it as label address. There are two problems: 

  1. size of branch offset vary in size - it can be 58bit for sm_90, 50bit for sm_75, 24 for sm_3 and so on
  2. branch offset is signed value, so we need some method to detect that some value of known bit size is negative

пятница, 21 марта 2025 г.

nvidia sass disassembler, part 4

I've made native sass disasm - just adding c++ codegen (can be produced by ead.pl with -C option). It works via dynamic loading of right disasm module - see list of supported architectures in map s_sms. For now it supports only operands dump with -O option - not rendered yet (bcs rewriting bunch of perl code with duck-types to C++ is boring and tedious work). Also you can dump attributes with -e option. You can make those modules with something like "make sm90.so". Btw dumb gcc allocates for local vars ~600kb on stack and with -Os option it compiles each module for 10 minutes with stack consumption shrink to normal values)

Tests show zero unrecognized instructions (and I am truly proud of this), however if you will find such - I also add option -N to dump it's content to bit-mask, which you then can pass to ead.pl with the same -N option to see what happened

On the other side it seems that nvidia trying to hide something important from us - let's check libcublas.so from v12 - we can notice lots of sections

  • .nv.merc.nv.info - genuine nvdiasm unable to show their content
  • .nv.capmerc.text - however, the instructions they contain are clearly in some other format and cannot be disassembled - I add -s option to disasm single section by it's index, so you can try it by yourself
  • and they obviously has corresponding relocs in sections .nv.merc.rela.text
  • and even .nv.merc.rela.debug_frame & .nv.merc.symtab

Known problems

пятница, 14 марта 2025 г.

nvidia sass disassembler, part 3

It looks like this rabbit hole goes much deeper

Some const banks does not have ConstBankAddressX:
CX:Sb[UniformRegister:URb][UImm(16)*:Sb_offset]
BITS_6_37_32_Ra_URb=URb
BITS_14_53_40_Sb_offset=Sb_offset SCALE 4 
 
Btw there is no encoding for field Sb

Next they have desc memory:
DESC:memoryDescriptor[UniformRegister:Ra_URb][Register:Ra /ONLY64:input_reg_sz_64_dist + SImm(24/0)*:Ra_offset]

genuine nvdisasm shows them like LDG.E.U16.CONSTANT R10, desc[UR8][R2.64], my disasm as LDG.E.U16.CONSTANT ,R10,desc[UR8][R2.64 + 0x0]

And finally we also have:

A:srcAttr[ UniformRegister:URa + SImm(11/0)*:URa_offset ]

GMMA:gdesc[ UniformRegister:URb ]

TMA:desc[ UniformRegister:URe ]

TTU:ttuAddr[ UImm(16)*:ImmU16 ]

RF:indexURb[UniformRegister:URb] ','UImm(4/0xf)*:PixMaskU04 

TMEMA:tmemA[ UniformRegister:URa ] 

TMEM perhaps means "tensor memory" and I have no idea about the rest of the prefixes

понедельник, 10 марта 2025 г.

nvidia sass disassembler, part 2

Lets continue explore "machine descriptions" - in this time try to understand how to make format output more similar to genuine nvdisasm

For example format for one of variant I2F looks like: 

FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /Float64:dstfmt /SRCFMT_U16_S16:srcfmt /Round1("RN"):rnd
Register:Rd
','C:Sb[UImm(5/0*):Sb_bank]*   [SImm(17)*:Sb_addr] /HSEL("H0"):hsel
$( { '&' REQ:req '=' BITSET(6/0x0000):req_bit_set } )$
$( { '&' RD:rd '=' UImm(3/0x7):src_rel_sb } )$
$( { '&' WR:wr '=' UImm(3/0x7):dst_wr_sb } )$
$( { '?' USCHED_INFO("DRAIN"):usched_info } )$
$( { '?' BATCH_T("NOP"):batch_t } )$
$( { '?' PM_PRED("PMN"):pm_pred } )$ ;

...

ENCODING
!i2f_Rd64__Cb_16b_unused;
BITS_3_14_12_Pg = Pg;
BITS_1_15_15_Pg_not = Pg@not;
BITS_13_91_91_11_0_opcode=Opcode;
BITS_3_77_75_sz=*dstfmt;
BITS_3_85_84_74_74_srcfmt=*srcfmt;
BITS_2_79_78_stride=rnd;
BITS_8_23_16_Rd=Rd;

BITS_5_58_54_Sb_bank,BITS_14_53_40_Sb_offset =  ConstBankAddress2(Sb_bank,Sb_addr);
BITS_2_61_60_hsel=hsel;
BITS_6_121_116_req_bit_set=req_bit_set;
BITS_3_115_113_src_rel_sb=VarLatOperandEnc(src_rel_sb);
BITS_3_112_110_dst_wr_sb=VarLatOperandEnc(dst_wr_sb);
BITS_2_103_102_pm_pred=pm_pred;
BITS_8_124_122_109_105_opex=TABLES_opex_0(batch_t,usched_info);

четверг, 6 марта 2025 г.

nvidia sass disassembler

Couple weeks ago I made decryptor to extract from nvdisasm so called "machine descriptions" (MD) (btw nvdisasm v12 uses lz4 compression library, so I made yet another decryptor + results). And after that I became extremely curious whether it was possible to make full SASS disassembler - sure format of those MDs even more undocumented than syntax of PTX - but anyway it's much better than having no documentation about ISA at all

First and most important thing to check is width of instruction - it can be

  • 64bit for sm37 (Kepler) and more old
  • 88bit for sm5x (Maxwell) until sm70
  • 128bit since sm70 (Volta, Turing, Ampere, Ada, Hopper & Blackwell)

I have a very limited imagination so I couldn't imagine how hardware could support alignment for 11 byte instruction just bcs 11 is not power of 2. So after some magic with debugger I found following code snippet:
 lea     eax, [r9+r9*4] ; eax = r9 * 5
 lea     ecx, [r9+rax*4] ; ecx = r9 * 21
 mov     eax, 1FFFFFh ; 17bit mask

and then some search in google revealed this document:

On Kepler there is 1 control instruction for every 7 operational instructions. Maxwell added additional control capabilities and so has 1 control for every 3 instructions

So 88bit became 4 64bit qwords where first is Control qword and 3 remaining are instructions, then 21 + 64 = 85bit - very close to 88

Note: such martian architecture makes it impossible to create IDA Pro processor module for Maxwell and more old GPUs - bcs IDA expects that instruction at any properly aligned address should be valid, and you just don't know there Control qword is located for block of instructions

Lets check how looks description of each instruction (from here onwards I will refer to sm90 MD)

понедельник, 24 февраля 2025 г.

ptx internals

It seems that syntax of PTX is undocumented - at least I was unable to find actual BNF grammar or reference implementation. Grammar from antlr project is greatly out-dated and don't contain instructions like ldmatrix/istypep (and btw it has 3505 lines - much bigger than c++14 syntax)

Another project is much better however it's also incomplete. Official documentation is most vague & obscure reading for last years (previous champion) - it contains word "deprecated" 24 times. So it's time to do some disassembling

First remarkable thing is lots of high-entropy parts of data in .rodata section - this is glare sign of encryption, so I made decryptor. After looking at encrypted data we can discover several undocumented features

суббота, 16 ноября 2024 г.

perl module for powerpc disasm

It seems that there are no good open-source disasm for PPC except Capstone. And it even has perl binding - unfortunately it can extract only basic fields like opcode and text but no operands for specific processor. So I made yet another

I would like to make a few critical remarks about the capstone itself

1) it's fat as pig - with default settings we have
ls -l libcapstone.a
-rw-rw-r-- 1 redp redp 93833624 nov  8 16:10 libcapstone.a
Sure this can be fixed with selecting just needed processors but anyway makes a lasting impression

2) it's inconsistent. Just couple of examples
mr r31, r3 actually has instruction id PPC_INSN_OR. to get MR you must check alias ID, and even in this case they have PPC_INS_ALIAS_MR & PPC_INS_ALIAS_MR_

Another example - ld r3, something actually has PPC_REG_X3 instead of PPC_REG_R3 despite the fact that these are the same register. Why clone lots of registers if you produce the same output for them? And why not add size of register? I suspect this happens bcs they used MD from llvm and was too lazy to make some optimizations

3) it's incomplete. For example they don't implemented reg_access for powerpc (as well as for alpha, risc-v, sparc etc)

multiple TOCs

It's critical to track TOC for distinguishing loading some constant vs loading at some address. Lets check following code

; prologue
addis r2, r12, 0x1d8
addi r2, r2, 0x70 ; TOC
...
addis r3, r2, 0x19
ld r3, -0x1ca8(r3) ; kmem_cache

Here r3 adjusting from r2 holding address of TOC to get address of kmem_cache. And this is how loading of constant looks like:
lis r10, 0xa9
ori r10, r10, 0xc00 ; A90C00
And yes - LIS is again alias PPC_INS_ALIAS_LIS, instruction ID is PPC_ADDIS. 

Official doc is very unclear regarding the possibility of having multiple TOCs. Theoretically if linker would put TOC somewhere in middle of .bss/.data/.rdata sections it can cover full 32bit address space. But what if the program has size bigger 4Gb or it was linked with code from several compilers like llvm & gcc? Unfortunately I don't have such samples so cannot say if my disasm will work correctly in case of having several TOCs

results

Anyway, enjoy - powepc disassembler in 120LOC