пятница, 18 апреля 2025 г.

nvidia sass disassembler, part 7: dual issued instructions

Previous parts: 1, 2, 3, 4, 5 & 6

As you could notice genuine nvdisasm put couple of instructions in curly braces for old sm (always 88bits). So I finally realized how those dual issued instructions are selected - the first one must have USCHED_INFO eq 0x10 (floxy2)

Interesting note that more new sm (since 70) missed 0x10:

 W15EG=15,
 WAIT15_END_GROUP=15,
 W1=17,
 trans1=17,

results

пятница, 11 апреля 2025 г.

nvidia sass disassembler, part 6: predicates

Previous parts: 1, 2, 3, 4 & 5

Lets check how pairs of instructions are chained together - this information stored in MD files with prefix _2.txt - for example from sm90_2.txt

CONNECTOR CONDITIONS

    RaRange = (((((MD_PRED(ISRC_A_SIZE)) >= (1)) ? (MD_PRED(ISRC_A_SIZE)) : (1)) - 1) >> 5) + 1;

What is ISRC_A_SIZE? They are so called PREDICATES of instruction:

PREDICATES
 IDEST_SIZE = 32 + (((sz==`ATOMCASSZ@U64) || (sz==`ATOMCASSZ@"64"))*32 + ((sz==`ATOMCASSZ@"128"))*96);
 ISRC_B_SIZE = 32 + (((sz==`ATOMCASSZ@U64) || (sz==`ATOMCASSZ@"64"))*32 + ((sz==`ATOMCASSZ@"128"))*96);
 ISRC_C_SIZE = 32 + (((sz==`ATOMCASSZ@U64) || (sz==`ATOMCASSZ@"64"))*32 + ((sz==`ATOMCASSZ@"128"))*96);
 ISRC_A_SIZE = 32 + ((e==`E@E))*32;
So their values depend on instruction fields, like:
BITS_1_72_72_e=e
BITS_3_75_73_sz=sz

How we can convert this rules to C++? Well, they already almost have C++ syntax, we need to patch two things:

  1. extract values of all used fields (in this case e & sz)
  2. replace `ENUM@VALUE with numerical value of enum. Perl allows do this using cool regex modifier /e

So rule for ISRC_A_SIZE can be rewritten as:

int e = (int)e_iter->second; // extract value of e field
return 32 + ((e==1))*32;

Bcs enum E described as E "noe"=0 , "E"=1;

results

I've add option -p to my disasm to dump predicates:
> LDCU.128 UR16,c:[0][URZ+0x3D0] &0 &0 ?trans1 ?NOP ?PMN
P> ILABEL_URa_SIZE: 32
P> ISRC_A_SIZE: 32
P> IDEST_SIZE: 128

пятница, 4 апреля 2025 г.

ptx instructions emitting by nvidia compiler. part 2

Part 1 described v10
And today let's check cicc v12. The first thing that catches your eye is its size - almost 76Mb! And it also contains at least 5 different decryptors - Nvidia really wants to hide something from its grateful clients

Why it is so fat?

Bcs it contains at least 4 code generators: for arm32, aarch64, x86 & nvptx
+ at least 27 llvm bytecode blobs (signature 0x42 0x43 0xc0 0xde) - they contains mostly bodies of intrinsic functions like nvvm_mulq/nvvm_divq but on some llvm-dis just crashes:

#0  0x000055abcc64743d in llvm::Intrinsic::getIntrinsicInfoTableEntries (id=0, T=...) at /home/redp/disc/src/llvm-project/llvm/lib/IR/Function.cpp:1339
1339      unsigned TableVal = IIT_Table[id-1];
>>> where
#0  0x000055abcc64743d in llvm::Intrinsic::getIntrinsicInfoTableEntries (id=0, T=...) at /home/redp/disc/src/llvm-project/llvm/lib/IR/Function.cpp:1339
#1  0x000055abcc5fe41f in UpgradeIntrinsicFunction1 (F=0x55abce63cf18, NewFn=@0x7ffc60414e70: 0x0) at /home/redp/disc/src/llvm-project/llvm/include/llvm/IR/Function.h:204
#2  0x000055abcc60111a in llvm::UpgradeIntrinsicFunction (F=F@entry=0x55abce63cf18, NewFn=@0x7ffc60414e70: 0x0) at /home/redp/disc/src/llvm-project/llvm/lib/IR/AutoUpgrade.cpp:1226
#3  0x000055abcc584778 in (anonymous namespace)::BitcodeReader::globalCleanup (this=0x55abce608e30) at /home/redp/disc/src/llvm-project/llvm/lib/Bitcode/Reader/BitcodeReader.cpp:3696
#4  0x000055abcc5856cc in (anonymous namespace)::BitcodeReader::parseModule (this=<optimized out>, ResumeBit=<optimized out>, ShouldLazyLoadMetadata=<optimized out>, Callbacks=...) at /home/redp/disc/src/llvm-project/llvm/lib/Bitcode/Reader/BitcodeReader.cpp:4385
#5  0x000055abcc5959ca in (anonymous namespace)::BitcodeReader::parseBitcodeInto (Callbacks=..., IsImporting=false, ShouldLazyLoadMetadata=false, M=0x55abce5f3d80, this=0x55abce608e30) at /usr/include/c++/9/bits/std_function.h:564
#6  llvm::BitcodeModule::getModuleImpl (this=<optimized out>, Context=..., MaterializeAll=<optimized out>, ShouldLazyLoadMetadata=<optimized out>, IsImporting=<optimized out>, Callbacks=...) at /home/redp/disc/src/llvm-project/llvm/lib/Bitcode/Reader/BitcodeReader.cpp:7981
#7  0x000055abcc596070 in llvm::BitcodeModule::getLazyModule (this=0x7ffc60415d10, Context=..., ShouldLazyLoadMetadata=<optimized out>, IsImporting=<optimized out>, Callbacks=...) at /usr/include/c++/9/bits/std_function.h:263
#8  0x000055abcc550e49 in main (argc=<optimized out>, argv=<optimized out>) at /home/redp/disc/src/llvm-project/llvm/include/llvm/Support/CommandLine.h:1399

>>> p id
$1 = 0

At least they should check that index can become negative, no? Who would doubt that llvm is very reliable and secure

So if they process llvm ByteCode then they also must link half of llvm run-time to do it, but they also use 

воскресенье, 30 марта 2025 г.

ptx instructions emitting by nvidia compiler

I recently became curious what exactly ptx instructions can produce nvidia compiler - like if it uses something totally undocumented or vice versa - some official ptx instructions are never generated during compilation

The first thing is where those compiler located - no, it's not nvcc. Real compiler is cicc from packet cuda-nvvm. cicc from v10 has size 21Mb. The strings utility shows many interesting things, like

Portions Copyright (c) 1988-2016 Edison Design Group, Inc.
Portions Copyright (c) 2007-2016 University of Illinois at Urbana-Champaign.
Based on Edison Design Group C/C++ Front End

So they use front-end from Edison Design Group and llvm as back-end

Then I extracted several tables:

  1. compiler errors
  2. list of built-in functions with prototypes
  3. list of llvm attributes - as you can see they are mostly correspond to nvvm LLVM dialect
  4. and finally what I looked for - list with internal instruction names & their bodies to place into PTX file

As you can see mapping is very straightforward - for example for instruction BFE_S32rii (index 0x27) generating PTX bfe.s32

Results

I found only minor PTX instructions not presented in their official documentations: suq.xxx - perhaps should mean surface qword or something like this

среда, 26 марта 2025 г.

nvidia sass disassembler, part 5

Previous parts: 1, 2, 3 & 4

I've finally add native rendering for instructions - actually just rewrite from perl terrible function make_inst. Because in output typically rendering only small fraction of instructions data for formats are filling by demand via std::call_once. Results to compare with genuine nvdisasm:

minenvdisasm
LDC R1,c:[0][0x37C]
LDCU.64 UR8,c:[0][URZ+0x440]
LDC R16,c:[0][0x3B8]
LDCU.64 UR12,c:[0][URZ+0x448]
LDCU UR4,c:[0][URZ+0x3AC]
LDC._64 R4,c:[0][0x450]
LDCU.64 UR14,c:[0][URZ+0x380]
LDCU.64 UR10,c:[0][URZ+0x358]
HFMA2 R13,-RZ,RZ, 1.875000, 0.000000
ISETP.NE.S64.AND P2,PT,RZ,UR8,PT
LDC R1, c[0x0][0x37c]
LDCU.64 UR8, c[0x0][0x440]
LDC R16, c[0x0][0x3b8]
LDCU.64 UR12, c[0x0][0x448]
LDCU UR4, c[0x0][0x3ac]
LDC.64 R4, c[0x0][0x450]
LDCU.64 UR14, c[0x0][0x380]
LDCU.64 UR10, c[0x0][0x358]
HFMA2 R13, -RZ, RZ, 1.875, 0
ISETP.NE.S64.AND P2, PT, RZ, UR8, PT

IMHO very similar, has some minor problems with formatting of floating point values (I used FP16 to extract 16bit values but don't know what means E8M7Imm in format descriptor)

So the next thing to show is 

labels for branches

As I mentioned you can identify instruction as branches via it's PROPERTIES, get value in BRANCH_TARGET_INDEX and render it as label address. There are two problems: 

  1. size of branch offset vary in size - it can be 58bit for sm_90, 50bit for sm_75, 24 for sm_3 and so on
  2. branch offset is signed value, so we need some method to detect that some value of known bit size is negative

пятница, 21 марта 2025 г.

nvidia sass disassembler, part 4

I've made native sass disasm - just adding c++ codegen (can be produced by ead.pl with -C option). It works via dynamic loading of right disasm module - see list of supported architectures in map s_sms. For now it supports only operands dump with -O option - not rendered yet (bcs rewriting bunch of perl code with duck-types to C++ is boring and tedious work). Also you can dump attributes with -e option. You can make those modules with something like "make sm90.so". Btw dumb gcc allocates for local vars ~600kb on stack and with -Os option it compiles each module for 10 minutes with stack consumption shrink to normal values)

Tests show zero unrecognized instructions (and I am truly proud of this), however if you will find such - I also add option -N to dump it's content to bit-mask, which you then can pass to ead.pl with the same -N option to see what happened

On the other side it seems that nvidia trying to hide something important from us - let's check libcublas.so from v12 - we can notice lots of sections

  • .nv.merc.nv.info - genuine nvdiasm unable to show their content
  • .nv.capmerc.text - however, the instructions they contain are clearly in some other format and cannot be disassembled - I add -s option to disasm single section by it's index, so you can try it by yourself
  • and they obviously has corresponding relocs in sections .nv.merc.rela.text
  • and even .nv.merc.rela.debug_frame & .nv.merc.symtab

Known problems

пятница, 14 марта 2025 г.

nvidia sass disassembler, part 3

It looks like this rabbit hole goes much deeper

Some const banks does not have ConstBankAddressX:
CX:Sb[UniformRegister:URb][UImm(16)*:Sb_offset]
BITS_6_37_32_Ra_URb=URb
BITS_14_53_40_Sb_offset=Sb_offset SCALE 4 
 
Btw there is no encoding for field Sb

Next they have desc memory:
DESC:memoryDescriptor[UniformRegister:Ra_URb][Register:Ra /ONLY64:input_reg_sz_64_dist + SImm(24/0)*:Ra_offset]

genuine nvdisasm shows them like LDG.E.U16.CONSTANT R10, desc[UR8][R2.64], my disasm as LDG.E.U16.CONSTANT ,R10,desc[UR8][R2.64 + 0x0]

And finally we also have:

A:srcAttr[ UniformRegister:URa + SImm(11/0)*:URa_offset ]

GMMA:gdesc[ UniformRegister:URb ]

TMA:desc[ UniformRegister:URe ]

TTU:ttuAddr[ UImm(16)*:ImmU16 ]

RF:indexURb[UniformRegister:URb] ','UImm(4/0xf)*:PixMaskU04 

TMEMA:tmemA[ UniformRegister:URa ] 

TMEM perhaps means "tensor memory" and I have no idea about the rest of the prefixes