воскресенье, 15 июня 2025 г.

nvdisasm sass parser

Having sass assembler it seems like easy task to make parser for it. So I made parser of nvdisasm output
 
Lets check some samples:
SHF.R.S32.HI R209, RZ, 0x2, R209 ;
Looks like easy application of LL(1) parser - you first select instruction, then process it's optional enums (separated by dots) and then just try to match operands separated by commas, right? Hwell, no - grammar of sass is not regular and we can have lots of quirky cases

Instruction names with '.'

It's perfectly legal to meet instructions "UIADD3" & "UIADD3.64". And they have different encodings and even not marked as ALTERNATE

Pseudo opcodes

We can observe totally non-distinguishable enum
PSEUDO_OPCODE "nopseudo_opcode"=0 , "SHL"=0 , "ISCADD"=0 , "IADD"=0 , "MOV"=0;
 
and samples of using:
Opcode /LOOnly("LO"):wide /PSEUDO_OPCODE("nopseudo_opcode"):pseudo_opcode
 
Btw operand pseudo_opcode don't even have corresponding encoding field. In essence instructions like IMAD.IADD, IMAD.MOV & IMAD.SHL have exactly the same encoding form. I don't know how nvdisasm selects PSEUDO_OPCODE - probably they borrowed hallucination generator from chatgpt

Enums can contain '.' too

Yes - enum names can be something like SR_CTAID.X, SR_CTAID.Y & SR_CTAID.Z

Operands not always separated with ','

BRX R2 -0x110 (*"INDIRECT_CALL"*) 

nvidasm can't show some fields

especially batch & pm_pred. Typical instructions tail looks like:
$( { '&' REQ:req '=' BITSET(6/0x0000):req_bit_set } )$
$( { '&' RD:rd '=' UImm(3/0x7):src_rel_sb } )$
$( { '&' WR:wr '=' UImm(3/0x7):dst_wr_sb } )$
$( { '?' USCHED_INFO("DRAIN"):usched_info } )$
$( { '?' BATCH_T("NOP"):batch_t } )$
$( { '?' PM_PRED("PMN"):pm_pred } )$
and nvdisasm output contains only &wr=0x1 for WR, &rd=0x2 for RD and ?something for USCHED_INFO

Results

SMparsing rateavg forms
51.01.0
551.01.0
571.01.0
701.01.002404
751.01.018318
861.01.0
901.01.001589
1001.01.016845
1201.01.000225

Source of ambiguity

Lets run pa with options -Ssv to dump original text and all matched forms. We can see something like:
BAR.SYNC.DEFER_BLOCKING 0x0
2 forms:
 19342 @Pg.D(7) BAR .E:barmode .E:defer_blocking Sb:UImm E:Rc.D(255) req_bit_set:BITSET src_rel_sb:UImm(7) E:usched_info E:batch_t.D(0) E:pm_pred.D(0)
 19286 @Pg.D(7) BAR .E:barmode .E:defer_blocking Sb:UImm ,Sc:UImm req_bit_set:BITSET src_rel_sb:UImm(7) E:usched_info E:batch_t.D(0) E:pm_pred.D(0)

The first form has additional register operand with default value 255 and second has yet another UImm operand Sc with default value 0 (UImm(12/0)*:Sc) - so they cannot be distinguished

пятница, 30 мая 2025 г.

nvidia sass assembler

I am very skeptical about patching of existing .cubin files - it requires too much book-keeping. Let's say we want to insert several additional instructions into some function - then we need

  1. extend section containing code for those function by patching sections table
  2. patch symbols table/relocs
  3. disasm whole function and build code-flow graph for all instructions in function
  4. fix offsets for jumps
  5. fix attributes like EIATTR_INDIRECT_BRANCH_TARGETS & EIATTR_JUMPTABLE_RELOCS
  6. and so on

While points 1-2 can be implemented with ELF patching libraries like elftools it is anyway too much tedious labour

For example CuAssembler prefers to create new .cubin files from scratch. In any case we need some engine to generate sass instructions and this task is perfectly achieve-able when you have ready disassembler. So I add to my sass disasm engine some primary features for code generation:

  • dictionary of all instructions for given SM - method INV_disasm::get_instrs
  • for each instruction add encoders describing how to put values for fields, tables, constant banks & scheduling

As illustration I've implemented interactive sass assembler (with some help of readline for auto-completion)

воскресенье, 4 мая 2025 г.

nvidia sass latency tables

It seems that latency values are the best kept secret - I was able to find only article in internet and author didn't provided any code to decipher those tables. So

Disclaimer

All of the following are the shaky conclusions of my dark mind, almost certainly false and having no connection to reality

 

How they are look like

Descriptions of latency tables are located in files *_2.txt and look like
TABLE_OUTPUT(UGPR) : UDP_subset`{URd @URdRange,URd2 @URd2Range}
                      R2UR_S2UR`{URd @URdRange,URd2 @URd2Range}
                       OP_R2UR_COUPLED`{URd @URdRange,URd2 @URd2Range}
                        ULDC_VOTEU_UMOV_ULEPC`{URd @URdRange,URd2 @URd2Range}=
{
    UDP_subset`{URd @URdRange,URd2 @URd2Range} : 1 4 7 7
    R2UR_S2UR`{URd @URdRange,URd2 @URd2Range} : 1 1 1 1
    OP_R2UR_COUPLED`{URd @URdRange,URd2 @URd2Range} : 4 4 1 10
    ULDC_VOTEU_UMOV_ULEPC`{URd @URdRange,URd2 @URd2Range} : 1 4 1 1
};

пятница, 18 апреля 2025 г.

nvidia sass disassembler, part 7: dual issued instructions

Previous parts: 1, 2, 3, 4, 5 & 6

As you could notice genuine nvdisasm put couple of instructions in curly braces for old sm (always 88bits). So I finally realized how those dual issued instructions are selected - the first one must have USCHED_INFO eq 0x10 (floxy2)

Interesting note that more new sm (since 70) missed 0x10:

 W15EG=15,
 WAIT15_END_GROUP=15,
 W1=17,
 trans1=17,

results

пятница, 11 апреля 2025 г.

nvidia sass disassembler, part 6: predicates

Previous parts: 1, 2, 3, 4 & 5

Lets check how pairs of instructions are chained together - this information stored in MD files with prefix _2.txt - for example from sm90_2.txt

CONNECTOR CONDITIONS

    RaRange = (((((MD_PRED(ISRC_A_SIZE)) >= (1)) ? (MD_PRED(ISRC_A_SIZE)) : (1)) - 1) >> 5) + 1;

What is ISRC_A_SIZE? They are so called PREDICATES of instruction:

PREDICATES
 IDEST_SIZE = 32 + (((sz==`ATOMCASSZ@U64) || (sz==`ATOMCASSZ@"64"))*32 + ((sz==`ATOMCASSZ@"128"))*96);
 ISRC_B_SIZE = 32 + (((sz==`ATOMCASSZ@U64) || (sz==`ATOMCASSZ@"64"))*32 + ((sz==`ATOMCASSZ@"128"))*96);
 ISRC_C_SIZE = 32 + (((sz==`ATOMCASSZ@U64) || (sz==`ATOMCASSZ@"64"))*32 + ((sz==`ATOMCASSZ@"128"))*96);
 ISRC_A_SIZE = 32 + ((e==`E@E))*32;
So their values depend on instruction fields, like:
BITS_1_72_72_e=e
BITS_3_75_73_sz=sz

How we can convert this rules to C++? Well, they already almost have C++ syntax, we need to patch two things:

  1. extract values of all used fields (in this case e & sz)
  2. replace `ENUM@VALUE with numerical value of enum. Perl allows do this using cool regex modifier /e

So rule for ISRC_A_SIZE can be rewritten as:

int e = (int)e_iter->second; // extract value of e field
return 32 + ((e==1))*32;

Bcs enum E described as E "noe"=0 , "E"=1;

results

I've add option -p to my disasm to dump predicates:
> LDCU.128 UR16,c:[0][URZ+0x3D0] &0 &0 ?trans1 ?NOP ?PMN
P> ILABEL_URa_SIZE: 32
P> ISRC_A_SIZE: 32
P> IDEST_SIZE: 128

пятница, 4 апреля 2025 г.

ptx instructions emitting by nvidia compiler. part 2

Part 1 described v10
And today let's check cicc v12. The first thing that catches your eye is its size - almost 76Mb! And it also contains at least 5 different decryptors - Nvidia really wants to hide something from its grateful clients

Why it is so fat?

Bcs it contains at least 4 code generators: for arm32, aarch64, x86 & nvptx
+ at least 27 llvm bytecode blobs (signature 0x42 0x43 0xc0 0xde) - they contains mostly bodies of intrinsic functions like nvvm_mulq/nvvm_divq but on some llvm-dis just crashes:

#0  0x000055abcc64743d in llvm::Intrinsic::getIntrinsicInfoTableEntries (id=0, T=...) at /home/redp/disc/src/llvm-project/llvm/lib/IR/Function.cpp:1339
1339      unsigned TableVal = IIT_Table[id-1];
>>> where
#0  0x000055abcc64743d in llvm::Intrinsic::getIntrinsicInfoTableEntries (id=0, T=...) at /home/redp/disc/src/llvm-project/llvm/lib/IR/Function.cpp:1339
#1  0x000055abcc5fe41f in UpgradeIntrinsicFunction1 (F=0x55abce63cf18, NewFn=@0x7ffc60414e70: 0x0) at /home/redp/disc/src/llvm-project/llvm/include/llvm/IR/Function.h:204
#2  0x000055abcc60111a in llvm::UpgradeIntrinsicFunction (F=F@entry=0x55abce63cf18, NewFn=@0x7ffc60414e70: 0x0) at /home/redp/disc/src/llvm-project/llvm/lib/IR/AutoUpgrade.cpp:1226
#3  0x000055abcc584778 in (anonymous namespace)::BitcodeReader::globalCleanup (this=0x55abce608e30) at /home/redp/disc/src/llvm-project/llvm/lib/Bitcode/Reader/BitcodeReader.cpp:3696
#4  0x000055abcc5856cc in (anonymous namespace)::BitcodeReader::parseModule (this=<optimized out>, ResumeBit=<optimized out>, ShouldLazyLoadMetadata=<optimized out>, Callbacks=...) at /home/redp/disc/src/llvm-project/llvm/lib/Bitcode/Reader/BitcodeReader.cpp:4385
#5  0x000055abcc5959ca in (anonymous namespace)::BitcodeReader::parseBitcodeInto (Callbacks=..., IsImporting=false, ShouldLazyLoadMetadata=false, M=0x55abce5f3d80, this=0x55abce608e30) at /usr/include/c++/9/bits/std_function.h:564
#6  llvm::BitcodeModule::getModuleImpl (this=<optimized out>, Context=..., MaterializeAll=<optimized out>, ShouldLazyLoadMetadata=<optimized out>, IsImporting=<optimized out>, Callbacks=...) at /home/redp/disc/src/llvm-project/llvm/lib/Bitcode/Reader/BitcodeReader.cpp:7981
#7  0x000055abcc596070 in llvm::BitcodeModule::getLazyModule (this=0x7ffc60415d10, Context=..., ShouldLazyLoadMetadata=<optimized out>, IsImporting=<optimized out>, Callbacks=...) at /usr/include/c++/9/bits/std_function.h:263
#8  0x000055abcc550e49 in main (argc=<optimized out>, argv=<optimized out>) at /home/redp/disc/src/llvm-project/llvm/include/llvm/Support/CommandLine.h:1399

>>> p id
$1 = 0

At least they should check that index can become negative, no? Who would doubt that llvm is very reliable and secure

So if they process llvm ByteCode then they also must link half of llvm run-time to do it, but they also use 

воскресенье, 30 марта 2025 г.

ptx instructions emitting by nvidia compiler

I recently became curious what exactly ptx instructions can produce nvidia compiler - like if it uses something totally undocumented or vice versa - some official ptx instructions are never generated during compilation

The first thing is where those compiler located - no, it's not nvcc. Real compiler is cicc from packet cuda-nvvm. cicc from v10 has size 21Mb. The strings utility shows many interesting things, like

Portions Copyright (c) 1988-2016 Edison Design Group, Inc.
Portions Copyright (c) 2007-2016 University of Illinois at Urbana-Champaign.
Based on Edison Design Group C/C++ Front End

So they use front-end from Edison Design Group and llvm as back-end

Then I extracted several tables:

  1. compiler errors
  2. list of built-in functions with prototypes
  3. list of llvm attributes - as you can see they are mostly correspond to nvvm LLVM dialect
  4. and finally what I looked for - list with internal instruction names & their bodies to place into PTX file

As you can see mapping is very straightforward - for example for instruction BFE_S32rii (index 0x27) generating PTX bfe.s32

Results

I found only minor PTX instructions not presented in their official documentations: suq.xxx - perhaps should mean surface qword or something like this