- I am able to write on it almost all I want
- when I can't - I can always to develop my own module(s)
- yet I don't feel sick like from pseudo languages like python
- and it damn good and fast when you try to sketch out prototypes for things you have no idea how to make
пятница, 17 октября 2025 г.
perl modules for CUBINs patching
среда, 1 октября 2025 г.
addresses of cuda kernel functions
Quote from official document:
It is not allowed to take the address of a
__device__function in host code
I haven't been surprised for a long time that entire CUDA is made up of ridiculous restrictions. What if I told you that paranoid nvidia lies as usually and actually you can get addresses of kernel functions in your host code?
But first lets check what workarounds we can employ to have functions pointers. I don't know for what pedagogical purpose this code intentionally was written so poorly and does not free the allocated memory - and now millions of brainless artificial idiots will copy-paste it forever, so I made patched version. You can realize that attempt to read from early gathered with cudaMemcpyFromSymbol addresses will results error 1 (invalid argument)
cuda-gdb
суббота, 13 сентября 2025 г.
practical ced usage: extracting sm machine ID
It's funny how paranoid nvidia trying to hide as much info from their customers as it can. One sample is so called "special registers" - even with PTX you can extract only limited set of them
So I played a bit with some undocumented SRs - namely with SR_MACHINE_ID_XX & SR_REGALLOC. I made legal loading of special registers and then patched those SASS instructions with my Ced
Lets see how those code looks in
PTX
mov.u32 %r2, %tid.x;
st.global.u32 [%rd2], %r2;
mov.u32 %r3, %tid.y;
st.global.u32 [%rd2+4], %r3;
mov.u32 %r4, %tid.z;
st.global.u32 [%rd2+8], %r4;
// inline asm
mov.u32 %r1, %smid;
// inline asm
st.global.u32 [%rd2+12], %r1;
mov.u32 %r5, 21;
st.global.u32 [%rd2+16], %r5;четверг, 7 августа 2025 г.
tool to extract/replace files within CUDA FatBinaries
Patching of cubin files is good, but loading and running them requires lots of code and using of Driver API. It would be much more convenient to patch SASS directly in binaries produced by nvcc
Unfortunately evil nvidia as usually shows it's paranoia:
- cuobjdump can list & extract content but not replace. Also it is extremely buggy on old libraries like libcublas.so v7
- official fatbinary is too complex and rebuilds whole file from scratch
- format of fatbinary is undocumented
- list files with -v option
- extract file at some index: -i idx -o output.filename
- replace file at some index: -i idx -r replace.filename
Perl binding
Limitation
- compressed fatbinaries not supported
- size of files must be the same
Some results
среда, 23 июля 2025 г.
ced: sed-like cubin editor
Unfortunately, the only sass assembler I know of has several drawbacks:
- it's inactive last couple of years. I dropped email to his author and he didn't replied. Hope he is well
- it don't support modern sm architectures sm1xx
- it's matmul solver sometimes produces wrong instructions
- and it don't support many EIATTRS
The last problem is not related with CuAssembler itself - it is more general: seems that nvdisasm produces output which cannot be used to assembly cubin files
Also we still don't know format of some sections like SHT_CUDA_RELOCINFO. All this makes task of rebuilding cubin files very hard
However do we really need to rebuild cubin files? In my experience 99.9% of desired patches are just set/remove some instructions attributes like register reusing/caching policy/wait groups for USCHED_INFO etc - just boring tuning to squeeze out the last couple of percent of productivity
So the flow of thought was something like
- it would be good to make plugin for hex-editor to disasm sass instruction at some known offset and show GUI where I could patch some fields
- I am talentless at creating GUI - so perhaps it would be better to dump instructions fields in text form and then just edit it
- hey - if you can parse this text representation and patch it back to sass - you don't need hex-editor at all - you could just use sed-like tool to patch instructions via script
and so being lazy and impatient I wrote such tool - it's called ced. Name similarity to sed is not coincidence - it allows you run text script to patch or replace some sass instructions inside cubin files
суббота, 19 июля 2025 г.
sass instructions: LUT operations
I was asked yesterday why I didn't transformed sample from my previous record
iadd r8, r2, r8 ; r8 = r2 + r8
iadd r8, r8, r8 ; r8 = r8 + r8
iadd r8, r8, ur4 ; r8 = r8 + ur4
to more simple
imad r8, r8, 2, ur4 ; r8 = r8 * 2 + ur4
While this is technically correct the problem here - ISA is non-orthogonal. You can use my ina to check available forms of IMAD for universal registers - and suddenly we will discover that it has only 2 forms
- @Pg IMAD E:wide E:fmt E:Rd E:Pu E:Ra E:reuse_src_a E:Rb E:reuse_src_b -E:URc
- @Pg IMAD E:wide E:fmt E:Rd E:Pu E:Ra E:reuse_src_a E:URb -E:Rc E:reuse_src_c
And no forms with imm value for Ra/Rb. So you can generate only something like:
imad r8, r8, rXX, ur4
And for UIMAD with imm values we have forms with universal registers only:
- @UPg UIMAD E:wide E:fmt E:X E:URd E:UPu E:URa ,Sb ~E:URc !E:UPp
- @UPg UIMAD E:wide E:fmt E:URd E:UPu E:URa ,Sb -E:URc
- etc
But all this is just kids games compared to LUT operations. In short - you can have 255 combinations of logical operations over 3 operands driven by index. nvdisasm shows them like:
LOP3.LUT R0, R3, R0, RZ, 0x30, !PT
Very informative, yeah. So I employed sympy to generate table of simplified expressions - however I am too old and lazy to write python scripts. So pretty obvious solution:
- make perl script to enumerate all possible combinations and generate python script
- which in turn generates string table
- and then sed add quotes and commas
LOP3.LUT PT,R0,R3,R0,RZ, 0x30,!PT &req={5}; LUT 30: a & ~bSo here a = R3, b = R0 and result R0 = R3 & ~R0
пятница, 18 июля 2025 г.
sass instructions: registers tracking
I've add tracking of registers to both nvd & pa - you can use -T option. And I have lots of bad news
nvdisasm lies
CS2R R100, SRZ
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /QInteger("64"):sz
Register:Rd
','SpecialRegister:SRa
PREDICATES
IDEST_SIZE = 32 + ((sz==`QInteger@"64"))*32;
FORMAT PREDICATE @[!]UniformPredicate(UPT):UPg Opcode
UniformRegister:URd
PREDICATES
IDEST_SIZE = 32;
lack of documentation
Predicates
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /ICmpAll:icmp /REDUX_SZ("S32"):fmt /Bop:bop /EXONLY:ex
Predicate:Pu
','Predicate:Pv
PREDICATES
IDEST_SIZE = 0;
IDEST2_SIZE = 0;
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /OFMT_F16_V2_BF16_V2("F16_V2"):ofmt /FCMP:cmp /H_AND("noh_and"):h_and /FTZ("noftz"):ft
z /Bop:bop
Predicate:Pu
','Predicate:Pv
PREDICATES
IDEST_SIZE = 0;
IDEST2_SIZE = 0;
I don't know if they set their first predicate Pu only or both Pu & Pv. Btw famous IMAD has very curious MD for some forms:
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /HIONLY:wide /FMT("S32"):fmt /XONLY:X
Register:Rd
','Predicate("PT"):Pu
','Register:Ra {/REUSE("noreuse"):reuse_src_a}
','Register:Rb {/REUSE("noreuse"):reuse_src_b}
',' [~] Register:Rc {/REUSE("noreuse"):reuse_src_c}
',' [!]Predicate:Pp
Usually IMAD means multiply and add, so Rd = Ra * Rb + Rc. But here we have two predicates, so should it have semantic Rd = Ra * Rb * Pu + Rc * Pp?
Barriers
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /ONLY32:sz
BD:barReg
','CBU_STATE_NONBAR:cbu_state
PREDICATES
IDEST_SIZE = 0;
IDEST2_SIZE = 0;
BD "B10"=10 , "B11"=11 , "B14"=14 , "B4"=4 , "B5"=5 , "B6"=6 , "B7"=7 , "B0"=0 , "B1"=1 , "B2"=2 , "B3"=3 , "B15"=15 , "B12"=12 , "B8"=8 , "B9"=9 , "B13"=13;
ptxas produces code that is far from perfect
понедельник, 7 июля 2025 г.
sass instructions: uniform registers & wide loading
Having predicates for operands size and properties for type/identification we could write register tracking (well, at least up to sm90). But before we should familiarize yourself with couple of CUDA specific things
Uniform registers
Turing introduces a new feature intended to improve the maximum achievable arithmetic throughput of the main, floating-point capable datapaths, by adding a separate, integer-only, scalar datapath (named the uniform datapath) that operates in parallel with the main datapath
Regular instructions can access both uniform and regular registers. Uniform datapath instructions, instead, focus on uniform instructions almost exclusively
S2R R3, SR_TID.X
S2UR UR4, SR_CTAID.Y
S2R R10, SR_CTAID.Z ; R10 now contains value from special register SR_CTAID.Z
ULDC.64 UR10, c[0x0][0x118]
IMAD.WIDE R2, R10, R3, c[0x0][0x168] ; and here it's value is still alive
/*30*/ ULDC.64 UR4,c[0][0x118];
; unknown cb off 118
/*40*/ IMAD.WIDE R2,PT,R7,R6,c[0][0x168] &req={0};
; cb in section 254, offset 168 - 160 = 8
Wide loading
четверг, 3 июля 2025 г.
sass instructions properties
I've already described so called predicates. Unfortunately they have only size of operands. Unlike predicates properties also have types:
IDEST_OPERAND_MAP = (1<<INDEX(Rd));
IDEST_OPERAND_TYPE = (1<<IOPERAND_TYPE_GENERIC);
IDEST2_OPERAND_MAP = (1<<IOPERAND_MAP_NON_EXISTENT_OPERAND);
IDEST2_OPERAND_TYPE = (1<<IOPERAND_TYPE_NON_EXISTENT_OPERAND);
ISRC_B_OPERAND_MAP = (1<<INDEX(Rb));
ISRC_B_OPERAND_TYPE = (1<<IOPERAND_TYPE_GENERIC);
ISRC_C_OPERAND_MAP = (1<<INDEX(Rc));
ISRC_C_OPERAND_TYPE = (1<<IOPERAND_TYPE_TEX);
ISRC_A_OPERAND_MAP = (1<<INDEX(Ra));
ISRC_A_OPERAND_TYPE = (1<<IOPERAND_TYPE_SURFACE_COORDINATES);
This sample for suatom instruction. Here destination has single operand so DEST2 marked with NON_EXISTENT_OPERAND. Unfortunately properties has couple of serious drawbacks:
1) they were cut out by paranoid NVidia somewhere in version 12.7-12.8, so I ripped MDs with properties up to sm90 - sm100, sm101 & sm120 don't have them. I also tried to re-apply properties from sm90 to 3 remained - but this is very unreliable
2) they are not complete. Lets see couple of samples
пятница, 27 июня 2025 г.
curse of IMAD
Found strange case while disassembly some forms of IMAD (btw raison d'être of GPU). Official nvdisasm shows:
IMAD.WIDE R2, R7, R6, c[0x0][0x168] ; /* 0x00005a0007027625 */my nvd:
; IMAD line 63362 n 1196 15 render items 1 missed: wide
/*40*/ IMAD R2,P7,R7,R6,c[0][0x168] &req={0};
Problem here not only missed P7 - at least it has default value:
CLASS "imad_wide__RRC_RRC"
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /WIDEONLY:wide /FMT("S32"):fmt
Register:Rd
','Predicate("PT"):Pu
','Register:Ra {/REUSE("noreuse"):reuse_src_a}
','Register:Rb {/REUSE("noreuse"):reuse_src_b}
',' [-] C:Sc[UImm(5/0*):Sc_bank]* [SImm(17)*:Sc_addr]
Both P7 & PT has the same value 7 (and btw wide does not have corresponding encoding field). Mask for this instruction ends with "011000100101" - 0x5
Main problem is that IMAD with form Reg, Reg, Reg has another mask:
CLASS "imad__RRC_RRC"
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /LOOnly("LO"):wide /FMT("S32"):fmt
Register:Rd
','Register:Ra {/REUSE("noreuse"):reuse_src_a}
','Register:Rb {/REUSE("noreuse"):reuse_src_b}
',' [-] C:Sc[UImm(5/0*):Sc_bank]* [SImm(17)*:Sc_addr]
mask ends with "011000100100" - 0x4
As you can see original instruction bytes is 0x00005a0007027625 - nvdisasm just produced incorrect output
Why this happens? I have hypothesis that Nvidia just don't have own official sass asm and so output of nvdisasm never used/verified
воскресенье, 15 июня 2025 г.
nvdisasm sass parser
SHF.R.S32.HI R209, RZ, 0x2, R209 ; Instruction names with '.'
Pseudo opcodes
PSEUDO_OPCODE "nopseudo_opcode"=0 , "SHL"=0 , "ISCADD"=0 , "IADD"=0 , "MOV"=0; Opcode /LOOnly("LO"):wide /PSEUDO_OPCODE("nopseudo_opcode"):pseudo_opcode
Enums can contain '.' too
Operands not always separated with ','
BRX R2 -0x110 (*"INDIRECT_CALL"*) nvidasm can't show some fields
$( { '&' REQ:req '=' BITSET(6/0x0000):req_bit_set } )$
$( { '&' RD:rd '=' UImm(3/0x7):src_rel_sb } )$
$( { '&' WR:wr '=' UImm(3/0x7):dst_wr_sb } )$
$( { '?' USCHED_INFO("DRAIN"):usched_info } )$
$( { '?' BATCH_T("NOP"):batch_t } )$
$( { '?' PM_PRED("PMN"):pm_pred } )$
Results
| SM | parsing rate | avg forms |
|---|---|---|
| 5 | 1.0 | 1.0 |
| 55 | 1.0 | 1.0 |
| 57 | 1.0 | 1.0 |
| 70 | 1.0 | 1.002404 |
| 75 | 1.0 | 1.018318 |
| 86 | 1.0 | 1.0 |
| 90 | 1.0 | 1.001589 |
| 100 | 1.0 | 1.016845 |
| 120 | 1.0 | 1.000225 |
Source of ambiguity
Lets run pa with options -Ssv to dump original text and all matched forms. We can see something like:BAR.SYNC.DEFER_BLOCKING 0x0
2 forms:
19342 @Pg.D(7) BAR .E:barmode .E:defer_blocking Sb:UImm E:Rc.D(255) req_bit_set:BITSET src_rel_sb:UImm(7) E:usched_info E:batch_t.D(0) E:pm_pred.D(0)
19286 @Pg.D(7) BAR .E:barmode .E:defer_blocking Sb:UImm ,Sc:UImm req_bit_set:BITSET src_rel_sb:UImm(7) E:usched_info E:batch_t.D(0) E:pm_pred.D(0)
The first form has additional register operand with default value 255 and second has yet another UImm operand Sc with default value 0 (UImm(12/0)*:Sc) - so they cannot be distinguishedпятница, 30 мая 2025 г.
nvidia sass assembler
I am very skeptical about patching of existing .cubin files - it requires too much book-keeping. Let's say we want to insert several additional instructions into some function - then we need
- extend section containing code for those function by patching sections table
- patch symbols table/relocs
- disasm whole function and build code-flow graph for all instructions in function
- fix offsets for jumps
- fix attributes like EIATTR_INDIRECT_BRANCH_TARGETS & EIATTR_JUMPTABLE_RELOCS
- and so on
While points 1-2 can be implemented with ELF patching libraries like elftools it is anyway too much tedious labour
For example CuAssembler prefers to create new .cubin files from scratch. In any case we need some engine to generate sass instructions and this task is perfectly achieve-able when you have ready disassembler. So I add to my sass disasm engine some primary features for code generation:
- dictionary of all instructions for given SM - method INV_disasm::get_instrs
- for each instruction add encoders describing how to put values for fields, tables, constant banks & scheduling
As illustration I've implemented interactive sass assembler (with some help of readline for auto-completion)
воскресенье, 4 мая 2025 г.
nvidia sass latency tables
It seems that latency values are the best kept secret - I was able to find only article in internet and author didn't provided any code to decipher those tables. So
Disclaimer
How they are look like
TABLE_OUTPUT(UGPR) : UDP_subset`{URd @URdRange,URd2 @URd2Range}
R2UR_S2UR`{URd @URdRange,URd2 @URd2Range}
OP_R2UR_COUPLED`{URd @URdRange,URd2 @URd2Range}
ULDC_VOTEU_UMOV_ULEPC`{URd @URdRange,URd2 @URd2Range}=
{
UDP_subset`{URd @URdRange,URd2 @URd2Range} : 1 4 7 7
R2UR_S2UR`{URd @URdRange,URd2 @URd2Range} : 1 1 1 1
OP_R2UR_COUPLED`{URd @URdRange,URd2 @URd2Range} : 4 4 1 10
ULDC_VOTEU_UMOV_ULEPC`{URd @URdRange,URd2 @URd2Range} : 1 4 1 1
}; пятница, 18 апреля 2025 г.
nvidia sass disassembler, part 7: dual issued instructions
Previous parts: 1, 2, 3, 4, 5 & 6
As you could notice genuine nvdisasm put couple of instructions in curly braces for old sm (always 88bits). So I finally realized how those dual issued instructions are selected - the first one must have USCHED_INFO eq 0x10 (floxy2)
Interesting note that more new sm (since 70) missed 0x10:
W15EG=15,
WAIT15_END_GROUP=15,
W1=17,
trans1=17,results
пятница, 11 апреля 2025 г.
nvidia sass disassembler, part 6: predicates
Previous parts: 1, 2, 3, 4 & 5
Lets check how pairs of instructions are chained together - this information stored in MD files with prefix _2.txt - for example from sm90_2.txt:
CONNECTOR CONDITIONS
RaRange = (((((MD_PRED(ISRC_A_SIZE)) >= (1)) ? (MD_PRED(ISRC_A_SIZE)) : (1)) - 1) >> 5) + 1;
What is ISRC_A_SIZE? They are so called PREDICATES of instruction:
PREDICATES IDEST_SIZE = 32 + (((sz==`ATOMCASSZ@U64) || (sz==`ATOMCASSZ@"64"))*32 + ((sz==`ATOMCASSZ@"128"))*96); ISRC_B_SIZE = 32 + (((sz==`ATOMCASSZ@U64) || (sz==`ATOMCASSZ@"64"))*32 + ((sz==`ATOMCASSZ@"128"))*96); ISRC_C_SIZE = 32 + (((sz==`ATOMCASSZ@U64) || (sz==`ATOMCASSZ@"64"))*32 + ((sz==`ATOMCASSZ@"128"))*96); ISRC_A_SIZE = 32 + ((e==`E@E))*32;So their values depend on instruction fields, like:
BITS_1_72_72_e=e
BITS_3_75_73_sz=sz
How we can convert this rules to C++? Well, they already almost have C++ syntax, we need to patch two things:
- extract values of all used fields (in this case e & sz)
- replace `ENUM@VALUE with numerical value of enum. Perl allows do this using cool regex modifier /e
So rule for ISRC_A_SIZE can be rewritten as:
int e = (int)e_iter->second; // extract value of e field
return 32 + ((e==1))*32;
Bcs enum E described as E "noe"=0 , "E"=1;
results
> LDCU.128 UR16,c:[0][URZ+0x3D0] &0 &0 ?trans1 ?NOP ?PMN
P> ILABEL_URa_SIZE: 32
P> ISRC_A_SIZE: 32
P> IDEST_SIZE: 128
пятница, 4 апреля 2025 г.
ptx instructions emitting by nvidia compiler. part 2
Why it is so fat?
Bcs it contains at least 4 code generators: for arm32, aarch64, x86 & nvptx#0 0x000055abcc64743d in llvm::Intrinsic::getIntrinsicInfoTableEntries (id=0, T=...) at /home/redp/disc/src/llvm-project/llvm/lib/IR/Function.cpp:1339
1339 unsigned TableVal = IIT_Table[id-1];
>>> where
#0 0x000055abcc64743d in llvm::Intrinsic::getIntrinsicInfoTableEntries (id=0, T=...) at /home/redp/disc/src/llvm-project/llvm/lib/IR/Function.cpp:1339
#1 0x000055abcc5fe41f in UpgradeIntrinsicFunction1 (F=0x55abce63cf18, NewFn=@0x7ffc60414e70: 0x0) at /home/redp/disc/src/llvm-project/llvm/include/llvm/IR/Function.h:204
#2 0x000055abcc60111a in llvm::UpgradeIntrinsicFunction (F=F@entry=0x55abce63cf18, NewFn=@0x7ffc60414e70: 0x0) at /home/redp/disc/src/llvm-project/llvm/lib/IR/AutoUpgrade.cpp:1226
#3 0x000055abcc584778 in (anonymous namespace)::BitcodeReader::globalCleanup (this=0x55abce608e30) at /home/redp/disc/src/llvm-project/llvm/lib/Bitcode/Reader/BitcodeReader.cpp:3696
#4 0x000055abcc5856cc in (anonymous namespace)::BitcodeReader::parseModule (this=<optimized out>, ResumeBit=<optimized out>, ShouldLazyLoadMetadata=<optimized out>, Callbacks=...) at /home/redp/disc/src/llvm-project/llvm/lib/Bitcode/Reader/BitcodeReader.cpp:4385
#5 0x000055abcc5959ca in (anonymous namespace)::BitcodeReader::parseBitcodeInto (Callbacks=..., IsImporting=false, ShouldLazyLoadMetadata=false, M=0x55abce5f3d80, this=0x55abce608e30) at /usr/include/c++/9/bits/std_function.h:564
#6 llvm::BitcodeModule::getModuleImpl (this=<optimized out>, Context=..., MaterializeAll=<optimized out>, ShouldLazyLoadMetadata=<optimized out>, IsImporting=<optimized out>, Callbacks=...) at /home/redp/disc/src/llvm-project/llvm/lib/Bitcode/Reader/BitcodeReader.cpp:7981
#7 0x000055abcc596070 in llvm::BitcodeModule::getLazyModule (this=0x7ffc60415d10, Context=..., ShouldLazyLoadMetadata=<optimized out>, IsImporting=<optimized out>, Callbacks=...) at /usr/include/c++/9/bits/std_function.h:263
#8 0x000055abcc550e49 in main (argc=<optimized out>, argv=<optimized out>) at /home/redp/disc/src/llvm-project/llvm/include/llvm/Support/CommandLine.h:1399
>>> p id
$1 = 0
At least they should check that index can become negative, no? Who would doubt that llvm is very reliable and secure
So if they process llvm ByteCode then they also must link half of llvm run-time to do it, but they also use
воскресенье, 30 марта 2025 г.
ptx instructions emitting by nvidia compiler
The first thing is where those compiler located - no, it's not nvcc. Real compiler is cicc from packet cuda-nvvm. cicc from v10 has size 21Mb. The strings utility shows many interesting things, like
Portions Copyright (c) 1988-2016 Edison Design Group, Inc.
Portions Copyright (c) 2007-2016 University of Illinois at Urbana-Champaign.
Based on Edison Design Group C/C++ Front End
So they use front-end from Edison Design Group and llvm as back-end
Then I extracted several tables:
- compiler errors
- list of built-in functions with prototypes
- list of llvm attributes - as you can see they are mostly correspond to nvvm LLVM dialect
- and finally what I looked for - list with internal instruction names & their bodies to place into PTX file
As you can see mapping is very straightforward - for example for instruction BFE_S32rii (index 0x27) generating PTX bfe.s32
Results
среда, 26 марта 2025 г.
nvidia sass disassembler, part 5
I've finally add native rendering for instructions - actually just rewrite from perl terrible function make_inst. Because in output typically rendering only small fraction of instructions data for formats are filling by demand via std::call_once. Results to compare with genuine nvdisasm:
| mine | nvdisasm |
|---|---|
| LDC R1,c:[0][0x37C] LDCU.64 UR8,c:[0][URZ+0x440] LDC R16,c:[0][0x3B8] LDCU.64 UR12,c:[0][URZ+0x448] LDCU UR4,c:[0][URZ+0x3AC] LDC._64 R4,c:[0][0x450] LDCU.64 UR14,c:[0][URZ+0x380] LDCU.64 UR10,c:[0][URZ+0x358] HFMA2 R13,-RZ,RZ, 1.875000, 0.000000 ISETP.NE.S64.AND P2,PT,RZ,UR8,PT |
LDC R1, c[0x0][0x37c] LDCU.64 UR8, c[0x0][0x440] LDC R16, c[0x0][0x3b8] LDCU.64 UR12, c[0x0][0x448] LDCU UR4, c[0x0][0x3ac] LDC.64 R4, c[0x0][0x450] LDCU.64 UR14, c[0x0][0x380] LDCU.64 UR10, c[0x0][0x358] HFMA2 R13, -RZ, RZ, 1.875, 0 ISETP.NE.S64.AND P2, PT, RZ, UR8, PT |
IMHO very similar, has some minor problems with formatting of floating point values (I used FP16 to extract 16bit values but don't know what means E8M7Imm in format descriptor)
So the next thing to show is
labels for branches
As I mentioned you can identify instruction as branches via it's PROPERTIES, get value in BRANCH_TARGET_INDEX and render it as label address. There are two problems:
- size of branch offset vary in size - it can be 58bit for sm_90, 50bit for sm_75, 24 for sm_3 and so on
- branch offset is signed value, so we need some method to detect that some value of known bit size is negative
пятница, 21 марта 2025 г.
nvidia sass disassembler, part 4
I've made native sass disasm - just adding c++ codegen (can be produced by ead.pl with -C option). It works via dynamic loading of right disasm module - see list of supported architectures in map s_sms. For now it supports only operands dump with -O option - not rendered yet (bcs rewriting bunch of perl code with duck-types to C++ is boring and tedious work). Also you can dump attributes with -e option. You can make those modules with something like "make sm90.so". Btw dumb gcc allocates for local vars ~600kb on stack and with -Os option it compiles each module for 10 minutes with stack consumption shrink to normal values)
Tests show zero unrecognized instructions (and I am truly proud of this), however if you will find such - I also add option -N to dump it's content to bit-mask, which you then can pass to ead.pl with the same -N option to see what happened
On the other side it seems that nvidia trying to hide something important from us - let's check libcublas.so from v12 - we can notice lots of sections
- .nv.merc.nv.info - genuine nvdiasm unable to show their content
- .nv.capmerc.text - however, the instructions they contain are clearly in some other format and cannot be disassembled - I add -s option to disasm single section by it's index, so you can try it by yourself
- and they obviously has corresponding relocs in sections .nv.merc.rela.text
- and even .nv.merc.rela.debug_frame & .nv.merc.symtab
Known problems
пятница, 14 марта 2025 г.
nvidia sass disassembler, part 3
It looks like this rabbit hole goes much deeper
CX:Sb[UniformRegister:URb][UImm(16)*:Sb_offset]BITS_6_37_32_Ra_URb=URbBITS_14_53_40_Sb_offset=Sb_offset SCALE 4 Next they have desc memory:
DESC:memoryDescriptor[UniformRegister:Ra_URb][Register:Ra /ONLY64:input_reg_sz_64_dist + SImm(24/0)*:Ra_offset]
genuine nvdisasm shows them like LDG.E.U16.CONSTANT R10, desc[UR8][R2.64], my disasm as LDG.E.U16.CONSTANT ,R10,desc[UR8][R2.64 + 0x0]
And finally we also have:
A:srcAttr[ UniformRegister:URa + SImm(11/0)*:URa_offset ]
GMMA:gdesc[ UniformRegister:URb ]
TMA:desc[ UniformRegister:URe ]
TTU:ttuAddr[ UImm(16)*:ImmU16 ]
RF:indexURb[UniformRegister:URb] ','UImm(4/0xf)*:PixMaskU04
TMEMA:tmemA[ UniformRegister:URa ]
TMEM perhaps means "tensor memory" and I have no idea about the rest of the prefixes
понедельник, 10 марта 2025 г.
nvidia sass disassembler, part 2
Lets continue explore "machine descriptions" - in this time try to understand how to make format output more similar to genuine nvdisasm
FORMAT PREDICATE @[!]Predicate(PT):Pg Opcode /Float64:dstfmt /SRCFMT_U16_S16:srcfmt /Round1("RN"):rnd
Register:Rd
','C:Sb[UImm(5/0*):Sb_bank]* [SImm(17)*:Sb_addr] /HSEL("H0"):hsel
$( { '&' REQ:req '=' BITSET(6/0x0000):req_bit_set } )$
$( { '&' RD:rd '=' UImm(3/0x7):src_rel_sb } )$
$( { '&' WR:wr '=' UImm(3/0x7):dst_wr_sb } )$
$( { '?' USCHED_INFO("DRAIN"):usched_info } )$
$( { '?' BATCH_T("NOP"):batch_t } )$
$( { '?' PM_PRED("PMN"):pm_pred } )$ ;...
ENCODING !i2f_Rd64__Cb_16b_unused; BITS_3_14_12_Pg = Pg; BITS_1_15_15_Pg_not = Pg@not; BITS_13_91_91_11_0_opcode=Opcode; BITS_3_77_75_sz=*dstfmt; BITS_3_85_84_74_74_srcfmt=*srcfmt; BITS_2_79_78_stride=rnd; BITS_8_23_16_Rd=Rd; BITS_5_58_54_Sb_bank,BITS_14_53_40_Sb_offset = ConstBankAddress2(Sb_bank,Sb_addr); BITS_2_61_60_hsel=hsel; BITS_6_121_116_req_bit_set=req_bit_set; BITS_3_115_113_src_rel_sb=VarLatOperandEnc(src_rel_sb); BITS_3_112_110_dst_wr_sb=VarLatOperandEnc(dst_wr_sb); BITS_2_103_102_pm_pred=pm_pred; BITS_8_124_122_109_105_opex=TABLES_opex_0(batch_t,usched_info);
четверг, 6 марта 2025 г.
nvidia sass disassembler
Couple weeks ago I made decryptor to extract from nvdisasm so called "machine descriptions" (MD) (btw nvdisasm v12 uses lz4 compression library, so I made yet another decryptor + results). And after that I became extremely curious whether it was possible to make full SASS disassembler - sure format of those MDs even more undocumented than syntax of PTX - but anyway it's much better than having no documentation about ISA at all
First and most important thing to check is width of instruction - it can be
- 64bit for sm37 (Kepler) and more old
- 88bit for sm5x (Maxwell) until sm70
- 128bit since sm70 (Volta, Turing, Ampere, Ada, Hopper & Blackwell)
I have a very limited imagination so I couldn't imagine how hardware could support alignment for 11 byte instruction just bcs 11 is not power of 2. So after some magic with debugger I found following code snippet:
lea eax, [r9+r9*4] ; eax = r9 * 5
lea ecx, [r9+rax*4] ; ecx = r9 * 21
mov eax, 1FFFFFh ; 17bit mask
and then some search in google revealed this document:
On Kepler there is 1 control instruction for every 7 operational instructions. Maxwell added additional control capabilities and so has 1 control for every 3 instructions
So 88bit became 4 64bit qwords where first is Control qword and 3 remaining are instructions, then 21 + 64 = 85bit - very close to 88
Note: such martian architecture makes it impossible to create IDA Pro processor module for Maxwell and more old GPUs - bcs IDA expects that instruction at any properly aligned address should be valid, and you just don't know there Control qword is located for block of instructions
Lets check how looks description of each instruction (from here onwards I will refer to sm90 MD)
понедельник, 24 февраля 2025 г.
ptx internals
It seems that syntax of PTX is undocumented - at least I was unable to find actual BNF grammar or reference implementation. Grammar from antlr project is greatly out-dated and don't contain instructions like ldmatrix/istypep (and btw it has 3505 lines - much bigger than c++14 syntax)
Another project is much better however it's also incomplete. Official documentation is most vague & obscure reading for last years (previous champion) - it contains word "deprecated" 24 times. So it's time to do some disassembling
First remarkable thing is lots of high-entropy parts of data in .rodata section - this is glare sign of encryption, so I made decryptor. After looking at encrypted data we can discover several undocumented features