Disclaimer
Highly likely that author is an illiterate, inattentive, and incompetent lazy person with a poor imagination - therefore his hypotheses may be questionable, ideas delusional and his analysis simply incorrect. Also maybe I still haven't mastered ida pro in 28 years so extracted data can be incomplete/have missed parts. As always all code on perl and therefore offends the aesthetic feelings of believers
Prior works
- Official PTX ISA. We all know than nvidia is evil and paranoid, so this document also incomplete and maliciously conceals information. Proofs are somewhere below in this text
- ANTLR ptx grammar - very outdated, based on cuda-waste parser from 2010
- infamous zluda. It's enough to look at their AST to understand that they support at best a third of the instructions
- nvopen-tools by Grigory Evko. AI generated slop, but at least we can borrow from chapter 7 format of instructions and decoding scheme for arguments
So as you can see there is no machine readable grammar for modern PTX, Why this is important at all? Well, according to "Official guide to inline PTX"
The compiler front end does not parse the asm() statement template string and does not know what it means or even whether it is valid PTX input
Therefore you can successfully compile your buggy code to PTX and suddenly got mysterious errors during dynamic loading over JIT. Plus I always suspected that nvidia hides as much information from us as possible
So I started with some disassembly of ptxas version V10.1.243 from sdk 13.1 looking for PTX instruction names (encrypted btw)
Data extracting
Instruction attributes dynamically filled in two places
Please don't ask me why there are 2 separate places. More importantly that code from both looks uniform
pxor xmm0, xmm0
sub rsp, 48h
lea rcx, a0000+2 ; "00" - ins operands
lea rdx, aEx2 ; "ex2" - ins name
lea rsi, aH32h32+3 ; "H32" - operands types
mov r8d, 4 ; ins index
mov [rsp+48h+var_18], 0
mov [rsp+48h+var_38], 0
movaps [rsp+48h+var_28], xmm0 ; zero 16 bytes mask
mov byte ptr [rsp+48h+var_28], 0A0h ; fill mask with some values
mov byte ptr [rsp+48h+var_28+3], 8
...
mov byte ptr [rsp+48h+var_28+1], 2
movdqa xmm0, [rsp+48h+var_28] ; load filled mask ... call ptx_ins_register_func
Format of each row:
- instruction index
- 16 bytes mask
- instruction name then tab
- instruction operands then tab
- and finally types of operands in single string
I got 270 unique instructions names and 1420 rows
Verification
Next logical question arises: how can we ensure that we have extracted all the data?
Well, earlier I extracted huge list of PTX instruction from their cicc. So I made simple perl script to check intersection
Yes, all PTX from cicc presented in dumped data. Sure this doesn't mean that I dumped everything, but it gives me some confidence.
Analysis
First questionable hypotheses - this 16 bit masks are bitfields for instruction attributes, like in add.sat.u32 instruction name must be add and sat/u32 is some 1 in mask
Lets just look at dumped data to check how dense are bit masks and if they are the same for each instruction:
48 20 12 00 08 00 00 00 00 00 00 00 00 00 00 00 00 add 000 F1648 20 10 00 02 00 00 00 00 00 00 00 00 00 00 00 00 add 000 I wait - WHAT? masks for the same instruction differs depending on type of arguments. How is this possible?
Well, one possible answer - parser delayed parsing of attributes till got types of operands. I frankly don't remember where I saw this idea - perhaps in a book "Parsing Techniques: A Practical Guide" read in the last century
So I counted amount of 1 in all masks - 113
Attributes tables
and continued looking in disasm. Actually this was most tedious part of work - find and extract over a hundred of tables with encrypted strings. Result
s -1 tabs/ | wc -l
113 Hallelujah - the balance has been balanced
Analysis 2
ok, I have masks per instruction and tables - but how to link them? Lets choice some attribute and place it at index 0, then for index 1 we still have 112 variants and so on. In essence there is 113! of possible variants
Here I made couple of another hypotheses
- masks must preserve order of attributes
- bcs masks used as 128bit word in XMM register - 1 at index with lowest index must be located to the left. For example attribute corresponding to index 1 must precede attribute at index 2
So I add to my script several simple options for intersection of sets:
- -f - do frequency analysis for each non-zero bit in mask
- -a - intersection of masks for several instructions
- -i - intersection of masks for several instructions minus masks of remaining
- -o - union of masks for several instructions minus masks of remaining
All found masks stored in map gk_tabs and can be ignored with option -k
And couple of words why nvidia lies as usually
- there are totally undocumented instructions like genmetadata/spmetadata
- there are totally undocumented attributes - for example ignoreC_pred/ignoreC/frel
- some instructions have attributes not presented in official documentation
202 18 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 exit Current status
For couple of days I was able to identify 26 attributes tabs - only 23%
So if you are passionate on PTX, love digging into unstructured data and performing operations on sets, your help is welcome
Комментариев нет:
Отправить комментарий