instructions that cicc cannot generate
- cctl for cache control
- lop3 - yeah, I saw them many times in SASS, so it later generated by ptxas during optimization passes
- r2p
- and all video instructions
вы все еще верите написанному кириллицей ?
So as you can see there is no machine readable grammar for modern PTX, Why this is important at all? Well, according to "Official guide to inline PTX"
The compiler front end does not parse the asm() statement template string and does not know what it means or even whether it is valid PTX input
After extracting latency table I became curious how good the code produced by ptxas. Projects like CuAsmRL never estimated limits of profit after rescheduling - it's strange and looks even worse than famous "proof left as an exercise to the reader" - what if ptxas generates perfect code and there is just no space for instructions reordering?
So I wrote perl script to measure redundant stalls and want to present it and obtained results
The first thing was to convert latency table from plain text to some code. As you can see format is straightforward but some instructions have special cases like
I2F
3
I2F (not F64)
13
Anyway having latency value for each instruction is better than nothing, so next step was to add new method ins_lat into perl XS module for SASS disasm
Finally we can try to analyze latency of SASS instructions
Having stall count and latency of single instructions it's easy to compare it - if stall count is bigger - we have redundant latency. But some instructions must wait on read/write barrier - then their latency is variable and should be ignored - see function traverse_lat in dg.pl
But what if stall count (stored in 4bit field) is lesser than latency (which can be up to 48 cycles)? Clearly then we must sum stall counts for several instruction - but how to get their count?
I couldn't think of anything smarter than finding first instruction that uses a register or predicate that is changed by the current instruction. Highly likely it already have some official name in graph theory but being illiterate I named it Joint. In fact it is strictly opposite to SSA dominator. So we need registers/predicates tracking logic - see logic for Joints detection in function track2lat
So for such long latency instructions we must use totally different logic - try to find if we can fit their latency from original instruction till its joint. But there is another problem - what if some instruction inside this path was already patched? For now I used simplest logic - we just check if patched stall count is OK, else revert patch. Sure there can be several patched instructions - for them we should employ some kind of dynamic programming and check if we can fit latency with patch and without it. However this lead to exponential complexity so I decided not include this logic for first version
So algo is simple - we have 3 pass:
strace -o c.strace -s 512 -f --trace=/^exec nvcc ...
Lets check output file c.strace and see launches of
After some trials right combination of arguments for cicc is
NVVMCCWIZ=553282 cicc --nv_arch compute_XX --device-c -keep 1.cpp1.ii
ls -l *.bc
-rw-rw-r-- 1 redp redp 8072 mar 31 13:25 1.lgenfe.bc
-rw-rw-r-- 1 redp redp 9988 mar 31 13:25 1.lnk.bc
-rw-rw-r-- 1 redp redp 6500 mar 31 13:25 1.opt.bc
lgenfe.bc - bitcode from front-end
opt.bc - bitcode after all optimization passes
to disassembly we can now just use llvm-dis-21:
%1 = tail call i32 asm sideeffect "activemask.b32 $0;", "=r"() #3, !dbg !11
%2 = tail call { i32, i1 } @llvm.nvvm.shfl.sync.i32(i32 %1, i32 3, i32 %val, i32 16, i32 31) #3, !dbg !17
%3 = extractvalue { i32, i1 } %2, 0, !dbg !17
I've add some support of DWARF debug info from nvidia nvcc to my dwarfdump. As everyone knows dwarf is over-complicated, fat and just disgusting - however, nvidia was able to take his nausea to a new level
As far as I know, this is the only book describing AI/ML compilers so far. Also TVM looks very promising - unlike monsters like XLA/iree it is compact and observable for mere mortals
Drawbacks:
Despite this, considering that the book is freely downloadable, my rating is 4 out of 5
In my first attempt I used latency tables extracted from MD file (located inside nvdisasm) and nothing good came out of it
Obvious reason is that real latency table should be located not in disassembler - it must be inside ptxas. But the problem with that file is that it is really huge - in SDK 13 it has size 40Mb. Sure no symbols included
This is not surprisingly bcs it contains lots of things:
Besides it does not have any tracepoints and big part of string are encrypted. So it took lots of time and patience but finally I found and extracted right latency table
And then a lot of discoveries came my way