windows deep internals: марта 2026

вторник, 31 марта 2026 г.

dumping llvm bitcode from cicc

requires building of hijacked .so with appropriate LLVM version. I am too lazy for this

cool, but does not work - cicc claims on bad arguments. I've tried many combinations with no luck

But hey - we are under linux and can make many hacks, for example check what arguments genuine nvcc passing to cicc. For this I ran nvcc -dc -keep under strace:

strace -o c.strace -s 512 -f --trace=/^exec nvcc ...

Arguments:

-s NUM - maximum string size, bcs arguments can be very long - I set this parameter to 512
-f - trace child processes
and finally --trace - since I don't know which exactly syscall used to launch processes I used regex syntax for all calls starting with exec

Lets check output file c.strace and see launches of

gcc/cc1plus
cicc
ptxas
fatbinary
bin2c
cudafe++
etc

After some trials right combination of arguments for cicc is
NVVMCCWIZ=553282 cicc --nv_arch compute_XX --device-c -keep 1.cpp1.ii ls -l *.bc -rw-rw-r-- 1 redp redp 8072 mar 31 13:25 1.lgenfe.bc -rw-rw-r-- 1 redp redp 9988 mar 31 13:25 1.lnk.bc -rw-rw-r-- 1 redp redp 6500 mar 31 13:25 1.opt.bc

lgenfe.bc - bitcode from front-end

opt.bc - bitcode after all optimization passes

to disassembly we can now just use llvm-dis-21:

  %1 = tail call i32 asm sideeffect "activemask.b32 $0;", "=r"() #3, !dbg !11
  %2 = tail call { i32, i1 } @llvm.nvvm.shfl.sync.i32(i32 %1, i32 3, i32 %val, i32 16, i32 31) #3, !dbg !17
  %3 = extractvalue { i32, i1 } %2, 0, !dbg !17

четверг, 26 марта 2026 г.

dwarf from nvcc

I've add some support of DWARF debug info from nvidia nvcc to my dwarfdump. As everyone knows dwarf is over-complicated, fat and just disgusting - however, nvidia was able to take his nausea to a new level

relocs

their cuda-gdb does not contains reloc_howto_type for CUDA relocs - it's special kind of bare minimal open-source when they publish as little code as possible. So my implementation highly likely incomplete and wrong

locations

stored in section .debug_loc - that's ok, although the last time gсс used them was somewhere around the time of version 4. Also nvidia introduced new attribute DW_AT_address_class for addresses in different segments. Cool, but for example for ADDR_const_space you can't get in which constant bank those address was placed

register names

this is main nightmare

Дальше »

среда, 18 марта 2026 г.

read a couple of books about compilers

LLVM Compiler for RISC-V Architecture

Describes details of risc-v vectorization support in llvm. It should be noted that the implementation of vector operations in risc-v was done later than in Intel and sve in arm64 - they took into account many flaws (like made explicit masks for vector operations) and were implemented in a much more convenient way from the programmer's point of view

On other hand any HW vendor can add it's own ISA subset and support of this custom processors in compiler can become very segmented and pure nightmare

Also I want to note that support of risc-v vectors in LLVM carefully avoids MLIR (IMHO second most overrated thing after LLM) - to do this they even had to patch their holy cow tablegen

Drawbacks:

there is no introduction about LLVM IR/risc-v specific IR, so long IR listings are very hard to follow
author don't give link to source code implementing some algo. Fortunately elixir indexed whole LLVM source tree

4/5

Dive into Deep Learning Compiler

As far as I know, this is the only book describing AI/ML compilers so far. Also TVM looks very promising - unlike monsters like XLA/iree it is compact and observable for mere mortals

Drawbacks:

book is not completed - last two chapter about NN & deployment are just "place holder"
it's unclear why for matrix multiplication on CUDA they didn't get cublas as base case
and openblas for cpu version

Despite this, considering that the book is freely downloadable, my rating is 4 out of 5

пятница, 6 марта 2026 г.

SASS latency table: second try

In my first attempt I used latency tables extracted from MD file (located inside nvdisasm) and nothing good came out of it

Obvious reason is that real latency table should be located not in disassembler - it must be inside ptxas. But the problem with that file is that it is really huge - in SDK 13 it has size 40Mb. Sure no symbols included

This is not surprisingly bcs it contains lots of things:

ptxas parser
lots of macros
optimizing compiler with 159 passes and don't use LLVM at all
code generators for several different SMs

Besides it does not have any tracepoints and big part of string are encrypted. So it took lots of time and patience but finally I found and extracted right latency table

And then a lot of discoveries came my way