windows deep internals: libcuda.so internals

The first question that comes to mind when looking at them is "why they are so huge?". For example libcuda.so from cuda 10.1 has size 28Mb and from 13.1 already 96Mb. So I rejected the idea that they are just yet another victims of vibe-coding and made some preliminary RE. The answer is - because they contain in .rodata section lots of CUBIN files for

kernel run-time

I extracted them (archive from 13.1) and checked SASS. Now I am almost sure that nvidia has some internal SASS assembler - they use LEPC instruction (to load address of current instruction) which you just can't get from official ptxas

   /*0160*/  LEPC R20 ; R20 now holds 170
   /*0170*/  IADD3 R20, P0, R20, 0x50, RZ 1 ; and if P0 R20 += 0x50

What contain those CUBIN files?

syscalls like __cuda_syscall_cp_async_bulk_tensor_XX, __cuda_syscall_tex_grad_XX etc
implementation of functions like cudaGraphLaunch/vprintf
functions cnpXXX like cnpDeviceGetAttribute
logic for kernel enqueue
some support for profiling like scProfileBuffers
trap handlers

and so on. In essence this is backstage workers - like old good BIOS

API callbacks

Another unusual thing scratched my eyes - almost all of public API functions look like

                public cuMemAdvise
cuMemAdvise     proc near
                cmp     cs:finited, 321CBA00h
                jz      short loc_2EC348
                jmp     cs:off_1B105E8
loc_2EC348:            ; CODE XREF: cuMemAdvise+A↑j
                mov     eax, 4 ; CUDA_ERROR_DEINITIALIZED
                retn

As you can see they have jump to address located in .data section. I don't know for what this was done but we can reuse this indirection for our own dirty purposes - like patch them to trace some specific CUDA API (instead of ancient trick with LD_PRELOAD). So I made FSM to extract them

Source. test program tries to disasm libcuda.so and dump all found callbacks

Happy hacking!

2 комментария:

.1 января 2026 г. в 17:43
When looking at the extracted cubin files, I noticed in .note.nv.tkinfo section that some have been generated by ptxas and some by "nvasm_internal" confirming that nvidia has internal SASS assembler. Another weird thing is all of them use 65 as ELF OS/ABI (normally it's 51 for cubin files).
ОтветитьУдалить
Ответы
redp1 января 2026 г. в 19:40
it seems that ABI version does not matter - you can patch it to any number
yes, they have internal asm bcs ptxas can't produce some instructions like LEPC, also see function __cuda_syscall_asmFuncs
ОтветитьУдалить
Ответы

Добавить комментарий

среда, 24 декабря 2025 г.

libcuda.so internals

kernel run-time

API callbacks

2 комментария:

среда, 24 декабря 2025 г.