windows deep internals: SASS latency analysis

вторник, 14 апреля 2026 г.

SASS latency analysis

After extracting latency table I became curious how good the code produced by ptxas. Projects like CuAsmRL never estimated limits of profit after rescheduling - it's strange and looks even worse than famous "proof left as an exercise to the reader" - what if ptxas generates perfect code and there is just no space for instructions reordering?

So I wrote perl script to measure redundant stalls and want to present it and obtained results

The first thing was to convert latency table from plain text to some code. As you can see format is straightforward but some instructions have special cases like

I2F 3 I2F (not F64) 13

so I made yet another perl script to generate latency table for C++ and bunch of enums for special cases - which then was implemented manually in method NV_renderer::calc_latency. Code is horrible and incompleted - I am not smartest person in the world so was just unable to find appropriate conditions for some cases in MD files. Also note that this code is result of reverse engineering so unknown how correct it is

Anyway having latency value for each instruction is better than nothing, so next step was to add new method ins_lat into perl XS module for SASS disasm

Finally we can try to analyze latency of SASS instructions

Algorithm

Having stall count and latency of single instructions it's easy to compare it - if stall count is bigger - we have redundant latency. But some instructions must wait on read/write barrier - then their latency is variable and should be ignored - see function traverse_lat in dg.pl

But what if stall count (stored in 4bit field) is lesser than latency (which can be up to 48 cycles)? Clearly then we must sum stall counts for several instruction - but how to get their count?

I couldn't think of anything smarter than finding first instruction that uses a register or predicate that is changed by the current instruction. Highly likely it already have some official name in graph theory but being illiterate I named it Joint. In fact it is strictly opposite to SSA dominator. So we need registers/predicates tracking logic - see logic for Joints detection in function track2lat

So for such long latency instructions we must use totally different logic - try to find if we can fit their latency from original instruction till its joint. But there is another problem - what if some instruction inside this path was already patched? For now I used simplest logic - we just check if patched stall count is OK, else revert patch. Sure there can be several patched instructions - for them we should employ some kind of dynamic programming and check if we can fit latency with patch and without it. However this lead to exponential complexity so I decided not include this logic for first version

So algo is simple - we have 3 pass:

try to detect simple redundant stall counts and put highly latency instructions in array (@tails)
process @tails in reverse order to try find redundant stall counts on path till Joint
finally collect all found results and update stat data

Results

As you can expect they greatly varied for different version of ptxas and your source code. In average I got 16-25% of redundant stall counts in average 33% of instructions. Sure you can try it on your own cubin files and probably get very different picture. Options for dg.pl to analyze SASS latency:

-g - build code flow graph
-t - track registers to find Joints
-l - for latency analysis

windows deep internals

вторник, 14 апреля 2026 г.

SASS latency analysis

Algorithm

Results

Комментариев нет:

Отправить комментарий

вторник, 14 апреля 2026 г.

SASS latency analysis

Algorithm

Results

Комментариев нет:

Отправить комментарий

вторник, 14 апреля 2026 г.