After extracting latency table I became curious how good the code produced by ptxas. Projects like CuAsmRL never estimated limits of profit after rescheduling - it's strange and looks even worse than famous "proof left as an exercise to the reader" - what if ptxas generates perfect code and there is just no space for instructions reordering?
So I wrote perl script to measure redundant stalls and want to present it and obtained results
The first thing was to convert latency table from plain text to some code. As you can see format is straightforward but some instructions have special cases like
I2F
3
I2F (not F64)
13
Anyway having latency value for each instruction is better than nothing, so next step was to add new method ins_lat into perl XS module for SASS disasm
Finally we can try to analyze latency of SASS instructions
Algorithm
Having stall count and latency of single instructions it's easy to compare it - if stall count is bigger - we have redundant latency. But some instructions must wait on read/write barrier - then their latency is variable and should be ignored - see function traverse_lat in dg.pl
But what if stall count (stored in 4bit field) is lesser than latency (which can be up to 48 cycles)? Clearly then we must sum stall counts for several instruction - but how to get their count?
I couldn't think of anything smarter than finding first instruction that uses a register or predicate that is changed by the current instruction. Highly likely it already have some official name in graph theory but being illiterate I named it Joint. In fact it is strictly opposite to SSA dominator. So we need registers/predicates tracking logic - see logic for Joints detection in function track2lat
So for such long latency instructions we must use totally different logic - try to find if we can fit their latency from original instruction till its joint. But there is another problem - what if some instruction inside this path was already patched? For now I used simplest logic - we just check if patched stall count is OK, else revert patch. Sure there can be several patched instructions - for them we should employ some kind of dynamic programming and check if we can fit latency with patch and without it. However this lead to exponential complexity so I decided not include this logic for first version
So algo is simple - we have 3 pass:
- try to detect simple redundant stall counts and put highly latency instructions in array (@tails)
- process @tails in reverse order to try find redundant stall counts on path till Joint
- finally collect all found results and update stat data
Results
- -g - build code flow graph
- -t - track registers to find Joints
- -l - for latency analysis
Комментариев нет:
Отправить комментарий