вторник, 25 ноября 2025 г.

SASS latency table & instructions reordering

In these difficult times, no one wants to report bad or simply weak results (and this will destroy this hypocritical civilization). Since this is my personal blog and I am not looking for grants, I don't care.

Let's dissect one truly inspiring paper - they employed reinforcement learning and claim that

transparently producing 2% to 26% speedup

wow, 26% is really excellent result. So I decided to implement proposed technique, but first I need get source of latency values for each SASS instruction. I extracted files with latency tables from nvdisasm - their names have _2.txt suffixes

Then I made perl binding for my perl version of Ced (see methods for object Cubin::Ced::LatIndex), add new pass (-l option for dg.pl) and done some experiments to dump latency values for each instruction. Because connections order is unknown I implemented all 3:

  1. current column with current row
  2. column from previous instruction with current row
  3. current column with row from previous instruction

The results are discouraging

  • some instructions (~1.5% for best case 1) does not have latency at all (for example S2R or XXXBAR)
  • some instructions have more than 1 index to the same table - well, I fixed this with selecting max value (see function intersect_lat)
  • while comparing with actual stall count the percentage of incorrect values above 60 - it's even worse than just coin flipping

Some possible reasons for failure:

  1. I am just too stupid
    Highly likely
  2. Bad/incomplete reversing of latency tables format
    "Talk is cheap. Show me the code." (c) Linus Torvalds
    Format of those files partially described so just write your own parser
  3. Latency tables are outdated
    While processing _2.txt files my parser claims that it can't find lots of mentioned instructions - they are really not present in corresponding .MD file, so this is one of possible reason
  4. ptxas don't use latency tables from nvdisasm
    I personally believe that this is the main reason - latency tables inside PTXAS are just differs

Anyway we still can implement

Instructions reordering 

The main idea is that for pair of truly independent instructions we can hide latency from second. Let's check simple example:
; stall 4 total 125 cword 7E4 B--:R-:W-:Y:S4
/*F10*/ ISETP.NE.AND P2,PT,R10,RZ,PT ?WAIT4_END_GROUP ;
; stall 13 total 129 cword 7ED B--:R-:W-:Y:Sd
/*F20*/ ISETP.GE.U32.AND P1,PT,R24, 0x3,PT ?WAIT13_END_GROUP ; 

We have here pair with stall count 4 & 13 = total 17. What if we swap these instructions and decrease stall count 13 - 4 = 9? At end of execution second instruction will have stall 9 + 4 = the same 13 but overall stall count will be only 9 + 4 = 13. So potentially we can have speedup 1 - 13 / 17 = 23%. Yeah, sounds cool

So I add yet another pass (-s option) for my dg.pl. I also add more restrictions for possible candidates (see function can_swap):

  • they must not be so called "dual issued" pairs
  • don't changing execution path like CALL/JUMP
  • I excluded instructions having RELA fixup - just bcs I don't know how to extract/patch corresponding instruction field
  • and finally both instructions must have (or not have at all) the same conditions, like
    @P1 instr1
    @P1 instr2

Results

I made some statistic in my script (see function dump_swap_stat):
Each swap can give speedup in range 20-25%
Percentage of instructions that can be rearranged varies greatly - from 10% to 25%, on my code in average 15-20
So overall speedup is 0.2 * 0.15 = 3%, 0.2 * 0.2 = 4%
 
Theoretically if we can reorder each pair of instructions we could get 0.2 * 0.5 = 10%. I don't know how authors of mentioned above paper was able to get 26%

Комментариев нет:

Отправить комментарий