In these difficult times, no one wants to report bad or simply weak results (and this will destroy this hypocritical civilization). Since this is my personal blog and I am not looking for grants, I don't care.
Let's dissect one truly inspiring paper - they employed reinforcement learning and claim that
transparently producing 2% to 26% speedup
wow, 26% is really excellent result. So I decided to implement proposed technique, but first I need get source of latency values for each SASS instruction. I extracted files with latency tables from nvdisasm - their names have _2.txt suffixes
Then I made perl binding for my perl version of Ced (see methods for object Cubin::Ced::LatIndex), add new pass (-l option for dg.pl) and done some experiments to dump latency values for each instruction. Because connections order is unknown I implemented all 3:
- current column with current row
- column from previous instruction with current row
- current column with row from previous instruction
The results are discouraging
- some instructions (~1.5% for best case 1) does not have latency at all (for example S2R or XXXBAR)
- some instructions have more than 1 index to the same table - well, I fixed this with selecting max value (see function intersect_lat)
- while comparing with actual stall count the percentage of incorrect values above 60 - it's even worse than just coin flipping
Some possible reasons for failure:
- I am just too stupid
Highly likely - Bad/incomplete reversing of latency tables format
"Talk is cheap. Show me the code." (c) Linus Torvalds
Format of those files partially described so just write your own parser - Latency tables are outdated
While processing _2.txt files my parser claims that it can't find lots of mentioned instructions - they are really not present in corresponding .MD file, so this is one of possible reason - ptxas don't use latency tables from nvdisasm
I personally believe that this is the main reason - latency tables inside PTXAS are just differs
Anyway we still can implement
Instructions reordering
; stall 4 total 125 cword 7E4 B--:R-:W-:Y:S4
/*F10*/ ISETP.NE.AND P2,PT,R10,RZ,PT ?WAIT4_END_GROUP ;; stall 13 total 129 cword 7ED B--:R-:W-:Y:Sd
/*F20*/ ISETP.GE.U32.AND P1,PT,R24, 0x3,PT ?WAIT13_END_GROUP ; We have here pair with stall count 4 & 13 = total 17. What if we swap these instructions and decrease stall count 13 - 4 = 9? At end of execution second instruction will have stall 9 + 4 = the same 13 but overall stall count will be only 9 + 4 = 13. So potentially we can have speedup 1 - 13 / 17 = 23%. Yeah, sounds cool
So I add yet another pass (-s option) for my dg.pl. I also add more restrictions for possible candidates (see function can_swap):
- they must not be so called "dual issued" pairs
- don't changing execution path like CALL/JUMP
- I excluded instructions having RELA fixup - just bcs I don't know how to extract/patch corresponding instruction field
- and finally both instructions must have (or not have at all) the same conditions, like
@P1 instr1
@P1 instr2
Комментариев нет:
Отправить комментарий