It's funny how paranoid nvidia trying to hide as much info from their customers as it can. One sample is so called "special registers" - even with PTX you can extract only limited set of them
So I played a bit with some undocumented SRs - namely with SR_MACHINE_ID_XX & SR_REGALLOC. I made legal loading of special registers and then patched those SASS instructions with my Ced
Lets see how those code looks in
PTX
mov.u32 %r2, %tid.x;
st.global.u32 [%rd2], %r2;
mov.u32 %r3, %tid.y;
st.global.u32 [%rd2+4], %r3;
mov.u32 %r4, %tid.z;
st.global.u32 [%rd2+8], %r4;
// inline asm
mov.u32 %r1, %smid;
// inline asm
st.global.u32 [%rd2+12], %r1;
mov.u32 %r5, 21;
st.global.u32 [%rd2+16], %r5;
As you can see - nothing special, just load in r2-r5 some values and store them in r12 holding address of function argument (obtained with cvta.to.global.u64)