windows deep internals: июня 2022

четверг, 30 июня 2022 г.

size of ebpf jit code on different processors

it doesn't make much sense but bcs I have now several jit compilers - why not compare how much size have jitted code for different processors?

I chose 3 ebpf programs

simple BPF_PROG_TYPE_CGROUP_SKB with only comparison, 8 opcodes
BPF_PROG_TYPE_RAW_TRACEPOINT with 3 maps, 68 opcodes
enough complex BPF_PROG_TYPE_RAW_TRACEPOINT with 6 maps, 1824 opcodes

results

processor	1st	2nd	3rd
x64	54	312	8195
arm64	99	567	12959
powerpc	78	546	11462
risc-v	102	470	9494
s390	78	534	12622
sparc	79	482	10446

среда, 29 июня 2022 г.

verification of jitted ebpf code

There are some projects for ebpf in usermode, but for verification purposes you need the same code which was used in kernel. So I ripped out some jit code to run it in usermode

x64
powerpc
risc-v
s390
sparc
sunway sw64

And now we can make verification of jitted code - we have actual generated code for some ebpf, next we run JIT for ebpf opcodes in usermode, and finally can compare them

суббота, 25 июня 2022 г.

pmu events

Some details

pmu stored in tree pmu_idr and synced with mutex pmus_lock. and as usually can be used to blind EBPF. How? Lets see:

General speaking there are usually four steps involved to attach an eBPF program to a perf event:
Open the perf event
Load the eBPF program
Set the eBPF program on the perf event
Enable the perf event

We interested in point 4 - enabling of the perf event involves calling of pmu->event_init & pmu->add methods. And worse - all pmu structures located in .data section and thus writable. So I add today some code to dump them:

Дальше »

понедельник, 20 июня 2022 г.

ebpf opcodes patching

I made today disasm for eBPF opcodes. Lets see how they looks like:
85 00 00 00 C0 10 02 00 call 0x210C0

in jitted code this is call 0xffffffffb4c14110. ffffffffb4c14110 - 210C0 = FFFFFFFFB4BF3050, address of __bpf_call_base. Suppose that we have some paranoidal code in kernel mode and don`t want to be traced with all this ebpf black magic, what we can do on machine without JIT?

First, we could just patch first opcode to

95 00 00 00 00 00 00 00 ret

Second - we could find some empty native function in kernel (or even reuse __bpf_call_base) and patch address let`s say htab_map_update_elem to it. Can some linux ~~ebpf-based~~ EDR detect this?

среда, 15 июня 2022 г.

epbf maps

As you can see from function bpf_map_alloc_id all bpf maps stored in balanced tree map_idr and synced on spinlock map_idr_lock. No surprise that you can`t view them in user-mode - there is bpf command BPF_MAP_GET_NEXT_ID but it can only enumerate ID of maps. So I add today some code to view bpf maps: lkmem -c -d -B gives output like

bpf_maps at 0xffffffff929c1880: 15

[0] id 3 UDPrecvAge at 0xffff99e344f48000

type: 1 BPF_MAP_TYPE_HASH

key_size 8 value_size 8

[1] id 4 UDPsendAge at 0xffff99e344cb4c00

type: 1 BPF_MAP_TYPE_HASH

key_size 38 value_size 8

also disasm of jitted ebpf code began to look better:
mov rdi, 0xffff99e344f48000 ; UDPrecvAge

call 0xffffffff90c191f0 ; __htab_map_lookup_elem

This letter explains that JIT replacing sequence of opcodes

bpf_mov r1, const_internal_map_id
bpf_call bpf_map_lookup

with direct loading of 64bit address of map (BPF_LD_IMM64 pseudo op). But this code is not optimal - every instruction occupy 10 bytes. Lets consider case where we employ constants pool and put all map addresses somewhere after function - sure this will require at least 8 bytes for each address + perhaps some space for alignment. But now we can produce code like:
mov rdi, qword [map1_addr wrt rip] ; 7 bytes

call __htab_map_lookup_elem

...

; somewhere after function

map1_addr: resq 1 ; jit should put real address of map here

if function has 3 or more reference to the same map we can have some decreasing of jitted code size

вторник, 7 июня 2022 г.

position independent sw64 code

lets see how PIC looks like for sw64 on the example of a function from libLLVM-7.so.1 (huge shared library - size 45Mb):

1000ED0 ldih GP, PV, 0x1D3

PV almost always contains address of called function so value of GP now 2D30ED0
1000ED4 ldi GP, GP, -0x1290

value of GP now 2D30ED0 - 1290 = 2D2FC40. I expected that this base address always located inside .got but this is not true - it can lie anywhere, sometimes even not inside elf module! All remaining refs use this base address in GP register:

1000ED8 ldih PV, GP, 0

1000EDC ldl PV, PV, -0x4EC0

...

1000F14 call RA, PV, 0

1000F18 ldih GP, RA, 0x1D3 ; 2D30F18

1000F20 ldi GP, GP, -0x12D8 ; 2D2FC40

wait, WHAT? they use return address in RA to fill GP with the same value 2D2FC40. and even worse - they restore value of GP even in epilogue where it is not used

Lets estimate size overhead. libLLVM-7.so.1 has 41337 functions, 8432116 instructions and 781997 to set value of GP. rate 781997 / 8432116 = 0.092740

Lets assume that each function anyway need to setup GP, so required number of instructions is 41337 * 2 = 82674. remaining is 781997 - 82674 = 699323

remove unneeded GP setups from epilogues: 699323 - 82674 = 616649

this amount easy can be reduced in half - just store calculated value of GP in stack with stl gp, sp, offset (+41337 instructions) and then pop it when needed with ldl gp, sp, offset

So actual amount of instructions could be 616649 / 2 + 41337 + 82674 = 432336

new rate: 432336 / 8432116 = 0.05127

overhead is 0.092740 - 0.05127 = 4.1%

cool, almost 2Mb of code is just unnecessary

суббота, 4 июня 2022 г.

reversing of sunway sw64 ISA

It seems that Chinese are hiding information about their another homemade processor sw64 - try to find some technical details with google, baidu or gitee. At the same time they ported linux on this processor - and you even can find some details in openEuler project. I think this conspiracy is very funny and at least violating licenses for binutils/clang/gcc etc

Anyway lets see if we can reverse ISA for sw64 having only linux image and some source code from linux kernel (spoiler: also write processor module for ida pro)

registers

try to compare registers of sw64 with Alpha AXP - can you find any difference? at least we now know that processor has 32 general purpose registers and 32 for floating point, so fields for register encoding must be 5 bits

ELF relocs

relocs can be extracted from arch/sw_64/include/asm/elf.h. So the next thing which I wrote was small ida pro plugin to apply this relocs - nothing special, actually it was almost exactly copy of the same plugin for LoongArch

mnemonics

Дальше »

четверг, 30 июня 2022 г.

среда, 29 июня 2022 г.

суббота, 25 июня 2022 г.

понедельник, 20 июня 2022 г.

среда, 15 июня 2022 г.

вторник, 7 июня 2022 г.

суббота, 4 июня 2022 г.