суббота, 16 ноября 2024 г.

perl module for powerpc disasm

It seems that there are no good open-source disasm for PPC except Capstone. And it even has perl binding - unfortunately it can extract only basic fields like opcode and text but no operands for specific processor. So I made yet another

I would like to make a few critical remarks about the capstone itself

1) it's fat as pig - with default settings we have
ls -l libcapstone.a
-rw-rw-r-- 1 redp redp 93833624 nov  8 16:10 libcapstone.a
Sure this can be fixed with selecting just needed processors but anyway makes a lasting impression

2) it's inconsistent. Just couple of examples
mr r31, r3 actually has instruction id PPC_INSN_OR. to get MR you must check alias ID, and even in this case they have PPC_INS_ALIAS_MR & PPC_INS_ALIAS_MR_

Another example - ld r3, something actually has PPC_REG_X3 instead of PPC_REG_R3 despite the fact that these are the same register. Why clone lots of registers if you produce the same output for them? And why not add size of register? I suspect this happens bcs they used MD from llvm and was too lazy to make some optimizations

3) it's incomplete. For example they don't implemented reg_access for powerpc (as well as for alpha, risc-v, sparc etc)

multiple TOCs

It's critical to track TOC for distinguishing loading some constant vs loading at some address. Lets check following code

; prologue
addis r2, r12, 0x1d8
addi r2, r2, 0x70 ; TOC
...
addis r3, r2, 0x19
ld r3, -0x1ca8(r3) ; kmem_cache

Here r3 adjusting from r2 holding address of TOC to get address of kmem_cache. And this is how loading of constant looks like:
lis r10, 0xa9
ori r10, r10, 0xc00 ; A90C00
And yes - LIS is again alias PPC_INS_ALIAS_LIS, instruction ID is PPC_ADDIS. 

Official doc is very unclear regarding the possibility of having multiple TOCs. Theoretically if linker would put TOC somewhere in middle of .bss/.data/.rdata sections it can cover full 32bit address space. But what if the program has size bigger 4Gb or it was linked with code from several compilers like llvm & gcc? Unfortunately I don't have such samples so cannot say if my disasm will work correctly in case of having several TOCs

results

Anyway, enjoy - powepc disassembler in 120LOC

пятница, 1 ноября 2024 г.

perl module for DWARF debug info parsing

I've made perl binding of my c++ dwarf dumper. Supports 64/32 bit, no DWO, don't sure if it can load arbitrary object files but at least can parse LKM - I even added relocations processing
Maybe one day I'll have the courage to write documentation for it, so for now just several samples of using it:
  • script to dump enums
  • script to find all functions having arguments with pointer or reference to some named structure
  • script to find and dump structures fields
  • script to find all classes with VTBL and dump them

So now I have in perl full set of modules to disasm/analyse elf files:

воскресенье, 20 октября 2024 г.

binding c++ objects to perl

Sure there are lots of ways to do this (as usually in Perl: "there's more than one way to do it"), just to name few:

  • swig - result looks not native and clumsy, also you need to make facade-like interfaces for really complex classes
  • ffi - also has problems with complex classes
  • XS - official and wide-spread way. Btw smoothest and most seamless Win32::OLE module uses it exactly

and besides you can simple call perl functions directly from c++ code (with all bells and whistles "dSP; ENTER; SAVETMPS" etc) like polymake do

So this article is just remaining for myself how to bind small-sized c++ classes to perl using XS. Lets begin with simplest one - just perl wrapper around Interval-Tree

  • to create new skeleton: h2xs -cfn Module::Name
  • patch Makefile.PL to add XSOPT="-C++" and change C compiler CC="g++", also highly likely we should add -lstdc++ to LIBS
  • don't forget to add all source/header files to MANIFEST

Code is simple, I prefer to make my own MAGIC instead of implementation DESTROY method - you will see further why

четверг, 3 октября 2024 г.

TLS in gcc RTL

Lets check how TLS looks like in RTL. I wrote simple test:
(insn 29 8 10 3 (set (reg:SI 0 ax [orig:82 _1 ] [82])
        (mem/c:SI (const:DI (unspec:DI [
                        (symbol_ref:DI ("tls_state") [flags 0x2a]  <var_decl 0x7f61bce052d0 tls_state>)
                    ] UNSPEC_NTPOFF)) [2 tls_state.idx+0 S4 A32 AS1])) "stest.cc":15:37 81 {*movsi_internal}
     (nil))

We can see attribute UNSPEC here - it can be extracted as XINT (in_rtx, 1) for GET_CODE == UNSPEC. The first problem here is that values UNSPEC_XXX are machine-specific. For example the same code while cross-compiling for mips:
insn 60 13 61 (set (reg:SI 3 $3)
        (unspec:SI [
                (const_int 0 [0])
            ] UNSPEC_TLS_GET_TP)) "stest.cc":15:37 706 {*tls_get_tp_si_split}
     (nil))
(insn 61 60 15 (set (reg:SI 2 $2 [201])
        (reg:SI 3 $3)) "stest.cc":15:37 313 {*movsi_internal}
     (nil))
(insn 15 61 16 (set (reg:SI 3 $3 [202])
        (high:SI (const:SI (unspec:SI [
                        (symbol_ref:SI ("tls_state") [flags 0x2a]  <var_decl 0x7ff96a9dfcf0 tls_state>)
                    ] 306)))) "stest.cc":15:37 313 {*movsi_internal}
     (nil))

Second - NTPOFF in comments marked as belonging to "Relocation specifiers" while 

;; TLS support
  UNSPEC_TP
  UNSPEC_TLS_GD
  UNSPEC_TLS_LD_BASE
  UNSPEC_TLSDESC
  UNSPEC_TLS_IE_SUN

Thirdly - lets recompile the same file with -fPIC option:
(call_insn/u 9 8 11 3 (parallel [
            (set (reg:DI 0 ax)
                (call:DI (mem:QI (symbol_ref:DI ("__tls_get_addr")) [0  S1 A8])
                    (const_int 0 [0])))
            (unspec:DI [
                    (symbol_ref:DI ("tls_state") [flags 0x10]  <var_decl 0x7fcea13c92d0 tls_state>)
                    (reg/f:DI 7 sp)
                ] UNSPEC_TLS_GD)
        ]) "stest.cc":15:37 1048 {*tls_global_dynamic_64_di}
     (expr_list:REG_EH_REGION (const_int -2147483648 [0xffffffff80000000])
        (nil))
    (nil))
(insn 11 9 46 3 (set (reg:SI 0 ax [orig:82 _1 ] [82])
        (mem/c:SI (reg/f:DI 0 ax [88]) [2 tls_state.idx+0 S4 A32])) "stest.cc":15:37 81 {*movsi_internal}
     (nil))

You can notice that address of TLS var gathered with __tls_get_addr but next access to it not marked in any way - just regular chain set mem component_ref. Disgusting

Just note for myself what else cannot be extracted from gcc RTL:

  • pointer to members -
     despite the fact that in file cp/cp-tree.def there are OFFSET_REF & PTRMEM_CST in real RTL they are just integer constants
  • pointer to member methods - similalry
  • TLS are indistinguishable from regular global vars
  • to be continued

понедельник, 30 сентября 2024 г.

gcc plugin to collect cross-references, part 9

Lets extract some useful results from my gcc plugin for collecting cross-references: 1, 2, 3, 4, 5, 6, 7 & 8

I've noticed that plugin worked unbearably slowly on big source files (like compiling itself) - something above 30 minutes, so I add profiling to it (see details here). Profiler showed that consumed user-time was only 18 seconds - this is glare sign that plugin spending most of time on some kind of lock. After some meditation (and strace/ltrace) I finally found that root of problem was in sqlite - it sleeping in fdatasync on each data writing to its DB! The cure is to execute
PRAGMA synchronous=OFF
this gives a non-illusory x100 speed up. Can I now consider myself as one of mythical "x100 programmer", he-he?

The next problem is how to store data from multiple gcc processes into single sqlite DB (like parallel make -j X). I've decided to use Apache Thrift - old and unfashionable but it really works and you can make multithreaded RPC server in 500 lines of c++ code. Server just listen on some port and stores all data from RPC clients into sqlite DB. Command line arguments path2db port_number .To shutdown RPC server use rpcq port_number. Of course you can make your own implementation of RPC server, for example with PHP & DB/2 (unfortunately Thrift does not supports Cobol)

As a consequence now plugin has 3 implementations for data storing:

  1. in plain text files, self-test takes 14s
  2. into sqlite, self-test takes 18s
  3. in sqlite via RPC server listened on some arbitrary TCP port. Self-test takes 23s. Connection string looks like -fplugin-arg-gptest-db=localhost:port_number 

Btw sqlite version has size 2.67Mb, Thrift rpc client - 4.65Mb. Something is fundamentally wrong with "modern frameworks"

DB schema 

is very simple - in essence it consists of two tables:

symtab for storing symbols

  • id - primary key
  • mangled name of symbol/function or value of literal
  • fname - for functions this is name of file where they were declared
  • bcount - amount of basic blocks of function, can be used as some complexity metric

xrefs to link symbols and functions from which they are referred

  • id of function
  • bb - index of basic block
  • arg - if non-zero - index of function's argument
  • what - id of referred symbol
  • kind - one letter type of xref:
    • 'c' - function call
    • 'v' - call of virtual method
    • 'l' - literal
    • 'r' - just reference to some global symbol
    • 'f' - field of some struct/class/union
    • 'F' - constant (with -fplugin-arg-gptest-ic option)

Fetching results

I wrote perl script to search and dump xrefs. If I want to find functions calling virtual methods of class FPersistence:

пятница, 20 сентября 2024 г.

Tracking arguments of functions in gcc

I continuing to improve my gcc plugin for collecting cross-references: 1, 2, 3, 4, 5, 6 & 7. On this week I decided to see if I can extract source of complex types like records and most prominent kind of them is arguments of function - they are easy to identify in asm (but not so easy to bind them in gcc RTL expressions)

Having function declaration fdecl we can extract arguments with something like:

for (tree arg = DECL_ARGUMENTS (fdecl); arg; arg = DECL_CHAIN (arg))
  {
    auto a = DECL_RTL_IF_SET (arg);
    if ( a && REG_EXPR(a) ) { // do something with argument in
REG_EXPR(a)

 
I need only arguments that are a pointer or reference to record/union so I filtered them in method check_arg

Those was easiest part, and now try to find how arguments can be tracked in RTL. 

воскресенье, 15 сентября 2024 г.

bug in gcc?

It seems that gcc not always put COMPONENT_REF when access fields of structures passed by reference. For example I add today simple static function append_name(aux_type_clutch &clutch, const char *name)

It has reference to field txt of structure aux_type_clutch but RTL looks like:

(insn 20 19 21 4 (set (reg/f:DI 0 ax [89])
        (mem/f/c:DI (plus:DI (reg/f:DI 6 bp)
                (const_int -8 [0xfffffffffffffff8])) [258 clutch+0 S8 A64])) "gptest.cpp":1364:24 80 {*movdi_internal}
     (nil))
(insn 21 20 22 4 (parallel [
            (set (reg/f:DI 0 ax [orig:83 _2 ] [83])
                (plus:DI (reg/f:DI 0 ax [89])
                    (const_int 8 [0x8])))
            (clobber (reg:CC 17 flags))
        ]) "gptest.cpp":1364:24 230 {*adddi_1}
     (expr_list:REG_EQUAL (plus:DI (mem/f/c:DI (plus:DI (reg/f:DI 19 frame)
                    (const_int -8 [0xfffffffffffffff8])) [258 clutch+0 S8 A64])
            (const_int 8 [0x8]))
        (nil)))

First instruction just loads in register RAX parm_decl (of type aux_type_clutch) like

mov     rax, [rbp+clutch]

and second add to RAX just some const 0x8 (offset to field txt):

add     rax, 8

it's impossible from RTL to track back this constant to offset in COMPONENT_REF

What is more even strange - for methods you can track fields access for parameters passed by reference (like this) - for example in constructor of the same aux_type_clutch:

(insn 12 11 13 2 (set (mem:SI (plus:DI (reg/f:DI 0 ax [94])
                (const_int 40 [0x28])) [4 this_12(D)->level+0 S4 A64])
        (const_int 0 [0])) "gptest.cpp":465:4 81 {*movsi_internal}
     (nil))

четверг, 5 сентября 2024 г.

hidden executable pages in linux kernel, part 2

In part 1 I've described how memory managed by hardware. Now lets dig into how kernel sees memory. Not surprisingly that we should check the same structures that malicious drivers update while hiding

Modules

List of module structures with head in modules and lock modules_mutex. It has projection on file /proc/modules but sizes in those file are sloppy - function module_total_size calculates total size of driver (including discarded sections!). So we should use only some selected fields:
  • on kernel >= 6.4 mem[MOD_TEXT].base & mem[MOD_TEXT].size
  • on kernel < 4.5 module_core & core_text_size
  • otherwise core_layout.base & core_layout.text_size

vmap_area_list

It has projection on file /proc/vmallocinfo and requires root access. Sure sophisticated rootkits can intercept it but that's ok since we use it for cross-scan only

 

False positives

As you can guess not every executable page belongs to some driver - there are couple exceptions

среда, 28 августа 2024 г.

hidden executable pages in linux kernel

Standard method to find rootkits like this (or like this) is cross-scanning PTEs of memory without NX bit, then extract pages belonging to LKMs - thus in set difference we will gather hidden executable memory. Lets check how we can scan PTEs under linux

disclaimer

this article is not digest of Intel or linux documentation - I'll just describe how you can traverse page tables from LKM. Also code

Lets start with some simple things:

cat /boot/config-$(uname -r) | grep -E 'X86_5LEVEL|PGTABLE_LEVELS'
CONFIG_PGTABLE_LEVELS=5
CONFIG_X86_5LEVEL=y

So my kernel has 5 levels and this exactly correspond to hardware:

  1. pte_t - PTE, 9 bits of page address (size of page is 4096 bytes - low 12 bits)
  2. pmd_t - PDE. another 9 bits of address
  3. pud_t - PDPTE, 9 bits
  4. p4d_t - PML4, 9 bits
  5. pgd_t - PML5, 9 bits

Total 12 + 5 * 9 = 57bits

It's really amazing how memory management is implemented differently in different operating systems running on the same hardware. For example in Windows all PTE are located in huge contiguous sparse array and you can get address of PTE for some address of memory with very simple function MiGetPteAddress. Let MiGetPteAddress(addr1) is addr2. We then can continue this process for all paging levels - get PteAddress(addr2) and so on - to find if all 5 parts of address is valid. And this can be used in reverse direction - skip scan of huge PTEs areas if they are not presented in memory

Unfortunately in linux PTE not stored in one huge contiguous memory. So we need to start with top-level (from PGD) and scan all tables on lower levels. Root of pgd_t stored in init_mm->pgd. As usually var init_mm is not exported

<sarcasm>Linux widely known for the consistency, completeness and backward-compatibility of its API and being developers-friendly in general</sarcasm>

Next we need way to find valid pXX_t. Seems that there are functions pXX_present, pXX_bad and so on. The right sequence of calls is

  • pXX_none
  • pXX_leaf - this is damn good name for functions to check for large pages
  • pXX_bad
  • and finally pXX_offset to get item for next level

Unfortunately there are also so called hugeTLB pages (enabled with CONFIG_HUGETLB_PAGE):

grep 'HUGETLB_PAGE' /boot/config-$(uname -r)
CONFIG_HUGETLB_PAGE=y
CONFIG_HUGETLB_PAGE_FREE_VMEMMAP=y

As you may expect functions pmd_huge & pud_huge non-exported too (and p4d_huge & pgd_huge are just dumb macros)

Finally we need to check if some page is executable. This is very hardware specific - for example

  • Arc has flag _PAGE_EXECUTE
  • aarch64 has flag _PAGE_KERNEL_EXEC
  • powerpc has _PAGE_EXEC
  • s390 has _PAGE_NOEXEC

so for some arch there is function pte_exec, while for another pte_no_exec. Also it's curious that there are no analogs for pud/pmd etc - so actually I have zero ideas how check executability for large & huge pages

However, this is not the end of suffering. Quick check:

grep address /proc/cpuinfo
address sizes    : 39 bits physical, 48 bits virtual
shows that they lie - my hardware actually supports only 48bit addresses, so kernel should have only 4 levels of paging. Try to guess how they swept the trash under the carpet?

вторник, 20 августа 2024 г.

bpf_verifier_ops

Lets dissect some typical ebpf spyware. It sets up uprobes on

  • SSL_read_ex
  • SSL_read
  • SSL_write_ex
  • SSL_write
  • gotls_write_register
  • gotls_read_register
  • gotls_exit_read_register

and uses bpf functions probe_read_user & probe_read_user_str to steal data and map_update_elem & ringbuf_submit to store data in bpf maps

How we can mitigate this?

Official way is to use LSM - function __sys_bpf calls security_bpf so we could register with security_add_hooks LSM hook with index bpf. This effectively prevents loading of ebpf program and sometimes is not what we want - for example in case of honeypots there is high chance that usermode program just will exits after failed ebpf program loading and you can't monitor which connections it will try to establish

Another way - is to patch bpf_func_proto for selected functions, like I did. However this is brutal method and affects all ebpf programs (I still believe that some is not spyware, he-he)

Luckily there is way to blind only some types of ebpf programs - method get_func_proto in bpf_verifier_ops. I made PoC to blind aforementioned 4 functions for BPF_PROG_TYPE_TRACING & BPF_PROG_TYPE_KPROBE only

Now we have another problem - how to check integrity of bpf_verifier_ops? I've also add this check in my lkcd. Example of output when PoC ublind is loaded looks like:

[24] type BPF_PROG_TYPE_TRACING at 0xffffffffc1357720 - ublind!s_trace_patched
  get_func_proto: 0xffffffffc13551e0 - ublind!my_func_proto
  is_valid_access: 0xffffffffaee24e20 - kernel!tracing_prog_is_valid_access

четверг, 25 июля 2024 г.

crowdstrike

since everyone (including "professional C++ programmers") has already written about so I will also write it down (for memory)

This was not first time: proof

Official report is very vague but we can conclude that root cause was bug in config parser leading to dereference of bad (it's unclear if it contained zero or just was uninitialized) pointer, as I assumed

Well, this is not big deal - I made thousands bsod/kernel panics. There usually long path from developers machine to production, right?

Then I read second report - and this is pure madness. They DIDN'T TESTED IT AT ALL!!! Proof is very simple - lets assume we have bug with probability of bsod P. Then for N test machines probability to not catch this bug is (1-P) ^ N. for P = 0.5 (bsod happens or not, he-he) and 8 test machines this probability 0.4%. Given that in this incident probability of bsod was close to 1 - size of their test cluster is exactly 0

Instead they used some kind of spell-checkar Content Validator. I'm almost sure that they were bitten by adherents of totalitarian eBPF sect with their “20,000 lines of code” verifier (which exists solely to drink blood and ruin lives also unable to prevents cases like this)

Well, this is already serious problems but still not disaster - like you can deploy only for very small percent of users, employ fuse to prevent spreading of malicious update (for example by tracking heartbeats from updated machines) and so on. Right? ...

So right recipe for catastrophe from group of rebels:

  • never test anything - static validator is enough
  • deploy in Friday
  • deploy to all users at the same time
     

вторник, 23 июля 2024 г.

moving string literals to discardable sections

As kotest shows we can achieve reduction in the size of non-discardable sections of LKMs (and thus reduce total size of memory occupied by kernel) with moving constants used in functions from .init.text to discardable sections like .init.rodata. As usually we can do it manually or invent some dirty hacks to automate this boring process. And the first thought that comes to mind is to employ

gcc

Lets assume that we have some string literal referred from set of functions F1 ... FN. We can safely move it to discardable section only if all this functions itself located in discardable sections. This leads to two consequences
  1. in case of gcc plugin we must operate on RTL bcs some functions can be inlined (or even fully eliminated) - you just cannot do it in GIMPLE
  2. we need some tracking of cross-references between functions and literals

Unfortunately gcc does not have such tracking - all string literals stored in string pool. To make live even worse sometimes string pool belongs to function. From this comment

Only a few targets need per-function constant pools. Most can use one per-file pool

we can conclude that actually nobody ever understand when per-function constant pools are used. Probably I could patch my gcc plugin to add tracking of string literals, finding candidates for eviction to discardable section and marking them with __section__ attribute. However debugging of RTL gcc-plugins is real nightmare

 

Patching obj files

The next logical step is try to move string literals directly in object files (for example with LIEF). kotest already able to find candidates so the rest is simply to move them, right? Personally I believe that this is possible but there are several problems with this approach:
  1. currently LIEF (or binutils) cannot do it
  2. we need to patch not only relocs but also DWARF debug info and symbol table - bcs after removing of some string from .rodata addresses of all entries below must be changed

In general patching of binary is not perfect idea, especially if there is good alternative

четверг, 11 июля 2024 г.

ebpf map as communication channel

Recently I've done small research to repurpose overvalued ebpf into something useful and even achieved some modest results. It seems that at least you can use ebpf maps in your old-school native drivers without writing single line of code for ebpf progs. You can ask me - c'mon, there are tons of ways to communicate with driver under linux, just to name few:
  • ioctls
  • read from driver/write to driver + possible employ polling
  • file in kernfs
  • futex in shared memory
  • io_uring
  • netlink like auditd or XFRM do

etc

Let's look at typical situation - your IT crew was bitten by violent adherent of the totalitarian sect of rust ebpf Witnesses and as a consequence you now have hundreds of ebpfs (and no single person who know how this pile of spaghetti code works)

Probably now is much better to integrate new drivers with ebpf, right? Considering that ebpf progs have very serious limitations (like you can't read content of linked lists like raw_notifier_head with right locking) it would be very nice to produce output of your EDR drivers directly to ebpf maps

So I made simple POC to show how you can do it. Source of driver & userland test program

Lets dive into gory details and see what other non-obvious advantages can be obtained from such heretical crossbreeding

понедельник, 1 июля 2024 г.

dumping ebpf kind_ops

Very funny article about (im)possible future of ebpf. Given that right now 8 small BPF scripts with only 7 opcodes occupy 1Mb whole kernel on ebpf will require exabytes of RAM, he-he
Anyway there is another case of info hiding in linux - mentioned in article struct_ops has type btf_kind_operations and cannot be found in include (as usually). So I add today dumping of it in my lkcd. Sample of output:

четверг, 20 июня 2024 г.

frame sizes in dwarfdump

Add today dumping of stack frame sizes to my dwarfdump (well, where they are exists). Format of .debug_frame section obviously was invented by martian misantrophes so patch is huge and ugly

Sample of output for some random function from mips kernel:

// Addr 0x183D27C .text
// Frame Size 18
// FileName: drivers/char/random.c
// LocalVars:
//  LVar0, tag 6965A71
//    int ret
//  LVar1, tag 6965A8F
//    bool branch
ssize_t random_read_iter(struct kiocb* kiocb,struct iov_iter* iter);

and the same in JSON format: 

"110516780":{"type":"function","file":"drivers/char/random.c","type_id":"110427157","name":"random_read_iter","addr":"25416316","section":".text","frame_size":"24","params":[{"name":"kiocb","id":"110516807","type_id":"110466611"},{"name":"iter","id":"110516828","type_id":"110470510"}],"lvars":["110516849":{"type":"var","file":"drivers/char/random.c","owner":"110516780","type_id":"110426600","name":"ret"},
"110516879":{"type":"var","file":"drivers/char/random.c","owner":"110516780","type_id":"110427073","name":"branch"}]}

Unfortunately dwarfdump can't work with kernel modules bcs they are actually just object files and for this sad reason they have relocations even for debug sections. So to properly deal with this files I need to apply relocations first and this is arch-specific action (which I prefer to avoid)

Binutils has 2 solution of this problem:

  1. objcopy calls bfd_simple_get_relocated_section_contents from libbfd.so and this means that tool should have dependency from it
  2. readelf has it's own relocation code in apply_relocations and this is huge pile of code

And I really don’t like both of the above approaches

вторник, 18 июня 2024 г.

function stack size in GCC

Let's continue our wandering in endless dead end
GCC has struct stack_usage and even field su inside struct function. And this last is accessible as global var cfun. So it's should be easy to patch dwarf2out.cc yet one more time for example to extract stack size (like function output_stack_usage_1 do) and put it inside DW_AT_frame_base block, right?

NO

As you can see function allocate_stack_usage_info called only when
In other cases field function->su is zero
There should probably be heartbreaking conclusion about quality of opensource in general and gcc and in particular...

суббота, 15 июня 2024 г.

stack frames size in DWARF

As you might suspect, the stack size in the kernel is quite meager so it's very important to know how much of stack occupy your driver. So I conducted inhumane experiments on my own driver lkcd to check if stack frame size can be extracted from DWARF debug info. Biggest function in my driver is lkcd_ioctl so lets explore it

mips

Prolog of lkcd_ioctl looks like:
addiu   sp,sp,-688
 
Lets try to find this number in output of objdump -g

воскресенье, 9 июня 2024 г.

kotest fix for mips

kotest refused to count string literals for MIPS kernel modules. Reason was in that gcc does not put sizes/object types of unnamed string literals - it looks in asm files like
$LC0:
        .ascii  "const string %f\012\000"

At the same time it does for named literals:
        .type   fmt_msg, @object
        .size   fmt_msg, 15
fmt_msg:
        .ascii  "enter with %d\012\000"

I am too lazy to investigate which ancient specification from past century it follows. Fortunately this is easy repairable problem - just calculate size of symbol as distance to next one (or till end of section). Since I suspect that this is not the only architecture with similar gcc behavior, I add -f option to do such kind of sizes recalculation

Some results for mips32 kernel 6.0:
find ~/linux60/ -type f -name "*.ko" | xargs ./kotest | awk -f total.awk
1890293
1497092

potential memory savings is almost 1.9Mb from moving string literals used only in .init.text + yet almost 1.5Mb from unloading some unnecessary sections

понедельник, 27 мая 2024 г.

kotest

Linux kernel allows you to have discardable sections in LKM and this creates problem of links between two kind of memory. As you can guess keeping pointer to already unloaded area can be very dangerous so I made simple tool kotest to check such kind of links. It divides sections of ELF file into two category and check all relocations - relocs between areas of the same type considered as ok. To keep track if some symbol from persistent area is used only from discardable sections I also use couple of reference counts

 

command line options

  • -b take into account variables in .bss
  • -h make hexdump of found vars
  • -v verbose mode 
To run on lots of LKMs use something like
find path_to_kernel_root -type f -name "*.ko" | xargs kotest
 
To get summary you can run awk -f total.awk on output of previous command

 

it is reliable to use for analysis only fixups?

No - there are false positives. Consider excerpt from ip_vs.ko, function ip_vs_register_nl_ioctl:
.init.text:0000000000016155   mov     rdi, offset ip_vs_genl_family
.init.text:000000000001615C   mov     cs:ip_vs_genl_family.module, offset __this_module
.init.text:0000000000016167   mov     cs:ip_vs_genl_family.ops, offset ip_vs_genl_ops
.init.text:0000000000016172   mov     cs:ip_vs_genl_family.mcgrps, 0
.init.text:000000000001617D   mov     qword ptr cs:ip_vs_genl_family.n_ops, 10h
.init.text:0000000000016188   call    __genl_register_family

it turns out that ip_vs_genl_ops (located inside .rodata section) referred only from function ip_vs_register_nl_ioctl in .init.text, but actually it cannot be moved to discardable area bcs it was registered with genl_register_family. Kotest cannot analyze usage of addresses and so gives FP:
.rodata + 5A0 (ip_vs_genl_ops) rref 1 xref 0 add size 768
 
Another issue is string merging by ld. Lets assume that we have couple of strings: "foobar" referred from some function(s) in .text section and "bar" referred from code in .init.text. Linker can (and usually do) put only string "foobar" into .rodata and fixup to string "bar" will point to middle of this single string "foobar"
 
So consider output of kotest as estimated upper bound of memory which can be potentially saved by moving into discardable area

 

why not use famous objtool?

Because of NIH syndrome objtool employs disassembler and as consequence it is slow and supports only few architectures. Kotest is based on elfio and can process both 32 & 64 bit ELF files from any arch (and it is very fast)


LKM loading

четверг, 16 мая 2024 г.

linux input handles

Try convince me that input_register_handle is not best place for installing keylogger, it's even strange that they were embarrassed to connect there their holy cow eBPF. Long story short - there are 3 structures in linux kernel for servicing of input devices:

  1. input_dev chained in list (sure non-exported) input_dev_list
  2. input_handler chained in list input_handler_list
  3. input_handle with pointer to input_handler and attached to input_dev (in list h_list)

So keylogger could

  • just call input_register_handle
  • to be more stealthy - patch functions pointers in already registered input_handler (very convenient that sysrq_handler missed out method event)
  • attach own input_handle to desired input_dev but without registering corresponding input_handler - yes, this is perfectly legal
  • patch functions pointers directly in input_dev

Guess in three tries what exactly you can extract from sysfs?
So I add to my lkcd dumping of all above-mentioned structures. Sample of output

среда, 8 мая 2024 г.

asm injection stub

Lets check what this stub should do being injected in some linux process via __malloc_hook/__free_hook (btw this implicitly means than you cannot use this dirty hack for processes linked with musl or uClibc - they just don't have those hooks)
  • bcs our stub can be called from two different hooks we should store somewhere via which entry point we was called
  • restore old hooks values
  • call dlopen/dlsym and then target function (and pass it address of injection stub for delayed munmap. No, you can't free those memory directly in your target function - try to guess why)
  • get right old hook and jump to it if it was installed or just return to code called __malloc_hook somewhere in libc

So I collected all parameters to do job in table dtab consisting from 6 pointers

  1. __malloc_hook address
  2. old value of __malloc_hook
  3. __free_hook address
  4. old value of __free_hook
  5. pointer to dlopen
  6. pointer to dlsym
after those table we also has couple of string constants for injected.so full path and function name. Also bcs we must setup 2 entry point I decided to put 1 byte with distance between first and second (to make injection logic more universal) right after dtab. Sounds easy, so lets check how this logic can be implemented on some still living processors (given that RIP alpha, sparc, hp-pa etc)

четверг, 2 мая 2024 г.

yet another linux process injection

As you my know there are two methods

1) using LD_PRELOAD, don`t work if you want to inject into already running (and perhaps even many days) process

2) ptrace. Has the following inherent disadvantages

  • target process can be ptraced by somebody else
  • victim program can detect ptrace
  • you just want to avoid in logs something like ptrace attach of "./a.out"[PID] was attempted by "XXX"

So I developed very rough analogs of famous VirtualAllocEx/VirtualProtectEx + simple hook to hijack execution onto assembly written shellcode to call dlopen/dlsym. Currently only x86_64 supported bcs I am too lazy to rewrite this asm stub

Prerequisites
You must have root privileges and be able to build and load kernel modules. I tested code on kernel 6.8, 5.15 and probably it also can work on 4.x, not sure about more old versions

Lets start lighting the dirty details in reverse order

суббота, 27 апреля 2024 г.

gcc: placing strings in arbitrary section

As you may know gcc always placing string literals in section .rodata. Let's assume what we want to change this outrageous behavior - for example for shellcode literals used in function init_module (contained in section .init.text)

We can start with dumping of gcc RTL - for something like printf("some string") RTL will be symbol_ref <var_decl addr *.LC1> and in .S file this looks like

.section        .rodata
.LC1:
        .string "some string"

 
That unnamed VAR_DECL has attribute DECL_IN_CONSTANT_POOL. Probably it is possible to make gcc plugin to collect such literals referring from functions inside specific section and instead of DECL_IN_CONSTANT_POOL patch them section attribute. However this requires too many labour so lets try something more lazy

Possible solutions is to explicitly set section via gcc __attribute__:

#define RSection __attribute__ ((__section__ (".init.text")))
#define _RN(name) static const char rn_##name[] RSection =
#define _GN(name) rn_##name
...
_RN(dummy_str) "some string";
printf("%s\n", _GN(dummy_str));

Looks very ugly. And even worse - this raises compilation error:

error: ‘rn_dummy_str__’ causes a section type conflict with ‘init_module’
   11 | #define _RN(name) static const char rn_##name##__[] __attribute__ ((section (".init.text"))) =

How we can fix this problem? My first thought was to write quick and dirty Perl script to scan sources for _RN markers and produce .S file where all strings were placed in right section. But then I decided to overcome my laziness and made patch for gcc - it just checks if passed declaration is initialized with STRING_CST value. Surprisingly, it works!

воскресенье, 31 марта 2024 г.

netfilter hooks

They can be used to run shell when received some magic packet: 1 2 3. As usually there is not tool to show installed netfilter hooks so I added dumping them (and at the same time netfilter loggers) to my lkcd
 
Lets check where this hooks live inside kernel. As starting point we can review source of main function for hooks installing nf_register_net_hooks which leads to nf_hook_entry_head. We can notice that there are lots of locations for hooks:
  1. field nf_hooks_ingress in net_dev (when CONFIG_NETFILTER_INGRESS enabled)
  2. on more new kernels also field nf_hooks_egress in net_dev (when CONFIG_NETFILTER_EGRESS enabled)
  3. lots of fields in struct netns_nf:
    • hooks_ipv4
    • hooks_ipv6
    • hooks_arp (CONFIG_NETFILTER_FAMILY_ARP)
    • hooks_bridge (CONFIG_NETFILTER_FAMILY_BRIDGE)
    • hooks_decnet (CONFIG_NETFILTER_FAMILY_DECNET)
    Also on old kernels (before 4.16) there was one array hooks in netns_nf
 
results
lkmem -c -n ../unpacked/101 /boot/System.map-5.15.0-101-generic
...
2 nf hooks:
   [0] type 02 IPV4 idx 0 0xffffffffa7b84dd0 - kernel!apparmor_ipv4_postroute
   [1] type 10 IPV6 idx 0 0xffffffffa7b84e10 - kernel!apparmor_ipv6_postroute

пятница, 8 марта 2024 г.

Profiling shared libraries on linux

Disclaimer: proposed approach uses dirty hacks & patches and tested on x86_64 only so use it at your own risk. Also no chatGPT or some another Artificial Idiots were used for this research

Lets assume that we have shared library (for example R extension or python module) and we want to know where and why it spending many hours and consuming megawatts of electricity. There is even semi-official way to do this:

  1. compile shared library with -pg option
  2. set envvar LD_PROFILE_OUTPUT to directory where you want to store profiling data
  3. set envvar LD_PROFILE to filename of library to profile
  4. run your program. Well, sounds that you need lots of things to do before this step and you can`t set up profiling dynamically
  5. run sprof on profiling log

Unfortunately this method just don`t work - sprof fails with cryptic message
Inconsistency detected by ld.so: dl-open.c: 890: _dl_open: Assertion `_dl_debug_initialize (0, args.nsid)->r_state == RT_CONSISTENT' failed!

Seems that this long lived bug known since 2017 and still not fixed
Lets try to discover some more reliable way and start with inspection of code generated for profiling

воскресенье, 14 января 2024 г.

failed attempts to draw graphs

CSES has several really hard graph-related tasks, for example

It would be a good idea to visualize those graphs. One of well-known tool to do this is Graphviz, so I wrote simple perl script to render graph from CSES plain text into their DSL. On small graphs all goes well and we can enjoy with something like

But seems that on big graphs with 200k nodes dot just can`t finish rendering and after ~2 hours of hard work met with OOM killer. Lets think how we can reduce size of graph

четверг, 4 января 2024 г.

Distinct Colors

I`ve solved yet another very funny CSES task - it looks very similar to another task called "Reachable Nodes" (my solution for it). The only difference is that we asked to count not unique nodes but colors of nodes. What can go wrong?

And this is where funny part begins - my patched solution got crashes. gdb didn`t showed nothing interesting. However I remember scary cryptic command to show stack usage:

print (char *)_environ - (char *)$sp
$1 = 8384904

Very close to default 8Mb (check ulimit -s). Wait, WHAT? Do we really have stack exhausting? Lets check - 8 * 1024 * 1024 = 8388608 bytes. Tree can have 200000 nodes. 8388608 / 200000 = ~42 bytes for each recursive DFS call. Seems to be true - in each call we store return address + stack frame RBP + 3 registers holding args (this, indexes of node and parent) - so at least 5 * 8 = 40 bytes. It`s so happened that some tests contain tree with very long stem from root till end, so yes - recursive DFS cannot visit all nodes in such tree. Solution is simple - we can emulate recursion with std::stack. As bonus for all nodes in stack we can use single bit mask to save space

Another unpleasant observation is that trees in tests ain't BINARY trees. When one picture is worth a thousand words:

Degree of node 2 is 4. This is main reason why function dfs has separate branch for processing joint nodes with only 2 descendants - bcs initially method is_fork returned only left and right

Source