Rendered at 09:38:20 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
quotemstr 18 hours ago [-]
Linux is unusual in OS kernels in that direct system calls from arbitrary userspace code are supported and ABI-stable. This model has always been a terrible idea. It robs the system of an ability to intercept system calls in userspace before doing an expensive privilege-mode transition.
If, instead, as on OpenBSD, the kernel enforced the rule that all system calls had to go through libc (or perhaps a big ntdll.dll-like VDSO), then the whole problem the linked article tries in vain to solve would disappear. If you wanted to hook a system call, you'd just change the libc/VDSO dispatch. No need to rewrite any instructions.
If I were Linus, I'd make a new rule: starting today, all new system calls must go through VDSO. No exceptions. SYSCALL from anywhere else? SIGKILL.
This way, you can just LD_PRELOAD in front of the VDSO and system call interception in userspace Just Works.
razighter777 15 hours ago [-]
Direct system calls are an amazing idea. The NtDll and bsd models are worse. The whole libc becomes a security boundary without the protection of kernel space. So much windows malware and process tampering happens because now you have a library (ntdll) fully in userspace that is given special privileges, which now becomes a huge attack surface. Then you have to deal with breakages between the built in libc versions and the kernel
This syscall overhead isn't as much as you suppose it is; for workloads where the syscall overhead actually makes a difference there are robust low-syscall paths for io/latency sensitive operations with DPDK, io_uring, and futex being a few examples.
And there are robust performant methods on linux for syscall interception/tracing, see seccomp unotify, bpf tracepoints, ftrace.
eqvinox 1 hours ago [-]
Your argument about libc/ntdll having "special privileges" is a bit weird in that the alternate option is everything having those privileges. The ntdll tampering doesn't exist on Linux because it's not necessary. It's not better due to this.
yjftsjthsd-h 17 hours ago [-]
> This model has always been a terrible idea. It robs the system of an ability to intercept system calls in userspace before doing an expensive privilege-mode transition.
This model has always been a trade-off. It has downsides, but it also has upsides, including an immense boost in flexibility; decoupling from any particular userspace is useful.
> This way, you can just LD_PRELOAD in front of the VDSO and system call interception in userspace Just Works.
Can you LD_PRELOAD in front of the vDSO? I was under the (possibly mistaken) impression that the kernel injects it directly.
matheusmoreira 11 hours ago [-]
The vDSO is just a normal ELF shared object that Linux maps somewhere in the address space of the process. The kernel passes a pointer to the ELF header to the process via the auxiliary vector.
That's the end of Linux's involvement. It's up to the program itself to do something useful with that pointer, namely by parsing the ELF header, and then resolving its symbols to function pointer addresses.
There's no doubt that all the various libc implementations out there do this, but I don't know if they do it in a way that lets LD_PRELOAD override the vDSO. They could be hard linking the vDSO system calls into their system call stubs or something.
Usually programs intercept system calls by overriding the libc stubs, which also indirectly intercepts the vDSO. However, it's not actually a requirement that the system be structured like this. Theoretically, the program could do anything. System calls can be done directly, without any stubs. Compilers could just generate the code directly without any functions at all.
mananaysiempre 15 hours ago [-]
> Can you LD_PRELOAD in front of the vDSO? I was under the (possibly mistaken) impression that the kernel injects it directly.
The kernel puts the vDSO in memory and tells ld.so where it is, but where if anywhere ld.so will put it in the search order it implements is its own concern. (TBH I don’t actually know whether ld.so will actually allow LD_PRELOAD to override the vDSO, but there’s no reason for it not to, except I guess for the syscalls that are needed to perform the dynamic linking itself.)
matheusmoreira 11 hours ago [-]
> This model has always been a terrible idea.
I disagree. It's an amazing idea. It allows me to write freestanding programs without any C libraries. It allows compilers to have Linux system call builtins that directly generate the calling convention. I created an entire lisp interpreter with nothing but Linux system calls, completely freestanding.
As a kernel, Linux is completely independent from its user space. The instruction set is the correct abstraction for the system call entry point. There should be no "required C libraries". User space should be free to reinvent everything in Rust if it wants.
There are various kernel mechanisms for system call interception if that's what you want. Tools like strace work just fine on my lisp interpreter, so libc is clearly not needed.
LD_PRELOAD is a GNU ld feature. The linker is the exact sort of user space component that's supposed to be completely replaceable. None of this is any of Linux's business.
Use of the vDSO is not even mandatory. All system calls in the vDSO are also available via the kernel entry point. The vDSO is just an optimization for frequently called system calls like gettimeofday. Forcing all programs to use the vDSO would force them all to not only implement the ELF spec but also to implement a small ELF linker. This is a significant blow if you want to create minimal freestanding Linux programs.
Joker_vD 1 hours ago [-]
> It allows me to write freestanding programs without any C libraries.
KERNEL32.dll is not a C library (for once, its exported functions don't even use any of the default C calling conventions on x86).
> I created an entire lisp interpreter with nothing but Linux system calls, completely freestanding.
"Freestanding", as in "standing on top of an OS but nothing else"? Then using the OS-provided shared object that is the documented interface between the userspace and the kernel doesn't violate your free stand.
I mean, I too had written small interpreters that had only LoadLibraryW/GetProcAddress from kernel32.dll as their imports and nothing else.
> The instruction set is the correct abstraction for the system call entry point.
Why? A function call seems a much more appropriate abstraction for the system call entry point.
> There should be no "required C libraries".
There is no required C library on Windows, yet it doesn't use direct system calls.
> Forcing all programs to use the vDSO would force them all to not only implement the ELF spec but also to implement a small ELF linker.
Not really. Neither Windows nor UEFI require you to reimplement any linking functionality. The OS can simply give your program a pointer to a table of function pointers at your entry point... which it already can do, see the aux vector on Linux.
throwaway7356 16 hours ago [-]
> all system calls had to go through libc (or perhaps a big ntdll.dll-like
Which makes containers crap on Windows and *BSD as they have to run the currect libc or equivalent. Thus you need to build a different container per OS version which sucks compared to Linux.
Joker_vD 16 hours ago [-]
Windows doesn't even have its own libc.
orangesilk 13 hours ago [-]
Windows does have three libc, likely as a compability layer.
their names are:
* <forgotten something Windows 3.1>
* msvcrt.dll, 2014
* ucrt.dll (universal c runtime, since Windows 10)
Joker_vD 10 hours ago [-]
Those are not a compatibility layer with the OS. Heck, the all barely even provide proper access to the file system, ffs! The "msvcrt.dll" in the System32 folder is an ancient leftover from Microsoft-internal version of MSVC 6.0 or so, not intended for 3rd-party consumption.
At some point Microsoft got tired of maintaining binary-incompatible versions of its C runtime for different Visual Studios, so they started shipping UCRT with Windows itself... but you still don't need to touch that garbage for anything whatsoever.
yjftsjthsd-h 15 hours ago [-]
They said "or equivalent", so ntdll
quotemstr 10 hours ago [-]
In Window,s the last-userspace-before-kernel-mode layer is called ntdll.dll. Unlike msvcrt or any other libc, ntdll is universal and loaded into every process.
quotemstr 10 hours ago [-]
You understand that your container is using the VDSO today, right? A UAPI requirement to issue system calls through it wouldn't hurt your deployment story at all.
But sure, keep using SYSCALL, THE DEPENDENCY MUTILATOR. It's got what containers crave!
freestanding 16 hours ago [-]
thats why OpenBSD is unconvinient for development - because it binds to libc bloatware
razighter777 15 hours ago [-]
yep and and it forces every application to deal with the C FFI. It's beautiful in linux that I can access the full kernel API from an int 0x80/syscall instruction + a few register loads without having to link against crap. I can write a simple cat utility in a dozen or so lines of assembly.
freestanding 12 hours ago [-]
FFI is a different term. i called LIBC bloatware because it comes with many stuff that is not needed and things that are not appropriate for the system API layer, like memory allocator, string primitives etc. it also has an old style naming, like_this_one_supposed_to_be_nice or whtabthis1?
windows's NTDLL (at least early versions) naming is much better and the layer is much thiner, the problem is that it is "undocumented". also its rigid portability, while libc binding makes NIX software non-portable. NT also has syscalls through the interrupt btw.
matheusmoreira 11 hours ago [-]
You might enjoy my work on the lone lisp language. I got rid of the libc and implemented an entire interpreter with nothing but Linux system calls. Been working on it and blogging about it for about 3 years now.
The amount of times we ran LD_PRELOAD in prod was vanishingly small and limited to debug so the OpenBSD solution seems to be just waste of CPU cycles
Gualdrapo 17 hours ago [-]
> If I were Linus, I'd make a new rule
Or, you know, just propose your idea to him
yjftsjthsd-h 16 hours ago [-]
Based on https://www.phoronix.com/news/Linus-Torvalds-No-Random-vDSO , I had been under the impression that he wasn't fond of adding more use of vDSO. On rereading, I can't tell if that's a vDSO thing or a preference against fast randomness being provided by the kernel.
If, instead, as on OpenBSD, the kernel enforced the rule that all system calls had to go through libc (or perhaps a big ntdll.dll-like VDSO), then the whole problem the linked article tries in vain to solve would disappear. If you wanted to hook a system call, you'd just change the libc/VDSO dispatch. No need to rewrite any instructions.
If I were Linus, I'd make a new rule: starting today, all new system calls must go through VDSO. No exceptions. SYSCALL from anywhere else? SIGKILL.
This way, you can just LD_PRELOAD in front of the VDSO and system call interception in userspace Just Works.
This syscall overhead isn't as much as you suppose it is; for workloads where the syscall overhead actually makes a difference there are robust low-syscall paths for io/latency sensitive operations with DPDK, io_uring, and futex being a few examples.
And there are robust performant methods on linux for syscall interception/tracing, see seccomp unotify, bpf tracepoints, ftrace.
This model has always been a trade-off. It has downsides, but it also has upsides, including an immense boost in flexibility; decoupling from any particular userspace is useful.
> This way, you can just LD_PRELOAD in front of the VDSO and system call interception in userspace Just Works.
Can you LD_PRELOAD in front of the vDSO? I was under the (possibly mistaken) impression that the kernel injects it directly.
That's the end of Linux's involvement. It's up to the program itself to do something useful with that pointer, namely by parsing the ELF header, and then resolving its symbols to function pointer addresses.
There's no doubt that all the various libc implementations out there do this, but I don't know if they do it in a way that lets LD_PRELOAD override the vDSO. They could be hard linking the vDSO system calls into their system call stubs or something.
Usually programs intercept system calls by overriding the libc stubs, which also indirectly intercepts the vDSO. However, it's not actually a requirement that the system be structured like this. Theoretically, the program could do anything. System calls can be done directly, without any stubs. Compilers could just generate the code directly without any functions at all.
The kernel puts the vDSO in memory and tells ld.so where it is, but where if anywhere ld.so will put it in the search order it implements is its own concern. (TBH I don’t actually know whether ld.so will actually allow LD_PRELOAD to override the vDSO, but there’s no reason for it not to, except I guess for the syscalls that are needed to perform the dynamic linking itself.)
I disagree. It's an amazing idea. It allows me to write freestanding programs without any C libraries. It allows compilers to have Linux system call builtins that directly generate the calling convention. I created an entire lisp interpreter with nothing but Linux system calls, completely freestanding.
I've written a sort of manifesto around this:
https://www.matheusmoreira.com/articles/linux-system-calls
> If I were Linus
Good thing you aren't.
As a kernel, Linux is completely independent from its user space. The instruction set is the correct abstraction for the system call entry point. There should be no "required C libraries". User space should be free to reinvent everything in Rust if it wants.
There are various kernel mechanisms for system call interception if that's what you want. Tools like strace work just fine on my lisp interpreter, so libc is clearly not needed.
LD_PRELOAD is a GNU ld feature. The linker is the exact sort of user space component that's supposed to be completely replaceable. None of this is any of Linux's business.
Use of the vDSO is not even mandatory. All system calls in the vDSO are also available via the kernel entry point. The vDSO is just an optimization for frequently called system calls like gettimeofday. Forcing all programs to use the vDSO would force them all to not only implement the ELF spec but also to implement a small ELF linker. This is a significant blow if you want to create minimal freestanding Linux programs.
KERNEL32.dll is not a C library (for once, its exported functions don't even use any of the default C calling conventions on x86).
> I created an entire lisp interpreter with nothing but Linux system calls, completely freestanding.
"Freestanding", as in "standing on top of an OS but nothing else"? Then using the OS-provided shared object that is the documented interface between the userspace and the kernel doesn't violate your free stand.
I mean, I too had written small interpreters that had only LoadLibraryW/GetProcAddress from kernel32.dll as their imports and nothing else.
> The instruction set is the correct abstraction for the system call entry point.
Why? A function call seems a much more appropriate abstraction for the system call entry point.
> There should be no "required C libraries".
There is no required C library on Windows, yet it doesn't use direct system calls.
> Forcing all programs to use the vDSO would force them all to not only implement the ELF spec but also to implement a small ELF linker.
Not really. Neither Windows nor UEFI require you to reimplement any linking functionality. The OS can simply give your program a pointer to a table of function pointers at your entry point... which it already can do, see the aux vector on Linux.
Which makes containers crap on Windows and *BSD as they have to run the currect libc or equivalent. Thus you need to build a different container per OS version which sucks compared to Linux.
At some point Microsoft got tired of maintaining binary-incompatible versions of its C runtime for different Visual Studios, so they started shipping UCRT with Windows itself... but you still don't need to touch that garbage for anything whatsoever.
But sure, keep using SYSCALL, THE DEPENDENCY MUTILATOR. It's got what containers crave!
windows's NTDLL (at least early versions) naming is much better and the layer is much thiner, the problem is that it is "undocumented". also its rigid portability, while libc binding makes NIX software non-portable. NT also has syscalls through the interrupt btw.
http://github.com/lone-lang/lone/
Or, you know, just propose your idea to him