Linux system calls
For a long time I felt there was something unique about Linux. Something was different, couldn't quite figure out what it was. Only when I learned about Linux's system call interface did I finally understand.
This article contains everything I've learned.
Why Linux system calls?
Programs usually interface with the kernel through libraries provided
by the operating system, most commonly libc
. Through
these libraries, they have access to system functions.
POSIX-compliant operating systems provide
read
, write
and numerous others. Windows has
the Win32 API and its many DLLs and functions.
These operating systems consist of strongly connected kernel and user space components, developed and distributed as one unit. The user space libraries are the only supported means of using the system. User space programs are not meant to talk to the kernel directly, they're meant to use the provided system libraries. This forces them to depend on and link against such libraries.
While it is often possible to interface with the kernel directly, the kernel interface is unstable and subject to change. Software that insists on using that interface might simply stop working if or rather when it changes.
Sometimes it's not even possible to use system calls at all. OpenBSD
has implemented
system call origin verification, a security mechanism that only allows system calls originating from
the system's libc
. So not only is the kernel ABI
unstable, normal programs are not even allowed to interface with the
kernel at all.
Linux is different
One of the things that make the Linux kernel interesting is the fact it has a stable kernel-userspace interface. Unlike virtually every other kernel and operating system, Linux guarantees stability at the binary interface level.
This is rooted in the fact Linux is a kernel, not a complete operating system as traditionally defined. As an independent component, it must have a stable interface to user space software if anything is to be built upon it.
While many people argue that Linux is not an operating system, there's
no question that Linux is a platform and that it is possible
to safely build directly upon it. There is no actual need to
depend on anything else. Not even libc
.
How Linux system calls work
Processor instruction set architectures contain special instructions for calling the kernel. These instructions cause the processor to switch to kernel mode and execute code at a predefined location in the kernel.
At least one parameter must be provided: the system call number, often
referred to as the NR
. Linux uses this number as an index
into a table of function pointers to find the function being called.
Any other arguments are passed to this function.
These parameters are passed to the kernel in registers. The kernel also returns a result value in a register. Which registers are used for which parameters and which register contains the return value defines the Linux system call calling convention.
This calling convention is stable, allowing user space programs to use
it without fear of breakage. It is defined at the instruction set
level and so it is also programming language agnostic. All user space
programs written in any language may make use of it. Typically,
programs call libc
functions which implement this calling
convention. However, that is not actually a requirement. It's
perfectly possible for a compiler to directly emit code following that
calling convention: it could have support for a
system_call
keyword. A JIT compiler could generate code
for this at runtime.
The journalists at the LWN have written detailed articles about the implementation of Linux system calls. They are definitely worth reading.
- Anatomy of a system call, part 1
- Anatomy of a system call, part 2
- Anatomy of a system call, additional content
The calling convention is documented here:
Implementing a system call function
In order to make a system call, the parameters must be placed in the appropriate registers, the system call instruction must be executed and the return value must be collected from the return register. System calls support a maximum of six arguments.
Since the registers and system call instruction vary by architecture, separate functions are needed for each architecture. Despite this, it is simple to write a C function that can make any system call.
For example, a system call function for the
x86_64
architecture:
long
linux_system_call_x86_64(long number,
long _1, long _2, long _3,
long _4, long _5, long _6)
{
register long rax __asm__("rax") = number;
register long rdi __asm__("rdi") = _1;
register long rsi __asm__("rsi") = _2;
register long rdx __asm__("rdx") = _3;
register long r10 __asm__("r10") = _4;
register long r8 __asm__("r8") = _5;
register long r9 __asm__("r9") = _6;
__asm__ volatile
("syscall"
: "+r" (rax),
"+r" (r8), "+r" (r9), "+r" (r10)
: "r" (rdi), "r" (rsi), "r" (rdx)
: "rcx", "r11", "cc", "memory");
return rax;
}
All parameters and the return value are of type long
.
This really just means "register": all values passed to the kernel
must fit in registers and typically long
is register
sized. This means all arguments must either be simple values or
pointers to more complex structures.
The function ensures all arguments are placed in the appropriate
registers by assigning them to local variables annotated with an
inline assembly directive which tells the compiler which register to
choose. The register
keyword does nothing, it's there
just to make it clear what these variables are about.
The x86_64
architecture contains the aptly named
syscall
instruction which switches to kernel mode and enters the kernel entry
point. Other architectures have different instructions. For example,
aarch64
uses svc #0
instead.
The compiler is informed via the extended inline assembly construct
that this instruction has 7 inputs, 1 output and that it clobbers some
registers, the carry bit and memory. The 7 inputs are all the
previously assigned system call number and parameter registers. The
output is the return value which is placed in rax
,
overwriting the system call number.
After the system call has been made, all that's left to do is to
return the result. It may be a valid value or a negated
errno
constant. The various libc
s normalize
those error values and place them in a global or thread local
errno
variable. That's not necessary when using Linux
system calls directly!