Seccomp机制与seccomp notify介绍

seccomp代表secure computing，是早在2.6.12版本就引入到内核的特性，用来限制进程可以使用的系统调用。它作用于进程里的线程(task)。

最初，seccomp只允许使用read, write, _exit, sigreturn4个系统调用，调用其他系统调用时，内核会发送SIGKILL信号终止进程。当时seccomp的提出主要是想用于出租空闲的CPU算力。这种模式叫做STRICT模式。它限制过于严格，在实际应用上并没有太多发展。Linus Torvald甚至建议把它从内核中砍掉。

下面通过实例展示一下STRICT模式的seccomp机制:

strict.c代码如下:

#include <unistd.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <fcntl.h>

int main(int argc, char **argv)
{
    open("/dev/null", O_RDONLY);
#define MESSAGE1    "open called\n"
    write(STDOUT_FILENO, MESSAGE1, sizeof(MESSAGE1));

    prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);

    open("/dev/null", O_RDONLY);
#define MESSAGE2    "You can't see this message\n"
    write(STDOUT_FILENO, MESSAGE2, sizeof(MESSAGE2));

    return 0;
}

程序的执行结果是:

1
2
3

[root@t8 seccomp]# ./strict
open called
Killed

可以看到第二个write调用没有执行到，因为在它之上的open调用执行时，内核终止了这个进程。我们代码里不使用printf而是使用write输出，是因为printf实现本身可能还会调用write之外其他的系统调用。

沉寂了一些年之后，在3.5版本的内核中引入一种新的seccomp模式。它基于BPF来过滤系统调用，这种模式叫做SECCOMP_MODE_FILTER。这种模式下，可以自定义被允许使用的系统调用，而自定义过滤规则是借由BPF语言来实现。因而这种模式也叫做Seccomp-BPF。

过滤规则仍旧使用BPF的struct socket_filter结构来表示，但匹配的内容却是系统调用号和参数内容，但是过滤程序不能解引用指针(dereference pointer)，去匹配指针指向的内容。BPF程序可以不同的返回值，指示内核进行不同的处理逻辑，如:

SECCOMP_RET_KILL: 立即终止进程
SECCOMP_RET_TRAP: 发送一个可捕获的SIGSYS
SECCOMP_RET_ERROR: 指定errno的值并返回
SECCOMP_RET_TRACE: 由被附加的ptrace tracer裁决
SECCOMP_RET_ALLOW: 允许这个系统调用继续

随着内核发展，返回值也在变化，5.17版本上已经有更多的返回值，可以参考内核文档。

对于同一个系统调用可以加载多个过滤器。这种场景下，系统调用的裁决结果以最高优先级的返回值为准，返回值优先级也可以参考不同版本内核的上述文档。

BPF语言本身提供了一套指令集来实现过滤功能。可以直接基于BPF指令和内核定义的宏来编写过滤程序。BPF的指令规范可以参考这里。

seccomp-BPF模式的使用流程是这样的:
1、以struct socket_filter的数组承载过滤规则
2、以struct sock_fprog结构来封装上述过滤规则
3、使用prctl系统调用加载上述struct sock_fprog

而BPF程序的输入是struct seccomp_data结构:

struct seccomp_data {
    int nr;
    __u32 arch;
    __u64 instruction_pointer;
    __u64 args[6];
};

下面用内核定义的BPF宏来展示Seccomp-BPF模式:

#include <stdio.h>
#include <unistd.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/syscall.h>
#include <stdlib.h>
#include <stddef.h>


int main(int argc, char **argv)
{
    int ret;

    struct sock_filter filter[] = {
        BPF_STMT(BPF_LD+BPF_W+BPF_ABS, (offsetof(struct seccomp_data, nr))),
        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 0, 1),
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 0, 1),
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 0, 1),
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 0, 1),
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 0, 1),
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),

//        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_dup2, 0, 1),
//        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
    };

    struct sock_fprog prog = {
        .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
        .filter = filter,
    };

    ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);

    if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
        printf("prctl failed\n");
        exit(1);
    }

#define MESSAGE1    "Filter loaded\n"
    write(STDOUT_FILENO, MESSAGE1, sizeof(MESSAGE1));

    ret = dup2(1, 2);

#define MESSAGE2    "dup2 called\n"
    write(STDOUT_FILENO, MESSAGE2, sizeof(MESSAGE2));

    _exit(0);

    return 0;
}

我们的过滤程序从seccomp_data结构中读取nr字段的值，装载到寄存器中，然后进行一系列的系统调用号的匹配和跳转操作。如果任何一个系统调用号都没有匹配到，则返回SECCOMP_RET_KILL, 内核将终止进程。

加载BPF过滤器，需要调用线程有CAP_SYS_ADMIN权限，或者将no_new_priv置位，这可以通过prctl实现:

1	prctl(PR_SET_NO_NEW_PRIVS, 1);

运行结果如下:

1
2
3

[root@t8 seccomp]# ./raw_filter
Filter loaded
Bad system call (core dumped)

我们代码中注释掉了允许dup2调用的两行程序，当执行到dup2时，进程将被终止, 因而最后的”dup called”不会输出。

将注释掉的两行代码放开后，执行结果如下:

1
2
3

[root@t8 seccomp]# ./raw_filter2
Filter loaded
dup2 called

这种方式开发效率非常低下，可以使用libseccomp这种更高阶的API库，具体参考官网。

我们使用libseccomp来展示示例:

#include <unistd.h>
#include <seccomp.h>

int main(int agrc, char **argv)
{
    scmp_filter_ctx ctx;

    ctx = seccomp_init(SCMP_ACT_KILL);

    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(rt_sigreturn), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);

    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(dup2), 2,
            SCMP_A0(SCMP_CMP_EQ, 1),
            SCMP_A1(SCMP_CMP_EQ, 2));

    seccomp_load(ctx);

#define MESSAGE1 "Filter loaded\n"
    write(STDOUT_FILENO, MESSAGE1, sizeof(MESSAGE1));

    dup2(1, 2);

#define MESSAGE2 "Dup2 succeeded\n"
    write(STDOUT_FILENO, MESSAGE2, sizeof(MESSAGE2));

    dup2(2, 42);

#define MESSAGE3 "You can't see this message\n"
    write(STDOUT_FILENO, MESSAGE3, sizeof(MESSAGE3));

    return 0;
}

scmp_filter_ctx是过滤逻辑所使用的上下文结构，seccomp_init函数对上下文结构体进行初始化，若参数为SCMP_ACT_ALLOW, 则过滤为黑名单模式；若为SCMP_ACT_KILL，则为白名单模式，即进程调用没有匹配到规则的系统调用都会杀死，默认不允许所有的系统调用。seccomp_rule_add函数用来添加规则，seccomp_load函数加载过滤器。

我们允许read,write, sig_return, exit4个系统调用。对于dup2调用，我们还会根据传入的参数来进行判断。只有传入的两个参数为1和2被允许，即:

1	dup2(1, 2)

编译程序需要链接libseccomp:

1	gcc seccomp_filter.c -lseccomp -o seccomp_filter

执行结果:

[root@t8 seccomp]# ./seccomp_filter
Filter loaded
Dup2 succeeded
Bad system call (core dumped)

dup2(1, 2)成功执行，而后边的dup2(2, 42)没有成功执行，进程被终止。

之后，在5.0版本内核又加入了seccomp-unotify机制，5.9版本又做了特性增强。seccomp-BPF模式对系统调用的裁决是由过滤程序自己完成的，而seccomp-unotify机制能够将裁决权转移给另一个用户态进程。这个文章对这个特性介绍的非常详细。

我们将加载过滤程序的进程叫做target, 接收通知的进程叫做supervisor。在这个模式中，supervisor不仅对是否允许系统调用能够做出裁决，它还可以代替target进程完成这个系统调用的行为。这大大扩大了seccomp机制的应用范围。上边我们介绍过，Seccomp-BPF模式只能检测系统调用的参数，不能解引用指针。而这个unotify模式则还可以去查看指针所指向的内存。

具体的使用方式可以参考这里。

我们再来展示一个示例。target进程代码如下:

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/syscall.h>
#include <stdlib.h>
#include <stddef.h>
#include <errno.h>
#include <string.h>
#include <fcntl.h>


int main(int argc, char **argv)
{
    int ret;
    int notifyfd, fd;

    if (argc != 2) {
        printf("usage: %s <file path>\n", argv[0]);
        exit(-1);
    }

    struct sock_filter filter[] = {
        BPF_STMT(BPF_LD+BPF_W+BPF_ABS, (offsetof(struct seccomp_data, nr))),

        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_open, 0, 1),
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),

        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_openat, 0, 1),
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),

        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
    };

    struct sock_fprog prog = {
        .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
        .filter = filter,
    };

    ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);

    notifyfd = syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
            SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
    if (notifyfd < 0) {
        printf("seccomp failed: %s\n", strerror(errno));
        exit(-1);
    }

    printf("tid: %d, notify fd: %d\n", syscall(SYS_gettid), notifyfd);

    fd = open(argv[1], O_CREAT|O_RDWR);
    if (fd < 0) {
        printf("open failed: %s\n", strerror(errno));
        exit(-1);
    } else {
        printf("open succeeded\n");
    }

    close(fd);
    close(notifyfd);

    return 0;
}

target进程使用seccomp系统调用来加载过滤程序，获得一个seccomp-unotify的fd, 接着将线程ID和通知fd输出。接下来调用open系统调用。我们的BPF程序里对于open和openat返回SECCOMP_RET_USER_NOTIF，因而执行到这里时，程序将会阻塞。

再来看supervisor进程:

#include <stdio.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <linux/limits.h>
#include <linux/seccomp.h>
#include <sys/ioctl.h>
#include <assert.h>
#include <fcntl.h>

static void process_notifications(int notifyfd)
{
    __u64  id;
    struct seccomp_notif_sizes sizes;
    struct seccomp_notif *req;
    struct seccomp_notif_resp *resp;

    char path[PATH_MAX];
    int memfd;
    ssize_t s;

    if (syscall(SYS_seccomp, SECCOMP_GET_NOTIF_SIZES, 0, &sizes) == -1) {
        printf("seccomp failed: %s", strerror(errno));
        exit(-1);
    }

    assert((req = malloc(sizes.seccomp_notif)));
    assert((resp = malloc(sizes.seccomp_notif_resp)));

    memset(req, 0, sizes.seccomp_notif);
    memset(resp, 0, sizes.seccomp_notif_resp);

    if (ioctl(notifyfd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1) {
        printf("ioctl failed: %s\n", strerror(errno));
        exit(-1);
    }

    printf("Got notification for PID: %d, id is %llx\n",
            req->pid, req->id);

    id = req->id;
    if (ioctl(notifyfd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == -1) {
        printf("Notification ID check: target has died: %s\n",
                strerror(errno));
        exit(-1);
    }

    snprintf(path, sizeof(path), "/proc/%d/mem", req->pid);
    memfd = open(path, O_RDONLY);
    if (memfd < 0) {
        printf("open mem file failed: %s\n", path);
        exit(-1);
    }

    printf("SYSCALL: %d\n", req->data.nr);

    if (req->data.nr == SYS_open) {
        assert(lseek(memfd, req->data.args[0], SEEK_SET) >= 0);
    } else if (req->data.nr == SYS_openat) {

        printf("memory address: 0x%016X\n", req->data.args[1]);
        assert(lseek(memfd, req->data.args[1], SEEK_SET) >= 0);
    }

    assert((s = read(memfd, path, sizeof(path))) > 0);

    printf("open path: %s\n", path);

    close(memfd);

    if (strlen(path) == strlen("/tmp/noopen.txt") &&
        strncmp(path, "/tmp/noopen.txt", strlen("/tmp/noopen.txt")) == 0)
    {
        printf("Denied\n");
        resp->error = -EPERM;
        resp->flags = 0;
    } else {
        printf("Allowed\n");
        resp->error = 0;
        resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
    }

    resp->id = req->id;

    if (ioctl(notifyfd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
        if (errno == ENOENT) {
            printf("Response failed with ENOENT; perhaps target "
                   "process's syscall was interrupted by signal?\n");
        } else {
            printf("ioctl failed: %s\n", strerror(errno));
            exit(-1);
        }
    }

    free(req);
    free(resp);
}

int main(int argc, char **argv)
{
    int pidfd;
    int notifyfd, targetfd;
    pid_t pid;

    if (argc != 3) {
        printf("usage: %s <pid> <target fd>\n", argv[0]);
        exit(-1);
    }

    pid = atoi(argv[1]);
    targetfd = atoi(argv[2]);
    printf("PID: %d, TARGET FD: %d\n", pid, targetfd);

    pidfd = syscall(SYS_pidfd_open, pid, 0);
    assert(pidfd >= 0);
    printf("PIDFD: %d\n", pidfd);

    notifyfd = syscall(SYS_pidfd_getfd, pidfd, targetfd, 0);
    assert(notifyfd >= 0);
    printf("NOTIFY FD: %d\n", notifyfd);

    process_notifications(notifyfd);

    return 0;
}

supervisor进程能过pidfd_getfd从target进程获取到seccomp-unotify的fd, 从fd中获取到系统调用的调用通知。通过/proc/[pid]/mem文件获取到open调用传入的指针所指向的文件名。当文件名为/tmp/noopen.txt时，open被禁止，其他文件名则允许执行。

我们先执行target进程，可以看到输出的PID和FD, 同时进程被阻塞:

1
2
3

[root@t8 seccomp]# ./target /tmp/noopen.txt
tid: 9449, notify fd: 3

然后我们在另一个终端执行supervisor:

[root@t8 seccomp]# ./supervisor 9449 3
PID: 9449, TARGET FD: 3
PIDFD: 3
NOTIFY FD: 4
Got notification for PID: 9449, id is 5928d2192e65a0b5
SYSCALL: 257
memory address: 0x000000008FB87785
open path: /tmp/noopen.txt
Denied

可以看到open被禁止，而另一个终端上target进程也返回:

1
2
3

[root@t8 seccomp]# ./target /tmp/noopen.txt
tid: 9449, notify fd: 3
open failed: Operation not permitted

我们再以非/tmp/noopen.txt的文件来运行一次target:

1
2
3

[root@t8 seccomp]# ./target /tmp/dummy.txt
tid: 9451, notify fd: 3

程序正常阻塞，然后运行supervisor:

[root@t8 seccomp]# ./supervisor 9451 3
PID: 9451, TARGET FD: 3
PIDFD: 3
NOTIFY FD: 4
Got notification for PID: 9451, id is 211503f5ffd2ddc3
SYSCALL: 257
memory address: 0x000000007E3EF786
open path: /tmp/dummy.txt
Allowed

open调用被允许执行，这时target进程也返回:

1
2
3

[root@t8 seccomp]# ./target /tmp/dummy.txt
tid: 9451, notify fd: 3
open succeeded

查看创建的/tmp/dummy.txt可以看到文件被成功创建了:

1 2	[root@t8 seccomp]# ls -l /tmp/dummy.txt ---sr-----. 1 root root 0 Mar 29 08:01 /tmp/dummy.txt

seccomp-unotify当前在容器环境里有比较大的使用空间，而且还在不断的发展之中，后续有新的进展，再来单独介绍。