jouni.rosenlof@jrcomplex.fi

Securing containers in Kubernetes with Seccomp

"Complex World Needs Simple Solutions"

Securing containers in Kubernetes with Seccomp

Basic container threat model is that the container is behaving badly, it means that for some reason, application malfunction, hacker, malicious code, malicious admin actor etc., container try to do something that it is not supposed to do, e.g. it tries to access something it’s not supposed to access. All threats come down to container privileges, what container can do, access and how? I’m not going into detail how to deploy basic Kubernetes security controls, like namespaces, RBAC, Kubernetes secrets vault, Pod security Policies etc., that’s another story. In this article I will focus on seccomp and how to use it to reduce container privileges and in that way mitigate possible threats. 

First some basic concepts, lets pull Busybox test container to local store and run it: 

$ podman pull busybox

$ podman images

REPOSITORY                 TAG     IMAGE ID      CREATED     SIZE

docker.io/library/busybox  latest  219ee5171f80  6 days ago  1.45 MB

$ podman run docker.io/library/busybox sleep 300

Now we have Busybox running for 5 minutes, let’s check how container looks like from host perspective:

Find out container PID (16885):

# ps -e -o pid,comm,cgroup

Find out our container command:

# cat /proc/16885/cmdline 

sleep300

Containers are just processes running inside a host sharing kernel with all other processes. 

Everything container wants to do is made through syscall interface to kernel. There are about 435 syscalls in linux. Container runtime interfaces, like cri-o, use filters to somewhat reduce amount of syscalls to about 300, but it’s still a lot. According to Aqua sec  only 40 – 70 are needed. Here comes the tricky part, how can we filter syscalls and have our container still working? 

Seccomp basics

Seccomp is a feature of the Linux kernel:

cat /boot/config-`uname -r` | grep CONFIG_SECCOMP=

CONFIG_SECCOMP=y

Seccomp limits syscalls container is able to execute. This is done by applying profiles that defines what syscalls should be allowed or blocked. 

In my previous post I had Busybox Pod running on Kubernetes with cri-o runtime, where I can use crictl tool to inspect running container:

# crictl imagefsinfo f1f27e7eb0ed2

“io.kubernetes.cri-o.SeccompProfilePath”: “runtime/default”, <= default profile

 “seccomp”: { <= seccomp profile starts here

          “defaultAction”: “SCMP_ACT_ERRNO”, 1.

          “architectures”: [ 2.

            “SCMP_ARCH_X86_64”,

            “SCMP_ARCH_X86”,

            “SCMP_ARCH_X32”

          ],

          “syscalls”: [

            { 

              “names”: [

“There are 314 syscalls in default profile”

“action”: “SCMP_ACT_ALLOW” 3.

  1. “defaultAction”: “SCMP_ACT_ERRNO”, this will block execution of syscall
  2. “architectures”: Will map syscall ID and corresponding architecture
  3. “action”: “SCMP_ACT_ALLOW” is action for listed syscalls

Default profile can be found here. To create a custom profile for Kubernetes, you have to create a profile file in json format on every node in Kubernetes cluster. I used an Ansible script in my demo to do that. 

There are two ways to go: ‘Whitelist’ profile that would list all syscalls we want to allow or ‘Blacklist’ profile where we would list all syscalls we want to deny. 

SCMP_ACT_LOG allows the syscall to execute like in SCMP_ACT_ALLOW, but forces it to be logged. That can be used for tracing syscalls. 

Let’s start with profile that does not allow creation of directory (Blacklist):

{

  “defaultAction”: “SCMP_ACT_LOG”,

  “architectures”: [“SCMP_ARCH_X86_64”, “SCMP_ARCH_X86”, “SCMP_ARCH_X32”],

  “syscalls”: [

    {

      “names”: [

        “mkdir”

      ],

      “action”: “SCMP_ACT_ERRNO”

    }

  ]

}

I used Podman for testing above profile:

podman run -it –rm –security-opt seccomp=./mkdir.json docker.io/library/busybox sh

Now, let’s test our profile inside container:

/ # mkdir foo

mkdir: can’t create directory ‘foo’: Operation not permitted

/ #

Not allowed as expected. Lets try above in Kubernetes cluster.

Busyboxsecc has seccomp defined in spec:

securityContext:

    seccompProfile:

      type: Localhost

      localhostProfile: mkdir.json

Open shell in busybox and try to create directory:
kubectl exec -it busyboxsecc — sh

Execute command ‘ls’ in busybox

/ # mkdir foo

mkdir: can’t create directory ‘foo’: Operation not permitted

/ # 

Same result as above as expected. 

Tracing container syscalls

Tracing container runtime syscalls can be a tricky task, Podman can be used for generating seccomp profiles and make tracing easier. 

First we need to compile and install bcc 

After bcc we need to compile and install oci-seccomp-bpf-hook (I’m using Ubuntu, so I have to do these tasks, e.g. in Fedora oci-seccomp-bpf-hook is available by default).

After the above steps we need to set Podman hook: /usr/share/containers/oci/hooks.d/oci-seccomp-bpf-hook.json with a path to compiled oci-seccomp-bpf-hook binary. 

Now we can use Podman to generate profile:

sudo podman run –annotation io.containers.trace-syscall=of:/tmp/ls.json docker.io/library/busybox ls

cat /tmp/ls.json | jq

{

  “defaultAction”: “SCMP_ACT_ERRNO”,

  “architectures”: [

    “SCMP_ARCH_X86_64”

  ],

  “syscalls”: [

    {

      “names”: [

        “arch_prctl”,

        “brk”,

        “capget”,

        “capset”,

        “chdir”,

        “close”,

        “epoll_ctl”,

        “epoll_pwait”,

        “execve”,

        “exit_group”,

        “fchown”,

        “fcntl”,

        “fstat”,

        “fstatfs”,

        “futex”,

        “getdents64”,

        “getpid”,

        “getppid”,

        “getuid”,

        “ioctl”,

        “nanosleep”,

        “newfstatat”,

        “open”,

        “openat”,

        “prctl”,

        “read”,

        “seccomp”,

        “setgid”,

        “setgroups”,

        “setuid”,

        “stat”,

        “time”,

        “write”

      ],

      “action”: “SCMP_ACT_ALLOW”,

  “args”: [],

      “comment”: “”,

      “includes”: {},

      “excludes”: {}

    }

  ]

}

Lets try above profile with Podman:

podman run –security-opt seccomp=/tmp/ls.json docker.io/library/busybox ls

bin

dev

etc

home

proc

root

run

sys

tmp

usr

var

Works as expected, let’s try to create directory:

podman run –security-opt seccomp=/tmp/ls.json docker.io/library/busybox mkdir foo

mkdir: can’t create directory ‘foo’: Operation not permitted

Not allowed. 

In this article we have learned: 

  • How to create two types of seccomp profiles: Whitelist and Blacklist
  • How to test seccomp profiles with Podman
  • How to apply seccomp profiles to Kubernetes cluster
  • How to generate seccomp profiles with Podman

References:

https://itnext.io/seccomp-in-kubernetes-part-i-7-things-you-should-know-before-you-even-start-97502ad6b6d6

https://itnext.io/seccomp-in-kubernetes-part-2-crafting-custom-seccomp-profiles-for-your-applications-c28c658f676e

https://itnext.io/seccomp-in-kubernetes-part-3-the-new-syntax-plus-some-advanced-topics-95dd3835263a

https://info.aquasec.com/container-security-book

https://podman.io/blogs/2019/10/15/generate-seccomp-profiles.html

Linux syscall ID numbers: https://filippo.io/linux-syscall-table/

JRComplex Oy Kubernetes services can be found from here

 

Leave a Reply

Your email address will not be published. Required fields are marked *