Securing containers in Kubernetes with Seccomp
Basic container threat model is that the container is behaving badly, it means that for some reason, application malfunction, hacker, malicious code, malicious admin actor etc., container try to do something that it is not supposed to do, e.g. it tries to access something it’s not supposed to access. All threats come down to container privileges, what container can do, access and how? I’m not going into detail how to deploy basic Kubernetes security controls, like namespaces, RBAC, Kubernetes secrets vault, Pod security Policies etc., that’s another story. In this article I will focus on seccomp and how to use it to reduce container privileges and in that way mitigate possible threats.
First some basic concepts, lets pull Busybox test container to local store and run it:
$ podman pull busybox
$ podman images
REPOSITORY TAG IMAGE ID CREATED SIZE
docker.io/library/busybox latest 219ee5171f80 6 days ago 1.45 MB
$ podman run docker.io/library/busybox sleep 300
Now we have Busybox running for 5 minutes, let’s check how container looks like from host perspective:
Find out container PID (16885):
# ps -e -o pid,comm,cgroup
Find out our container command:
# cat /proc/16885/cmdline
sleep300
Containers are just processes running inside a host sharing kernel with all other processes.
Everything container wants to do is made through syscall interface to kernel. There are about 435 syscalls in linux. Container runtime interfaces, like cri-o, use filters to somewhat reduce amount of syscalls to about 300, but it’s still a lot. According to Aqua sec only 40 – 70 are needed. Here comes the tricky part, how can we filter syscalls and have our container still working?
Seccomp basics
Seccomp is a feature of the Linux kernel:
cat /boot/config-`uname -r` | grep CONFIG_SECCOMP=
CONFIG_SECCOMP=y
Seccomp limits syscalls container is able to execute. This is done by applying profiles that defines what syscalls should be allowed or blocked.
In my previous post I had Busybox Pod running on Kubernetes with cri-o runtime, where I can use crictl tool to inspect running container:
# crictl imagefsinfo f1f27e7eb0ed2
…
“io.kubernetes.cri-o.SeccompProfilePath”: “runtime/default”, <= default profile
…
“seccomp”: { <= seccomp profile starts here
“defaultAction”: “SCMP_ACT_ERRNO”, 1.
“architectures”: [ 2.
“SCMP_ARCH_X86_64”,
“SCMP_ARCH_X86”,
“SCMP_ARCH_X32”
],
“syscalls”: [
{
“names”: [
“There are 314 syscalls in default profile”
“action”: “SCMP_ACT_ALLOW” 3.
- “defaultAction”: “SCMP_ACT_ERRNO”, this will block execution of syscall
- “architectures”: Will map syscall ID and corresponding architecture
- “action”: “SCMP_ACT_ALLOW” is action for listed syscalls
Default profile can be found here. To create a custom profile for Kubernetes, you have to create a profile file in json format on every node in Kubernetes cluster. I used an Ansible script in my demo to do that.
There are two ways to go: ‘Whitelist’ profile that would list all syscalls we want to allow or ‘Blacklist’ profile where we would list all syscalls we want to deny.
SCMP_ACT_LOG allows the syscall to execute like in SCMP_ACT_ALLOW, but forces it to be logged. That can be used for tracing syscalls.
Let’s start with profile that does not allow creation of directory (Blacklist):
{
“defaultAction”: “SCMP_ACT_LOG”,
“architectures”: [“SCMP_ARCH_X86_64”, “SCMP_ARCH_X86”, “SCMP_ARCH_X32”],
“syscalls”: [
{
“names”: [
“mkdir”
],
“action”: “SCMP_ACT_ERRNO”
}
]
}
I used Podman for testing above profile:
podman run -it –rm –security-opt seccomp=./mkdir.json docker.io/library/busybox sh
Now, let’s test our profile inside container:
/ # mkdir foo
mkdir: can’t create directory ‘foo’: Operation not permitted
/ #
Not allowed as expected. Lets try above in Kubernetes cluster.
Busyboxsecc has seccomp defined in spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: mkdir.json
Open shell in busybox and try to create directory:
kubectl exec -it busyboxsecc — sh
Execute command ‘ls’ in busybox
/ # mkdir foo
mkdir: can’t create directory ‘foo’: Operation not permitted
/ #
Same result as above as expected.
Tracing container syscalls
Tracing container runtime syscalls can be a tricky task, Podman can be used for generating seccomp profiles and make tracing easier.
First we need to compile and install bcc
After bcc we need to compile and install oci-seccomp-bpf-hook (I’m using Ubuntu, so I have to do these tasks, e.g. in Fedora oci-seccomp-bpf-hook is available by default).
After the above steps we need to set Podman hook: /usr/share/containers/oci/hooks.d/oci-seccomp-bpf-hook.json with a path to compiled oci-seccomp-bpf-hook binary.
Now we can use Podman to generate profile:
sudo podman run –annotation io.containers.trace-syscall=of:/tmp/ls.json docker.io/library/busybox ls
cat /tmp/ls.json | jq
{
“defaultAction”: “SCMP_ACT_ERRNO”,
“architectures”: [
“SCMP_ARCH_X86_64”
],
“syscalls”: [
{
“names”: [
“arch_prctl”,
“brk”,
“capget”,
“capset”,
“chdir”,
“close”,
“epoll_ctl”,
“epoll_pwait”,
“execve”,
“exit_group”,
“fchown”,
“fcntl”,
“fstat”,
“fstatfs”,
“futex”,
“getdents64”,
“getpid”,
“getppid”,
“getuid”,
“ioctl”,
“nanosleep”,
“newfstatat”,
“open”,
“openat”,
“prctl”,
“read”,
“seccomp”,
“setgid”,
“setgroups”,
“setuid”,
“stat”,
“time”,
“write”
],
“action”: “SCMP_ACT_ALLOW”,
“args”: [],
“comment”: “”,
“includes”: {},
“excludes”: {}
}
]
}
Lets try above profile with Podman:
podman run –security-opt seccomp=/tmp/ls.json docker.io/library/busybox ls
bin
dev
etc
home
proc
root
run
sys
tmp
usr
var
Works as expected, let’s try to create directory:
podman run –security-opt seccomp=/tmp/ls.json docker.io/library/busybox mkdir foo
mkdir: can’t create directory ‘foo’: Operation not permitted
Not allowed.
In this article we have learned:
- How to create two types of seccomp profiles: Whitelist and Blacklist
- How to test seccomp profiles with Podman
- How to apply seccomp profiles to Kubernetes cluster
- How to generate seccomp profiles with Podman
References:
https://itnext.io/seccomp-in-kubernetes-part-3-the-new-syntax-plus-some-advanced-topics-95dd3835263a
https://info.aquasec.com/container-security-book
https://podman.io/blogs/2019/10/15/generate-seccomp-profiles.html
Linux syscall ID numbers: https://filippo.io/linux-syscall-table/
JRComplex Oy Kubernetes services can be found from here