Sandboxing ImageMagick with nsjail

ImageMagick is the go-to image conversion library in many environments. It’s written in C and doesn’t have the best track record on security. Last year, a major vulnerability called ImageTragick (yes, there’s a logo) made the news. Even Facebook turned out to be vulnerable.

While secure alternatives exist, many existing projects have a hard dependency on ImageMagick and abstracting the image conversion can be quite involved. If you find yourself in a situation where you can’t avoid using ImageMagick, sandboxing can help you mitigate the damage in the event of a compromise.

Enter nsjail

nsjail, written by Google, calls itself “a light-weight process isolation tool.” It uses a number of Linux kernel features that allow users to isolate processes in dedicated namespaces, limit file system access, put constraints on their resource usage and to filter syscalls.

My goals for this sandbox can be broken down like this:

No network access — all images are local, so there’s no reason for ImageMagick to talk to anyone over the network
Read-only access to the binaries, library and configuration files and write access to the directory in which images live temporarily during conversion
Sane maximum execution times, somewhat limiting the impact of DoS attacks
Permit only a small subset of syscalls in order to reduce the overall attack surface (making it harder for attackers to escape the sandbox)

Most distributions don’t have nsjail packages yet, so we’ll need to build from source. We’ll start with the dependencies (assuming you’re on Debian or Ubuntu):

sudo apt install autoconf bison flex gcc g++ git libprotobuf-dev libtool make pkg-config protobuf-compiler

Next, we’ll clone the code and check out the latest release (that’s version 2.2 at the time of writing).

git clone https://github.com/google/nsjail.git
cd nsjail && git checkout 2.2

Building the project should be as simple as running make. This will produce a nsjail binary in the same directory which you can then move to /usr/local/bin.

Some of the kernel features used by nsjail weren’t added until kernel 4.6, so you might have to update your kernel or distribution. nsjail also uses the user_namespaces feature, which is typically disabled. Append the following line to /etc/sysctl.conf to enable it:

kernel.unprivileged_userns_clone=1

Load the configuration change by rebooting your machine or use sudo sysctl -p.

Policy Configuration

nsjail helpfully includes a sample configuration for ImageMagick’s convert binary. This offers a good starting point for what we need. Much of the configuration depends on how your application uses ImageMagick. In my case, the application is Mastodon, via the popular paperclip gem for managing file attachments. Paperclip uses ImageMagick by shelling out to the convert and identify binaries. That’s not a particularly clean way to use it, but it happens to make this task a bit easier.

You can skip to the end of this section if you just want a working nsjail configuration for Mastodon’s ImageMagick usage. If you’re running into issues with that configuration, reading this section will probably give you the tools you need for a fix.

Let’s start by looking at the sample configuration. The default values for things like time_limit seem good enough, so we can leave them mostly as-is. rlimit_nofile, which is the maximum number of opened files, should be increased to slightly above the file limit set in your ImageMagick policy. Converting GIFs with many frames seems to generate a lot of temporary files. This is the policy.xml file I use with ImageMagick 7 — this is another place where you can greatly reduce your attack surface.

Next are a couple of mount directives which provide access to the file system. Most of them are read-only (no rw: true) and permit access to the ImageMagick binary and shared libraries. I happen to use a compiled version of ImageMagick that’s located in /usr/local/bin rather than /usr/bin, so I’ll need a mount for that. We’ll also want to permit access to the ImageMagick configuration files which are located in either /etc/ImageMagick-6 or /etc/ImageMagick-7.

All of that leaves me with the following additional mount directives:

mount {
  src: "/usr/local/lib"
  dst: "/usr/local/lib"
  is_bind: true
  mandatory: false
}

mount {
  src: "/usr/local/bin/convert"
  dst: "/usr/local/bin/convert"
  is_bind: true
  mandatory: false
}

mount {
  src: "/etc/ImageMagick-6"
  dst: "/etc/ImageMagick-6"
  is_bind: true
  mandatory: false
}

mount {
  src: "/etc/ImageMagick-7"
  dst: "/etc/ImageMagick-7"
  is_bind: true
  mandatory: false
}

I also add mandatory: false to the existing /usr/bin/convert mount. That way, nsjail doesn’t throw an error if /usr/bin/convert doesn’t exist and I can go back and forth between compiled and packaged versions of ImageMagick without having to change the nsjail configuration.

Next, we’ll have to figure out where the files are stored while paperclip processes them. Paperclip helpfully logs every command it runs, so we can just grep for “Command” in our Rails logs and we’ll get something like this:

Command :: file -b --mime '/tmp/8d777f385d3dfec8815d20f7496026dc20171203-9975-dbjvvy.jpeg'
Command :: identify -format '%wx%h,%[exif:orientation]' '/tmp/8d777f385d3dfec8815d20f7496026dc20171203-9975-9mj1dj[0]' 2>/dev/null
Command :: identify -format %m '/tmp/8d777f385d3dfec8815d20f7496026dc20171203-9975-9mj1dj[0]'
Command :: convert '/tmp/8d777f385d3dfec8815d20f7496026dc20171203-9975-9mj1dj[0]' -auto-orient -resize "1280x1280>" -quality 90 -strip '/tmp/72dc008206075ad7e69b00a1e4f2544020171203-9975-1iywevw'

We can see that all the files are located in /tmp. We could continue by mounting /tmp within our sandbox, but this opens up a potential sandbox escape vector: Programs like Xorg create sockets within /tmp which could be used to run processes outside of the sandbox. (Thanks to Robert Święcki for pointing this out!)

What we’ll do instead is to instruct our application to use a different TMP directory that’s used exclusively by the application, and mount that. I’ll explain how in the next section, but for now add the following mount and change the TMP environment variable:

envar: "TMP=/home/mastodon/live/tmp/rails_tmp"

...

mount {
  src: "/home/mastodon/live/tmp/rails_tmp"
  dst: "/home/mastodon/live/tmp/rails_tmp"
  rw: true
  is_bind: true
}

While we’re at it, let’s remove the entire /Documents mount — we won’t be needing that.

The final section of the sample configuration is where we define our syscall filters. The sample configuration uses a blacklist approach which causes the process to be killed if it uses the ptrace, process_vm_readv or process_vm_writev syscalls. That’s better than nothing, but we can do better by using a whitelist of syscalls that we know ImageMagick needs, and killing the process if any other syscall is used.

Update: The previous paragraph is no longer accurate. nsjail’s sample configuration now includes a syscall policy based on the one we’re building in this post, with some improvements from the nsjail author. The improvements have been added to the policy linked at the end of this section.

Getting a list of the required syscalls is a bit involved. We can start by using strace -qcf followed by some of the commands we observed in our Rails log, using a couple of sample images in various formats. Our goal is to exercise all of the code paths ImageMagick will run in production, so make sure you use all the image formats and command variations you can find in your log. You might run something like:

strace -qcf convert '/tmp/input.png' -auto-orient -resize "1280x1280>" -quality 90 -strip '/tmp/output.png'

This will produce output similar to this:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0        43           read
  0.00    0.000000           0         3           write
  0.00    0.000000           0        65        22 open
  0.00    0.000000           0        43           close
  0.00    0.000000           0        12         5 stat
  0.00    0.000000           0        51           fstat
  0.00    0.000000           0         9           lseek
  0.00    0.000000           0        76           mmap
  0.00    0.000000           0        58           mprotect
  0.00    0.000000           0         7           munmap
  0.00    0.000000           0         8           brk
  0.00    0.000000           0        11           rt_sigaction
  0.00    0.000000           0        19           rt_sigprocmask
  0.00    0.000000           0        31        29 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         2           getdents
  0.00    0.000000           0         2           getrlimit
  0.00    0.000000           0         1           sysinfo
  0.00    0.000000           0        14           times
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           futex
  0.00    0.000000           0         1           sched_getaffinity
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                   461        56 total

We’re interested in the syscall column, giving us a first set of syscalls for our seccomp-bpf policy. Let’s change the policy to use DEFAULT KILL and insert the extracted (comma-separated) syscalls:

seccomp_string: "POLICY imagemagick_convert {"
seccomp_string: "  ALLOW {"
seccomp_string: "    read, write, open, close, newstat, newfstat,"
seccomp_string: "    ... more syscalls ..."
seccomp_string: "  }"
seccomp_string: "}"
seccomp_string: "USE imagemagick_convert DEFAULT KILL"

strace uses a slightly different naming convention for some syscalls, so we’ll need to convert those manually. nsjail uses the Kafel language for its syscall filtering specification, so we’ll use the source file containing all syscalls as a reference. stat is called newstat in Kafel, fstat is newfstat, etc.

Let’s store what we have so far in a file in /etc/nsjail/imagemagick-convert.cfg and see if we can successfully run convert within nsjail:

nsjail --config /etc/nsjail/imagemagick-convert.cfg -- /usr/bin/convert '/tmp/input.png' -auto-orient -resize "1280x1280>" -quality 90 -strip '/tmp/output.png'

If you missed a syscall, or if strace did (don’t ask me why - it happens), you’ll see something like this in the output:

[W][1047] subprocSeccompViolation():258 PID: 1048 commited a syscall/seccomp violation and exited with SIGSYS

Finding the syscall that caused the violation can be done by using grep SECCOMP on your syslog or audit log. That should produce a log line like this:

type=SECCOMP msg=audit(1512341279.874:80142): auid=1000 uid=1000 gid=1000 ses=3 pid=1048 comm="convert" exe="/usr/bin/convert" sig=31 arch=c000003e syscall=158 compat=0 ip=0x7fa87097dbb8 code=0x0

Now we know the missing syscall has the number 158, which we can translate back to arch_prctl using the Kafel source file from earlier.

You’ll probably end up doing this a couple of times before the execution succeeds. This is the final syscall policy I ended up with:

seccomp_string: "POLICY imagemagick_convert {"
seccomp_string: "  ALLOW {"
seccomp_string: "    read, write, open, openat, close, newstat, newfstat,"
seccomp_string: "    newlstat, lseek, mmap, mprotect, munmap, brk,"
seccomp_string: "    rt_sigaction, rt_sigprocmask, pwrite64, access,"
seccomp_string: "    getpid, execve, getdents, unlink, fchmod,"
seccomp_string: "    getrlimit, getrusage, sysinfo, times, futex,"
seccomp_string: "    arch_prctl, sched_getaffinity, set_tid_address,"
seccomp_string: "    clock_gettime, set_robust_list, exit_group,"
seccomp_string: "    clone, getcwd, pread64, readlink, prlimit64, mremap"
seccomp_string: "  }"
seccomp_string: "}"
seccomp_string: "USE imagemagick_convert DEFAULT KILL"

The full configuration for the convert binary can be found here. The same gist also includes a configuration for the identify binary, for FFmpeg and file(1).

Caging the Elephant

We now have a working nsjail configuration, but there’s one thing left to do: Getting Mastodon to use it. This is where paperclip shelling out to ImageMagick works in our favor — we’ll just create our own convert command that runs ImageMagick within a sandbox. Let’s create /usr/local/bin/nsjail-wrapper/convert with the following content:

#!/usr/bin/env bash
nsjail --quiet --config /etc/nsjail/imagemagick-convert.cfg -- /usr/bin/convert "$@"

Use chmod +x on the newly-created file and make sure to adjust the path from /usr/bin/convert if you use a compiled version of ImageMagick and .

Next, we’ll need to get Mastodon to use this file rather than the one located in /usr/bin or /usr/local/bin. We do that by adding the following environment variable to the systemd services that run Mastodon:

Environment="PATH=/usr/local/bin/nsjail-wrapper:/usr/local/bin:/usr/bin:/bin"

Since our mount directives don’t include /tmp, we’ll also need to tell Rails (and thus paperclip) to use a different TMP path.

Environment="TMP=/home/mastodon/live/tmp/rails_tmp"

Make sure to actually create that directory and chown it to the mastodon user. Otherwise, Rails will quietly fall back to using /tmp, effectively breaking your attachment processing.

If you’re following the default setup instructions for Mastodon, you’ll want to add the two lines to both /etc/systemd/system/mastodon-sidekiq.service and /etc/systemd/system/mastodon-web.service.

Reload systemd, restart the two services and you’re done:

sudo systemctl daemon-reload
sudo systemctl restart mastodon-sidekiq
sudo systemctl restart mastodon-web

It’s a good idea to periodically check your syslog (or audit log) for the string “SECCOMP” after you deploy this, or to have monitoring alert you to a match. Certain versions or configurations of ImageMagick might use syscalls that aren’t included in my policy, or you might deal with images that trigger a code path I haven’t run into yet. Remember that a policy violation might also be due to a malicious file, so be careful when adjusting the policy.

PoC > GTFO

It’s probably a good idea to test if our sandbox is working as intended. To do that, we’ll use the Proof of Concept available for the ImageTragick vulnerability, with some small adjustments to make it work in our TMP path.

We’ll need to build a vulnerable version of ImageMagick first. I went with 6.8.5-10:

convert -version
Version: ImageMagick 6.8.5-10 2017-12-04 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2013 ImageMagick Studio LLC
Features: DPC OpenMP Modules
Delegates: mpeg fontconfig freetype jbig jng jpeg lzma png ps x xml zlib

The PoC code uses the identify binary, so we’ll need the nsjail configuration from the earlier gist and the corresponding wrapper script. Running the PoC without having /usr/local/bin/nsjail-wrapper in my path, I now get the following result:

./test.sh
testing read
UNSAFE

testing delete
UNSAFE

testing http with local port: 27279
SAFE

testing http with nonce: 46648d3b
SAFE

testing rce1
UNSAFE

testing rce2
UNSAFE

testing MSL
UNSAFE

Evidently we’re vulnerable to some parts of ImageTragick. Next, let’s add /usr/local/bin/nsjail-wrapper back to our path and try again:

./test.sh
testing read
SAFE

testing delete
SAFE

testing http with local port: 45326
SAFE

testing http with nonce: 0fce39e0
SAFE

testing rce1
SAFE

testing rce2
SAFE

testing MSL
SAFE

Looks like we successfully mitigated ImageTragick! Our logs shows a bunch of lines like the following — the exploit code is trying to use the msync syscall and is subsequently killed:

type=SECCOMP msg=audit(1512348449.807:88531): auid=1000 uid=1000 gid=1000 ses=3 pid=5675 comm="identify" exe="/usr/local/bin/identify" sig=31 arch=c000003e syscall=26 compat=0 ip=0x7f6538695760 code=0x0

Performance Impact

Measuring the time it takes for a simple JPEG attachment to be processed and stored by paperclip, the average went from about 650 ms to 2100 ms. I suspect that most of the increase is due to paperclip shelling out to ImageMagick, which forces nsjail to build a new sandbox for every invocation. A daemon handling the conversion of many images would likely perform significantly better, perhaps even with no noticeable impact.

There is definitely room for improvement here, but given that image conversion as a whole barely makes a dent in the overall CPU budget of this service and that media uploads aren’t an area where users will get too frustrated because they’ll have to wait an additional second, it’s an acceptable trade-off.

Posted by Patrick Figel on Dec 04, 2017