Merge branch 'aosp/upstream-master' into master
Add the necessary android third-party files.
Remove the git submodule, as those are unsupported in Android trees.
Bug: 122270019
Test: None
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..622b76d
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,4 @@
+*.o
+config.pb.h
+config.pb.cc
+nsjail
diff --git a/CONTRIBUTING b/CONTRIBUTING
new file mode 100644
index 0000000..1ba8539
--- /dev/null
+++ b/CONTRIBUTING
@@ -0,0 +1,24 @@
+Want to contribute? Great! First, read this page (including the small print at the end).
+
+### Before you contribute
+Before we can use your code, you must sign the
+[Google Individual Contributor License Agreement](https://developers.google.com/open-source/cla/individual?csw=1)
+(CLA), which you can do online. The CLA is necessary mainly because you own the
+copyright to your changes, even after your contribution becomes part of our
+codebase, so we need your permission to use and distribute your code. We also
+need to be sure of various other things—for instance that you'll tell us if you
+know that your code infringes on other people's patents. You don't have to sign
+the CLA until after you've submitted your code for review and a member has
+approved it, but you must do it before we can put your code into our codebase.
+Before you start working on a larger contribution, you should get in touch with
+us first through the issue tracker with your idea so that we can help out and
+possibly guide you. Coordinating up front makes it much easier to avoid
+frustration later on.
+
+### Code reviews
+All submissions, including submissions by project members, require review. We
+use Github pull requests for this purpose.
+
+### The small print
+Contributions made by corporations are covered by a different agreement than
+the one above, the Software Grant and Corporate Contributor License Agreement.
diff --git a/Dockerfile b/Dockerfile
new file mode 100644
index 0000000..5bd472a
--- /dev/null
+++ b/Dockerfile
@@ -0,0 +1,19 @@
+FROM ubuntu:16.04
+
+RUN apt-get -y update && apt-get install -y \
+ autoconf \
+ bison \
+ flex \
+ gcc \
+ g++ \
+ git \
+ libprotobuf-dev \
+ libtool \
+ make \
+ pkg-config \
+ protobuf-compiler \
+ && rm -rf /var/lib/apt/lists/*
+
+COPY . /nsjail
+
+RUN cd /nsjail && make && mv /nsjail/nsjail /bin && rm -rf -- /nsjail
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..d645695
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,202 @@
+
+ Apache License
+ Version 2.0, January 2004
+ http://www.apache.org/licenses/
+
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+ 1. Definitions.
+
+ "License" shall mean the terms and conditions for use, reproduction,
+ and distribution as defined by Sections 1 through 9 of this document.
+
+ "Licensor" shall mean the copyright owner or entity authorized by
+ the copyright owner that is granting the License.
+
+ "Legal Entity" shall mean the union of the acting entity and all
+ other entities that control, are controlled by, or are under common
+ control with that entity. For the purposes of this definition,
+ "control" means (i) the power, direct or indirect, to cause the
+ direction or management of such entity, whether by contract or
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
+ outstanding shares, or (iii) beneficial ownership of such entity.
+
+ "You" (or "Your") shall mean an individual or Legal Entity
+ exercising permissions granted by this License.
+
+ "Source" form shall mean the preferred form for making modifications,
+ including but not limited to software source code, documentation
+ source, and configuration files.
+
+ "Object" form shall mean any form resulting from mechanical
+ transformation or translation of a Source form, including but
+ not limited to compiled object code, generated documentation,
+ and conversions to other media types.
+
+ "Work" shall mean the work of authorship, whether in Source or
+ Object form, made available under the License, as indicated by a
+ copyright notice that is included in or attached to the work
+ (an example is provided in the Appendix below).
+
+ "Derivative Works" shall mean any work, whether in Source or Object
+ form, that is based on (or derived from) the Work and for which the
+ editorial revisions, annotations, elaborations, or other modifications
+ represent, as a whole, an original work of authorship. For the purposes
+ of this License, Derivative Works shall not include works that remain
+ separable from, or merely link (or bind by name) to the interfaces of,
+ the Work and Derivative Works thereof.
+
+ "Contribution" shall mean any work of authorship, including
+ the original version of the Work and any modifications or additions
+ to that Work or Derivative Works thereof, that is intentionally
+ submitted to Licensor for inclusion in the Work by the copyright owner
+ or by an individual or Legal Entity authorized to submit on behalf of
+ the copyright owner. For the purposes of this definition, "submitted"
+ means any form of electronic, verbal, or written communication sent
+ to the Licensor or its representatives, including but not limited to
+ communication on electronic mailing lists, source code control systems,
+ and issue tracking systems that are managed by, or on behalf of, the
+ Licensor for the purpose of discussing and improving the Work, but
+ excluding communication that is conspicuously marked or otherwise
+ designated in writing by the copyright owner as "Not a Contribution."
+
+ "Contributor" shall mean Licensor and any individual or Legal Entity
+ on behalf of whom a Contribution has been received by Licensor and
+ subsequently incorporated within the Work.
+
+ 2. Grant of Copyright License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ copyright license to reproduce, prepare Derivative Works of,
+ publicly display, publicly perform, sublicense, and distribute the
+ Work and such Derivative Works in Source or Object form.
+
+ 3. Grant of Patent License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ (except as stated in this section) patent license to make, have made,
+ use, offer to sell, sell, import, and otherwise transfer the Work,
+ where such license applies only to those patent claims licensable
+ by such Contributor that are necessarily infringed by their
+ Contribution(s) alone or by combination of their Contribution(s)
+ with the Work to which such Contribution(s) was submitted. If You
+ institute patent litigation against any entity (including a
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
+ or a Contribution incorporated within the Work constitutes direct
+ or contributory patent infringement, then any patent licenses
+ granted to You under this License for that Work shall terminate
+ as of the date such litigation is filed.
+
+ 4. Redistribution. You may reproduce and distribute copies of the
+ Work or Derivative Works thereof in any medium, with or without
+ modifications, and in Source or Object form, provided that You
+ meet the following conditions:
+
+ (a) You must give any other recipients of the Work or
+ Derivative Works a copy of this License; and
+
+ (b) You must cause any modified files to carry prominent notices
+ stating that You changed the files; and
+
+ (c) You must retain, in the Source form of any Derivative Works
+ that You distribute, all copyright, patent, trademark, and
+ attribution notices from the Source form of the Work,
+ excluding those notices that do not pertain to any part of
+ the Derivative Works; and
+
+ (d) If the Work includes a "NOTICE" text file as part of its
+ distribution, then any Derivative Works that You distribute must
+ include a readable copy of the attribution notices contained
+ within such NOTICE file, excluding those notices that do not
+ pertain to any part of the Derivative Works, in at least one
+ of the following places: within a NOTICE text file distributed
+ as part of the Derivative Works; within the Source form or
+ documentation, if provided along with the Derivative Works; or,
+ within a display generated by the Derivative Works, if and
+ wherever such third-party notices normally appear. The contents
+ of the NOTICE file are for informational purposes only and
+ do not modify the License. You may add Your own attribution
+ notices within Derivative Works that You distribute, alongside
+ or as an addendum to the NOTICE text from the Work, provided
+ that such additional attribution notices cannot be construed
+ as modifying the License.
+
+ You may add Your own copyright statement to Your modifications and
+ may provide additional or different license terms and conditions
+ for use, reproduction, or distribution of Your modifications, or
+ for any such Derivative Works as a whole, provided Your use,
+ reproduction, and distribution of the Work otherwise complies with
+ the conditions stated in this License.
+
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
+ any Contribution intentionally submitted for inclusion in the Work
+ by You to the Licensor shall be under the terms and conditions of
+ this License, without any additional terms or conditions.
+ Notwithstanding the above, nothing herein shall supersede or modify
+ the terms of any separate license agreement you may have executed
+ with Licensor regarding such Contributions.
+
+ 6. Trademarks. This License does not grant permission to use the trade
+ names, trademarks, service marks, or product names of the Licensor,
+ except as required for reasonable and customary use in describing the
+ origin of the Work and reproducing the content of the NOTICE file.
+
+ 7. Disclaimer of Warranty. Unless required by applicable law or
+ agreed to in writing, Licensor provides the Work (and each
+ Contributor provides its Contributions) on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+ implied, including, without limitation, any warranties or conditions
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+ PARTICULAR PURPOSE. You are solely responsible for determining the
+ appropriateness of using or redistributing the Work and assume any
+ risks associated with Your exercise of permissions under this License.
+
+ 8. Limitation of Liability. In no event and under no legal theory,
+ whether in tort (including negligence), contract, or otherwise,
+ unless required by applicable law (such as deliberate and grossly
+ negligent acts) or agreed to in writing, shall any Contributor be
+ liable to You for damages, including any direct, indirect, special,
+ incidental, or consequential damages of any character arising as a
+ result of this License or out of the use or inability to use the
+ Work (including but not limited to damages for loss of goodwill,
+ work stoppage, computer failure or malfunction, or any and all
+ other commercial damages or losses), even if such Contributor
+ has been advised of the possibility of such damages.
+
+ 9. Accepting Warranty or Additional Liability. While redistributing
+ the Work or Derivative Works thereof, You may choose to offer,
+ and charge a fee for, acceptance of support, warranty, indemnity,
+ or other liability obligations and/or rights consistent with this
+ License. However, in accepting such obligations, You may act only
+ on Your own behalf and on Your sole responsibility, not on behalf
+ of any other Contributor, and only if You agree to indemnify,
+ defend, and hold each Contributor harmless for any liability
+ incurred by, or claims asserted against, such Contributor by reason
+ of your accepting any such warranty or additional liability.
+
+ END OF TERMS AND CONDITIONS
+
+ APPENDIX: How to apply the Apache License to your work.
+
+ To apply the Apache License to your work, attach the following
+ boilerplate notice, with the fields enclosed by brackets "[]"
+ replaced with your own identifying information. (Don't include
+ the brackets!) The text should be enclosed in the appropriate
+ comment syntax for the file format. We also recommend that a
+ file or class name and description of purpose be included on the
+ same "printed page" as the copyright notice for easier
+ identification within third-party archives.
+
+ Copyright [yyyy] [name of copyright owner]
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
diff --git a/METADATA b/METADATA
new file mode 100644
index 0000000..865cf83
--- /dev/null
+++ b/METADATA
@@ -0,0 +1,18 @@
+name: "nsjail"
+description:
+ "A light-weight process isolation tool, making use of Linux namespaces and "
+ "seccomp-bpf syscall filters (with help of the kafel bpf language)"
+
+third_party {
+ url {
+ type: HOMEPAGE
+ value: "http://nsjail.com"
+ }
+ url {
+ type: GIT
+ value: "https://github.com/google/nsjail"
+ }
+ version: "c7a313123b3dcb845ed3822b99ad9db69a6a82c8"
+ last_upgrade_date { year: 2019 month: 1 day: 10 }
+ license_type: NOTICE
+}
diff --git a/MODULE_LICENSE_APACHE2 b/MODULE_LICENSE_APACHE2
new file mode 100644
index 0000000..e69de29
--- /dev/null
+++ b/MODULE_LICENSE_APACHE2
diff --git a/Makefile b/Makefile
new file mode 100644
index 0000000..e318820
--- /dev/null
+++ b/Makefile
@@ -0,0 +1,125 @@
+#
+# nsjail - Makefile
+# -----------------------------------------
+#
+# Copyright 2014 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+PKG_CONFIG=$(shell which pkg-config)
+ifeq ($(PKG_CONFIG),)
+$(error "Install pkg-config to make it work")
+endif
+
+CC ?= gcc
+CXX ?= g++
+
+COMMON_FLAGS += -O2 -c \
+ -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 \
+ -fPIE \
+ -Wformat -Wformat-security -Wno-format-nonliteral \
+ -Wall -Wextra -Werror \
+ -Ikafel/include
+
+CXXFLAGS += $(COMMON_FLAGS) $(shell pkg-config --cflags protobuf) \
+ -std=c++11 -fno-exceptions -Wno-unused -Wno-unused-parameter
+LDFLAGS += -pie -Wl,-z,noexecstack -lpthread $(shell pkg-config --libs protobuf)
+
+BIN = nsjail
+LIBS = kafel/libkafel.a
+SRCS_CXX = caps.cc cgroup.cc cmdline.cc config.cc contain.cc cpu.cc logs.cc mnt.cc net.cc nsjail.cc pid.cc sandbox.cc subproc.cc uts.cc user.cc util.cc
+SRCS_PROTO = config.proto
+SRCS_PB_CXX = $(SRCS_PROTO:.proto=.pb.cc)
+SRCS_PB_H = $(SRCS_PROTO:.proto=.pb.h)
+SRCS_PB_O = $(SRCS_PROTO:.proto=.pb.o)
+OBJS = $(SRCS_CXX:.cc=.o) $(SRCS_PB_CXX:.cc=.o)
+
+ifdef DEBUG
+ CXXFLAGS += -g -ggdb -gdwarf-4
+endif
+
+USE_NL3 ?= yes
+ifeq ($(USE_NL3), yes)
+NL3_EXISTS := $(shell pkg-config --exists libnl-route-3.0 && echo yes)
+ifeq ($(NL3_EXISTS), yes)
+ CXXFLAGS += -DNSJAIL_NL3_WITH_MACVLAN $(shell pkg-config --cflags libnl-route-3.0)
+ LDFLAGS += $(shell pkg-config --libs libnl-route-3.0)
+endif
+endif
+
+.PHONY: all clean depend indent
+
+.cc.o: %.cc
+ $(CXX) $(CXXFLAGS) $< -o $@
+
+all: $(BIN)
+
+$(BIN): $(LIBS) $(OBJS)
+ifneq ($(NL3_EXISTS), yes)
+ $(warning "==========================================================")
+ $(warning "No support for libnl3/libnl-route-3; /sbin/ip will be used")
+ $(warning "==========================================================")
+endif
+ $(CXX) -o $(BIN) $(OBJS) $(LIBS) $(LDFLAGS)
+
+kafel/libkafel.a:
+ifeq ("$(wildcard kafel/Makefile)","")
+ git submodule update --init
+endif
+ $(MAKE) -C kafel
+
+# Sequence of proto deps, which doesn't fit automatic make rules
+config.o: $(SRCS_PB_O) $(SRCS_PB_H)
+$(SRCS_PB_O): $(SRCS_PB_CXX) $(SRCS_PB_H)
+$(SRCS_PB_CXX) $(SRCS_PB_H): $(SRCS_PROTO)
+ protoc --cpp_out=. $(SRCS_PROTO)
+
+.PHONY: clean
+clean:
+ $(RM) core Makefile.bak $(OBJS) $(SRCS_PB_CXX) $(SRCS_PB_H) $(BIN)
+ifneq ("$(wildcard kafel/Makefile)","")
+ $(MAKE) -C kafel clean
+endif
+
+.PHONY: depend
+depend: all
+ makedepend -Y -Ykafel/include -- -- $(SRCS_CXX) $(SRCS_PB_CXX)
+
+.PHONY: indent
+indent:
+ clang-format -style="{BasedOnStyle: google, IndentWidth: 8, UseTab: Always, IndentCaseLabels: false, ColumnLimit: 100, AlignAfterOpenBracket: false, AllowShortFunctionsOnASingleLine: false}" -i -sort-includes *.h $(SRCS_CXX)
+ clang-format -style="{BasedOnStyle: google, IndentWidth: 4, UseTab: Always, ColumnLimit: 100}" -i $(SRCS_PROTO)
+
+# DO NOT DELETE THIS LINE -- make depend depends on it.
+
+caps.o: caps.h nsjail.h logs.h macros.h util.h
+cgroup.o: cgroup.h nsjail.h logs.h util.h
+cmdline.o: cmdline.h nsjail.h caps.h config.h logs.h macros.h mnt.h user.h
+cmdline.o: util.h
+config.o: caps.h nsjail.h cmdline.h config.h config.pb.h logs.h macros.h
+config.o: mnt.h user.h util.h
+contain.o: contain.h nsjail.h caps.h cgroup.h cpu.h logs.h macros.h mnt.h
+contain.o: net.h pid.h user.h util.h uts.h
+cpu.o: cpu.h nsjail.h logs.h util.h
+logs.o: logs.h macros.h util.h nsjail.h
+mnt.o: mnt.h nsjail.h logs.h macros.h subproc.h util.h
+net.o: net.h nsjail.h logs.h subproc.h
+nsjail.o: nsjail.h cmdline.h logs.h macros.h net.h sandbox.h subproc.h util.h
+pid.o: pid.h nsjail.h logs.h subproc.h
+sandbox.o: sandbox.h nsjail.h kafel/include/kafel.h logs.h
+subproc.o: subproc.h nsjail.h cgroup.h contain.h logs.h macros.h net.h
+subproc.o: sandbox.h user.h util.h
+uts.o: uts.h nsjail.h logs.h
+user.o: user.h nsjail.h logs.h macros.h subproc.h util.h
+util.o: util.h nsjail.h logs.h macros.h
+config.pb.o: config.pb.h
diff --git a/NOTICE b/NOTICE
new file mode 120000
index 0000000..7a694c9
--- /dev/null
+++ b/NOTICE
@@ -0,0 +1 @@
+LICENSE
\ No newline at end of file
diff --git a/OWNERS b/OWNERS
new file mode 100644
index 0000000..e9105c2
--- /dev/null
+++ b/OWNERS
@@ -0,0 +1 @@
+dwillemsen@google.com
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..1b83daf
--- /dev/null
+++ b/README.md
@@ -0,0 +1,539 @@
+- [Overview](#overview)
+- [What forms of isolation does it provide](#what-forms-of-isolation-does-it-provide)
+- Which use-cases are supported
+ * [Isolation of network services (inetd style)](#isolation-of-network-services-inetd-style)
+ * [Isolation with access to a private, cloned interface (requires root/setuid)](#isolation-with-access-to-a-private-cloned-interface-requires-rootsetuid)
+ * [Isolation of local processes](#isolation-of-local-processes)
+ * [Isolation of local processes (and re-running them, if necessary)](#isolation-of-local-processes-and-re-running-them-if-necessary)
+- Examples of use
+ * [Bash in a minimal file-system with uid==0 and access to /dev/urandom only](#bash-in-a-minimal-file-system-with-uid0-and-access-to-devurandom-only)
+ * [/usr/bin/find in a minimal file-system (only /usr/bin/find accessible from /usr/bin)](#usrbinfind-in-a-minimal-file-system-only-usrbinfind-accessible-from-usrbin)
+ * [Using /etc/subuid](#using-etcsubuid)
+ * [Even more contrained shell (with seccomp-bpf policies)](#even-more-contrained-shell-with-seccomp-bpf-policies)
+- [Configuration file](#configuration-file)
+- [More info](#more-info)
+- [Launching in Docker](#launching-in-docker)
+- [Contact](#contact)
+
+***
+This is NOT an official Google product.
+
+***
+
+### Overview
+NsJail is a process isolation tool for Linux. It utilizes Linux namespace subsystem, resource limits, and the seccomp-bpf syscall filters of the Linux kernel.
+
+It can help you with (among other things):
+ * Isolating __networking services__ (e.g. web, time, DNS), by isolating them from the rest of the OS
+ * Hosting computer security challenges (so-called __CTFs__)
+ * Containing invasive syscall-level OS __fuzzers__
+
+Features:
+ - [x] Offers three __distinct operational modes__. See [this section](#which-use-cases-are-supported) for more info.
+ - [x] Utilizes [kafel seccomp-bpf configuration language](https://github.com/google/kafel/) for __flexible syscall policy definitions__.
+ - [x] Uses expressive, ProtoBuf-based [configuration file](#configuration-file)
+ - [x] It's __rock-solid__.
+
+***
+### What forms of isolation does it provide
+1. Linux __namespaces__: UTS (hostname), MOUNT (chroot), PID (separate PID tree), IPC, NET (separate networking context), USER, CGROUPS
+2. __FS constraints__: chroot(), pivot_root(), RO-remounting, custom ```/proc``` and ```tmpfs``` mount points
+3. __Resource limits__ (wall-time/CPU time limits, VM/mem address space limits, etc.)
+4. Programmable seccomp-bpf __syscall filters__ (through the [kafel language](https://github.com/google/kafel/))
+5. Cloned and isolated __Ethernet interfaces__
+6. __Cgroups__ for memory and PID utilization control
+
+***
+### Which use-cases are supported
+#### Isolation of network services (inetd style)
+
+_PS: You'll need to have a valid file-system tree in ```/chroot```. If you don't have it, change ```/chroot``` to ```/```_
+
++ Server:
+<pre>
+ $ ./nsjail -Ml --port 9000 --chroot /chroot/ --user 99999 --group 99999 -- /bin/sh -i
+</pre>
+
++ Client:
+<pre>
+ $ nc 127.0.0.1 9000
+ / $ ifconfig
+ / $ ifconfig -a
+ lo Link encap:Local Loopback
+ LOOPBACK MTU:65536 Metric:1
+ RX packets:0 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0
+ RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
+ / $ ps wuax
+ PID USER COMMAND
+ 1 99999 /bin/sh -i
+ 3 99999 {busybox} ps wuax
+ / $
+
+</pre>
+
+#### Isolation with access to a private, cloned interface (requires root/setuid)
+
+_PS: You'll need to have a valid file-system tree in ```/chroot```. If you don't have it, change ```/chroot``` to ```/```_
+
+<pre>
+$ sudo ./nsjail --user 9999 --group 9999 --macvlan_iface eth0 --chroot /chroot/ -Mo --macvlan_vs_ip 192.168.0.44 --macvlan_vs_nm 255.255.255.0 --macvlan_vs_gw 192.168.0.1 -- /bin/sh -i
+/ $ id
+uid=9999 gid=9999
+/ $ ip addr sh
+1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue
+ link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
+ inet 127.0.0.1/8 scope host lo
+ valid_lft forever preferred_lft forever
+ inet6 ::1/128 scope host
+ valid_lft forever preferred_lft forever
+2: vs: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
+ link/ether ca:a2:69:21:33:66 brd ff:ff:ff:ff:ff:ff
+ inet 192.168.0.44/24 brd 192.168.0.255 scope global vs
+ valid_lft forever preferred_lft forever
+ inet6 fe80::c8a2:69ff:fe21:cd66/64 scope link
+ valid_lft forever preferred_lft forever
+/ $ nc 217.146.165.209 80
+GET / HTTP/1.0
+
+HTTP/1.0 302 Found
+Cache-Control: private
+Content-Type: text/html; charset=UTF-8
+Location: https://www.google.ch/?gfe_rd=cr&ei=cEzWVrG2CeTI8ge88ofwDA
+Content-Length: 258
+Date: Wed, 02 Mar 2016 02:14:08 GMT
+
+...
+...
+/ $
+</pre>
+
+#### Isolation of local processes
+
+_PS: You'll need to have a valid file-system tree in ```/chroot```. If you don't have it, change ```/chroot``` to ```/```_
+
+<pre>
+ $ ./nsjail -Mo --chroot /chroot/ --user 99999 --group 99999 -- /bin/sh -i
+ / $ ifconfig -a
+ lo Link encap:Local Loopback
+ LOOPBACK MTU:65536 Metric:1
+ RX packets:0 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0
+ RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
+ / $ id
+ uid=99999 gid=99999
+ / $ ps wuax
+ PID USER COMMAND
+ 1 99999 /bin/sh -i
+ 4 99999 {busybox} ps wuax
+ / $exit
+ $
+</pre>
+
+#### Isolation of local processes (and re-running them, if necessary)
+
+_PS: You'll need to have a valid file-system tree in ```/chroot```. If you don't have it, change ```/chroot``` to ```/```_
+
+<pre>
+ $ ./nsjail -Mr --chroot /chroot/ --user 99999 --group 99999 -- /bin/sh -i
+ BusyBox v1.21.1 (Ubuntu 1:1.21.0-1ubuntu1) built-in shell (ash)
+ Enter 'help' for a list of built-in commands.
+ / $ ps wuax
+ PID USER COMMAND
+ 1 99999 /bin/sh -i
+ 2 99999 {busybox} ps wuax
+ / $ exit
+ BusyBox v1.21.1 (Ubuntu 1:1.21.0-1ubuntu1) built-in shell (ash)
+ Enter 'help' for a list of built-in commands.
+ / $ ps wuax
+ PID USER COMMAND
+ 1 99999 /bin/sh -i
+ 2 99999 {busybox} ps wuax
+ / $
+</pre>
+
+### Bash in a minimal file-system with uid==0 and access to /dev/urandom only
+
+<pre>
+$ ./nsjail -Mo --user 0 --group 99999 -R /bin/ -R /lib -R /lib64/ -R /usr/ -R /sbin/ -T /dev -R /dev/urandom --keep_caps -- /bin/bash -i
+[2017-05-24T17:08:02+0200] Mode: STANDALONE_ONCE
+[2017-05-24T17:08:02+0200] Jail parameters: hostname:'NSJAIL', chroot:'(null)', process:'/bin/bash', bind:[::]:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clonew_newuts:true, clone_newcgroup:false, keep_caps:true, tmpfs_size:4194304, disable_no_new_privs:false, pivot_root_only:false
+[2017-05-24T17:08:02+0200] Mount point: src:'none' dst:'/' type:'tmpfs' flags:MS_RDONLY|0 options:'' isDir:True
+[2017-05-24T17:08:02+0200] Mount point: src:'none' dst:'/proc' type:'proc' flags:MS_RDONLY|0 options:'' isDir:True
+[2017-05-24T17:08:02+0200] Mount point: src:'/bin/' dst:'/bin/' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:True
+[2017-05-24T17:08:02+0200] Mount point: src:'/lib' dst:'/lib' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:True
+[2017-05-24T17:08:02+0200] Mount point: src:'/lib64/' dst:'/lib64/' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:True
+[2017-05-24T17:08:02+0200] Mount point: src:'/usr/' dst:'/usr/' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:True
+[2017-05-24T17:08:02+0200] Mount point: src:'/sbin/' dst:'/sbin/' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:True
+[2017-05-24T17:08:02+0200] Mount point: src:'none' dst:'/dev' type:'tmpfs' flags:0 options:'size=4194304' isDir:True
+[2017-05-24T17:08:02+0200] Mount point: src:'/dev/urandom' dst:'/dev/urandom' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:False
+[2017-05-24T17:08:02+0200] Uid map: inside_uid:0 outside_uid:69664
+[2017-05-24T17:08:02+0200] Gid map: inside_gid:99999 outside_gid:5000
+[2017-05-24T17:08:02+0200] Executing '/bin/bash' for '[STANDALONE_MODE]'
+bash: cannot set terminal process group (-1): Inappropriate ioctl for device
+bash: no job control in this shell
+bash-4.3# ls -l
+total 28
+drwxr-xr-x 2 65534 65534 4096 May 15 14:04 bin
+drwxrwxrwt 2 0 99999 60 May 24 15:08 dev
+drwxr-xr-x 28 65534 65534 4096 May 15 14:10 lib
+drwxr-xr-x 2 65534 65534 4096 May 15 13:56 lib64
+dr-xr-xr-x 391 65534 65534 0 May 24 15:08 proc
+drwxr-xr-x 2 65534 65534 12288 May 15 14:16 sbin
+drwxr-xr-x 17 65534 65534 4096 May 15 13:58 usr
+bash-4.3# id
+uid=0 gid=99999 groups=65534,99999
+bash-4.3# exit
+exit
+[2017-05-24T17:08:05+0200] PID: 129839 exited with status: 0, (PIDs left: 0)
+</pre>
+
+### /usr/bin/find in a minimal file-system (only /usr/bin/find accessible from /usr/bin)
+
+<pre>
+$ ./nsjail -Mo --user 99999 --group 99999 -R /lib/x86_64-linux-gnu/ -R /lib/x86_64-linux-gnu -R /lib64 -R /usr/bin/find -R /dev/urandom --keep_caps -- /usr/bin/find / | wc -l
+[2017-05-24T17:04:37+0200] Mode: STANDALONE_ONCE
+[2017-05-24T17:04:37+0200] Jail parameters: hostname:'NSJAIL', chroot:'(null)', process:'/usr/bin/find', bind:[::]:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clonew_newuts:true, clone_newcgroup:false, keep_caps:true, tmpfs_size:4194304, disable_no_new_privs:false, pivot_root_only:false
+[2017-05-24T17:04:37+0200] Mount point: src:'none' dst:'/' type:'tmpfs' flags:MS_RDONLY|0 options:'' isDir:True
+[2017-05-24T17:04:37+0200] Mount point: src:'none' dst:'/proc' type:'proc' flags:MS_RDONLY|0 options:'' isDir:True
+[2017-05-24T17:04:37+0200] Mount point: src:'/lib/x86_64-linux-gnu/' dst:'/lib/x86_64-linux-gnu/' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:True
+[2017-05-24T17:04:37+0200] Mount point: src:'/lib/x86_64-linux-gnu' dst:'/lib/x86_64-linux-gnu' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:True
+[2017-05-24T17:04:37+0200] Mount point: src:'/lib64' dst:'/lib64' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:True
+[2017-05-24T17:04:37+0200] Mount point: src:'/usr/bin/find' dst:'/usr/bin/find' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:False
+[2017-05-24T17:04:37+0200] Mount point: src:'/dev/urandom' dst:'/dev/urandom' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:False
+[2017-05-24T17:04:37+0200] Uid map: inside_uid:99999 outside_uid:69664
+[2017-05-24T17:04:37+0200] Gid map: inside_gid:99999 outside_gid:5000
+[2017-05-24T17:04:37+0200] Executing '/usr/bin/find' for '[STANDALONE_MODE]'
+/usr/bin/find: `/proc/tty/driver': Permission denied
+2289
+[2017-05-24T17:04:37+0200] PID: 129525 exited with status: 1, (PIDs left: 0)
+</pre>
+
+### Using /etc/subuid
+
+<pre>
+$ tail -n1 /etc/subuid
+user:10000000:1
+$ ./nsjail -R /lib -R /lib64/ -R /usr/lib -R /usr/bin/ -R /usr/sbin/ -R /bin/ -R /sbin/ -R /dev/null -U 0:10000000:1 -u 0 -R /tmp/ -T /tmp/ -- /bin/ls -l /usr/
+[2017-05-24T17:12:31+0200] Mode: STANDALONE_ONCE
+[2017-05-24T17:12:31+0200] Jail parameters: hostname:'NSJAIL', chroot:'(null)', process:'/bin/ls', bind:[::]:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clonew_newuts:true, clone_newcgroup:false, keep_caps:false, tmpfs_size:4194304, disable_no_new_privs:false, pivot_root_only:false
+[2017-05-24T17:12:31+0200] Mount point: src:'none' dst:'/' type:'tmpfs' flags:MS_RDONLY|0 options:'' isDir:True
+[2017-05-24T17:12:31+0200] Mount point: src:'none' dst:'/proc' type:'proc' flags:MS_RDONLY|0 options:'' isDir:True
+[2017-05-24T17:12:31+0200] Mount point: src:'/lib' dst:'/lib' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:True
+[2017-05-24T17:12:31+0200] Mount point: src:'/lib64/' dst:'/lib64/' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:True
+[2017-05-24T17:12:31+0200] Mount point: src:'/usr/lib' dst:'/usr/lib' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:True
+[2017-05-24T17:12:31+0200] Mount point: src:'/usr/bin/' dst:'/usr/bin/' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:True
+[2017-05-24T17:12:31+0200] Mount point: src:'/usr/sbin/' dst:'/usr/sbin/' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:True
+[2017-05-24T17:12:31+0200] Mount point: src:'/bin/' dst:'/bin/' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:True
+[2017-05-24T17:12:31+0200] Mount point: src:'/sbin/' dst:'/sbin/' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:True
+[2017-05-24T17:12:31+0200] Mount point: src:'/dev/null' dst:'/dev/null' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:False
+[2017-05-24T17:12:31+0200] Mount point: src:'/tmp/' dst:'/tmp/' type:'' flags:MS_RDONLY|MS_BIND|MS_REC|0 options:'' isDir:True
+[2017-05-24T17:12:31+0200] Mount point: src:'none' dst:'/tmp/' type:'tmpfs' flags:0 options:'size=4194304' isDir:True
+[2017-05-24T17:12:31+0200] Uid map: inside_uid:0 outside_uid:69664
+[2017-05-24T17:12:31+0200] Gid map: inside_gid:5000 outside_gid:5000
+[2017-05-24T17:12:31+0200] Newuid mapping: inside_uid:'0' outside_uid:'10000000' count:'1'
+[2017-05-24T17:12:31+0200] Executing '/bin/ls' for '[STANDALONE_MODE]'
+total 120
+drwxr-xr-x 5 65534 65534 77824 May 24 12:25 bin
+drwxr-xr-x 210 65534 65534 20480 May 22 16:11 lib
+drwxr-xr-x 4 65534 65534 20480 May 24 00:24 sbin
+[2017-05-24T17:12:31+0200] PID: 130841 exited with status: 0, (PIDs left: 0)
+</pre>
+
+### Even more contrained shell (with seccomp-bpf policies)
+
+<pre>
+$ ./nsjail --chroot / --seccomp_string 'ALLOW { write, execve, brk, access, mmap, open, openat, newfstat, close, read, mprotect, arch_prctl, munmap, getuid, getgid, getpid, rt_sigaction, geteuid, getppid, getcwd, getegid, ioctl, fcntl, newstat, clone, wait4, rt_sigreturn, exit_group } DEFAULT KILL' -- /bin/sh -i
+[2017-01-15T21:53:08+0100] Mode: STANDALONE_ONCE
+[2017-01-15T21:53:08+0100] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/bin/sh', bind:[::]:0, max_conns_per_ip:0, uid:(ns:1000, global:1000), gid:(ns:1000, global:1000), time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clonew_newuts:true, clone_newcgroup:false, keep_caps:false, tmpfs_size:4194304, disable_no_new_privs:false, pivot_root_only:false
+[2017-01-15T21:53:08+0100] Mount point: src:'/' dst:'/' type:'' flags:0x5001 options:''
+[2017-01-15T21:53:08+0100] Mount point: src:'(null)' dst:'/proc' type:'proc' flags:0x0 options:''
+[2017-01-15T21:53:08+0100] PID: 18873 about to execute '/bin/sh' for [STANDALONE_MODE]
+/bin/sh: 0: can't access tty; job control turned off
+$ set
+IFS='
+'
+OPTIND='1'
+PATH='/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'
+PPID='0'
+PS1='$ '
+PS2='> '
+PS4='+ '
+PWD='/'
+$ id
+Bad system call
+$ exit
+[2017-01-15T21:53:17+0100] PID: 18873 exited with status: 159, (PIDs left: 0)
+</pre>
+
+***
+### Configuration file
+
+You will also find all examples in the [configs](https://github.com/google/nsjail/blob/master/configs) directory.
+
+***
+
+[config.proto](https://github.com/google/nsjail/blob/master/config.proto) contains ProtoBuf schema for nsjail's configuration format.
+
+***
+
+You can examine an example config file in [configs/bash-with-fake-geteuid.cfg](https://github.com/google/nsjail/blob/master/configs/bash-with-fake-geteuid.cfg).
+
+Usage:
+<pre>
+$ ./nsjail --config configs/bash-with-fake-geteuid.cfg
+</pre>
+
+You can also override certain options with command-line options. Here, the executed binary (_/bin/bash_) is overriden with _/usr/bin/id_, yet options from _configs/bash-with-fake-geteuid.cfg_ still apply
+<pre>
+$ ./nsjail --config configs/bash-with-fake-geteuid.cfg -- /usr/bin/id
+...
+[INSIDE-JAIL]: id
+uid=999999 gid=999998 euid=4294965959 groups=999998,65534
+[INSIDE-JAIL]: exit
+[2017-05-27T18:45:40+0200] PID: 16579 exited with status: 0, (PIDs left: 0)
+</pre>
+
+***
+
+You might also want to try using [configs/home-documents-with-xorg-no-net.cfg](https://github.com/google/nsjail/blob/master/configs/home-documents-with-xorg-no-net.cfg).
+
+<pre>
+$ ./nsjail --config configs/home-documents-with-xorg-no-net.cfg -- /usr/bin/evince /user/Documents/doc.pdf
+$ ./nsjail --config configs/home-documents-with-xorg-no-net.cfg -- /usr/bin/geeqie /user/Documents/
+$ ./nsjail --config configs/home-documents-with-xorg-no-net.cfg -- /usr/bin/gv /user/Documents/doc.pdf
+$ ./nsjail --config configs/home-documents-with-xorg-no-net.cfg -- /usr/bin/mupdf /user/Documents/doc.pdf
+</pre>
+
+***
+
+The [configs/firefox-with-net.cfg](https://github.com/google/nsjail/blob/master/configs/firefox-with-net.cfg)
+config file will allow you to run firefox inside a sandboxed environment:
+
+<pre>
+$ ./nsjail --config configs/firefox-with-net.cfg
+</pre>
+
+A more complex setup, which utilizes virtualized (cloned) Ethernet
+interfaces (to separate it from the main network namespace), can be
+found in [configs/firefox-with-cloned-net.cfg](https://github.com/google/nsjail/blob/master/configs/firefox-with-cloned-net.cfg).
+Remember to change relevant UIDs and Ethernet interface names before use.
+
+As using cloned Ethernet interfaces (MACVTAP) required root privileges, you'll
+have to run it under sudo:
+
+<pre>
+$ sudo ./nsjail --config configs/firefox-with-cloned-net.cfg
+</pre>
+
+***
+### More info
+
+The command-line options should be self-explanatory, while the proto-buf config options are described in [config.proto](https://github.com/google/nsjail/blob/master/config.proto)
+
+<pre>
+./nsjail --help
+</pre>
+
+<pre>
+Usage: ./nsjail [options] -- path_to_command [args]
+Options:
+ --help|-h
+ Help plz..
+ --mode|-M VALUE
+ Execution mode (default: 'o' [MODE_STANDALONE_ONCE]):
+ l: Wait for connections on a TCP port (specified with --port) [MODE_LISTEN_TCP]
+ o: Launch a single process on the console using clone/execve [MODE_STANDALONE_ONCE]
+ e: Launch a single process on the console using execve [MODE_STANDALONE_EXECVE]
+ r: Launch a single process on the console with clone/execve, keep doing it forever [MODE_STANDALONE_RERUN]
+ --config|-C VALUE
+ Configuration file in the config.proto ProtoBuf format (see configs/ directory for examples)
+ --exec_file|-x VALUE
+ File to exec (default: argv[0])
+ --execute_fd
+ Use execveat() to execute a file-descriptor instead of executing the binary path. In such case argv[0]/exec_file denotes a file path before mount namespacing
+ --chroot|-c VALUE
+ Directory containing / of the jail (default: none)
+ --rw
+ Mount chroot dir (/) R/W (default: R/O)
+ --user|-u VALUE
+ Username/uid of processess inside the jail (default: your current uid). You can also use inside_ns_uid:outside_ns_uid:count convention here. Can be specified multiple times
+ --group|-g VALUE
+ Groupname/gid of processess inside the jail (default: your current gid). You can also use inside_ns_gid:global_ns_gid:count convention here. Can be specified multiple times
+ --hostname|-H VALUE
+ UTS name (hostname) of the jail (default: 'NSJAIL')
+ --cwd|-D VALUE
+ Directory in the namespace the process will run (default: '/')
+ --port|-p VALUE
+ TCP port to bind to (enables MODE_LISTEN_TCP) (default: 0)
+ --bindhost VALUE
+ IP address to bind the port to (only in [MODE_LISTEN_TCP]), (default: '::')
+ --max_conns_per_ip|-i VALUE
+ Maximum number of connections per one IP (only in [MODE_LISTEN_TCP]), (default: 0 (unlimited))
+ --log|-l VALUE
+ Log file (default: use log_fd)
+ --log_fd|-L VALUE
+ Log FD (default: 2)
+ --time_limit|-t VALUE
+ Maximum time that a jail can exist, in seconds (default: 600)
+ --max_cpus VALUE
+ Maximum number of CPUs a single jailed process can use (default: 0 'no limit')
+ --daemon|-d
+ Daemonize after start
+ --verbose|-v
+ Verbose output
+ --quiet|-q
+ Log warning and more important messages only
+ --really_quiet|-Q
+ Log fatal messages only
+ --keep_env|-e
+ Pass all environment variables to the child process (default: all envvars are cleared)
+ --env|-E VALUE
+ Additional environment variable (can be used multiple times)
+ --keep_caps
+ Don't drop any capabilities
+ --cap VALUE
+ Retain this capability, e.g. CAP_PTRACE (can be specified multiple times)
+ --silent
+ Redirect child process' fd:0/1/2 to /dev/null
+ --stderr_to_null
+ Redirect FD=2 (STDERR_FILENO) to /dev/null
+ --skip_setsid
+ Don't call setsid(), allows for terminal signal handling in the sandboxed process. Dangerous
+ --pass_fd VALUE
+ Don't close this FD before executing the child process (can be specified multiple times), by default: 0/1/2 are kept open
+ --disable_no_new_privs
+ Don't set the prctl(NO_NEW_PRIVS, 1) (DANGEROUS)
+ --rlimit_as VALUE
+ RLIMIT_AS in MB, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 512)
+ --rlimit_core VALUE
+ RLIMIT_CORE in MB, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 0)
+ --rlimit_cpu VALUE
+ RLIMIT_CPU, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 600)
+ --rlimit_fsize VALUE
+ RLIMIT_FSIZE in MB, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 1)
+ --rlimit_nofile VALUE
+ RLIMIT_NOFILE, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 32)
+ --rlimit_nproc VALUE
+ RLIMIT_NPROC, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 'soft')
+ --rlimit_stack VALUE
+ RLIMIT_STACK in MB, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 'soft')
+ --persona_addr_compat_layout
+ personality(ADDR_COMPAT_LAYOUT)
+ --persona_mmap_page_zero
+ personality(MMAP_PAGE_ZERO)
+ --persona_read_implies_exec
+ personality(READ_IMPLIES_EXEC)
+ --persona_addr_limit_3gb
+ personality(ADDR_LIMIT_3GB)
+ --persona_addr_no_randomize
+ personality(ADDR_NO_RANDOMIZE)
+ --disable_clone_newnet|-N
+ Don't use CLONE_NEWNET. Enable global networking inside the jail
+ --disable_clone_newuser
+ Don't use CLONE_NEWUSER. Requires euid==0
+ --disable_clone_newns
+ Don't use CLONE_NEWNS
+ --disable_clone_newpid
+ Don't use CLONE_NEWPID
+ --disable_clone_newipc
+ Don't use CLONE_NEWIPC
+ --disable_clone_newuts
+ Don't use CLONE_NEWUTS
+ --disable_clone_newcgroup
+ Don't use CLONE_NEWCGROUP. Might be required for kernel versions < 4.6
+ --uid_mapping|-U VALUE
+ Add a custom uid mapping of the form inside_uid:outside_uid:count. Setting this requires newuidmap (set-uid) to be present
+ --gid_mapping|-G VALUE
+ Add a custom gid mapping of the form inside_gid:outside_gid:count. Setting this requires newgidmap (set-uid) to be present
+ --bindmount_ro|-R VALUE
+ List of mountpoints to be mounted --bind (ro) inside the container. Can be specified multiple times. Supports 'source' syntax, or 'source:dest'
+ --bindmount|-B VALUE
+ List of mountpoints to be mounted --bind (rw) inside the container. Can be specified multiple times. Supports 'source' syntax, or 'source:dest'
+ --tmpfsmount|-T VALUE
+ List of mountpoints to be mounted as tmpfs (R/W) inside the container. Can be specified multiple times. Supports 'dest' syntax. Alternatively, use '-m none:dest:tmpfs:size=8388608'
+ --mount|-m VALUE
+ Arbitrary mount, format src:dst:fs_type:options
+ --symlink|-s VALUE
+ Symlink, format src:dst
+ --disable_proc
+ Disable mounting procfs in the jail
+ --proc_path VALUE
+ Path used to mount procfs (default: '/proc')
+ --proc_rw
+ Is procfs mounted as R/W (default: R/O)
+ --seccomp_policy|-P VALUE
+ Path to file containing seccomp-bpf policy (see kafel/)
+ --seccomp_string VALUE
+ String with kafel seccomp-bpf policy (see kafel/)
+ --seccomp_log
+ Use SECCOMP_FILTER_FLAG_LOG. Log all actions except SECCOMP_RET_ALLOW). Supported since kernel version 4.14
+ --cgroup_mem_max VALUE
+ Maximum number of bytes to use in the group (default: '0' - disabled)
+ --cgroup_mem_mount VALUE
+ Location of memory cgroup FS (default: '/sys/fs/cgroup/memory')
+ --cgroup_mem_parent VALUE
+ Which pre-existing memory cgroup to use as a parent (default: 'NSJAIL')
+ --cgroup_pids_max VALUE
+ Maximum number of pids in a cgroup (default: '0' - disabled)
+ --cgroup_pids_mount VALUE
+ Location of pids cgroup FS (default: '/sys/fs/cgroup/pids')
+ --cgroup_pids_parent VALUE
+ Which pre-existing pids cgroup to use as a parent (default: 'NSJAIL')
+ --cgroup_net_cls_classid VALUE
+ Class identifier of network packets in the group (default: '0' - disabled)
+ --cgroup_net_cls_mount VALUE
+ Location of net_cls cgroup FS (default: '/sys/fs/cgroup/net_cls')
+ --cgroup_net_cls_parent VALUE
+ Which pre-existing net_cls cgroup to use as a parent (default: 'NSJAIL')
+ --cgroup_cpu_ms_per_sec VALUE
+ Number of milliseconds of CPU time per second that the process group can use (default: '0' - no limit)
+ --cgroup_cpu_mount VALUE
+ Location of cpu cgroup FS (default: '/sys/fs/cgroup/net_cls')
+ --cgroup_cpu_parent VALUE
+ Which pre-existing cpu cgroup to use as a parent (default: 'NSJAIL')
+ --iface_no_lo
+ Don't bring the 'lo' interface up
+ --iface_own VALUE
+ Move this existing network interface into the new NET namespace. Can be specified multiple times
+ --macvlan_iface|-I VALUE
+ Interface which will be cloned (MACVLAN) and put inside the subprocess' namespace as 'vs'
+ --macvlan_vs_ip VALUE
+ IP of the 'vs' interface (e.g. "192.168.0.1")
+ --macvlan_vs_nm VALUE
+ Netmask of the 'vs' interface (e.g. "255.255.255.0")
+ --macvlan_vs_gw VALUE
+ Default GW for the 'vs' interface (e.g. "192.168.0.1")
+ --macvlan_vs_ma VALUE
+ MAC-address of the 'vs' interface (e.g. "ba:ad:ba:be:45:00")
+
+ Examples:
+ Wait on a port 31337 for connections, and run /bin/sh
+ nsjail -Ml --port 31337 --chroot / -- /bin/sh -i
+ Re-run echo command as a sub-process
+ nsjail -Mr --chroot / -- /bin/echo "ABC"
+ Run echo command once only, as a sub-process
+ nsjail -Mo --chroot / -- /bin/echo "ABC"
+ Execute echo command directly, without a supervising process
+ nsjail -Me --chroot / --disable_proc -- /bin/echo "ABC"
+</pre>
+
+***
+### Launching in Docker
+
+To launch nsjail in a docker container clone the repository and build the docker image:
+<pre>
+docker build -t nsjailcontainer .
+</pre>
+
+This will build up an image containing njsail and kafel.
+
+From now you can either use it in another Dockerfile (`FROM nsjailcontainer`) or directly:
+<pre>
+docker run --privileged --rm -it nsjailcontainer nsjail --user 99999 --group 99999 --disable_proc --chroot / --time_limit 30 /bin/bash
+</pre>
+
+***
+### Contact
+
+ * User mailing list: [nsjail@googlegroups.com](mailto:nsjail@googlegroups.com), sign up with this [link](https://groups.google.com/forum/#!forum/nsjail)
diff --git a/caps.cc b/caps.cc
new file mode 100644
index 0000000..07785da
--- /dev/null
+++ b/caps.cc
@@ -0,0 +1,276 @@
+/*
+
+ nsjail - capability-related operations
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#include "caps.h"
+
+#include <linux/capability.h>
+#include <string.h>
+#include <sys/prctl.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include <string>
+
+#include "logs.h"
+#include "macros.h"
+#include "util.h"
+
+namespace caps {
+
+struct {
+ const int val;
+ const char* const name;
+} static const capNames[] = {
+ NS_VALSTR_STRUCT(CAP_CHOWN),
+ NS_VALSTR_STRUCT(CAP_DAC_OVERRIDE),
+ NS_VALSTR_STRUCT(CAP_DAC_READ_SEARCH),
+ NS_VALSTR_STRUCT(CAP_FOWNER),
+ NS_VALSTR_STRUCT(CAP_FSETID),
+ NS_VALSTR_STRUCT(CAP_KILL),
+ NS_VALSTR_STRUCT(CAP_SETGID),
+ NS_VALSTR_STRUCT(CAP_SETUID),
+ NS_VALSTR_STRUCT(CAP_SETPCAP),
+ NS_VALSTR_STRUCT(CAP_LINUX_IMMUTABLE),
+ NS_VALSTR_STRUCT(CAP_NET_BIND_SERVICE),
+ NS_VALSTR_STRUCT(CAP_NET_BROADCAST),
+ NS_VALSTR_STRUCT(CAP_NET_ADMIN),
+ NS_VALSTR_STRUCT(CAP_NET_RAW),
+ NS_VALSTR_STRUCT(CAP_IPC_LOCK),
+ NS_VALSTR_STRUCT(CAP_IPC_OWNER),
+ NS_VALSTR_STRUCT(CAP_SYS_MODULE),
+ NS_VALSTR_STRUCT(CAP_SYS_RAWIO),
+ NS_VALSTR_STRUCT(CAP_SYS_CHROOT),
+ NS_VALSTR_STRUCT(CAP_SYS_PTRACE),
+ NS_VALSTR_STRUCT(CAP_SYS_PACCT),
+ NS_VALSTR_STRUCT(CAP_SYS_ADMIN),
+ NS_VALSTR_STRUCT(CAP_SYS_BOOT),
+ NS_VALSTR_STRUCT(CAP_SYS_NICE),
+ NS_VALSTR_STRUCT(CAP_SYS_RESOURCE),
+ NS_VALSTR_STRUCT(CAP_SYS_TIME),
+ NS_VALSTR_STRUCT(CAP_SYS_TTY_CONFIG),
+ NS_VALSTR_STRUCT(CAP_MKNOD),
+ NS_VALSTR_STRUCT(CAP_LEASE),
+ NS_VALSTR_STRUCT(CAP_AUDIT_WRITE),
+ NS_VALSTR_STRUCT(CAP_AUDIT_CONTROL),
+ NS_VALSTR_STRUCT(CAP_SETFCAP),
+ NS_VALSTR_STRUCT(CAP_MAC_OVERRIDE),
+ NS_VALSTR_STRUCT(CAP_MAC_ADMIN),
+ NS_VALSTR_STRUCT(CAP_SYSLOG),
+ NS_VALSTR_STRUCT(CAP_WAKE_ALARM),
+ NS_VALSTR_STRUCT(CAP_BLOCK_SUSPEND),
+#if defined(CAP_AUDIT_READ)
+ NS_VALSTR_STRUCT(CAP_AUDIT_READ),
+#endif /* defined(CAP_AUDIT_READ) */
+};
+
+int nameToVal(const char* name) {
+ for (const auto& cap : capNames) {
+ if (strcmp(name, cap.name) == 0) {
+ return cap.val;
+ }
+ }
+ LOG_W("Uknown capability: '%s'", name);
+ return -1;
+}
+
+static const std::string capToStr(int val) {
+ for (const auto& cap : capNames) {
+ if (val == cap.val) {
+ return cap.name;
+ }
+ }
+
+ std::string res;
+ res.append("CAP_UNKNOWN(");
+ res.append(std::to_string(val));
+ res.append(")");
+ return res;
+}
+
+static cap_user_data_t getCaps() {
+ static __thread struct __user_cap_data_struct cap_data[_LINUX_CAPABILITY_U32S_3];
+ const struct __user_cap_header_struct cap_hdr = {
+ .version = _LINUX_CAPABILITY_VERSION_3,
+ .pid = 0,
+ };
+ if (syscall(__NR_capget, &cap_hdr, &cap_data) == -1) {
+ PLOG_W("capget() failed");
+ return NULL;
+ }
+ return cap_data;
+}
+
+static bool setCaps(const cap_user_data_t cap_data) {
+ const struct __user_cap_header_struct cap_hdr = {
+ .version = _LINUX_CAPABILITY_VERSION_3,
+ .pid = 0,
+ };
+ if (syscall(__NR_capset, &cap_hdr, cap_data) == -1) {
+ PLOG_W("capset() failed");
+ return false;
+ }
+ return true;
+}
+
+static void clearInheritable(cap_user_data_t cap_data) {
+ for (size_t i = 0; i < _LINUX_CAPABILITY_U32S_3; i++) {
+ cap_data[i].inheritable = 0U;
+ }
+}
+
+static bool getPermitted(cap_user_data_t cap_data, unsigned int cap) {
+ size_t off_byte = CAP_TO_INDEX(cap);
+ unsigned mask = CAP_TO_MASK(cap);
+ return cap_data[off_byte].permitted & mask;
+}
+
+static bool getEffective(cap_user_data_t cap_data, unsigned int cap) {
+ size_t off_byte = CAP_TO_INDEX(cap);
+ unsigned mask = CAP_TO_MASK(cap);
+ return cap_data[off_byte].effective & mask;
+}
+
+static bool getInheritable(cap_user_data_t cap_data, unsigned int cap) {
+ size_t off_byte = CAP_TO_INDEX(cap);
+ unsigned mask = CAP_TO_MASK(cap);
+ return cap_data[off_byte].inheritable & mask;
+}
+
+static void setInheritable(cap_user_data_t cap_data, unsigned int cap) {
+ size_t off_byte = CAP_TO_INDEX(cap);
+ unsigned mask = CAP_TO_MASK(cap);
+ cap_data[off_byte].inheritable |= mask;
+}
+
+#if !defined(PR_CAP_AMBIENT)
+#define PR_CAP_AMBIENT 47
+#define PR_CAP_AMBIENT_RAISE 2
+#define PR_CAP_AMBIENT_CLEAR_ALL 4
+#endif /* !defined(PR_CAP_AMBIENT) */
+static bool initNsKeepCaps(cap_user_data_t cap_data) {
+ /* Copy all permitted caps to the inheritable set */
+ std::string dbgmsg1;
+ for (const auto& i : capNames) {
+ if (getPermitted(cap_data, i.val)) {
+ util::StrAppend(&dbgmsg1, " %s", i.name);
+ setInheritable(cap_data, i.val);
+ }
+ }
+ LOG_D("Adding the following capabilities to the inheritable set:%s", dbgmsg1.c_str());
+
+ if (!setCaps(cap_data)) {
+ return false;
+ }
+
+ /* Make sure the inheritable set is preserved across execve via the ambient set */
+ std::string dbgmsg2;
+ for (const auto& i : capNames) {
+ if (!getPermitted(cap_data, i.val)) {
+ continue;
+ }
+ if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, (unsigned long)i.val, 0UL, 0UL) ==
+ -1) {
+ PLOG_W("prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, %s)", i.name);
+ } else {
+ util::StrAppend(&dbgmsg2, " %s", i.name);
+ }
+ }
+ LOG_D("Added the following capabilities to the ambient set:%s", dbgmsg2.c_str());
+
+ return true;
+}
+
+bool initNs(nsjconf_t* nsjconf) {
+ cap_user_data_t cap_data = getCaps();
+ if (cap_data == NULL) {
+ return false;
+ }
+
+ /* Let's start with an empty inheritable set to avoid any mistakes */
+ clearInheritable(cap_data);
+ /*
+ * Remove all capabilities from the ambient set first. It works with newer kernel versions
+ * only, so don't panic() if it fails
+ */
+ if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_CLEAR_ALL, 0UL, 0UL, 0UL) == -1) {
+ PLOG_W("prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_CLEAR_ALL)");
+ }
+
+ if (nsjconf->keep_caps) {
+ return initNsKeepCaps(cap_data);
+ }
+
+ /* Set all requested caps in the inheritable set if these are present in the permitted set
+ */
+ std::string dbgmsg;
+ for (const auto& cap : nsjconf->caps) {
+ if (!getPermitted(cap_data, cap)) {
+ LOG_W("Capability %s is not permitted in the namespace",
+ capToStr(cap).c_str());
+ return false;
+ }
+ dbgmsg.append(" ").append(capToStr(cap));
+ setInheritable(cap_data, cap);
+ }
+ LOG_D("Adding the following capabilities to the inheritable set:%s", dbgmsg.c_str());
+
+ if (!setCaps(cap_data)) {
+ return false;
+ }
+
+ /*
+ * Make sure all other caps (those which were not explicitly requested) are removed from the
+ * bounding set. We need to have CAP_SETPCAP to do that now
+ */
+ dbgmsg.clear();
+ if (getEffective(cap_data, CAP_SETPCAP)) {
+ for (const auto& i : capNames) {
+ if (getInheritable(cap_data, i.val)) {
+ continue;
+ }
+ dbgmsg.append(" ").append(i.name);
+ if (prctl(PR_CAPBSET_DROP, (unsigned long)i.val, 0UL, 0UL, 0UL) == -1) {
+ PLOG_W("prctl(PR_CAPBSET_DROP, %s)", i.name);
+ return false;
+ }
+ }
+ LOG_D(
+ "Dropped the following capabilities from the bounding set:%s", dbgmsg.c_str());
+ }
+
+ /* Make sure inheritable set is preserved across execve via the modified ambient set */
+ dbgmsg.clear();
+ for (const auto& cap : nsjconf->caps) {
+ if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, (unsigned long)cap, 0UL, 0UL) ==
+ -1) {
+ PLOG_W("prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, %s)",
+ capToStr(cap).c_str());
+ } else {
+ dbgmsg.append(" ").append(capToStr(cap));
+ }
+ }
+ LOG_D("Added the following capabilities to the ambient set:%s", dbgmsg.c_str());
+
+ return true;
+}
+
+} // namespace caps
diff --git a/caps.h b/caps.h
new file mode 100644
index 0000000..f189a6d
--- /dev/null
+++ b/caps.h
@@ -0,0 +1,37 @@
+/*
+
+ nsjail - capability-related operations
+ -----------------------------------------
+
+ Copyright 2017 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#ifndef NS_CAPS_H
+#define NS_CAPS_H
+
+#include <stdbool.h>
+#include <stdint.h>
+
+#include "nsjail.h"
+
+namespace caps {
+
+int nameToVal(const char* name);
+bool initNs(nsjconf_t* nsjconf);
+
+} // namespace caps
+
+#endif /* NS_CAPS_H */
diff --git a/cgroup.cc b/cgroup.cc
new file mode 100644
index 0000000..91a09ce
--- /dev/null
+++ b/cgroup.cc
@@ -0,0 +1,194 @@
+/*
+
+ nsjail - cgroup namespacing
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#include "cgroup.h"
+
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <stdarg.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include <sstream>
+
+#include "logs.h"
+#include "util.h"
+
+namespace cgroup {
+
+static bool createCgroup(const std::string& cgroup_path, pid_t pid) {
+ LOG_D("Create '%s' for PID=%d", cgroup_path.c_str(), (int)pid);
+ if (mkdir(cgroup_path.c_str(), 0700) == -1 && errno != EEXIST) {
+ PLOG_W("mkdir('%s', 0700) failed", cgroup_path.c_str());
+ return false;
+ }
+
+ return true;
+}
+
+static bool writeToCgroup(
+ const std::string& cgroup_path, const std::string& value, const std::string& what) {
+ LOG_D("Setting '%s' to '%s'", cgroup_path.c_str(), value.c_str());
+ if (!util::writeBufToFile(
+ cgroup_path.c_str(), value.c_str(), value.length(), O_WRONLY | O_CLOEXEC)) {
+ LOG_W("Could not update %s", what.c_str());
+ return false;
+ }
+
+ return true;
+}
+
+static bool addPidToTaskList(const std::string& cgroup_path, pid_t pid) {
+ std::string pid_str = std::to_string(pid);
+ std::string tasks_path = cgroup_path + "/tasks";
+ LOG_D("Adding PID='%s' to '%s'", pid_str.c_str(), tasks_path.c_str());
+ return writeToCgroup(tasks_path, pid_str, "'" + tasks_path + "' task list");
+}
+
+static bool initNsFromParentMem(nsjconf_t* nsjconf, pid_t pid) {
+ if (nsjconf->cgroup_mem_max == (size_t)0) {
+ return true;
+ }
+
+ std::string mem_cgroup_path = nsjconf->cgroup_mem_mount + '/' + nsjconf->cgroup_mem_parent +
+ "/NSJAIL." + std::to_string(pid);
+ RETURN_ON_FAILURE(createCgroup(mem_cgroup_path, pid));
+
+ std::string mem_max_str = std::to_string(nsjconf->cgroup_mem_max);
+ RETURN_ON_FAILURE(writeToCgroup(
+ mem_cgroup_path + "/memory.limit_in_bytes", mem_max_str, "memory cgroup max limit"));
+
+ /*
+ * Use OOM-killer instead of making processes hang/sleep
+ */
+ RETURN_ON_FAILURE(writeToCgroup(
+ mem_cgroup_path + "/memory.oom_control", "0", "memory cgroup oom control"));
+
+ return addPidToTaskList(mem_cgroup_path, pid);
+}
+
+static bool initNsFromParentPids(nsjconf_t* nsjconf, pid_t pid) {
+ if (nsjconf->cgroup_pids_max == 0U) {
+ return true;
+ }
+
+ std::string pids_cgroup_path = nsjconf->cgroup_pids_mount + '/' +
+ nsjconf->cgroup_pids_parent + "/NSJAIL." +
+ std::to_string(pid);
+ RETURN_ON_FAILURE(createCgroup(pids_cgroup_path, pid));
+
+ std::string pids_max_str = std::to_string(nsjconf->cgroup_pids_max);
+ RETURN_ON_FAILURE(
+ writeToCgroup(pids_cgroup_path + "/pids.max", pids_max_str, "pids cgroup max limit"));
+
+ return addPidToTaskList(pids_cgroup_path, pid);
+}
+
+static bool initNsFromParentNetCls(nsjconf_t* nsjconf, pid_t pid) {
+ if (nsjconf->cgroup_net_cls_classid == 0U) {
+ return true;
+ }
+
+ std::string net_cls_cgroup_path = nsjconf->cgroup_net_cls_mount + '/' +
+ nsjconf->cgroup_net_cls_parent + "/NSJAIL." +
+ std::to_string(pid);
+ RETURN_ON_FAILURE(createCgroup(net_cls_cgroup_path, pid));
+
+ std::string net_cls_classid_str;
+ {
+ std::stringstream ss;
+ ss << "0x" << std::hex << nsjconf->cgroup_net_cls_classid;
+ net_cls_classid_str = ss.str();
+ }
+ RETURN_ON_FAILURE(writeToCgroup(net_cls_cgroup_path + "/net_cls.classid",
+ net_cls_classid_str, "net_cls cgroup classid"));
+
+ return addPidToTaskList(net_cls_cgroup_path, pid);
+}
+
+static bool initNsFromParentCpu(nsjconf_t* nsjconf, pid_t pid) {
+ if (nsjconf->cgroup_cpu_ms_per_sec == 0U) {
+ return true;
+ }
+
+ std::string cpu_cgroup_path = nsjconf->cgroup_cpu_mount + '/' + nsjconf->cgroup_cpu_parent +
+ "/NSJAIL." + std::to_string(pid);
+ RETURN_ON_FAILURE(createCgroup(cpu_cgroup_path, pid));
+
+ std::string cpu_ms_per_sec_str = std::to_string(nsjconf->cgroup_cpu_ms_per_sec * 1000U);
+ RETURN_ON_FAILURE(
+ writeToCgroup(cpu_cgroup_path + "/cpu.cfs_quota_us", cpu_ms_per_sec_str, "cpu quota"));
+
+ RETURN_ON_FAILURE(
+ writeToCgroup(cpu_cgroup_path + "/cpu.cfs_period_us", "1000000", "cpu period"));
+
+ return addPidToTaskList(cpu_cgroup_path, pid);
+}
+
+bool initNsFromParent(nsjconf_t* nsjconf, pid_t pid) {
+ RETURN_ON_FAILURE(initNsFromParentMem(nsjconf, pid));
+ RETURN_ON_FAILURE(initNsFromParentPids(nsjconf, pid));
+ RETURN_ON_FAILURE(initNsFromParentNetCls(nsjconf, pid));
+ return initNsFromParentCpu(nsjconf, pid);
+}
+
+static void removeCgroup(const std::string& cgroup_path) {
+ LOG_D("Remove '%s'", cgroup_path.c_str());
+ if (rmdir(cgroup_path.c_str()) == -1) {
+ PLOG_W("rmdir('%s') failed", cgroup_path.c_str());
+ }
+}
+
+void finishFromParent(nsjconf_t* nsjconf, pid_t pid) {
+ if (nsjconf->cgroup_mem_max != (size_t)0) {
+ std::string mem_cgroup_path = nsjconf->cgroup_mem_mount + '/' +
+ nsjconf->cgroup_mem_parent + "/NSJAIL." +
+ std::to_string(pid);
+ removeCgroup(mem_cgroup_path);
+ }
+ if (nsjconf->cgroup_pids_max != 0U) {
+ std::string pids_cgroup_path = nsjconf->cgroup_pids_mount + '/' +
+ nsjconf->cgroup_pids_parent + "/NSJAIL." +
+ std::to_string(pid);
+ removeCgroup(pids_cgroup_path);
+ }
+ if (nsjconf->cgroup_net_cls_classid != 0U) {
+ std::string net_cls_cgroup_path = nsjconf->cgroup_net_cls_mount + '/' +
+ nsjconf->cgroup_net_cls_parent + "/NSJAIL." +
+ std::to_string(pid);
+ removeCgroup(net_cls_cgroup_path);
+ }
+ if (nsjconf->cgroup_cpu_ms_per_sec != 0U) {
+ std::string cpu_cgroup_path = nsjconf->cgroup_cpu_mount + '/' +
+ nsjconf->cgroup_cpu_parent + "/NSJAIL." +
+ std::to_string(pid);
+ removeCgroup(cpu_cgroup_path);
+ }
+}
+
+bool initNs(void) {
+ return true;
+}
+
+} // namespace cgroup
diff --git a/cgroup.h b/cgroup.h
new file mode 100644
index 0000000..e241d23
--- /dev/null
+++ b/cgroup.h
@@ -0,0 +1,38 @@
+/*
+
+ nsjail - cgroup namespacing
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#ifndef NS_CGROUP_H
+#define NS_CGROUP_H
+
+#include <stdbool.h>
+#include <stddef.h>
+
+#include "nsjail.h"
+
+namespace cgroup {
+
+bool initNsFromParent(nsjconf_t* nsjconf, pid_t pid);
+bool initNs(void);
+void finishFromParent(nsjconf_t* nsjconf, pid_t pid);
+
+} // namespace cgroup
+
+#endif /* _CGROUP_H */
diff --git a/cmdline.cc b/cmdline.cc
new file mode 100644
index 0000000..4347e9a
--- /dev/null
+++ b/cmdline.cc
@@ -0,0 +1,846 @@
+/*
+
+ nsjail - cmdline parsing
+
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#include "cmdline.h"
+
+#include <ctype.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <grp.h>
+#include <limits.h>
+#include <pwd.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <strings.h>
+#include <sys/mount.h>
+#include <sys/personality.h>
+#include <sys/resource.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "caps.h"
+#include "config.h"
+#include "logs.h"
+#include "macros.h"
+#include "mnt.h"
+#include "user.h"
+#include "util.h"
+
+namespace cmdline {
+
+#define _LOG_DEFAULT_FILE "/var/log/nsjail.log"
+
+struct custom_option {
+ struct option opt;
+ const char* descr;
+};
+
+// clang-format off
+struct custom_option custom_opts[] = {
+ { { "help", no_argument, NULL, 'h' }, "Help plz.." },
+ { { "mode", required_argument, NULL, 'M' },
+ "Execution mode (default: 'o' [MODE_STANDALONE_ONCE]):\n"
+ "\tl: Wait for connections on a TCP port (specified with --port) [MODE_LISTEN_TCP]\n"
+ "\to: Launch a single process on the console using clone/execve [MODE_STANDALONE_ONCE]\n"
+ "\te: Launch a single process on the console using execve [MODE_STANDALONE_EXECVE]\n"
+ "\tr: Launch a single process on the console with clone/execve, keep doing it forever [MODE_STANDALONE_RERUN]" },
+ { { "config", required_argument, NULL, 'C' }, "Configuration file in the config.proto ProtoBuf format (see configs/ directory for examples)" },
+ { { "exec_file", required_argument, NULL, 'x' }, "File to exec (default: argv[0])" },
+ { { "execute_fd", no_argument, NULL, 0x0607 }, "Use execveat() to execute a file-descriptor instead of executing the binary path. In such case argv[0]/exec_file denotes a file path before mount namespacing" },
+ { { "chroot", required_argument, NULL, 'c' }, "Directory containing / of the jail (default: none)" },
+ { { "rw", no_argument, NULL, 0x601 }, "Mount chroot dir (/) R/W (default: R/O)" },
+ { { "user", required_argument, NULL, 'u' }, "Username/uid of processess inside the jail (default: your current uid). You can also use inside_ns_uid:outside_ns_uid:count convention here. Can be specified multiple times" },
+ { { "group", required_argument, NULL, 'g' }, "Groupname/gid of processess inside the jail (default: your current gid). You can also use inside_ns_gid:global_ns_gid:count convention here. Can be specified multiple times" },
+ { { "hostname", required_argument, NULL, 'H' }, "UTS name (hostname) of the jail (default: 'NSJAIL')" },
+ { { "cwd", required_argument, NULL, 'D' }, "Directory in the namespace the process will run (default: '/')" },
+ { { "port", required_argument, NULL, 'p' }, "TCP port to bind to (enables MODE_LISTEN_TCP) (default: 0)" },
+ { { "bindhost", required_argument, NULL, 0x604 }, "IP address to bind the port to (only in [MODE_LISTEN_TCP]), (default: '::')" },
+ { { "max_conns_per_ip", required_argument, NULL, 'i' }, "Maximum number of connections per one IP (only in [MODE_LISTEN_TCP]), (default: 0 (unlimited))" },
+ { { "log", required_argument, NULL, 'l' }, "Log file (default: use log_fd)" },
+ { { "log_fd", required_argument, NULL, 'L' }, "Log FD (default: 2)" },
+ { { "time_limit", required_argument, NULL, 't' }, "Maximum time that a jail can exist, in seconds (default: 600)" },
+ { { "max_cpus", required_argument, NULL, 0x508 }, "Maximum number of CPUs a single jailed process can use (default: 0 'no limit')" },
+ { { "daemon", no_argument, NULL, 'd' }, "Daemonize after start" },
+ { { "verbose", no_argument, NULL, 'v' }, "Verbose output" },
+ { { "quiet", no_argument, NULL, 'q' }, "Log warning and more important messages only" },
+ { { "really_quiet", no_argument, NULL, 'Q' }, "Log fatal messages only" },
+ { { "keep_env", no_argument, NULL, 'e' }, "Pass all environment variables to the child process (default: all envvars are cleared)" },
+ { { "env", required_argument, NULL, 'E' }, "Additional environment variable (can be used multiple times). If the envvar doesn't contain '=' (e.g. just the 'DISPLAY' string), the current envvar value will be used" },
+ { { "keep_caps", no_argument, NULL, 0x0501 }, "Don't drop any capabilities" },
+ { { "cap", required_argument, NULL, 0x0509 }, "Retain this capability, e.g. CAP_PTRACE (can be specified multiple times)" },
+ { { "silent", no_argument, NULL, 0x0502 }, "Redirect child process' fd:0/1/2 to /dev/null" },
+ { { "stderr_to_null", no_argument, NULL, 0x0503 }, "Redirect child process' fd:2 (STDERR_FILENO) to /dev/null" },
+ { { "skip_setsid", no_argument, NULL, 0x0504 }, "Don't call setsid(), allows for terminal signal handling in the sandboxed process. Dangerous" },
+ { { "pass_fd", required_argument, NULL, 0x0505 }, "Don't close this FD before executing the child process (can be specified multiple times), by default: 0/1/2 are kept open" },
+ { { "disable_no_new_privs", no_argument, NULL, 0x0507 }, "Don't set the prctl(NO_NEW_PRIVS, 1) (DANGEROUS)" },
+ { { "rlimit_as", required_argument, NULL, 0x0201 }, "RLIMIT_AS in MB, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 512)" },
+ { { "rlimit_core", required_argument, NULL, 0x0202 }, "RLIMIT_CORE in MB, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 0)" },
+ { { "rlimit_cpu", required_argument, NULL, 0x0203 }, "RLIMIT_CPU, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 600)" },
+ { { "rlimit_fsize", required_argument, NULL, 0x0204 }, "RLIMIT_FSIZE in MB, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 1)" },
+ { { "rlimit_nofile", required_argument, NULL, 0x0205 }, "RLIMIT_NOFILE, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 32)" },
+ { { "rlimit_nproc", required_argument, NULL, 0x0206 }, "RLIMIT_NPROC, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 'soft')" },
+ { { "rlimit_stack", required_argument, NULL, 0x0207 }, "RLIMIT_STACK in MB, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM64_INFINITY (default: 'soft')" },
+ { { "persona_addr_compat_layout", no_argument, NULL, 0x0301 }, "personality(ADDR_COMPAT_LAYOUT)" },
+ { { "persona_mmap_page_zero", no_argument, NULL, 0x0302 }, "personality(MMAP_PAGE_ZERO)" },
+ { { "persona_read_implies_exec", no_argument, NULL, 0x0303 }, "personality(READ_IMPLIES_EXEC)" },
+ { { "persona_addr_limit_3gb", no_argument, NULL, 0x0304 }, "personality(ADDR_LIMIT_3GB)" },
+ { { "persona_addr_no_randomize", no_argument, NULL, 0x0305 }, "personality(ADDR_NO_RANDOMIZE)" },
+ { { "disable_clone_newnet", no_argument, NULL, 'N' }, "Don't use CLONE_NEWNET. Enable global networking inside the jail" },
+ { { "disable_clone_newuser", no_argument, NULL, 0x0402 }, "Don't use CLONE_NEWUSER. Requires euid==0" },
+ { { "disable_clone_newns", no_argument, NULL, 0x0403 }, "Don't use CLONE_NEWNS" },
+ { { "disable_clone_newpid", no_argument, NULL, 0x0404 }, "Don't use CLONE_NEWPID" },
+ { { "disable_clone_newipc", no_argument, NULL, 0x0405 }, "Don't use CLONE_NEWIPC" },
+ { { "disable_clone_newuts", no_argument, NULL, 0x0406 }, "Don't use CLONE_NEWUTS" },
+ { { "disable_clone_newcgroup", no_argument, NULL, 0x0407 }, "Don't use CLONE_NEWCGROUP. Might be required for kernel versions < 4.6" },
+ { { "uid_mapping", required_argument, NULL, 'U' }, "Add a custom uid mapping of the form inside_uid:outside_uid:count. Setting this requires newuidmap (set-uid) to be present" },
+ { { "gid_mapping", required_argument, NULL, 'G' }, "Add a custom gid mapping of the form inside_gid:outside_gid:count. Setting this requires newgidmap (set-uid) to be present" },
+ { { "bindmount_ro", required_argument, NULL, 'R' }, "List of mountpoints to be mounted --bind (ro) inside the container. Can be specified multiple times. Supports 'source' syntax, or 'source:dest'" },
+ { { "bindmount", required_argument, NULL, 'B' }, "List of mountpoints to be mounted --bind (rw) inside the container. Can be specified multiple times. Supports 'source' syntax, or 'source:dest'" },
+ { { "tmpfsmount", required_argument, NULL, 'T' }, "List of mountpoints to be mounted as tmpfs (R/W) inside the container. Can be specified multiple times. Supports 'dest' syntax. Alternatively, use '-m none:dest:tmpfs:size=8388608'" },
+ { { "mount", required_argument, NULL, 'm' }, "Arbitrary mount, format src:dst:fs_type:options" },
+ { { "symlink", required_argument, NULL, 's' }, "Symlink, format src:dst" },
+ { { "disable_proc", no_argument, NULL, 0x0603 }, "Disable mounting procfs in the jail" },
+ { { "proc_path", required_argument, NULL, 0x0605 }, "Path used to mount procfs (default: '/proc')" },
+ { { "proc_rw", no_argument, NULL, 0x0606 }, "Is procfs mounted as R/W (default: R/O)" },
+ { { "seccomp_policy", required_argument, NULL, 'P' }, "Path to file containing seccomp-bpf policy (see kafel/)" },
+ { { "seccomp_string", required_argument, NULL, 0x0901 }, "String with kafel seccomp-bpf policy (see kafel/)" },
+ { { "seccomp_log", no_argument, NULL, 0x0902 }, "Use SECCOMP_FILTER_FLAG_LOG. Log all actions except SECCOMP_RET_ALLOW). Supported since kernel version 4.14" },
+ { { "cgroup_mem_max", required_argument, NULL, 0x0801 }, "Maximum number of bytes to use in the group (default: '0' - disabled)" },
+ { { "cgroup_mem_mount", required_argument, NULL, 0x0802 }, "Location of memory cgroup FS (default: '/sys/fs/cgroup/memory')" },
+ { { "cgroup_mem_parent", required_argument, NULL, 0x0803 }, "Which pre-existing memory cgroup to use as a parent (default: 'NSJAIL')" },
+ { { "cgroup_pids_max", required_argument, NULL, 0x0811 }, "Maximum number of pids in a cgroup (default: '0' - disabled)" },
+ { { "cgroup_pids_mount", required_argument, NULL, 0x0812 }, "Location of pids cgroup FS (default: '/sys/fs/cgroup/pids')" },
+ { { "cgroup_pids_parent", required_argument, NULL, 0x0813 }, "Which pre-existing pids cgroup to use as a parent (default: 'NSJAIL')" },
+ { { "cgroup_net_cls_classid", required_argument, NULL, 0x0821 }, "Class identifier of network packets in the group (default: '0' - disabled)" },
+ { { "cgroup_net_cls_mount", required_argument, NULL, 0x0822 }, "Location of net_cls cgroup FS (default: '/sys/fs/cgroup/net_cls')" },
+ { { "cgroup_net_cls_parent", required_argument, NULL, 0x0823 }, "Which pre-existing net_cls cgroup to use as a parent (default: 'NSJAIL')" },
+ { { "cgroup_cpu_ms_per_sec", required_argument, NULL, 0x0831 }, "Number of milliseconds of CPU time per second that the process group can use (default: '0' - no limit)" },
+ { { "cgroup_cpu_mount", required_argument, NULL, 0x0822 }, "Location of cpu cgroup FS (default: '/sys/fs/cgroup/net_cls')" },
+ { { "cgroup_cpu_parent", required_argument, NULL, 0x0833 }, "Which pre-existing cpu cgroup to use as a parent (default: 'NSJAIL')" },
+ { { "iface_no_lo", no_argument, NULL, 0x700 }, "Don't bring the 'lo' interface up" },
+ { { "iface_own", required_argument, NULL, 0x704 }, "Move this existing network interface into the new NET namespace. Can be specified multiple times" },
+ { { "macvlan_iface", required_argument, NULL, 'I' }, "Interface which will be cloned (MACVLAN) and put inside the subprocess' namespace as 'vs'" },
+ { { "macvlan_vs_ip", required_argument, NULL, 0x701 }, "IP of the 'vs' interface (e.g. \"192.168.0.1\")" },
+ { { "macvlan_vs_nm", required_argument, NULL, 0x702 }, "Netmask of the 'vs' interface (e.g. \"255.255.255.0\")" },
+ { { "macvlan_vs_gw", required_argument, NULL, 0x703 }, "Default GW for the 'vs' interface (e.g. \"192.168.0.1\")" },
+ { { "macvlan_vs_ma", required_argument, NULL, 0x705 }, "MAC-address of the 'vs' interface (e.g. \"ba:ad:ba:be:45:00\")" },
+};
+// clang-format on
+
+static const char* logYesNo(bool yes) {
+ return (yes ? "true" : "false");
+}
+
+static void cmdlineOptUsage(struct custom_option* option) {
+ if (option->opt.val < 0x80) {
+ LOG_HELP_BOLD(" --%s%s%c %s", option->opt.name, "|-", option->opt.val,
+ option->opt.has_arg == required_argument ? "VALUE" : "");
+ } else {
+ LOG_HELP_BOLD(" --%s %s", option->opt.name,
+ option->opt.has_arg == required_argument ? "VALUE" : "");
+ }
+ LOG_HELP("\t%s", option->descr);
+}
+
+static void cmdlineUsage(const char* pname) {
+ LOG_HELP_BOLD("Usage: %s [options] -- path_to_command [args]", pname);
+ LOG_HELP_BOLD("Options:");
+ for (size_t i = 0; i < ARR_SZ(custom_opts); i++) {
+ cmdlineOptUsage(&custom_opts[i]);
+ }
+ LOG_HELP_BOLD("\n Examples: ");
+ LOG_HELP(" Wait on a port 31337 for connections, and run /bin/sh");
+ LOG_HELP_BOLD(" nsjail -Ml --port 31337 --chroot / -- /bin/sh -i");
+ LOG_HELP(" Re-run echo command as a sub-process");
+ LOG_HELP_BOLD(" nsjail -Mr --chroot / -- /bin/echo \"ABC\"");
+ LOG_HELP(" Run echo command once only, as a sub-process");
+ LOG_HELP_BOLD(" nsjail -Mo --chroot / -- /bin/echo \"ABC\"");
+ LOG_HELP(" Execute echo command directly, without a supervising process");
+ LOG_HELP_BOLD(" nsjail -Me --chroot / --disable_proc -- /bin/echo \"ABC\"");
+}
+
+void addEnv(nsjconf_t* nsjconf, const std::string& env) {
+ if (env.find('=') != std::string::npos) {
+ nsjconf->envs.push_back(env);
+ return;
+ }
+ char* e = getenv(env.c_str());
+ if (!e) {
+ LOG_W("Requested to use the '%s' envvar, but it's not set. It'll be ignored",
+ env.c_str());
+ return;
+ }
+ nsjconf->envs.push_back(std::string(env).append("=").append(e));
+}
+
+void logParams(nsjconf_t* nsjconf) {
+ switch (nsjconf->mode) {
+ case MODE_LISTEN_TCP:
+ LOG_I("Mode: LISTEN_TCP");
+ break;
+ case MODE_STANDALONE_ONCE:
+ LOG_I("Mode: STANDALONE_ONCE");
+ break;
+ case MODE_STANDALONE_EXECVE:
+ LOG_I("Mode: STANDALONE_EXECVE");
+ break;
+ case MODE_STANDALONE_RERUN:
+ LOG_I("Mode: STANDALONE_RERUN");
+ break;
+ default:
+ LOG_F("Mode: UNKNOWN");
+ break;
+ }
+
+ LOG_I(
+ "Jail parameters: hostname:'%s', chroot:'%s', process:'%s', bind:[%s]:%d, "
+ "max_conns_per_ip:%u, time_limit:%" PRId64
+ ", personality:%#lx, daemonize:%s, clone_newnet:%s, "
+ "clone_newuser:%s, clone_newns:%s, clone_newpid:%s, clone_newipc:%s, clone_newuts:%s, "
+ "clone_newcgroup:%s, keep_caps:%s, disable_no_new_privs:%s, max_cpus:%zu",
+ nsjconf->hostname.c_str(), nsjconf->chroot.c_str(),
+ nsjconf->exec_file.empty() ? nsjconf->argv[0].c_str() : nsjconf->exec_file.c_str(),
+ nsjconf->bindhost.c_str(), nsjconf->port, nsjconf->max_conns_per_ip, nsjconf->tlimit,
+ nsjconf->personality, logYesNo(nsjconf->daemonize), logYesNo(nsjconf->clone_newnet),
+ logYesNo(nsjconf->clone_newuser), logYesNo(nsjconf->clone_newns),
+ logYesNo(nsjconf->clone_newpid), logYesNo(nsjconf->clone_newipc),
+ logYesNo(nsjconf->clone_newuts), logYesNo(nsjconf->clone_newcgroup),
+ logYesNo(nsjconf->keep_caps), logYesNo(nsjconf->disable_no_new_privs),
+ nsjconf->max_cpus);
+
+ for (const auto& p : nsjconf->mountpts) {
+ LOG_I("%s: %s", p.is_symlink ? "Symlink" : "Mount point",
+ mnt::describeMountPt(p).c_str());
+ }
+ for (const auto& uid : nsjconf->uids) {
+ LOG_I("Uid map: inside_uid:%lu outside_uid:%lu count:%zu newuidmap:%s",
+ (unsigned long)uid.inside_id, (unsigned long)uid.outside_id, uid.count,
+ uid.is_newidmap ? "true" : "false");
+ if (uid.outside_id == 0 && nsjconf->clone_newuser) {
+ LOG_W(
+ "Process will be UID/EUID=0 in the global user namespace, and will "
+ "have user root-level access to files");
+ }
+ }
+ for (const auto& gid : nsjconf->gids) {
+ LOG_I("Gid map: inside_gid:%lu outside_gid:%lu count:%zu newgidmap:%s",
+ (unsigned long)gid.inside_id, (unsigned long)gid.outside_id, gid.count,
+ gid.is_newidmap ? "true" : "false");
+ if (gid.outside_id == 0 && nsjconf->clone_newuser) {
+ LOG_W(
+ "Process will be GID/EGID=0 in the global user namespace, and will "
+ "have group root-level access to files");
+ }
+ }
+}
+
+uint64_t parseRLimit(int res, const char* optarg, unsigned long mul) {
+ if (strcasecmp(optarg, "inf") == 0) {
+ return RLIM64_INFINITY;
+ }
+ struct rlimit64 cur;
+ if (getrlimit64(res, &cur) == -1) {
+ PLOG_F("getrlimit(%d)", res);
+ }
+ if (strcasecmp(optarg, "def") == 0 || strcasecmp(optarg, "soft") == 0) {
+ return cur.rlim_cur;
+ }
+ if (strcasecmp(optarg, "max") == 0 || strcasecmp(optarg, "hard") == 0) {
+ return cur.rlim_max;
+ }
+ if (!util::isANumber(optarg)) {
+ LOG_F(
+ "RLIMIT %d needs a numeric or 'max'/'hard'/'def'/'soft'/'inf' value ('%s' "
+ "provided)",
+ res, optarg);
+ }
+ errno = 0;
+ uint64_t val = strtoull(optarg, NULL, 0);
+ if (val == ULLONG_MAX && errno != 0) {
+ PLOG_F("strtoull('%s', 0)", optarg);
+ }
+ return val * mul;
+}
+
+static std::string argFromVec(const std::vector<std::string>& vec, size_t pos) {
+ if (pos >= vec.size()) {
+ return "";
+ }
+ return vec[pos];
+}
+
+static bool setupArgv(nsjconf_t* nsjconf, int argc, char** argv, int optind) {
+ for (int i = optind; i < argc; i++) {
+ nsjconf->argv.push_back(argv[i]);
+ }
+ if (nsjconf->argv.empty()) {
+ cmdlineUsage(argv[0]);
+ LOG_E("No command provided");
+ return false;
+ }
+ if (nsjconf->exec_file.empty()) {
+ nsjconf->exec_file = nsjconf->argv[0];
+ }
+
+ if (nsjconf->use_execveat) {
+#if !defined(__NR_execveat)
+ LOG_E(
+ "Your nsjail is compiled without support for the execveat() syscall, yet you "
+ "specified the --execute_fd flag");
+ return false;
+#endif /* !defined(__NR_execveat) */
+ if ((nsjconf->exec_fd = TEMP_FAILURE_RETRY(
+ open(nsjconf->exec_file.c_str(), O_RDONLY | O_PATH | O_CLOEXEC))) == -1) {
+ PLOG_W("Couldn't open '%s' file", nsjconf->exec_file.c_str());
+ return false;
+ }
+ }
+ return true;
+}
+
+static bool setupMounts(nsjconf_t* nsjconf) {
+ if (!(nsjconf->chroot.empty())) {
+ if (!mnt::addMountPtHead(nsjconf, nsjconf->chroot, "/", /* fstype= */ "",
+ /* options= */ "",
+ nsjconf->is_root_rw ? (MS_BIND | MS_REC | MS_PRIVATE)
+ : (MS_BIND | MS_REC | MS_PRIVATE | MS_RDONLY),
+ /* is_dir= */ mnt::NS_DIR_YES, /* is_mandatory= */ true, /* src_env= */ "",
+ /* dst_env= */ "", /* src_content= */ "", /* is_symlink= */ false)) {
+ return false;
+ }
+ } else {
+ if (!mnt::addMountPtHead(nsjconf, /* src= */ "", "/", "tmpfs",
+ /* options= */ "", nsjconf->is_root_rw ? 0 : MS_RDONLY,
+ /* is_dir= */ mnt::NS_DIR_YES,
+ /* is_mandatory= */ true, /* src_env= */ "", /* dst_env= */ "",
+ /* src_content= */ "", /* is_symlink= */ false)) {
+ return false;
+ }
+ }
+ if (!nsjconf->proc_path.empty()) {
+ if (!mnt::addMountPtTail(nsjconf, /* src= */ "", nsjconf->proc_path, "proc",
+ /* options= */ "", nsjconf->is_proc_rw ? 0 : MS_RDONLY,
+ /* is_dir= */ mnt::NS_DIR_YES, /* is_mandatory= */ true, /* src_env= */ "",
+ /* dst_env= */ "", /* src_content= */ "", /* is_symlink= */ false)) {
+ return false;
+ }
+ }
+
+ return true;
+}
+
+void setupUsers(nsjconf_t* nsjconf) {
+ if (nsjconf->uids.empty()) {
+ idmap_t uid;
+ uid.inside_id = getuid();
+ uid.outside_id = getuid();
+ uid.count = 1U;
+ uid.is_newidmap = false;
+ nsjconf->uids.push_back(uid);
+ }
+ if (nsjconf->gids.empty()) {
+ idmap_t gid;
+ gid.inside_id = getgid();
+ gid.outside_id = getgid();
+ gid.count = 1U;
+ gid.is_newidmap = false;
+ nsjconf->gids.push_back(gid);
+ }
+}
+
+std::unique_ptr<nsjconf_t> parseArgs(int argc, char* argv[]) {
+ std::unique_ptr<nsjconf_t> nsjconf(new nsjconf_t);
+
+ nsjconf->use_execveat = false;
+ nsjconf->exec_fd = -1;
+ nsjconf->hostname = "NSJAIL";
+ nsjconf->cwd = "/";
+ nsjconf->port = 0;
+ nsjconf->bindhost = "::";
+ nsjconf->daemonize = false;
+ nsjconf->tlimit = 0;
+ nsjconf->max_cpus = 0;
+ nsjconf->keep_env = false;
+ nsjconf->keep_caps = false;
+ nsjconf->disable_no_new_privs = false;
+ nsjconf->rl_as = 512 * (1024 * 1024);
+ nsjconf->rl_core = 0;
+ nsjconf->rl_cpu = 600;
+ nsjconf->rl_fsize = 1 * (1024 * 1024);
+ nsjconf->rl_nofile = 32;
+ nsjconf->rl_nproc = parseRLimit(RLIMIT_NPROC, "soft", 1);
+ nsjconf->rl_stack = parseRLimit(RLIMIT_STACK, "soft", 1);
+ nsjconf->personality = 0;
+ nsjconf->clone_newnet = true;
+ nsjconf->clone_newuser = true;
+ nsjconf->clone_newns = true;
+ nsjconf->clone_newpid = true;
+ nsjconf->clone_newipc = true;
+ nsjconf->clone_newuts = true;
+ nsjconf->clone_newcgroup = true;
+ nsjconf->mode = MODE_STANDALONE_ONCE;
+ nsjconf->is_root_rw = false;
+ nsjconf->is_silent = false;
+ nsjconf->stderr_to_null = false;
+ nsjconf->skip_setsid = false;
+ nsjconf->max_conns_per_ip = 0;
+ nsjconf->proc_path = "/proc";
+ nsjconf->is_proc_rw = false;
+ nsjconf->cgroup_mem_mount = "/sys/fs/cgroup/memory";
+ nsjconf->cgroup_mem_parent = "NSJAIL";
+ nsjconf->cgroup_mem_max = (size_t)0;
+ nsjconf->cgroup_pids_mount = "/sys/fs/cgroup/pids";
+ nsjconf->cgroup_pids_parent = "NSJAIL";
+ nsjconf->cgroup_pids_max = 0U;
+ nsjconf->cgroup_net_cls_mount = "/sys/fs/cgroup/net_cls";
+ nsjconf->cgroup_net_cls_parent = "NSJAIL";
+ nsjconf->cgroup_net_cls_classid = 0U;
+ nsjconf->cgroup_cpu_mount = "/sys/fs/cgroup/cpu";
+ nsjconf->cgroup_cpu_parent = "NSJAIL";
+ nsjconf->cgroup_cpu_ms_per_sec = 0U;
+ nsjconf->iface_lo = true;
+ nsjconf->iface_vs_ip = "0.0.0.0";
+ nsjconf->iface_vs_nm = "255.255.255.0";
+ nsjconf->iface_vs_gw = "0.0.0.0";
+ nsjconf->iface_vs_ma = "";
+ nsjconf->orig_uid = getuid();
+ nsjconf->num_cpus = sysconf(_SC_NPROCESSORS_ONLN);
+ nsjconf->seccomp_fprog.filter = NULL;
+ nsjconf->seccomp_fprog.len = 0;
+ nsjconf->seccomp_log = false;
+
+ nsjconf->openfds.push_back(STDIN_FILENO);
+ nsjconf->openfds.push_back(STDOUT_FILENO);
+ nsjconf->openfds.push_back(STDERR_FILENO);
+
+ // Generate options array for getopt_long.
+ size_t options_length = ARR_SZ(custom_opts) + 1;
+ struct option opts[options_length];
+ for (unsigned i = 0; i < ARR_SZ(custom_opts); i++) {
+ opts[i] = custom_opts[i].opt;
+ }
+ // Last, NULL option as a terminator.
+ struct option terminator = {NULL, 0, NULL, 0};
+ memcpy(&opts[options_length - 1].name, &terminator, sizeof(terminator));
+
+ int opt_index = 0;
+ for (;;) {
+ int c = getopt_long(argc, argv,
+ "x:H:D:C:c:p:i:u:g:l:L:t:M:NdvqQeh?E:R:B:T:m:s:P:I:U:G:", opts, &opt_index);
+ if (c == -1) {
+ break;
+ }
+ switch (c) {
+ case 'x':
+ nsjconf->exec_file = optarg;
+ break;
+ case 'H':
+ nsjconf->hostname = optarg;
+ break;
+ case 'D':
+ nsjconf->cwd = optarg;
+ break;
+ case 'C':
+ if (!config::parseFile(nsjconf.get(), optarg)) {
+ LOG_F("Couldn't parse configuration from '%s' file", optarg);
+ }
+ break;
+ case 'c':
+ nsjconf->chroot = optarg;
+ break;
+ case 'p':
+ nsjconf->port = strtoumax(optarg, NULL, 0);
+ nsjconf->mode = MODE_LISTEN_TCP;
+ break;
+ case 0x604:
+ nsjconf->bindhost = optarg;
+ break;
+ case 'i':
+ nsjconf->max_conns_per_ip = strtoul(optarg, NULL, 0);
+ break;
+ case 'l':
+ logs::logFile(optarg);
+ break;
+ case 'L':
+ logs::logFile(std::string("/dev/fd/") + optarg);
+ break;
+ case 'd':
+ nsjconf->daemonize = true;
+ break;
+ case 'v':
+ logs::logLevel(logs::DEBUG);
+ break;
+ case 'q':
+ logs::logLevel(logs::WARNING);
+ break;
+ case 'Q':
+ logs::logLevel(logs::FATAL);
+ break;
+ case 'e':
+ nsjconf->keep_env = true;
+ break;
+ case 't':
+ nsjconf->tlimit = (uint64_t)strtoull(optarg, NULL, 0);
+ break;
+ case 'h': /* help */
+ cmdlineUsage(argv[0]);
+ exit(0);
+ break;
+ case 0x0201:
+ nsjconf->rl_as = parseRLimit(RLIMIT_AS, optarg, (1024 * 1024));
+ break;
+ case 0x0202:
+ nsjconf->rl_core = parseRLimit(RLIMIT_CORE, optarg, (1024 * 1024));
+ break;
+ case 0x0203:
+ nsjconf->rl_cpu = parseRLimit(RLIMIT_CPU, optarg, 1);
+ break;
+ case 0x0204:
+ nsjconf->rl_fsize = parseRLimit(RLIMIT_FSIZE, optarg, (1024 * 1024));
+ break;
+ case 0x0205:
+ nsjconf->rl_nofile = parseRLimit(RLIMIT_NOFILE, optarg, 1);
+ break;
+ case 0x0206:
+ nsjconf->rl_nproc = parseRLimit(RLIMIT_NPROC, optarg, 1);
+ break;
+ case 0x0207:
+ nsjconf->rl_stack = parseRLimit(RLIMIT_STACK, optarg, (1024 * 1024));
+ break;
+ case 0x0301:
+ nsjconf->personality |= ADDR_COMPAT_LAYOUT;
+ break;
+ case 0x0302:
+ nsjconf->personality |= MMAP_PAGE_ZERO;
+ break;
+ case 0x0303:
+ nsjconf->personality |= READ_IMPLIES_EXEC;
+ break;
+ case 0x0304:
+ nsjconf->personality |= ADDR_LIMIT_3GB;
+ break;
+ case 0x0305:
+ nsjconf->personality |= ADDR_NO_RANDOMIZE;
+ break;
+ case 'N':
+ nsjconf->clone_newnet = false;
+ break;
+ case 0x0402:
+ nsjconf->clone_newuser = false;
+ break;
+ case 0x0403:
+ nsjconf->clone_newns = false;
+ break;
+ case 0x0404:
+ nsjconf->clone_newpid = false;
+ break;
+ case 0x0405:
+ nsjconf->clone_newipc = false;
+ break;
+ case 0x0406:
+ nsjconf->clone_newuts = false;
+ break;
+ case 0x0407:
+ nsjconf->clone_newcgroup = false;
+ break;
+ case 0x0408:
+ nsjconf->clone_newcgroup = true;
+ break;
+ case 0x0501:
+ nsjconf->keep_caps = true;
+ break;
+ case 0x0502:
+ nsjconf->is_silent = true;
+ break;
+ case 0x0503:
+ nsjconf->stderr_to_null = true;
+ break;
+ case 0x0504:
+ nsjconf->skip_setsid = true;
+ break;
+ case 0x0505:
+ nsjconf->openfds.push_back((int)strtol(optarg, NULL, 0));
+ break;
+ case 0x0507:
+ nsjconf->disable_no_new_privs = true;
+ break;
+ case 0x0508:
+ nsjconf->max_cpus = strtoul(optarg, NULL, 0);
+ break;
+ case 0x0509: {
+ int cap = caps::nameToVal(optarg);
+ if (cap == -1) {
+ return nullptr;
+ }
+ nsjconf->caps.push_back(cap);
+ } break;
+ case 0x0601:
+ nsjconf->is_root_rw = true;
+ break;
+ case 0x0603:
+ nsjconf->proc_path.clear();
+ break;
+ case 0x0605:
+ nsjconf->proc_path = optarg;
+ break;
+ case 0x0606:
+ nsjconf->is_proc_rw = true;
+ break;
+ case 0x0607:
+ nsjconf->use_execveat = true;
+ break;
+ case 'E':
+ addEnv(nsjconf.get(), optarg);
+ break;
+ case 'u': {
+ std::vector<std::string> subopts = util::strSplit(optarg, ':');
+ std::string i_id = argFromVec(subopts, 0);
+ std::string o_id = argFromVec(subopts, 1);
+ std::string cnt = argFromVec(subopts, 2);
+ size_t count = strtoul(cnt.c_str(), nullptr, 0);
+ if (!user::parseId(nsjconf.get(), i_id, o_id, count, /* is_gid= */ false,
+ /* is_newidmap= */ false)) {
+ return nullptr;
+ }
+ } break;
+ case 'g': {
+ std::vector<std::string> subopts = util::strSplit(optarg, ':');
+ std::string i_id = argFromVec(subopts, 0);
+ std::string o_id = argFromVec(subopts, 1);
+ std::string cnt = argFromVec(subopts, 2);
+ size_t count = strtoul(cnt.c_str(), nullptr, 0);
+ if (!user::parseId(nsjconf.get(), i_id, o_id, count, /* is_gid= */ true,
+ /* is_newidmap= */ false)) {
+ return nullptr;
+ }
+ } break;
+ case 'U': {
+ std::vector<std::string> subopts = util::strSplit(optarg, ':');
+ std::string i_id = argFromVec(subopts, 0);
+ std::string o_id = argFromVec(subopts, 1);
+ std::string cnt = argFromVec(subopts, 2);
+ size_t count = strtoul(cnt.c_str(), nullptr, 0);
+ if (!user::parseId(nsjconf.get(), i_id, o_id, count, /* is_gid= */ false,
+ /* is_newidmap= */ true)) {
+ return nullptr;
+ }
+ } break;
+ case 'G': {
+ std::vector<std::string> subopts = util::strSplit(optarg, ':');
+ std::string i_id = argFromVec(subopts, 0);
+ std::string o_id = argFromVec(subopts, 1);
+ std::string cnt = argFromVec(subopts, 2);
+ size_t count = strtoul(cnt.c_str(), nullptr, 0);
+ if (!user::parseId(nsjconf.get(), i_id, o_id, count, /* is_gid= */ true,
+ /* is_newidmap= */ true)) {
+ return nullptr;
+ }
+ } break;
+ case 'R': {
+ std::vector<std::string> subopts = util::strSplit(optarg, ':');
+ std::string src = argFromVec(subopts, 0);
+ std::string dst = argFromVec(subopts, 1);
+ if (dst.empty()) {
+ dst = src;
+ }
+ if (!mnt::addMountPtTail(nsjconf.get(), src, dst, /* fstype= */ "",
+ /* options= */ "", MS_BIND | MS_REC | MS_PRIVATE | MS_RDONLY,
+ /* is_dir= */ mnt::NS_DIR_MAYBE, /* is_mandatory= */ true,
+ /* src_env= */ "", /* dst_env= */ "", /* src_content= */ "",
+ /* is_symlink= */ false)) {
+ return nullptr;
+ }
+ }; break;
+ case 'B': {
+ std::vector<std::string> subopts = util::strSplit(optarg, ':');
+ std::string src = argFromVec(subopts, 0);
+ std::string dst = argFromVec(subopts, 1);
+ if (dst.empty()) {
+ dst = src;
+ }
+ if (!mnt::addMountPtTail(nsjconf.get(), src, dst, /* fstype= */ "",
+ /* options= */ "", MS_BIND | MS_REC | MS_PRIVATE,
+ /* is_dir= */ mnt::NS_DIR_MAYBE, /* is_mandatory= */ true,
+ /* src_env= */ "", /* dst_env= */ "", /* src_content= */ "",
+ /* is_symlink= */ false)) {
+ return nullptr;
+ }
+ }; break;
+ case 'T': {
+ if (!mnt::addMountPtTail(nsjconf.get(), "", optarg, /* fstype= */ "tmpfs",
+ /* options= */ "size=4194304", 0,
+ /* is_dir= */ mnt::NS_DIR_YES, /* is_mandatory= */ true,
+ /* src_env= */ "", /* dst_env= */ "", /* src_content= */ "",
+ /* is_symlink= */ false)) {
+ return nullptr;
+ }
+ }; break;
+ case 'm': {
+ std::vector<std::string> subopts = util::strSplit(optarg, ':');
+ std::string src = argFromVec(subopts, 0);
+ std::string dst = argFromVec(subopts, 1);
+ if (dst.empty()) {
+ dst = src;
+ }
+ std::string fs_type = argFromVec(subopts, 2);
+ std::string options = argFromVec(subopts, 3);
+ if (!mnt::addMountPtTail(nsjconf.get(), src, dst, /* fstype= */ fs_type,
+ /* options= */ options, /* flags= */ 0,
+ /* is_dir= */ mnt::NS_DIR_MAYBE, /* is_mandatory= */ true,
+ /* src_env= */ "", /* dst_env= */ "", /* src_content= */ "",
+ /* is_symlink= */ false)) {
+ return nullptr;
+ }
+ }; break;
+ case 's': {
+ std::vector<std::string> subopts = util::strSplit(optarg, ':');
+ std::string src = argFromVec(subopts, 0);
+ std::string dst = argFromVec(subopts, 1);
+ if (!mnt::addMountPtTail(nsjconf.get(), src, dst, /* fstype= */ "",
+ /* options= */ "", /* flags= */ 0,
+ /* is_dir= */ mnt::NS_DIR_NO, /* is_mandatory= */ true,
+ /* src_env= */ "", /* dst_env= */ "", /* src_content= */ "",
+ /* is_symlink= */ true)) {
+ return nullptr;
+ }
+ }; break;
+ case 'M':
+ switch (optarg[0]) {
+ case 'l':
+ nsjconf->mode = MODE_LISTEN_TCP;
+ break;
+ case 'o':
+ nsjconf->mode = MODE_STANDALONE_ONCE;
+ break;
+ case 'e':
+ nsjconf->mode = MODE_STANDALONE_EXECVE;
+ break;
+ case 'r':
+ nsjconf->mode = MODE_STANDALONE_RERUN;
+ break;
+ default:
+ LOG_E("Modes supported: -M l - MODE_LISTEN_TCP (default)");
+ LOG_E(" -M o - MODE_STANDALONE_ONCE");
+ LOG_E(" -M r - MODE_STANDALONE_RERUN");
+ LOG_E(" -M e - MODE_STANDALONE_EXECVE");
+ cmdlineUsage(argv[0]);
+ return nullptr;
+ break;
+ }
+ break;
+ case 0x700:
+ nsjconf->iface_lo = false;
+ break;
+ case 'I':
+ nsjconf->iface_vs = optarg;
+ break;
+ case 0x701:
+ nsjconf->iface_vs_ip = optarg;
+ break;
+ case 0x702:
+ nsjconf->iface_vs_nm = optarg;
+ break;
+ case 0x703:
+ nsjconf->iface_vs_gw = optarg;
+ break;
+ case 0x704:
+ nsjconf->ifaces.push_back(optarg);
+ break;
+ case 0x705:
+ nsjconf->iface_vs_ma = optarg;
+ break;
+ case 0x801:
+ nsjconf->cgroup_mem_max = (size_t)strtoull(optarg, NULL, 0);
+ break;
+ case 0x802:
+ nsjconf->cgroup_mem_mount = optarg;
+ break;
+ case 0x803:
+ nsjconf->cgroup_mem_parent = optarg;
+ break;
+ case 0x811:
+ nsjconf->cgroup_pids_max = (unsigned int)strtoul(optarg, NULL, 0);
+ break;
+ case 0x812:
+ nsjconf->cgroup_pids_mount = optarg;
+ break;
+ case 0x813:
+ nsjconf->cgroup_pids_parent = optarg;
+ break;
+ case 0x821:
+ nsjconf->cgroup_net_cls_classid = (unsigned int)strtoul(optarg, NULL, 0);
+ break;
+ case 0x822:
+ nsjconf->cgroup_net_cls_mount = optarg;
+ break;
+ case 0x823:
+ nsjconf->cgroup_net_cls_parent = optarg;
+ break;
+ case 0x831:
+ nsjconf->cgroup_cpu_ms_per_sec = (unsigned int)strtoul(optarg, NULL, 0);
+ break;
+ case 0x832:
+ nsjconf->cgroup_cpu_mount = optarg;
+ break;
+ case 0x833:
+ nsjconf->cgroup_cpu_parent = optarg;
+ break;
+ case 'P':
+ nsjconf->kafel_file_path = optarg;
+ break;
+ case 0x901:
+ nsjconf->kafel_string = optarg;
+ break;
+ case 0x902:
+ nsjconf->seccomp_log = true;
+ break;
+ default:
+ cmdlineUsage(argv[0]);
+ return nullptr;
+ break;
+ }
+ }
+
+ if (nsjconf->daemonize && !logs::logSet()) {
+ logs::logFile(_LOG_DEFAULT_FILE);
+ }
+ if (!setupMounts(nsjconf.get())) {
+ return nullptr;
+ }
+ if (!setupArgv(nsjconf.get(), argc, argv, optind)) {
+ return nullptr;
+ }
+ setupUsers(nsjconf.get());
+
+ return nsjconf;
+}
+
+} // namespace cmdline
diff --git a/cmdline.h b/cmdline.h
new file mode 100644
index 0000000..f452fe0
--- /dev/null
+++ b/cmdline.h
@@ -0,0 +1,41 @@
+/*
+
+ nsjail - cmdline parsing
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#ifndef NS_CMDLINE_H
+#define NS_CMDLINE_H
+
+#include <stdint.h>
+
+#include <memory>
+#include <string>
+
+#include "nsjail.h"
+
+namespace cmdline {
+
+uint64_t parseRLimit(int res, const char* optarg, unsigned long mul);
+void logParams(nsjconf_t* nsjconf);
+void addEnv(nsjconf_t* nsjconf, const std::string& env);
+std::unique_ptr<nsjconf_t> parseArgs(int argc, char* argv[]);
+
+} // namespace cmdline
+
+#endif /* _CMDLINE_H */
diff --git a/config.cc b/config.cc
new file mode 100644
index 0000000..adabf0e
--- /dev/null
+++ b/config.cc
@@ -0,0 +1,317 @@
+/*
+
+ nsjail - config parsing
+ -----------------------------------------
+
+ Copyright 2017 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#include <fcntl.h>
+#include <stdio.h>
+#include <sys/mount.h>
+#include <sys/personality.h>
+#include <sys/resource.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+
+#include <google/protobuf/io/zero_copy_stream_impl.h>
+#include <google/protobuf/text_format.h>
+#include <fstream>
+#include <string>
+#include <vector>
+
+#include "caps.h"
+#include "cmdline.h"
+#include "config.h"
+#include "config.pb.h"
+#include "logs.h"
+#include "macros.h"
+#include "mnt.h"
+#include "user.h"
+#include "util.h"
+
+namespace config {
+
+static uint64_t configRLimit(
+ int res, const nsjail::RLimit& rl, const uint64_t val, unsigned long mul = 1UL) {
+ if (rl == nsjail::RLimit::VALUE) {
+ return (val * mul);
+ }
+ if (rl == nsjail::RLimit::SOFT) {
+ return cmdline::parseRLimit(res, "soft", mul);
+ }
+ if (rl == nsjail::RLimit::HARD) {
+ return cmdline::parseRLimit(res, "hard", mul);
+ }
+ if (rl == nsjail::RLimit::INF) {
+ return RLIM64_INFINITY;
+ }
+ LOG_F("Unknown rlimit value type for rlimit:%d", res);
+ abort();
+}
+
+static bool configParseInternal(nsjconf_t* nsjconf, const nsjail::NsJailConfig& njc) {
+ switch (njc.mode()) {
+ case nsjail::Mode::LISTEN:
+ nsjconf->mode = MODE_LISTEN_TCP;
+ break;
+ case nsjail::Mode::ONCE:
+ nsjconf->mode = MODE_STANDALONE_ONCE;
+ break;
+ case nsjail::Mode::RERUN:
+ nsjconf->mode = MODE_STANDALONE_RERUN;
+ break;
+ case nsjail::Mode::EXECVE:
+ nsjconf->mode = MODE_STANDALONE_EXECVE;
+ break;
+ default:
+ LOG_E("Uknown running mode: %d", njc.mode());
+ return false;
+ }
+ if (njc.has_chroot_dir()) {
+ nsjconf->chroot = njc.chroot_dir();
+ }
+ nsjconf->is_root_rw = njc.is_root_rw();
+ nsjconf->hostname = njc.hostname();
+ nsjconf->cwd = njc.cwd();
+ nsjconf->port = njc.port();
+ nsjconf->bindhost = njc.bindhost();
+ nsjconf->max_conns_per_ip = njc.max_conns_per_ip();
+ nsjconf->tlimit = njc.time_limit();
+ nsjconf->max_cpus = njc.max_cpus();
+ nsjconf->daemonize = njc.daemon();
+
+ if (njc.has_log_fd()) {
+ logs::logFile(std::string("/dev/fd/") + std::to_string(njc.log_fd()));
+ }
+ if (njc.has_log_file()) {
+ logs::logFile(njc.log_file());
+ }
+ if (njc.has_log_level()) {
+ switch (njc.log_level()) {
+ case nsjail::LogLevel::DEBUG:
+ logs::logLevel(logs::DEBUG);
+ break;
+ case nsjail::LogLevel::INFO:
+ logs::logLevel(logs::INFO);
+ break;
+ case nsjail::LogLevel::WARNING:
+ logs::logLevel(logs::WARNING);
+ break;
+ case nsjail::LogLevel::ERROR:
+ logs::logLevel(logs::ERROR);
+ break;
+ case nsjail::LogLevel::FATAL:
+ logs::logLevel(logs::FATAL);
+ break;
+ default:
+ LOG_E("Unknown log_level: %d", njc.log_level());
+ return false;
+ }
+ }
+
+ nsjconf->keep_env = njc.keep_env();
+ for (ssize_t i = 0; i < njc.envar_size(); i++) {
+ cmdline::addEnv(nsjconf, njc.envar(i));
+ }
+
+ nsjconf->keep_caps = njc.keep_caps();
+ for (ssize_t i = 0; i < njc.cap_size(); i++) {
+ int cap = caps::nameToVal(njc.cap(i).c_str());
+ if (cap == -1) {
+ return false;
+ }
+ nsjconf->caps.push_back(cap);
+ }
+
+ nsjconf->is_silent = njc.silent();
+ nsjconf->skip_setsid = njc.skip_setsid();
+
+ for (ssize_t i = 0; i < njc.pass_fd_size(); i++) {
+ nsjconf->openfds.push_back(njc.pass_fd(i));
+ }
+
+ nsjconf->stderr_to_null = njc.stderr_to_null();
+ nsjconf->disable_no_new_privs = njc.disable_no_new_privs();
+
+ nsjconf->rl_as =
+ configRLimit(RLIMIT_AS, njc.rlimit_as_type(), njc.rlimit_as(), 1024UL * 1024UL);
+ nsjconf->rl_core =
+ configRLimit(RLIMIT_CORE, njc.rlimit_core_type(), njc.rlimit_core(), 1024UL * 1024UL);
+ nsjconf->rl_cpu = configRLimit(RLIMIT_CPU, njc.rlimit_cpu_type(), njc.rlimit_cpu());
+ nsjconf->rl_fsize = configRLimit(
+ RLIMIT_FSIZE, njc.rlimit_fsize_type(), njc.rlimit_fsize(), 1024UL * 1024UL);
+ nsjconf->rl_nofile =
+ configRLimit(RLIMIT_NOFILE, njc.rlimit_nofile_type(), njc.rlimit_nofile());
+ nsjconf->rl_nproc = configRLimit(RLIMIT_NPROC, njc.rlimit_nproc_type(), njc.rlimit_nproc());
+ nsjconf->rl_stack = configRLimit(
+ RLIMIT_STACK, njc.rlimit_stack_type(), njc.rlimit_stack(), 1024UL * 1024UL);
+
+ if (njc.persona_addr_compat_layout()) {
+ nsjconf->personality |= ADDR_COMPAT_LAYOUT;
+ }
+ if (njc.persona_mmap_page_zero()) {
+ nsjconf->personality |= MMAP_PAGE_ZERO;
+ }
+ if (njc.persona_read_implies_exec()) {
+ nsjconf->personality |= READ_IMPLIES_EXEC;
+ }
+ if (njc.persona_addr_limit_3gb()) {
+ nsjconf->personality |= ADDR_LIMIT_3GB;
+ }
+ if (njc.persona_addr_no_randomize()) {
+ nsjconf->personality |= ADDR_NO_RANDOMIZE;
+ }
+
+ nsjconf->clone_newnet = njc.clone_newnet();
+ nsjconf->clone_newuser = njc.clone_newuser();
+ nsjconf->clone_newns = njc.clone_newns();
+ nsjconf->clone_newpid = njc.clone_newpid();
+ nsjconf->clone_newipc = njc.clone_newipc();
+ nsjconf->clone_newuts = njc.clone_newuts();
+ nsjconf->clone_newcgroup = njc.clone_newcgroup();
+
+ for (ssize_t i = 0; i < njc.uidmap_size(); i++) {
+ if (!user::parseId(nsjconf, njc.uidmap(i).inside_id(), njc.uidmap(i).outside_id(),
+ njc.uidmap(i).count(), false /* is_gid */, njc.uidmap(i).use_newidmap())) {
+ return false;
+ }
+ }
+ for (ssize_t i = 0; i < njc.gidmap_size(); i++) {
+ if (!user::parseId(nsjconf, njc.gidmap(i).inside_id(), njc.gidmap(i).outside_id(),
+ njc.gidmap(i).count(), true /* is_gid */, njc.gidmap(i).use_newidmap())) {
+ return false;
+ }
+ }
+
+ if (!njc.mount_proc()) {
+ nsjconf->proc_path.clear();
+ }
+ for (ssize_t i = 0; i < njc.mount_size(); i++) {
+ std::string src = njc.mount(i).src();
+ std::string src_env = njc.mount(i).prefix_src_env();
+ std::string dst = njc.mount(i).dst();
+ std::string dst_env = njc.mount(i).prefix_dst_env();
+ std::string fstype = njc.mount(i).fstype();
+ std::string options = njc.mount(i).options();
+
+ uintptr_t flags = (!njc.mount(i).rw()) ? MS_RDONLY : 0;
+ flags |= njc.mount(i).is_bind() ? (MS_BIND | MS_REC | MS_PRIVATE) : 0;
+ flags |= njc.mount(i).nosuid() ? MS_NOSUID : 0;
+ flags |= njc.mount(i).nodev() ? MS_NODEV : 0;
+ flags |= njc.mount(i).noexec() ? MS_NOEXEC : 0;
+ bool is_mandatory = njc.mount(i).mandatory();
+ bool is_symlink = njc.mount(i).is_symlink();
+ std::string src_content = njc.mount(i).src_content();
+
+ mnt::isDir_t is_dir = mnt::NS_DIR_MAYBE;
+ if (njc.mount(i).has_is_dir()) {
+ is_dir = njc.mount(i).is_dir() ? mnt::NS_DIR_YES : mnt::NS_DIR_NO;
+ }
+
+ if (!mnt::addMountPtTail(nsjconf, src, dst, fstype, options, flags, is_dir,
+ is_mandatory, src_env, dst_env, src_content, is_symlink)) {
+ LOG_E("Couldn't add mountpoint for src:'%s' dst:'%s'", src.c_str(),
+ dst.c_str());
+ return false;
+ }
+ }
+
+ if (njc.has_seccomp_policy_file()) {
+ nsjconf->kafel_file_path = njc.seccomp_policy_file();
+ }
+ for (ssize_t i = 0; i < njc.seccomp_string().size(); i++) {
+ nsjconf->kafel_string += njc.seccomp_string(i);
+ nsjconf->kafel_string += '\n';
+ }
+ nsjconf->seccomp_log = njc.seccomp_log();
+
+ nsjconf->cgroup_mem_max = njc.cgroup_mem_max();
+ nsjconf->cgroup_mem_mount = njc.cgroup_mem_mount();
+ nsjconf->cgroup_mem_parent = njc.cgroup_mem_parent();
+ nsjconf->cgroup_pids_max = njc.cgroup_pids_max();
+ nsjconf->cgroup_pids_mount = njc.cgroup_pids_mount();
+ nsjconf->cgroup_pids_parent = njc.cgroup_pids_parent();
+ nsjconf->cgroup_net_cls_classid = njc.cgroup_net_cls_classid();
+ nsjconf->cgroup_net_cls_mount = njc.cgroup_net_cls_mount();
+ nsjconf->cgroup_net_cls_parent = njc.cgroup_net_cls_parent();
+ nsjconf->cgroup_cpu_ms_per_sec = njc.cgroup_cpu_ms_per_sec();
+ nsjconf->cgroup_cpu_mount = njc.cgroup_cpu_mount();
+ nsjconf->cgroup_cpu_parent = njc.cgroup_cpu_parent();
+
+ nsjconf->iface_lo = !(njc.iface_no_lo());
+ for (ssize_t i = 0; i < njc.iface_own().size(); i++) {
+ nsjconf->ifaces.push_back(njc.iface_own(i));
+ }
+ if (njc.has_macvlan_iface()) {
+ nsjconf->iface_vs = njc.macvlan_iface();
+ }
+ nsjconf->iface_vs_ip = njc.macvlan_vs_ip();
+ nsjconf->iface_vs_nm = njc.macvlan_vs_nm();
+ nsjconf->iface_vs_gw = njc.macvlan_vs_gw();
+ nsjconf->iface_vs_ma = njc.macvlan_vs_ma();
+
+ if (njc.has_exec_bin()) {
+ nsjconf->exec_file = njc.exec_bin().path();
+ nsjconf->argv.push_back(njc.exec_bin().path());
+ for (ssize_t i = 0; i < njc.exec_bin().arg().size(); i++) {
+ nsjconf->argv.push_back(njc.exec_bin().arg(i));
+ }
+ if (njc.exec_bin().has_arg0()) {
+ nsjconf->argv[0] = njc.exec_bin().arg0();
+ }
+ nsjconf->use_execveat = njc.exec_bin().exec_fd();
+ }
+
+ return true;
+}
+
+static void LogHandler(
+ google::protobuf::LogLevel level, const char* filename, int line, const std::string& message) {
+ LOG_W("config.cc: '%s'", message.c_str());
+}
+
+bool parseFile(nsjconf_t* nsjconf, const char* file) {
+ LOG_D("Parsing configuration from '%s'", file);
+
+ int fd = open(file, O_RDONLY | O_CLOEXEC);
+ if (fd == -1) {
+ PLOG_W("Couldn't open config file '%s'", file);
+ return false;
+ }
+
+ SetLogHandler(LogHandler);
+ google::protobuf::io::FileInputStream input(fd);
+ input.SetCloseOnDelete(true);
+
+ /* Use static so we can get c_str() pointers, and copy them into the nsjconf struct */
+ static nsjail::NsJailConfig nsc;
+
+ auto parser = google::protobuf::TextFormat::Parser();
+ if (!parser.Parse(&input, &nsc)) {
+ LOG_W("Couldn't parse file '%s' from Text into ProtoBuf", file);
+ return false;
+ }
+ if (!configParseInternal(nsjconf, nsc)) {
+ LOG_W("Couldn't parse the ProtoBuf");
+ return false;
+ }
+
+ LOG_D("Parsed config:\n'%s'", nsc.DebugString().c_str());
+ return true;
+}
+
+} // namespace config
diff --git a/config.h b/config.h
new file mode 100644
index 0000000..108d3fd
--- /dev/null
+++ b/config.h
@@ -0,0 +1,35 @@
+/*
+
+ nsjail - config parsing
+ -----------------------------------------
+
+ Copyright 2017 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#ifndef NS_CONFIG_H
+#define NS_CONFIG_H
+
+#include <stdbool.h>
+
+#include "nsjail.h"
+
+namespace config {
+
+bool parseFile(nsjconf_t* nsjconf, const char* file);
+
+} // namespace config
+
+#endif /* NS_CONFIG_H */
diff --git a/config.proto b/config.proto
new file mode 100644
index 0000000..3988543
--- /dev/null
+++ b/config.proto
@@ -0,0 +1,243 @@
+syntax = "proto2";
+
+package nsjail;
+
+enum Mode {
+ LISTEN = 0; /* Listening on a TCP port */
+ ONCE = 1; /* Running the command once only */
+ RERUN = 2; /* Re-executing the command (forever) */
+ EXECVE = 3; /* Executing command w/o the supervisor */
+}
+/* Should be self explanatory */
+enum LogLevel {
+ DEBUG = 0; /* Equivalent to the '-v' cmd-line option */
+ INFO = 1; /* Default level */
+ WARNING = 2; /* Equivalent to the '-q' cmd-line option */
+ ERROR = 3;
+ FATAL = 4;
+}
+message IdMap {
+ /* Empty string means "current uid/gid" */
+ optional string inside_id = 1 [default = ""];
+ optional string outside_id = 2 [default = ""];
+ /* See 'man user_namespaces' for the meaning of count */
+ optional uint32 count = 3 [default = 1];
+ /* Does this map use /usr/bin/new[u|g]idmap binary? */
+ optional bool use_newidmap = 4 [default = false];
+}
+message MountPt {
+ /* Can be skipped for filesystems like 'proc' */
+ optional string src = 1 [default = ""];
+ /* Should 'src' path be prefixed with this envvar? */
+ optional string prefix_src_env = 2 [default = ""];
+ /* If specified, contains buffer that will be written to the dst file */
+ optional bytes src_content = 3 [default = ""];
+ /* Mount point inside jail */
+ required string dst = 4 [default = ""];
+ /* Should 'dst' path be prefixed with this envvar? */
+ optional string prefix_dst_env = 5 [default = ""];
+ /* Can be empty for mount --bind mounts */
+ optional string fstype = 6 [default = ""];
+ /* E.g. size=5000000 for 'tmpfs' */
+ optional string options = 7 [default = ""];
+ /* Is it a 'mount --bind src dst' type of mount? */
+ optional bool is_bind = 8 [default = false];
+ /* Is it a R/W mount? */
+ optional bool rw = 9 [default = false];
+ /* Is it a directory? If not specified an internal
+ heuristics will be used to determine that */
+ optional bool is_dir = 10;
+ /* Should the sandboxing fail if we cannot mount this resource? */
+ optional bool mandatory = 11 [default = true];
+ /* Is it a symlink (instead of real mount point)? */
+ optional bool is_symlink = 12 [default = false];
+ /* Is it a nosuid mount */
+ optional bool nosuid = 13 [default = false];
+ /* Is it a nodev mount */
+ optional bool nodev = 14 [default = false];
+ /* Is it a noexec mount */
+ optional bool noexec = 15 [default = false];
+}
+enum RLimit {
+ VALUE = 0; /* Use the provided value */
+ SOFT = 1; /* Use the current soft rlimit */
+ HARD = 2; /* Use the current hard rlimit */
+ INF = 3; /* Use RLIM64_INFINITY */
+}
+message Exe {
+ /* Will be used both as execv's path and as argv[0] */
+ required string path = 1;
+ /* This will be argv[1] and so on.. */
+ repeated string arg = 2;
+ /* Override argv[0] */
+ optional string arg0 = 3;
+ /* Should execveat() be used to execute a file-descriptor instead? */
+ optional bool exec_fd = 4 [default = false];
+}
+message NsJailConfig {
+ /* Optional name and description for this config */
+ optional string name = 1 [default = ""];
+ repeated string description = 2;
+
+ /* Execution mode: see 'msg Mode' description for more */
+ optional Mode mode = 3 [default = ONCE];
+ /* Equivalent to a bind mount with dst='/'. DEPRECATED: Use bind mounts. */
+ optional string chroot_dir = 4 [deprecated = true];
+ /* Applies both to the chroot_dir and to /proc mounts. DEPRECATED: Use bind mounts */
+ optional bool is_root_rw = 5 [default = false, deprecated = true];
+ /* Hostname inside jail */
+ optional string hostname = 8 [default = "NSJAIL"];
+ /* Initial current working directory for the binary */
+ optional string cwd = 9 [default = "/"];
+
+ /* TCP port to listen to. Valid with mode=LISTEN only */
+ optional uint32 port = 10 [default = 0];
+ /* Host to bind to for mode=LISTEN. Must be in IPv6 format */
+ optional string bindhost = 11 [default = "::"];
+ /* For mode=LISTEN, maximum number of connections from a single IP */
+ optional uint32 max_conns_per_ip = 12 [default = 0];
+
+ /* Wall-time time limit for commands */
+ optional uint32 time_limit = 13 [default = 600];
+ /* Should nsjail go into background? */
+ optional bool daemon = 14 [default = false];
+ /* Maximum number of CPUs to use: 0 - no limit */
+ optional uint32 max_cpus = 15 [default = 0];
+
+ /* FD to log to. */
+ optional int32 log_fd = 16;
+ /* File to save lofs to */
+ optional string log_file = 17;
+ /* Minimum log level displayed.
+ See 'msg LogLevel' description for more */
+ optional LogLevel log_level = 18;
+
+ /* Should the current environment variables be kept
+ when executing the binary */
+ optional bool keep_env = 19 [default = false];
+ /* EnvVars to be set before executing binaries. If the envvar doesn't contain '='
+ (e.g. just the 'DISPLAY' string), the current envvar value will be used */
+ repeated string envar = 20;
+
+ /* Should capabilities be preserved or dropped */
+ optional bool keep_caps = 21 [default = false];
+ /* Which capabilities should be preserved if keep_caps == false.
+ Format: "CAP_SYS_PTRACE" */
+ repeated string cap = 22;
+ /* Should nsjail close FD=0,1,2 before executing the process */
+ optional bool silent = 23 [default = false];
+ /* Should the child process have control over terminal?
+ Can be useful to allow /bin/sh to provide
+ job control / signals. Dangerous, can be used to put
+ characters into the controlling terminal back */
+ optional bool skip_setsid = 24 [default = false];
+ /* Redirect sdterr of the process to /dev/null instead of the socket or original TTY */
+ optional bool stderr_to_null = 25 [default = false];
+ /* Which FDs should be passed to the newly executed process
+ By default only FD=0,1,2 are passed */
+ repeated int32 pass_fd = 26;
+ /* Setting it to true will allow to have set-uid binaries
+ inside the jail */
+ optional bool disable_no_new_privs = 27 [default = false];
+
+ /* Various rlimits, the rlimit_as/rlimit_core/... are used only if
+ rlimit_as_type/rlimit_core_type/... are set to RLimit::VALUE */
+ optional uint64 rlimit_as = 28 [default = 512]; /* In MiB */
+ optional RLimit rlimit_as_type = 29 [default = VALUE];
+ optional uint64 rlimit_core = 30 [default = 0]; /* In MiB */
+ optional RLimit rlimit_core_type = 31 [default = VALUE];
+ optional uint64 rlimit_cpu = 32 [default = 600]; /* In seconds */
+ optional RLimit rlimit_cpu_type = 33 [default = VALUE];
+ optional uint64 rlimit_fsize = 34 [default = 1]; /* In MiB */
+ optional RLimit rlimit_fsize_type = 35 [default = VALUE];
+ optional uint64 rlimit_nofile = 36 [default = 32];
+ optional RLimit rlimit_nofile_type = 37 [default = VALUE];
+ /* RLIMIT_NPROC is system-wide - tricky to use; use the soft limit value by
+ * default here */
+ optional uint64 rlimit_nproc = 38 [default = 1024];
+ optional RLimit rlimit_nproc_type = 39 [default = SOFT];
+ /* In MiB, use the soft limit value by default */
+ optional uint64 rlimit_stack = 40 [default = 1048576];
+ optional RLimit rlimit_stack_type = 41 [default = SOFT];
+
+ /* See 'man personality' for more */
+ optional bool persona_addr_compat_layout = 42 [default = false];
+ optional bool persona_mmap_page_zero = 43 [default = false];
+ optional bool persona_read_implies_exec = 44 [default = false];
+ optional bool persona_addr_limit_3gb = 45 [default = false];
+ optional bool persona_addr_no_randomize = 46 [default = false];
+
+ /* Which name-spaces should be used? */
+ optional bool clone_newnet = 47 [default = true];
+ optional bool clone_newuser = 48 [default = true];
+ optional bool clone_newns = 49 [default = true];
+ optional bool clone_newpid = 50 [default = true];
+ optional bool clone_newipc = 51 [default = true];
+ optional bool clone_newuts = 52 [default = true];
+ /* Disable for kernel versions < 4.6 as it's not supported there */
+ optional bool clone_newcgroup = 53 [default = true];
+
+ /* Mappings for UIDs and GIDs. See the description for 'msg IdMap'
+ for more */
+ repeated IdMap uidmap = 54;
+ repeated IdMap gidmap = 55;
+
+ /* Should /proc be mounted (R/O)? This can also be added in the 'mount'
+ section below */
+ optional bool mount_proc = 56 [default = false];
+ /* Mount points inside the jail. See the description for 'msg MountPt'
+ for more */
+ repeated MountPt mount = 57;
+
+ /* Kafel seccomp-bpf policy file or a string:
+ Homepage of the project: https://github.com/google/kafel */
+ optional string seccomp_policy_file = 58;
+ repeated string seccomp_string = 59;
+ /* Setting it to true makes audit write seccomp logs to dmesg */
+ optional bool seccomp_log = 60 [default = false];
+
+ /* If > 0, maximum cumulative size of RAM used inside any jail */
+ optional uint64 cgroup_mem_max = 61 [default = 0]; /* In MiB */
+ /* Mount point for cgroups-memory in your system */
+ optional string cgroup_mem_mount = 62 [default = "/sys/fs/cgroup/memory"];
+ /* Writeable directory (for the nsjail user) under cgroup_mem_mount */
+ optional string cgroup_mem_parent = 63 [default = "NSJAIL"];
+
+ /* If > 0, maximum number of PIDs (threads/processes) inside jail */
+ optional uint64 cgroup_pids_max = 64 [default = 0];
+ /* Mount point for cgroups-pids in your system */
+ optional string cgroup_pids_mount = 65 [default = "/sys/fs/cgroup/pids"];
+ /* Writeable directory (for the nsjail user) under cgroup_pids_mount */
+ optional string cgroup_pids_parent = 66 [default = "NSJAIL"];
+
+ /* If > 0, Class identifier of network packets inside jail */
+ optional uint32 cgroup_net_cls_classid = 67 [default = 0];
+ /* Mount point for cgroups-net-cls in your system */
+ optional string cgroup_net_cls_mount = 68 [default = "/sys/fs/cgroup/net_cls"];
+ /* Writeable directory (for the nsjail user) under cgroup_net_mount */
+ optional string cgroup_net_cls_parent = 69 [default = "NSJAIL"];
+
+ /* If > 0, number of milliseconds of CPU time per second that jailed processes can use */
+ optional uint32 cgroup_cpu_ms_per_sec = 70 [default = 0];
+ /* Mount point for cgroups-cpu in your system */
+ optional string cgroup_cpu_mount = 71 [default = "/sys/fs/cgroup/cpu"];
+ /* Writeable directory (for the nsjail user) under cgroup_cpu_mount */
+ optional string cgroup_cpu_parent = 72 [default = "NSJAIL"];
+
+ /* Should the 'lo' interface be brought up (active) inside this jail? */
+ optional bool iface_no_lo = 73 [default = false];
+
+ /* Put this interface inside the jail */
+ repeated string iface_own = 74;
+
+ /* Parameters for the cloned MACVLAN interface inside jail */
+ optional string macvlan_iface = 75; /* Interface to be cloned, eg 'eth0' */
+ optional string macvlan_vs_ip = 76 [default = "192.168.0.2"];
+ optional string macvlan_vs_nm = 77 [default = "255.255.255.0"];
+ optional string macvlan_vs_gw = 78 [default = "192.168.0.1"];
+ optional string macvlan_vs_ma = 79 [default = ""];
+
+ /* Binary path (with arguments) to be executed. If not specified here, it
+ can be specified with cmd-line as "-- /path/to/command arg1 arg2" */
+ optional Exe exec_bin = 80;
+}
diff --git a/configs/apache.cfg b/configs/apache.cfg
new file mode 100644
index 0000000..f3ae838
--- /dev/null
+++ b/configs/apache.cfg
@@ -0,0 +1,135 @@
+name: "apache-with-cloned-net"
+description: "Tested under Ubuntu 17.04. Other Linux distros might "
+description: "use different locations for the Apache's HTTPD configuration "
+description: "files and system libraries"
+description: "Run as: sudo ./nsjail --config configs/apache.cfg"
+
+mode: ONCE
+hostname: "APACHE-NSJ"
+
+rlimit_as: 1024
+rlimit_fsize: 1024
+rlimit_cpu_type: INF
+rlimit_nofile: 64
+
+time_limit: 0
+
+cap: "CAP_NET_BIND_SERVICE"
+
+envar: "APACHE_RUN_DIR=/run/apache2"
+envar: "APACHE_PID_FILE=/run/apache2/apache2.pid"
+envar: "APACHE_RUN_USER=www-data"
+envar: "APACHE_RUN_GROUP=www-data"
+envar: "APACHE_LOG_DIR=/run/apache2"
+envar: "APACHE_LOCK_DIR=/run/apache2"
+
+uidmap {
+ inside_id: "1"
+ outside_id: "www-data"
+}
+
+gidmap {
+ inside_id: "1"
+ outside_id: "www-data"
+}
+
+mount {
+ src: "/etc/apache2"
+ dst: "/etc/apache2"
+ is_bind: true
+}
+mount {
+ src: "/etc/mime.types"
+ dst: "/etc/mime.types"
+ is_bind: true
+}
+mount {
+ src: "/etc/localtime"
+ dst: "/etc/localtime"
+ is_bind: true
+}
+mount {
+ src_content: "www-data:x:1:1:www-data:/var/www:/bin/false"
+ dst: "/etc/passwd"
+}
+mount {
+ src_content: "www-data:x:1:"
+ dst: "/etc/group"
+}
+mount {
+ dst: "/tmp"
+ fstype: "tmpfs"
+ rw: true
+}
+mount {
+ dst: "/run/apache2"
+ fstype: "tmpfs"
+ rw: true
+}
+mount {
+ src: "/dev/urandom"
+ dst: "/dev/urandom"
+ is_bind: true
+ rw: true
+}
+mount {
+ dst: "/dev/shm"
+ fstype: "tmpfs"
+ rw: true
+}
+mount {
+ dst: "/proc"
+ fstype: "proc"
+}
+mount {
+ src: "/lib64"
+ dst: "/lib64"
+ is_bind: true
+}
+mount {
+ src: "/lib"
+ dst: "/lib"
+ is_bind: true
+}
+mount {
+ src: "/usr/lib"
+ dst: "/usr/lib"
+ is_bind: true
+}
+mount {
+ src: "/var/www/html"
+ dst: "/var/www/html"
+ is_bind: true
+}
+mount {
+ src: "/usr/share/apache2"
+ dst: "/usr/share/apache2"
+ is_bind: true
+}
+mount {
+ src: "/var/lib/apache2"
+ dst: "/var/lib/apache2"
+ is_bind: true
+}
+mount {
+ src: "/usr/sbin/apache2"
+ dst: "/usr/sbin/apache2"
+ is_bind: true
+}
+
+seccomp_string: " KILL {"
+seccomp_string: " ptrace,"
+seccomp_string: " process_vm_readv,"
+seccomp_string: " process_vm_writev"
+seccomp_string: " }"
+seccomp_string: " DEFAULT ALLOW"
+
+macvlan_iface: "enp0s31f6"
+macvlan_vs_ip: "192.168.10.223"
+macvlan_vs_nm: "255.255.255.0"
+macvlan_vs_gw: "192.168.10.1"
+
+exec_bin {
+ path: "/usr/sbin/apache2"
+ arg : "-DFOREGROUND"
+}
diff --git a/configs/bash-with-fake-geteuid.cfg b/configs/bash-with-fake-geteuid.cfg
new file mode 100644
index 0000000..c0046ba
--- /dev/null
+++ b/configs/bash-with-fake-geteuid.cfg
@@ -0,0 +1,184 @@
+name: "bash-with-fake-geteuid"
+description: "An example/demo policy which allows to execute /bin/bash and other commands in "
+description: "a fairly restricted jail containing only some directories from the main "
+description: "system, and with blocked __NR_syslog syscall. Also, __NR_geteuid returns -1337 "
+description: "value, which /usr/bin/id will show as euid=4294965959, and ptrace is blocked "
+description: "but returns success, hence strange behavior of the strace command. "
+description: "This is an example/demo policy, hence it repeats many default values from the "
+description: "https://github.com/google/nsjail/blob/master/config.proto PB schema "
+
+mode: ONCE
+hostname: "JAILED-BASH"
+cwd: "/tmp"
+
+bindhost: "127.0.0.1"
+max_conns_per_ip: 10
+port: 31337
+
+time_limit: 100
+daemon: false
+max_cpus: 1
+
+keep_env: false
+envar: "ENVAR1=VALUE1"
+envar: "ENVAR2=VALUE2"
+envar: "TERM=linux"
+envar: "HOME=/"
+envar: "PS1=[\\H:\\t:\\s-\\V:\\w]\\$ "
+
+keep_caps: true
+cap: "CAP_NET_ADMIN"
+cap: "CAP_NET_RAW"
+silent: false
+stderr_to_null: false
+skip_setsid: true
+pass_fd: 100
+pass_fd: 3
+disable_no_new_privs: false
+
+rlimit_as: 128
+rlimit_core: 0
+rlimit_cpu: 10
+rlimit_fsize: 0
+rlimit_nofile: 32
+rlimit_stack_type: SOFT
+rlimit_nproc_type: SOFT
+
+persona_addr_compat_layout: false
+persona_mmap_page_zero: false
+persona_read_implies_exec: false
+persona_addr_limit_3gb: false
+persona_addr_no_randomize: false
+
+clone_newnet: true
+clone_newuser: true
+clone_newns: true
+clone_newpid: true
+clone_newipc: true
+clone_newuts: true
+clone_newcgroup: true
+
+uidmap {
+ inside_id: "0"
+ outside_id: ""
+ count: 1
+}
+
+gidmap {
+ inside_id: "0"
+ outside_id: ""
+ count: 1
+}
+
+mount_proc: false
+
+mount {
+ src: "/lib"
+ dst: "/lib"
+ is_bind: true
+ rw: false
+}
+
+mount {
+ src: "/bin"
+ dst: "/bin"
+ is_bind: true
+ rw: false
+}
+
+mount {
+ src: "/sbin"
+ dst: "/sbin"
+ is_bind: true
+ rw: false
+}
+
+mount {
+ src: "/usr"
+ dst: "/usr"
+ is_bind: true
+ rw: false
+}
+
+mount {
+ src: "/lib64"
+ dst: "/lib64"
+ is_bind: true
+ rw: false
+ mandatory: false
+}
+
+mount {
+ src: "/lib32"
+ dst: "/lib32"
+ is_bind: true
+ rw: false
+ mandatory: false
+}
+
+mount {
+ dst: "/tmp"
+ fstype: "tmpfs"
+ rw: true
+ is_bind: false
+ noexec: true
+ nodev: true
+ nosuid: true
+}
+
+mount {
+ dst: "/dev"
+ fstype: "tmpfs"
+ options: "size=8388608"
+ rw: true
+ is_bind: false
+}
+
+mount {
+ src: "/dev/null"
+ dst: "/dev/null"
+ rw: true
+ is_bind: true
+}
+
+mount {
+ dst: "/proc"
+ fstype: "proc"
+ rw: false
+}
+
+mount {
+ src_content: "This file was created dynamically"
+ dst: "/DYNAMIC_FILE"
+}
+
+mount {
+ src: "/nonexistent_777"
+ dst: "/nonexistent_777"
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ src: "/proc/self/fd"
+ dst: "/dev/fd"
+ is_symlink: true
+}
+
+mount {
+ src: "/some/unimportant/target"
+ dst: "/proc/no/symlinks/can/be/created/in/proc"
+ is_symlink: true
+ mandatory: false
+}
+
+seccomp_string: "ERRNO(1337) { geteuid } "
+seccomp_string: "ERRNO(0) { ptrace } "
+seccomp_string: "KILL { syslog } "
+seccomp_string: "DEFAULT ALLOW "
+
+exec_bin {
+ path: "/bin/bash"
+ arg0: "sh"
+ arg: "-i"
+}
diff --git a/configs/demo-dont-use-chrome-with-net.cfg b/configs/demo-dont-use-chrome-with-net.cfg
new file mode 100644
index 0000000..690657e
--- /dev/null
+++ b/configs/demo-dont-use-chrome-with-net.cfg
@@ -0,0 +1,178 @@
+name: "chrome-with-net"
+
+description: "Don't use for anything serious - this is just a demo policy. See notes"
+description: "at the end of this description for more."
+description: ""
+description: "This policy allows to run Chrome inside a jail. Access to networking is"
+description: "permitted with this setup (clone_newnet: false)."
+description: ""
+description: "The only permitted home directory is $HOME/.mozilla and $HOME/Documents."
+description: "The rest of available on the FS files/dires are libs and X-related files/dirs."
+description: ""
+description: "Run as:"
+description: ""
+description: "./nsjail --config configs/chrome-with-net.cfg"
+description: ""
+description: "You can then go to https://uploadfiles.io/ and try to upload a file in order"
+description: "to see how your local directory (also, all system directories) look like."
+description: ""
+description: "Note: Using this profile for anything serious is *A VERY BAD* idea. Chrome"
+description: "provides excellent FS&syscall sandbox for Linux, as this profile disables"
+description: "this sandboxing with --no-sandbox and substitutes Chrome's syscall/ns policy"
+description: "with more relaxed namespacing."
+
+mode: ONCE
+hostname: "CHROME"
+cwd: "/user"
+
+time_limit: 0
+
+envar: "HOME=/user"
+envar: "DISPLAY"
+envar: "TMP=/tmp"
+
+rlimit_as: 4096
+rlimit_cpu: 1000
+rlimit_fsize: 1024
+rlimit_nofile: 1024
+
+clone_newnet: false
+
+mount {
+ dst: "/proc"
+ fstype: "proc"
+}
+
+mount {
+ src: "/lib"
+ dst: "/lib"
+ is_bind: true
+}
+
+mount {
+ src: "/usr/lib"
+ dst: "/usr/lib"
+ is_bind: true
+}
+
+mount {
+ src: "/lib64"
+ dst: "/lib64"
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ src: "/lib32"
+ dst: "/lib32"
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ src: "/bin"
+ dst: "/bin"
+ is_bind: true
+}
+
+mount {
+ src: "/usr/bin"
+ dst: "/usr/bin"
+ is_bind: true
+}
+
+mount {
+ src: "/opt/google/chrome"
+ dst: "/opt/google/chrome"
+ is_bind: true
+}
+
+mount {
+ src: "/usr/share"
+ dst: "/usr/share"
+ is_bind: true
+}
+
+mount {
+ src: "/dev/urandom"
+ dst: "/dev/urandom"
+ is_bind: true
+ rw: true
+}
+
+mount {
+ src: "/dev/null"
+ dst: "/dev/null"
+ is_bind: true
+ rw: true
+}
+
+mount {
+ src: "/dev/fd/"
+ dst: "/dev/fd/"
+ is_bind: true
+ rw: true
+}
+
+mount {
+ src: "/etc/resolv.conf"
+ dst: "/etc/resolv.conf"
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ dst: "/tmp"
+ fstype: "tmpfs"
+ rw: true
+ is_bind: false
+}
+
+mount {
+ dst: "/dev/shm"
+ fstype: "tmpfs"
+ rw: true
+ is_bind: false
+}
+
+mount {
+ dst: "/user"
+ fstype: "tmpfs"
+ rw: true
+}
+
+mount {
+ prefix_src_env: "HOME"
+ src: "/Documents"
+ dst: "/user/Documents"
+ rw: true
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ prefix_src_env: "HOME"
+ src: "/.config/google-chrome"
+ dst: "/user/.config/google-chrome"
+ is_bind: true
+ rw: true
+ mandatory: false
+}
+
+mount {
+ src: "/tmp/.X11-unix/X0"
+ dst: "/tmp/.X11-unix/X0"
+ is_bind: true
+}
+
+seccomp_string: " KILL {"
+seccomp_string: " ptrace,"
+seccomp_string: " process_vm_readv,"
+seccomp_string: " process_vm_writev"
+seccomp_string: " }"
+seccomp_string: " DEFAULT ALLOW"
+
+exec_bin {
+ path: "/opt/google/chrome/google-chrome"
+ arg: "--no-sandbox"
+}
diff --git a/configs/firefox-with-cloned-net.cfg b/configs/firefox-with-cloned-net.cfg
new file mode 100644
index 0000000..eb541e3
--- /dev/null
+++ b/configs/firefox-with-cloned-net.cfg
@@ -0,0 +1,181 @@
+name: "firefox-with-cloned-net"
+
+description: "This policy allows to run firefox inside a jail on a separate eth interface."
+description: "A separate networking context separates process from the global \"lo\", and"
+description: "from global abstract socket namespace."
+description: ""
+description: "The only permitted home directory is $HOME/.mozilla and $HOME/Documents."
+description: "The rest of available on the FS files/dires are libs and X-related files/dirs."
+description: ""
+description: "As this needs to be run as root, you will have to set-up correct uid&gid"
+description: "mappings (here: jagger), name of your local interface (here: 'enp0s31f6'),"
+description: "and correct IPv4 addresses."
+description: ""
+description: "IPv6 should work out-of-the-box, given that your local IPv6 discovery is set"
+description: "up correctly."
+description: ""
+description: "Run as:"
+description: ""
+description: "sudo ./nsjail --config configs/firefox-with-cloned-net.cfg"
+description: ""
+description: "You can then go to https://uploadfiles.io/ and try to upload a file in order"
+description: "to see how your local directory (also, all system directories) look like."
+
+mode: ONCE
+hostname: "FF-MACVTAP"
+cwd: "/user"
+
+time_limit: 0
+
+envar: "HOME=/user"
+envar: "DISPLAY"
+envar: "TMP=/tmp"
+
+rlimit_as: 4096
+rlimit_cpu: 1000
+rlimit_fsize: 1024
+rlimit_nofile: 512
+
+uidmap {
+ inside_id: "9999999"
+ outside_id: "jagger"
+}
+
+gidmap {
+ inside_id: "9999999"
+ outside_id: "jagger"
+}
+
+mount {
+ dst: "/proc"
+ fstype: "proc"
+ rw: true
+}
+
+mount {
+ src: "/lib"
+ dst: "/lib"
+ is_bind: true
+}
+
+mount {
+ src: "/usr/lib"
+ dst: "/usr/lib"
+ is_bind: true
+}
+
+mount {
+ src: "/lib64"
+ dst: "/lib64"
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ src: "/lib32"
+ dst: "/lib32"
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ src: "/usr/lib/firefox"
+ dst: "/usr/lib/firefox"
+ is_bind: true
+}
+
+mount {
+ src: "/usr/bin/firefox"
+ dst: "/usr/bin/firefox"
+ is_bind: true
+}
+
+mount {
+ src: "/usr/share"
+ dst: "/usr/share"
+ is_bind: true
+}
+
+mount {
+ src_content: "<?xml version=\"1.0\"?>\n<!DOCTYPE fontconfig SYSTEM \"fonts.dtd\">\n<fontconfig><dir>/usr/share/fonts</dir><cachedir>/tmp/fontconfig</cachedir></fontconfig>"
+ dst: "/etc/fonts/fonts.conf"
+}
+
+mount {
+ src: "/dev/urandom"
+ dst: "/dev/urandom"
+ is_bind: true
+ rw: true
+}
+
+mount {
+ src: "/dev/null"
+ dst: "/dev/null"
+ is_bind: true
+ rw: true
+}
+
+mount {
+ src_content: "nameserver 8.8.8.8"
+ dst: "/etc/resolv.conf"
+}
+
+mount {
+ dst: "/tmp"
+ fstype: "tmpfs"
+ rw: true
+ is_bind: false
+}
+
+mount {
+ dst: "/dev/shm"
+ fstype: "tmpfs"
+ rw: true
+ is_bind: false
+}
+
+mount {
+ dst: "/user"
+ fstype: "tmpfs"
+ rw: true
+}
+
+mount {
+ prefix_src_env: "HOME"
+ src: "/Documents"
+ dst: "/user/Documents"
+ rw: true
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ prefix_src_env: "HOME"
+ src: "/.mozilla"
+ dst: "/user/.mozilla"
+ is_bind: true
+ rw: true
+ mandatory: false
+}
+
+mount {
+ src: "/tmp/.X11-unix/X0"
+ dst: "/tmp/.X11-unix/X0"
+ is_bind: true
+}
+
+seccomp_string: "KILL {"
+seccomp_string: " ptrace,"
+seccomp_string: " process_vm_readv,"
+seccomp_string: " process_vm_writev"
+seccomp_string: "}"
+seccomp_string: "DEFAULT ALLOW"
+
+macvlan_iface: "enp0s31f6"
+macvlan_vs_ip: "192.168.10.223"
+macvlan_vs_nm: "255.255.255.0"
+macvlan_vs_gw: "192.168.10.1"
+
+exec_bin {
+ path: "/usr/lib/firefox/firefox"
+}
diff --git a/configs/firefox-with-net.cfg b/configs/firefox-with-net.cfg
new file mode 100644
index 0000000..190f7c2
--- /dev/null
+++ b/configs/firefox-with-net.cfg
@@ -0,0 +1,168 @@
+name: "firefox-with-net"
+
+description: "This policy allows to run firefox inside a jail. Access to networking is"
+description: "permitted with this setup (clone_newnet: false)."
+description: ""
+description: "The only permitted home directory is $HOME/.mozilla and $HOME/Documents."
+description: "The rest of available on the FS files/dires are libs and X-related files/dirs."
+description: ""
+description: "Run as:"
+description: ""
+description: "./nsjail --config configs/firefox-with-net.cfg"
+description: ""
+description: "You can then go to https://uploadfiles.io/ and try to upload a file in order"
+description: "to see how your local directory (also, all system directories) look like."
+
+mode: ONCE
+hostname: "FIREFOX"
+cwd: "/user"
+
+time_limit: 0
+
+clone_newnet: false
+
+envar: "HOME=/user"
+envar: "DISPLAY"
+envar: "TMP=/tmp"
+
+rlimit_as: 4096
+rlimit_cpu: 1000
+rlimit_fsize: 1024
+rlimit_nofile: 512
+
+uidmap {
+ inside_id: "9999999"
+}
+
+gidmap {
+ inside_id: "9999999"
+}
+
+mount {
+ dst: "/proc"
+ fstype: "proc"
+ rw: true
+}
+
+mount {
+ src: "/lib"
+ dst: "/lib"
+ is_bind: true
+}
+
+mount {
+ src: "/usr/lib"
+ dst: "/usr/lib"
+ is_bind: true
+}
+
+mount {
+ src: "/lib64"
+ dst: "/lib64"
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ src: "/lib32"
+ dst: "/lib32"
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ src: "/usr/lib/firefox"
+ dst: "/usr/lib/firefox"
+ is_bind: true
+}
+
+mount {
+ src: "/usr/bin/firefox"
+ dst: "/usr/bin/firefox"
+ is_bind: true
+}
+
+mount {
+ src: "/usr/share"
+ dst: "/usr/share"
+ is_bind: true
+}
+
+mount {
+ src_content: "<?xml version=\"1.0\"?>\n<!DOCTYPE fontconfig SYSTEM \"fonts.dtd\">\n<fontconfig><dir>/usr/share/fonts</dir><cachedir>/tmp/fontconfig</cachedir></fontconfig>"
+ dst: "/etc/fonts/fonts.conf"
+}
+
+mount {
+ src: "/dev/urandom"
+ dst: "/dev/urandom"
+ is_bind: true
+ rw: true
+}
+
+mount {
+ src: "/dev/null"
+ dst: "/dev/null"
+ is_bind: true
+ rw: true
+}
+
+mount {
+ src_content: "nameserver 8.8.8.8"
+ dst: "/etc/resolv.conf"
+}
+
+mount {
+ dst: "/tmp"
+ fstype: "tmpfs"
+ rw: true
+ is_bind: false
+}
+
+mount {
+ dst: "/dev/shm"
+ fstype: "tmpfs"
+ rw: true
+ is_bind: false
+}
+
+mount {
+ dst: "/user"
+ fstype: "tmpfs"
+ rw: true
+}
+
+mount {
+ prefix_src_env: "HOME"
+ src: "/Documents"
+ dst: "/user/Documents"
+ rw: true
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ prefix_src_env: "HOME"
+ src: "/.mozilla"
+ dst: "/user/.mozilla"
+ is_bind: true
+ rw: true
+ mandatory: false
+}
+
+mount {
+ src: "/tmp/.X11-unix/X0"
+ dst: "/tmp/.X11-unix/X0"
+ is_bind: true
+}
+
+seccomp_string: "KILL {"
+seccomp_string: " ptrace,"
+seccomp_string: " process_vm_readv,"
+seccomp_string: " process_vm_writev"
+seccomp_string: "}"
+seccomp_string: "DEFAULT ALLOW"
+
+exec_bin {
+ path: "/usr/lib/firefox/firefox"
+}
diff --git a/configs/home-documents-with-xorg-no-net.cfg b/configs/home-documents-with-xorg-no-net.cfg
new file mode 100644
index 0000000..cc2514f
--- /dev/null
+++ b/configs/home-documents-with-xorg-no-net.cfg
@@ -0,0 +1,134 @@
+name: "documents-with-xorg"
+
+description: "This policy allows to run many X-org based tool, which are allowed"
+description: "to access $HOME/Documents directory only. An example of use is:"
+description: ""
+description: "./nsjail --config configs/documents-with-xorg.cfg -- \\"
+description: " /usr/bin/geeqie /user/Documents/"
+description: ""
+description: "What is more, this policy doesn't allow to access networking."
+
+mode: ONCE
+hostname: "NSJAIL"
+cwd: "/user"
+
+time_limit: 1000
+
+envar: "DISPLAY"
+envar: "HOME=/user"
+envar: "TMP=/tmp"
+
+rlimit_as: 2048
+rlimit_cpu: 1000
+rlimit_fsize: 1024
+rlimit_nofile: 16
+
+mount {
+ src: "/lib"
+ dst: "/lib"
+ is_bind: true
+}
+
+mount {
+ src: "/lib64"
+ dst: "/lib64"
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ src: "/lib32"
+ dst: "/lib32"
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ src: "/bin"
+ dst: "/bin"
+ is_bind: true
+}
+
+mount {
+ src: "/usr/bin"
+ dst: "/usr/bin"
+ is_bind: true
+}
+
+mount {
+ src: "/usr/share"
+ dst: "/usr/share"
+ is_bind: true
+}
+
+mount {
+ src: "/usr/lib"
+ dst: "/usr/lib"
+ is_bind: true
+}
+
+mount {
+ src: "/usr/lib64"
+ dst: "/usr/lib64"
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ src: "/usr/lib32"
+ dst: "/usr/lib32"
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ dst: "/tmp"
+ fstype: "tmpfs"
+ rw: true
+}
+
+mount {
+ dst: "/dev/shm"
+ fstype: "tmpfs"
+ rw: true
+}
+
+mount {
+ dst: "/user"
+ fstype: "tmpfs"
+ rw: true
+}
+
+mount {
+ prefix_src_env: "HOME"
+ src: "/Documents"
+ dst: "/user/Documents"
+ is_bind: true
+}
+
+mount {
+ src: "/tmp/.X11-unix"
+ dst: "/tmp/.X11-unix"
+ is_bind: true
+ rw: true
+}
+
+mount {
+ src: "/dev/null"
+ dst: "/dev/null"
+ is_bind: true
+ rw: true
+}
+
+mount {
+ src: "/etc/passwd"
+ dst: "/etc/passwd"
+ is_bind: true
+}
+
+seccomp_string: "KILL {"
+seccomp_string: " ptrace,"
+seccomp_string: " process_vm_readv,"
+seccomp_string: " process_vm_writev"
+seccomp_string: "}"
+seccomp_string: "DEFAULT ALLOW"
diff --git a/configs/imagemagick-convert.cfg b/configs/imagemagick-convert.cfg
new file mode 100644
index 0000000..dfe702d
--- /dev/null
+++ b/configs/imagemagick-convert.cfg
@@ -0,0 +1,88 @@
+name: "imagemagick-convert"
+
+description: "This policy allows to run ImageMagick's convert inside a jail."
+description: "Your $HOME's Documents will be mapped as /user/Documents"
+description: ""
+description: "Run as:"
+description: ""
+description: "./nsjail --config imagemagick-convert.cfg -- /usr/bin/convert \\"
+description: " jpg:/user/Documents/input.jpg png:/user/Documents/output.png"
+
+mode: ONCE
+hostname: "IM-CONVERT"
+cwd: "/user"
+
+time_limit: 120
+
+envar: "HOME=/user"
+envar: "TMP=/tmp"
+
+rlimit_as: 2048
+rlimit_cpu: 1000
+rlimit_fsize: 1024
+rlimit_nofile: 64
+
+mount {
+ src: "/lib"
+ dst: "/lib"
+ is_bind: true
+}
+
+mount {
+ src: "/usr/lib"
+ dst: "/usr/lib"
+ is_bind: true
+}
+
+mount {
+ src: "/lib64"
+ dst: "/lib64"
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ src: "/lib32"
+ dst: "/lib32"
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ dst: "/tmp"
+ fstype: "tmpfs"
+ rw: true
+ is_bind: false
+}
+
+mount {
+ dst: "/user"
+ fstype: "tmpfs"
+ rw: true
+}
+
+mount {
+ prefix_src_env: "HOME"
+ src: "/Documents"
+ dst: "/user/Documents"
+ rw: true
+ is_bind: true
+ mandatory: false
+}
+
+seccomp_string: "ALLOW {"
+seccomp_string: " read, write, open, openat, close, newstat, newfstat,"
+seccomp_string: " newlstat, lseek, mmap, mprotect, munmap, brk,"
+seccomp_string: " rt_sigaction, rt_sigprocmask, pwrite64, access,"
+seccomp_string: " getpid, execveat, getdents, unlink, fchmod,"
+seccomp_string: " getrlimit, getrusage, sysinfo, times, futex,"
+seccomp_string: " arch_prctl, sched_getaffinity, set_tid_address,"
+seccomp_string: " clock_gettime, set_robust_list, exit_group,"
+seccomp_string: " clone, getcwd, pread64, readlink, prlimit64"
+seccomp_string: "}"
+seccomp_string: "DEFAULT KILL"
+
+exec_bin {
+ path: "/usr/bin/convert"
+ exec_fd: true
+}
diff --git a/configs/static-busybox-with-execveat.cfg b/configs/static-busybox-with-execveat.cfg
new file mode 100644
index 0000000..0d0a49e
--- /dev/null
+++ b/configs/static-busybox-with-execveat.cfg
@@ -0,0 +1,46 @@
+name: "static-busybox-with-execveat"
+description: "An example/demo policy which allows to execute /bin/busybox-static in an "
+description: "empty (only /proc) mount namespace which doesn't even include busybox itself"
+
+mode: ONCE
+hostname: "BUSYBOX"
+cwd: "/"
+
+time_limit: 100
+
+keep_env: false
+envar: "TERM=linux"
+envar: "PS1=$ "
+
+skip_setsid: true
+
+clone_newcgroup: true
+
+uidmap {
+ inside_id: "999999"
+ outside_id: ""
+ count: 1
+}
+
+gidmap {
+ inside_id: "999999"
+ outside_id: ""
+ count: 1
+}
+
+mount_proc: false
+
+mount {
+ dst: "/proc"
+ fstype: "proc"
+ rw: false
+}
+
+seccomp_string: "ERRNO(0) { ptrace }"
+seccomp_string: "DEFAULT ALLOW"
+
+exec_bin {
+ path: "/bin/busybox"
+ arg: "sh"
+ exec_fd: true
+}
diff --git a/configs/xchat-with-net.cfg b/configs/xchat-with-net.cfg
new file mode 100644
index 0000000..e8d2759
--- /dev/null
+++ b/configs/xchat-with-net.cfg
@@ -0,0 +1,142 @@
+name: "xchat-with-net"
+
+description: "This policy allows to run xchat inside a jail. Access to networking is"
+description: "permitted with this setup (clone_newnet: false)."
+description: ""
+description: "The only permitted home directory is $HOME/.xchat2 and $HOME/Documents."
+description: "The rest of available on the FS files/dires are libs and X-related files/dirs."
+description: ""
+description: "Run as:"
+description: "./nsjail --config configs/xchat-with-net.cfg --daemon -l /tmp/xchat.log"
+
+mode: ONCE
+hostname: "XCHAT"
+cwd: "/user"
+
+time_limit: 0
+
+envar: "HOME=/user"
+envar: "DISPLAY"
+envar: "TMP=/tmp"
+envar: "FONTCONFIG_FILE=/etc/fonts/fonts.conf"
+envar: "FC_CONFIG_FILE=/etc/fonts/fonts.conf"
+envar: "LANG"
+
+rlimit_as: 4096
+rlimit_cpu_type: INF
+rlimit_fsize: 4096
+rlimit_nofile: 128
+
+clone_newnet: false
+
+mount {
+ dst: "/proc"
+ fstype: "proc"
+}
+
+mount {
+ src: "/lib"
+ dst: "/lib"
+ is_bind: true
+}
+
+mount {
+ src: "/usr/lib"
+ dst: "/usr/lib"
+ is_bind: true
+}
+
+mount {
+ src: "/lib64"
+ dst: "/lib64"
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ src: "/lib32"
+ dst: "/lib32"
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ src_content: "<?xml version=\"1.0\"?>\n<!DOCTYPE fontconfig SYSTEM \"fonts.dtd\">\n<fontconfig><dir>/usr/share/fonts</dir><cachedir>/tmp/fontconfig</cachedir></fontconfig>"
+ dst: "/etc/fonts/fonts.conf"
+}
+
+mount {
+ src: "/usr/share"
+ dst: "/usr/share"
+ is_bind: true
+}
+
+mount {
+ src: "/dev/urandom"
+ dst: "/dev/urandom"
+ is_bind: true
+ rw: true
+}
+
+mount {
+ src: "/etc/resolv.conf"
+ dst: "/etc/resolv.conf"
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ dst: "/tmp"
+ fstype: "tmpfs"
+ rw: true
+ is_bind: false
+}
+
+mount {
+ dst: "/dev/shm"
+ fstype: "tmpfs"
+ rw: true
+ is_bind: false
+}
+
+mount {
+ dst: "/user"
+ fstype: "tmpfs"
+ rw: true
+}
+
+mount {
+ prefix_src_env: "HOME"
+ src: "/Documents"
+ dst: "/user/Documents"
+ rw: true
+ is_bind: true
+ mandatory: false
+}
+
+mount {
+ prefix_src_env: "HOME"
+ src: "/.xchat2"
+ dst: "/user/.xchat2"
+ is_bind: true
+ rw: true
+ mandatory: false
+}
+
+mount {
+ src: "/tmp/.X11-unix/X0"
+ dst: "/tmp/.X11-unix/X0"
+ is_bind: true
+}
+
+seccomp_string: "KILL {"
+seccomp_string: " ptrace,"
+seccomp_string: " process_vm_readv,"
+seccomp_string: " process_vm_writev"
+seccomp_string: "}"
+seccomp_string: "DEFAULT ALLOW"
+
+exec_bin {
+ path: "/usr/bin/xchat"
+ exec_fd: true
+}
diff --git a/contain.cc b/contain.cc
new file mode 100644
index 0000000..176f216
--- /dev/null
+++ b/contain.cc
@@ -0,0 +1,316 @@
+/*
+
+ nsjail - isolating the binary
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#include "contain.h"
+
+#include <dirent.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/personality.h>
+#include <sys/prctl.h>
+#include <sys/resource.h>
+#include <unistd.h>
+
+#include <algorithm>
+
+#include "caps.h"
+#include "cgroup.h"
+#include "cpu.h"
+#include "logs.h"
+#include "macros.h"
+#include "mnt.h"
+#include "net.h"
+#include "pid.h"
+#include "user.h"
+#include "util.h"
+#include "uts.h"
+
+namespace contain {
+
+static bool containUserNs(nsjconf_t* nsjconf) {
+ return user::initNsFromChild(nsjconf);
+}
+
+static bool containInitPidNs(nsjconf_t* nsjconf) {
+ return pid::initNs(nsjconf);
+}
+
+static bool containInitNetNs(nsjconf_t* nsjconf) {
+ return net::initNsFromChild(nsjconf);
+}
+
+static bool containInitUtsNs(nsjconf_t* nsjconf) {
+ return uts::initNs(nsjconf);
+}
+
+static bool containInitCgroupNs(void) {
+ return cgroup::initNs();
+}
+
+static bool containDropPrivs(nsjconf_t* nsjconf) {
+#ifndef PR_SET_NO_NEW_PRIVS
+#define PR_SET_NO_NEW_PRIVS 38
+#endif
+ if (!nsjconf->disable_no_new_privs) {
+ if (prctl(PR_SET_NO_NEW_PRIVS, 1UL, 0UL, 0UL, 0UL) == -1) {
+ /* Only new kernels support it */
+ PLOG_W("prctl(PR_SET_NO_NEW_PRIVS, 1)");
+ }
+ }
+
+ if (!caps::initNs(nsjconf)) {
+ return false;
+ }
+
+ return true;
+}
+
+static bool containPrepareEnv(nsjconf_t* nsjconf) {
+ if (prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0) == -1) {
+ PLOG_E("prctl(PR_SET_PDEATHSIG, SIGKILL)");
+ return false;
+ }
+ if (nsjconf->personality && personality(nsjconf->personality) == -1) {
+ PLOG_E("personality(%lx)", nsjconf->personality);
+ return false;
+ }
+ errno = 0;
+ if (setpriority(PRIO_PROCESS, 0, 19) == -1 && errno != 0) {
+ PLOG_W("setpriority(19)");
+ }
+ if (!nsjconf->skip_setsid) {
+ setsid();
+ }
+ return true;
+}
+
+static bool containInitMountNs(nsjconf_t* nsjconf) {
+ return mnt::initNs(nsjconf);
+}
+
+static bool containCPU(nsjconf_t* nsjconf) {
+ return cpu::initCpu(nsjconf);
+}
+
+static bool containSetLimits(nsjconf_t* nsjconf) {
+ struct rlimit64 rl;
+ rl.rlim_cur = rl.rlim_max = nsjconf->rl_as;
+ if (setrlimit64(RLIMIT_AS, &rl) == -1) {
+ PLOG_E("setrlimit64(0, RLIMIT_AS, %" PRIu64 ")", nsjconf->rl_as);
+ return false;
+ }
+ rl.rlim_cur = rl.rlim_max = nsjconf->rl_core;
+ if (setrlimit64(RLIMIT_CORE, &rl) == -1) {
+ PLOG_E("setrlimit64(0, RLIMIT_CORE, %" PRIu64 ")", nsjconf->rl_core);
+ return false;
+ }
+ rl.rlim_cur = rl.rlim_max = nsjconf->rl_cpu;
+ if (setrlimit64(RLIMIT_CPU, &rl) == -1) {
+ PLOG_E("setrlimit64(0, RLIMIT_CPU, %" PRIu64 ")", nsjconf->rl_cpu);
+ return false;
+ }
+ rl.rlim_cur = rl.rlim_max = nsjconf->rl_fsize;
+ if (setrlimit64(RLIMIT_FSIZE, &rl) == -1) {
+ PLOG_E("setrlimit64(0, RLIMIT_FSIZE, %" PRIu64 ")", nsjconf->rl_fsize);
+ return false;
+ }
+ rl.rlim_cur = rl.rlim_max = nsjconf->rl_nofile;
+ if (setrlimit64(RLIMIT_NOFILE, &rl) == -1) {
+ PLOG_E("setrlimit64(0, RLIMIT_NOFILE, %" PRIu64 ")", nsjconf->rl_nofile);
+ return false;
+ }
+ rl.rlim_cur = rl.rlim_max = nsjconf->rl_nproc;
+ if (setrlimit64(RLIMIT_NPROC, &rl) == -1) {
+ PLOG_E("setrlimit64(0, RLIMIT_NPROC, %" PRIu64 ")", nsjconf->rl_nproc);
+ return false;
+ }
+ rl.rlim_cur = rl.rlim_max = nsjconf->rl_stack;
+ if (setrlimit64(RLIMIT_STACK, &rl) == -1) {
+ PLOG_E("setrlimit64(0, RLIMIT_STACK, %" PRIu64 ")", nsjconf->rl_stack);
+ return false;
+ }
+ return true;
+}
+
+static bool containPassFd(nsjconf_t* nsjconf, int fd) {
+ return (std::find(nsjconf->openfds.begin(), nsjconf->openfds.end(), fd) !=
+ nsjconf->openfds.end());
+}
+
+static bool containMakeFdsCOENaive(nsjconf_t* nsjconf) {
+ /*
+ * Don't use getrlimit(RLIMIT_NOFILE) here, as it can return an artifically small value
+ * (e.g. 32), which could be smaller than a maximum assigned number to file-descriptors
+ * in this process. Just use some reasonably sane value (e.g. 1024)
+ */
+ for (unsigned fd = 0; fd < 1024; fd++) {
+ int flags = TEMP_FAILURE_RETRY(fcntl(fd, F_GETFD, 0));
+ if (flags == -1) {
+ continue;
+ }
+ if (containPassFd(nsjconf, fd)) {
+ LOG_D("FD=%d will be passed to the child process", fd);
+ if (TEMP_FAILURE_RETRY(fcntl(fd, F_SETFD, flags & ~(FD_CLOEXEC))) == -1) {
+ PLOG_E("Could not set FD_CLOEXEC for FD=%d", fd);
+ return false;
+ }
+ } else {
+ if (TEMP_FAILURE_RETRY(fcntl(fd, F_SETFD, flags | FD_CLOEXEC)) == -1) {
+ PLOG_E("Could not set FD_CLOEXEC for FD=%d", fd);
+ return false;
+ }
+ }
+ }
+ return true;
+}
+
+static bool containMakeFdsCOEProc(nsjconf_t* nsjconf) {
+ int dirfd = open("/proc/self/fd", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
+ if (dirfd == -1) {
+ PLOG_D("open('/proc/self/fd', O_DIRECTORY|O_RDONLY|O_CLOEXEC)");
+ return false;
+ }
+ DIR* dir = fdopendir(dirfd);
+ if (dir == NULL) {
+ PLOG_W("fdopendir(fd=%d)", dirfd);
+ close(dirfd);
+ return false;
+ }
+ /* Make all fds above stderr close-on-exec */
+ for (;;) {
+ errno = 0;
+ struct dirent* entry = readdir(dir);
+ if (entry == NULL && errno != 0) {
+ PLOG_D("readdir('/proc/self/fd')");
+ closedir(dir);
+ return false;
+ }
+ if (entry == NULL) {
+ break;
+ }
+ if (strcmp(".", entry->d_name) == 0) {
+ continue;
+ }
+ if (strcmp("..", entry->d_name) == 0) {
+ continue;
+ }
+ errno = 0;
+ int fd = strtoimax(entry->d_name, NULL, 10);
+ if (errno != 0) {
+ PLOG_W("Cannot convert /proc/self/fd/%s to a number", entry->d_name);
+ continue;
+ }
+ int flags = TEMP_FAILURE_RETRY(fcntl(fd, F_GETFD, 0));
+ if (flags == -1) {
+ PLOG_D("fcntl(fd=%xld, F_GETFD, 0)", fd);
+ closedir(dir);
+ return false;
+ }
+ if (containPassFd(nsjconf, fd)) {
+ LOG_D("FD=%d will be passed to the child process", fd);
+ if (TEMP_FAILURE_RETRY(fcntl(fd, F_SETFD, flags & ~(FD_CLOEXEC))) == -1) {
+ PLOG_E("Could not clear FD_CLOEXEC for FD=%d", fd);
+ closedir(dir);
+ return false;
+ }
+ } else {
+ LOG_D("FD=%d will be closed before execve()", fd);
+ if (TEMP_FAILURE_RETRY(fcntl(fd, F_SETFD, flags | FD_CLOEXEC)) == -1) {
+ PLOG_E("Could not set FD_CLOEXEC for FD=%d", fd);
+ closedir(dir);
+ return false;
+ }
+ }
+ }
+ closedir(dir);
+ return true;
+}
+
+static bool containMakeFdsCOE(nsjconf_t* nsjconf) {
+ if (containMakeFdsCOEProc(nsjconf)) {
+ return true;
+ }
+ if (containMakeFdsCOENaive(nsjconf)) {
+ return true;
+ }
+ LOG_E("Couldn't mark relevant file-descriptors as close-on-exec with any known method");
+ return false;
+}
+
+bool setupFD(nsjconf_t* nsjconf, int fd_in, int fd_out, int fd_err) {
+ if (nsjconf->stderr_to_null) {
+ LOG_D("Redirecting FD=2 (STDERR_FILENO) to /dev/null");
+ if ((fd_err = TEMP_FAILURE_RETRY(open("/dev/null", O_RDWR))) == -1) {
+ PLOG_E("open('/dev/null', O_RDWR");
+ return false;
+ }
+ }
+ if (nsjconf->is_silent) {
+ LOG_D("Redirecting FD=0/1/2 (STDIN/OUT/ERR_FILENO) to /dev/null");
+ if (TEMP_FAILURE_RETRY(fd_in = fd_out = fd_err = open("/dev/null", O_RDWR)) == -1) {
+ PLOG_E("open('/dev/null', O_RDWR)");
+ return false;
+ }
+ }
+ /* Set stdin/stdout/stderr to the net */
+ if (fd_in != STDIN_FILENO && TEMP_FAILURE_RETRY(dup2(fd_in, STDIN_FILENO)) == -1) {
+ PLOG_E("dup2(%d, STDIN_FILENO)", fd_in);
+ return false;
+ }
+ if (fd_out != STDOUT_FILENO && TEMP_FAILURE_RETRY(dup2(fd_out, STDOUT_FILENO)) == -1) {
+ PLOG_E("dup2(%d, STDOUT_FILENO)", fd_out);
+ return false;
+ }
+ if (fd_err != STDERR_FILENO && TEMP_FAILURE_RETRY(dup2(fd_err, STDERR_FILENO)) == -1) {
+ PLOG_E("dup2(%d, STDERR_FILENO)", fd_err);
+ return false;
+ }
+ return true;
+}
+
+bool containProc(nsjconf_t* nsjconf) {
+ RETURN_ON_FAILURE(containUserNs(nsjconf));
+ RETURN_ON_FAILURE(containInitPidNs(nsjconf));
+ RETURN_ON_FAILURE(containInitMountNs(nsjconf));
+ RETURN_ON_FAILURE(containInitNetNs(nsjconf));
+ RETURN_ON_FAILURE(containInitUtsNs(nsjconf));
+ RETURN_ON_FAILURE(containInitCgroupNs());
+ RETURN_ON_FAILURE(containDropPrivs(nsjconf));
+ ;
+ /* */
+ /* As non-root */
+ RETURN_ON_FAILURE(containCPU(nsjconf));
+ RETURN_ON_FAILURE(containSetLimits(nsjconf));
+ RETURN_ON_FAILURE(containPrepareEnv(nsjconf));
+ RETURN_ON_FAILURE(containMakeFdsCOE(nsjconf));
+
+ return true;
+}
+
+} // namespace contain
diff --git a/contain.h b/contain.h
new file mode 100644
index 0000000..5ed750d
--- /dev/null
+++ b/contain.h
@@ -0,0 +1,36 @@
+/*
+
+ nsjail - isolating the binary
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#ifndef NS_CONTAIN_H
+#define NS_CONTAIN_H
+
+#include <stdbool.h>
+
+#include "nsjail.h"
+
+namespace contain {
+
+bool setupFD(nsjconf_t* nsjconf, int fd_in, int fd_out, int fd_err);
+bool containProc(nsjconf_t* nsjconf);
+
+} // namespace contain
+
+#endif /* NS_CONTAIN_H */
diff --git a/cpu.cc b/cpu.cc
new file mode 100644
index 0000000..c67fa0d
--- /dev/null
+++ b/cpu.cc
@@ -0,0 +1,94 @@
+/*
+
+ nsjail - CPU affinity
+ -----------------------------------------
+
+ Copyright 2017 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#include "cpu.h"
+
+#include <inttypes.h>
+#include <sched.h>
+#include <stdint.h>
+#include <string.h>
+#include <unistd.h>
+
+#include "logs.h"
+#include "util.h"
+
+namespace cpu {
+
+static void setRandomCpu(cpu_set_t* mask, size_t mask_size, size_t cpu_num) {
+ if ((size_t)CPU_COUNT_S(mask_size, mask) >= cpu_num) {
+ LOG_F(
+ "Number of CPUs in the mask '%d' is bigger than number of available CPUs '%zu'",
+ CPU_COUNT(mask), cpu_num);
+ }
+
+ for (;;) {
+ uint64_t n = util::rnd64() % cpu_num;
+ if (!CPU_ISSET_S(n, mask_size, mask)) {
+ LOG_D("Setting allowed CPU#:%" PRIu64 " of [0-%zu]", n, cpu_num - 1);
+ CPU_SET_S(n, mask_size, mask);
+ break;
+ }
+ }
+}
+
+bool initCpu(nsjconf_t* nsjconf) {
+ if (nsjconf->num_cpus < 0) {
+ PLOG_W("sysconf(_SC_NPROCESSORS_ONLN) returned %ld", nsjconf->num_cpus);
+ return false;
+ }
+ if (nsjconf->max_cpus > (size_t)nsjconf->num_cpus) {
+ LOG_W("Requested number of CPUs:%zu is bigger than CPUs online:%ld",
+ nsjconf->max_cpus, nsjconf->num_cpus);
+ return true;
+ }
+ if (nsjconf->max_cpus == (size_t)nsjconf->num_cpus) {
+ LOG_D("All CPUs requested (%zu of %ld)", nsjconf->max_cpus, nsjconf->num_cpus);
+ return true;
+ }
+ if (nsjconf->max_cpus == 0) {
+ LOG_D("No max_cpus limit set");
+ return true;
+ }
+
+ cpu_set_t* mask = CPU_ALLOC(nsjconf->num_cpus);
+ if (mask == NULL) {
+ PLOG_W("Failure allocating cpu_set_t for %ld CPUs", nsjconf->num_cpus);
+ return false;
+ }
+
+ size_t mask_size = CPU_ALLOC_SIZE(nsjconf->num_cpus);
+ CPU_ZERO_S(mask_size, mask);
+
+ for (size_t i = 0; i < nsjconf->max_cpus; i++) {
+ setRandomCpu(mask, mask_size, nsjconf->num_cpus);
+ }
+
+ if (sched_setaffinity(0, mask_size, mask) == -1) {
+ PLOG_W("sched_setaffinity(max_cpus=%zu) failed", nsjconf->max_cpus);
+ CPU_FREE(mask);
+ return false;
+ }
+ CPU_FREE(mask);
+
+ return true;
+}
+
+} // namespace cpu
diff --git a/cpu.h b/cpu.h
new file mode 100644
index 0000000..d6346dc
--- /dev/null
+++ b/cpu.h
@@ -0,0 +1,35 @@
+/*
+
+ nsjail - CPU affinity
+ -----------------------------------------
+
+ Copyright 2017 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#ifndef NS_CPU_H
+#define NS_CPU_H
+
+#include <stdbool.h>
+
+#include "nsjail.h"
+
+namespace cpu {
+
+bool initCpu(nsjconf_t* nsjconf);
+
+} // namespace cpu
+
+#endif /* NS_CPU_H */
diff --git a/logs.cc b/logs.cc
new file mode 100644
index 0000000..70eca39
--- /dev/null
+++ b/logs.cc
@@ -0,0 +1,161 @@
+/*
+
+ nsjail - logging
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#include "logs.h"
+
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <limits.h>
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "macros.h"
+#include "util.h"
+
+#include <string.h>
+
+namespace logs {
+
+static int _log_fd = STDERR_FILENO;
+static bool _log_fd_isatty = true;
+static enum llevel_t _log_level = INFO;
+static bool _log_set = false;
+
+__attribute__((constructor)) static void log_init(void) {
+ _log_fd = fcntl(_log_fd, F_DUPFD_CLOEXEC, 0);
+ if (_log_fd == -1) {
+ _log_fd = STDERR_FILENO;
+ }
+ _log_fd_isatty = isatty(_log_fd);
+}
+
+bool logSet() {
+ return _log_set;
+}
+
+/*
+ * Log to stderr by default. Use a dup()d fd, because in the future we'll associate the
+ * connection socket with fd (0, 1, 2).
+ */
+
+void logLevel(enum llevel_t ll) {
+ _log_level = ll;
+}
+
+void logFile(const std::string& logfile) {
+ _log_set = true;
+ /* Close previous log_fd */
+ if (_log_fd > STDERR_FILENO) {
+ close(_log_fd);
+ _log_fd = STDERR_FILENO;
+ }
+ if (TEMP_FAILURE_RETRY(_log_fd = open(logfile.c_str(),
+ O_CREAT | O_RDWR | O_APPEND | O_CLOEXEC, 0640)) == -1) {
+ _log_fd = STDERR_FILENO;
+ PLOG_W("Couldn't open logfile open('%s')", logfile.c_str());
+ }
+ _log_fd_isatty = (isatty(_log_fd) == 1);
+}
+
+void logMsg(enum llevel_t ll, const char* fn, int ln, bool perr, const char* fmt, ...) {
+ if (ll < _log_level) {
+ return;
+ }
+
+ char strerr[512];
+ if (perr) {
+ snprintf(strerr, sizeof(strerr), "%s", strerror(errno));
+ }
+ struct {
+ const char* const descr;
+ const char* const prefix;
+ const bool print_funcline;
+ const bool print_time;
+ } static const logLevels[] = {
+ {"D", "\033[0;4m", true, true},
+ {"I", "\033[1m", false, true},
+ {"W", "\033[0;33m", true, true},
+ {"E", "\033[1;31m", true, true},
+ {"F", "\033[7;35m", true, true},
+ {"HR", "\033[0m", false, false},
+ {"HB", "\033[1m", false, false},
+ };
+
+ /* Start printing logs */
+ std::string msg;
+ if (_log_fd_isatty) {
+ msg.append(logLevels[ll].prefix);
+ }
+ msg.append("[").append(logLevels[ll].descr).append("]");
+ if (logLevels[ll].print_time) {
+ msg.append("[").append(util::timeToStr(time(NULL))).append("]");
+ }
+ if (logLevels[ll].print_funcline) {
+ msg.append("[")
+ .append(std::to_string(getpid()))
+ .append("] ")
+ .append(fn)
+ .append("():")
+ .append(std::to_string(ln));
+ }
+
+ char* strp;
+ va_list args;
+ va_start(args, fmt);
+ int ret = vasprintf(&strp, fmt, args);
+ va_end(args);
+ if (ret == -1) {
+ msg.append(" [logs internal]: MEMORY ALLOCATION ERROR");
+ } else {
+ msg.append(" ").append(strp);
+ free(strp);
+ }
+ if (perr) {
+ msg.append(": ").append(strerr);
+ }
+ if (_log_fd_isatty) {
+ msg.append("\033[0m");
+ }
+ msg.append("\n");
+ /* End printing logs */
+
+ if (write(_log_fd, msg.c_str(), msg.size()) == -1) {
+ dprintf(_log_fd, "%s", msg.c_str());
+ }
+
+ if (ll == FATAL) {
+ exit(0xff);
+ }
+}
+
+void logStop(int sig) {
+ LOG_I("Server stops due to fatal signal (%d) caught. Exiting", sig);
+}
+
+} // namespace logs
diff --git a/logs.h b/logs.h
new file mode 100644
index 0000000..27727ef
--- /dev/null
+++ b/logs.h
@@ -0,0 +1,67 @@
+/*
+
+ nsjail - logging
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#ifndef NS_LOGS_H
+#define NS_LOGS_H
+
+#include <getopt.h>
+#include <stdbool.h>
+
+#include <string>
+
+namespace logs {
+
+#define LOG_HELP(...) logs::logMsg(logs::HELP, __PRETTY_FUNCTION__, __LINE__, false, __VA_ARGS__);
+#define LOG_HELP_BOLD(...) \
+ logs::logMsg(logs::HELP_BOLD, __PRETTY_FUNCTION__, __LINE__, false, __VA_ARGS__);
+
+#define LOG_D(...) logs::logMsg(logs::DEBUG, __PRETTY_FUNCTION__, __LINE__, false, __VA_ARGS__);
+#define LOG_I(...) logs::logMsg(logs::INFO, __PRETTY_FUNCTION__, __LINE__, false, __VA_ARGS__);
+#define LOG_W(...) logs::logMsg(logs::WARNING, __PRETTY_FUNCTION__, __LINE__, false, __VA_ARGS__);
+#define LOG_E(...) logs::logMsg(logs::ERROR, __PRETTY_FUNCTION__, __LINE__, false, __VA_ARGS__);
+#define LOG_F(...) logs::logMsg(logs::FATAL, __PRETTY_FUNCTION__, __LINE__, false, __VA_ARGS__);
+
+#define PLOG_D(...) logs::logMsg(logs::DEBUG, __PRETTY_FUNCTION__, __LINE__, true, __VA_ARGS__);
+#define PLOG_I(...) logs::logMsg(logs::INFO, __PRETTY_FUNCTION__, __LINE__, true, __VA_ARGS__);
+#define PLOG_W(...) logs::logMsg(logs::WARNING, __PRETTY_FUNCTION__, __LINE__, true, __VA_ARGS__);
+#define PLOG_E(...) logs::logMsg(logs::ERROR, __PRETTY_FUNCTION__, __LINE__, true, __VA_ARGS__);
+#define PLOG_F(...) logs::logMsg(logs::FATAL, __PRETTY_FUNCTION__, __LINE__, true, __VA_ARGS__);
+
+enum llevel_t {
+ DEBUG = 0,
+ INFO,
+ WARNING,
+ ERROR,
+ FATAL,
+ HELP,
+ HELP_BOLD,
+};
+
+void logMsg(enum llevel_t ll, const char* fn, int ln, bool perr, const char* fmt, ...)
+ __attribute__((format(printf, 5, 6)));
+void logStop(int sig);
+void logLevel(enum llevel_t ll);
+void logFile(const std::string& logfile);
+bool logSet();
+
+} // namespace logs
+
+#endif /* NS_LOGS_H */
diff --git a/macros.h b/macros.h
new file mode 100644
index 0000000..d29b03b
--- /dev/null
+++ b/macros.h
@@ -0,0 +1,71 @@
+/*
+
+ nsjail - common macros
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#ifndef NS_COMMON_H
+#define NS_COMMON_H
+
+#include <unistd.h>
+
+#if !defined(TEMP_FAILURE_RETRY)
+#define TEMP_FAILURE_RETRY(expression) \
+ (__extension__({ \
+ long int __result; \
+ do \
+ __result = (long int)(expression); \
+ while (__result == -1L && errno == EINTR); \
+ __result; \
+ }))
+#endif /* !defined(TEMP_FAILURE_RETRY) */
+
+#if !defined(ARR_SZ)
+#define ARR_SZ(array) (sizeof(array) / sizeof(*array))
+#endif /* !defined(ARR_SZ) */
+#define UNUSED __attribute__((unused))
+
+#if 0 /* Works, but needs -fblocks and libBlocksRuntime with clang */
+/* Go-style defer implementation */
+#define __STRMERGE(a, b) a##b
+#define _STRMERGE(a, b) __STRMERGE(a, b)
+
+#ifdef __clang__
+static void __attribute__ ((unused)) __clang_cleanup_func(void (^*dfunc) (void))
+{
+ (*dfunc) ();
+}
+
+#define defer \
+ void (^_STRMERGE(__defer_f_, __COUNTER__))(void) \
+ __attribute__((cleanup(__clang_cleanup_func))) __attribute__((unused)) = ^
+#else
+#define __block
+#define _DEFER(a, count) \
+ auto void _STRMERGE(__defer_f_, count)(void* _defer_arg __attribute__((unused))); \
+ int _STRMERGE(__defer_var_, count) __attribute__((cleanup(_STRMERGE(__defer_f_, count)))) \
+ __attribute__((unused)); \
+ void _STRMERGE(__defer_f_, count)(void* _defer_arg __attribute__((unused)))
+#define defer _DEFER(a, __COUNTER__)
+#endif
+#endif
+
+#define NS_VALSTR_STRUCT(x) \
+ { x, #x }
+
+#endif /* NS_COMMON_H */
diff --git a/mnt.cc b/mnt.cc
new file mode 100644
index 0000000..4da78d6
--- /dev/null
+++ b/mnt.cc
@@ -0,0 +1,573 @@
+/*
+
+ nsjail - CLONE_NEWNS routines
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#include "mnt.h"
+
+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <linux/sched.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <sys/statvfs.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <syscall.h>
+#include <unistd.h>
+
+#include <memory>
+#include <string>
+
+#include "logs.h"
+#include "macros.h"
+#include "subproc.h"
+#include "util.h"
+
+namespace mnt {
+
+#if !defined(MS_LAZYTIME)
+#define MS_LAZYTIME (1 << 25)
+#endif /* if !defined(MS_LAZYTIME) */
+
+static const std::string flagsToStr(uintptr_t flags) {
+ std::string res;
+
+ struct {
+ const uintptr_t flag;
+ const char* const name;
+ } static const mountFlags[] = {
+ NS_VALSTR_STRUCT(MS_RDONLY),
+ NS_VALSTR_STRUCT(MS_NOSUID),
+ NS_VALSTR_STRUCT(MS_NODEV),
+ NS_VALSTR_STRUCT(MS_NOEXEC),
+ NS_VALSTR_STRUCT(MS_SYNCHRONOUS),
+ NS_VALSTR_STRUCT(MS_REMOUNT),
+ NS_VALSTR_STRUCT(MS_MANDLOCK),
+ NS_VALSTR_STRUCT(MS_DIRSYNC),
+ NS_VALSTR_STRUCT(MS_NOATIME),
+ NS_VALSTR_STRUCT(MS_NODIRATIME),
+ NS_VALSTR_STRUCT(MS_BIND),
+ NS_VALSTR_STRUCT(MS_MOVE),
+ NS_VALSTR_STRUCT(MS_REC),
+ NS_VALSTR_STRUCT(MS_SILENT),
+ NS_VALSTR_STRUCT(MS_POSIXACL),
+ NS_VALSTR_STRUCT(MS_UNBINDABLE),
+ NS_VALSTR_STRUCT(MS_PRIVATE),
+ NS_VALSTR_STRUCT(MS_SLAVE),
+ NS_VALSTR_STRUCT(MS_SHARED),
+ NS_VALSTR_STRUCT(MS_RELATIME),
+ NS_VALSTR_STRUCT(MS_KERNMOUNT),
+ NS_VALSTR_STRUCT(MS_I_VERSION),
+ NS_VALSTR_STRUCT(MS_STRICTATIME),
+ NS_VALSTR_STRUCT(MS_LAZYTIME),
+ };
+
+ uintptr_t knownFlagMask = 0U;
+ for (const auto& i : mountFlags) {
+ if (flags & i.flag) {
+ if (!res.empty()) {
+ res.append("|");
+ }
+ res.append(i.name);
+ }
+ knownFlagMask |= i.flag;
+ }
+
+ if (flags & ~(knownFlagMask)) {
+ util::StrAppend(&res, "|%#tx", flags & ~(knownFlagMask));
+ }
+
+ return res;
+}
+
+static bool isDir(const char* path) {
+ /*
+ * If the source dir is NULL, we assume it's a dir (for /proc and tmpfs)
+ */
+ if (path == NULL) {
+ return true;
+ }
+ struct stat st;
+ if (stat(path, &st) == -1) {
+ PLOG_D("stat('%s')", path);
+ return false;
+ }
+ if (S_ISDIR(st.st_mode)) {
+ return true;
+ }
+ return false;
+}
+
+static bool mountPt(mount_t* mpt, const char* newroot, const char* tmpdir) {
+ LOG_D("Mounting '%s'", describeMountPt(*mpt).c_str());
+
+ char dstpath[PATH_MAX];
+ snprintf(dstpath, sizeof(dstpath), "%s/%s", newroot, mpt->dst.c_str());
+
+ char srcpath[PATH_MAX];
+ if (!mpt->src.empty()) {
+ snprintf(srcpath, sizeof(srcpath), "%s", mpt->src.c_str());
+ } else {
+ snprintf(srcpath, sizeof(srcpath), "none");
+ }
+
+ if (!util::createDirRecursively(dstpath)) {
+ LOG_W("Couldn't create upper directories for '%s'", dstpath);
+ return false;
+ }
+
+ if (mpt->is_symlink) {
+ LOG_D("symlink('%s', '%s')", srcpath, dstpath);
+ if (symlink(srcpath, dstpath) == -1) {
+ if (mpt->is_mandatory) {
+ PLOG_W("symlink('%s', '%s')", srcpath, dstpath);
+ return false;
+ } else {
+ PLOG_W("symlink('%s', '%s'), but it's not mandatory, continuing",
+ srcpath, dstpath);
+ }
+ }
+ return true;
+ }
+
+ if (mpt->is_dir) {
+ if (mkdir(dstpath, 0711) == -1 && errno != EEXIST) {
+ PLOG_W("mkdir('%s')", dstpath);
+ }
+ } else {
+ int fd = TEMP_FAILURE_RETRY(open(dstpath, O_CREAT | O_RDONLY | O_CLOEXEC, 0644));
+ if (fd >= 0) {
+ close(fd);
+ } else {
+ PLOG_W("open('%s', O_CREAT|O_RDONLY|O_CLOEXEC, 0644)", dstpath);
+ }
+ }
+
+ if (!mpt->src_content.empty()) {
+ static uint64_t df_counter = 0;
+ snprintf(
+ srcpath, sizeof(srcpath), "%s/dynamic_file.%" PRIu64, tmpdir, ++df_counter);
+ int fd = TEMP_FAILURE_RETRY(
+ open(srcpath, O_CREAT | O_EXCL | O_CLOEXEC | O_WRONLY, 0644));
+ if (fd < 0) {
+ PLOG_W("open(srcpath, O_CREAT|O_EXCL|O_CLOEXEC|O_WRONLY, 0644) failed");
+ return false;
+ }
+ if (!util::writeToFd(fd, mpt->src_content.data(), mpt->src_content.length())) {
+ LOG_W("Writting %zu bytes to '%s' failed", mpt->src_content.length(),
+ srcpath);
+ close(fd);
+ return false;
+ }
+ close(fd);
+ mpt->flags |= (MS_BIND | MS_REC | MS_PRIVATE);
+ }
+
+ /*
+ * Initially mount it as RW, it will be remounted later on if needed
+ */
+ unsigned long flags = mpt->flags & ~(MS_RDONLY);
+ if (mount(srcpath, dstpath, mpt->fs_type.c_str(), flags, mpt->options.c_str()) == -1) {
+ if (errno == EACCES) {
+ PLOG_W(
+ "mount('%s') src:'%s' dstpath:'%s' failed. "
+ "Try fixing this problem by applying 'chmod o+x' to the '%s' "
+ "directory and its ancestors",
+ describeMountPt(*mpt).c_str(), srcpath, dstpath, srcpath);
+ } else {
+ PLOG_W("mount('%s') src:'%s' dstpath:'%s' failed",
+ describeMountPt(*mpt).c_str(), srcpath, dstpath);
+ if (mpt->fs_type.compare("proc") == 0) {
+ PLOG_W(
+ "procfs can only be mounted if the original /proc doesn't have "
+ "any other file-systems mounted on top of it (e.g. /dev/null "
+ "on top of /proc/kcore)");
+ }
+ }
+ return false;
+ } else {
+ mpt->mounted = true;
+ }
+
+ if (!mpt->src_content.empty() && unlink(srcpath) == -1) {
+ PLOG_W("unlink('%s')", srcpath);
+ }
+ return true;
+}
+
+static bool remountPt(const mount_t& mpt) {
+ if (!mpt.mounted) {
+ return true;
+ }
+ if (mpt.is_symlink) {
+ return true;
+ }
+
+ struct statvfs vfs;
+ if (TEMP_FAILURE_RETRY(statvfs(mpt.dst.c_str(), &vfs)) == -1) {
+ PLOG_W("statvfs('%s')", mpt.dst.c_str());
+ return false;
+ }
+
+ struct {
+ const unsigned long mount_flag;
+ const unsigned long vfs_flag;
+ } static const mountPairs[] = {
+ {MS_NOSUID, ST_NOSUID},
+ {MS_NODEV, ST_NODEV},
+ {MS_NOEXEC, ST_NOEXEC},
+ {MS_SYNCHRONOUS, ST_SYNCHRONOUS},
+ {MS_MANDLOCK, ST_MANDLOCK},
+ {MS_NOATIME, ST_NOATIME},
+ {MS_NODIRATIME, ST_NODIRATIME},
+ {MS_RELATIME, ST_RELATIME},
+ };
+
+ const unsigned long per_mountpoint_flags =
+ MS_LAZYTIME | MS_MANDLOCK | MS_NOATIME | MS_NODEV | MS_NODIRATIME | MS_NOEXEC |
+ MS_NOSUID | MS_RELATIME | MS_RDONLY | MS_SYNCHRONOUS;
+ unsigned long new_flags = MS_REMOUNT | MS_BIND | (mpt.flags & per_mountpoint_flags);
+ for (const auto& i : mountPairs) {
+ if (vfs.f_flag & i.vfs_flag) {
+ new_flags |= i.mount_flag;
+ }
+ }
+
+ LOG_D("Re-mounting '%s' (flags:%s)", mpt.dst.c_str(), flagsToStr(new_flags).c_str());
+ if (mount(mpt.dst.c_str(), mpt.dst.c_str(), NULL, new_flags, 0) == -1) {
+ PLOG_W("mount('%s', flags:%s)", mpt.dst.c_str(), flagsToStr(new_flags).c_str());
+ return false;
+ }
+
+ return true;
+}
+
+static bool mkdirAndTest(const std::string& dir) {
+ if (mkdir(dir.c_str(), 0755) == -1 && errno != EEXIST) {
+ PLOG_D("Couldn't create '%s' directory", dir.c_str());
+ return false;
+ }
+ if (access(dir.c_str(), R_OK) == -1) {
+ PLOG_W("access('%s', R_OK)", dir.c_str());
+ return false;
+ }
+ LOG_D("Created accessible directory in '%s'", dir.c_str());
+ return true;
+}
+
+static std::unique_ptr<std::string> getDir(nsjconf_t* nsjconf, const char* name) {
+ std::unique_ptr<std::string> dir(new std::string);
+
+ dir->assign("/run/user/")
+ .append("/nsjail.")
+ .append(std::to_string(nsjconf->orig_uid))
+ .append(".")
+ .append(name);
+ if (mkdirAndTest(*dir)) {
+ return dir;
+ }
+ dir->assign("/tmp/nsjail.")
+ .append(std::to_string(nsjconf->orig_uid))
+ .append(".")
+ .append(name);
+ if (mkdirAndTest(*dir)) {
+ return dir;
+ }
+ const char* tmp = getenv("TMPDIR");
+ if (tmp) {
+ dir->assign(tmp)
+ .append("/")
+ .append("nsjail.")
+ .append(std::to_string(nsjconf->orig_uid))
+ .append(".")
+ .append(name);
+ if (mkdirAndTest(*dir)) {
+ return dir;
+ }
+ }
+ dir->assign("/dev/shm/nsjail.")
+ .append(std::to_string(nsjconf->orig_uid))
+ .append(".")
+ .append(name);
+ if (mkdirAndTest(*dir)) {
+ return dir;
+ }
+ dir->assign("/tmp/nsjail.")
+ .append(std::to_string(nsjconf->orig_uid))
+ .append(".")
+ .append(name)
+ .append(".")
+ .append(std::to_string(util::rnd64()));
+ if (mkdirAndTest(*dir)) {
+ return dir;
+ }
+
+ LOG_E("Couldn't create tmp directory of type '%s'", name);
+ return nullptr;
+}
+
+static bool initNsInternal(nsjconf_t* nsjconf) {
+ /*
+ * If CLONE_NEWNS is not used, we would be changing the global mount namespace, so simply
+ * use --chroot in this case
+ */
+ if (!nsjconf->clone_newns) {
+ if (nsjconf->chroot.empty()) {
+ PLOG_E(
+ "--chroot was not specified, and it's required when not using "
+ "CLONE_NEWNS");
+ return false;
+ }
+ if (chroot(nsjconf->chroot.c_str()) == -1) {
+ PLOG_E("chroot('%s')", nsjconf->chroot.c_str());
+ return false;
+ }
+ if (chdir("/") == -1) {
+ PLOG_E("chdir('/')");
+ return false;
+ }
+ return true;
+ }
+
+ if (chdir("/") == -1) {
+ PLOG_E("chdir('/')");
+ return false;
+ }
+
+ std::unique_ptr<std::string> destdir = getDir(nsjconf, "root");
+ if (!destdir) {
+ LOG_E("Couldn't obtain root mount directories");
+ return false;
+ }
+
+ /* Make changes to / (recursively) private, to avoid changing the global mount ns */
+ if (mount("/", "/", NULL, MS_REC | MS_PRIVATE, NULL) == -1) {
+ PLOG_E("mount('/', '/', NULL, MS_REC|MS_PRIVATE, NULL)");
+ return false;
+ }
+ if (mount(NULL, destdir->c_str(), "tmpfs", 0, "size=16777216") == -1) {
+ PLOG_E("mount('%s', 'tmpfs')", destdir->c_str());
+ return false;
+ }
+
+ std::unique_ptr<std::string> tmpdir = getDir(nsjconf, "tmp");
+ if (!tmpdir) {
+ LOG_E("Couldn't obtain temporary mount directories");
+ return false;
+ }
+ if (mount(NULL, tmpdir->c_str(), "tmpfs", 0, "size=16777216") == -1) {
+ PLOG_E("mount('%s', 'tmpfs')", tmpdir->c_str());
+ return false;
+ }
+
+ for (auto& p : nsjconf->mountpts) {
+ if (!mountPt(&p, destdir->c_str(), tmpdir->c_str()) && p.is_mandatory) {
+ return false;
+ }
+ }
+
+ if (umount2(tmpdir->c_str(), MNT_DETACH) == -1) {
+ PLOG_E("umount2('%s', MNT_DETACH)", tmpdir->c_str());
+ return false;
+ }
+ /*
+ * This requires some explanation: It's actually possible to pivot_root('/', '/'). After
+ * this operation has been completed, the old root is mounted over the new root, and it's OK
+ * to simply umount('/') now, and to have new_root as '/'. This allows us not care about
+ * providing any special directory for old_root, which is sometimes not easy, given that
+ * e.g. /tmp might not always be present inside new_root
+ */
+ if (syscall(__NR_pivot_root, destdir->c_str(), destdir->c_str()) == -1) {
+ PLOG_E("pivot_root('%s', '%s')", destdir->c_str(), destdir->c_str());
+ return false;
+ }
+
+ if (umount2("/", MNT_DETACH) == -1) {
+ PLOG_E("umount2('/', MNT_DETACH)");
+ return false;
+ }
+ if (chdir(nsjconf->cwd.c_str()) == -1) {
+ PLOG_E("chdir('%s')", nsjconf->cwd.c_str());
+ return false;
+ }
+
+ for (const auto& p : nsjconf->mountpts) {
+ if (!remountPt(p) && p.is_mandatory) {
+ return false;
+ }
+ }
+
+ return true;
+}
+
+/*
+ * With mode MODE_STANDALONE_EXECVE it's required to mount /proc inside a new process,
+ * as the current process is still in the original PID namespace (man pid_namespaces)
+ */
+bool initNs(nsjconf_t* nsjconf) {
+ if (nsjconf->mode != MODE_STANDALONE_EXECVE) {
+ return initNsInternal(nsjconf);
+ }
+
+ pid_t pid = subproc::cloneProc(CLONE_FS | SIGCHLD);
+ if (pid == -1) {
+ return false;
+ }
+
+ if (pid == 0) {
+ exit(initNsInternal(nsjconf) ? 0 : 0xff);
+ }
+
+ int status;
+ while (wait4(pid, &status, 0, NULL) != pid)
+ ;
+ if (WIFEXITED(status) && WEXITSTATUS(status) == 0) {
+ return true;
+ }
+ return false;
+}
+
+static bool addMountPt(mount_t* mnt, const std::string& src, const std::string& dst,
+ const std::string& fstype, const std::string& options, uintptr_t flags, isDir_t is_dir,
+ bool is_mandatory, const std::string& src_env, const std::string& dst_env,
+ const std::string& src_content, bool is_symlink) {
+ if (!src_env.empty()) {
+ const char* e = getenv(src_env.c_str());
+ if (e == NULL) {
+ LOG_W("No such envvar:'%s'", src_env.c_str());
+ return false;
+ }
+ mnt->src = e;
+ }
+ mnt->src.append(src);
+
+ if (!dst_env.empty()) {
+ const char* e = getenv(dst_env.c_str());
+ if (e == NULL) {
+ LOG_W("No such envvar:'%s'", dst_env.c_str());
+ return false;
+ }
+ mnt->dst = e;
+ }
+ mnt->dst.append(dst);
+
+ mnt->fs_type = fstype;
+ mnt->options = options;
+ mnt->flags = flags;
+ mnt->is_symlink = is_symlink;
+ mnt->is_mandatory = is_mandatory;
+ mnt->mounted = false;
+ mnt->src_content = src_content;
+
+ switch (is_dir) {
+ case NS_DIR_YES:
+ mnt->is_dir = true;
+ break;
+ case NS_DIR_NO:
+ mnt->is_dir = false;
+ break;
+ case NS_DIR_MAYBE: {
+ if (!src_content.empty()) {
+ mnt->is_dir = false;
+ } else if (mnt->src.empty()) {
+ mnt->is_dir = true;
+ } else if (mnt->flags & MS_BIND) {
+ mnt->is_dir = mnt::isDir(mnt->src.c_str());
+ } else {
+ mnt->is_dir = true;
+ }
+ } break;
+ default:
+ LOG_E("Unknown is_dir value: %d", is_dir);
+ return false;
+ }
+
+ return true;
+}
+
+bool addMountPtHead(nsjconf_t* nsjconf, const std::string& src, const std::string& dst,
+ const std::string& fstype, const std::string& options, uintptr_t flags, isDir_t is_dir,
+ bool is_mandatory, const std::string& src_env, const std::string& dst_env,
+ const std::string& src_content, bool is_symlink) {
+ mount_t mnt;
+ if (!addMountPt(&mnt, src, dst, fstype, options, flags, is_dir, is_mandatory, src_env,
+ dst_env, src_content, is_symlink)) {
+ return false;
+ }
+ nsjconf->mountpts.insert(nsjconf->mountpts.begin(), mnt);
+ return true;
+}
+
+bool addMountPtTail(nsjconf_t* nsjconf, const std::string& src, const std::string& dst,
+ const std::string& fstype, const std::string& options, uintptr_t flags, isDir_t is_dir,
+ bool is_mandatory, const std::string& src_env, const std::string& dst_env,
+ const std::string& src_content, bool is_symlink) {
+ mount_t mnt;
+ if (!addMountPt(&mnt, src, dst, fstype, options, flags, is_dir, is_mandatory, src_env,
+ dst_env, src_content, is_symlink)) {
+ return false;
+ }
+ nsjconf->mountpts.push_back(mnt);
+ return true;
+}
+
+const std::string describeMountPt(const mount_t& mpt) {
+ std::string descr;
+
+ descr.append(mpt.src.empty() ? "" : "'")
+ .append(mpt.src.empty() ? "" : mpt.src)
+ .append(mpt.src.empty() ? "" : "' -> ")
+ .append("'")
+ .append(mpt.dst)
+ .append("' flags:")
+ .append(flagsToStr(mpt.flags))
+ .append(" type:'")
+ .append(mpt.fs_type)
+ .append("' options:'")
+ .append(mpt.options)
+ .append("'");
+
+ if (mpt.is_dir) {
+ descr.append(" is_dir:true");
+ } else {
+ descr.append(" is_dir:false");
+ }
+ if (!mpt.is_mandatory) {
+ descr.append(" mandatory:false");
+ }
+ if (!mpt.src_content.empty()) {
+ descr.append(" src_content_len:").append(std::to_string(mpt.src_content.length()));
+ }
+ if (mpt.is_symlink) {
+ descr.append(" symlink:true");
+ }
+
+ return descr;
+}
+
+} // namespace mnt
diff --git a/mnt.h b/mnt.h
new file mode 100644
index 0000000..9b177f9
--- /dev/null
+++ b/mnt.h
@@ -0,0 +1,53 @@
+/*
+
+ nsjail - CLONE_NEWNS routines
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#ifndef NS_MNT_H
+#define NS_MNT_H
+
+#include <stdbool.h>
+#include <stdint.h>
+
+#include <string>
+
+#include "nsjail.h"
+
+namespace mnt {
+
+typedef enum {
+ NS_DIR_NO = 0x100,
+ NS_DIR_YES,
+ NS_DIR_MAYBE,
+} isDir_t;
+
+bool initNs(nsjconf_t* nsjconf);
+bool addMountPtHead(nsjconf_t* nsjconf, const std::string& src, const std::string& dst,
+ const std::string& fstype, const std::string& options, uintptr_t flags, isDir_t is_dir,
+ bool is_mandatory, const std::string& src_env, const std::string& dst_env,
+ const std::string& src_content, bool is_symlink);
+bool addMountPtTail(nsjconf_t* nsjconf, const std::string& src, const std::string& dst,
+ const std::string& fstype, const std::string& options, uintptr_t flags, isDir_t is_dir,
+ bool is_mandatory, const std::string& src_env, const std::string& dst_env,
+ const std::string& src_content, bool is_symlink);
+const std::string describeMountPt(const mount_t& mpt);
+
+} // namespace mnt
+
+#endif /* NS_MNT_H */
diff --git a/net.cc b/net.cc
new file mode 100644
index 0000000..fb78e9b
--- /dev/null
+++ b/net.cc
@@ -0,0 +1,498 @@
+/*
+
+ nsjail - networking routines
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#include "net.h"
+
+#include <arpa/inet.h>
+#include <errno.h>
+#include <net/if.h>
+#include <net/route.h>
+#include <netinet/in.h>
+#include <netinet/ip6.h>
+#include <netinet/tcp.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <strings.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include <string>
+
+#include "logs.h"
+#include "subproc.h"
+
+extern char** environ;
+
+namespace net {
+
+#define IFACE_NAME "vs"
+
+#if defined(NSJAIL_NL3_WITH_MACVLAN)
+#include <netlink/route/link.h>
+#include <netlink/route/link/macvlan.h>
+
+static bool cloneIface(
+ nsjconf_t* nsjconf, struct nl_sock* sk, struct nl_cache* link_cache, int pid) {
+ struct rtnl_link* rmv = rtnl_link_macvlan_alloc();
+ if (rmv == NULL) {
+ LOG_E("rtnl_link_macvlan_alloc()");
+ return false;
+ }
+
+ int err;
+ int master_index = rtnl_link_name2i(link_cache, nsjconf->iface_vs.c_str());
+ if (!master_index) {
+ LOG_E("rtnl_link_name2i(): Did not find '%s' interface", nsjconf->iface_vs.c_str());
+ rtnl_link_put(rmv);
+ return false;
+ }
+
+ rtnl_link_set_name(rmv, IFACE_NAME);
+ rtnl_link_set_link(rmv, master_index);
+ rtnl_link_set_ns_pid(rmv, pid);
+
+ if (nsjconf->iface_vs_ma != "") {
+ struct nl_addr* nladdr = nullptr;
+ if ((err = nl_addr_parse(nsjconf->iface_vs_ma.c_str(), AF_LLC, &nladdr)) < 0) {
+ LOG_E("nl_addr_parse('%s', AF_LLC) failed: %s",
+ nsjconf->iface_vs_ma.c_str(), nl_geterror(err));
+ return false;
+ }
+ rtnl_link_set_addr(rmv, nladdr);
+ nl_addr_put(nladdr);
+ }
+
+ if ((err = rtnl_link_add(sk, rmv, NLM_F_CREATE)) < 0) {
+ LOG_E("rtnl_link_add(name:'%s' link:'%s'): %s", IFACE_NAME,
+ nsjconf->iface_vs.c_str(), nl_geterror(err));
+ rtnl_link_put(rmv);
+ return false;
+ }
+
+ rtnl_link_put(rmv);
+ return true;
+}
+
+static bool moveToNs(
+ const std::string& iface, struct nl_sock* sk, struct nl_cache* link_cache, pid_t pid) {
+ LOG_D("Moving interface '%s' into netns=%d", iface.c_str(), (int)pid);
+
+ struct rtnl_link* orig_link = rtnl_link_get_by_name(link_cache, iface.c_str());
+ if (!orig_link) {
+ LOG_E("Couldn't find interface '%s'", iface.c_str());
+ return false;
+ }
+ struct rtnl_link* new_link = rtnl_link_alloc();
+ if (!new_link) {
+ LOG_E("Couldn't allocate new link");
+ rtnl_link_put(orig_link);
+ return false;
+ }
+
+ rtnl_link_set_ns_pid(new_link, pid);
+
+ int err = rtnl_link_change(sk, orig_link, new_link, RTM_SETLINK);
+ if (err < 0) {
+ LOG_E("rtnl_link_change(): set NS of interface '%s' to PID=%d: %s", iface.c_str(),
+ (int)pid, nl_geterror(err));
+ rtnl_link_put(new_link);
+ rtnl_link_put(orig_link);
+ return false;
+ }
+
+ rtnl_link_put(new_link);
+ rtnl_link_put(orig_link);
+ return true;
+}
+
+bool initNsFromParent(nsjconf_t* nsjconf, int pid) {
+ if (!nsjconf->clone_newnet) {
+ return true;
+ }
+ struct nl_sock* sk = nl_socket_alloc();
+ if (!sk) {
+ LOG_E("Could not allocate socket with nl_socket_alloc()");
+ return false;
+ }
+
+ int err;
+ if ((err = nl_connect(sk, NETLINK_ROUTE)) < 0) {
+ LOG_E("Unable to connect socket: %s", nl_geterror(err));
+ nl_socket_free(sk);
+ return false;
+ }
+
+ struct nl_cache* link_cache;
+ if ((err = rtnl_link_alloc_cache(sk, AF_UNSPEC, &link_cache)) < 0) {
+ LOG_E("rtnl_link_alloc_cache(): %s", nl_geterror(err));
+ nl_socket_free(sk);
+ return false;
+ }
+
+ for (const auto& iface : nsjconf->ifaces) {
+ if (!moveToNs(iface, sk, link_cache, pid)) {
+ nl_cache_free(link_cache);
+ nl_socket_free(sk);
+ return false;
+ }
+ }
+ if (!nsjconf->iface_vs.empty() && !cloneIface(nsjconf, sk, link_cache, pid)) {
+ nl_cache_free(link_cache);
+ nl_socket_free(sk);
+ return false;
+ }
+
+ nl_cache_free(link_cache);
+ nl_socket_free(sk);
+ return true;
+}
+#else // defined(NSJAIL_NL3_WITH_MACVLAN)
+
+static bool moveToNs(const std::string& iface, pid_t pid) {
+ const std::vector<std::string> argv{
+ "/sbin/ip", "link", "set", iface, "netns", std::to_string(pid)};
+ if (subproc::systemExe(argv, environ) != 0) {
+ LOG_E("Couldn't put interface '%s' into NET ns of the PID=%d", iface.c_str(),
+ (int)pid);
+ return false;
+ }
+ return true;
+}
+
+bool initNsFromParent(nsjconf_t* nsjconf, int pid) {
+ if (!nsjconf->clone_newnet) {
+ return true;
+ }
+ for (const auto& iface : nsjconf->ifaces) {
+ if (!moveToNs(iface, pid)) {
+ return false;
+ }
+ }
+ if (nsjconf->iface_vs.empty()) {
+ return true;
+ }
+
+ LOG_D("Putting iface:'%s' into namespace of PID:%d (with /sbin/ip)",
+ nsjconf->iface_vs.c_str(), pid);
+
+ std::vector<std::string> argv;
+
+ if (nsjconf->iface_vs_ma != "") {
+ argv = {"/sbin/ip", "link", "add", "link", nsjconf->iface_vs, "name", IFACE_NAME,
+ "netns", std::to_string(pid), "address", nsjconf->iface_vs_ma, "type",
+ "macvlan", "mode", "bridge"};
+ } else {
+ argv = {"/sbin/ip", "link", "add", "link", nsjconf->iface_vs, "name", IFACE_NAME,
+ "netns", std::to_string(pid), "type", "macvlan", "mode", "bridge"};
+ }
+ if (subproc::systemExe(argv, environ) != 0) {
+ LOG_E("Couldn't create MACVTAP interface for '%s'", nsjconf->iface_vs.c_str());
+ return false;
+ }
+ return true;
+}
+#endif // defined(NSJAIL_NL3_WITH_MACVLAN)
+
+static bool isSocket(int fd) {
+ int optval;
+ socklen_t optlen = sizeof(optval);
+ int ret = getsockopt(fd, SOL_SOCKET, SO_TYPE, &optval, &optlen);
+ if (ret == -1) {
+ return false;
+ }
+ return true;
+}
+
+bool limitConns(nsjconf_t* nsjconf, int connsock) {
+ /* 0 means 'unlimited' */
+ if (nsjconf->max_conns_per_ip == 0) {
+ return true;
+ }
+
+ struct sockaddr_in6 addr;
+ auto connstr = connToText(connsock, true /* remote */, &addr);
+
+ unsigned cnt = 0;
+ for (const auto& pid : nsjconf->pids) {
+ if (memcmp(addr.sin6_addr.s6_addr, pid.remote_addr.sin6_addr.s6_addr,
+ sizeof(pid.remote_addr.sin6_addr.s6_addr)) == 0) {
+ cnt++;
+ }
+ }
+ if (cnt >= nsjconf->max_conns_per_ip) {
+ LOG_W("Rejecting connection from '%s', max_conns_per_ip limit reached: %u",
+ connstr.c_str(), nsjconf->max_conns_per_ip);
+ return false;
+ }
+
+ return true;
+}
+
+int getRecvSocket(const char* bindhost, int port) {
+ if (port < 1 || port > 65535) {
+ LOG_F(
+ "TCP port %d out of bounds (0 <= port <= 65535), specify one with --port "
+ "<port>",
+ port);
+ }
+
+ char bindaddr[128];
+ snprintf(bindaddr, sizeof(bindaddr), "%s", bindhost);
+ struct in_addr in4a;
+ if (inet_pton(AF_INET, bindaddr, &in4a) == 1) {
+ snprintf(bindaddr, sizeof(bindaddr), "::ffff:%s", bindhost);
+ LOG_D("Converting bind IPv4:'%s' to IPv6:'%s'", bindhost, bindaddr);
+ }
+
+ struct in6_addr in6a;
+ if (inet_pton(AF_INET6, bindaddr, &in6a) != 1) {
+ PLOG_E(
+ "Couldn't convert '%s' (orig:'%s') into AF_INET6 address", bindaddr, bindhost);
+ return -1;
+ }
+
+ int sockfd = socket(AF_INET6, SOCK_STREAM, 0);
+ if (sockfd == -1) {
+ PLOG_E("socket(AF_INET6)");
+ return -1;
+ }
+ int so = 1;
+ if (setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &so, sizeof(so)) == -1) {
+ PLOG_E("setsockopt(%d, SO_REUSEADDR)", sockfd);
+ return -1;
+ }
+ struct sockaddr_in6 addr = {
+ .sin6_family = AF_INET6,
+ .sin6_port = htons(port),
+ .sin6_flowinfo = 0,
+ .sin6_addr = in6a,
+ .sin6_scope_id = 0,
+ };
+ if (bind(sockfd, (struct sockaddr*)&addr, sizeof(addr)) == -1) {
+ close(sockfd);
+ PLOG_E("bind(host:[%s] (orig:'%s'), port:%d)", bindaddr, bindhost, port);
+ return -1;
+ }
+ if (listen(sockfd, SOMAXCONN) == -1) {
+ close(sockfd);
+ PLOG_E("listen(%d)", SOMAXCONN);
+ return -1;
+ }
+
+ auto connstr = connToText(sockfd, false /* remote */, NULL);
+ LOG_I("Listening on %s", connstr.c_str());
+
+ return sockfd;
+}
+
+int acceptConn(int listenfd) {
+ struct sockaddr_in6 cli_addr;
+ socklen_t socklen = sizeof(cli_addr);
+ int connfd = accept(listenfd, (struct sockaddr*)&cli_addr, &socklen);
+ if (connfd == -1) {
+ if (errno != EINTR) {
+ PLOG_E("accept(%d)", listenfd);
+ }
+ return -1;
+ }
+
+ auto connremotestr = connToText(connfd, true /* remote */, NULL);
+ auto connlocalstr = connToText(connfd, false /* remote */, NULL);
+ LOG_I("New connection from: %s on: %s", connremotestr.c_str(), connlocalstr.c_str());
+
+ return connfd;
+}
+
+const std::string connToText(int fd, bool remote, struct sockaddr_in6* addr_or_null) {
+ std::string res;
+
+ if (!isSocket(fd)) {
+ return "[STANDALONE MODE]";
+ }
+
+ struct sockaddr_in6 addr;
+ socklen_t addrlen = sizeof(addr);
+ if (remote) {
+ if (getpeername(fd, (struct sockaddr*)&addr, &addrlen) == -1) {
+ PLOG_W("getpeername(%d)", fd);
+ return "[unknown]";
+ }
+ } else {
+ if (getsockname(fd, (struct sockaddr*)&addr, &addrlen) == -1) {
+ PLOG_W("getsockname(%d)", fd);
+ return "[unknown]";
+ }
+ }
+
+ if (addr_or_null) {
+ memcpy(addr_or_null, &addr, sizeof(*addr_or_null));
+ }
+
+ char addrstr[128];
+ if (!inet_ntop(AF_INET6, addr.sin6_addr.s6_addr, addrstr, sizeof(addrstr))) {
+ PLOG_W("inet_ntop()");
+ snprintf(addrstr, sizeof(addrstr), "[unknown](%s)", strerror(errno));
+ }
+
+ res.append("[");
+ res.append(addrstr);
+ res.append("]:");
+ res.append(std::to_string(ntohs(addr.sin6_port)));
+ return res;
+}
+
+static bool ifaceUp(const char* ifacename) {
+ int sock = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
+ if (sock == -1) {
+ PLOG_E("socket(AF_INET, SOCK_STREAM, IPPROTO_IP)");
+ return false;
+ }
+
+ struct ifreq ifr;
+ memset(&ifr, '\0', sizeof(ifr));
+ snprintf(ifr.ifr_name, IF_NAMESIZE, "%s", ifacename);
+
+ if (ioctl(sock, SIOCGIFFLAGS, &ifr) == -1) {
+ PLOG_E("ioctl(iface='%s', SIOCGIFFLAGS, IFF_UP)", ifacename);
+ close(sock);
+ return false;
+ }
+
+ ifr.ifr_flags |= (IFF_UP | IFF_RUNNING);
+
+ if (ioctl(sock, SIOCSIFFLAGS, &ifr) == -1) {
+ PLOG_E("ioctl(iface='%s', SIOCSIFFLAGS, IFF_UP|IFF_RUNNING)", ifacename);
+ close(sock);
+ return false;
+ }
+
+ close(sock);
+ return true;
+}
+
+static bool ifaceConfig(const std::string& iface, const std::string& ip, const std::string& mask,
+ const std::string& gw) {
+ int sock = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
+ if (sock == -1) {
+ PLOG_E("socket(AF_INET, SOCK_STREAM, IPPROTO_IP)");
+ return false;
+ }
+
+ struct in_addr addr;
+ if (inet_pton(AF_INET, ip.c_str(), &addr) != 1) {
+ PLOG_E("Cannot convert '%s' into an IPv4 address", ip.c_str());
+ close(sock);
+ return false;
+ }
+ if (addr.s_addr == INADDR_ANY) {
+ LOG_D("IPv4 address for interface '%s' not set", iface.c_str());
+ close(sock);
+ return true;
+ }
+
+ struct ifreq ifr;
+ memset(&ifr, '\0', sizeof(ifr));
+ snprintf(ifr.ifr_name, IF_NAMESIZE, "%s", iface.c_str());
+ struct sockaddr_in* sa = (struct sockaddr_in*)(&ifr.ifr_addr);
+ sa->sin_family = AF_INET;
+ sa->sin_addr = addr;
+ if (ioctl(sock, SIOCSIFADDR, &ifr) == -1) {
+ PLOG_E("ioctl(iface='%s', SIOCSIFADDR, '%s')", iface.c_str(), ip.c_str());
+ close(sock);
+ return false;
+ }
+
+ if (inet_pton(AF_INET, mask.c_str(), &addr) != 1) {
+ PLOG_E("Cannot convert '%s' into a IPv4 netmask", mask.c_str());
+ close(sock);
+ return false;
+ }
+ sa->sin_family = AF_INET;
+ sa->sin_addr = addr;
+ if (ioctl(sock, SIOCSIFNETMASK, &ifr) == -1) {
+ PLOG_E("ioctl(iface='%s', SIOCSIFNETMASK, '%s')", iface.c_str(), mask.c_str());
+ close(sock);
+ return false;
+ }
+
+ if (!ifaceUp(iface.c_str())) {
+ close(sock);
+ return false;
+ }
+
+ if (inet_pton(AF_INET, gw.c_str(), &addr) != 1) {
+ PLOG_E("Cannot convert '%s' into a IPv4 GW address", gw.c_str());
+ close(sock);
+ return false;
+ }
+ if (addr.s_addr == INADDR_ANY) {
+ LOG_D("Gateway address for '%s' is not set", iface.c_str());
+ close(sock);
+ return true;
+ }
+
+ struct rtentry rt;
+ memset(&rt, '\0', sizeof(rt));
+ struct sockaddr_in* sdest = (struct sockaddr_in*)(&rt.rt_dst);
+ struct sockaddr_in* smask = (struct sockaddr_in*)(&rt.rt_genmask);
+ struct sockaddr_in* sgate = (struct sockaddr_in*)(&rt.rt_gateway);
+ sdest->sin_family = AF_INET;
+ sdest->sin_addr.s_addr = INADDR_ANY;
+ smask->sin_family = AF_INET;
+ smask->sin_addr.s_addr = INADDR_ANY;
+ sgate->sin_family = AF_INET;
+ sgate->sin_addr = addr;
+
+ rt.rt_flags = RTF_UP | RTF_GATEWAY;
+ char rt_dev[IF_NAMESIZE];
+ snprintf(rt_dev, sizeof(rt_dev), "%s", iface.c_str());
+ rt.rt_dev = rt_dev;
+
+ if (ioctl(sock, SIOCADDRT, &rt) == -1) {
+ PLOG_E("ioctl(SIOCADDRT, '%s')", gw.c_str());
+ close(sock);
+ return false;
+ }
+
+ close(sock);
+ return true;
+}
+
+bool initNsFromChild(nsjconf_t* nsjconf) {
+ if (!nsjconf->clone_newnet) {
+ return true;
+ }
+ if (nsjconf->iface_lo && !ifaceUp("lo")) {
+ return false;
+ }
+ if (!nsjconf->iface_vs.empty() && !ifaceConfig(IFACE_NAME, nsjconf->iface_vs_ip,
+ nsjconf->iface_vs_nm, nsjconf->iface_vs_gw)) {
+ return false;
+ }
+ return true;
+}
+
+} // namespace net
diff --git a/net.h b/net.h
new file mode 100644
index 0000000..3056af1
--- /dev/null
+++ b/net.h
@@ -0,0 +1,43 @@
+/*
+
+ nsjail - networking routines
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#ifndef NS_NET_H
+#define NS_NET_H
+
+#include <stdbool.h>
+#include <stddef.h>
+
+#include <string>
+
+#include "nsjail.h"
+
+namespace net {
+
+bool limitConns(nsjconf_t* nsjconf, int connsock);
+int getRecvSocket(const char* bindhost, int port);
+int acceptConn(int listenfd);
+const std::string connToText(int fd, bool remote, struct sockaddr_in6* addr_or_null);
+bool initNsFromParent(nsjconf_t* nsjconf, int pid);
+bool initNsFromChild(nsjconf_t* nsjconf);
+
+} // namespace net
+
+#endif /* _NET_H */
diff --git a/nsjail.1 b/nsjail.1
new file mode 100644
index 0000000..f8ffca5
--- /dev/null
+++ b/nsjail.1
@@ -0,0 +1,288 @@
+.TH NSJAIL "1" "August 2017" "nsjail" "User Commands"
+\"
+.SH NAME
+nsjail \- process isolation tool for linux
+\"
+.SH SYNOPSIS
+\fInsjail\fP [options] \fB\-\-\fR path_to_command [args]
+\"
+.SH DESCRIPTION
+NsJail is a process isolation tool for Linux. It utilizes Linux namespace subsystem, resource limits, and the seccomp-bpf syscall filters of the Linux kernel
+\"
+.SH Options
+.TP
+\fB\-\-help\fR|\fB\-h\fR Help plz..
+.TP
+\fB\-\-mode\fR|\fB\-M\fR VALUE
+Execution mode (default: o [MODE_STANDALONE_ONCE]):
+.IP
+\fBl\fR: Wait for connections on a TCP port (specified with \fB\-\-port\fR) [MODE_LISTEN_TCP]
+.PP
+.IP
+\fBo\fR: Launch a single process on the console using clone/execve [MODE_STANDALONE_ONCE]
+.PP
+.IP
+\fBe\fR: Launch a single process on the console using execve [MODE_STANDALONE_EXECVE]
+.PP
+.IP
+\fBr\fR: Launch a single process on the console with clone/execve, keep doing it forever [MODE_STANDALONE_RERUN]
+.PP
+.TP
+\fB\-\-config\fR|\fB\-C\fR VALUE
+Configuration file in the config.proto ProtoBuf format (see configs/ directory for examples)
+.TP
+\fB\-\-exec_file\fR|\fB\-x\fR VALUE
+File to exec (default: argv[0])
+.TP
+\fB\-\-execute_fd\fR
+Use execveat() to execute a file-descriptor instead of executing the binary path. In such case argv[0]/exec_file denotes a file path before mount namespacing
+.TP
+\fB\-\-chroot\fR|\fB\-c\fR VALUE
+Directory containing / of the jail (default: none)
+.TP
+\fB\-\-rw\fR
+Mount chroot dir (/) R/W (default: R/O)
+.TP
+\fB\-\-user\fR|\fB\-u\fR VALUE
+Username/uid of processess inside the jail (default: your current uid). You can also use inside_ns_uid:outside_ns_uid:count convention here. Can be specified multiple times
+.TP
+\fB\-\-group\fR|\fB\-g\fR VALUE
+Groupname/gid of processess inside the jail (default: your current gid). You can also use inside_ns_gid:global_ns_gid:count convention here. Can be specified multiple times
+.TP
+\fB\-\-hostname\fR|\fB\-H\fR VALUE
+UTS name (hostname) of the jail (default: 'NSJAIL')
+.TP
+\fB\-\-cwd\fR|\fB\-D\fR VALUE
+Directory in the namespace the process will run (default: '/')
+.TP
+\fB\-\-port\fR|\fB\-p\fR VALUE
+TCP port to bind to (enables MODE_LISTEN_TCP) (default: 0)
+.TP
+\fB\-\-bindhost\fR VALUE
+IP address to bind the port to (only in [MODE_LISTEN_TCP]), (default: '::')
+.TP
+\fB\-\-max_conns_per_ip\fR|\fB\-i\fR VALUE
+Maximum number of connections per one IP (only in [MODE_LISTEN_TCP]), (default: 0 (unlimited))
+.TP
+\fB\-\-log\fR|\fB\-l\fR VALUE
+Log file (default: use log_fd)
+.TP
+\fB\-\-log_fd\fR|\fB\-L\fR VALUE
+Log FD (default: 2)
+.TP
+\fB\-\-time_limit\fR|\fB\-t\fR VALUE
+Maximum time that a jail can exist, in seconds (default: 600)
+.TP
+\fB\-\-max_cpus\fR VALUE
+Maximum number of CPUs a single jailed process can use (default: 0 'no limit')
+.TP
+\fB\-\-daemon\fR|\fB\-d\fR
+Daemonize after start
+.TP
+\fB\-\-verbose\fR|\fB\-v\fR
+Verbose output
+.TP
+\fB\-\-quiet\fR|\fB\-q\fR
+Log warning and more important messages only
+.TP
+\fB\-\-really_quiet\fR|\fB\-Q\fR
+Log fatal messages only
+.TP
+\fB\-\-keep_env\fR|\fB\-e\fR
+Pass all environment variables be passed process (default: all envvars are cleared)
+.TP
+\fB\-\-env\fR|\fB\-E\fR VALUE
+Additional environment variable (can be used multiple times). If the envvar doesn't contain '=' (e.g. just the 'DISPLAY' string), the current envvar value will be used
+.TP
+\fB\-\-keep_caps\fR
+Don't drop any capabilities
+.TP
+\fB\-\-cap\fR VALUE
+Retain this capability, e.g. CAP_PTRACE (can be specified multiple times)
+.TP
+\fB\-\-silent\fR
+Redirect child process' fd:0/1/2 to /dev/null
+.TP
+\fB\-\-stderr_to_null\fR
+Redirect FD=2 (STDERR_FILENO) to /dev/null
+.TP
+\fB\-\-skip_setsid\fR
+Don't call setsid(), allows for terminal signal handling in the sandboxed process. Dangerous
+.TP
+\fB\-\-pass_fd\fR VALUE
+Don't close this FD before executing the child process (can be specified multiple times), by default: 0/1/2 are kept open
+.TP
+\fB\-\-disable_no_new_privs\fR
+Don't set the prctl(NO_NEW_PRIVS, 1) (DANGEROUS)
+.TP
+\fB\-\-rlimit_as\fR VALUE
+RLIMIT_AS in MB, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM_INFINITY (default: 512)
+.TP
+\fB\-\-rlimit_core\fR VALUE
+RLIMIT_CORE in MB, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current limit, 'inf' for RLIM_INFINITY (default: 0)
+.TP
+\fB\-\-rlimit_cpu\fR VALUE
+RLIMIT_CPU, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM_INFINITY (default: 600)
+.TP
+\fB\-\-rlimit_fsize\fR VALUE
+RLIMIT_FSIZE in MB, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM_INFINITY (default: 1)
+.TP
+\fB\-\-rlimit_nofile\fR VALUE
+RLIMIT_NOFILE, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current limit, 'inf' for RLIM_INFINITY (default: 32)
+.TP
+\fB\-\-rlimit_nproc\fR VALUE
+RLIMIT_NPROC, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM_INFINITY (default: 'soft')
+.TP
+\fB\-\-rlimit_stack\fR VALUE
+RLIMIT_STACK in MB, 'max' or 'hard' for the current hard limit, 'def' or 'soft' for the current soft limit, 'inf' for RLIM_INFINITY (default: 'soft')
+.TP
+\fB\-\-persona_addr_compat_layout\fR
+personality(ADDR_COMPAT_LAYOUT)
+.TP
+\fB\-\-persona_mmap_page_zero\fR
+personality(MMAP_PAGE_ZERO)
+.TP
+\fB\-\-persona_read_implies_exec\fR
+personality(READ_IMPLIES_EXEC)
+.TP
+\fB\-\-persona_addr_limit_3gb\fR
+personality(ADDR_LIMIT_3GB)
+.TP
+\fB\-\-persona_addr_no_randomize\fR
+personality(ADDR_NO_RANDOMIZE)
+.TP
+\fB\-\-disable_clone_newnet\fR|\-N
+Don't use CLONE_NEWNET. Enable global networking inside the jail
+.TP
+\fB\-\-disable_clone_newuser\fR
+Don't use CLONE_NEWUSER. Requires euid==0
+.TP
+\fB\-\-disable_clone_newns\fR
+Don't use CLONE_NEWNS
+.TP
+\fB\-\-disable_clone_newpid\fR
+Don't use CLONE_NEWPID
+.TP
+\fB\-\-disable_clone_newipc\fR
+Don't use CLONE_NEWIPC
+.TP
+\fB\-\-disable_clone_newuts\fR
+Don't use CLONE_NEWUTS
+.TP
+\fB\-\-disable_clone_newcgroup\fR
+Don't use CLONE_NEWCGROUP. Might be required for kernel versions < 4.6
+.TP
+\fB\-\-uid_mapping\fR|\fB\-U\fR VALUE
+Add a custom uid mapping of the form inside_uid:outside_uid:count. Setting this requires newuidmap (set-uid) to be present
+.TP
+\fB\-\-gid_mapping\fR|\fB\-G\fR VALUE
+Add a custom gid mapping of the form inside_gid:outside_gid:count. Setting this requires newgidmap (set-uid) to be present
+.TP
+\fB\-\-bindmount_ro\fR|\fB\-R\fR VALUE
+List of mountpoints to be mounted \fB\-\-bind\fR (ro) inside the container. Can be specified multiple times. Supports 'source' syntax, or 'source:dest'
+.TP
+\fB\-\-bindmount\fR|\fB\-B\fR VALUE
+List of mountpoints to be mounted \fB\-\-bind\fR (rw) inside the container. Can be specified multiple times. Supports 'source' syntax, or 'source:dest'
+.TP
+\fB\-\-tmpfsmount\fR|\fB\-T\fR VALUE
+List of mountpoints to be mounted as tmpfs (R/W) inside the container. Can be specified multiple times. Supports 'dest' syntax. Alternatively, use '-m none:dest:tmpfs:size=8388608'
+.TP
+\fB\-\-mount\fR|\fB\-m\fR VALUE
+Arbitrary mount, format src:dst:fs_type:options
+.TP
+\fB\-\-symlink\fR|\f\B\-s\fR VALUE
+Symlink, format src:dst
+.TP
+\fB\-\-disable_proc\fR
+Disable mounting procfs in the jail
+.TP
+\fB\-\-proc_path\fR VALUE
+Path used to mount procfs (default: '/proc')
+.TP
+\fB\-\-proc_rw\fR
+Is procfs mounted as R/W (default: R/O)
+.TP
+\fB\-\-seccomp_policy\fR|\fB\-P\fR VALUE
+Path to file containing seccomp\-bpf policy (see kafel/)
+.TP
+\fB\-\-seccomp_string\fR VALUE
+String with kafel seccomp\-bpf policy (see kafel/)
+.TP
+\fB\-\-seccomp_log\fR
+Use SECCOMP_FILTER_FLAG_LOG. Log all actions except SECCOMP_RET_ALLOW. Supported since kernel version 4.14
+.TP
+\fB\-\-cgroup_mem_max\fR VALUE
+Maximum number of bytes to use in the group (default: '0' \- disabled)
+.TP
+\fB\-\-cgroup_mem_mount\fR VALUE
+Location of memory cgroup FS (default: '/sys/fs/cgroup/memory')
+.TP
+\fB\-\-cgroup_mem_parent\fR VALUE
+Which pre\-existing memory cgroup to use as a parent (default: 'NSJAIL')
+.TP
+\fB\-\-cgroup_pids_max\fR VALUE
+Maximum number of pids in a cgroup (default: '0' \- disabled)
+.TP
+\fB\-\-cgroup_pids_mount\fR VALUE
+Location of pids cgroup FS (default: '/sys/fs/cgroup/pids')
+.TP
+\fB\-\-cgroup_pids_parent\fR VALUE
+Which pre\-existing pids cgroup to use as a parent (default: 'NSJAIL')
+.TP
+\fB\-\-cgroup_net_cls_classid\fR VALUE
+Class identifier of network packets in the group (default: '0' \- disabled)
+.TP
+\fB\-\-cgroup_net_cls_mount\fR VALUE
+Location of net_cls cgroup FS (default: '/sys/fs/cgroup/net_cls')
+.TP
+\fB\-\-cgroup_net_cls_parent\fR VALUE
+Which pre\-existing net_cls cgroup to use as a parent (default: 'NSJAIL')
+.TP
+\fB\-\-cgroup_cpu_ms_per_sec\fR VALUE
+Number of milliseconds of CPU time per second that the process group can use (default: '0' - no limit)
+.TP
+\fB\-\-cgroup_cpu_mount\fR VALUE
+Location of cpu cgroup FS (default: '/sys/fs/cgroup/net_cls')
+.TP
+\fB\-\-cgroup_cpu_parent\fR VALUE
+Which pre-existing cpu cgroup to use as a parent (default: 'NSJAIL')
+.TP
+\fB\-\-iface_no_lo\fR
+Don't bring the 'lo' interface up
+.TP
+\fB\-\-iface_own\fR VALUE
+Move this existing network interface into the new NET namespace. Can be specified multiple times
+.TP
+\fB\-\-macvlan_iface\fR|\fB\-I\fR VALUE
+Interface which will be cloned (MACVLAN) and put inside the subprocess' namespace as 'vs'
+.TP
+\fB\-\-macvlan_vs_ip\fR VALUE
+IP of the 'vs' interface (e.g. "192.168.0.1")
+.TP
+\fB\-\-macvlan_vs_nm\fR VALUE
+Netmask of the 'vs' interface (e.g. "255.255.255.0")
+.TP
+\fB\-\-macvlan_vs_gw\fR VALUE
+Default GW for the 'vs' interface (e.g. "192.168.0.1")
+.TP
+\fB\-\-macvlan_vs_ma\fR VALUE
+MAC-address of the 'vs' interface (e.g. "ba:ad:ba:be:45:00")
+\"
+.SH Examples
+.PP
+Wait on a port 31337 for connections, and run /bin/sh:
+.IP
+nsjail \-Ml \-\-port 31337 \-\-chroot / \-\- /bin/sh \-i
+.PP
+Re\-run echo command as a sub\-process:
+.IP
+nsjail \-Mr \-\-chroot / \-\- /bin/echo "ABC"
+.PP
+Run echo command once only, as a sub\-process:
+.IP
+nsjail \-Mo \-\-chroot / \-\- /bin/echo "ABC"
+.PP
+Execute echo command directly, without a supervising process:
+.IP
+nsjail \-Me \-\-chroot / \-\-disable_proc \-\- /bin/echo "ABC"
+\"
diff --git a/nsjail.cc b/nsjail.cc
new file mode 100644
index 0000000..0b57033
--- /dev/null
+++ b/nsjail.cc
@@ -0,0 +1,233 @@
+/*
+
+ nsjail
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#include "nsjail.h"
+
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/time.h>
+#include <termios.h>
+#include <unistd.h>
+
+#include <memory>
+
+#include "cmdline.h"
+#include "logs.h"
+#include "macros.h"
+#include "net.h"
+#include "sandbox.h"
+#include "subproc.h"
+#include "util.h"
+
+namespace nsjail {
+
+static __thread int sigFatal = 0;
+static __thread bool showProc = false;
+
+static void sigHandler(int sig) {
+ if (sig == SIGALRM) {
+ return;
+ }
+ if (sig == SIGCHLD) {
+ return;
+ }
+ if (sig == SIGUSR1 || sig == SIGQUIT) {
+ showProc = true;
+ return;
+ }
+ sigFatal = sig;
+}
+
+static bool setSigHandler(int sig) {
+ LOG_D("Setting sighandler for signal %s (%d)", util::sigName(sig).c_str(), sig);
+
+ sigset_t smask;
+ sigemptyset(&smask);
+
+ struct sigaction sa;
+ sa.sa_handler = sigHandler;
+ sa.sa_mask = smask;
+ sa.sa_flags = 0;
+ sa.sa_restorer = NULL;
+
+ if (sig == SIGTTIN || sig == SIGTTOU) {
+ sa.sa_handler = SIG_IGN;
+ };
+ if (sigaction(sig, &sa, NULL) == -1) {
+ PLOG_E("sigaction(%d)", sig);
+ return false;
+ }
+ return true;
+}
+
+static bool setSigHandlers(void) {
+ for (const auto& i : nssigs) {
+ if (!setSigHandler(i)) {
+ return false;
+ }
+ }
+ return true;
+}
+
+static bool setTimer(nsjconf_t* nsjconf) {
+ if (nsjconf->mode == MODE_STANDALONE_EXECVE) {
+ return true;
+ }
+
+ struct itimerval it = {
+ .it_interval =
+ {
+ .tv_sec = 1,
+ .tv_usec = 0,
+ },
+ .it_value =
+ {
+ .tv_sec = 1,
+ .tv_usec = 0,
+ },
+ };
+ if (setitimer(ITIMER_REAL, &it, NULL) == -1) {
+ PLOG_E("setitimer(ITIMER_REAL)");
+ return false;
+ }
+ return true;
+}
+
+static int listenMode(nsjconf_t* nsjconf) {
+ int listenfd = net::getRecvSocket(nsjconf->bindhost.c_str(), nsjconf->port);
+ if (listenfd == -1) {
+ return EXIT_FAILURE;
+ }
+ for (;;) {
+ if (sigFatal > 0) {
+ subproc::killAndReapAll(nsjconf);
+ logs::logStop(sigFatal);
+ close(listenfd);
+ return EXIT_SUCCESS;
+ }
+ if (showProc) {
+ showProc = false;
+ subproc::displayProc(nsjconf);
+ }
+ int connfd = net::acceptConn(listenfd);
+ if (connfd >= 0) {
+ subproc::runChild(nsjconf, connfd, connfd, connfd);
+ close(connfd);
+ }
+ subproc::reapProc(nsjconf);
+ }
+}
+
+static int standaloneMode(nsjconf_t* nsjconf) {
+ for (;;) {
+ if (!subproc::runChild(nsjconf, STDIN_FILENO, STDOUT_FILENO, STDERR_FILENO)) {
+ LOG_E("Couldn't launch the child process");
+ return 0xff;
+ }
+ for (;;) {
+ int child_status = subproc::reapProc(nsjconf);
+ if (subproc::countProc(nsjconf) == 0) {
+ if (nsjconf->mode == MODE_STANDALONE_ONCE) {
+ return child_status;
+ }
+ break;
+ }
+ if (showProc) {
+ showProc = false;
+ subproc::displayProc(nsjconf);
+ }
+ if (sigFatal > 0) {
+ subproc::killAndReapAll(nsjconf);
+ logs::logStop(sigFatal);
+ return (128 + sigFatal);
+ }
+ pause();
+ }
+ }
+ // not reached
+}
+
+std::unique_ptr<struct termios> getTC(int fd) {
+ std::unique_ptr<struct termios> trm(new struct termios);
+
+ if (ioctl(fd, TCGETS, trm.get()) == -1) {
+ PLOG_D("ioctl(fd=%d, TCGETS) failed", fd);
+ return nullptr;
+ }
+ LOG_D("Saved the current state of the TTY");
+ return trm;
+}
+
+void setTC(int fd, const struct termios* trm) {
+ if (!trm) {
+ return;
+ }
+ if (ioctl(fd, TCSETS, trm) == -1) {
+ PLOG_W("ioctl(fd=%d, TCSETS) failed", fd);
+ return;
+ }
+ LOG_D("Restored the previous state of the TTY");
+}
+
+} // namespace nsjail
+
+int main(int argc, char* argv[]) {
+ std::unique_ptr<nsjconf_t> nsjconf = cmdline::parseArgs(argc, argv);
+ std::unique_ptr<struct termios> trm = nsjail::getTC(STDIN_FILENO);
+
+ if (!nsjconf) {
+ LOG_F("Couldn't parse cmdline options");
+ }
+ if (!nsjconf->clone_newuser && geteuid() != 0) {
+ LOG_W("--disable_clone_newuser might require root() privs");
+ }
+ if (nsjconf->daemonize && (daemon(0, 0) == -1)) {
+ PLOG_F("daemon");
+ }
+ cmdline::logParams(nsjconf.get());
+ if (!nsjail::setSigHandlers()) {
+ LOG_F("nsjail::setSigHandlers() failed");
+ }
+ if (!nsjail::setTimer(nsjconf.get())) {
+ LOG_F("nsjail::setTimer() failed");
+ }
+ if (!sandbox::preparePolicy(nsjconf.get())) {
+ LOG_F("Couldn't prepare sandboxing policy");
+ }
+
+ int ret = 0;
+ if (nsjconf->mode == MODE_LISTEN_TCP) {
+ ret = nsjail::listenMode(nsjconf.get());
+ } else {
+ ret = nsjail::standaloneMode(nsjconf.get());
+ }
+
+ sandbox::closePolicy(nsjconf.get());
+ /* Try to restore the underlying console's params in case some program has changed it */
+ nsjail::setTC(STDIN_FILENO, trm.get());
+
+ LOG_D("Returning with %d", ret);
+ return ret;
+}
diff --git a/nsjail.h b/nsjail.h
new file mode 100644
index 0000000..f91b8fd
--- /dev/null
+++ b/nsjail.h
@@ -0,0 +1,157 @@
+/*
+
+ nsjail
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+ Copyright 2016 Sergiusz Bazanski. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#ifndef NS_NSJAIL_H
+#define NS_NSJAIL_H
+
+#include <linux/filter.h>
+#include <netinet/ip6.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <time.h>
+#include <unistd.h>
+
+#include <string>
+#include <vector>
+
+static const int nssigs[] = {
+ SIGINT,
+ SIGQUIT,
+ SIGUSR1,
+ SIGALRM,
+ SIGCHLD,
+ SIGTERM,
+ SIGTTIN,
+ SIGTTOU,
+};
+
+struct pids_t {
+ pid_t pid;
+ time_t start;
+ std::string remote_txt;
+ struct sockaddr_in6 remote_addr;
+ int pid_syscall_fd;
+};
+
+struct mount_t {
+ std::string src;
+ std::string src_content;
+ std::string dst;
+ std::string fs_type;
+ std::string options;
+ uintptr_t flags;
+ bool is_dir;
+ bool is_symlink;
+ bool is_mandatory;
+ bool mounted;
+};
+
+struct idmap_t {
+ uid_t inside_id;
+ uid_t outside_id;
+ size_t count;
+ bool is_newidmap;
+};
+
+enum ns_mode_t {
+ MODE_LISTEN_TCP = 0,
+ MODE_STANDALONE_ONCE,
+ MODE_STANDALONE_EXECVE,
+ MODE_STANDALONE_RERUN
+};
+
+struct nsjconf_t {
+ std::string exec_file;
+ bool use_execveat;
+ int exec_fd;
+ std::vector<std::string> argv;
+ std::string hostname;
+ std::string cwd;
+ std::string chroot;
+ int port;
+ std::string bindhost;
+ bool daemonize;
+ uint64_t tlimit;
+ size_t max_cpus;
+ bool keep_env;
+ bool keep_caps;
+ bool disable_no_new_privs;
+ uint64_t rl_as;
+ uint64_t rl_core;
+ uint64_t rl_cpu;
+ uint64_t rl_fsize;
+ uint64_t rl_nofile;
+ uint64_t rl_nproc;
+ uint64_t rl_stack;
+ unsigned long personality;
+ bool clone_newnet;
+ bool clone_newuser;
+ bool clone_newns;
+ bool clone_newpid;
+ bool clone_newipc;
+ bool clone_newuts;
+ bool clone_newcgroup;
+ enum ns_mode_t mode;
+ bool is_root_rw;
+ bool is_silent;
+ bool stderr_to_null;
+ bool skip_setsid;
+ unsigned int max_conns_per_ip;
+ std::string proc_path;
+ bool is_proc_rw;
+ bool iface_lo;
+ std::string iface_vs;
+ std::string iface_vs_ip;
+ std::string iface_vs_nm;
+ std::string iface_vs_gw;
+ std::string iface_vs_ma;
+ std::string cgroup_mem_mount;
+ std::string cgroup_mem_parent;
+ size_t cgroup_mem_max;
+ std::string cgroup_pids_mount;
+ std::string cgroup_pids_parent;
+ unsigned int cgroup_pids_max;
+ std::string cgroup_net_cls_mount;
+ std::string cgroup_net_cls_parent;
+ unsigned int cgroup_net_cls_classid;
+ std::string cgroup_cpu_mount;
+ std::string cgroup_cpu_parent;
+ unsigned int cgroup_cpu_ms_per_sec;
+ std::string kafel_file_path;
+ std::string kafel_string;
+ struct sock_fprog seccomp_fprog;
+ bool seccomp_log;
+ long num_cpus;
+ uid_t orig_uid;
+ std::vector<mount_t> mountpts;
+ std::vector<pids_t> pids;
+ std::vector<idmap_t> uids;
+ std::vector<idmap_t> gids;
+ std::vector<std::string> envs;
+ std::vector<int> openfds;
+ std::vector<int> caps;
+ std::vector<std::string> ifaces;
+};
+
+#endif /* _NSJAIL_H */
diff --git a/pid.cc b/pid.cc
new file mode 100644
index 0000000..593018b
--- /dev/null
+++ b/pid.cc
@@ -0,0 +1,86 @@
+/*
+
+ nsjail - CLONE_PID routines
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#include "pid.h"
+
+#include <linux/sched.h>
+#include <sched.h>
+#include <signal.h>
+#include <stddef.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+#include "logs.h"
+#include "subproc.h"
+
+namespace pid {
+
+bool initNs(nsjconf_t* nsjconf) {
+ if (nsjconf->mode != MODE_STANDALONE_EXECVE) {
+ return true;
+ }
+ if (!nsjconf->clone_newpid) {
+ return true;
+ }
+
+ LOG_D("Creating a dummy 'init' process");
+
+ /*
+ * If -Me is used then we need to create permanent init inside PID ns, otherwise only the
+ * first clone/fork will work, and the rest will fail with ENOMEM (see 'man pid_namespaces'
+ * for details on this behavior)
+ */
+ pid_t pid = subproc::cloneProc(CLONE_FS);
+ if (pid == -1) {
+ PLOG_E("Couldn't create a dummy init process");
+ return false;
+ }
+ if (pid > 0) {
+ return true;
+ }
+
+ if (prctl(PR_SET_PDEATHSIG, SIGKILL, 0UL, 0UL, 0UL) == -1) {
+ PLOG_W("(prctl(PR_SET_PDEATHSIG, SIGKILL) failed");
+ }
+ if (prctl(PR_SET_NAME, "ns-init", 0UL, 0UL, 0UL) == -1) {
+ PLOG_W("(prctl(PR_SET_NAME, 'init') failed");
+ }
+ if (prctl(PR_SET_DUMPABLE, 0UL, 0UL, 0UL, 0UL) == -1) {
+ PLOG_W("(prctl(PR_SET_DUMPABLE, 0) failed");
+ }
+
+ /* Act sort-a like a init by reaping zombie processes */
+ struct sigaction sa;
+ sa.sa_handler = SIG_DFL;
+ sa.sa_flags = SA_NOCLDWAIT | SA_NOCLDSTOP;
+ sa.sa_restorer = NULL;
+ sigemptyset(&sa.sa_mask);
+
+ if (sigaction(SIGCHLD, &sa, NULL) == -1) {
+ PLOG_W("Couldn't set sighandler for SIGCHLD");
+ }
+
+ for (;;) {
+ pause();
+ }
+}
+
+} // namespace pid
diff --git a/pid.h b/pid.h
new file mode 100644
index 0000000..d74cce4
--- /dev/null
+++ b/pid.h
@@ -0,0 +1,35 @@
+/*
+
+ nsjail - CLONE_PID routines
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#ifndef NS_PID_H
+#define NS_PID_H
+
+#include <stdbool.h>
+
+#include "nsjail.h"
+
+namespace pid {
+
+bool initNs(nsjconf_t* nsjconf);
+
+} // namespace pid
+
+#endif /* NS_PID_H */
diff --git a/sandbox.cc b/sandbox.cc
new file mode 100644
index 0000000..d987bfb
--- /dev/null
+++ b/sandbox.cc
@@ -0,0 +1,138 @@
+/*
+
+ nsjail - seccomp-bpf sandboxing
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#include "sandbox.h"
+
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <sys/prctl.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+extern "C" {
+#include "kafel.h"
+}
+#include "logs.h"
+
+namespace sandbox {
+
+#ifndef PR_SET_NO_NEW_PRIVS /* in prctl.h since Linux 3.5 */
+#define PR_SET_NO_NEW_PRIVS 38
+#endif /* PR_SET_NO_NEW_PRIVS */
+
+#ifndef SECCOMP_FILTER_FLAG_TSYNC
+#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0)
+#endif /* SECCOMP_FILTER_FLAG_TSYNC */
+
+#ifndef SECCOMP_FILTER_FLAG_LOG
+#define SECCOMP_FILTER_FLAG_LOG (1UL << 1)
+#endif /* SECCOMP_FILTER_FLAG_LOG */
+
+static bool prepareAndCommit(nsjconf_t* nsjconf) {
+ if (nsjconf->kafel_file_path.empty() && nsjconf->kafel_string.empty()) {
+ return true;
+ }
+
+ if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
+ PLOG_W("prctl(PR_SET_NO_NEW_PRIVS, 1) failed");
+ return false;
+ }
+ if (nsjconf->seccomp_log) {
+#ifndef __NR_seccomp
+ LOG_E(
+ "The __NR_seccomp is not defined with this kernel's header files (kernel "
+ "headers "
+ "too old?)");
+ return false;
+#else
+ if (syscall(__NR_seccomp, (uintptr_t)SECCOMP_SET_MODE_FILTER,
+ (uintptr_t)(SECCOMP_FILTER_FLAG_TSYNC | SECCOMP_FILTER_FLAG_LOG),
+ &nsjconf->seccomp_fprog) == -1) {
+ PLOG_E(
+ "seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC | "
+ "SECCOMP_FILTER_FLAG_LOG) failed");
+ return false;
+ }
+ return true;
+#endif /* __NR_seccomp */
+ }
+
+ if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &nsjconf->seccomp_fprog, 0UL, 0UL)) {
+ PLOG_W("prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER) failed");
+ return false;
+ }
+ return true;
+}
+
+bool applyPolicy(nsjconf_t* nsjconf) {
+ return prepareAndCommit(nsjconf);
+}
+
+bool preparePolicy(nsjconf_t* nsjconf) {
+ if (nsjconf->kafel_file_path.empty() && nsjconf->kafel_string.empty()) {
+ return true;
+ }
+ if (!nsjconf->kafel_file_path.empty() && !nsjconf->kafel_string.empty()) {
+ LOG_W(
+ "You specified both kafel seccomp policy, and kafel seccomp file. Specify one "
+ "only");
+ return false;
+ }
+
+ kafel_ctxt_t ctxt = kafel_ctxt_create();
+
+ if (!nsjconf->kafel_file_path.empty()) {
+ FILE* f = fopen(nsjconf->kafel_file_path.c_str(), "r");
+ if (!f) {
+ PLOG_W("Couldn't open the kafel seccomp policy file '%s'",
+ nsjconf->kafel_file_path.c_str());
+ kafel_ctxt_destroy(&ctxt);
+ return false;
+ }
+ LOG_D("Compiling seccomp policy from file: '%s'", nsjconf->kafel_file_path.c_str());
+ kafel_set_input_file(ctxt, f);
+ }
+ if (!nsjconf->kafel_string.empty()) {
+ LOG_D("Compiling seccomp policy from string: '%s'", nsjconf->kafel_string.c_str());
+ kafel_set_input_string(ctxt, nsjconf->kafel_string.c_str());
+ }
+
+ if (kafel_compile(ctxt, &nsjconf->seccomp_fprog) != 0) {
+ LOG_W("Could not compile policy: %s", kafel_error_msg(ctxt));
+ kafel_ctxt_destroy(&ctxt);
+ return false;
+ }
+ kafel_ctxt_destroy(&ctxt);
+ return true;
+}
+
+void closePolicy(nsjconf_t* nsjconf) {
+ if (!nsjconf->seccomp_fprog.filter) {
+ return;
+ }
+ free(nsjconf->seccomp_fprog.filter);
+ nsjconf->seccomp_fprog.filter = nullptr;
+ nsjconf->seccomp_fprog.len = 0;
+}
+
+} // namespace sandbox
diff --git a/sandbox.h b/sandbox.h
new file mode 100644
index 0000000..5ce6264
--- /dev/null
+++ b/sandbox.h
@@ -0,0 +1,37 @@
+/*
+
+ nsjail - seccomp-bpf sandboxing
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#ifndef NS_SANDBOX_H
+#define NS_SANDBOX_H
+
+#include <stdbool.h>
+
+#include "nsjail.h"
+
+namespace sandbox {
+
+bool applyPolicy(nsjconf_t* nsjconf);
+bool preparePolicy(nsjconf_t* nsjconf);
+void closePolicy(nsjconf_t* nsjconf);
+
+} // namespace sandbox
+
+#endif /* NS_SANDBOX_H */
diff --git a/subproc.cc b/subproc.cc
new file mode 100644
index 0000000..dc05383
--- /dev/null
+++ b/subproc.cc
@@ -0,0 +1,579 @@
+/*
+
+ nsjail - subprocess management
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#include "subproc.h"
+
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <linux/sched.h>
+#include <sched.h>
+#include <setjmp.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+
+#include <string>
+#include <vector>
+
+#include "cgroup.h"
+#include "contain.h"
+#include "logs.h"
+#include "macros.h"
+#include "net.h"
+#include "sandbox.h"
+#include "user.h"
+#include "util.h"
+
+namespace subproc {
+
+#if !defined(CLONE_NEWCGROUP)
+#define CLONE_NEWCGROUP 0x02000000
+#endif /* !defined(CLONE_NEWCGROUP) */
+
+static const std::string cloneFlagsToStr(uintptr_t flags) {
+ std::string res;
+
+ struct {
+ const uintptr_t flag;
+ const char* const name;
+ } static const cloneFlags[] = {
+ NS_VALSTR_STRUCT(CLONE_VM),
+ NS_VALSTR_STRUCT(CLONE_FS),
+ NS_VALSTR_STRUCT(CLONE_FILES),
+ NS_VALSTR_STRUCT(CLONE_SIGHAND),
+ NS_VALSTR_STRUCT(CLONE_PTRACE),
+ NS_VALSTR_STRUCT(CLONE_VFORK),
+ NS_VALSTR_STRUCT(CLONE_PARENT),
+ NS_VALSTR_STRUCT(CLONE_THREAD),
+ NS_VALSTR_STRUCT(CLONE_NEWNS),
+ NS_VALSTR_STRUCT(CLONE_SYSVSEM),
+ NS_VALSTR_STRUCT(CLONE_SETTLS),
+ NS_VALSTR_STRUCT(CLONE_PARENT_SETTID),
+ NS_VALSTR_STRUCT(CLONE_CHILD_CLEARTID),
+ NS_VALSTR_STRUCT(CLONE_DETACHED),
+ NS_VALSTR_STRUCT(CLONE_UNTRACED),
+ NS_VALSTR_STRUCT(CLONE_CHILD_SETTID),
+ NS_VALSTR_STRUCT(CLONE_NEWCGROUP),
+ NS_VALSTR_STRUCT(CLONE_NEWUTS),
+ NS_VALSTR_STRUCT(CLONE_NEWIPC),
+ NS_VALSTR_STRUCT(CLONE_NEWUSER),
+ NS_VALSTR_STRUCT(CLONE_NEWPID),
+ NS_VALSTR_STRUCT(CLONE_NEWNET),
+ NS_VALSTR_STRUCT(CLONE_IO),
+ };
+
+ uintptr_t knownFlagMask = CSIGNAL;
+ for (const auto& i : cloneFlags) {
+ if (flags & i.flag) {
+ res.append(i.name).append("|");
+ }
+ knownFlagMask |= i.flag;
+ }
+
+ if (flags & ~(knownFlagMask)) {
+ util::StrAppend(&res, "%#tx|", flags & ~(knownFlagMask));
+ }
+ res.append(util::sigName(flags & CSIGNAL).c_str());
+ return res;
+}
+
+/* Reset the execution environment for the new process */
+static bool resetEnv(void) {
+ /* Set all previously changed signals to their default behavior */
+ for (const auto& sig : nssigs) {
+ if (signal(sig, SIG_DFL) == SIG_ERR) {
+ PLOG_W("signal(%s, SIG_DFL)", util::sigName(sig).c_str());
+ return false;
+ }
+ }
+ /* Unblock all signals */
+ sigset_t sset;
+ sigemptyset(&sset);
+ if (sigprocmask(SIG_SETMASK, &sset, NULL) == -1) {
+ PLOG_W("sigprocmask(SIG_SET, empty)");
+ return false;
+ }
+ return true;
+}
+
+static const char kSubprocDoneChar = 'D';
+static const char kSubprocErrorChar = 'E';
+
+static void subprocNewProc(nsjconf_t* nsjconf, int fd_in, int fd_out, int fd_err, int pipefd) {
+ if (!contain::setupFD(nsjconf, fd_in, fd_out, fd_err)) {
+ return;
+ }
+ if (!resetEnv()) {
+ return;
+ }
+
+ if (pipefd == -1) {
+ if (!user::initNsFromParent(nsjconf, getpid())) {
+ LOG_E("Couldn't initialize net user namespace");
+ return;
+ }
+ if (!cgroup::initNsFromParent(nsjconf, getpid())) {
+ LOG_E("Couldn't initialize net user namespace");
+ return;
+ }
+ } else {
+ char doneChar;
+ if (util::readFromFd(pipefd, &doneChar, sizeof(doneChar)) != sizeof(doneChar)) {
+ return;
+ }
+ if (doneChar != kSubprocDoneChar) {
+ return;
+ }
+ }
+ if (!contain::containProc(nsjconf)) {
+ return;
+ }
+ if (!nsjconf->keep_env) {
+ clearenv();
+ }
+ for (const auto& env : nsjconf->envs) {
+ putenv(const_cast<char*>(env.c_str()));
+ }
+
+ auto connstr = net::connToText(fd_in, /* remote= */ true, NULL);
+ LOG_I("Executing '%s' for '%s'", nsjconf->exec_file.c_str(), connstr.c_str());
+
+ std::vector<const char*> argv;
+ for (const auto& s : nsjconf->argv) {
+ argv.push_back(s.c_str());
+ LOG_D(" Arg: '%s'", s.c_str());
+ }
+ argv.push_back(nullptr);
+
+ /* Should be the last one in the sequence */
+ if (!sandbox::applyPolicy(nsjconf)) {
+ return;
+ }
+
+ if (nsjconf->use_execveat) {
+#if defined(__NR_execveat)
+ syscall(__NR_execveat, (uintptr_t)nsjconf->exec_fd, "", (char* const*)argv.data(),
+ environ, (uintptr_t)AT_EMPTY_PATH);
+#else /* defined(__NR_execveat) */
+ LOG_E("Your system doesn't support execveat() syscall");
+ return;
+#endif /* defined(__NR_execveat) */
+ } else {
+ execv(nsjconf->exec_file.c_str(), (char* const*)argv.data());
+ }
+
+ PLOG_E("execve('%s') failed", nsjconf->exec_file.c_str());
+}
+
+static void addProc(nsjconf_t* nsjconf, pid_t pid, int sock) {
+ pids_t p;
+
+ p.pid = pid;
+ p.start = time(NULL);
+ p.remote_txt = net::connToText(sock, /* remote= */ true, &p.remote_addr);
+
+ char fname[PATH_MAX];
+ snprintf(fname, sizeof(fname), "/proc/%d/syscall", (int)pid);
+ p.pid_syscall_fd = TEMP_FAILURE_RETRY(open(fname, O_RDONLY | O_CLOEXEC));
+
+ nsjconf->pids.push_back(p);
+
+ LOG_D("Added pid '%d' with start time '%u' to the queue for IP: '%s'", p.pid,
+ (unsigned int)p.start, p.remote_txt.c_str());
+}
+
+static void removeProc(nsjconf_t* nsjconf, pid_t pid) {
+ for (auto p = nsjconf->pids.begin(); p != nsjconf->pids.end(); ++p) {
+ if (p->pid == pid) {
+ LOG_D("Removing pid '%d' from the queue (IP:'%s', start time:'%s')", p->pid,
+ p->remote_txt.c_str(), util::timeToStr(p->start).c_str());
+ close(p->pid_syscall_fd);
+ nsjconf->pids.erase(p);
+
+ return;
+ }
+ }
+ LOG_W("PID: %d not found (?)", pid);
+}
+
+int countProc(nsjconf_t* nsjconf) {
+ return nsjconf->pids.size();
+}
+
+void displayProc(nsjconf_t* nsjconf) {
+ LOG_I("Total number of spawned namespaces: %d", countProc(nsjconf));
+ time_t now = time(NULL);
+ for (const auto& pid : nsjconf->pids) {
+ time_t diff = now - pid.start;
+ uint64_t left = nsjconf->tlimit ? nsjconf->tlimit - (uint64_t)diff : 0;
+ LOG_I("PID: %d, Remote host: %s, Run time: %ld sec. (time left: %" PRId64 " sec.)",
+ pid.pid, pid.remote_txt.c_str(), (long)diff, left);
+ }
+}
+
+static const pids_t* getPidElem(nsjconf_t* nsjconf, pid_t pid) {
+ for (const auto& p : nsjconf->pids) {
+ if (p.pid == pid) {
+ return &p;
+ }
+ }
+ return NULL;
+}
+
+static void seccompViolation(nsjconf_t* nsjconf, siginfo_t* si) {
+ LOG_W("PID: %d commited a syscall/seccomp violation and exited with SIGSYS", si->si_pid);
+
+ const pids_t* p = getPidElem(nsjconf, si->si_pid);
+ if (p == NULL) {
+ LOG_W("PID:%d SiSyscall: %d, SiCode: %d, SiErrno: %d, SiSigno: %d", (int)si->si_pid,
+ si->si_syscall, si->si_code, si->si_errno, si->si_signo);
+ LOG_E("Couldn't find pid element in the subproc list for PID: %d", (int)si->si_pid);
+ return;
+ }
+
+ char buf[4096];
+ ssize_t rdsize = util::readFromFd(p->pid_syscall_fd, buf, sizeof(buf) - 1);
+ if (rdsize < 1) {
+ LOG_W("PID: %d, SiSyscall: %d, SiCode: %d, SiErrno: %d, SiSigno: %d",
+ (int)si->si_pid, si->si_syscall, si->si_code, si->si_errno, si->si_signo);
+ return;
+ }
+ buf[rdsize - 1] = '\0';
+
+ uintptr_t arg1, arg2, arg3, arg4, arg5, arg6, sp, pc;
+ ptrdiff_t sc;
+ int ret = sscanf(buf, "%td %tx %tx %tx %tx %tx %tx %tx %tx", &sc, &arg1, &arg2, &arg3,
+ &arg4, &arg5, &arg6, &sp, &pc);
+ if (ret == 9) {
+ LOG_W(
+ "PID: %d, Syscall number: %td, Arguments: %#tx, %#tx, %#tx, %#tx, %#tx, %#tx, "
+ "SP: %#tx, PC: %#tx, si_syscall: %d, si_errno: %#x",
+ (int)si->si_pid, sc, arg1, arg2, arg3, arg4, arg5, arg6, sp, pc, si->si_syscall,
+ si->si_errno);
+ } else if (ret == 3) {
+ LOG_W(
+ "PID: %d, SiSyscall: %d, SiCode: %d, SiErrno: %d, SiSigno: %d, SP: %#tx, PC: "
+ "%#tx",
+ (int)si->si_pid, si->si_syscall, si->si_code, si->si_errno, si->si_signo, arg1,
+ arg2);
+ } else {
+ LOG_W("PID: %d, SiSyscall: %d, SiCode: %d, SiErrno: %d, Syscall string '%s'",
+ (int)si->si_pid, si->si_syscall, si->si_code, si->si_errno, buf);
+ }
+}
+
+static int reapProc(nsjconf_t* nsjconf, pid_t pid, bool should_wait = false) {
+ int status;
+
+ if (wait4(pid, &status, should_wait ? 0 : WNOHANG, NULL) == pid) {
+ cgroup::finishFromParent(nsjconf, pid);
+
+ std::string remote_txt = "[UNKNOWN]";
+ const pids_t* elem = getPidElem(nsjconf, pid);
+ if (elem) {
+ remote_txt = elem->remote_txt;
+ }
+
+ if (WIFEXITED(status)) {
+ LOG_I("PID: %d (%s) exited with status: %d, (PIDs left: %d)", pid,
+ remote_txt.c_str(), WEXITSTATUS(status), countProc(nsjconf) - 1);
+ removeProc(nsjconf, pid);
+ return WEXITSTATUS(status);
+ }
+ if (WIFSIGNALED(status)) {
+ LOG_I("PID: %d (%s) terminated with signal: %s (%d), (PIDs left: %d)", pid,
+ remote_txt.c_str(), util::sigName(WTERMSIG(status)).c_str(),
+ WTERMSIG(status), countProc(nsjconf) - 1);
+ removeProc(nsjconf, pid);
+ return 128 + WTERMSIG(status);
+ }
+ }
+ return 0;
+}
+
+int reapProc(nsjconf_t* nsjconf) {
+ int rv = 0;
+ siginfo_t si;
+
+ for (;;) {
+ si.si_pid = 0;
+ if (waitid(P_ALL, 0, &si, WNOHANG | WNOWAIT | WEXITED) == -1) {
+ break;
+ }
+ if (si.si_pid == 0) {
+ break;
+ }
+ if (si.si_code == CLD_KILLED && si.si_status == SIGSYS) {
+ seccompViolation(nsjconf, &si);
+ }
+ rv = reapProc(nsjconf, si.si_pid);
+ }
+
+ time_t now = time(NULL);
+ for (const auto& p : nsjconf->pids) {
+ if (nsjconf->tlimit == 0) {
+ continue;
+ }
+ pid_t pid = p.pid;
+ time_t diff = now - p.start;
+ if ((uint64_t)diff >= nsjconf->tlimit) {
+ LOG_I("PID: %d run time >= time limit (%ld >= %" PRIu64
+ ") (%s). Killing it",
+ pid, (long)diff, nsjconf->tlimit, p.remote_txt.c_str());
+ /*
+ * Probably a kernel bug - some processes cannot be killed with KILL if
+ * they're namespaced, and in a stopped state
+ */
+ kill(pid, SIGCONT);
+ LOG_D("Sent SIGCONT to PID: %d", pid);
+ kill(pid, SIGKILL);
+ LOG_D("Sent SIGKILL to PID: %d", pid);
+ }
+ }
+ return rv;
+}
+
+void killAndReapAll(nsjconf_t* nsjconf) {
+ while (!nsjconf->pids.empty()) {
+ pid_t pid = nsjconf->pids.front().pid;
+ if (kill(pid, SIGKILL) == 0) {
+ reapProc(nsjconf, pid, true);
+ } else {
+ removeProc(nsjconf, pid);
+ }
+ }
+}
+
+static bool initParent(nsjconf_t* nsjconf, pid_t pid, int pipefd) {
+ if (!net::initNsFromParent(nsjconf, pid)) {
+ LOG_E("Couldn't initialize net namespace for pid '%d'", pid);
+ return false;
+ }
+ if (!cgroup::initNsFromParent(nsjconf, pid)) {
+ LOG_E("Couldn't initialize cgroup user namespace for pid '%d'", pid);
+ exit(0xff);
+ }
+ if (!user::initNsFromParent(nsjconf, pid)) {
+ LOG_E("Couldn't initialize user namespace for pid %d", pid);
+ return false;
+ }
+ if (!util::writeToFd(pipefd, &kSubprocDoneChar, sizeof(kSubprocDoneChar))) {
+ LOG_E("Couldn't signal the new process via a socketpair");
+ return false;
+ }
+ return true;
+}
+
+bool runChild(nsjconf_t* nsjconf, int fd_in, int fd_out, int fd_err) {
+ if (!net::limitConns(nsjconf, fd_in)) {
+ return true;
+ }
+ unsigned long flags = 0UL;
+ flags |= (nsjconf->clone_newnet ? CLONE_NEWNET : 0);
+ flags |= (nsjconf->clone_newuser ? CLONE_NEWUSER : 0);
+ flags |= (nsjconf->clone_newns ? CLONE_NEWNS : 0);
+ flags |= (nsjconf->clone_newpid ? CLONE_NEWPID : 0);
+ flags |= (nsjconf->clone_newipc ? CLONE_NEWIPC : 0);
+ flags |= (nsjconf->clone_newuts ? CLONE_NEWUTS : 0);
+ flags |= (nsjconf->clone_newcgroup ? CLONE_NEWCGROUP : 0);
+
+ if (nsjconf->mode == MODE_STANDALONE_EXECVE) {
+ if (unshare(flags) == -1) {
+ PLOG_F("unshare(%s)", cloneFlagsToStr(flags).c_str());
+ }
+ subprocNewProc(nsjconf, fd_in, fd_out, fd_err, -1);
+ LOG_F("Launching new process failed");
+ }
+
+ flags |= SIGCHLD;
+ LOG_D("Creating new process with clone flags:%s", cloneFlagsToStr(flags).c_str());
+
+ int sv[2];
+ if (socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, sv) == -1) {
+ PLOG_E("socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC) failed");
+ return false;
+ }
+ int child_fd = sv[0];
+ int parent_fd = sv[1];
+
+ pid_t pid = cloneProc(flags);
+ if (pid == 0) {
+ close(parent_fd);
+ subprocNewProc(nsjconf, fd_in, fd_out, fd_err, child_fd);
+ util::writeToFd(child_fd, &kSubprocErrorChar, sizeof(kSubprocErrorChar));
+ LOG_F("Launching child process failed");
+ }
+ close(child_fd);
+ if (pid == -1) {
+ if (flags & CLONE_NEWCGROUP) {
+ PLOG_E(
+ "nsjail tried to use the CLONE_NEWCGROUP clone flag, which is "
+ "supported under kernel versions >= 4.6 only. Try disabling this flag");
+ }
+ PLOG_E(
+ "clone(flags=%s) failed. You probably need root privileges if your system "
+ "doesn't support CLONE_NEWUSER. Alternatively, you might want to recompile "
+ "your kernel with support for namespaces or check the current value of the "
+ "kernel.unprivileged_userns_clone sysctl",
+ cloneFlagsToStr(flags).c_str());
+ close(parent_fd);
+ return false;
+ }
+ addProc(nsjconf, pid, fd_in);
+
+ if (!initParent(nsjconf, pid, parent_fd)) {
+ close(parent_fd);
+ return false;
+ }
+
+ char rcvChar;
+ if (util::readFromFd(parent_fd, &rcvChar, sizeof(rcvChar)) == sizeof(rcvChar) &&
+ rcvChar == kSubprocErrorChar) {
+ LOG_W("Received error message from the child process before it has been executed");
+ close(parent_fd);
+ return false;
+ }
+
+ close(parent_fd);
+ return true;
+}
+
+/*
+ * Will be used inside the child process only, so it's safe to have it in BSS.
+ * Some CPU archs (e.g. aarch64) must have it aligned. Size: 128 KiB (/2)
+ */
+static uint8_t cloneStack[128 * 1024] __attribute__((aligned(__BIGGEST_ALIGNMENT__)));
+/* Cannot be on the stack, as the child's stack pointer will change after clone() */
+static __thread jmp_buf env;
+
+static int cloneFunc(void* arg __attribute__((unused))) {
+ longjmp(env, 1);
+ return 0;
+}
+
+/*
+ * Avoid problems with caching of PID/TID in glibc - when using syscall(__NR_clone) glibc doesn't
+ * update the internal PID/TID caches, what can lead to invalid values being returned by getpid()
+ * or incorrect PID/TIDs used in raise()/abort() functions
+ */
+pid_t cloneProc(uintptr_t flags) {
+ if (flags & CLONE_VM) {
+ LOG_E("Cannot use clone(flags & CLONE_VM)");
+ return -1;
+ }
+
+ if (setjmp(env) == 0) {
+ LOG_D("Cloning process with flags:%s", cloneFlagsToStr(flags).c_str());
+ /*
+ * Avoid the problem of the stack growing up/down under different CPU architectures,
+ * by using middle of the static stack buffer (which is temporary, and used only
+ * inside of the cloneFunc()
+ */
+ void* stack = &cloneStack[sizeof(cloneStack) / 2];
+ /* Parent */
+ return clone(cloneFunc, stack, flags, NULL, NULL, NULL);
+ }
+ /* Child */
+ return 0;
+}
+
+int systemExe(const std::vector<std::string>& args, char** env) {
+ bool exec_failed = false;
+
+ std::vector<const char*> argv;
+ for (const auto& a : args) {
+ argv.push_back(a.c_str());
+ }
+ argv.push_back(nullptr);
+
+ int sv[2];
+ if (pipe2(sv, O_CLOEXEC) == -1) {
+ PLOG_W("pipe2(sv, O_CLOEXEC");
+ return -1;
+ }
+
+ pid_t pid = fork();
+ if (pid == -1) {
+ PLOG_W("fork()");
+ close(sv[0]);
+ close(sv[1]);
+ return -1;
+ }
+
+ if (pid == 0) {
+ close(sv[0]);
+ execve(argv[0], (char* const*)argv.data(), (char* const*)env);
+ PLOG_W("execve('%s')", argv[0]);
+ util::writeToFd(sv[1], "A", 1);
+ exit(0);
+ }
+
+ close(sv[1]);
+ char buf[1];
+ if (util::readFromFd(sv[0], buf, sizeof(buf)) > 0) {
+ exec_failed = true;
+ LOG_W("Couldn't execute '%s'", argv[0]);
+ }
+ close(sv[0]);
+
+ for (;;) {
+ int status;
+ int ret = wait4(pid, &status, __WALL, NULL);
+ if (ret == -1 && errno == EINTR) {
+ continue;
+ }
+ if (ret == -1) {
+ PLOG_W("wait4(pid=%d)", pid);
+ return -1;
+ }
+ if (WIFEXITED(status)) {
+ int exit_code = WEXITSTATUS(status);
+ LOG_D("PID %d exited with exit code: %d", pid, exit_code);
+ if (exec_failed) {
+ return -1;
+ } else if (exit_code == 0) {
+ return 0;
+ } else {
+ return 1;
+ }
+ }
+ if (WIFSIGNALED(status)) {
+ int exit_signal = WTERMSIG(status);
+ LOG_W("PID %d killed by signal: %d (%s)", pid, exit_signal,
+ util::sigName(exit_signal).c_str());
+ return 2;
+ }
+ LOG_W("Unknown exit status: %d", status);
+ }
+}
+
+} // namespace subproc
diff --git a/subproc.h b/subproc.h
new file mode 100644
index 0000000..33e2b5c
--- /dev/null
+++ b/subproc.h
@@ -0,0 +1,47 @@
+/*
+
+ nsjail - subprocess management
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#ifndef NS_PROC_H
+#define NS_PROC_H
+
+#include <inttypes.h>
+#include <stdbool.h>
+#include <unistd.h>
+
+#include <string>
+#include <vector>
+
+#include "nsjail.h"
+
+namespace subproc {
+
+bool runChild(nsjconf_t* nsjconf, int fd_in, int fd_out, int fd_err);
+int countProc(nsjconf_t* nsjconf);
+void displayProc(nsjconf_t* nsjconf);
+void killAndReapAll(nsjconf_t* nsjconf);
+/* Returns the exit code of the first failing subprocess, or 0 if none fail */
+int reapProc(nsjconf_t* nsjconf);
+int systemExe(const std::vector<std::string>& args, char** env);
+pid_t cloneProc(uintptr_t flags);
+
+} // namespace subproc
+
+#endif /* NS_PROC_H */
diff --git a/user.cc b/user.cc
new file mode 100644
index 0000000..4053884
--- /dev/null
+++ b/user.cc
@@ -0,0 +1,337 @@
+/*
+
+ nsjail - CLONE_NEWUSER routines
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#include "user.h"
+
+#include <errno.h>
+#include <fcntl.h>
+#include <grp.h>
+#include <limits.h>
+#include <linux/securebits.h>
+#include <pwd.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/prctl.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "logs.h"
+#include "macros.h"
+#include "subproc.h"
+#include "util.h"
+
+namespace user {
+
+static bool setResGid(gid_t gid) {
+ LOG_D("setresgid(%d)", gid);
+#if defined(__NR_setresgid32)
+ if (syscall(__NR_setresgid32, (uintptr_t)gid, (uintptr_t)gid, (uintptr_t)gid) == -1) {
+ PLOG_W("setresgid32(%d)", (int)gid);
+ return false;
+ }
+#else /* defined(__NR_setresgid32) */
+ if (syscall(__NR_setresgid, (uintptr_t)gid, (uintptr_t)gid, (uintptr_t)gid) == -1) {
+ PLOG_W("setresgid(%d)", gid);
+ return false;
+ }
+#endif /* defined(__NR_setresuid32) */
+ return true;
+}
+
+static bool setResUid(uid_t uid) {
+ LOG_D("setresuid(%d)", uid);
+#if defined(__NR_setresuid32)
+ if (syscall(__NR_setresuid32, (uintptr_t)uid, (uintptr_t)uid, (uintptr_t)uid) == -1) {
+ PLOG_W("setresuid32(%d)", (int)uid);
+ return false;
+ }
+#else /* defined(__NR_setresuid32) */
+ if (syscall(__NR_setresuid, (uintptr_t)uid, (uintptr_t)uid, (uintptr_t)uid) == -1) {
+ PLOG_W("setresuid(%d)", uid);
+ return false;
+ }
+#endif /* defined(__NR_setresuid32) */
+ return true;
+}
+
+static bool setGroups(pid_t pid) {
+ /*
+ * No need to write 'deny' to /proc/pid/setgroups if our euid==0, as writing to
+ * uid_map/gid_map will succeed anyway
+ */
+ if (geteuid() == 0) {
+ return true;
+ }
+
+ char fname[PATH_MAX];
+ snprintf(fname, sizeof(fname), "/proc/%d/setgroups", pid);
+ const char* denystr = "deny";
+ if (!util::writeBufToFile(fname, denystr, strlen(denystr), O_WRONLY | O_CLOEXEC)) {
+ LOG_E("util::writeBufToFile('%s', '%s') failed", fname, denystr);
+ return false;
+ }
+ return true;
+}
+
+static bool uidMapSelf(nsjconf_t* nsjconf, pid_t pid) {
+ std::string map;
+ for (const auto& uid : nsjconf->uids) {
+ if (uid.is_newidmap) {
+ continue;
+ }
+ map.append(std::to_string(uid.inside_id));
+ map.append(" ");
+ map.append(std::to_string(uid.outside_id));
+ map.append(" ");
+ map.append(std::to_string(uid.count));
+ map.append("\n");
+ }
+ if (map.empty()) {
+ return true;
+ }
+
+ char fname[PATH_MAX];
+ snprintf(fname, sizeof(fname), "/proc/%d/uid_map", pid);
+ LOG_D("Writing '%s' to '%s'", map.c_str(), fname);
+ if (!util::writeBufToFile(fname, map.data(), map.length(), O_WRONLY | O_CLOEXEC)) {
+ LOG_E("util::writeBufToFile('%s', '%s') failed", fname, map.c_str());
+ return false;
+ }
+
+ return true;
+}
+
+static bool gidMapSelf(nsjconf_t* nsjconf, pid_t pid) {
+ std::string map;
+ for (const auto& gid : nsjconf->gids) {
+ if (gid.is_newidmap) {
+ continue;
+ }
+ map.append(std::to_string(gid.inside_id));
+ map.append(" ");
+ map.append(std::to_string(gid.outside_id));
+ map.append(" ");
+ map.append(std::to_string(gid.count));
+ map.append("\n");
+ }
+ if (map.empty()) {
+ return true;
+ }
+
+ char fname[PATH_MAX];
+ snprintf(fname, sizeof(fname), "/proc/%d/gid_map", pid);
+ LOG_D("Writing '%s' to '%s'", map.c_str(), fname);
+ if (!util::writeBufToFile(fname, map.data(), map.length(), O_WRONLY | O_CLOEXEC)) {
+ LOG_E("util::writeBufToFile('%s', '%s') failed", fname, map.c_str());
+ return false;
+ }
+
+ return true;
+}
+
+/* Use /usr/bin/newgidmap for writing the gid map */
+static bool gidMapExternal(nsjconf_t* nsjconf, pid_t pid UNUSED) {
+ bool use = false;
+
+ std::vector<std::string> argv = {"/usr/bin/newgidmap", std::to_string(pid)};
+ for (const auto& gid : nsjconf->gids) {
+ if (!gid.is_newidmap) {
+ continue;
+ }
+ use = true;
+
+ argv.push_back(std::to_string(gid.inside_id));
+ argv.push_back(std::to_string(gid.outside_id));
+ argv.push_back(std::to_string(gid.count));
+ }
+ if (!use) {
+ return true;
+ }
+ if (subproc::systemExe(argv, environ) != 0) {
+ LOG_E("'/usr/bin/newgidmap' failed");
+ return false;
+ }
+
+ return true;
+}
+
+/* Use /usr/bin/newuidmap for writing the uid map */
+static bool uidMapExternal(nsjconf_t* nsjconf, pid_t pid UNUSED) {
+ bool use = false;
+
+ std::vector<std::string> argv = {"/usr/bin/newuidmap", std::to_string(pid)};
+ for (const auto& uid : nsjconf->uids) {
+ if (!uid.is_newidmap) {
+ continue;
+ }
+ use = true;
+
+ argv.push_back(std::to_string(uid.inside_id));
+ argv.push_back(std::to_string(uid.outside_id));
+ argv.push_back(std::to_string(uid.count));
+ }
+ if (!use) {
+ return true;
+ }
+ if (subproc::systemExe(argv, environ) != 0) {
+ LOG_E("'/usr/bin/newuidmap' failed");
+ return false;
+ }
+
+ return true;
+}
+
+static bool uidGidMap(nsjconf_t* nsjconf, pid_t pid) {
+ RETURN_ON_FAILURE(gidMapSelf(nsjconf, pid));
+ RETURN_ON_FAILURE(gidMapExternal(nsjconf, pid));
+ RETURN_ON_FAILURE(uidMapSelf(nsjconf, pid));
+ RETURN_ON_FAILURE(uidMapExternal(nsjconf, pid));
+
+ return true;
+}
+
+bool initNsFromParent(nsjconf_t* nsjconf, pid_t pid) {
+ if (!setGroups(pid)) {
+ return false;
+ }
+ if (!nsjconf->clone_newuser) {
+ return true;
+ }
+ if (!uidGidMap(nsjconf, pid)) {
+ return false;
+ }
+ return true;
+}
+
+bool initNsFromChild(nsjconf_t* nsjconf) {
+ /*
+ * Best effort because of /proc/self/setgroups
+ */
+ LOG_D("setgroups(0, NULL)");
+ const gid_t* group_list = NULL;
+ if (setgroups(0, group_list) == -1) {
+ PLOG_D("setgroups(NULL) failed");
+ }
+
+ /*
+ * Make sure all capabilities are retained after the subsequent setuid/setgid, as they will
+ * be needed for privileged operations: mounts, uts change etc.
+ */
+ if (prctl(PR_SET_SECUREBITS, SECBIT_KEEP_CAPS | SECBIT_NO_SETUID_FIXUP, 0UL, 0UL, 0UL) ==
+ -1) {
+ PLOG_E("prctl(PR_SET_SECUREBITS, SECBIT_KEEP_CAPS | SECBIT_NO_SETUID_FIXUP)");
+ return false;
+ }
+
+ if (!setResGid(nsjconf->gids[0].inside_id)) {
+ PLOG_E("setresgid(%u)", nsjconf->gids[0].inside_id);
+ return false;
+ }
+ if (!setResUid(nsjconf->uids[0].inside_id)) {
+ PLOG_E("setresuid(%u)", nsjconf->uids[0].inside_id);
+ return false;
+ }
+
+ return true;
+}
+
+static uid_t parseUid(const std::string& id) {
+ if (id.empty()) {
+ return getuid();
+ }
+ struct passwd* pw = getpwnam(id.c_str());
+ if (pw != NULL) {
+ return pw->pw_uid;
+ }
+ if (util::isANumber(id.c_str())) {
+ return (uid_t)strtoimax(id.c_str(), NULL, 0);
+ }
+ return (uid_t)-1;
+}
+
+static gid_t parseGid(const std::string& id) {
+ if (id.empty()) {
+ return getgid();
+ }
+ struct group* gr = getgrnam(id.c_str());
+ if (gr != NULL) {
+ return gr->gr_gid;
+ }
+ if (util::isANumber(id.c_str())) {
+ return (gid_t)strtoimax(id.c_str(), NULL, 0);
+ }
+ return (gid_t)-1;
+}
+
+bool parseId(nsjconf_t* nsjconf, const std::string& i_id, const std::string& o_id, size_t cnt,
+ bool is_gid, bool is_newidmap) {
+ if (cnt < 1) {
+ cnt = 1;
+ }
+
+ uid_t inside_id;
+ uid_t outside_id;
+
+ if (is_gid) {
+ inside_id = parseGid(i_id);
+ if (inside_id == (uid_t)-1) {
+ LOG_W("Cannot parse '%s' as GID", i_id.c_str());
+ return false;
+ }
+ outside_id = parseGid(o_id);
+ if (outside_id == (uid_t)-1) {
+ LOG_W("Cannot parse '%s' as GID", o_id.c_str());
+ return false;
+ }
+ } else {
+ inside_id = parseUid(i_id);
+ if (inside_id == (uid_t)-1) {
+ LOG_W("Cannot parse '%s' as UID", i_id.c_str());
+ return false;
+ }
+ outside_id = parseUid(o_id);
+ if (outside_id == (uid_t)-1) {
+ LOG_W("Cannot parse '%s' as UID", o_id.c_str());
+ return false;
+ }
+ }
+
+ idmap_t id;
+ id.inside_id = inside_id;
+ id.outside_id = outside_id;
+ id.count = cnt;
+ id.is_newidmap = is_newidmap;
+
+ if (is_gid) {
+ nsjconf->gids.push_back(id);
+ } else {
+ nsjconf->uids.push_back(id);
+ }
+
+ return true;
+}
+
+} // namespace user
diff --git a/user.h b/user.h
new file mode 100644
index 0000000..598ea81
--- /dev/null
+++ b/user.h
@@ -0,0 +1,40 @@
+/*
+
+ nsjail - CLONE_NEWUSER routines
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#ifndef NS_USER_H
+#define NS_USER_H
+
+#include <stdbool.h>
+
+#include <string>
+
+#include "nsjail.h"
+
+namespace user {
+
+bool initNsFromParent(nsjconf_t* nsjconf, pid_t pid);
+bool initNsFromChild(nsjconf_t* nsjconf);
+bool parseId(nsjconf_t* nsjconf, const std::string& i_id, const std::string& o_id, size_t cnt,
+ bool is_gid, bool is_newidmap);
+
+} // namespace user
+
+#endif /* NS_USER_H */
diff --git a/util.cc b/util.cc
new file mode 100644
index 0000000..35e1749
--- /dev/null
+++ b/util.cc
@@ -0,0 +1,320 @@
+/*
+
+ nsjail - useful procedures
+ -----------------------------------------
+
+ Copyright 2016 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#include "util.h"
+
+#include <ctype.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdarg.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <time.h>
+#include <unistd.h>
+
+#include <sstream>
+#include <string>
+#include <vector>
+
+#include "logs.h"
+#include "macros.h"
+
+namespace util {
+
+ssize_t readFromFd(int fd, void* buf, size_t len) {
+ uint8_t* charbuf = (uint8_t*)buf;
+
+ size_t readSz = 0;
+ while (readSz < len) {
+ ssize_t sz = TEMP_FAILURE_RETRY(read(fd, &charbuf[readSz], len - readSz));
+ if (sz <= 0) {
+ break;
+ }
+ readSz += sz;
+ }
+ return readSz;
+}
+
+ssize_t readFromFile(const char* fname, void* buf, size_t len) {
+ int fd;
+ TEMP_FAILURE_RETRY(fd = open(fname, O_RDONLY | O_CLOEXEC));
+ if (fd == -1) {
+ LOG_E("open('%s', O_RDONLY|O_CLOEXEC)", fname);
+ return -1;
+ }
+ ssize_t ret = readFromFd(fd, buf, len);
+ close(fd);
+ return ret;
+}
+
+bool writeToFd(int fd, const void* buf, size_t len) {
+ const uint8_t* charbuf = (const uint8_t*)buf;
+
+ size_t writtenSz = 0;
+ while (writtenSz < len) {
+ ssize_t sz = TEMP_FAILURE_RETRY(write(fd, &charbuf[writtenSz], len - writtenSz));
+ if (sz < 0) {
+ return false;
+ }
+ writtenSz += sz;
+ }
+ return true;
+}
+
+bool writeBufToFile(const char* filename, const void* buf, size_t len, int open_flags) {
+ int fd;
+ TEMP_FAILURE_RETRY(fd = open(filename, open_flags, 0644));
+ if (fd == -1) {
+ PLOG_E("Couldn't open '%s' for writing", filename);
+ return false;
+ }
+
+ if (!writeToFd(fd, buf, len)) {
+ PLOG_E("Couldn't write '%zu' bytes to file '%s' (fd='%d')", len, filename, fd);
+ close(fd);
+ if (open_flags & O_CREAT) {
+ unlink(filename);
+ }
+ return false;
+ }
+
+ LOG_D("Written '%zu' bytes to '%s'", len, filename);
+
+ close(fd);
+ return true;
+}
+
+bool createDirRecursively(const char* dir) {
+ if (dir[0] != '/') {
+ LOG_W("The directory path must start with '/': '%s' provided", dir);
+ return false;
+ }
+
+ int prev_dir_fd = TEMP_FAILURE_RETRY(open("/", O_RDONLY | O_CLOEXEC | O_DIRECTORY));
+ if (prev_dir_fd == -1) {
+ PLOG_W("open('/', O_RDONLY | O_CLOEXEC)");
+ return false;
+ }
+
+ char path[PATH_MAX];
+ snprintf(path, sizeof(path), "%s", dir);
+ char* curr = path;
+ for (;;) {
+ while (*curr == '/') {
+ curr++;
+ }
+
+ char* next = strchr(curr, '/');
+ if (next == NULL) {
+ close(prev_dir_fd);
+ return true;
+ }
+ *next = '\0';
+
+ if (mkdirat(prev_dir_fd, curr, 0755) == -1 && errno != EEXIST) {
+ PLOG_W("mkdir('%s', 0755)", curr);
+ close(prev_dir_fd);
+ return false;
+ }
+
+ int dir_fd = TEMP_FAILURE_RETRY(openat(prev_dir_fd, curr, O_DIRECTORY | O_CLOEXEC));
+ if (dir_fd == -1) {
+ PLOG_W("openat('%d', '%s', O_DIRECTORY | O_CLOEXEC)", prev_dir_fd, curr);
+ close(prev_dir_fd);
+ return false;
+ }
+ close(prev_dir_fd);
+ prev_dir_fd = dir_fd;
+ curr = next + 1;
+ }
+}
+
+std::string* StrAppend(std::string* str, const char* format, ...) {
+ char* strp;
+
+ va_list args;
+ va_start(args, format);
+ int ret = vasprintf(&strp, format, args);
+ va_end(args);
+
+ if (ret == -1) {
+ PLOG_E("Memory allocation failed during asprintf()");
+ str->append(" [ERROR: mem_allocation_failed] ");
+ return str;
+ }
+
+ str->append(strp, ret);
+ free(strp);
+ return str;
+}
+
+std::string StrPrintf(const char* format, ...) {
+ char* strp;
+
+ va_list args;
+ va_start(args, format);
+ int ret = vasprintf(&strp, format, args);
+ va_end(args);
+
+ if (ret == -1) {
+ PLOG_E("Memory allocation failed during asprintf()");
+ return "[ERROR: mem_allocation_failed]";
+ }
+
+ std::string str(strp, ret);
+ free(strp);
+ return str;
+}
+
+bool isANumber(const char* s) {
+ for (size_t i = 0; s[i]; s++) {
+ if (!isdigit(s[i]) && s[i] != 'x') {
+ return false;
+ }
+ }
+ return true;
+}
+
+static __thread pthread_once_t rndThreadOnce = PTHREAD_ONCE_INIT;
+static __thread uint64_t rndX;
+
+/* MMIX LCG PRNG */
+static const uint64_t a = 6364136223846793005ULL;
+static const uint64_t c = 1442695040888963407ULL;
+
+static void rndInitThread(void) {
+#if defined(__NR_getrandom)
+ if (syscall(__NR_getrandom, &rndX, sizeof(rndX), 0) == sizeof(rndX)) {
+ return;
+ }
+#endif /* defined(__NR_getrandom) */
+ int fd = open("/dev/urandom", O_RDONLY | O_CLOEXEC);
+ if (fd == -1) {
+ PLOG_D(
+ "Couldn't open /dev/urandom for reading. Using gettimeofday "
+ "fall-back");
+ struct timeval tv;
+ gettimeofday(&tv, NULL);
+ rndX = tv.tv_usec + ((uint64_t)tv.tv_sec << 32);
+ return;
+ }
+ if (readFromFd(fd, (uint8_t*)&rndX, sizeof(rndX)) != sizeof(rndX)) {
+ PLOG_F("Couldn't read '%zu' bytes from /dev/urandom", sizeof(rndX));
+ close(fd);
+ }
+ close(fd);
+}
+
+uint64_t rnd64(void) {
+ pthread_once(&rndThreadOnce, rndInitThread);
+ rndX = a * rndX + c;
+ return rndX;
+}
+
+const std::string sigName(int signo) {
+ std::string res;
+
+ struct {
+ const int signo;
+ const char* const name;
+ } static const sigNames[] = {
+ NS_VALSTR_STRUCT(SIGINT),
+ NS_VALSTR_STRUCT(SIGILL),
+ NS_VALSTR_STRUCT(SIGABRT),
+ NS_VALSTR_STRUCT(SIGFPE),
+ NS_VALSTR_STRUCT(SIGSEGV),
+ NS_VALSTR_STRUCT(SIGTERM),
+ NS_VALSTR_STRUCT(SIGHUP),
+ NS_VALSTR_STRUCT(SIGQUIT),
+ NS_VALSTR_STRUCT(SIGTRAP),
+ NS_VALSTR_STRUCT(SIGKILL),
+ NS_VALSTR_STRUCT(SIGBUS),
+ NS_VALSTR_STRUCT(SIGSYS),
+ NS_VALSTR_STRUCT(SIGPIPE),
+ NS_VALSTR_STRUCT(SIGALRM),
+ NS_VALSTR_STRUCT(SIGURG),
+ NS_VALSTR_STRUCT(SIGSTOP),
+ NS_VALSTR_STRUCT(SIGTSTP),
+ NS_VALSTR_STRUCT(SIGCONT),
+ NS_VALSTR_STRUCT(SIGCHLD),
+ NS_VALSTR_STRUCT(SIGTTIN),
+ NS_VALSTR_STRUCT(SIGTTOU),
+ NS_VALSTR_STRUCT(SIGPOLL),
+ NS_VALSTR_STRUCT(SIGXCPU),
+ NS_VALSTR_STRUCT(SIGXFSZ),
+ NS_VALSTR_STRUCT(SIGVTALRM),
+ NS_VALSTR_STRUCT(SIGPROF),
+ NS_VALSTR_STRUCT(SIGUSR1),
+ NS_VALSTR_STRUCT(SIGUSR2),
+ NS_VALSTR_STRUCT(SIGWINCH),
+ };
+
+ for (const auto& i : sigNames) {
+ if (signo == i.signo) {
+ res.append(i.name);
+ return res;
+ }
+ }
+
+ if (signo > SIGRTMIN) {
+ res.append("SIG");
+ res.append(std::to_string(signo));
+ res.append("-RTMIN+");
+ res.append(std::to_string(signo - SIGRTMIN));
+ return res;
+ }
+
+ res.append("SIGUNKNOWN(");
+ res.append(std::to_string(signo));
+ res.append(")");
+ return res;
+}
+
+const std::string timeToStr(time_t t) {
+ char timestr[128];
+ struct tm utctime;
+ localtime_r(&t, &utctime);
+ if (strftime(timestr, sizeof(timestr) - 1, "%FT%T%z", &utctime) == 0) {
+ return "[Time conv error]";
+ }
+ return timestr;
+}
+
+std::vector<std::string> strSplit(const std::string str, char delim) {
+ std::vector<std::string> vec;
+ std::istringstream stream(str);
+ for (std::string word; std::getline(stream, word, delim);) {
+ vec.push_back(word);
+ }
+ return vec;
+}
+
+} // namespace util
diff --git a/util.h b/util.h
new file mode 100644
index 0000000..357b606
--- /dev/null
+++ b/util.h
@@ -0,0 +1,59 @@
+/*
+
+ nsjail - useful procedures
+ -----------------------------------------
+
+ Copyright 2016 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#ifndef NS_UTIL_H
+#define NS_UTIL_H
+
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+#include <string>
+#include <vector>
+
+#include "nsjail.h"
+
+#define RETURN_ON_FAILURE(expr) \
+ do { \
+ if (!(expr)) { \
+ return false; \
+ } \
+ } while (0)
+
+namespace util {
+
+ssize_t readFromFd(int fd, void* buf, size_t len);
+ssize_t readFromFile(const char* fname, void* buf, size_t len);
+bool writeToFd(int fd, const void* buf, size_t len);
+bool writeBufToFile(const char* filename, const void* buf, size_t len, int open_flags);
+bool createDirRecursively(const char* dir);
+std::string* StrAppend(std::string* str, const char* format, ...)
+ __attribute__((format(printf, 2, 3)));
+std::string StrPrintf(const char* format, ...) __attribute__((format(printf, 1, 2)));
+bool isANumber(const char* s);
+uint64_t rnd64(void);
+const std::string sigName(int signo);
+const std::string timeToStr(time_t t);
+std::vector<std::string> strSplit(const std::string str, char delim);
+
+} // namespace util
+
+#endif /* NS_UTIL_H */
diff --git a/uts.cc b/uts.cc
new file mode 100644
index 0000000..9148d2b
--- /dev/null
+++ b/uts.cc
@@ -0,0 +1,44 @@
+/*
+
+ nsjail - CLONE_NEWUTS routines
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#include "uts.h"
+
+#include <string.h>
+#include <unistd.h>
+
+#include "logs.h"
+
+namespace uts {
+
+bool initNs(nsjconf_t* nsjconf) {
+ if (!nsjconf->clone_newuts) {
+ return true;
+ }
+
+ LOG_D("Setting hostname to '%s'", nsjconf->hostname.c_str());
+ if (sethostname(nsjconf->hostname.data(), nsjconf->hostname.length()) == -1) {
+ PLOG_E("sethostname('%s')", nsjconf->hostname.c_str());
+ return false;
+ }
+ return true;
+}
+
+} // namespace uts
diff --git a/uts.h b/uts.h
new file mode 100644
index 0000000..04b294f
--- /dev/null
+++ b/uts.h
@@ -0,0 +1,35 @@
+/*
+
+ nsjail - CLONE_NEWUTS routines
+ -----------------------------------------
+
+ Copyright 2014 Google Inc. All Rights Reserved.
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+*/
+
+#ifndef NS_UTS_H
+#define NS_UTS_H
+
+#include <stdbool.h>
+
+#include "nsjail.h"
+
+namespace uts {
+
+bool initNs(nsjconf_t* nsjconf);
+
+} // namespace uts
+
+#endif /* NS_UTS_H */