blob: 4494226d86bd76211b0f932ebba0946ada7d3982 [file] [log] [blame]
\title{Device trees everywhere}
\author{David Gibson \texttt{<{dwg}{@}{}>}\\
Benjamin Herrenschmidt \texttt{<{benh}{@}{}>}\\
\emph{OzLabs, IBM Linux Technology Center}}
\newcommand{\fixme}[1]{$\bigstar$\emph{\textbf{\large #1}}$\bigstar$\xspace}
\newcommand{\of}{Open Firmware\xspace}
\newcommand{\benh}{Ben Herrenschmidt\xspace}
We present a method for booting a \ppc{}\R Linux\R kernel on an
embedded machine. To do this, we supply the kernel with a compact
flattened-tree representation of the system's hardware based on the
device tree supplied by Open Firmware on IBM\R servers and Apple\R
Power Macintosh\R machines.
The ``blob'' representing the device tree can be created using \dtc
--- the Device Tree Compiler --- that turns a simple text
representation of the tree into the compact representation used by
the kernel. The compiler can produce either a binary ``blob'' or an
assembler file ready to be built into a firmware or bootwrapper
This flattened-tree approach is now the only supported method of
booting a \texttt{ppc64} kernel without Open Firmware, and we plan
to make it the only supported method for all \texttt{powerpc}
kernels in the future.
\subsection{OF and the device tree}
Historically, ``everyday'' \ppc machines have booted with the help of
\of (OF), a firmware environment defined by IEEE1275 \cite{IEEE1275}.
Among other boot-time services, OF maintains a device tree that
describes all of the system's hardware devices and how they're
connected. During boot, before taking control of memory management,
the Linux kernel uses OF calls to scan the device tree and transfer it
to an internal representation that is used at run time to look up
various device information.
The device tree consists of nodes representing devices or
buses\footnote{Well, mostly. There are a few special exceptions.}.
Each node contains \emph{properties}, name--value pairs that give
information about the device. The values are arbitrary byte strings,
and for some properties, they contain tables or other structured
\subsection{The bad old days}
Embedded systems, by contrast, usually have a minimal firmware that
might supply a few vital system parameters (size of RAM and the like),
but nothing as detailed or complete as the OF device tree. This has
meant that the various 32-bit \ppc embedded ports have required a
variety of hacks spread across the kernel to deal with the lack of
device tree. These vary from specialised boot wrappers to parse
parameters (which are at least reasonably localised) to
CONFIG-dependent hacks in drivers to override normal probe logic with
hardcoded addresses for a particular board. As well as being ugly of
itself, such CONFIG-dependent hacks make it hard to build a single
kernel image that supports multiple embedded machines.
Until relatively recently, the only 64-bit \ppc machines without OF
were legacy (pre-POWER5\R) iSeries\R machines. iSeries machines often
only have virtual IO devices, which makes it quite simple to work
around the lack of a device tree. Even so, the lack means the iSeries
boot sequence must be quite different from the pSeries or Macintosh,
which is not ideal.
The device tree also presents a problem for implementing \kexec. When
the kernel boots, it takes over full control of the system from OF,
even re-using OF's memory. So, when \kexec comes to boot another
kernel, OF is no longer around for the second kernel to query.
\section{The Flattened Tree}
In May 2005 \benh implemented a new approach to handling the device
tree that addresses all these problems. When booting on OF systems,
the first thing the kernel runs is a small piece of code in
\texttt{prom\_init.c}, which executes in the context of OF. This code
walks the device tree using OF calls, and transcribes it into a
compact, flattened format. The resulting device tree ``blob'' is then
passed to the kernel proper, which eventually unflattens the tree into
its runtime form. This blob is the only data communicated between the
\texttt{prom\_init.c} bootstrap and the rest of the kernel.
When OF isn't available, either because the machine doesn't have it at
all or because \kexec has been used, the kernel instead starts
directly from the entry point taking a flattened device tree. The
device tree blob must be passed in from outside, rather than generated
by part of the kernel from OF. For \kexec, the userland
\texttt{kexec} tools build the blob from the runtime device tree
before invoking the new kernel. For embedded systems the blob can
come either from the embedded bootloader, or from a specialised
version of the \texttt{zImage} wrapper for the system in question.
\subsection{Properties of the flattened tree}
The flattened tree format should be easy to handle, both for the
kernel that parses it and the bootloader that generates it. In
particular, the following properties are desirable:
\item \emph{relocatable}: the bootloader or kernel should be able to
move the blob around as a whole, without needing to parse or adjust
its internals. In practice that means we must not use pointers
within the blob.
\item \emph{insert and delete}: sometimes the bootloader might want to
make tweaks to the flattened tree, such as deleting or inserting a
node (or whole subtree). It should be possible to do this without
having to effectively regenerate the whole flattened tree. In
practice this means limiting the use of internal offsets in the blob
that need recalculation if a section is inserted or removed with
\item \emph{compact}: embedded systems are frequently short of
resources, particularly RAM and flash memory space. Thus, the tree
representation should be kept as small as conveniently possible.
\subsection{Format of the device tree blob}
\multicolumn{1}{r}{\textbf{Offset}}& \multicolumn{1}{c}{\textbf{Contents}} \\\cline{2-2}
\texttt{0x00} & \texttt{0xd00dfeed} & magic number \\\cline{2-2}
\texttt{0x04} & \emph{totalsize} \\\cline{2-2}
\texttt{0x08} & \emph{off\_struct} & \\\cline{2-2}
\texttt{0x0C} & \emph{off\_strs} & \\\cline{2-2}
\texttt{0x10} & \emph{off\_rsvmap} & \\\cline{2-2}
\texttt{0x14} & \emph{version} \\\cline{2-2}
\texttt{0x18} & \emph{last\_comp\_ver} & \\\cline{2-2}
\texttt{0x1C} & \emph{boot\_cpu\_id} & \tge v2 only\\\cline{2-2}
\texttt{0x20} & \emph{size\_strs} & \tge v3 only\\\cline{2-2}
\multicolumn{1}{r}{\vdots} & \multicolumn{1}{c}{\vdots} & \\\cline{2-2}
\emph{off\_rsvmap} & \emph{address0} & memory reserve \\
+ \texttt{0x04} & ...& table \\\cline{2-2}
+ \texttt{0x08} & \emph{len0} & \\
+ \texttt{0x0C} & ...& \\\cline{2-2}
\vdots & \multicolumn{1}{c|}{\vdots} & \\\cline{2-2}
& \texttt{0x00000000}- & end marker\\
& \texttt{00000000} & \\\cline{2-2}
& \texttt{0x00000000}- & \\
& \texttt{00000000} & \\\cline{2-2}
\multicolumn{1}{r}{\vdots} & \multicolumn{1}{c}{\vdots} & \\\cline{2-2}
\emph{off\_strs} & \texttt{'n' 'a' 'm' 'e'} & strings block \\
+ \texttt{0x04} & \texttt{~0~ 'm' 'o' 'd'} & \\
+ \texttt{0x08} & \texttt{'e' 'l' ~0~ \makebox[\widthof{~~~}]{\textrm{...}}} & \\
\vdots & \multicolumn{1}{c|}{\vdots} & \\\cline{2-2}
\multicolumn{1}{r}{+ \emph{size\_strs}} \\
\multicolumn{1}{r}{\vdots} & \multicolumn{1}{c}{\vdots} & \\\cline{2-2}
\emph{off\_struct} & \dtbeginnode & structure block \\\cline{2-2}
+ \texttt{0x04} & \texttt{'/' ~0~ ~0~ ~0~} & root node\\\cline{2-2}
+ \texttt{0x08} & \dtprop & \\\cline{2-2}
+ \texttt{0x0C} & \texttt{0x00000005} & ``\texttt{model}''\\\cline{2-2}
+ \texttt{0x10} & \texttt{0x00000008} & \\\cline{2-2}
+ \texttt{0x14} & \texttt{'M' 'y' 'B' 'o'} & \\
+ \texttt{0x18} & \texttt{'a' 'r' 'd' ~0~} & \\\cline{2-2}
\vdots & \multicolumn{1}{c|}{\vdots} & \\\cline{2-2}
& \texttt{\dtendnode} \\\cline{2-2}
& \texttt{\dtend} \\\cline{2-2}
\multicolumn{1}{r}{\vdots} & \multicolumn{1}{c}{\vdots} & \\\cline{2-2}
\multicolumn{1}{r}{\emph{totalsize}} \\
\caption{Device tree blob layout}
The format for the blob we devised, was first described on the
\texttt{linuxppc64-dev} mailing list in \cite{noof1}. The format has
since evolved through various revisions, and the current version is
included as part of the \dtc (see \S\ref{sec:dtc}) git tree,
Figure \ref{fig:blob-layout} shows the layout of the blob of data
containing the device tree. It has three sections of variable size:
the \emph{memory reserve table}, the \emph{structure block} and the
\emph{strings block}. A small header gives the blob's size and
version and the locations of the three sections, plus a handful of
vital parameters used during early boot.
The memory reserve map section gives a list of regions of memory that
the kernel must not use\footnote{Usually such ranges contain some data
structure initialised by the firmware that must be preserved by the
kernel.}. The list is represented as a simple array of (address,
size) pairs of 64 bit values, terminated by a zero size entry. The
strings block is similarly simple, consisting of a number of
null-terminated strings appended together, which are referenced from
the structure block as described below.
The structure block contains the device tree proper. Each node is
introduced with a 32-bit \dtbeginnode tag, followed by the node's name
as a null-terminated string, padded to a 32-bit boundary. Then
follows all of the properties of the node, each introduced with a
\dtprop tag, then all of the node's subnodes, each introduced with
their own \dtbeginnode tag. The node ends with an \dtendnode tag, and
after the \dtendnode for the root node is an \dtend tag, indicating
the end of the whole tree\footnote{This is redundant, but included for
ease of parsing.}. The structure block starts with the \dtbeginnode
introducing the description of the root node (named \texttt{/}).
Each property, after the \dtprop, has a 32-bit value giving an offset
from the beginning of the strings block at which the property name is
stored. Because it's common for many nodes to have properties with
the same name, this approach can substantially reduce the total size
of the blob. The name offset is followed by the length of the
property value (as a 32-bit value) and then the data itself padded to
a 32-bit boundary.
\subsection{Contents of the tree}
Having seen how to represent the device tree structure as a flattened
blob, what actually goes into the tree? The short answer is ``the
same as an OF tree''. On OF systems, the flattened tree is
transcribed directly from the OF device tree, so for simplicity we
also use OF conventions for the tree on other systems.
In many cases a flat tree can be simpler than a typical OF provided
device tree. The flattened tree need only provide those nodes and
properties that the kernel actually requires; the flattened tree
generally need not include devices that the kernel can probe itself.
For example, an OF device tree would normally include nodes for each
PCI device on the system. A flattened tree need only include nodes
for the PCI host bridges; the kernel will scan the buses thus
described to find the subsidiary devices. The device tree can include
nodes for devices where the kernel needs extra information, though:
for example, for ISA devices on a subsidiary PCI/ISA bridge, or for
devices with unusual interrupt routing.
Where they exist, we follow the IEEE1275 bindings that specify how to
describe various buses in the device tree (for example,
\cite{IEEE1275-pci} describe how to represent PCI devices). The
standard has not been updated for a long time, however, and lacks
bindings for many modern buses and devices. In particular, embedded
specific devices such as the various System-on-Chip buses are not
covered. We intend to create new bindings for such buses, in keeping
with the general conventions of IEEE1275 (a simple such binding for a
System-on-Chip bus was included in \cite{noof5} a revision of
One complication arises for representing ``phandles'' in the flattened
tree. In OF, each node in the tree has an associated phandle, a
32-bit integer that uniquely identifies the node\footnote{In practice
usually implemented as a pointer or offset within OF memory.}. This
handle is used by the various OF calls to query and traverse the tree.
Sometimes phandles are also used within the tree to refer to other
nodes in the tree. For example, devices that produce interrupts
generally have an \texttt{interrupt-parent} property giving the
phandle of the interrupt controller that handles interrupts from this
device. Parsing these and other interrupt related properties allows
the kernel to build a complete representation of the system's
interrupt tree, which can be quite different from the tree of bus
In the flattened tree, a node's phandle is represented by a special
\phandle property. When the kernel generates a flattened tree from
OF, it adds a \phandle property to each node, containing the phandle
retrieved from OF. When the tree is generated without OF, however,
only nodes that are actually referred to by phandle need to have this
Another complication arises because nodes in an OF tree have two
names. First they have the ``unit name'', which is how the node is
referred to in an OF path. The unit name generally consists of a
device type followed by an \texttt{@} followed by a \emph{unit
address}. For example \texttt{/memory@0} is the full path of a memory
node at address 0, \texttt{/ht@0,f2000000/pci@1} is the path of a PCI
bus node, which is under a HyperTransport\tm bus node. The form of
the unit address is bus dependent, but is generally derived from the
node's \texttt{reg} property. In addition, nodes have a property,
\texttt{name}, whose value is usually equal to the first path of the
unit name. For example, the nodes in the previous example would have
\texttt{name} properties equal to \texttt{memory} and \texttt{pci},
respectively. To save space in the blob, the current version of the
flattened tree format only requires the unit names to be present.
When the kernel unflattens the tree, it automatically generates a
\texttt{name} property from the node's path name.
\section{The Device Tree Compiler}
/memreserve/ 0x20000000-0x21FFFFFF;
/ {
model = "MyBoard";
compatible = "MyBoardFamily";
#address-cells = <2>;
#size-cells = <2>;
cpus {
#address-cells = <1>;
#size-cells = <0>;
PowerPC,970@0 {
device_type = "cpu";
reg = <0>;
clock-frequency = <5f5e1000>;
timebase-frequency = <1FCA055>;
i-cache-size = <10000>;
d-cache-size = <8000>;
memory@0 {
device_type = "memory";
memreg: reg = <00000000 00000000
00000000 20000000>;
mpic@0x3fffdd08400 {
/* Interrupt controller */
/* ... */
pci@40000000000000 {
/* PCI host bridge */
/* ... */
chosen {
bootargs = "root=/dev/sda2";
linux,platform = <00000600>;
interrupt-controller =
< &/mpic@0x3fffdd08400 >;
\caption{Example \dtc source}
As we've seen, the flattened device tree format provides a convenient
way of communicating device tree information to the kernel. It's
simple for the kernel to parse, and simple for bootloaders to
manipulate. On OF systems, it's easy to generate the flattened tree
by walking the OF maintained tree. However, for embedded systems, the
flattened tree must be generated from scratch.
Embedded bootloaders are generally built for a particular board. So,
it's usually possible to build the device tree blob at compile time
and include it in the bootloader image. For minor revisions of the
board, the bootloader can contain code to make the necessary tweaks to
the tree before passing it to the booted kernel.
The device trees for embedded boards are usually quite simple, and
it's possible to hand construct the necessary blob by hand, but doing
so is tedious. The ``device tree compiler'', \dtc{}\footnote{\dtc can
be obtained from \cite{dtcgit}.}, is designed to make creating device
tree blobs easier by converting a text representation of the tree
into the necessary blob.
\subsection{Input and output formats}
As well as the normal mode of compiling a device tree blob from text
source, \dtc can convert a device tree between a number of
representations. It can take its input in one of three different
\item source, the normal case. The device tree is described in a text
form, described in \S\ref{sec:dts}.
\item blob (\texttt{dtb}), the flattened tree format described in
\S\ref{sec:format}. This mode is useful for checking a pre-existing
device tree blob.
\item filesystem (\texttt{fs}), input is a directory tree in the
layout of \texttt{/proc/device-tree} (roughly, a directory for each
node in the device tree, a file for each property). This is useful
for building a blob for the device tree in use by the currently
running kernel.
In addition, \dtc can output the tree in one of three different
\item blob (\texttt{dtb}), as in \S\ref{sec:format}. The most
straightforward use of \dtc is to compile from ``source'' to
``blob'' format.
\item source (\texttt{dts}), as in \S\ref{sec:dts}. If used with blob
input, this allows \dtc to act as a ``decompiler''.
\item assembler source (\texttt{asm}). \dtc can produce an assembler
file, which will assemble into a \texttt{.o} file containing the
device tree blob, with symbols giving the beginning of the blob and
its various subsections. This can then be linked directly into a
bootloader or firmware image.
For maximum applicability, \dtc can both read and write any of the
existing revisions of the blob format. When reading, \dtc takes the
version from the blob header, and when writing it takes a command line
option specifying the desired version. It automatically makes any
necessary adjustments to the tree that are necessary for the specified
version. For example, formats before 0x10 require each node to have
an explicit \texttt{name} property. When \dtc creates such a blob, it
will automatically generate \texttt{name} properties from the unit
\subsection{Source format}
The ``source'' format for \dtc is a text description of the device
tree in a vaguely C-like form. Figure \ref{fig:dts} shows an
example. The file starts with \texttt{/memreserve/} directives, which
gives address ranges to add to the output blob's memory reserve table,
then the device tree proper is described.
Nodes of the tree are introduced with the node name, followed by a
\texttt{\{} ... \texttt{\};} block containing the node's properties
and subnodes. Properties are given as just {\emph{name} \texttt{=}
\emph{value}\texttt{;}}. The property values can be given in any
of three forms:
\item \emph{string} (for example, \texttt{"MyBoard"}). The property
value is the given string, including terminating NULL. C-style
escapes (\verb+\t+, \verb+\n+, \verb+\0+ and so forth) are allowed.
\item \emph{cells} (for example, \texttt{<0 8000 f0000000>}). The
property value is made up of a list of 32-bit ``cells'', each given
as a hex value.
\item \emph{bytestring} (for example, \texttt{[1234abcdef]}). The
property value is given as a hex bytestring.
Cell properties can also contain \emph{references}. Instead of a hex
number, the source can give an ampersand (\texttt{\&}) followed by the
full path to some node in the tree. For example, in Figure
\ref{fig:dts}, the \texttt{/chosen} node has an
\texttt{interrupt-controller} property referring to the interrupt
controller described by the node \texttt{/mpic@0x3fffdd08400}. In the
output tree, the value of the referenced node's phandle is included in
the property. If that node doesn't have an explicit phandle property,
\dtc will automatically create a unique phandle for it. This approach
makes it easy to create interrupt trees without having to explicitly
assign and remember phandles for the various interrupt controller
The \dtc source can also include ``labels'', which are placed on a
particular node or property. For example, Figure \ref{fig:dts} has a
label ``\texttt{memreg}'' on the \texttt{reg} property of the node
\texttt{/memory@0}. When using assembler output, corresponding labels
in the output are generated, which will assemble into symbols
addressing the part of the blob with the node or property in question.
This is useful for the common case where an embedded board has an
essentially fixed device tree with a few variable properties, such as
the size of memory. The bootloader for such a board can have a device
tree linked in, including a symbol referring to the right place in the
blob to update the parameter with the correct value determined at
\subsection{Tree checking}
Between reading in the device tree and writing it out in the new
format, \dtc performs a number of checks on the tree:
\item \emph{syntactic structure}: \dtc checks that node and property
names contain only allowed characters and meet length restrictions.
It checks that a node does not have multiple properties or subnodes
with the same name.
\item \emph{semantic structure}: In some cases, \dtc checks that
properties whose contents are defined by convention have appropriate
values. For example, it checks that \texttt{reg} properties have a
length that makes sense given the address forms specified by the
\texttt{\#address-cells} and \texttt{\#size-cells} properties. It
checks that properties such as \texttt{interrupt-parent} contain a
valid phandle.
\item \emph{Linux requirements}: \dtc checks that the device tree
contains those nodes and properties that are required by the Linux
kernel to boot correctly.
These checks are useful to catch simple problems with the device tree,
rather than having to debug the results on an embedded kernel. With
the blob input mode, it can also be used for diagnosing problems with
an existing blob.
\section{Future Work}
\subsection{Board ports}
The flattened device tree has always been the only supported way to
boot a \texttt{ppc64} kernel on an embedded system. With the merge of
\texttt{ppc32} and \texttt{ppc64} code it has also become the only
supported way to boot any merged \texttt{powerpc} kernel, 32-bit or
64-bit. In fact, the old \texttt{ppc} architecture exists mainly just
to support the old ppc32 embedded ports that have not been migrated
to the flattened device tree approach. We plan to remove the
\texttt{ppc} architecture eventually, which will mean porting all the
various embedded boards to use the flattened device tree.
\subsection{\dtc features}
While it is already quite usable, there are a number of extra features
that \dtc could include to make creating device trees more convenient:
\item \emph{better tree checking}: Although \dtc already performs a
number of checks on the device tree, they are rather haphazard. In
many cases \dtc will give up after detecting a minor error early and
won't pick up more interesting errors later on. There is a
\texttt{-f} parameter that forces \dtc to generate an output tree
even if there are errors. At present, this needs to be used more
often than one might hope, because \dtc is bad at deciding which
errors should really be fatal, and which rate mere warnings.
\item \emph{binary include}: Occasionally, it is useful for the device
tree to incorporate as a property a block of binary data for some
board-specific purpose. For example, many of Apple's device trees
incorporate bytecode drivers for certain platform devices. \dtc's
source format ought to allow this by letting a property's value be
read directly from a binary file.
\item \emph{macros}: it might be useful for \dtc to implement some
sort of macros so that a tree containing a number of similar devices
(for example, multiple identical ethernet controllers or PCI buses)
can be written more quickly. At present, this can be accomplished
in part by running the source file through CPP before compiling with
\dtc. It's not clear whether ``native'' support for macros would be
more useful.
\section*{About the authors}
David Gibson has been a member of the IBM Linux Technology Center,
working from Canberra, Australia, since 2001. Recently he has worked
on Linux hugepage support and performance counter support for ppc64,
as well as the device tree compiler. In the past, he has worked on
bringup for various ppc and ppc64 embedded systems, the orinoco
wireless driver, ramfs, and a userspace checkpointing system
Benjamin Herrenschmidt was a MacOS developer for about 10 years, but
ultimately saw the light and installed Linux on his Apple PowerPC
machine. After writing a bootloader, BootX, for it in 1998, he
started contributing to the PowerPC Linux port in various areas,
mostly around the support for Apple machines. He became official
PowerMac maintainer in 2001. In 2003, he joined the IBM Linux
Technology Center in Canberra, Australia, where he ported the 64 bit
PowerPC kernel to Apple G5 machines and the Maple embedded board,
among others things. He's a member of the ppc64 development ``team''
and one of his current goals is to make the integration of embedded
platforms smoother and more maintainable than in the 32-bit PowerPC
\section*{Legal Statement}
This work represents the view of the author and does not necessarily
represent the view of IBM.
IBM, \ppc, \ppc Architecture, POWER5, pSeries and iSeries are
trademarks or registered trademarks of International Business Machines
Corporation in the United States and/or other countries.
Apple and Power Macintosh are a registered trademarks of Apple
Computer Inc. in the United States, other countries, or both.
Linux is a registered trademark of Linus Torvalds.
Other company, product, and service names may be trademarks or service
marks of others.