blob: b05b077adf7c72278b2b52244b092246d85a9fbf [file] [log] [blame]
Android QEMU FAST PIPES
=======================
Introduction:
-------------
The Android emulator implements a special virtual device used to provide
_very_ fast communication channels between the guest system and the
emulator itself.
From the guest, usage is simply as follows:
1/ Open the /dev/qemu_pipe device for read+write
NOTE: Starting with Linux 3.10, the device was renamed as
/dev/goldfish_pipe but behaves exactly in the same way.
2/ Write a zero-terminated string describing which service you want to
connect.
3/ Simply use read() and write() to communicate with the service.
In other words:
fd = open("/dev/qemu_pipe", O_RDWR);
const char* pipeName = "<pipename>";
ret = write(fd, pipeName, strlen(pipeName)+1);
if (ret < 0) {
// error
}
... ready to go
Where <pipename> is the name of a specific emulator service you want to use.
Supported service names are listed later in this document.
Implementation details:
-----------------------
In the emulator source tree:
./hw/android/goldfish/pipe.c implements the virtual driver.
./hw/android/goldfish/pipe.h provides the interface that must be
implemented by any emulator pipe service.
./android/hw-pipe-net.c contains the implementation of the network pipe
services (i.e. 'tcp' and 'unix'). See below for details.
In the kernel source tree:
drivers/misc/qemupipe/qemu_pipe.c contains the driver source code
that will be accessible as /dev/qemu_pipe within the guest.
Device / Driver Protocol details:
---------------------------------
The device and driver use an I/O memory page and an IRQ to communicate.
- The driver writes to various I/O registers to send commands to the
device.
- The device raises an IRQ to instruct the driver that certain events
occured.
- The driver reads I/O registers to get the status of its latest command,
or the list of events that occured in case of interrupt.
Each opened file descriptor to /dev/qemu_pipe in the guest corresponds to a
32-bit 'channel' value allocated by the driver.
The following is a description of the various commands sent by the driver
to the device. Variable names beginning with REG_ correspond to 32-bit I/O
registers:
0/ Channel and address values:
Each communication channel is identified by a unique non-zero value
which is either 32-bit or 64-bit, depending on the guest CPU
architecture.
The channel value sent from the kernel to the emulator with:
void write_channel(channel) {
#if 64BIT_GUEST_CPU
REG_CHANNEL_HIGH = (channel >> 32);
#endif
REG_CHANNEL = (channel & 0xffffffffU);
}
Similarly, when passing a kernel address to the emulator:
void write_address(buffer_address) {
#if 64BIT_GUEST_CPU
REG_ADDRESS_HIGH = (buffer_address >> 32);
#endif
REG_ADDRESS = (buffer_address & 0xffffffffU);
}
1/ Creating a new channel:
Used by the driver to indicate that the guest just opened /dev/qemu_pipe
that will be identified by a named '<channel>':
write_channel(<channel>)
REG_CMD = CMD_OPEN
IMPORTANT: <channel> should never be 0
2/ Closing a channel:
Used by the driver to indicate that the guest called 'close' on the
channel file descriptor.
write_channel(<channel>)
REG_CMD = CMD_CLOSE
3/ Writing data to the channel:
Corresponds to when the guest does a write() or writev() on the
channel's file descriptor. This command is used to send a single
memory buffer:
write_channel(<channel>)
write_address(<buffer-address>)
REG_SIZE = <buffer-size>
REG_CMD = CMD_WRITE_BUFFER
status = REG_STATUS
NOTE: The <buffer-address> is the *GUEST* buffer address, not the
physical/kernel one.
IMPORTANT: The buffer sent through this command SHALL ALWAYS be entirely
contained inside a single page of guest memory. This is
enforced to simplify both the driver and the device.
When a write() spans several pages of guest memory, the
driver will issue several CMD_WRITE_BUFFER commands in
succession, transparently to the client.
The value returned by REG_STATUS should be:
> 0 The number of bytes that were written to the pipe
0 To indicate end-of-stream status
< 0 A negative error code (see below).
On important error code is PIPE_ERROR_AGAIN, used to indicate that
writes can't be performed yet. See CMD_WAKE_ON_WRITE for more.
4/ Reading data from the channel:
Corresponds to when the guest does a read() or readv() on the
channel's file descriptor.
write_channel(<channel>)
write_address(<buffer-address>)
REG_SIZE = <buffer-size>
REG_CMD = CMD_READ_BUFFER
status = REG_STATUS
Same restrictions on buffer addresses/lengths and same set of error
codes.
5/ Waiting for write ability:
If CMD_WRITE_BUFFER returns PIPE_ERROR_AGAIN, and the file descriptor
is not in non-blocking mode, the driver must put the client task on a
wait queue until the pipe service can accept data again.
Before this, the driver will do:
write_channel(<channel>)
REG_CMD = CMD_WAKE_ON_WRITE
To indicate to the virtual device that it is waiting and should be woken
up when the pipe becomes writable again. How this is done is explained
later.
6/ Waiting for read ability:
This is the same than CMD_WAKE_ON_WRITE, but for readability instead.
write_channel(<channel>)
REG_CMD = CMD_WAKE_ON_READ
7/ Polling for write-able/read-able state:
The following command is used by the driver to implement the select(),
poll() and epoll() system calls where a pipe channel is involved.
write_channel(<channel>)
REG_CMD = CMD_POLL
mask = REG_STATUS
The mask value returned by REG_STATUS is a mix of bit-flags for
which events are available / have occured since the last call.
See PIPE_POLL_READ / PIPE_POLL_WRITE / PIPE_POLL_CLOSED.
8/ Signaling events to the driver:
The device can signal events to the driver by raising its IRQ.
The driver's interrupt handler will then have to read a list of
(channel,mask) pairs, terminated by a single 0 value for the channel.
In other words, the driver's interrupt handler will do:
for (;;) {
channel = REG_CHANNEL
if (channel == 0) // END OF LIST
break;
mask = REG_WAKES // BIT FLAGS OF EVENTS
... process events
}
The events reported through this list are simply:
PIPE_WAKE_READ :: the channel is now readable.
PIPE_WAKE_WRITE :: the channel is now writable.
PIPE_WAKE_CLOSED :: the pipe service closed the connection.
The PIPE_WAKE_READ and PIPE_WAKE_WRITE are only reported for a given
channel if CMD_WAKE_ON_READ or CMD_WAKE_ON_WRITE (respectively) were
issued for it.
PIPE_WAKE_CLOSED can be signaled at any time.
9/ Faster read/writes through parameter blocks:
Recent Goldfish kernels implement a faster way to perform reads and writes
that perform a single I/O write per operation (which is useful when
emulating x86 system through KVM or HAX).
This uses the following structure known to both the virtual device and
the kernel, defined in $QEMU/hw/android/goldfish/pipe.h:
For 32-bit guest CPUs:
struct access_params {
uint32_t channel;
uint32_t size;
uint32_t address;
uint32_t cmd;
uint32_t result;
/* reserved for future extension */
uint32_t flags;
};
And the 64-bit variant:
struct access_params_64 {
uint64_t channel;
uint32_t size;
uint64_t address;
uint32_t cmd;
uint32_t result;
/* reserved for future extension */
uint32_t flags;
};
This is simply a way to pack several parameters into a single structure.
Preliminary, e.g. at boot time, the kernel will allocate one such structure
and pass its physical address with:
PARAMS_ADDR_LOW = (params & 0xffffffff);
PARAMS_ADDR_HIGH = (params >> 32) & 0xffffffff;
Then for each operation, it will do something like:
params.channel = channel;
params.address = buffer;
params.size = buffer_size;
params.cmd = CMD_WRITE_BUFFER (or CMD_READ_BUFFER)
REG_ACCESS_PARAMS = <any>
status = params.status
The write to REG_ACCESS_PARAMS will trigger the operation, i.e. QEMU will
read the content of the params block, use its fields to perform the
operation then write back the return value into params.status.
10/ v2 pipe: faster read/writes through command buffers and buffer lists
Batch access_params still only perform one buffer transfer at a time. This is OK
for applications that only use one page of memory in each transfer, but for
many latency/throughput-sensitive applications like OpenGL and ADB push/pull,
the one buffer/page per operation is not sufficient.
Ideally, if the guest wants to transfer a buffer, it should be done in
something as resembling one step as possible.
But, what can make this difficult is that guest pages tend to fragment all
over the place and not be physically contiguous.
The v2 Goldfish pipe driver and device change the register set and
device/pipe structs to allow for more buffers to be transferred per loop of
goldfish_pipe_read_write, increasing performance for all pipe users
in throughput-limited situations.
There are two key new structures to consider.
1. The v2 pipe adds a struct goldfish_pipe_command that represents the delivery of
multiple buffers in one I/O transaction:
/* A per-pipe command structure, shared with the host */
struct goldfish_pipe_command {
s32 cmd; /* PipeCmdCode, guest -> host */
s32 id; /* pipe id, guest -> host */
s32 status; /* command execution status, host -> guest */
s32 reserved; /* to pad to 64-bit boundary */
union {
/* Parameters for PIPE_CMD_{READ,WRITE} */
struct {
u64 ptrs[MAX_BUFFERS_PER_COMMAND]; /* buffer pointers, guest -> host */
u32 sizes[MAX_BUFFERS_PER_COMMAND]; /* buffer sizes, guest -> host */
u32 buffers_count; /* number of buffers, guest -> host */
s32 consumed_size; /* number of consumed bytes, host -> guest */
} rw_params;
};
};
For each pipe fd, there is a corresponding goldfish_pipe_command structure
that can hold up to MAX_BUFFERS_PER_COMMAND (concretely, order of 10^2 so
far) buffers. The buffers comprising the data on the guest are tracked in
rw_params. Some effort is taken to ensure that the list is short as
possible (i.e., merge physically contiguous pages).
2. Previously, for host-triggered transfers, we walked through a linked
list of pipes that had pending actions (PIPE_WAKE_***), called the
"signaled pipes" list, and the interrupt handler would process roughly 1-3
such pipes each time a transfer was occurring. This is not ideal because a)
we are doing work in the interrupt handler and b) there can be a lot of
transitions between running the kernel driver in the guest versus running
the virtual device in the host (transitions are slow).
Hence, pipe v2 keeps a set of pipes to process from host IRQ raises:
/* Device-level set of buffers shared with the host */
struct goldfish_pipe_dev_buffers {
struct open_command_param open_command_params;
struct signalled_pipe_buffer signalled_pipe_buffers[MAX_SIGNALLED_PIPES];
};
and each interrupt handler can then process up to MAX_SIGNALLED_PIPES at a
time. In addition, the work of waking up the signaled pipes is left to a
bottom-half handler and not done in interrupt context.
11/ goldfish_dma: directly accessing memory on guest from host
Sometimes, for applications that need both very high throughput and very
low latency, such as video playback, it can be useful to skip host/guest
transitions and gathering discontiguous buffers and write directly memory
that is also visible on the host machine.
To do this, goldfish_dma allows any pipe fd to have ioctls to allocate
a contiguous physical region in kernel space, and to mmap() from the guest
userspace into that physical region. These regions can also be shared
across processes, such as with gralloc.
The goldfish_dma kernel driver adds struct goldfish_dma_context to
all struct goldfish_pipe instances:
struct goldfish_dma_context {
struct kref kref; /* kref to safely free dma region */
struct device* pdev_dev; /* pointer to feed to dma_***_coherent */
void* dma_vaddr; /* kernel vaddr of dma region */
dma_addr_t dma_bus_addr; /* kernel dma_addr_t of dma region */
u64 dma_size; /* size of dma region */
u64 phys_begin; /* paddr of dma region */
unsigned long pfn; /* pfn of dma region */
u64 phys_end; /* paddr of dma region + dma_size */
bool locked; /* marks whether the region is currently in use */
unsigned long refcount; /* For debugging purposes only. */
struct goldfish_pipe_command* command_buffer; /* For safe freeing of DMA regions,
we need another pipe command buffer that lives longer than the pipe instance */
};
Interface (guest side):
The guest user calls goldfish_dma_alloc (ioctls) and then mmap() on a
goldfish pipe fd, which means that it wants high-speed access to
host-visible memory.
The guest can then write into the pointer returned by mmap(), and these
writes become immediately visible on the host without BQL or context
switching.
The main data structure tracking state is struct goldfish_dma_context,
which is included as an extra pointer field in struct goldfish_pipe. Each
such context is associated with possibly one physical address and size
describing the allocated DMA region, and only one allocation is allowed for
each pipe fd. Further allocations require more open()'s of pipe fd's.
dma_alloc_coherent() is used to obtain contiguous physical memory regions,
and we allocate and interact with this region on both guest and host
through the following ioctls:
- LOCK: Lock the DMA region for data access.
- UNLOCK: Unlock the DMA region. This may also be done from the host
through WAKE_ON_UNLOCK_DMA.
- CREATE_REGION: initialize size info for a DMA region.
- GETOFF: send physical address to guest driver.
- (UN)MAPHOST: uses goldfish_pipe_cmd to tell the host to (un)map to the
guest physical address associated with the current dma context. This
makes the physically contiguous memory (in)visible to the host.
Guest userspace obtains a pointer to the DMA memory using mmap(),
which also lazily allocates the memory with dma_alloc_coherent().
(On last pipe close(), the region is freed).
The mmaped() region can handle very high bandwidth transfers, and pipe
operations can be used at the same time to handle synchronization and command
communication.
The virtual device has these hooks into the kernel driver:
1. UNLOCK_DMA: This unlocks the DMA region from the host.
2. MAP/UNMAP_HOST: This is a pipe command used to tell the host machine
to obtain a host void* pointing at a particular guest RAM physical address.
It is triggered when a new DMA buffer is allocated in the guest kernel
with the ioctl(), and makes it so the host can also see into the same region.
The host will then call cpu_physical_memory_(un)map and track all currently
allocated regions in a global data structure (DmaMap). If we take
snapshots, we will need to remap those regions, something also handled by
DmaMap.
2. UNLOCK_DMA: This pipe command signals the guest kernel that the
host is done processing the DMA region. In most simple use cases, the guest
will first lock its DMA region and write to it, assuming that
the unlock will be triggered by the host after the processing is done.
UNLOCK_DMA is how the host triggers this unlocking.
Available services:
-------------------
tcp:<port>
Open a TCP socket to a given localhost port. This provides a very fast
pass-through that doesn't depend on the very slow internal emulator
NAT router. Note that you can only use the file descriptor with read()
and write() though, send() and recv() will return an ENOTSOCK error,
as well as any socket ioctl().
For security reasons, it is not possible to connect to non-localhost
ports.
unix:<path>
Open a Unix-domain socket on the host.
opengles
Connects to the OpenGL ES emulation process. For now, the implementation
is equivalent to tcp:22468, but this may change in the future.
qemud
Connects to the QEMUD service inside the emulator. This replaces the
connection that was performed through /dev/ttyS1 in older Android platform
releases. See $QEMU/docs/ANDROID-QEMUD.TXT for details.