| Android QEMU FAST PIPES |
| ======================= |
| |
| Introduction: |
| ------------- |
| |
| The Android emulator implements a special virtual device used to provide |
| _very_ fast communication channels between the guest system and the |
| emulator itself. |
| |
| From the guest, usage is simply as follows: |
| |
| 1/ Open the /dev/qemu_pipe device for read+write |
| |
| NOTE: Starting with Linux 3.10, the device was renamed as |
| /dev/goldfish_pipe but behaves exactly in the same way. |
| |
| 2/ Write a zero-terminated string describing which service you want to |
| connect. |
| |
| 3/ Simply use read() and write() to communicate with the service. |
| |
| In other words: |
| |
| fd = open("/dev/qemu_pipe", O_RDWR); |
| const char* pipeName = "<pipename>"; |
| ret = write(fd, pipeName, strlen(pipeName)+1); |
| if (ret < 0) { |
| // error |
| } |
| ... ready to go |
| |
| Where <pipename> is the name of a specific emulator service you want to use. |
| Supported service names are listed later in this document. |
| |
| |
| Implementation details: |
| ----------------------- |
| |
| In the emulator source tree: |
| |
| ./hw/android/goldfish/pipe.c implements the virtual driver. |
| |
| ./hw/android/goldfish/pipe.h provides the interface that must be |
| implemented by any emulator pipe service. |
| |
| ./android/hw-pipe-net.c contains the implementation of the network pipe |
| services (i.e. 'tcp' and 'unix'). See below for details. |
| |
| In the kernel source tree: |
| |
| drivers/misc/qemupipe/qemu_pipe.c contains the driver source code |
| that will be accessible as /dev/qemu_pipe within the guest. |
| |
| |
| Device / Driver Protocol details: |
| --------------------------------- |
| |
| The device and driver use an I/O memory page and an IRQ to communicate. |
| |
| - The driver writes to various I/O registers to send commands to the |
| device. |
| |
| - The device raises an IRQ to instruct the driver that certain events |
| occured. |
| |
| - The driver reads I/O registers to get the status of its latest command, |
| or the list of events that occured in case of interrupt. |
| |
| Each opened file descriptor to /dev/qemu_pipe in the guest corresponds to a |
| 32-bit 'channel' value allocated by the driver. |
| |
| The following is a description of the various commands sent by the driver |
| to the device. Variable names beginning with REG_ correspond to 32-bit I/O |
| registers: |
| |
| 0/ Channel and address values: |
| |
| Each communication channel is identified by a unique non-zero value |
| which is either 32-bit or 64-bit, depending on the guest CPU |
| architecture. |
| |
| The channel value sent from the kernel to the emulator with: |
| |
| void write_channel(channel) { |
| #if 64BIT_GUEST_CPU |
| REG_CHANNEL_HIGH = (channel >> 32); |
| #endif |
| REG_CHANNEL = (channel & 0xffffffffU); |
| } |
| |
| Similarly, when passing a kernel address to the emulator: |
| |
| void write_address(buffer_address) { |
| #if 64BIT_GUEST_CPU |
| REG_ADDRESS_HIGH = (buffer_address >> 32); |
| #endif |
| REG_ADDRESS = (buffer_address & 0xffffffffU); |
| } |
| |
| 1/ Creating a new channel: |
| |
| Used by the driver to indicate that the guest just opened /dev/qemu_pipe |
| that will be identified by a named '<channel>': |
| |
| write_channel(<channel>) |
| REG_CMD = CMD_OPEN |
| |
| IMPORTANT: <channel> should never be 0 |
| |
| 2/ Closing a channel: |
| |
| Used by the driver to indicate that the guest called 'close' on the |
| channel file descriptor. |
| |
| write_channel(<channel>) |
| REG_CMD = CMD_CLOSE |
| |
| 3/ Writing data to the channel: |
| |
| Corresponds to when the guest does a write() or writev() on the |
| channel's file descriptor. This command is used to send a single |
| memory buffer: |
| |
| write_channel(<channel>) |
| write_address(<buffer-address>) |
| REG_SIZE = <buffer-size> |
| REG_CMD = CMD_WRITE_BUFFER |
| |
| status = REG_STATUS |
| |
| NOTE: The <buffer-address> is the *GUEST* buffer address, not the |
| physical/kernel one. |
| |
| IMPORTANT: The buffer sent through this command SHALL ALWAYS be entirely |
| contained inside a single page of guest memory. This is |
| enforced to simplify both the driver and the device. |
| |
| When a write() spans several pages of guest memory, the |
| driver will issue several CMD_WRITE_BUFFER commands in |
| succession, transparently to the client. |
| |
| The value returned by REG_STATUS should be: |
| |
| > 0 The number of bytes that were written to the pipe |
| 0 To indicate end-of-stream status |
| < 0 A negative error code (see below). |
| |
| On important error code is PIPE_ERROR_AGAIN, used to indicate that |
| writes can't be performed yet. See CMD_WAKE_ON_WRITE for more. |
| |
| 4/ Reading data from the channel: |
| |
| Corresponds to when the guest does a read() or readv() on the |
| channel's file descriptor. |
| |
| write_channel(<channel>) |
| write_address(<buffer-address>) |
| REG_SIZE = <buffer-size> |
| REG_CMD = CMD_READ_BUFFER |
| |
| status = REG_STATUS |
| |
| Same restrictions on buffer addresses/lengths and same set of error |
| codes. |
| |
| 5/ Waiting for write ability: |
| |
| If CMD_WRITE_BUFFER returns PIPE_ERROR_AGAIN, and the file descriptor |
| is not in non-blocking mode, the driver must put the client task on a |
| wait queue until the pipe service can accept data again. |
| |
| Before this, the driver will do: |
| |
| write_channel(<channel>) |
| REG_CMD = CMD_WAKE_ON_WRITE |
| |
| To indicate to the virtual device that it is waiting and should be woken |
| up when the pipe becomes writable again. How this is done is explained |
| later. |
| |
| 6/ Waiting for read ability: |
| |
| This is the same than CMD_WAKE_ON_WRITE, but for readability instead. |
| |
| write_channel(<channel>) |
| REG_CMD = CMD_WAKE_ON_READ |
| |
| 7/ Polling for write-able/read-able state: |
| |
| The following command is used by the driver to implement the select(), |
| poll() and epoll() system calls where a pipe channel is involved. |
| |
| write_channel(<channel>) |
| REG_CMD = CMD_POLL |
| mask = REG_STATUS |
| |
| The mask value returned by REG_STATUS is a mix of bit-flags for |
| which events are available / have occured since the last call. |
| See PIPE_POLL_READ / PIPE_POLL_WRITE / PIPE_POLL_CLOSED. |
| |
| 8/ Signaling events to the driver: |
| |
| The device can signal events to the driver by raising its IRQ. |
| The driver's interrupt handler will then have to read a list of |
| (channel,mask) pairs, terminated by a single 0 value for the channel. |
| |
| In other words, the driver's interrupt handler will do: |
| |
| for (;;) { |
| channel = REG_CHANNEL |
| if (channel == 0) // END OF LIST |
| break; |
| |
| mask = REG_WAKES // BIT FLAGS OF EVENTS |
| ... process events |
| } |
| |
| The events reported through this list are simply: |
| |
| PIPE_WAKE_READ :: the channel is now readable. |
| PIPE_WAKE_WRITE :: the channel is now writable. |
| PIPE_WAKE_CLOSED :: the pipe service closed the connection. |
| |
| The PIPE_WAKE_READ and PIPE_WAKE_WRITE are only reported for a given |
| channel if CMD_WAKE_ON_READ or CMD_WAKE_ON_WRITE (respectively) were |
| issued for it. |
| |
| PIPE_WAKE_CLOSED can be signaled at any time. |
| |
| |
| 9/ Faster read/writes through parameter blocks: |
| |
| Recent Goldfish kernels implement a faster way to perform reads and writes |
| that perform a single I/O write per operation (which is useful when |
| emulating x86 system through KVM or HAX). |
| |
| This uses the following structure known to both the virtual device and |
| the kernel, defined in $QEMU/hw/android/goldfish/pipe.h: |
| |
| For 32-bit guest CPUs: |
| |
| struct access_params { |
| uint32_t channel; |
| uint32_t size; |
| uint32_t address; |
| uint32_t cmd; |
| uint32_t result; |
| /* reserved for future extension */ |
| uint32_t flags; |
| }; |
| |
| And the 64-bit variant: |
| |
| struct access_params_64 { |
| uint64_t channel; |
| uint32_t size; |
| uint64_t address; |
| uint32_t cmd; |
| uint32_t result; |
| /* reserved for future extension */ |
| uint32_t flags; |
| }; |
| |
| This is simply a way to pack several parameters into a single structure. |
| Preliminary, e.g. at boot time, the kernel will allocate one such structure |
| and pass its physical address with: |
| |
| PARAMS_ADDR_LOW = (params & 0xffffffff); |
| PARAMS_ADDR_HIGH = (params >> 32) & 0xffffffff; |
| |
| Then for each operation, it will do something like: |
| |
| params.channel = channel; |
| params.address = buffer; |
| params.size = buffer_size; |
| params.cmd = CMD_WRITE_BUFFER (or CMD_READ_BUFFER) |
| |
| REG_ACCESS_PARAMS = <any> |
| |
| status = params.status |
| |
| The write to REG_ACCESS_PARAMS will trigger the operation, i.e. QEMU will |
| read the content of the params block, use its fields to perform the |
| operation then write back the return value into params.status. |
| |
| 10/ v2 pipe: faster read/writes through command buffers and buffer lists |
| |
| Batch access_params still only perform one buffer transfer at a time. This is OK |
| for applications that only use one page of memory in each transfer, but for |
| many latency/throughput-sensitive applications like OpenGL and ADB push/pull, |
| the one buffer/page per operation is not sufficient. |
| |
| Ideally, if the guest wants to transfer a buffer, it should be done in |
| something as resembling one step as possible. |
| |
| But, what can make this difficult is that guest pages tend to fragment all |
| over the place and not be physically contiguous. |
| |
| The v2 Goldfish pipe driver and device change the register set and |
| device/pipe structs to allow for more buffers to be transferred per loop of |
| goldfish_pipe_read_write, increasing performance for all pipe users |
| in throughput-limited situations. |
| |
| There are two key new structures to consider. |
| |
| 1. The v2 pipe adds a struct goldfish_pipe_command that represents the delivery of |
| multiple buffers in one I/O transaction: |
| |
| /* A per-pipe command structure, shared with the host */ |
| struct goldfish_pipe_command { |
| s32 cmd; /* PipeCmdCode, guest -> host */ |
| s32 id; /* pipe id, guest -> host */ |
| s32 status; /* command execution status, host -> guest */ |
| s32 reserved; /* to pad to 64-bit boundary */ |
| union { |
| /* Parameters for PIPE_CMD_{READ,WRITE} */ |
| struct { |
| u64 ptrs[MAX_BUFFERS_PER_COMMAND]; /* buffer pointers, guest -> host */ |
| u32 sizes[MAX_BUFFERS_PER_COMMAND]; /* buffer sizes, guest -> host */ |
| u32 buffers_count; /* number of buffers, guest -> host */ |
| s32 consumed_size; /* number of consumed bytes, host -> guest */ |
| } rw_params; |
| }; |
| }; |
| |
| For each pipe fd, there is a corresponding goldfish_pipe_command structure |
| that can hold up to MAX_BUFFERS_PER_COMMAND (concretely, order of 10^2 so |
| far) buffers. The buffers comprising the data on the guest are tracked in |
| rw_params. Some effort is taken to ensure that the list is short as |
| possible (i.e., merge physically contiguous pages). |
| |
| 2. Previously, for host-triggered transfers, we walked through a linked |
| list of pipes that had pending actions (PIPE_WAKE_***), called the |
| "signaled pipes" list, and the interrupt handler would process roughly 1-3 |
| such pipes each time a transfer was occurring. This is not ideal because a) |
| we are doing work in the interrupt handler and b) there can be a lot of |
| transitions between running the kernel driver in the guest versus running |
| the virtual device in the host (transitions are slow). |
| |
| Hence, pipe v2 keeps a set of pipes to process from host IRQ raises: |
| |
| /* Device-level set of buffers shared with the host */ |
| struct goldfish_pipe_dev_buffers { |
| struct open_command_param open_command_params; |
| struct signalled_pipe_buffer signalled_pipe_buffers[MAX_SIGNALLED_PIPES]; |
| }; |
| |
| and each interrupt handler can then process up to MAX_SIGNALLED_PIPES at a |
| time. In addition, the work of waking up the signaled pipes is left to a |
| bottom-half handler and not done in interrupt context. |
| |
| 11/ goldfish_dma: directly accessing memory on guest from host |
| |
| Sometimes, for applications that need both very high throughput and very |
| low latency, such as video playback, it can be useful to skip host/guest |
| transitions and gathering discontiguous buffers and write directly memory |
| that is also visible on the host machine. |
| |
| To do this, goldfish_dma allows any pipe fd to have ioctls to allocate |
| a contiguous physical region in kernel space, and to mmap() from the guest |
| userspace into that physical region. These regions can also be shared |
| across processes, such as with gralloc. |
| |
| The goldfish_dma kernel driver adds struct goldfish_dma_context to |
| all struct goldfish_pipe instances: |
| |
| struct goldfish_dma_context { |
| struct kref kref; /* kref to safely free dma region */ |
| struct device* pdev_dev; /* pointer to feed to dma_***_coherent */ |
| void* dma_vaddr; /* kernel vaddr of dma region */ |
| dma_addr_t dma_bus_addr; /* kernel dma_addr_t of dma region */ |
| u64 dma_size; /* size of dma region */ |
| u64 phys_begin; /* paddr of dma region */ |
| unsigned long pfn; /* pfn of dma region */ |
| u64 phys_end; /* paddr of dma region + dma_size */ |
| bool locked; /* marks whether the region is currently in use */ |
| unsigned long refcount; /* For debugging purposes only. */ |
| struct goldfish_pipe_command* command_buffer; /* For safe freeing of DMA regions, |
| we need another pipe command buffer that lives longer than the pipe instance */ |
| }; |
| |
| Interface (guest side): |
| |
| The guest user calls goldfish_dma_alloc (ioctls) and then mmap() on a |
| goldfish pipe fd, which means that it wants high-speed access to |
| host-visible memory. |
| |
| The guest can then write into the pointer returned by mmap(), and these |
| writes become immediately visible on the host without BQL or context |
| switching. |
| |
| The main data structure tracking state is struct goldfish_dma_context, |
| which is included as an extra pointer field in struct goldfish_pipe. Each |
| such context is associated with possibly one physical address and size |
| describing the allocated DMA region, and only one allocation is allowed for |
| each pipe fd. Further allocations require more open()'s of pipe fd's. |
| |
| dma_alloc_coherent() is used to obtain contiguous physical memory regions, |
| and we allocate and interact with this region on both guest and host |
| through the following ioctls: |
| |
| - LOCK: Lock the DMA region for data access. |
| - UNLOCK: Unlock the DMA region. This may also be done from the host |
| through WAKE_ON_UNLOCK_DMA. |
| - CREATE_REGION: initialize size info for a DMA region. |
| - GETOFF: send physical address to guest driver. |
| - (UN)MAPHOST: uses goldfish_pipe_cmd to tell the host to (un)map to the |
| guest physical address associated with the current dma context. This |
| makes the physically contiguous memory (in)visible to the host. |
| |
| Guest userspace obtains a pointer to the DMA memory using mmap(), |
| which also lazily allocates the memory with dma_alloc_coherent(). |
| (On last pipe close(), the region is freed). |
| |
| The mmaped() region can handle very high bandwidth transfers, and pipe |
| operations can be used at the same time to handle synchronization and command |
| communication. |
| |
| The virtual device has these hooks into the kernel driver: |
| |
| 1. UNLOCK_DMA: This unlocks the DMA region from the host. |
| 2. MAP/UNMAP_HOST: This is a pipe command used to tell the host machine |
| to obtain a host void* pointing at a particular guest RAM physical address. |
| It is triggered when a new DMA buffer is allocated in the guest kernel |
| with the ioctl(), and makes it so the host can also see into the same region. |
| |
| The host will then call cpu_physical_memory_(un)map and track all currently |
| allocated regions in a global data structure (DmaMap). If we take |
| snapshots, we will need to remap those regions, something also handled by |
| DmaMap. |
| |
| 2. UNLOCK_DMA: This pipe command signals the guest kernel that the |
| host is done processing the DMA region. In most simple use cases, the guest |
| will first lock its DMA region and write to it, assuming that |
| the unlock will be triggered by the host after the processing is done. |
| UNLOCK_DMA is how the host triggers this unlocking. |
| |
| Available services: |
| ------------------- |
| |
| tcp:<port> |
| |
| Open a TCP socket to a given localhost port. This provides a very fast |
| pass-through that doesn't depend on the very slow internal emulator |
| NAT router. Note that you can only use the file descriptor with read() |
| and write() though, send() and recv() will return an ENOTSOCK error, |
| as well as any socket ioctl(). |
| |
| For security reasons, it is not possible to connect to non-localhost |
| ports. |
| |
| unix:<path> |
| |
| Open a Unix-domain socket on the host. |
| |
| opengles |
| |
| Connects to the OpenGL ES emulation process. For now, the implementation |
| is equivalent to tcp:22468, but this may change in the future. |
| |
| qemud |
| |
| Connects to the QEMUD service inside the emulator. This replaces the |
| connection that was performed through /dev/ttyS1 in older Android platform |
| releases. See $QEMU/docs/ANDROID-QEMUD.TXT for details. |
| |