<HTML> | |
<HEAD> | |
<TITLE>Garbage collector scalability</TITLE> | |
</HEAD> | |
<BODY> | |
<H1>Garbage collector scalability</h1> | |
In its default configuration, the Boehm-Demers-Weiser garbage collector | |
is not thread-safe. It can be made thread-safe for a number of environments | |
by building the collector with the appropriate | |
<TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation | |
flag. This has primarily two effects: | |
<OL> | |
<LI> It causes the garbage collector to stop all other threads when | |
it needs to see a consistent memory state. | |
<LI> It causes the collector to acquire a lock around essentially all | |
allocation and garbage collection activity. | |
</ol> | |
Since a single lock is used for all allocation-related activity, only one | |
thread can be allocating or collecting at one point. This inherently | |
limits performance of multi-threaded applications on multiprocessors. | |
<P> | |
On most platforms, the allocator/collector lock is implemented as a | |
spin lock with exponential back-off. Longer wait times are implemented | |
by yielding and/or sleeping. If a collection is in progress, the pure | |
spinning stage is skipped. This has the advantage that uncontested and | |
thus most uniprocessor lock acquisitions are very cheap. It has the | |
disadvantage that the application may sleep for small periods of time | |
even when there is work to be done. And threads may be unnecessarily | |
woken up for short periods. Nonetheless, this scheme empirically | |
outperforms native queue-based mutual exclusion implementations in most | |
cases, sometimes drastically so. | |
<H2>Options for enhanced scalability</h2> | |
Version 6.0 of the collector adds two facilities to enhance collector | |
scalability on multiprocessors. As of 6.0alpha1, these are supported | |
only under Linux on X86 and IA64 processors, though ports to other | |
otherwise supported Pthreads platforms should be straightforward. | |
They are intended to be used together. | |
<UL> | |
<LI> | |
Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to | |
run the mark phase in parallel in multiple threads, and thus on multiple | |
processors. The mark phase typically consumes the large majority of the | |
collection time. Thus this largely parallelizes the garbage collector | |
itself, though not the allocation process. Currently the marking is | |
performed by the thread that triggered the collection, together with | |
<I>N</i>-1 dedicated | |
threads, where <I>N</i> is the number of processors detected by the collector. | |
The dedicated threads are created once at initialization time. | |
<P> | |
A second effect of this flag is to switch to a more concurrent | |
implementation of <TT>GC_malloc_many</tt>, so that free lists can be | |
built, and memory can be cleared, by more than one thread concurrently. | |
<LI> | |
Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread | |
local allocation. It does not, by itself, cause thread local allocation | |
to be used. It simply allows the use of the interface in | |
<TT>gc_local_alloc.h</tt>. | |
<P> | |
Memory returned from thread-local allocators is completely interchangeable | |
with that returned by the standard allocators. It may be used by other | |
threads. The only difference is that, if the thread allocates enough | |
memory of a certain kind, it will build a thread-local free list for | |
objects of that kind, and allocate from that. This greatly reduces | |
locking. The thread-local free lists are refilled using | |
<TT>GC_malloc_many</tt>. | |
<P> | |
An important side effect of this flag is to replace the default | |
spin-then-sleep lock to be replace by a spin-then-queue based implementation. | |
This <I>reduces performance</i> for the standard allocation functions, | |
though it usually improves performance when thread-local allocation is | |
used heavily, and thus the number of short-duration lock acquisitions | |
is greatly reduced. | |
</ul> | |
<P> | |
The easiest way to switch an application to thread-local allocation is to | |
<OL> | |
<LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>, | |
and then include the <TT>gc.h</tt> | |
header in each client source file. | |
<LI> Invoke <TT>GC_thr_init()</tt> before any allocation. | |
<LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>, | |
and/or <TT>GC_GCJ_MALLOC</tt>. | |
</ol> | |
<H2>The Parallel Marking Algorithm</h2> | |
We use an algorithm similar to | |
<A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by | |
Endo, Taura, and Yonezawa</a> at the University of Tokyo. | |
However, the data structures and implementation are different, | |
and represent a smaller change to the original collector source, | |
probably at the expense of extreme scalability. Some of | |
the refinements they suggest, <I>e.g.</i> splitting large | |
objects, were also incorporated into out approach. | |
<P> | |
The global mark stack is transformed into a global work queue. | |
Unlike the usual case, it never shrinks during a mark phase. | |
The mark threads remove objects from the queue by copying them to a | |
local mark stack and changing the global descriptor to zero, indicating | |
that there is no more work to be done for this entry. | |
This removal | |
is done with no synchronization. Thus it is possible for more than | |
one worker to remove the same entry, resulting in some work duplication. | |
<P> | |
The global work queue grows only if a marker thread decides to | |
return some of its local mark stack to the global one. This | |
is done if the global queue appears to be running low, or if | |
the local stack is in danger of overflowing. It does require | |
synchronization, but should be relatively rare. | |
<P> | |
The sequential marking code is reused to process local mark stacks. | |
Hence the amount of additional code required for parallel marking | |
is minimal. | |
<P> | |
It should be possible to use generational collection in the presence of the | |
parallel collector, by calling <TT>GC_enable_incremental()</tt>. | |
This does not result in fully incremental collection, since parallel mark | |
phases cannot currently be interrupted, and doing so may be too | |
expensive. | |
<P> | |
Gcj-style mark descriptors do not currently mix with the combination | |
of local allocation and incremental collection. They should work correctly | |
with one or the other, but not both. | |
<P> | |
The number of marker threads is set on startup to the number of | |
available processors (or to the value of the <TT>GC_NPROCS</tt> | |
environment variable). If only a single processor is detected, | |
parallel marking is disabled. | |
<P> | |
Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside | |
the collector to immediately yield the processor instead of busy waiting | |
first. In the case of a multiprocessor and a client with multiple | |
simultaneously runnable threads, this may have disastrous performance | |
consequences (e.g. a factor of 10 slowdown). | |
<H2>Performance</h2> | |
We conducted some simple experiments with a version of | |
<A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to | |
run multiple concurrent client threads in the same address space. | |
Each client thread does the same work as the original benchmark, but they share | |
a heap. | |
This benchmark involves very little work outside of memory allocation. | |
This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine | |
under Linux 2.2.12. | |
<P> | |
Running with a thread-unsafe collector, the benchmark ran in 9 | |
seconds. With the simple thread-safe collector, | |
built with <TT>-DLINUX_THREADS</tt>, the execution time | |
increased to 10.3 seconds, or 23.5 elapsed seconds with two clients. | |
(The times for the <TT>malloc</tt>/i<TT>free</tt> version | |
with glibc <TT>malloc</tt> | |
are 10.51 (standard library, pthreads not linked), | |
20.90 (one thread, pthreads linked), | |
and 24.55 seconds respectively. The benchmark favors a | |
garbage collector, since most objects are small.) | |
<P> | |
The following table gives execution times for the collector built | |
with parallel marking and thread-local allocation support | |
(<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>). We tested | |
the client using either one or two marker threads, and running | |
one or two client threads. Note that the client uses thread local | |
allocation exclusively. With -DTHREAD_LOCAL_ALLOC the collector | |
switches to a locking strategy that is better tuned to less frequent | |
lock acquisition. The standard allocation primitives thus peform | |
slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be | |
avoided in time-critical code. | |
<P> | |
(The results using <TT>pthread_mutex_lock</tt> | |
directly for allocation locking would have been worse still, at | |
least for older versions of linuxthreads. | |
With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the | |
lock with pthread_mutex_try_lock(), busy_waiting between attempts. | |
After a fixed number of attempts, we use pthread_mutex_lock().) | |
<P> | |
These measurements do not use incremental collection, nor was prefetching | |
enabled in the marker. We used the C version of the benchmark. | |
All measurements are in elapsed seconds on an unloaded machine. | |
<P> | |
<TABLE BORDER ALIGN="CENTER"> | |
<TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th> | |
<TH>2 marker threads (secs.)</th></tr> | |
<TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td> | |
<TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td> | |
</table> | |
<PP> | |
The execution time for the single threaded case is slightly worse than with | |
simple locking. However, even the single-threaded benchmark runs faster than | |
even the thread-unsafe version if a second processor is available. | |
The execution time for two clients with thread local allocation time is | |
only 1.4 times the sequential execution time for a single thread in a | |
thread-unsafe environment, even though it involves twice the client work. | |
That represents close to a | |
factor of 2 improvement over the 2 client case with the old collector. | |
The old collector clearly | |
still suffered from some contention overhead, in spite of the fact that the | |
locking scheme had been fairly well tuned. | |
<P> | |
Full linear speedup (i.e. the same execution time for 1 client on one | |
processor as 2 clients on 2 processors) | |
is probably not achievable on this kind of | |
hardware even with such a small number of processors, | |
since the memory system is | |
a major constraint for the garbage collector, | |
the processors usually share a single memory bus, and thus | |
the aggregate memory bandwidth does not increase in | |
proportion to the number of processors. | |
<P> | |
These results are likely to be very sensitive to both hardware and OS | |
issues. Preliminary experiments with an older Pentium Pro machine running | |
an older kernel were far less encouraging. | |
</body> | |
</html> |