Add spin loop to mutex, overhaul monitor

Since Linux context switch overhead is typically larger than a
microsecond, this may greatly reduce the overhead of waiting for a
mutex that is only briefly held by another thread. Rather than going
to sleep and having to be woken back up again, at a cost of several
microseconds, we just spin, hopefully for much less than microsecond,
until the mutex becomes available. It does waste some CPU cycles
when spinning fails, either because the lock is held too long, or
we are being scheduled against the thread holding the lock. But we
expect those to be unlikely.

We test the lock only a few times, with pauses in between. It's
unclear that's beneficial; we should perhaps just loop reading
the variable. In general, this needs further tuning.

Add a test that mutual exclusion works, which can also be run as
lock microbenchmark.

The old monitor implementation did not benefit much from this, because
it used mutex only as a low-level lock to protect the monitor data
structure. Instead use monitor_lock_ as the actual lock providing
mutual exclusion for the monitor, i.e. hold onto it while the monitor
is fatlocked.  Among other things, this requires that the monitor_lock_
always be acquired by, or explicitly on behalf of, the thread holding
the monitor.

This in turn makes it really hard to deflate a monitor held by another
thread. Just stop doing that, since it was unclear whether that's
actually beneficial.

The main advantages of the monitor change are:
- Half the number of mutex acquisitions.
- Easier to effectively spin.
- No possibility of blocking while trying to release a monitor.

No longer compute owner method and dex pc values on monitor entry
unless we're tracing. This was expensive and increased lock hold times
sufficiently that it often made spinning ineffective.

Have mutex acquisition call futex wait in a loop between updating
waiter count. The old way resulted in extra futex wakeups in highly
contended situations.

Conditionally disable frame size warning for Heap::PreZygoteFork().
Otherwise the platform doesn't build with ART_USE_FUTEXES = 0, which
we needed for testing.

Based on the new test, this appears to get us about a decimal order
of magnitude in inflated contended locking performance. Single-threaded
or scalable applications (i.e. most) should be unaffected. But it should
prevent applications that do encounter contention from "falling off a
cliff", or at least greatly reduce the height of the cliff. And it
should make performance more repeatable by making it less dependent on
whether a monitor happens to get inflated.

Bug: 111835365
Bug: 140590186
Test: Successfully built and ran monitor tests. Boots AOSP.
Test: Build platform with ART_USE_FUTEXES = 0.
Test: Check contention messages in log after booting AOSP.
Test: Check systrace output while partially running new test.

Change-Id: Iff7457fff59efcb24e25d35a4ef71b67b8a9082a
10 files changed