Fix race condition in accept() on socket that is concurrently closed.
This CL is the result of long digging into the rabbit hole after
observing a flaky test. I've added a unit test for behavior at
the ServerSocket level, although the root cause of the bug sits
ServerSocket.accept() checks for a closed connection at the Java
level before descending into deeper (including native) levels.
The problem was that if the socket was closed() concurrently
*after* this initial check at the Java level, later native code
didn't deal with the closed state correctly and could block
Specifically, a closed socket internally sets the value of its
fd field to -1. Native code reads that field via
(*env)->GetIntField(env, fdObj, IO_fd_fdID);
and passes it to NET_Timeout, which then blocks. Other places
in the code read the same field and pass it to NET_Accept and
NET_Connect, which do not block when they get an fd value of -1.
There are 7 places that I found which call NET_Timeout, and they
all get their fd value from the Java field using the code snippet
above. NET_Timeout is documented to be a "Wrapper for
poll(s, timeout)". However, it was never a particularly thin
wrapper: unlike NET_Poll, NET_Timeout already has a bunch of
extra logic on top of the poll syscall, namely ignoring other
signals and adjusting the timeout. All of this suggests that
NET_Timeout is the right place to handle the case of fd AKA s
being < 0.
As discussed on the bug, observability in the sense of
ServerSocketConcurrentCloseTest failing was caused by:
It's therefore probably a good idea to investigate whether our
current mechanics for closing sockets are the right ones.
Establishing correctness of such an alternative change would
require very careful analysis, though. Therefore I went for
this fix instead, which is much easier to verify as safe and
corrects all observable issues that we currently know about.
- Ran the test many times from Android Studio (adjusted
test's packageName). This is the only test that I also did
on devices without the fix.
- on build with the fix, the test passes 25/25 times.
- on build without the fix, the test fails 12/15 times.
- Also ran a couple of times from CTS (25181x pass, 19x skipped):
make cts && cts-tradefed run cts -m CtsLibcoreTestCases \
- Ran all of the libcore tests once through CTS
(25181 tests pass, 19 tests not run):
make cts && cts-tradefed run cts -m CtsLibcoreTestCases
- make cts && cts-tradefed run cts -m CtsNetTestCases
Tests were run on a Nexus 6P. I previously verified that the
bug also occurs on Nexus 5X and Nexus 6 running N.
(cherry picked from commit bcb10462e854ba7172bab840329b6d3126a2fdd4)
2 files changed