2012-10-20

What is a "data race" and when is a race not a data race?

I've been meaning to write about this since my post series about Rust (in particular here, where I wrote "while data races are no longer possible, race conditions in general still are" about the RWARC library). In general, Rust statically guarantees freedom from data races, though not freedom from all races. But what does that mean?

A data race is when multiple threads concurrently access the same memory location where at least one access is a write. "Concurrently" here could mean either literally at the same time (threads run on different CPUs) or abstractly at the same time (threads interleave with each other on the same CPU); i.e., no synchronisation primitive enforces that one thread's access completes before the other begins.

Thread 1Thread 2
if (p != NULL)

p = NULL;
output(p->data);


Data race detectors, such as Eraser and Helgrind, analyse threads' mutual-exclusion and happens-before relationships to identify unsafe concurrent accesses like these. But it's possible to stop accesses from being concurrent without enforcing correct behaviour:

Thread 1Thread 2
mutex_lock(m);
bool ok = p != NULL;
mutex_unlock(m);

mutex_lock(m);

p = NULL;

mutex_unlock(m);
mutex_lock(m);
if (ok) output(p->data);
mutex_unlock(m);


Now the data race is gone, but the bug has simply become a higher-level race condition. Most literature calls this an "atomicity violation" (and some literature even uses "race" to mean exclusively data races).

You might think this code looks silly, but if you're working in a project with many layers of abstraction and function/module boundaries, this kind of mistake can be all too easy to make, and data race detectors are powerless to find them.

Consider this real-world example. When I started at Mozilla this summer, Rust 0.2 had recently shipped, and its release notes mentioned that it was "Helgrind-clean" (meaning no data races existed). Yet the Rust runtime contained this code:

bool rust_task::blocked() {
    mutex_lock(this->lifecycle_lock);
    bool is_blocked = this->state == task_state_blocked;
    mutex_unlock(this->lifecycle_lock);
    return is_blocked;
}

Sure, accessing the field was safely protected by the mutex, but once it dropped the lock and returned, all bets were off as to whether the returned value was still accurate or not. (I fixed several bugs related to this, and removed this function entirely.)

In a similar vein, Rust's type system guarantees that concurrent tasks cannot share state but instead must use message-passing to communicate, which precludes the possibility of data races completely by enforcing happens-before relationships on all data accesses (or in the case of this post, by enforcing mutual-exclusion relationships). Yet it's still possible to write nondeterministic programs in Rust (using select2, failure propagation, etc), and so race conditions are still possible.

The moral of this story is that data races are only one of many types of races, and though many tools exist for finding them, just because one guarantees absence of data races does not mean your code is completely concurrency-safe. Not to say these tools aren't useful, but they often fail where more sophisticated race-finding techniques could succeed, and even still, no automated race-finding tool can substitute for a careful human brain when reasoning about concurrency.

No comments:

Post a Comment