|
Message-ID: <20241004162542.GI10433@brightrain.aerifal.cx> Date: Fri, 4 Oct 2024 12:25:42 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Spin logic flaw in mutexes, etc. I realized while reviewing how we opportunistically spin trying to obtain locks that there's a problem with the logic: it was intended not to spin if there are already waiters (this is both a fairness consideration and a wasted-cycles one), but it can only see waiters that have already enterred the futex-wait phase. Waiters which are spinning are invisible. This leads to a situation where a large number of threads arriving at the same time hammering a lock will all spin. Just incrementing the waiter count before the spin is an obvious remedy, but by itself that would lead to the unlocking thread performing a useless and costly futex wake. I think we could redefine the waiter count to be adjusted by 1, so that a value of 1 would mean there's a spinning waiter, and N>1 would mean N-1 waiters potentially in futex wait that require a wake. With this approach, I think it would work to have the caller remember that it performed an additional waiters increment that needs to be reverted, or just do logic for transitioning 2->0 rather than 2->1. I thought this could be put in __wait, but I believe all the important callers actually use __timedwait which does not integrate the waiters increment/decrement logic. There might be other code paths that exit early where the waiter accounting needs to be reverted too though. For example I see one on line 80 of pthread_mutex_timedlock.c, but I think having this check is erroneous anyway, since the earier call to pthread_mutex_trylock would necessarily have caught this condition in the absence of state-corrupting UB. I'm not sure if we should try to avoid ever returning the waiters count to 0 between futex waits, which can happen now if there's only one waiter. Returning the count momentarily to zero allows a newly arriving waiter to spin (stealing the opportunity to take the lock). However stealing the lock is always possible before spinning anyway, anyway due to the trylock, and I don't think any complete mitigation for this is possible. The affected functions seem to be: - pthread_barrier_wait - pthread_mutex_timedwait - pthread_rwlock_timedrdlock - pthread_rwlock_timedwrlock - sem_timedwait and of course their callers. A similar issue applies to the internal __lock, but as it spins a lot fewer times and is called only in more controlled contexts, I'm not sure if it matters. The same mitigations can probably be applied to it if needed. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.