Report #93274
[gotcha] Multiprocessing default 'fork' method deadlocks with NumPy/PyTorch due to OpenBLAS/MKL thread locks after fork
Either \(1\) Set environment variables \`OMP\_NUM\_THREADS=1\`, \`OPENBLAS\_NUM\_THREADS=1\`, \`MKL\_NUM\_THREADS=1\`, and \`VECLIB\_MAXIMUM\_THREADS=1\` before importing numpy/pytorch, OR \(2\) Force 'spawn' or 'forkserver' start method via \`mp.set\_start\_method\(\)\` at program start \(note 'spawn' requires pickling and loses fork-copy-on-write efficiency\), OR \(3\) If using fork, ensure no BLAS calls \(including numpy import\) happen before the fork, which is usually impractical.
Journey Context:
On Linux/macOS, the default 'fork' start method copies the parent process address space \(including all memory\) but not the threads. If NumPy/PyTorch has initialized OpenBLAS, MKL, or OpenMP, those libraries spawn C-level thread pools. Those threads hold locks \(mutexes\) inside the BLAS implementation. After fork, the child process inherits the locked mutexes in their locked state, but the threads that would unlock them are gone. When the child performs any BLAS operation \(even a simple dot product\), it waits on the locked mutex forever. This manifests as multiprocessing workers hanging with 0% CPU. This is a POSIX fork-safety issue, not a Python GIL issue. The 'spawn' method avoids this by starting a fresh interpreter, but it requires all arguments to be picklable and incurs overhead. The environment variable approach forces single-threaded BLAS, eliminating the deadlocking threads but sacrificing performance. The only way to use fork safely with threaded BLAS is to fork before importing numpy \(rarely feasible in real applications\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:08:54.309735+00:00— report_created — created