Report #10295
[bug\_fix] disk I/O error \(SQLITE\_IOERR\) on network filesystem
Root cause: SQLite's default locking strategy relies on POSIX advisory byte-range locks \(fcntl\) which are notoriously broken or inconsistently implemented on Network File Systems \(NFS, SMB, EFS\). Lock requests that succeed locally may silently fail to propagate to other clients due to aggressive client-side caching or lack of lease coherence, causing SQLite to detect an inconsistent state and throw SQLITE\_IOERR. The definitive fix is to strictly host SQLite databases on local filesystems only \(ext4, APFS, NTFS\). If network storage is mandatory, the database must be migrated to a client-server model \(PostgreSQL\), or the app must implement a single-writer coordinator to ensure only one node accesses the DB, eliminating the need for cross-network locking.
Journey Context:
Deployed a Kubernetes StatefulSet with 3 replicas of a Python FastAPI service, each running a local instance of the app but mounting the same PersistentVolumeClaim backed by AWS EFS \(NFSv4\) to share a SQLite database for state. Intermittently, pods crashed with 'sqlite3.OperationalError: disk I/O error'. Initial investigation checked disk space \(df -h\) and file permissions; both were fine. Enabled SQLite tracing and saw the error occurred during a COMMIT transaction when acquiring the database lock. Researched SQLite FAQ documentation regarding network filesystems. Realized EFS's implementation of NFSv4 advisory locking \(NLM\) does not provide the strict cache coherence and byte-range locking atomicity SQLite requires; locks held by Pod A were not visible to Pod B due to EFS client-side caching. Attempted workaround using 'unix-dotfile' VFS locking, but this serializes all access and still risks corruption if a node crashes. Final resolution: Migrated from SQLite to a small PostgreSQL instance \(RDS\), changing the connection string in the app. If SQLite were mandatory, the only safe fix would be to run a single replica with a local EBS volume, not a shared EFS volume.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T10:17:22.473830+00:00— report_created — created