SkyPilot uses the venerable SQLite for state management. SQLite can handle millions of QPS, and terabytes of data. However, our efforts to scale our Managed Jobs feature ran up against the one downfall of SQLite: many concurrent writers. Since SkyPilot typically runs as a CLI on your laptop, we wanted to stick with SQLite, so we decided to figure out how we can make it work. We were very surprised with some of our findings.

Our concurrent writers problem

In our managed jobs architecture, there is a single VM instance or Kubernetes pod (the “controller”) that manages the state of many other instances, which actually run the jobs. On the controller, there is a separate process to manage each job. This process:

Watches the state of the job on the remote instance.
Notices if the job is cancelled by the user.
Updates our internal shared job database - this is the SQLite database in question.

In other words, if there are 1000s of jobs finishing at the same time, there can be 1000s of concurrent writes to the SQLite database.

For a while, we had some reports of SQLite-related crashes that look like this:

Traceback (most recent call last):
  ...
  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/state.py", line 645, in set_cancelling
    	rows = cursor.execute(
sqlite3.OperationalError: database is locked

But we were only able to reproduce the error consistently when scaling to 1000+ jobs at once. Short of drastically rearchitecting our system or switching to another DB, what can we do to make 1000x write concurrency actually work in SQLite? This is what we wanted to find out.

TL;DR

Try to avoid using SQLite if you have many processes writing to your database concurrently.
SQLite uses database-level locks that can counter-intuitively starve unlucky processes.
If you must, use WAL mode and a high lock timeout value.

You should not use SQLite for many concurrent writers

The SQLite docs are very clear about when you should and shouldn’t use SQLite, versus a standard client/server SQL database. Under “Situations Where A Client/Server RDBMS May Work Better”:

High Concurrency
SQLite supports an unlimited number of simultaneous readers, but it will only allow one writer at any instant in time. For many situations, this is not a problem. Writers queue up. Each application does its database work quickly and moves on, and no lock lasts for more than a few dozen milliseconds. But there are some applications that require more concurrency, and those applications may need to seek a different solution.

Indeed, for most applications, if you find yourself here, your go-to answer should be to switch to something like PostgreSQL or TiDB. But if you need to stick with SQLite, what happens?

When we increased the number of SkyPilot processes to around 1000, we started to see some processes crash with this error: sqlite3.OperationalError: database is locked. Resolving these locking crashes took us deep into SQLite internals.

What we learned about SQLite internals

1. sqlite3.OperationalError could have multiple causes

SQLite has various locking mechanisms to ensure consistency. If a transaction tries to execute but cannot obtain the necessary lock, it will give an SQLITE_BUSY error. In Python, this is translated to sqlite3.OperationalError: database is locked.

If the necessary lock is currently held, SQLite will try to obtain the lock until it times out. The timeout time is set by sqlite3_busy_timeout() in the C API, which corresponds in the Python API to timeout= kwarg in the sqlite3.connect() call.

static int
pysqlite_connection_init(PyObject *self, PyObject *args, PyObject *kwargs)
{
	// ...
	double timeout = 5.0;
	// ...
}

In Python, the default timeout is 5 seconds.

If you’re getting the database is locked error, there are a couple of other things that could be causing it. You should eliminate these potential causes before proceeding:

If your application has only a single process, make sure it’s not opening multiple connections that conflict with each other.
If you have multiple processes, one process might be inadvertently holding a write transaction open for a long time, blocking all the other processes.
sqlite3_busy_timeout() might be unset and defaulting to 0. That is, immediately return SQLITE_BUSY instead of waiting for the lock. (This is unlikely in Python, where the default is 5s.)
If you run transactions without using BEGIN IMMEDIATE, you might hit SQLITE_BUSY regardless of your timeout setting.

We were pretty sure that none of these affected SkyPilot, so we charged on to figure out what these locks actually are.

2. WAL journaling mode is not a silver bullet

By default, SQLite uses a rollback journal to enforce atomicity. You can read about exactly how the locks work, but the TL;DR is: many readers can access the database at once, but if you want to write to the database, you must obtain an exclusive lock over the entire database.

Default rollback journal: at any time, either
n readers, OR
1 writer

SQLite can also run in “WAL” (Write-Ahead Logging) mode. Instead of using a rollback journal, new writes are not immediately written into the database, and are instead written to a write-ahead log. This allows readers to continue accessing the database while writes occur. Still, only a single writer can access the database at once, blocking all other writers until it is finished.

WAL mode: at any time, both
n readers, PLUS
1 writer

For both modes, there are some other cases that can cause SQLITE_BUSY. For instance, in WAL mode, a reader typically shouldn’t ever see lock contention. However, some edge cases can require an exclusive lock that blocks readers, such as flushing the WAL back to the database.

SkyPilot was already using WAL mode, but we were still falling over when many processes needed to write to the database at the same time. We needed to dig deeper, to find out how the locks worked exactly, and what was causing them to time out.

Early on, I found this old post from the beets blog, describing a failure mode where SQLite might only try to acquire the database lock once per second! But this post is from 2012 and is now very outdated - the SQLite source no longer uses HAVE_USLEEP, and the problem described in this post doesn’t apply to any modern system.

However, this did point me toward the function sqliteDefaultBusyCallback in SQLite’s C source code, which turns out to be a huge insight…

3. Lock acquisition is complete carnage

When SQLite tries to get a database lock, this is what it essentially does:

try to get lock
sleep 1 ms
try to get lock
sleep 2 ms
try to get lock
sleep 5 ms
... (sleep interval increases)
try to get lock
sleep 100 ms
try to get lock
sleep 100 ms
try to get lock
sleep 100 ms
... continue until timeout

The list of sleep intervals is actually hard-coded in the SQLite C code!

static const u8 delays[] =
   { 1, 2, 5, 10, 15, 20, 25, 25,  25,  50,  50, 100 };

Note that this doesn’t use a condition variable or any kind of notification-based synchronization.

That means that if multiple processes are trying to get the lock at the same time, there is no FIFO guarantee. Process 1 might have been waiting for the lock for two seconds already, but process 2 can still come along and snag it on the first attempt. There’s no mechanism for process 1 to be prioritized because it has been waiting longer.

Diagram of multiple processes trying to get the SQLite DB lock

So in a machine with hundreds of processes trying to get this lock, it’s basically a constant bloodbath of processes all vying to be the lucky one that gets to write to the database.

The Battle of Alexander at Issus, painting by Albrecht Altdorfer, cropped — Pictured: My 1000 SkyPilot processes fighting over the SQLite lock.

How long will it take for a process to get the lock? Thinking about this at the system level and at the process level leads us further toward SQLITE_BUSY enlightenment.

4. Increasing the timeout can exponentially reduce the chance of hitting it

From the perspective of the system as a whole, this is how it looks:

process 1 holds the lock, processes 2-20 are trying to get the lock every 100ms, but they aren’t synced up
process 1 releases the lock
process 17 happens to be the processes that tries to get the lock next, 4ms later
now, process 17 holds the lock and the other processes 2-20 keep trying

From the perspective of a single process waiting for the lock:

Wait 100ms
Try to get the lock
If we are the lucky client that is the first after the lock is released, we get the lock
Otherwise, we don’t get the lock, and wait another 100ms.

The chance of getting the lock can be modeled as an independent random event.

Since each attempt to get the lock is independent, the number of attempts/total time to get the lock should follow a geometric distribution. And indeed, in tests, this is exactly what we see (n=100,000):

There’s a spike in the first 0.5s because the retries are still backing off, so there are around 3x as many attempts in the first bin as in all the others.

Unsurprisingly, if we halve the contention, the latency also halves on average. But the distribution looks the same.

Even at 1000x concurrency - far beyond what SQLite is meant to handle - the p50 write time is 2.3 seconds. Only 0.13% of writes take longer than 60 seconds.

Simulation details

Create a simple table
CREATE TABLE IF NOT EXISTS test (id INTEGER PRIMARY KEY AUTOINCREMENT)
Do a bunch of writes:
 for _ in range(100):
 	start = time.time()

 	conn = sqlite3.connect("test.db", timeout=120)
 	cursor = conn.cursor()
 	cursor.execute('INSERT INTO test DEFAULT VALUES')
 	conn.commit()
 	conn.close()

 	print(f'{time.time() - start}', flush=True)

 	# Random sleep between transactions (0-100ms)
 	time.sleep(random.random() / 10)
The random sleep is needed, otherwise the process will typically immediately grab the lock back. This also introduces jitter between successive writes.
I ran this in 1000 parallel processes and aggregated all the printed latencies, then again with 500 parallel processes.
The pictured results are from an AWS m6i.4xlarge instance (16 vCPUs, 64GB memory).

There’s two important insights here:

The time it takes to get the lock is not consistent. It’s basically down to luck. In your application, this may look like random spikes in database latency, but in reality, it’s just some processes getting very unlucky when trying to get the lock.
You can directly see how increasing the lock timeout exponentially decreases the chance of hitting the timeout. There is no timeout that can 100% guarantee that you won’t hit a timeout, so choose whatever timeout reduces the chance to a reasonable level for your application. At 1000x concurrency, every 5.6s of timeout increase halves the chance of hitting the timeout.

So, how can you make high concurrency work?

If you’ve come this far, you can see how we arrived at our main conclusions:

Avoid high write concurrency, if possible. If you can, use a single process and serialize writes in application code. (We’re going to see if we can do this in the future.)
Use WAL mode, which will allow reading processes to run concurrently with writes. But know that WAL won’t save you from concurrent writes.
If you are not latency-sensitive, increase the timeout more than you think you should. According to the test above:
- At 1000x concurrency, increase the timeout by ~20s to reduce timeout probability by 10x.
- At 500x concurrency, increase the timeout by ~10s to reduce timeout probability by 10x.
- The relationship also depends on how long your write transactions take (once they have acquired the lock).
In SkyPilot, the timeout is currently set to 60s. We may increase it even further if needed.

Future exploration

SQLite has an experimental transaction type called BEGIN CONCURRENT that allows non-conflicting writes to partially overlap. Unfortunately, this feature is only available in a special branch for now, and isn’t part of the main sqlite trunk. So you need to build your own SQLite to use this feature. Since SkyPilot depends on the sqlite library compiled into Python, this is a dealbreaker for us. But if your application has control over your sqlite dependency, you may be able to use this to reduce conflicts!

P.S. If you need to run thousands of GPU jobs on the cloud, try SkyPilot

We struggled through these issues so that you can run thousands of parallel cloud jobs in SkyPilot without worrying about the scaling. SkyPilot is really good at finding cloud compute, including GPUs. It’s a great fit if you have a batch inference run or another parallelizable workload. We can handle your scale, even though we’re using SQLite under the hood! If you give it a shot, let us know how it goes in Slack.

Our concurrent writers problem#

TL;DR#

You should not use SQLite for many concurrent writers#

What we learned about SQLite internals#

1. sqlite3.OperationalError could have multiple causes#

2. WAL journaling mode is not a silver bullet#

3. Lock acquisition is complete carnage#

4. Increasing the timeout can exponentially reduce the chance of hitting it#

So, how can you make high concurrency work?#

Future exploration#

P.S. If you need to run thousands of GPU jobs on the cloud, try SkyPilot#