I fixed a bug in the application I've been working on this week. It was a bug that I knew was there, because I had written a function with multiple SQL queries that needed to be in a transaction that were not in a transaction. If this function ever got called by two clients near-simultaneously, one or both calls would produce weird results.

I had put off fixing it this issue because I wasn't sure how to test it and was afraid it might be hard to test. The bug would be annoying if it ever happened in production but it's probably not that likely, and I wasn't sure it was worth the time it would take to "do it right." Turns out, though, that since I'm using Elixir, it's quite easy to replicate the issue in a test. I haven't been able to find a writeup of this specific technique, so I thought I'd share it with you here.

How did I know that these queries should be in a transaction? Basically, because I had a pair of queries, the first query got a piece of data, and then the second query modified that data based on that data.

The code in question looked something like this:

def create_item(attrs) do
  position_query =
      from i in Items, select: count("*")

  position = Repo.one(position_query)

  %Item{}
    |> Item.changeset(Map.put(attrs, :position, position))
    |> Repo.insert()
end

The problem with this code is that if this function runs twice at effectively-the-same-time, then both queries will get the same value for position, and then create new Items with the same value for position. What should happen instead is the second function should wait to read until the first function has successfully finished the INSERT (or aborted.)

Often timing-based bugs like this are a little bit tricky to replicate, but I can replicate this bug with a Storytime.DataCase that looks something like this:

test "create_item/1 should always create items with a unique position, even when creating multiple items simultaneously" do
  tasks =
    Enum.map(1..2, fn _ ->
      Task.async(fn -> Organization.create_item(attrs) end)
    end)

  Enum.map(tasks, &Task.await(&1))

  positions = get_items!() |> Enum.map(& &1.position)
  assert Enum.uniq(positions) == positions
end

c

This test uses the Task module to create a List with two tasks. Each of those tasks runs create_item() – not quite simultaneously, but the tasks start within less than a millisecond of each other.

Then we take that List of Tasks and use map to await each one in sequence and convert tasks from a list of unresolved tasks into a list of the values returned by that task. If there were a lot more of these tasks, or if there was potentially a lot of variation in the amount of time they took, I might have done something other than "wait for each one in sequence" but in this case the simplest-to-read version is best.

Finally, we check that the list of positions returned is entirely unique – each possible position is used once and only once. As soon as I wrote this test it replicated the bug exactly – I was actually a little bit surprised. I was able to fix it simply by wrapping the whole operation inside of a transaction.

Repo.transaction(fn -> *** end)

My favorite thing about this test is that it's not a test for whether these operations are happening inside of a transition. It might actually be better to do this with the first request as a subquery inside the insert, and this test will happily accept that solution.

I am somewhat concerned that this test is technically nondeterministic. If for some reason the second task takes longer than usual to start, and didn't run its first query until the first task had completely finished, the whole test would succeed, even with the "buggy" version of the code. In practice even if this is the case I won't notice – the test would still succeed – but normally I avoid nondeterministic test code as much as possible.

I could make the "nondeterministic" failure less likely by using large number of tasks. I actually started out with Enum.map(1..100) but it didn't turn out to be necessary to do that many – all 100 came back with the same position value.

As an aside – if you work on systems that have databases and many clients – a description that fits most web applications – developing good intuition about transactions and transaction isolation is one of the best things you can do to present as a senior engineer. "This operation should have been in a transaction, but it wasn't" is probably the most common reason I have encountered for business-impacting bugs. It doesn't come up that often but when it does it's often in the form of questions like, "Why did we charge this customer $80,000 more than we should have?"

Testing Transactions in Elixir

My favorite thing about this test is that it's not a test for whether these operations are happening inside of a transition. It might actually be better to do this with the first request as a subquery inside the insert, and this test will happily accept that solution.