Channel Cancel Culture
I wrote a background job that backs up a git repo after every commit, with a poll interval, such that many commits submitted rapidly only triggers one backup process.
The tests have been intermittently failing on CI, for seemingly odd reasons:
- Removing the test git directory during setup
Error: Received unexpected error:
unlinkat data/testGitOrg: directory not empty
- Trying to backup after the original repo has already been removed (maybe)
target OID for the reference doesn't exist on the repository
I added some more logging, and it looks like a race condition is in play. I have the following logic to trigger a backup to run:
for {
select {
case <-ctx.Done():
log.Println("cancelled")
return
default:
time.Sleep(pollInterval)
if ready(batchInterval) {
Run(src) // this is the actual backup routine
}
}
}
I context I pass into this loop is instantiated thusly:
gbctx, gbcancel := context.WithCancel(context.Background())
gbcancel
gets called at the bottom of each test.
My current theory is that, if we are in the middle of a Run
, then the cancel won’t be responded to until that Run
is complete. Now that I’m typing this out, this logic is obvious…
The problem is, I don’t know how long the commits+backup are going to take during the test run. It seems lame to just bump up the wait time such that it’s extremely unlikely that a race will occur. There is a better solution, though. The Run
func uses a lock, so that more than one goroutine doesn’t try to interact with the backup repository. So, if I add an additional Run
call after the gbcancel
, the test will be locked until a previous Run
completes!
Even better, if I create a new method WaitForRunToComplete
, and lock/unlock the same mutex, I can wait for a previous Run
to complete without needing to execute another one.
This seems to be working, so far.