Celery in Production: Three more years of fixing bugs

This post covers the various issues we faced with celery in production over the last year or so. An ex-team member had iterated on a bunch of celery issues around 3 years before this iteration and had blogged about his experience here: https://medium.com/squad-engineering/two-years-with-celery-in-production-bug-fix-edition-22238669601d

While that iteration fixed a lot of problems at that time, some problems had only gone hidden & then reappeared a few months later. So it was given that a more definitive iteration would be required to fix the problems this time around.

Our instances would always be on high RAM usage, and manually scaling them to 2x or 4x the size would not fix the problem

This is so counter-intuitive, right? If I have an instance with 90% memory usage, doubling its resources should result in approx 45% memory usage. But the expected didn’t happen. We use AWS instances for celery workloads, so the team would scale up the instance in hopes of containing the memory requirement. The instance would still show 90-100% RAM usage so someone tried to change the instance type from c5 to t3 (c5 is for CPU bound tasks, and t3 is burstable CPU instance type for non-CPU bound workloads). This didn’t work either.

The Solution: Whenever celery processes (or any other, for that matter) takes a lot of RAM, the first response should be to gather a list of processes with resident & shared memory usage. Without this data, you’re very likely to be shooting an arrow in the dark. So anyway I went ahead on the culprit server, which had 16 gigs of memory. A quick view of htop & then sorted by memory usage showed that the process with max memory usage was using ~320MBs of memory. I already knew that our application takes ~280MBs when I run python manage.py shell, so it’s not a case of a few rogue celery processes taking too much RAM. Next, I ran

$ ps aux | grep 'celery worker' | wc -l
  52

If you do the math, 280MB * 52 ~= 14.5GB. So once all the celery workers are launched, the instance is already around 90% full. But why does this not get fixed when I double the instance size? That’s because if you don’t specify --concurrency for each worker explicitly, it defaults to using the number of cores available as the default value for concurrency. This option controls how many children will be forked for each worker defined. So when you double the instance, you usually also double the CPUs in the instance, thus increasing the number of celery processes in the process. A similar thing happens when you switch from c5 to t3 instance, you get more (but much weaker) CPU cores.

The final solution was to explicitly specify the concurrency in the worker launch command:

$ celery worker -A project (...) --concurrency=4

Some workers were using 4-5x times the RAM taken by a freshly launched child worker process

Just by reading the description of the problem, you can tell that there’s a strong hint of memory leak here. I had seen in htop that the culprit workers were all at least a few hours old. This problem was new, not something we faced in last iteration. So why did this happen now? Well in the previous iteration (See “Worker servers always had an unexplainably high CPU usage” section), we had removed the --max-memory-per-child flag from options, which now meant that the processes were not dying after regular intervals. So we made a mistake; while that setting was not working for us at that time, we should have changed it to a higher value than outright removing it. So while this was easily fixable by re-introducing the --max-memory-per-child flag, at that time I had a script lying around that’d kill a process matching certain properties, like if it goes beyond a threshold in memory usage. The kill mechanism itself is first it tries to soft-kill (letting the running task complete), and then initiates hard kill if the process is still running after a timeout. This method is still inferior to using --max-memory-per-child, but probably better for us right now.

Instances would frequently use more RAM than they should be using

This is the same problem as section “Worker servers always had an unexplainably high RAM usage” in https://medium.com/squad-engineering/two-years-with-celery-in-production-bug-fix-edition-22238669601d We thought it had been fixed until it re-appeared weeks later 😣.

So this time we already knew what the problem was: child worker orphans that don’t die out once their master worker has been killed in the deployment process. Now we know that all orphans get “adopted” by PID 1 once their parent die. So all I had to do was to write a small script that’d kill any celery worker process that had PID 1 as its immediate parent. We run our celery workers as multiple workers in each instance, all under supervisord which in turn is controlled by systemd. Now, this is not the best way to solve this. Ideally, the service manager (supervisord) should be aware of all the forked children & kill them when the parent is dead. A quick search tells that Systemd already does that, and is also available on the instance to be used. But we now generate the celery worker definitions from a YAML file into supervisord configs, and replacing supervisord with systemd would mean that I’ll also have to change that YAML to config generator accordingly. That would’ve taken more time than what I had then, so I decided to leave it this way. And it has been working fine till now with no issues so I won’t complain much either.

A few celery workers would hang after a few hours of usage, and would just stop consuming any more tasks

Let’s get you a little bit of background on our crons setup: Around the same time as the old blogpost, we switched from basic crond on a single large instance to celery beat with workers across multiple machines. We had migrated to healthchecks.io for alerting in case a scheduled cron misses its run.Now, after a few hours of usage, a few celery workers would stop consuming any more tasks. So I went in the instance & straced the child worker:

This looks familiar 🧐 (see section “Workers stayed idle, not consuming any tasks” in https://medium.com/squad-engineering/two-years-with-celery-in-production-bug-fix-edition-22238669601d)

So this has reappeared as well 😿 strace gives me the system call that it’s stuck on, but that’s not good enough. I want to see what application code, what library code, what underlying C function is responsible here. Basically, I need the backtrace of the process. Let’s begin!

install python debug build of python: sudo apt install python-dbg
install gdb: sudo apt install gdb
take a core dump of the process: sudo gcore $PID
launch the gdb attached to core dump: gdb python core.$PID
inside the gdb, use py-bt to get python level backtrace & bt for C level backtrace

After I collected backtraces from all the idle workers, I found that all of them had a similar backtrace, of top few lines are

and the corresponding C backtrace is

It looks like urllib3 is waiting to acquire GIL before it could decompress the HTTP body chunk, possibly stuck in a deadlock. Not much I can do here, this looks like a bug - which won’t be fixed now that Python 2 has reached EOL. We eventually created a workaround for this that works this way: once every few seconds, when a celery worker picks a new task, it sends a heartbeat to a local redis instance. A cron checks redis & kills whichever worker hasn’t sent a heartbeat in last few minutes, and sends a SIGTERM followed by a SIGKILL.

There’s a bit more detail to this workaround, but that’s a story for another time. Also: it’s easier to just use py-spy for this.

IO vs CPU bound tasks We had some instances which were having high CPU usage, while there were others with very low CPU usage. This is not a problem in itself, but I wanted to have a better visibility of CPU bound vs IO bound load in each queue to be able to allocate infra accordingly. Celery provides some signals on which you can attach your listeners. https://docs.celeryproject.org/en/stable/userguide/signals.html. We are particularly interested in 2 of these signals:

task_prerun
task_postrun

If I log the current clock time and CPU time at each of these signals, I can get the time taken by each task. I can then use the CPU time difference divided by clock time difference to get an approximate idea of CPU vs IO bound workload for each task. Let’s create the listeners:

After the logs are collected, I’d run a script that would parse the above lines into something like:

A plot of cpu_time_difference divided by clock_time_difference makes it clear which tasks are IO bound & which are CPU bound. The IO bound tasks would peak around lower X-axis values. Here’s the plot from our system (I have defaced the actual task names with hex codes), where you can see that the IO bound tasks peak towards the left. (The Y axis is indicative of number of tasks).

Use Gevent as worker pool

This is not a problem, but rather an optimization that I wanted to try out. We started using gevent as the pool option instead of the default prefork in some of our IO bound workers. This didn’t go well for because some libraries as well as application code were not thread-safe, causing some deadlocks & other annoyances. So this was reverted within minutes in all but 1 worker whose only tasks were shooting webhooks, and that too was reverted days later because it was too unsafe to let it be on production.

Key Takeaways

Celery has been one of the more unstable pieces in our infra over the years.
Supervisor should’ve killed celery’s forked workers once the parent dies. Either it is not reliable or we haven’t been able to make this happen in at least a few attempts now. I’m inclined to believe the former is true, but neither outcomes will look good for supervisord.
Verify if the bug is truly solved, or just dormant. Sometimes it may mean not closing the bug tracker issue before your changes have been running for few weeks in production.
Don’t debug just by google/github searches. Brendan Gregg has written some great tips around performance debugging, some of which are relevant when debugging errors as well: http://www.brendangregg.com/methodology.html
There are too many thread-unsafe libraries in python ecosystem. It’s also very easy to create your own thread unsafe code. As a result, gevent (or eventlet) will remain a pipedream except in very limited scenarios.

task_id	task	clock_time_difference	cpu_time_difference
c83fb48d-14e7-424c-b279-038a17c59c95	core.send_email	1.23123	0.07532