First look at performance for workflow processing

I’ve set up a simple benchmarking app to get a look at how my new Redis-based workflow processing system was going, and used it to create 100k workflows, each containing 9 tasks. The workflow structure is identical for each: just a simple chain. I ran it and a Redis instance on my dev laptop, and gathered data about task submit and completion time, and workflow completion time – both actual, and when I received the Redis pub-sub message to confirm completion.

Baseline metrics.
Fig 1. Baseline workflow performance metrics.

It took ~350 seconds to run through 900k tasks (each as close to a no-op as possible) giving an overall churn around 1k tasks per second. It took ~60 seconds to submit 100k workflows with initial tasks, or about 1.5k workflows per second.

Completion of tasks was noticeably suppressed during workflow submission, and as a result the number of completed tasks lags behind the number of submitted tasks. However, it looks like the rate of task submission and completion – the key parts of workflow turnover – are independent of the number of tasks to run, suggesting that this behaviour scales.

There are some features that I don’t immediately understand, however. The rate of workflow completion is not linear; completion is delayed, as you might expect, but the rate increases with time.

Additionally, there’s a delay between actual workflow completion and the delivery of the message that signals that a workflow has completed. The rate of delivery looks consistent with the rate of completion, but the delay is consistently about 50 seconds. Two things immediately occur to me; as the system’s using message passing to flag submission of new tasks that they might also be affected by a delay; and that arguably we might not need to wait for a message to tell us that a workflow has completed. As I’m marking tasks complete synchronously from the client, it could potentially tell me that there are either more tasks submitted as a result, or none, in which case I could reasonably take some sort of action (although that action would need to be shared across all task handling processes, if I had such).

In fact, there’s a problem with the way I’ve labelled the curves – it’s not obvious that workflow completion time is nonlinearly related to the number of remaining tasks. I’m making an assumption.

Next things to check: I’m going to add a periodic sampling of the queue size to add to this chart, and see if its possible to analyse the number of messages being posted.