Picking up where you left off

One thing is certain in any live system: something will break and leave the system in an inconsistent state. If your system isn’t designed with this as a core assumption – not that things will work, but that things will break – then your design is wrong.

Less catastrophically, there are occasions where you want to turn part or all of a system off. Without thought this will have the same effect: the system will be left in an inconsistent state.

In a distributed workflow this generally means that tasks are unofficially abandoned, as the agent responsible for running them can’t complete them and update their state. If a system is deliberately stopped then you have a chance to manage that state properly. You could:

wait for all running tasks to complete before stopping the system, while preventing any new tasks from running, or
transition any running tasks to an appropriate pre-running state.

Typically any final solution comprises some element of the second plan; under the first plan you might still want to be careful about giving tasks a maximum timeout duration, and if tasks reach that timeout then you have no option but to reset them to a pre-running state.

If your tasks define work that’s to be undertaken on a separate system – say a distinct compute cloud – then resetting tasks you might still have some underlying problems to solve if that compute infrastructure doesn’t allow you to cancel running work.

In each case, however, you have a chance to transition tasks to some desired state before agents stop. If an agent has failed you typically don’t get that chance, and you need to find some other way of managing the state. There are a few options for doing that, among them:

just forget; have some backstop process gather up tasks that haven’t had a state update within a certain duration and automatically reset them to a pre-running state, or
make it possible for agents to reconnect to their previous state and decide what to do with abandoned tasks.

The first option there is quite simple, but has implicit problems. How do you choose the duration after which task state gets reset? You run this risk of either resetting in-flight tasks, or leaving tasks abandoned for so long that their results are no longer required. At worst, this backstop is a manual process.

The second presents more explicit problems: how does the system know which state is or was associated with which agent? Two things are needed: first each agent needs to have an identity that’s maintained in time (or at any time there needs to be an agent that’s acting under a given identity), and second that identity needs to be associated with task state when it takes responsibility for it, and dissociated when it gives up responsibility – when it starts and stops the task.

Maintaining a system of agents with distinct identities is relatively straightforward. The trick is to have any new agent process assume one from a pool of identities when it starts up. The actual technique chosen is heavily influenced by the mechanism you use to manage agent processes, but could be as simple as an argument passed in on a command line – “you are agent 1, the first thing you need to do is work out what the last agent 1 was doing”.

Associating an identity with task state is a little more complicated, as there are some options for where that association state could sit. It could be stored locally by the agent; every time it takes a task, it adds the tasks identifier to its local store. If another agent takes its place, it can use that stored task state to reconnect with what was running. Alternatively, it could be stored alongside the task state in whatever state store’s being used.

In each case information from one domain is leaking into another as responsibility is taken and given up for task state. The important thing is to notice that an exchange of information is taking place – when a task is taken, task data is being exchanged for the an agent identity – and that that exchange must happen in a transactional manner so that information isn’t lost.

For example – an agent could pop a task from the task state store, and then store that task information in a local sqllite database. If the agent dies between popping the task and storing the data in a local database then information is completely lost.

That’s bad, and something to be avoided.

If the agent presents its identity information to the state store as part of the task pop process however, the state store can associate that identity with the now-running task as part of the same transaction –- safely.

The latter approach is what I’ve taken with the workflows-on-Redis project I’ve been working on recently in order to enable the recovery of tasks, and ensure that none of that information is lost when taking tasks in the first place.