Approaches to data structures in Redis

There are 3 broad types of data structure supporting redis.workflow. I’m considering how it might compare to a Hazelcast solution, so it made sense to revisit them.

I have a couple of global counters, used to tick over unique internal ids; one for workflows and one for tasks. These are just stored as simple key-value pairs.

Each workflow and each task gets its own hash. The hash for a given workflow or task can be identified by convention, so I don’t need to search the key or value space for a specific item. For workflows the convention is to key the hash as “workflow:<id>”, and similarly for tasks as “task:<id>”. These hashes then map various properties to values specific to the workflow or task, covering both entity-specific data, like a task payload, and internal information, like the time of transition into a specific internal state. In principle this information should be immutable; in practice I do overwrite, for example, time of transition into the running state, as a task might be caused to re-enter that state multiple times. It could of course retain a log of each value set.

Some internal operations are then supported by independent indexes or lookups. For example, it’s repeatedly necessary to work out which tasks are of a specific type. Rather than cycle through every “task:<id>” key looking for its “Type” property, it makes more sense to have an index or lookup keyed as “type:<typename>” that contains the set of tasks with that Type.

An interesting possibility arises when you notice that “Type” is just another property, and that people generally might be interested in querying on any given property. That implies that every property should be backed by its own index or lookup. Generalising further, its possible to imagine a system in which entities might have some or all properties set freely, and indexes dynamically generated for them in Redis as needed, to support ad-hoc queries for specific data. I don’t do that here, but its something that might be worth demonstrating in a sandbox.

Transient workflow state is again some reflection of a task property. A given task might be in the submitted state, and therefore it makes sense to have a “submitted” lookup. In this case however I don’t want to support just simple queries. I want to suport a query for the next highest priority task. I also don’t want to have agents constantly polling Redis in an attempt to find work.

To manage those requirements I have the submitted set implemented as a sorted set, with priority being the metric on which the set is sorted. By maintaining a list of known priorities dynamically, its then easy to zrange out the highest priority task from the set.

To avoid polling by agents, the “submitted” queue (sorted set) is complemented with a “submitted” channel. Contentless messages are published on this channel to signal that one or more tasks have been added to the submitting queue; agents listen to the channel and respond by querying for the next priority task until there are no more tasks left, at which point the go back to listening. This mechanism is actually that proposed by Marc Gravell, as used in the StackOverflow Redis implementation.

The “submitted” query mechanism has a more sophisticated mechanism as well, with sorted sets of tasks of specific Type, with associated Type-related channels, meaning that agents can specify that they’re interested in tasks of a specific Type or Types.

The alerting/non-polling mechanism could be applied to all Task property lookups, of course.

So, to wrap up, there are 3 broad data structures in my redis.workflow solution. The first are simple globals that record unique internal ids. The second are multi-property entities (like tasks) with associated property lookups. The third are more sophisticated property lookups less associated with tasks than with the workflow system itself, which are themselves complemented by a dedicated pub/sub channel to limit polling by agents.