Calliope Sounds: Adding persistence to the Incident Response Slack application

Adding persistence to the Incident Response Slack application is the next feature to implement. For this application change happens at a human pace. That is, even the busiest incidence response is unlikely to have more than a few dozen changes per hour. That is, changes per channel per hour. The application might need to coordinate many thousands of channels of changes per hour. Given this situation, persistence at the channel level can be coarse while persistence at the application level needs to be fine.

For coarse persistence with infrequent access storing the whole model as a chunk of data is usually sufficient. Within a channel our model is a collection of tasks each with a description, assignments, and a status. There might be one or two dozen tasks at any time. With an expectation of, on average, short descriptions, one user assignment, and one status we expect 100 to 200 bytes per task and so some 1200 to 4800 bytes in total, ie, 12 tasks * 100 bytes to 24 tasks * 200 bytes. Reading and writing this amount of data is too small to worry about performance; that is, the storage mechanism's overhead will dominate each operation.

For fine persistence with frequent access persisting must be done at the item level. The storage mechanism must allow for random, individually addressable datum. We don't need the storage mechanism to provide structure within the item. A key-value store will do.

A simple system design would have one application instance running on a host that has RAID or SAN storage. If the application crashes the host will automatically restart it and so only incur a second of downtime. And the likelihood of losing the RAID or SAN is too low to worry about. If your level of service allows for this system design then a useful key-value store is the humble file-system. Unfortunately, this design is also the most expensive choice from cloud providers.

Cloud providers will want you to allow them to manage your compute and storage separately. This enables them to provide your application with the highest level of service to your customers. A consequence of this is that your application needs to be designed to run with multiple, interchangeable instances, remote storage, and network partitioning. Unlike the one host & one disk platform, the cloud platforms are not going to help your application that much. The problems associated with distributed application design — CAP, CQRS, consensus, etc — are still largely the application's to solve.

Incident Response is, fortunately, too simple a tool to warrant sophisticated tooling [1]. If two users update the same task at the same time then one of them will win. We will attempt to tell the user of the collision, but the limits of eventual consistency may preclude that. Every cloud platform has a managed key-value store (with eventual consistency) and managed web applications. Since I know AWS, I plan on using SimpleDB and Elastic Beanstalk for the next implementation.

[1] I really want to explore Apache Geode!