shesmu

Is Shesmu Right for Me? Ask your Doctor

This document is meant to help you decide if Shesmu would be a good fit for your organisation and what it would take to get a Shesmu instance running.

Background: Our Before Times

As an explanation for the problem we were trying to solve. OICR GSI runs a genomics data processing pipeline. We collect data off of DNA sequencing machines and metadata describing what was sequenced (and how it was prepared) and run analysis batch jobs. These jobs write their output in two places: a data store for the data itself and a metadata store describing the provenance of that data. The analysis of one job feeds into the analysis of other jobs. The metadata tracks the format of the data, the program that generated it, the input data used by that program.

Initially, we had deciders which would ingest the entire metadata store and try to figure out what analysis should be done but was not yet done and then launch the batch analysis jobs. We would launch these deciders via cron.

This had a few problems:

Shesmu attempts to address these problems in several ways:

What does Shesmu do?

Shesmu operates in three steps:

A key design of Shesmu is that actions are stateless. Shesmu has no history of what it’s done. Every time Shesmu restarts, it reprocesses all its input data and generates a set of actions. Actions must determine if they have been previously run.

Although action was run a workflow in our original conception, it has expanded beyond that. We have action that include Open a JIRA ticket. If this action is rerun, it doesn’t always open a new ticket; it checks that JIRA has an open ticket that matches certain criteria. Similarly, run a workflow doesn’t necessarily run a workflow; it checks the metadatabase to check if a workflow with matching parameters has been run.

The mental model I use for an olive is that it takes a table of input data and reshapes the data until it fits the parameters for an action.

How do I deploy it?

Setting up a Shesmu instance can be a few minutes or a few months depending on what is involved. Shesmu reads all of its configuration from a directory containing configuration files that active plugins.

The configuration of any plugin varies depending on its complexity. There is a plugin that makes a list of strings from lines in a file; that’s an easy one to configure.

Realistically, for your needs, there may not be plugins that interface with your systems and writing them will be necessary. The plugin implementation explains how to write plugins in Java. Once a plugin JAR is built, deploying it involves installing the JAR in the class path and creating appropriate configuration files in the Shesmu configuration directory.

Some of the simpler plugins have been designed, built, tested, and deployed in an hour.

Shesmu’s security model is that the REST API is largely read-only and configuration on disk determines most of its behaviour. It has one REST endpoint that allows erasing actions, but since actions are continually regenerated, this is fairly minor. Securing disk is the responsibility of the administrator deploying it.

How do I talk to it?

Shesmu provides three main interfaces:

The web interface is a wrapper around the REST interface, so all functionality provided by the user interface is available via the REST interface.

Because Shesmu is very plugin-driven, some of the data that comes back via the REST interface is different depending on the plugins that are active.

Planning Your Deploy

To perform a deploy, we recommend the following steps:

  1. Have a look through the plugins list and see if any of the plugins seem useful.
  2. Determine what input data you will need.
  3. Develop an input format for this data. See the implementation guide.
  4. Deploy a test instance and get comfortable with writing an olive using the simulation dashboard on your test instance (ToolOlive Simulator).
  5. Determine what information you need for an action. In particular:
    • what information is required (is it uniform? each JIRA action takes identical parameters but each Vidarr workflow is a snowflake)
    • what are the criteria that make actions unique
    • how to determine if an action is already completed
    • how to launch and action and check its progress
    • what additional information you want to report by the REST/web UI
  6. Write and test your action plugin.
  7. Start writing and testing olives.

As general tips: