View on GitHub

vidarr

Analysis provenance tracking server

Víðarr Architecture

Víðarr runs workflows that are tied to external (probably LIMS) information. Each workflow goes through three phases:

  1. Input files are provisioned in to be available to the workflow.
  2. The workflow is run.
  3. The output from the workflow is copied out to a permanent store (provision out).

Every workflow definition in Víðarr has a description of the parameters of the workflow, but files are handled specially. Every workflow definition also has outputs. When it comes time to run a workflow, plugins will do the work of provisioning in files, running the workflow, and provisioning out the results. A combination of plugins is called a target.

Getting a file into a workflow is a bit of a challenge. If the workflow is running on shared disk, then the input needed is just a file path, but it could be an iRODS identifier or S3 URL or something else that requires a copy-in process. Víðarr provides identifiers for files it has produced so any file generated by one workflow can be used as input for another. Plugins are responsible for figuring out how to interpret these identifiers. Plugins can also provision in external files and the plugin defines what information it needs. Therefore, to run a workflow, the client needs to know what the parameters of the workflow are and what the provision in information for the target is.

Similarly, the provision out plugins need additional information about how to store the output file and each plugin will have different requirements.

Even running a workflow can be slightly different since the workflow engine (or its back-end) may have adjustable settings, such as priority.

A client would operate in the following way:

  1. Download all the targets available on a Víðarr server.
  2. Download all the workflows available on a Víðarr server.
  3. Combine workflow and targets to determine exactly what information is needed.
  4. Submit a workflow run request with all the information needed.

The big goal of Víðarr is to be able to associate these files with external metadata information (from LIMS). Each external ID comes with a two part identifier (a provider and an identifier) and version information. The versions will be discussed later.

When an external file is fed to a provision-in plugin, it must be manually associated with these external identifiers. After the workflow runs, the output of the workflow run will be associated with these identifiers too. Any workflow run that uses this output as input will automatically pick up the same external identifiers.

The submission request can explain how the output identifiers should be associated with the inputs. There are three ways: ALL, MANUAL, and REMAINING. An output file marked as ALL will be associated with any external identifier found in the input. Most files of most workflows will be ALL. For MANUAL files, the submission request explicitly associates output files with the appropriate external identifiers. This might be necessary for workflows that do complicated join/split operations such as co-cleaning. REMAINING associates a file with the external identifiers of the input that are not assigned manually.

This association of what output is attached to what external identifiers is called the metadata.

External Identifier Versions

Tying analysis to LIMS presents a problem: how to cope with changes in LIMS. To get around this, Víðarr uses versions. Suppose LIMS labdata has an element sample1.

Every external key can be associated with multiple version names and multiple version values. This is designed to work with Shesmu’s signatures. The design in Víðarr is meant to work something like this:

  1. Shesmu ingests data from LIMS and labdata/sample1 has version L1.
  2. Shesmu computes a signature S1 and passes this data to Víðarr as labdata/sample1 with lims-version = L1 and shesmu-signature = S1.
  3. Something changes in LIMS and now labdata/sample1 has version L2.
  4. Shesmu ingests the new data. Suppose the data the olive is using is unchanged, so Shesmu computes S1 again.
  5. Shesmu will send labdata/sample1 with lims-version = L2 and shesmu-signature = S1 to Víðarr.
  6. Víðarr will recognise that while it doesn’t have lims-version = L2, it does have shesmu-signature = S1. It takes this as proof that L1 and L2 are equivalent and adds L2 to its collection.

This could also work the other way around: if the olive changes and the signature changes, the LIMS-provided versions will allow Víðarr to recognise the Shesmu keys as equivalent. It also allows the LIMS provider to change its version computation algorithm: if a lims-version-2nd-edition version key comes along, it can be used to prove equivalences between different versions.

If the versions in the submission request are disjoint with the ones in the database, then Víðarr will raise an error and a human must intervene to decide what to do.

Imagine this a bit like a walk in a scary forest: Víðarr is willing to put one foot into the unknown as long as it has a foot in the known. It is never willing to jump into an unknown space.

Each workflow run stores its own separate copy of an external identifier. Consider the following scenario:

  1. Shesmu runs bcl2fastq on pinery-miso/123_1_LDI1234 with Pinery version 123abcdef and Shesmu signature 0987 on Víðarr. These keys get baked into Víðarr.
  2. Shesmu runs BWAmem on bcl2fastq’s output for pinery-miso/123_1_LDI1234 with version 123abcdef and Shesmu signature 7654 on Víðarr. These keys get baked into Víðarr.
  3. The lab goes into LIMS and updates the species of the sample and the version for pinery-miso/123_1_LDI1234 is now 4567decba.
  4. Shesmu runs bcl2fastq on pinery-miso/123_1_LDI1234 with Pinery version 4567decba and Shesmu signature 0987 on Víðarr. The Shesmu signature is the same because the olive does not look at the scientific name. This request is sent to Víðarr. Since one version matches, it updates the workflow to consider 4567decba to be a valid version of the Pinery key.
  5. Shesmu looks at pinery-miso/123_1_LDI1234 for the updated bcl2fastq output. The BWAmem olive does use the scientific name to pick the right reference, it will have 4567decba for the Pinery version and 8888 for the Shesmu signature. Víðarr will see no overlap between the existing workflow run and Shesmu’s request, so it will signal a failure to Shesmu.

That means the BWAmem and bcl2fastq workflow runs need to store separate copies of any information they have about pinery-miso/123_1_LDI1234. Equivalence of external versions only makes sense in a per workflow run context.