Víðarr Architecture
Víðarr runs workflows that are tied to external (probably LIMS) information. Each workflow goes through three phases:
- Input files are provisioned in to be available to the workflow.
- The workflow is run.
- The output from the workflow is copied out to a permanent store (provision out).
Every workflow definition in Víðarr has a description of the parameters of the workflow, but files are handled specially. Every workflow definition also has outputs. When it comes time to run a workflow, plugins will do the work of provisioning in files, running the workflow, and provisioning out the results. A combination of plugins is called a target.
Getting a file into a workflow is a bit of a challenge. If the workflow is running on shared disk, then the input needed is just a file path, but it could be an iRODS identifier or S3 URL or something else that requires a copy-in process. Víðarr provides identifiers for files it has produced so any file generated by one workflow can be used as input for another. Plugins are responsible for figuring out how to interpret these identifiers. Plugins can also provision in external files and the plugin defines what information it needs. Therefore, to run a workflow, the client needs to know what the parameters of the workflow are and what the provision in information for the target is.
Similarly, the provision out plugins need additional information about how to store the output file and each plugin will have different requirements.
Even running a workflow can be slightly different since the workflow engine (or its back-end) may have adjustable settings, such as priority.
A client would operate in the following way:
- Download all the targets available on a Víðarr server.
- Download all the workflows available on a Víðarr server.
- Combine workflow and targets to determine exactly what information is needed.
- Submit a workflow run request with all the information needed.
The big goal of Víðarr is to be able to associate these files with external metadata information (from LIMS). Each external ID comes with a two part identifier (a provider and an identifier) and version information. The versions will be discussed later.
When an external file is fed to a provision-in plugin, it must be manually associated with these external identifiers. After the workflow runs, the output of the workflow run will be associated with these identifiers too. Any workflow run that uses this output as input will automatically pick up the same external identifiers.
The submission request can explain how the output identifiers should be associated with the inputs. There are three ways: ALL, MANUAL, and REMAINING. An output file marked as ALL will be associated with any external identifier found in the input. Most files of most workflows will be ALL. For MANUAL files, the submission request explicitly associates output files with the appropriate external identifiers. This might be necessary for workflows that do complicated join/split operations such as co-cleaning. REMAINING associates a file with the external identifiers of the input that are not assigned manually.
This association of what output is attached to what external identifiers is called the metadata.
External Identifier Versions
Tying analysis to LIMS presents a problem: how to cope with changes in LIMS. To
get around this, Víðarr uses versions. Suppose LIMS labdata
has an element
sample1
.
Every external key can be associated with multiple version names and multiple version values. This is designed to work with Shesmu’s signatures. The design in Víðarr is meant to work something like this:
- Shesmu ingests data from LIMS and
labdata/sample1
has versionL1
. - Shesmu computes a signature
S1
and passes this data to Víðarr aslabdata/sample1
withlims-version
=L1
andshesmu-signature
=S1
. - Something changes in LIMS and now
labdata/sample1
has versionL2
. - Shesmu ingests the new data. Suppose the data the olive is using is
unchanged, so Shesmu computes
S1
again. - Shesmu will send
labdata/sample1
withlims-version
=L2
andshesmu-signature
=S1
to Víðarr. - Víðarr will recognise that while it doesn’t have
lims-version
=L2
, it does haveshesmu-signature
=S1
. It takes this as proof thatL1
andL2
are equivalent and addsL2
to its collection.
This could also work the other way around: if the olive changes and the
signature changes, the LIMS-provided versions will allow Víðarr to recognise
the Shesmu keys as equivalent. It also allows the LIMS provider to change its
version computation algorithm: if a lims-version-2nd-edition
version key comes
along, it can be used to prove equivalences between different versions.
If the versions in the submission request are disjoint with the ones in the database, then Víðarr will raise an error and a human must intervene to decide what to do.
Imagine this a bit like a walk in a scary forest: Víðarr is willing to put one foot into the unknown as long as it has a foot in the known. It is never willing to jump into an unknown space.
Each workflow run stores its own separate copy of an external identifier. Consider the following scenario:
- Shesmu runs bcl2fastq on
pinery-miso/123_1_LDI1234
with Pinery version123abcdef
and Shesmu signature0987
on Víðarr. These keys get baked into Víðarr. - Shesmu runs BWAmem on bcl2fastq’s output for
pinery-miso/123_1_LDI1234
with version123abcdef
and Shesmu signature7654
on Víðarr. These keys get baked into Víðarr. - The lab goes into LIMS and updates the species of the sample and the version for
pinery-miso/123_1_LDI1234
is now4567decba
. - Shesmu runs bcl2fastq on
pinery-miso/123_1_LDI1234
with Pinery version4567decba
and Shesmu signature0987
on Víðarr. The Shesmu signature is the same because the olive does not look at the scientific name. This request is sent to Víðarr. Since one version matches, it updates the workflow to consider4567decba
to be a valid version of the Pinery key. - Shesmu looks at
pinery-miso/123_1_LDI1234
for the updated bcl2fastq output. The BWAmem olive does use the scientific name to pick the right reference, it will have4567decba
for the Pinery version and8888
for the Shesmu signature. Víðarr will see no overlap between the existing workflow run and Shesmu’s request, so it will signal a failure to Shesmu.
That means the BWAmem and bcl2fastq workflow runs need to store separate copies
of any information they have about pinery-miso/123_1_LDI1234
. Equivalence of
external versions only makes sense in a per workflow run context.