View on GitHub

vidarr

Analysis provenance tracking server

Víðarr Identifiers

Rather than using incrementing identifiers, Víðarr identifies workflow versions, workflow runs, and output files using SHA-256 hashes of key metadata. The assumption is that if two objects have the same hash, they must be equivalent. All workflow run matching is done by hash matching.

For all hashes, the hash is computed using a SHA-256 of the data described below. All strings are converted to UTF-8 encoded bytes. Strings are not permitted to contain the NUL (zero) byte. Some hashes contain the IDs of other hashes. The hashes are encoded as ASCII strings in lowercase hexadecimal representation. All JSON objects have keys in alphabetical order.

Workflow Versions

A workflow version hash is present for each version of a workflow installed. Even if the same WDL file is installed under two different names, there will be two different workflow version hashes. It is computed as follows:

Workflow Runs

Each workflow run has a hash consisting of data that is considered to uniquely identify it but this does not include all information in a workflow run. That is, there are intentional hash collisions for different workflow runs.

Output Analysis: Files

The files provisioned out are given the ID:

Note that if the provisioning output workflow renames files, that is now the hash.

Output Analysis: URLs

The URLs provisioned out are given the ID:

Nulls in Hashes

The NUL characters are a kind of insurance against malicious names. Say the hash was just did name followed by version, then foobar + 1.0.0 becomes indistinguishable from foo + bar1.0.0. Although no one would construct such a name, but the nulls make it an easy way to prevent anyone from trying.