Shesmu has a lot of features that can be stacked to do complicated things compactly. These are examples not necessarily meant to templates, but general ideas with a dissection of why.
The goal here is to select the “best” output from a workflow. Older versions of the workflow produced one file in an unhelpful format. The new one produces two files: a useful format and the historic unhelpful one. The goal is to pick the best format for newest run of this workflow.
Input cerberus_fp;
Olive
Where workflow == "The Workflow Of Interest"
&& (metatype == "application/json"
|| metatype == "application/zip-report-bundle")
Group
By ius, workflow_accession
Into
outputs = List {accession, path, metatype},
timestamp = Max timestamp
Pick Max timestamp By ius
Let
ius = ius,
{accession, path, metatype} = OnlyIf
For {accession, path, metatype} In outputs:
Sort (If metatype == "the good format" Then 0 Else 1)
First {accession, path, metatype}
Run whatever With
{run, lane, barcode} = ius,
accession = accession,
path = path,
metatype = metatype;
Here is the clause-by-clause breakdown. First, we select only the workflows and file types of interest:
Where workflow == "The Workflow Of Interest"
&& (metatype == "application/json"
|| metatype == "application/zip-report-bundle")
Then, we group based on each IUS (library that was sequenced) and workflow run:
Group
By ius, workflow_accession
And we collect two things from each of these per-workflow-run groups: the
newest file time-stamp and a collection of all the file SWID, path, and file
metatype tuples. This outputs
list will contain all the files produced by the
workflow; presumably either one (the unhelpful format) or two (the unhelpful
format + the useful one):
Into
outputs = List {accession, path, metatype},
timestamp = Max timestamp
Now, select the latest edition of this workflow run:
Pick Max timestamp By ius
This is the most complex step which select the best file and throws away the other, if it exists.
Let
ius = ius,
{accession, path, metatype} = OnlyIf
For {accession, path, metatype} In outputs:
Sort (If metatype == "the good format" Then 0 Else 1)
First {accession, path, metatype}
Let’s break this down further. The first part simply reshapes the data preserving the IUS:
Let
ius = ius,
We will continue on slightly out of order:
For {accession, path, metatype} In outputs:
This iterates over all the output we collected. This uses destructuring to get the contents of the tuple at the time of iteration. We need to find the “best” file. The plan is to sort the tuples from best to worst and pick the first one:
For {accession, path, metatype} In outputs:
Sort (If metatype == "the good format" Then 0 Else 1)
First {accession, path, metatype}
This uses the numeric sort operation to place the better format before the
worse one (if both exist) and First
to select whatever item has the best sort
criterion. First
will emit another tuple. It’s in exactly the same format as
the original tuple, so this could be written as:
For output In outputs:
Sort (If output[2] == "the good format" Then 0 Else 1)
First output
with no change in meaning. First
cannot guarantee that there are any items in
the list at all, so it will return an optional of the tuple. Since we only want
rows where there is one file, we use OnlyIf
:
{accession, path, metatype} = OnlyIf
For {accession, path, metatype} In outputs:
Sort (If metatype == "the good format" Then 0 Else 1)
First {accession, path, metatype}
This will discard any rows with no files matching our criteria. At the same time, it does a destructuring assignment, unpacking the contents of the tuple in the variables.
Finally, we run the intended action:
Run whatever With
{run, lane, barcode} = ius,
accession = accession,
path = path,
metatype = metatype;
This also makes use of destructuring assignment for the IUS. Again, with no practical different, the ending could have been written:
Let
ius = ius,
output = OnlyIf
For output In outputs:
Sort (If output[2] == "the good format" Then 0 Else 1)
First output
Run whatever With
run = ius[0],
lane = ius[1],
barcode = ius[2],
accession = output[0],
path = output[1],
metatype = output[2];
or:
Let
ius = ius,
output = OnlyIf
For output In outputs:
Sort (If output[2] == "the good format" Then 0 Else 1)
First output
Run whatever With
{run, lane, barcode} = ius,
{accession, path, metatype} = output;
Shesmu has three major ways to control program flow: If
, Switch
and For
.
For
is useful for collections and optionals. If
and Switch
are both
useful for decision making.
Since Switch
works on tuples, it is possible to create decision-making flow that looks more like “matching” the desired data. This is a clause from the MAVIS olive:
Where
project In miso_active_projects
&& library_design In ["WG", "WT", "MR"]
&& tissue_type != "R"
&& (Switch {workflow, metatype}
When {"BamMergePreprocessing", "application/bam"} Then True
When {"BamMergePreprocessing", "application/bam-index"} Then True
When {"Manta", "text/vcf"} Then True
When {"StructuralVariation", "text/vcf"} Then True
When {"StarFusion", "text/plain"} Then "{path}" ~ /\.abridged\.tsv$/
Else False)
This initial selection on project
, library_design
, and tissue_type
is
pretty standard. The Switch
is not. Let’s start with an equivalent logical
expression and refactor it into the above:
Where
project In miso_active_projects
&& library_design In ["WG", "WT", "MR"]
&& tissue_type != "R"
&& (
workflow == "BamMergePreprocessing" && (metatype == "application/bam"
|| "application/bam-index") ||
metatype == "text/vcf" && (workflow == "Manta" ||
workflow == "StructuralVariation") ||
worfklow == "StarFusion" && metatype == "text/plain"
&& "{path}" ~ /\.abridged\.tsv$/)
This check is very ragged: in the case of BamMergePreprocessing
, we care
about different metatypes while for the metatype text/vcf
, we care about
different workflows. Let’s distribute those conditions. This will mean
redundant checks, but it will be clearer what the combinations are:
Where
project In miso_active_projects
&& library_design In ["WG", "WT", "MR"]
&& tissue_type != "R"
&& (
workflow == "BamMergePreprocessing" && metatype == "application/bam" ||
workflow == "BamMergePreprocessing" && metatype == "application/bam-index" ||
workflow == "Manta" && metatype == "text/vcf" ||
workflow == "StructuralVariation" && metatype == "text/vcf" ||
worfklow == "StarFusion" && metatype == "text/plain"
&& "{path}" ~ /\.abridged\.tsv$/)
Now, all of our conditions have a check for workflow
and metatype
where the
last one has an extra check. Rather than compare two things, lets put them into
tuples and compare the tuples:
Where
project In miso_active_projects
&& library_design In ["WG", "WT", "MR"]
&& tissue_type != "R"
&& (
{workflow, metatype} == {"BamMergePreprocessing", "application/bam"} ||
{workflow, metatype} == {"BamMergePreprocessing", "application/bam-index"} ||
{workflow, metatype} == {"Manta", "text/vcf"} ||
{workflow, metatype} == {"StructuralVariation", "text/vcf"} ||
{worfklow, metatype} == {"StarFusion", "text/plain"}
&& "{path}" ~ /\.abridged\.tsv$/)
Since {workflow, metatype}
is now in every condition, we can factor it out into the Switch
:
Where
project In miso_active_projects
&& library_design In ["WG", "WT", "MR"]
&& tissue_type != "R"
&& (Switch {workflow, metatype}
When {"BamMergePreprocessing", "application/bam"} Then True
When {"BamMergePreprocessing", "application/bam-index"} Then True
When {"Manta", "text/vcf"} Then True
When {"StructuralVariation", "text/vcf"} Then True
When {"StarFusion", "text/plain"} Then "{path}" ~ /\.abridged\.tsv$/
Else False)
For most of the conditions, everything is taken care of by the Switch
to the
result is simply True
. For the StarFusion
case, the extra conditions
becomes the Then
.
Here is another case of this from bcl2barcode
. It manipulates the bases mask
to replace the first Y
n with Y1N*
, and any subsequent Y
with N
while
leaving the first two I
untouched.
Function modify_bases_mask(masks_type bases_mask_parts)
pack_bases_mask(
For mask In bases_mask_parts:
Flatten (new_mask In
Switch {mask.type, mask.ordinal}
When {"Y", 0} Then
[ { type = "Y", position = 0, group = mask.group, length = 1 },
{ type = "N", position = 1, group = mask.group, length = -1 } ]
When {"I", 0} Then
[ convert_mask(mask) ]
When {"I", 1} Then
[ convert_mask(mask) ]
When {"Y", 1} Then
[ { type = "N", position = 0, group = mask.group, length = -1 } ]
Else
# unsupported bases_mask!
[ convert_mask(mask) ]
)
List new_mask);
The bases_mask_parts
is a list of objects already parsed by another function.
pack_bases_mask
creates a new bases mask from the objects provided. Since the
parser provides extra information the packer doesn’t need convert_mask
, drops
the superfluous data.
This code takes each mask and does a Flatten
on it. The mapping between old
input masks is usually 1:1, but in the case where Y
n maps to Y1N*
, it is
1:2, so the Flatten
will allow collecting the two. Again, this uses a
Switch
on a tuple. The tuple is the mask type (Y
, I
, N
) and the ordinal
(the number of this type) appears in the sequence. So, {"Y", 0}
is the first
read, {"I", 0}
is the first index, {"I", 1}
is the second index, and {"Y",
1}
is the second read. This assumes a relatively normal read structure for the
sequencer. If other reads or indices were common, it might make sense to have
the Else
convert them to an N*
sequence as the {"Y", 1}
case does.