Shesmu can become a repository of data, but it can still be necessary to access
that data in an exploratory way. Shesmu provides a companion command line tool
for extracting and filtering data using an AWK-like syntax: shawk
. The
tool uses an HTTP endpoint /extract
so it is also possible to use curl
or
wget
to access the same data.
The easiest way to get the tool, is to download a pre-built version from GitHub Releases.
The ShAWK tool doesn’t change much, so there’s no need to upgrade every Shesmu release.
If downloading a release sounds too easy, you can build the command line tool, by first, install Rust and then invoking:
wget https://github.com/oicr-gsi/shesmu/archive/refs/heads/master.zip
unzip master.zip
cd shesmu-master/shawk
cargo install --path .
And this will install shawk
in your home directory.
The command line tool has a configuration file that allows creating reusable
configuration. The location of the file depends on the OS, so use --help
to
see the path. The configuration file is a YAML file. All configuration
parameters are optional:
---
aliases:
prod: "https://jrhacker@shesmu-prod.example.com/"
dev: "https://jrhacker@shesmu-dev.example.com/"
default_host: prod
default_input_format: cerberus_fp
default_output_format: TSV
prepared_columns:
tissue: "tissue_type, tissue_origin, tissue_name, tissue_region, tissue_prep"
The tool needs to know the URL for each Shesmu instance, but this can be
cumbersome, so aliases
allows specifying short names for URLs. The full URL
can be provided to the -H
or --host
switch, but an alias can also be used.
If no host is provided on the command line, the default_host
is used, which
can be a URL or alias.
Data will be extracted from a particular input format, and, again it can be
specified on the command line using the -i
or --input
switch. If omitted,
the default_input_format
is used for the input format.
Similarly, the data can be made into several formats, described later, and the
-f
or --format
switch determines the output format, but this can be omitted
and a default format used.
ShAWK is about extracting and manipulating columns. The prepared_columns
allows defining reusable groups of columns that can be inserted into a query.
More details in the following section.
The data to extract is specified using an AWK-like structure with olive syntax. A query is made of one or more rules. Each rule specifies the columns it generates. If multiple rules are specified, they must produce the same columns in the same order. A simple rule can specify columns to copy:
shawk -i cerberus_fp -f TSV '{project, library_name, tissue_type}'
Columns can also be gangs using the @
column construction:
shawk -i cerberus_fp -f TSV '{@merged_library, cell_visibility}'
New column can be defined using an expression:
shawk -i cerberus_fp -f TSV '{library_name, provider = lims.provider}'
Rather than copy-and-paste from shell history, columns can be placed in the
configuration file as prepared columns and then access using $
:
shawk -i cerberus_fp -f TSV '{library_name, $tissue}'
This allows building a small library of reusable columns.
Usually, only some records are needed, so a filter can be included in a rule:
shawk -i cerberus_fp -f TSV 'file_size == 0 {library_name, path}'
A query can also include multiple rules:
shawk -i cerberus_fp -f TSV 'project == "foo" {library_name, ok = True} project != "foo" {library_name, ok = file_size > 100}'
Output can be generated in several formats. Normally, the output is written to
standard output, but the -o
or --output
switch can force it to be written
to a file. The supported output formats are:
CSV_EXCEL
: Comma-delimited text with escaping compatible with Microsoft ExcelCSV_MONGO
: Comma-delimited text with escaping compatible with MongoDBCSV_MYSQL
: Comma-delimited text with escaping compatible with mySQL importCSV_POSTGRESQL
: Comma-delimited text with escaping compatible with PostgreSQL importCSV_RFC4180
: Comma-delimited text with escaping compatible with RFC4180JSON
: An array of JSON objects for each row with times written as ISO-8660-compatible stringsJSON_SECS
: An array of JSON objects for each row with times written as an integer in seconds from the UNIX epochJSON_MILLIS
: An array of JSON objects for each row with times written as an integer in milliseconds from the UNIX epochTSV
: Tab-delimited textTSV_MONGO
: Tab-delimited text with escaping compatible with MongoDBXML
: An XML document with an element for each row with times written as ISO-8660-compatible stringsXML_SECS
: An XML document with an element for each row with times written as an integer in seconds from the UNIX epochXML_MILLIS
: An XML document with an element for each row with times written as an integer in milliseconds from the UNIX epoch