View on GitHub

vidarr

Analysis provenance tracking server

Vidarr Plugins for Cromwell

These plugins allow interfacing Vidarr with Cromwell.

Cromwell Output Provisioner

The output provisioner uses a Cromwell job to copy an output from a file off shared disk to another directory for permanent archival. The workflow must compute the file size and checksum of the file’s contents and report them plus the location where the file is now stored.

{
  "cromwellUrl": "http://cromwell-output.example.com:8000",
  "chunks": [ 2, 4 ],
  "debugCalls": false,
  "fileField": "provisionFileOut.inputFilePath",
  "fileSizeField": "provisionFileOut.fileSizeBytes",
  "checksumField": "provisionFileOut.fileChecksum",
  "checksumTypeField": "provisionFileOut.fileChecksumType",
  "outputPrefixField": "provisionFileOut.outputDirectory",
  "storagePathField": "provisionFileOut.fileOutputPath",
  "type": "cromwell",
  "wdlVersion": "1.0",
  "workflowOptions": {
    "read_from_cache": false,
    "write_to_cache": false
  },
  "workflowUrl": "http://example.com/provisionFileOut.wdl"
}

The "cromwellUrl" is the Cromwell server that will handle these requests. The workflow can be given by URL through "workflowUrl" or inline as a string using "workflowSource". "workflowOptions" is the workflow options Cromwell requires. Although Cromwell is given the entire workflow, it still needs to know the WDL version, provided as "wdlVersion". The workflow will be called with two arguments, the file to copy (which will be provided as the argument ("fileField")) and the submitter-provided archive directory ("outputPrefixField"). Once the workflow has completed, the plugin will collect the permanent location for the file ("storagePathField"), the checksum ("checksumField"), the checksum algorithm ("checksumTypeField") and the file size ("fileSizeField").

After running the specified WDL, the Cromwell OutputProvisioner needs to fetch metadata from the Cromwell server. Including the calls block of the metadata in the response can have negative performance implications for sufficiently large workflow runs, so by default, calls is only fetched if provisioning out has failed. Set "debugCalls" to true in order to retrieve calls information for running provision out tasks as well.

Your file system probably will not appreciate having thousands of files dumped in a single output directory, so the "chunks" parameter will create a hierarchy of directories based on the workflow run identifier. The numbers determine the number of characters to use in each directory. For example, [2, 4] will take an ID of the form AABBBBCCCCCCCCCCCCCC and produce an output path that is AA/BBBB/AABBBBCCCCCCCCCCCCCC. Once files have been provisioned out, it is possible to change the chunking scheme, but the existing files are already recorded in the Vidarr database and should not be moved without updating the database.

Here is an example WDL script to do provisioning out that uses rsync to do the file copying:

version 1.0
workflow provisionFileOut {
  input {
    String inputFilePath
    String outputDirectory
  }
   
  call rsync_file {
    input:
      inputFilePath=inputFilePath,
      outputDirectory=outputDirectory
  }
  output {
    String fileSizeBytes = rsync_file.fileSizeBytes
    String fileChecksum = rsync_file.fileMd5sum
    String fileChecksumType = "md5sum"
    String fileOutputPath = rsync_file.fileOutputPath
  }
}

task rsync_file {
  input {
    String inputFilePath
    String outputDirectory
  }

  command <<<
    set -euo pipefail

    INPUT_FILE="~{inputFilePath}"
    OUTPUT_DIRECTORY="~{outputDirectory}"
    OUTPUT_FILE_PATH="${OUTPUT_DIRECTORY%%/}/$(basename ${INPUT_FILE})"

    test -d "${OUTPUT_DIRECTORY}" || mkdir -p "${OUTPUT_DIRECTORY}"

    if [ ! -f "${INPUT_FILE}" ]; then
      echo "${INPUT_FILE} is not a file or not accessible"
      exit 1
    fi

    if [ -f "${OUTPUT_FILE_PATH}" ]; then
      echo "${OUTPUT_FILE_PATH} already exists"
      exit 1
    fi

    echo "Starting rsync"
    rsync -aL --checksum --out-format="%C" "${INPUT_FILE}" "${OUTPUT_FILE_PATH}" > md5sum.out
    echo "Completed rsync"

    stat --printf="%s" "${OUTPUT_FILE_PATH}" > size.out

    echo "${OUTPUT_FILE_PATH}" > filePath.out
  >>>

  output {
    String fileSizeBytes = read_string("size.out")
    String fileChecksum = read_string("md5sum.out")
    String fileChecksumType = "md5sum"
    String fileOutputPath = read_string("filePath.out")
  }

  runtime {
    memory: "1 GB"
    timeout: "1"
  }
}

Cromwell Workflow Engine

This can be used to run WDL workflows using a remote Cromwell instance. The configuration is as follows:

{
  "debugInflightRuns": false,
  "engineParameters": {
     "parameter1": type...
  },
  "type": "cromwell",
  "url": "http://cromwell.example.com:8000"
}

"url" specified the Cromwell server that should be contacted. "engineParameters" is optional and allows extra parameters to be required. These will be passed as Cromwell’s workflowOptions. As the Cromwell WorkflowEngine accesses Cromwell to assess progress on workflow runs, it fetches /metadata from Cromwell. For sufficiently large workflow runs, fetching this endpoint with calls information included has negative performance implications, so by default we only fetch calls for failed workflow runs to use as debugging information. To fetch calls information for running workflow runs, set "debugInflightRuns" to true.