This is a guide for operators who need to check whether Shesmu is healthy and investigate failures.
The main pages operators are going to use are:
On both the Olives and Actions page, you can add filters to narrow the list. There are a few kinds of filters:
Text and regular expression searches are much slower than the other filter. Try to use tags where possible and add tags to olives for better searching. When possibly try to combine text searches with other faster searches to improve performance. Using the Olives page for is also faster because it narrows the search to a single olive’s output.
On the Overview tab, there is a breakdown of actions (this applies even on the Olives page where the filters are under the Actions tab). By clicking on any cell header or cell in the tables, the search can be restricted using the Drill Down menu item.
For the histograms, clicking and dragging will filter on that range. Histograms can provide a lot of useful information about what is going on in an olive. For instance:
In this histogram there are a number of actions that have not been generated by an olive recently. This likely means that the input data changed and these actions are now orphans and may be candidates to be purged. The larger the gap between now and the last time they were generated, the more likely it is that they are not being generated. The Olives page would show when this olive last ran for comparison.
Exercise: Go to the Actions page, find an action, and open it in a new tab. Using the Actions and Olives page, find 3 different filter combinations that find this action using the information displayed about it.
When you build up a search query using filters on the Actions page, it can be saved using the Add to My Searches button. This search will now be available from your browser for this server.
If you want to share a search click on the Export Search button. If the search is exported to a file, it can be added to the server’s configuration to make available to all users. The To Clipboard, To Clipboard for Ticket, and To File are all useful for importing the search later. There are also buttons to create purge or fetch shell commands, for use in scripts. The JIRA plugin also allows exporting searches to tickets; more details on that in the next section.
Shesmu’s Actions dashboard provides a way to sift through the actions that olives have generated. It can be useful to save these searches. By clicking the Save Search, the search will be saved in the browser. They can be shared by clicking the clipboard icon beside a saved search to copy the search and then using the Add Search button on the dashboard and pasting in the text copied. The Import Searches and Export Searches can also be used to copy all searches and upload them to a different instance.
To go beyond person-to-person sharing, the search filter JSON, created by
either clicking the Show Search button, can be saved to a file ending in
.search
in the Shesmu configuration directory. The name of the file will be
used as the name of the search.
It is not recommended to save searches that reference a particular olive source
location. Every time the file is updated, the olive’s hash will be updated and
the filter will no longer match. The hash
property in the filter can be
changed to null
to avoid this issue. Even if this were not the case, it is
possible that the olive will move around in the script and the line and column
that mark the start of each olive will change.
Shesmu can use JIRA and custom searches to create delegation. A JIRA ticket can
contain a search in text form (i.e., shesmusearch:
) or references to
particular actions (i.e., shesmu:
) in the description.
In the JIRA configuration, we have JIRA queries as follows:
"searches": [
{
"filter": {
"states": [
"FAILED",
"UNKNOWN",
"HALP"
],
"type": "status"
},
"jql": "project IN (\"GC\", \"GDI\", \"GP\", \"GBS\", \"GRD\") AND resolution = Unresolved",
"name": "Problems from {key} - {summary} ({assignee})",
"type": "EACH_AND"
},
{
"filter": {
"states": [
"FAILED",
"UNKNOWN",
"HALP"
],
"type": "status"
},
"jql": "project IN (\"GC\", \"GDI\", \"GP\", \"GBS\", \"GRD\") AND resolution = Unresolved",
"name": "Problems for {assignee}",
"type": "BY_ASSIGNEE"
},
{
"filter": {
"states": [
"FAILED",
"UNKNOWN",
"HALP"
],
"type": "status"
},
"jql": "project IN (\"GC\", \"GDI\", \"GP\", \"GBS\", \"GRD\") AND resolution = Unresolved",
"name": "Pipeline Lead Dashboard",
"type": "ALL_EXCEPT"
},
{
"filter": {
"states": [
"FAILED",
"UNKNOWN",
"HALP"
],
"type": "status"
},
"jql": "project IN (\"GC\", \"GDI\", \"GP\", \"GBS\", \"GRD\") AND resolution = Unresolved",
"name": "Problems Currently Handed-Off",
"type": "ALL_AND"
}
],
Each of these searches performs a JIRA search using "jql"
to find searches
embedded in issues and then takes the Shesmu search in "filter"
and combines
them using "type"
. So, the EACH_AND
type takes every ticket and creates a
search for it by combining the query in this file with the query from the
ticket. BY_ASSIGNEE
does the same thing, but first grouping by ticket
assignee. The ALL_EXCEPT
search is the most important for operations. It
creates a dashboard that has all the problems except for ones mentioned in
tickets. Therefore, operations can carve off problems and delegate them to
other people by creating at ticket.
Tickets can be made by using the filters and then exporting the problem to a new ticket. It does not need to be assigned to be removed from the Pipeline Lead Dashboard. This allows the Pipeline Lead Dashboard to act as the operations inbox.
The page has to be refreshed to get the updated query from JIRA and the results from JIRA are cached for 15 minutes.
Exercise: Find a problem using the Pipeline Lead Dashboard and create a ticket for it.
To create a new ticket:
If there is already a ticket that you would like to attach the search to:
Once a ticket is filed, refresh the Actions page and there will be a search for each ticket with an embedded search and an aggregated search of every user. Shesmu caches this information for 15 minutes, so the searches may not be updated immediately.
It is also possible to extract the search from a ticket manually.
shesmusearch:
string into the box.This search will be visible in the drop down list only for you from this browser for this Shesmu server. If you wish to share it for everyone, use Export Search and then To File and install it in the Shesmu configuration directory.
Shesmu will stop running olives or checking on an action if there is a
potential overload. On the main page, STOP STOP STOP will cause all active
actions to slip into a THROTTLED
state.
There are 3 kinds of throttles supported:
Per plugin/service throttles can stop a plugin and all the actions and olives
that use it. For instance, throttling jira
will stop any olives that are
using the JIRA searches or any actions that file JIRA issues.
Every data format (e.g., cerberus_fp
) can also be throttled and olives that
use this data will not run.
To engage a throttle, there are several ways:
.maintenance
contains a list of times to engage a throttle. These are useful for planned events, such as IT maintenance. There’s a graphical maintenance schedule editor.Prometheus is the most flexible of the system. Prometheus rules monitor systems
and can stop Shesmu from accessing certain systems by firing AutoInhibit
alerts. Since it is useful to create these manually,
Somnus can be used to manually create
limited-time inhibitions. Think of them as the reverse of a silence; stop the
problem for a limited window instead of ignoring it for a limited.
Typically, actions in a THROTTLED
state don’t require any action. If an
action has been throttled for a very long time it may indicate that another
service is broken or stuck or a maintenance schedule is overwhelmed. It’s
usually best to check Prometheus for inhibition alerts.
While fun, STOP STOP STOP is a blunt tool for stopping actions. Actions can
also be paused using the olives that generate them. On the Olives page, it is
possible to pause the actions generated by an olive or a file. Pausing an
olive does not stop the olive from running. It simply puts all actions
generated by the olive into a THROTTLED
state.
Pauses can be created or removed from the Olives page and removed on the Pauses page. The reason for having them in two places is this:
To avoid this problem, all pauses, even for olives that have been replaced are available on the Pauses page and they can be cleared from there.
Every Shesmu action has:
added
)checked
)statusChanged
)external
; optional)Every few minutes, Shesmu runs all the olives and they generate all the actions. Since most actions are the same every time, the duplicates are thrown away. The last generation time is the last time an olive produced this action. If the action is a duplicate, it will still have an updated generation time.
Once an action has been generated by an olive, it will enter an UNKNOWN
state
and the Shesmu scheduler will try to run the action. Every time it does, it
will update the last checked time. When the action is checked, it can change
its state; if this occurs, last state transition time is also updated.
Therefore, an old last generation time means the olive has stopped producing this action, the olive has been deleted, or the olive is stuck. An old last checked time indicates the Shesmu scheduler is overloaded or the action is not requesting frequent updates. An old last transition time indicates that the problem is internal to the action.
The external modification time is some time that the action self-reports that it thinks is useful. For Vidarr workflows, this is the last modification time of the workflow run. JIRA actions show the last modification time of the ticket they are associated with.
Actions also have commands that allow you to tell the action to do something.
Commands will cause an action to flip back to the UNKNOWN
state. Some
commands can be applied in bulk. A command may require a confirmation before
executing and some dangerous commands require a puzzle to be solved before
working in bulk.
State | Description |
---|---|
FAILED |
The action has been attempted and encounter an error (possibly recoverable). |
HALP |
The action is in a state where it needs human attention or intervention to correct itself. |
INFLIGHT |
The action is currently being executed. |
QUEUED |
The action is waiting for a remote system to start it. |
SAFETY_LIMIT_REACHED |
The action has encountered some user-defined limit stopping it from proceeding. |
SUCCEEDED |
The action is complete. |
THROTTLED |
The action is being rate limited by a Shesmu throttler or by an over-capacity signal. |
UNKNOWN |
The actions state is not currently known either due to an exception or not having been attempted. |
WAITING |
The action cannot be started due to a resource being unavailable. |
ZOMBIE |
The action is never going to complete. This is not necessarily a failed state; testing or debugging actions should be in this state. |
Files can be deleted from disk by the SFTP delete action. To have a human
review before deleting, the olive can set automatic = False
and then a
command will be available for a human to approve the action. These actions
appear in the HALP
state until they are approved.
Vidarr actions have several important commands meant to replace access to the command line:
The Vidarr actions also generates some useful tags:
vidarr-target:
name: The target on the Vidarr instance.vidarr-workflow:
name[/
version`]: The workflow that this action will
run, both with and without the version.vidarr-state:
[active
|attempt
|conflict
|dead
|finished
|missing
]:
The action uses a state machine while it’s communicating with Vidarr. This is
the current state of that machine.vidarr-attempt:
count`: The number of times this workflow run has been
attempted.Vidarr actions have a few states they can be in:
FAILED
– This can happen for a few reasons: the workflow itself failed,
Vidarr rejected the submission request, an internal error occurred tying to
launch the workflow.HALP
– The workflow run has been previously run, but with incompatible
LIMS key versions. Correct LIMS or reprocess the workflow.QUEUED
– The workflow is waiting to start the next phase.INFLIGHT
– The workflow is running.WAITING
– The workflow run is in between Vidarr phases.ZOMBIE
– the workflow has input which is stale; normal procedures for
fixing stale records will eventually generate a non-stale version of this
action.Due to the imperfect nature of reality, it might be useful to launch bespoke
actions not defined by olives. To do this, create a JSON file that ends in
.actnow
.
Shesmu will add these actions to its queue and attempt to run them as if they were produced by an olive.