The Prometheus Alert Manager can
be used to throttle services using AutoInhibit
alert and can be the target
for Alert
olives.
To configure the server, create a file ending in .alertman
as follows:
{
"alertmanager": "http://alertmanager:9093",
"environment": "production"
"labels": ["job", "scope"]
}
The plugin will check Alert Manager and block any alerts firing of the form
AutoInhibit{environment="production"}
or
AutoInhibit{environment="production",_y_="
x"}
where x is the name of
the service used by an olive or action and y is one of labels
, in this
case, job
or scope
. If labels is not supplied, job
is assumed. This
allows dynamic throttling of Shesmu workload based on the services required.
Additionally, an Alert
olives’ output is pushed to Alert Manager with the
additional label environment="production"
.
Here are recommended rules for monitoring Shesmu’s state:
groups:
- name: shesmu.rules
rules:
- record: shesmu_incomplete_action_count
expr: sum(shesmu_action_state_count{state!~"SUCCEEDED|ZOMBIE"}) by (state)
- record: shesmu_action_perform_time:rate30m
expr: rate(shesmu_action_perform_time_bucket[30m])
- alert: BadSource
expr: max_over_time(shesmu_source_valid[5m]) == 0 and on(instance) up > 600
annotations:
description: Shesmu has failed to compile .
The source file is probably wrong.
summary: Unable to compile
To check for actions being in a state for too long, use these rules, adjusting the timeouts as desired:
- alert: StuckActions
expr: time() - shesmu_action_oldest_time{state=~"QUEUED|THROTTLED|WAITING"} > 2 * 86400
labels:
severity: pipeline
annotations:
description: "A action has been on for a while now."
summary: " actions too long on "
- alert: StuckActions
expr: time() - shesmu_action_oldest_time{state="INFLIGHT"} > 5 * 86400
labels:
severity: pipeline
annotations:
description: "A action has been on for a while now."
summary: " actions too long on "
To check for olives not running frequently enough or hitting their timeouts, try:
- alert: StuckOlive
expr: time() - shesmu_run_last_run > 7200 and up > 600
annotations:
description: All the olives are taking much too long to run on .
summary: All olives stuck on
- alert: StuckOlive
expr: shesmu_run_overtime > 0
annotations:
description: The olives from are taking much too long to run on .
summary: Olives in stuck on
If using the SSH refiller, it can be useful to watch for failures:
- alert: RefillFailure
expr: min_over_time(shesmu_sftp_refill_exit_status[1h]) > 0
annotations:
description: SSH refill processor on is exiting non-zero.
summary: Failed to refill on .