Lee Calcote
clouds, containers, infrastructure, applications and their management
Alertmanager is an alert...
ingester
grouper
de-duplicator
silencer
throttler
notifier
a brief Prometheus AlertManager construct review
match alerts to their receiver and how often to notify
where and how to send alerts
a brief Prometheus AlertManager construct review
Muting
Suppressing
Correlating
group_wait: 30s
group_by: ['alertname', 'cluster']
group_interval: 5m
per route
global
via ui / api
ALERT <alert name>
IF <PromQL vector expression>
FOR <duration>
LABELS { ... }
ANNOTATIONS { ... }
Supports clients other than Prometheus
is notified when alerts transition state
a shared construct
inactive
firing
pending
inactive
firing
notifications
!
Use continue to advance to next receiver.
route:
receiver: email_webhook
receivers:
- name: email_webhook
email_configs:
- to: 'lee@example.io'
webhook_configs:
- url: <webhook url here>
Use a receiver that goes to both destinations.
route:
receiver: ops-team-all # default
routes:
- match:
severity: page
receiver: ops-team-b
continue: true
- match:
severity: critical
receiver: ops-team-a
receivers:
- name: ops-team-all
email_configs:
- to: ops-team-all@example.io
- name: ops-team-a
email_configs:
- to: ops-team-a@example.io
- name: ops-team-b
email_configs:
- to: ops-team-b@example.io
api
Inhibitor
!
de-duplication
Dispatcher
Silencer
Dispatcher sorts incoming alerts into aggregation groups and assigns the correct notifiers to each. |
Alert Provider
UI
Silence Provider
store
de-duplication
subscribe
Router
batched alerts
notification pipeline
Notify Provider
checks for previously sent notifications |
Retry
Retry
Maintenance Script
alerts
being introduced in 0.5
I
gossip protocols.
built atop Weave Mesh
With HA, you no longer have to monitor the monitor.
Designed for an alert to be sent to all instances in the cluster.
All Prometheus instances send alerts to all Alertmanager instances.
Guarantees notifications to be sent at least once.
Story:
As an Operator, I would like to not only see a list of firing alerts, but also a list of all transpired alerts, so that I may have additional context as the thresholding behavior for a given defined alert.
Prologue:
Alert troubleshooting is improved when operators have a view of what is firing, has recently fired, what is normal, but also go back in time and see what fired an hour ago. U nderstanding firing order assists in root cause analysis and identify problem areas.
Limitations:
Acceptance Criteria:
Stretch:
Include p agination
Add a date range picker
Add a host filter
test setup
$ git clone https://github.com/prometheus/client_golang.git
$ cd client_golang/examples/random
$ go get -d
$ go build
Fetch and compile the client library code example.
Start example targets in separate terminals.
$ ./random -listen-address=:8080
$ ./random -listen-address=:8081
$ ./random -listen-address=:8082
Be sure to create and run the random sample targets and point it at your soon-to-be AlertManager:
Follow the getting started instructions to download, configure and run Prometheus.
$ ./prometheus -config.file=prometheus.yml -alertmanager.url=http://localhost:9093
ALERT instance_down
IF up == 0
FOR 5s
LABELS {severity="page"}
ANNOTATIONS {
DESCRIPTION="{{$labels.instance}} of job {{$labels.job}}
has been down for more than 5 seconds.",
SUMMARY="Instance {{$labels.instance}} down"}
/alert.rules
A simple alert rule that will fire when any given target is unreachable for longer than 5 seconds.
!
...
# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
- "alert.rules"
...
/prometheus.yml
development setup
$ git clone https://github.com/prometheus/alertmanager.git
Given that our user story includes making front-end changes to AlertManager, ensure that you install a small utility to generate Go code from any file.
Clone AlertManager repo
Get, build and copy go-bindata into any directory on your PATH
$ go get -u github.com/jteeuwen/go-bindata/...
$ cd $GOPATH/src/github.com/jteeuwen/go-bindata/go-bindata
$ go build
create an alert notification receiver.
route:
group_by: [cluster]
# If an alert isn't caught by a route, send it slack.
receiver: slack_general
routes:
# Send severity=slack alerts to slack.
- match:
severity: page
receiver: slack_general
receivers:
- name: slack_general
slack_configs:
- api_url: '<your-web-url-here>'
channel: '#<your-channel-name-here>'
send_resolved: true
Of the supported AlertManager receivers, let’s opt for integrating Slack.
The visual editor can assist in building routing trees.
Verify you have a functional development environment by building and running the project:
$ make assets # invokes go-bindata to inject static web files
$ go build # compiles go code
$ ./alertmanager -config.file=slack.yml # runs alertmanager with the specified configuration
$ curl -X POST http://localhost:9090/-/reload
$ kill -HUP `pgrep alertmanager`
$ ./promtool check-config <config file>
$ ./promtool check-rules <rules file>
Reload Prometheus or AlertManager configs
Validate Prometheus config and alert rules
If you choose to setup a Slack channel, you should now see new alerts firing as and when your random targets go up and down.
/ui/app/js/app.js
/api.go
/ui/app/partials/history.html
Angular
HTML
Go
Go & SQL
/provider/provider.go
/provider/sqlite/sqlite.go
/provider/boltmem/boltmem.go
/api.go
All UI functionality should be addressable via API.
Let’s register a new /history API endpoint:
r.Get("/history", ihf("history", api.listAllAlerts))
func (api *API) listAllAlerts(w http.ResponseWriter, r *http.Request) {
alerts := api.alerts.GetAll()
defer alerts.Close()
var (
err error
res []*types.Alert
)
for a := range alerts.Next() {
if err = alerts.Err(); err != nil {
break
}
res = append(res, a)
}
if err != nil {
respondError(w, apiError{
typ: errorInternal,
err: err,
}, nil)
return
}
respond(w, types.Alerts(res...))
}
With our /api/v1/history endpoint a newly addressable API endpoint, we’ll need to build a function to handle requests made to it.
The api.listAllAlerts function will handle inbound HTTP requests made to the new endpoint.
Add a new AlertIterator and AlertProvider to /provider/boltmem/boltmem.go
With API endpoint, let’s turn our attention to the backend for collecting the right recordset from our data provider.
/provider
/ui/app/js/app.js
angular.module('am.controllers').controller('NavCtrl',
function($scope, $location) {
$scope.items = [{
name: 'History',
url: 'history'
},
angular.module('am.services').factory('History',
function($resource) {
return $resource('', {}, {
'query': {
method: 'GET',
url: 'api/v1/history'
}
});
}
);
NavCtrl for the History menu item:
as well as a new History service:
angular.module('am.controllers').controller('HistoryCtrl',
function($scope, History) {
$scope.refresh = function () {
History.query({},
function(data) {
$scope.groups = data.data;
console.log($scope.groups);
}, function(data) {
console.log(data.data);
})
}
$scope.refresh(); } );
and a new History controller:
angular.module('am.directives').directive('history',
function() {
return {
restrict: 'E',
scope: {
alert: '=',
group: '='
},
templateUrl: 'app/partials/history.html'
}; } );
Insert a new History directive:
Finally, we’ll need a page in which to view the transpired alerts. So, create a new file, history.html, under /ui/app/partials.
History.html will simply format the display a tabular recordset. A new recordset will be retrieved from our data provider.
/ui/app/partials/history.html
This example enhancement provides a view of transient history — that of the period that the SQlite database holds.
AlertManager is not currently intended to provide long-term storage.
Contributing is easier than you may think.
IRC: #prometheus on irc.freenode.net
Mailing lists:
Prometheus repositories to file bugs and features requests
#
clouds, containers, infrastructure,
applications and their management
yes, we're hiring