Thursday, April 18 • 11:50am - 12:30pm
Alarm Threshold Evaluation

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

This session will include the following subject(s):

Distributed & scalable alarm threshold evaluation:

A simple method of detecting threshold breaches for alarms is to do so directly "in-stream" as the metric datapoints are ingested. However this approach is overly restrictive when it comes to wide dimension metrics, where a datapoint from a single source is insufficient to perform the threshold evaluation. The in-stream evaluation approach is also less suited to the detection of missing or delayed data conditions.

An alternative approach is to use a horizontally scaled array of threshold evaluators, partitioning the set of alarm rules across these workers. Each worker would poll for the aggregated metric corresponding to each rule they've been assigned.

The allocation of rules to evaluation workers could take into account both locality (ensuring rules applying to the same metric are handled by the same workers if possible) and fairness (ensuring the workload is evenly balanced across the current population of workers).

Logical combination of alarm states:

A mechanism to combine the states of multiple basic alarms into overarching meta-alarms could be useful in reducing noise from detailed monitoring. 

We would need to determine: 

* whether the meta-alarm threshold evaluation should be based on notification from basic alarms, or on re-evaluation of the underlying conditions 

* what complexity of logical combination we should support (number of basic alarms; &&, ||, !, subset-of, etc.) 

* whether an extended concept of simultaneity is required to handle lags in state changes

The polling cycle would also provide a logical point to implement policies such as:

* correcting for metric lag
* gracefully handling sparse metrics versus detecting missing expected datapoints
* selectively excluding chaotic data.

This design session will discuss & agree the best approaches to manage this distributed threshold evaluation, while seemlessly handling up- and down-scaling of the worker pool (i.e. fairly re-balance and avoid duplicate evaluation).

avatar for Eoghan Glynn

Eoghan Glynn

Principal Engineer, Red Hat
Eoghan is a Principal Software Engineer at the Red Hat OpenStack Infrastructure group, and is serving as Technical Lead for the OpenStack Telemetry Program over the Juno & Kilo cycles. Prior to OpenStack, Eoghan was at Amazon working on AWS monitoring services,.

Thursday April 18, 2013 11:50am - 12:30pm

Attendees (0)