This session will include the following subject(s):
Distributed & scalable alarm threshold evaluation:
A simple method of detecting threshold breaches for alarms is to do so directly "in-stream" as the metric datapoints are ingested. However this approach is overly restrictive when it comes to wide dimension metrics, where a datapoint from a single source is insufficient to perform the threshold evaluation. The in-stream evaluation approach is also less suited to the detection of missing or delayed data conditions.
An alternative approach is to use a horizontally scaled array of threshold evaluators, partitioning the set of alarm rules across these workers. Each worker would poll for the aggregated metric corresponding to each rule they've been assigned.
The allocation of rules to evaluation workers could take into account both locality (ensuring rules applying to the same metric are handled by the same workers if possible) and fairness (ensuring the workload is evenly balanced across the current population of workers).
Logical combination of alarm states:
A mechanism to combine the states of multiple basic alarms into overarching meta-alarms could be useful in reducing noise from detailed monitoring.
We would need to determine:
* whether the meta-alarm threshold evaluation should be based on notification from basic alarms, or on re-evaluation of the underlying conditions
* what complexity of logical combination we should support (number of basic alarms; &&, ||, !, subset-of, etc.)
* whether an extended concept of simultaneity is required to handle lags in state changes
The polling cycle would also provide a logical point to implement policies such as:
* correcting for metric lag
* gracefully handling sparse metrics versus detecting missing expected datapoints
* selectively excluding chaotic data.
This design session will discuss & agree the best approaches to manage this distributed threshold evaluation, while seemlessly handling up- and down-scaling of the worker pool (i.e. fairly re-balance and avoid duplicate evaluation).