Loading…

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Design Summit [clear filter]
Wednesday, April 17
 

4:30pm

Intro to Ceilometer Architecture
This session will provide a walk-through of the existing pieces of ceilometer as an introduction for new contributors and a refresher for the existing team, to serve as a basis for the rest of the discussions during the summit.

(Session proposed by Doug Hellmann)


Wednesday April 17, 2013 4:30pm - 5:10pm
B116

5:20pm

Feedback from Ceilometer users
This session has for goal to gather feedback from Ceilometer users. We should invite anyone whom has deployed Ceilometer to join this session to quickly explain to us (5 min per user):
* their architecture
* their pains
* their successes
So that we can learn an improve.

Users can be admins, devops, dev, anyone that had to deploy or interface with Ceilometer.

(Session proposed by Nick Barcet)


Wednesday April 17, 2013 5:20pm - 6:00pm
B116
 
Thursday, April 18
 

9:00am

Incremental improvement grab-bag
This session will include the following subject(s):

Incremental improvement grab-bag:

There are several incremental improvements that we should talk about, but that won't require a full hour session to discuss (I hope).



(Session proposed by Doug Hellmann)

Enable/Disable/Configure a pollster in runtime:

When using ceilometer for monitoring, sometimes the users want to enable/disable some pollsters which are only for testing/debugging purpose in runtime, without modifying the configuration file and restarting the agent.

Besides, some users might want to ask a pollster only to monitor part of the resources available to it, e.g. only to monitor one specific nova instance. The users need to pass the instance UUID as a configuration-paramter to the pollster in runtime.

We might need to design a framework to allow the user to use the "management-API" to do the following things in the run-time:
- enable/disable a pollster
- get/set configuration parameter for a pollster
- ask a pollster to immediately start polling, instead of waiting for other pollsters in the same polling task to finish before it can start poll.

That framework could also be extended to manage publishers.

(Session proposed by Lianhao Lu)


Thursday April 18, 2013 9:00am - 9:40am
B116

9:50am

Double Entry Auditing of collected metrics in CM
In order to offer SEC-Compliant billing we need to validate collected metrics from two sources. The needs to be an audit trail for important metrics such as instance lifecycle, bandwidth and storage usage. How might this be accomplished with CM?

(Session proposed by Sandy Walsh)


Thursday April 18, 2013 9:50am - 10:30am
B116

11:00am

API improvements for Ceilometer
This session will include the following subject(s):

API improvements for Ceilometer:

The API needs to evolve in order to solve more advanced questions from billing engines such as:

- Give me the maximum usage of a resource that lasted more than 1h
- Give me the use of a resource over a period of time, listing changes by increment of X volume over a period of Y time
- Provide a GROUP BY function
- Provide additional statistical function (Deviation, Median, Variation, Distribution, Slope, etc...) which could be given as multiple results for a given data set collection


(Session proposed by Nick Barcet)

Ceilometer API extensions:

Some enhancements to the API would allow support for a broader set of use cases.

(Session proposed by Phil Neal)


Thursday April 18, 2013 11:00am - 11:40am
B116

11:50am

Alarm Threshold Evaluation

This session will include the following subject(s):

Distributed & scalable alarm threshold evaluation:

A simple method of detecting threshold breaches for alarms is to do so directly "in-stream" as the metric datapoints are ingested. However this approach is overly restrictive when it comes to wide dimension metrics, where a datapoint from a single source is insufficient to perform the threshold evaluation. The in-stream evaluation approach is also less suited to the detection of missing or delayed data conditions.

An alternative approach is to use a horizontally scaled array of threshold evaluators, partitioning the set of alarm rules across these workers. Each worker would poll for the aggregated metric corresponding to each rule they've been assigned.

The allocation of rules to evaluation workers could take into account both locality (ensuring rules applying to the same metric are handled by the same workers if possible) and fairness (ensuring the workload is evenly balanced across the current population of workers).

Logical combination of alarm states:

A mechanism to combine the states of multiple basic alarms into overarching meta-alarms could be useful in reducing noise from detailed monitoring. 

We would need to determine: 

* whether the meta-alarm threshold evaluation should be based on notification from basic alarms, or on re-evaluation of the underlying conditions 

* what complexity of logical combination we should support (number of basic alarms; &&, ||, !, subset-of, etc.) 

* whether an extended concept of simultaneity is required to handle lags in state changes

The polling cycle would also provide a logical point to implement policies such as:

* correcting for metric lag
* gracefully handling sparse metrics versus detecting missing expected datapoints
* selectively excluding chaotic data.

This design session will discuss & agree the best approaches to manage this distributed threshold evaluation, while seemlessly handling up- and down-scaling of the worker pool (i.e. fairly re-balance and avoid duplicate evaluation).




Speakers
avatar for Eoghan Glynn

Eoghan Glynn

Principal Engineer, Red Hat
Eoghan is a Principal Software Engineer at the Red Hat OpenStack Infrastructure group, and is serving as Technical Lead for the OpenStack Telemetry Program over the Juno & Kilo cycles. Prior to OpenStack, Eoghan was at Amazon working on AWS monitoring services,.


Thursday April 18, 2013 11:50am - 12:30pm
B116

1:30pm

Time series data manipulation in nosql stores

This session will include the following subject(s):

Time series data manipulation in nosql stores:

Ceilometer currently supports multiple storage drivers (mongodb, sqlalchemy, hbase) behind a well-defined abstraction.

The purpose of this design session is to discuss how well suited the existing nosql stores are to the efficient manipulation of time-series data.

The questions to be decided would include:

* whether we could optimize/improve our existing schemas in this regard

* whether we should consider a storage driver based on Cassandra in order to take advantage of it's well-known suitability for TSD

(Session proposed by Eoghan Glynn)

The dotted line between metering and metric/alarms:

There is clear commonality in the data acquisition & transformation layers for gathering metering and metric data.

However the further we venture through the pipeline, there are also operation concerns around over-sharing of common infrastructure in the transport and storage layers.

We need to tie to down exactly where we see the dotted line between the handling of metering and metric data, deciding whether:

* a common conduit in the form of AMQP should be used for publication (for example given that during a brownout in the RPC layer, we would need a timely metric flow more than ever)

* a common storage layer should be used for persistence (for example given that data retention requirements may differ significantly)

* a common API layer should provide aggregation (for example given that certain aggregations such as percentile may make far more sense for metric rather than metering data)




Speakers
avatar for Eoghan Glynn

Eoghan Glynn

Principal Engineer, Red Hat
Eoghan is a Principal Software Engineer at the Red Hat OpenStack Infrastructure group, and is serving as Technical Lead for the OpenStack Telemetry Program over the Juno & Kilo cycles. Prior to OpenStack, Eoghan was at Amazon working on AWS monitoring services,.


Thursday April 18, 2013 1:30pm - 2:10pm
B116

2:20pm

Simple messaging action for Alerting
As we develop alerting in Ceilometer, it might be a good idea to provide a simple destination endpoint for alerts to be forwarded as:
- events on the oslo RPC bus
- emails (SMTP)
- SMS
- Nagios alerts

(Session proposed by Nick Barcet)


Thursday April 18, 2013 2:20pm - 3:00pm
B116

3:20pm

Alarm state and history management

We need to tie down the requirements for managing the state and history of alarms, for example providing:

* an API to allow users define and modify alarm rules

* an API to query current alarm state and modify this state for testing purposes

* a period for which alarm history is retained and is accessible to the alarm owner (likely to have less stringent data retention requirements than regular metering data)

* an administrative API to support across-the-board querying of state transitions for a particular period (useful when assessing the impact of operational issues in the metric pipeline)




Speakers
avatar for Eoghan Glynn

Eoghan Glynn

Principal Engineer, Red Hat
Eoghan is a Principal Software Engineer at the Red Hat OpenStack Infrastructure group, and is serving as Technical Lead for the OpenStack Telemetry Program over the Juno & Kilo cycles. Prior to OpenStack, Eoghan was at Amazon working on AWS monitoring services,.


Thursday April 18, 2013 3:20pm - 4:00pm
B116

4:10pm

Ceilometer support for advanced billing models
We need to ensure the metering architecture can support advanced billing models like direct Nova billing, Windows instance and PaaS service controller-owned Nova instances, application billing, and variations on quantity and overage billing

(Session proposed by Phil Neal)


Thursday April 18, 2013 4:10pm - 4:50pm
B116

5:00pm

Supporting rich data types and aggregation ...
This session will review some of the real-world metrics collected at Rackspace and discuss how this data might be stored in Ceilometer.

The potential ramifications on the query interface and multi-publisher will also be discussed.

(Session proposed by Sandy Walsh)


Thursday April 18, 2013 5:00pm - 5:40pm
B116