Thursday, April 18 • 11:50am - 12:30pm
Failure management in the gate

In the grizzly run up we had a few really bad days where gate resets were fast and furious, and it would take, on average, 6 or 8 hours to merge. This led us to a conversation where the nova dev team was seriously considering turning off the gate checking entirely.

This session would be on brainstorming the ways to find and get to the bottom of failures in the gate faster, and hopefully reduce them over time.

It would include:
* ways to optimize gate resets. When we know a test has failed, can we reset early, instead of waiting for the train wreck to complete?
* ways to get to the bottom of fails fast - the recheck page was a good start, but it turned out to be pretty static info, and people really corrupted the data by picking bugs poorly
* ways to analyze fails (some sort of failure dashboard), figure out the infra restriction on tooling for this.
* ways to alert users so the answer to "is there a problem" isn't keep an -infra tab open and scroll back.

(Session proposed by Sean Dague)

