Mozilla is collecting over 1,000 Telemetry probes which give rise to histograms, like the one in the figure below, that change slightly every day.
Until lately the only way to monitor those histogram was to sit down and literally stare the screen while something interesting was spotted. Clearly there was the need for an automated system which is able to discern between noise and real regressions.
Noise is a major challenge, even more so than with Talos data, as Telemetry data is collected from a wide variety of computers, configurations and workloads. A reliable mean of detecting regressions, improvements and changes in a measurement’s distribution is fundamental as erroneous alerts (false positives) tend to annoy people to the point that they just ignore any warning generated by the system.
I have looked at various methods to detect changes in histogram, like
- Correlation Coefficient
- Chi-Square Test
- Mann-Whitney Test
- Kolmogorov-Smirnov test of the estimated densities
- One Class Support Vector Machine
- Bhattacharyya Distance
Only the Bhattacharyya distance proved satisfactory for our data. There are several reasons why each of the previous methods fails with our dataset.
For instance a one class SVM wouldn’t be a bad idea if some distributions wouldn’t change dramatically over the course of time due to regressions and/or improvements in our code; so in other words, how do you define how a distribution should look like? You could just take the daily distributions of the past week as training set but that wouldn’t be enough data to get anything meaningful from a SVM. A Chi-Square test instead is not always applicable as it doesn’t allow cells with an expected count of 0. We could go on for quite a while and there are ways to get around those issues but the reader is probably more interested in the final solution. I evaluated how well those methods are actually at pinpointing some past known regressions and the Bhattacharyya distance proved to be able to detect the kind of pattern changes we are looking for, like distributions shifts or bin swaps, while minimizing the number of false positives.
Having a relevant distance metric is only part of the deal since we still have to decide what to compare. Should we compare the distribution of today’s build-id against the one from yesterday? Or the one from a week ago? It turns out that trying to mimic what an human would do yields a very accurate algorithm. If the variance of the distance between the histogram of the current build-id and the histograms of the past N build-ids is small enough and the distance between the histograms of the current build-id and the previous build-id is above a cutoff value K, a regression is reported. Furthermore, Histograms that don’t have enough data are filtered out and the cut-off values are determined empirically from past known regressions.
I am pretty satisfied with the detected regressions so far, for instance the system was able to correctly detect a regression caused by the OMTC patch that landed the 20st of May which caused a significant change in the the average frame interval during tab open animation:
We will soon roll-out a feature to allow histogram authors to be notified through e-mail when an histogram change occurs. In the meantime you can have a look at the detected regressions in the dashboard.