Telemetry meets Clojure.

tldr: Data related telemetry alerts (e.g. histograms or main-thread IO) are now aggregated by medusa, which allows devs to post, view and filter alerts. The dashboard allows to subscribe to search criterias or individual metrics.

As mentioned in my previous post, we recently switched to a dashboard generator, “iacomus“, to visualize the data produced by some of our periodic map-reduce jobs. Given that the dashboards gained some metadata that describes their datasets, writing a regression detection algorithm based on the iacomus data-format followed naturally.

The algorithm generates a time-series for each possible combination of the filtering and sorting criterias of a dashboard, compares the latest data-point to the distribution of the previous N, and generates an alert if it detects an outlier. Stats 101.

Alerts are aggregated by medusa, which provides a RESTful API to submit alerts and exposes a dashboard that allows users to view and filter alerts using regular expressions and subscribe to alerts.

Writing the aggregator and regression detector in Clojure[script] has been a lot of fun. I found particularly attracting the fact that Clojure doesn’t have any big web framework a la Ruby or Python that forces you in one specific mindset. Instead you can roll your own using a wide set of libraries, like:

  • HTTP-Kit, an event-driven HTTP client/server
  • Compojure, a routing library
  • Korma, a SQL DSL
  • Liberator, RESTful resource handlers
  • om, React.js interface for Clojurescript
  • secretary, a client-side routing library

The ability to easily compose functionality from different libraries is exceptionally well explained by a quote from Alan Perlis: “It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures”. And so as it happens instead of each library having its own set of independent abstractions and data-structures, Clojure libraries tend to use mostly just lists, vectors, sets and maps which greatly simplify interoperability.

Lisp gets criticized for its syntax, or lack thereof, but I don’t feel that’s fair. Using any editor that inserts and balances parentheses for you does the trick. I also feel like I didn’t have to run a background thread in my mind to think if what I was writing would please the compiler or not, unlike in Scala for instance. Not to speak of the ability to use macros which allows one to easily extend the compiler with user-defined code. The expressiveness of Clojure means also that more thought is required per LOC but that might be just a side-effect of not being a full-time functional programmer.

What I do miss in the clojure ecosystem is a strong set of tools for statistics and machine learning. Incanter is a wonderful library but, coming from a R and python/scipy background, I had the impression that there is still a lot of catching up to do.

Dasbhoard generator for custom Telemetry jobs

tldr: Next time you are in need of a dashboard similar to the one used to monitor main-thread IO, please consider using my dashboard generator which takes care of displaying periodically generated data.

So you wrote your custom analysis for Telemetry, your map-reduce job is finally giving you the desired data and you want to set it up so that it runs periodically. You will need some sort of dashboard to monitor the weekly runs but since you don’t really care how it’s done what do you do? You copy paste the code of one of our current dashboards, a little tweak here and there and off you go.

That basically describes all of the recent dashboards, like the one for main-thread IO (mea culpa). Writing dashboards is painful when the only thing you care about is data. Once you finally have what you were looking for, the way you present is often considered an afterthought at best. But maintaining N dashboards becomes quickly unpleasant.

But what makes writing and maintaining dashboards so painful exactly? It’s simply that the more controls you have, the more different kind events you have to handle and the easier things get out of hand quickly. You start with something small and beautiful that just displays some csv and presto you end up with what should have been properly described as a state machine but instead is a mess of intertwined event handlers.

What I was looking for was something on the line of Shiny for R, but in javascript and with the option to have a client-only based interface. It turns out that React does more or less what I want. It’s not necessary meant for data analysis so there aren’t any plotting facilities but everything is there to roll your own. What makes exactly Shiny and React so useful is that they embrace reactive programming. Once you define a state and a set of dependencies, i.e. a data flow graph in practical terms, changes that affect the state end up being automatically propagated to the right components. Even though this can be seen as overkill for small dashboards, it makes it extremely easy to extend them when the set of possible states expands, which is almost always what happens.

To make things easier for developers I wrote a dashboard generator, iacumus, for use-cases similar to the ones we currently have. It can be used in simple scenarios when:

  • the data is collected in csv files on a weekly basis, usually using build-ids;
  • the dashboard should compare the current week against the previous one and mark differences in rankings;
  • it should be possible to go back back and forward in time;
  • the dashboard should provide some filtering and sorting criterias.

Iacumus is customizable through a configuration file that is specified through a GET parameter. Since it’s hosted on github, it means you just have to provide the data and don’t even have to spend time deploying the dashboard somewhere, assuming the machine serving the configuration file supports CORS. Here is how the end result looks like using the data for the add-on start-up scan time dashboard. Note that currently Chrome doesn’t handle properly our gzipped datasets and is unable to display anything, in case you wonder…

My next immediate goal is to simplify writing map-reduce jobs for the above mentioned use cases or to the very least write down some guidelines. For instance, some of our dashboards are based on Firefox’s version numbers and not on build-ids, which is really what you want when you desire to make comparisons of Nightly on a weekly basis.

Another interesting thought would be to automatically detect differences in the dashboards and send alerts. That might be not as easy with the current data, since a quick look at the dashboards makes it clear that the rankings fluctuate quite a bit. We would have to collect daily reports and account for the variance of the ranking in those as just using a few weekly datapoints is not reliable enough to account for the deviation.

Regression detection for Telemetry histograms.

tldr: An automatic regression detector system for Telemetry data has been deployed; the detected regressions can be seen in the dashboard.

Mozilla is collecting over 1,000 Telemetry probes which give rise to histograms, like the one in the figure below, that change slightly every day.

Average frame interval during any tab open/close animation (excluding tabstrip scroll).

Average frame interval during any tab open/close animation (excluding tabstrip scroll).

 

Until lately the only way to monitor those histogram was to sit down and literally stare the screen while something interesting was spotted. Clearly there was the need for an automated system which is able to discern between noise and real regressions.

Noise is a major challenge, even more so than with Talos data, as Telemetry data is collected from a wide variety of computers, configurations and workloads. A reliable mean of detecting regressions, improvements and changes in a measurement’s distribution is fundamental as erroneous alerts (false positives) tend to annoy people to the point that they just ignore any warning generated by the system.

I have looked at various methods to detect changes in histogram, like

  • Correlation Coefficient
  • Chi-Square statistic
  • U statistic (Mann-Whitney)
  • Kolmogorov-Smirnov statistic of the estimated densities
  • One Class Support Vector Machine
  • Bhattacharyya Distance

Only the Bhattacharyya distance proved satisfactory for our data. There are several reasons why each of the previous methods fails with our dataset.

For instance a one class SVM wouldn’t be a bad idea if some distributions wouldn’t change dramatically over the course of time due to regressions and/or improvements in our code; so in other words, how do you define how a distribution should look like? You could just take the daily distributions of the past week as training set but that wouldn’t be enough data to get anything meaningful from a SVM. A Chi-Square statistic instead is not always applicable as it doesn’t allow cells with an expected count of 0. We could go on for quite a while and there are ways to get around those issues but the reader is probably more interested in the final solution. I evaluated how well those methods are actually at pinpointing some past known regressions and the Bhattacharyya distance proved to be able to detect the kind of pattern changes we are looking for, like distributions shifts or bin swaps, while minimizing the number of false positives.

Having a relevant distance metric is only part of the deal since we still have to decide what to compare. Should we compare the distribution of today’s build-id against the one from yesterday? Or the one from a week ago? It turns out that trying to mimic what an human would do yields a good algorithm: if

  • the variance of the distance between the histogram of the current build-id and the histograms of the past N build-ids is small enough and
  • the distance between the histograms of the current build-id and the previous build-id is above a cutoff value K yielding a significant difference and
  • a significant difference is also present in the next K build-ids, then a distribution change is reported.

Furthermore, Histograms that don’t have enough data are filtered out and the cut-off values and parameters are determined empirically from past known regressions.

I am pretty satisfied with the detected regressions so far, for instance the system was able to correctly detect a regression caused by the OMTC patch that landed the 20st of May which caused a significant change in the the average frame interval during tab open animation:

Average frame interval during tab open animation of about:newtab.

Average frame interval during tab open animation of about:newtab.

We will soon roll-out a feature to allow histogram authors to be notified through e-mail when an histogram change occurs. In the meantime you can have a look at the detected regressions in the dashboard.

Most popular GPUs on Nightly and Release

We have a long-standing bug about Firefox not scrolling smoothly with Intel GPUs. It turns out that recently, thanks to a driver update, the performance has drastically improved for some of our users. While trying to confirm this hypothesis I found some interesting data about Firefox users on Nightly 33 and Release 30 that might come in handy in deciding on what hardware to run our benchmarks in the future.

First things first, let’s have a look at the popularity of the various GPU vendors:

That doesn’t really come as a surprise considering how ubiquitous Intel chips are nowadays. This also means we should concentrate our optimization efforts towards Intel GPUs since it’s there where we can have the biggest impact.

But how well does Firefox perform with GPUs from the above mentioned vendors? It turns out that one of our telemetry metrics, that measures the average time to render a frame during a tab animation, comes in pretty handy to answer this question:

Ideally we would like to reach 60 frames per second for any vendor. Considering that practically every second user has an Intel GPU, the fact that our performance on those chips is not splendid weights even more on our shoulders. Though, as one might suspect, Intel chips are usually not as performant as Nvidia’s or AMD’s ones. There is also a suspicious difference in performance between Nightly and Release; could this be related to Bug 1013262?

The next question is what specific models do our users possess. In order to answer it, let’s have a look at the most popular GPUs that account for 25% of our user base on Nightly:

nightly_pop

and on Release:

As you can notice there are only Intel GPUs in the top 25% of the population. The HD 4000 and 3000 are dominating on Nightly while the distribution is much more spread out on Release with older models, like the GMA 4500, being quite popular. Unsurprisingly though the most popular chips are mobile ones.

Let’s take now as reference the most popular chips on Release and see how their performance compares to Nightly:

We can immediately spot that newer chips perform generally better than older ones, unsurprisingly. But more importantly, there is a marked performance difference between the same models on Nightly and Release which will require further investigation. Also noteworthy is that older desktop models outperform newer mobile ones.

And finally we can try to answer the original question by correlating the tab animation performance with the various driver versions. In order to do that, let’s take the Intel HD 4000 and plot the performance for its most popular drivers on the release channel and compare it to the nightly one:


We can notice that there is a clear difference in performance between the older 9.X drivers and the newer 10.X which answers our original question. Unfortunately though only about 25% of our users on the release channel have updated their driver to a recent 10.X version. Also, the difference between the older and newer drivers is more marked on the nightly channel than on the release one.

We are currently working on an alerting system for Telemetry that will notify us when a relevant change appears in our metrics. This should allow us to catch pre-emptively regressions, like Bug 1032185, or improvements, like the one mentioned in this blog post, without accidentally stumbling on it.

Using Telemetry to recommend Add-ons for Firefox

This post is about the beauty of a simple mathematical theorem and its applications in logic, machine learning and signal processing: Bayes’ theorem. During the way we will develop a simple recommender engine for Firefox add-ons based on data from Telemetry and reconstruct a signal from a noisy channel.

tl;dr: If you couldn’t care less about mathematical beauty and probabilities, just have a look at the recommender system.

Often the problem we are trying to solve can be reduced to determining the probability of an hypothesis H, given that we have observed data D and have some knowledge of the way the data was generated. In mathematical form:

P(H|D) = \frac{P(D|H)*P(H)}{P(D)}

Where P(D|H) is the likelihood of the data D given our hypothesis H, i.e. how likely it is to see data D given that our hypothesis H is true, and P(H) is the prior belief about H, i.e. how likely our hypothesis H is true in the first place.

That’s all there is, check out my last post to see a basic application of the theorem to find out why many science research findings based purely on statistical inference turn out later to be false.

Logic

According to logic, from the statement “if A is true then B is true” we can deduce that “if B is false then A is false”. This can be proven simply by using a truth table or probabilistic reasoning:

P(A=F|B=F) = 1 - P(A=T|B=F)

= 1 - \frac{P(B=F|A=T)P(A=T)}{P(B=F|A=T)P(A=T) + P(B=F|A=F)P(A=F)} = 1

where

P(B=F|A=T) = 1 - P(B=T|A=T) = 1 - 1 = 0

So we can see that probabilistic reasoning can be applied to make logical deductions, i.e. deductive logic can be seen as nothing more than a limiting case of probabilistic reasoning, where probabilities take only values of 0 or 1.

Recommender Engine

Telemetry submissions contain the IDs of the add-ons of our users. Note that Telemetry is opt-in for our release builds so we don’t collect data if you don’t grant us explicitly permission to do so.

We might be interested in answering the following question: given that a user has the add-ons A_{1}, ..., A_{k} , what’s the probability that the user also has add-on B? We could then use the answer to suggest new add-ons to users that don’t possess B but have A_{1}, ..., A_{k} . So let’s use Bayes theorem to derive our probabilities. What we are after is the following probability:

P(B|A_{1}, ..., A_{k}) = \frac{P(A_{1}, ..., A_{k}|B)P(B)}{P(A_{1}, ..., A_{k})}   eq. 1

which corresponds to the ratio between Telemetry submissions that contain A_{1}, ..., A_{k}, B and the submissions that contain all those add-ons but B. Note that if

B \in \{A_{1}, ..., A_{k}\}

then obviously eq. 1 is going have a value of 1 and it isn’t of interest, so we assume that

B \not\in \{A_{1}, ..., A_{k}\}

Since the denominator is constant for any B, we don’t really care about it.

P(B|A_{1}, ..., A_{k}) \propto P(A_{1}, ..., A_{k}|B)P(B)   eq. 2

Say we want to be smart and precompute all possible probabilities in order to avoid the work for each request. It’s going to be pretty slow considering that the number of possible subsets of N add-ons is exponential.

If we make the naive assumptions that the add-ons are conditionally independent, i.e. that

P(A_{1}, ..., A_{k}|B) = P(A_{1}|B)P(A_{2}|B)...P(A_{k}|B)

we can rewrite eq. 2 like so

P(B|A_{1}, ..., A_{k}) \propto P(A_{1}|B)P(A_{2}|B)...P(A_{k}|B)P(B)

Awesome, this means that once we calculate the N probabilities

P(B|A_{i})

where N is the total number of add-ons in the universe, we have everything we need to calculate eq. 2 with just k multiplications.

If we perform this calculation for every possible add-on B and return the add-on for which the equation is maximized, then we can make a reasonable suggestion. Follow the link to play with a simple recommendation engine that does exactly this.

Keep in mind that I didn’t spend time cleaning up the dataset from non interesting add-ons like Norton Toolbar, etc. The reason those add-ons show up is simply because a very large fraction of the population has them.

Signal Processing

Say we send a signal over a noisy channel and at the other end we would like to reconstruct the original signal by removing the noise. Let’s assume our original signal has the following form

y(t) = \alpha sin(2\pi \omega t + \varphi)

where \alpha is the amplitude, \omega is the frequency and \varphi is the phase. The i-th sample that we get on the other side of the channel is corrupted by some random noise \Sigma_i :

Y_i = A sin(2\pi \Omega t_i + \Phi) + W_i

where Y_i , A , \Omega , \Phi and W_i are random variables and

W_i = \mathcal{N}(0, \sigma^2)

i.e. we assume the variance of the noise is the same for all samples.

Since we have some knowledge of how the data was generated, let’s see if we can apply Bayes’ theorem. We have a continuous number of possible hypothesis, one for each combination of the amplitude, frequency, phase and variance.

f_{A ,\Omega,\Phi,\Sigma|Y}(\alpha , \omega, \varphi, \sigma|y) = \frac  {f_{Y|A , \Omega, \Phi, \Sigma}(y| \alpha , \omega, \varphi, \sigma)f_{A,\Omega,\Phi, \Sigma}(\alpha, \omega, \varphi, \sigma)}  {\iiiint f_{Y|A, \Omega, \Phi, \Sigma}(y| \alpha, \omega, \varphi, \sigma)f_{A,\Omega,\Phi, \Sigma}(\alpha, \omega, \varphi, \sigma) d\alpha d\omega d\varphi d\sigma}

where the functions f are probability density functions.

We are trying to find the quadruplet of values that maximizes the expression above so we don’t really care about the denominator since it’s constant. If we make the reasonable assumption that A , \Omega , \Phi and \Sigma are independent then

f_{A, \Omega, \Phi, \Sigma}(\alpha, \omega, \varphi, \sigma) = f_{A}(\alpha)f_{\Omega}(\omega)f_{\Phi}(\varphi)f_{\Sigma}(\sigma)

where f_{A} , f_{\Omega} , f_{\Phi} and f_{\Sigma} are our priors and since we don’t know much about them let’s assume that each one of them can be modeled as an Uniform distribution.

Assuming the measurements are independent from each other, the likelihood function for the vector of measurements y is

f_{Y|A, \Omega, \Phi, \Sigma}(y|\alpha, \omega, \varphi, \sigma) = \prod_{i=1}^N f_{Y_i|A, \Omega, \Phi, \Sigma}(y_i|\alpha, \omega, \varphi, \sigma)

Now we just have to specify our likelihood function for a single measurement, but since we have an idea of how the data was generated, we can do that easily

f_{Y_i|\alpha, \omega, \varphi, \sigma}(y_i|\alpha, \omega, \varphi, \sigma) = \mathcal{N}(\alpha sin(2\pi \omega t_i + \varphi), \sigma^2)

To compute the posterior distributions of our random variables given a series of measurements performed at the other end of the line, let’s use a Monte Carlo method with the python library pymc. As you can see from IPython notebook, it’s extremely easy to write and evaluate computationally a probabilistic model.

Final Thoughts

We have seen that Bayesian inference can be applied successfully in seemingly unrelated fields. Even though there are better suited techniques for logic, recommender systems and signal processing, it’s still interesting to see how such a simple theorem can have so many applications.

Forecasting time series for fun

Predicting time series can be very interesting not only for quants. Any server that logs metrics like the number of submissions or requests over time generates a so called time series or signal. An interesting time series I had the chance to play with some time ago is the one generated by Telemetry’s submissions. This is how it looks like for the Nightly channel for the past 60 days:

Figure 1

It’s immediately evident to an human eye when and where there was a drop in submissions in the past couple of months (bugs!). An interesting problem is to be able to automatically identify a drop in submissions as soon as it happens and at the same time reducing to a minimum the number of “false alarms”. It might seem rather trivial at first, but given that the distribution is quite sparse, caused mostly by daily oscillations, an outlier detection method based on the standard deviation is doomed to fail. Using the median absolute deviation is more robust but still not good enough to avoid false positives.

The periodic patterns might not be immediately visible from the raw data plot but once we connect the dots the daily and weekly pattern appear in all their beauty:

Figure 2

The method I came up with to catch drops does the following:

  1. It retrieves the distributions of the last 10 days from the current data point
  2. It performs a series of Mann-Whitney tests to compare the last 24h to the distributions of the previous 9 days
  3. If the distributions are statistically different for at least 5 days with the current daily one having a lower mean, then we have a drop

Figure 3

The algorithm requires a certain amount of history to make good predictions, reason why it detected the first drop on the left only after several days. As expected though it was able to detect the second drop without any false positives. Sudden drops are easy to detect with a robust outlier detection method but slow drops, as we experienced in the past, can go unnoticed if you just look for outliers.

Another interesting approach is to use time series analysis to decompse the series into its seasonals (periodic), trend and and noise signals. A simple classical decomposition by moving average yields the following series:

Figure 4

This simple algorithm was able to remove most of the periodic pattern; the trend is affected now by the weekly signal and the drops. It turns out that newer methods are able to decompose time series with multiple periodic patterns, or seasonalities. One algorithm I particularly like is the so called TBATS method, which is an advanced exponential smoothing model:

Figure 5

That’s pretty impressive! The TBATS algorithm was able to identify and remove the daily and weekly frequency from our signal, what remains is basically the trend and some random noise. Now that we have such a clean signal we could try to apply statistical quality control to our time series, i.e. use a set of rules to identify drops. The rules look at the historical mean of a series of datapoints and based on the standard deviation, the rules help judge whether a new set of points is experiencing a mean shift (drop) or not.

Given a decomposition of a time series, we can also use it to predict future datapoints. This can be useful for a variety of reasons beyond detecting drops. To have an idea of how well we can predict future submissions let’s take a clean subset of our data, from day 20 to day 40, and let’s try to predict Telemetry’s submissions for the next 5 days while comparing it to the actual data:

Figure 6

That’s pretty neat, we can immediately see that we have an outlier and the prediction is very close to the actual real data.

I wonder if there are other methods used to detect alterations to time series so feel free to drop me a line with a pointer if you happen to have a suggestion.

Main-thread IO Analysis

<edit>

  • The goal of this post is to show that main-thread IO matters and not to “shame” this or that add-on, or platform subsystem. There is only so much I can do alone by filing bugs and fixing some of them. By divulging knowledge about the off-main thread IO API and why it matters, my hope is that Firefox and add-ons developers alike will try to exploit it the next time an opportunity arises. I changed the title of this post accordingly to reflect my intentions.
  • Please note that just because some code performs IO on the main-thread, it doesn’t necessarily mean that it can be avoided. This has to be evaluated case by case.

</edit>

Recently a patch landed in Firefox that allows telemetry to collect aggregated main-thread disk IO timing per filename. Each filename comes with its accumulated time spent and the number of operations performed for open(), read(), write(), fsync() and stat(). Yoric wrote an interesting article some time ago on why main-thread IO can be a serious performance issue.

I collected a day worth of Nightly data using Telemetry’s map-reduce framework and filtered out all those filenames that were not common enough (< 10%) among the submissions. What follows are some interesting questions we could answer thanks to the newly collected data.

  • How common are the various operations?
    The most common operation is stat() followed by open(), all other operations are very uncommon with a third quartile of 0.
  • Is there a correlation between the number of operations and the time spent doing IO on a file?
    There is an extremely weak correlation which is what one would expect. Generally, what really makes a difference is the amount of bytes read and written for instance, not the number of actual operations performed.
  • Does having a SSD disk lower the time spent doing IO on a file?
    A Mann-Whitney test confirms that the distributions of time conditioned on “having a SSD” and “not having a SSD” seem to be indeed statistically different. On average having an SSD disk lowers the time by a factor of 2.7.

The above facts apply to all submissions, but what we are really interested in is discerning those files for which the accumulated time is particularly high. An interesting exercise is to compare some distributions of the top 100 files (in terms of time) against the distributions of the general population to determine if bad timings are correlated with older machines:

  • Does the distribution of Windows versions differ between the population and the top 100?
    Fisher’s test confirms that the distributions do not differ (with high probability), i.e. the OS version doesn’t seem to influence the timings.
  • Does the distribution of CPU architectures (32 vs 64 bit) differ between the population and the top 100?
    Fisher’s test confirms that the distributions do not seem to differ.
  • Does the distribution of the disk type differ between the population and the top 100?
    The distributions do not seem to differ according to Fisher’s test.

Can we deduce something from all this? It seems that files that behave badly do so regardless of the machine’s configuration. That is, removing the offending operations performing IO in the main-thread will likely benefit all users.

Which brings us to the more interesting question of how we should prioritize the IO operations that have to be moved off the main-thread. I am not a big fan of looking only at aggregates of data since outliers turn out to be precious more often than not. Let’s have a look at the distrubtion of the top 100 outliers.

Distribution of the top outliers.

1. Distribution of the top outliers.

Whoops, looks like httpDataUsage.dat, place-sqlite-wal, omni.ja and prefs.js behave particularly bad quite often. How bad is bad? For instance, one entry of httpDataUsage.dat performs about 100 seconds of IO operations (38 stat & 38 open), i.e. ~ 1.3 seconds per operation!

Now that we have a feeling of the outliers, let’s have a look at the aggregated picture. Say we are interested in those files most users hit (> 90%) and rank them by their third quartile:

Top 20 files most users hit

2. Top 20 files most users hit.

In general the total time spent doing IO is about 2 to 3 orders of magnitude lower than for the outliers.

But what happened to the add-ons? It turns out that the initial filtering we applied on our dataset removed most of them. Let’s rectify the situation by taking a dataset with the single hottest entry for each file, in terms of time, and aggregate the filenames by extension:

Top extensions with slow file operations.

3. Top extensions with slow file operations.

To keep track of some of the statistics presented here, mreid and myself prepared a  dashboard. Time to file some bugs!