# Most popular GPUs on Nightly and Release

We have a long-standing bug about Firefox not scrolling smoothly with Intel GPUs. It turns out that recently, thanks to a driver update, the performance has drastically improved for some of our users. While trying to confirm this hypothesis I found some interesting data about Firefox users on Nightly 33 and Release 30 that might come in handy in deciding on what hardware to run our benchmarks in the future.

First things first, let’s have a look at the popularity of the various GPU vendors:

That doesn’t really come as a surprise considering how ubiquitous Intel chips are nowadays. This also means we should concentrate our optimization efforts towards Intel GPUs since it’s there where we can have the biggest impact.

But how well does Firefox perform with GPUs from the above mentioned vendors? It turns out that one of our telemetry metrics, that measures the average time to render a frame during a tab animation, comes in pretty handy to answer this question:

Ideally we would like to reach 60 frames per second for any vendor. Considering that practically every second user has an Intel GPU, the fact that our performance on those chips is not splendid weights even more on our shoulders. Though, as one might suspect, Intel chips are usually not as performant as Nvidia’s or AMD’s ones. There is also a suspicious difference in performance between Nightly and Release; could this be related to Bug 1013262?

The next question is what specific models do our users possess. In order to answer it, let’s have a look at the most popular GPUs that account for 25% of our user base on Nightly:

and on Release:

As you can notice there are only Intel GPUs in the top 25% of the population. The HD 4000 and 3000 are dominating on Nightly while the distribution is much more spread out on Release with older models, like the GMA 4500, being quite popular. Unsurprisingly though the most popular chips are mobile ones.

Let’s take now as reference the most popular chips on Release and see how their performance compares to Nightly:

We can immediately spot that newer chips perform generally better than older ones, unsurprisingly. But more importantly, there is a marked performance difference between the same models on Nightly and Release which will require further investigation. Also noteworthy is that older desktop models outperform newer mobile ones.

And finally we can try to answer the original question by correlating the tab animation performance with the various driver versions. In order to do that, let’s take the Intel HD 4000 and plot the performance for its most popular drivers on the release channel and compare it to the nightly one:

We can notice that there is a clear difference in performance between the older 9.X drivers and the newer 10.X which answers our original question. Unfortunately though only about 25% of our users on the release channel have updated their driver to a recent 10.X version. Also, the difference between the older and newer drivers is more marked on the nightly channel than on the release one.

We are currently working on an alerting system for Telemetry that will notify us when a relevant change appears in our metrics. This should allow us to catch pre-emptively regressions, like Bug 1032185, or improvements, like the one mentioned in this blog post, without accidentally stumbling on it.

# Using Telemetry to recommend Add-ons for Firefox

This post is about the beauty of a simple mathematical theorem and its applications in logic, machine learning and signal processing: Bayes’ theorem. During the way we will develop a simple recommender engine for Firefox add-ons based on data from Telemetry and reconstruct a signal from a noisy channel.

tl;dr: If you couldn’t care less about mathematical beauty and probabilities, just have a look at the recommender system.

Often the problem we are trying to solve can be reduced to determining the probability of an hypothesis H, given that we have observed data D and have some knowledge of the way the data was generated. In mathematical form:

$P(H|D) = \frac{P(D|H)*P(H)}{P(D)}$

Where P(D|H) is the likelihood of the data D given our hypothesis H, i.e. how likely it is to see data D given that our hypothesis H is true, and P(H) is the prior belief about H, i.e. how likely our hypothesis H is true in the first place.

That’s all there is, check out my last post to see a basic application of the theorem to find out why many science research findings based purely on statistical inference turn out later to be false.

Logic

According to logic, from the statement “if A is true then B is true” we can deduce that “if B is false then A is false”. This can be proven simply by using a truth table or probabilistic reasoning:

$P(A=F|B=F) = 1 - P(A=T|B=F)$

$= 1 - \frac{P(B=F|A=T)P(A=T)}{P(B=F|A=T)P(A=T) + P(B=F|A=F)P(A=F)} = 1$

where

$P(B=F|A=T) = 1 - P(B=T|A=T) = 1 - 1 = 0$

So we can see that probabilistic reasoning can be applied to make logical deductions, i.e. deductive logic can be seen as nothing more than a limiting case of probabilistic reasoning, where probabilities take only values of 0 or 1.

Recommender Engine

Telemetry submissions contain the IDs of the add-ons of our users. Note that Telemetry is opt-in for our release builds so we don’t collect data if you don’t grant us explicitly permission to do so.

We might be interested in answering the following question: given that a user has the add-ons $A_{1}, ..., A_{k}$, what’s the probability that the user also has add-on B? We could then use the answer to suggest new add-ons to users that don’t possess B but have $A_{1}, ..., A_{k}$. So let’s use Bayes theorem to derive our probabilities. What we are after is the following probability:

$P(B|A_{1}, ..., A_{k}) = \frac{P(A_{1}, ..., A_{k}|B)P(B)}{P(A_{1}, ..., A_{k})}$  eq. 1

which corresponds to the ratio between Telemetry submissions that contain $A_{1}, ..., A_{k}, B$ and the submissions that contain all those add-ons but B. Note that if

$B \in \{A_{1}, ..., A_{k}\}$

then obviously eq. 1 is going have a value of 1 and it isn’t of interest, so we assume that

$B \not\in \{A_{1}, ..., A_{k}\}$

Since the denominator is constant for any B, we don’t really care about it.

$P(B|A_{1}, ..., A_{k}) \propto P(A_{1}, ..., A_{k}|B)P(B)$  eq. 2

Say we want to be smart and precompute all possible probabilities in order to avoid the work for each request. It’s going to be pretty slow considering that the number of possible subsets of N add-ons is exponential.

If we make the naive assumptions that the add-ons are conditionally independent, i.e. that

$P(A_{1}, ..., A_{k}|B) = P(A_{1}|B)P(A_{2}|B)...P(A_{k}|B)$

we can rewrite eq. 2 like so

$P(B|A_{1}, ..., A_{k}) \propto P(A_{1}|B)P(A_{2}|B)...P(A_{k}|B)P(B)$

Awesome, this means that once we calculate the $N$ probabilities

$P(B|A_{i})$

where $N$ is the total number of add-ons in the universe, we have everything we need to calculate eq. 2 with just $k$ multiplications.

If we perform this calculation for every possible add-on B and return the add-on for which the equation is maximized, then we can make a reasonable suggestion. Follow the link to play with a simple recommendation engine that does exactly this.

Keep in mind that I didn’t spend time cleaning up the dataset from non interesting add-ons like Norton Toolbar, etc. The reason those add-ons show up is simply because a very large fraction of the population has them.

Signal Processing

Say we send a signal over a noisy channel and at the other end we would like to reconstruct the original signal by removing the noise. Let’s assume our original signal has the following form

$y(t) = \alpha sin(2\pi \omega t + \varphi)$

where $\alpha$ is the amplitude, $\omega$ is the frequency and $\varphi$ is the phase. The i-th sample that we get on the other side of the channel is corrupted by some random noise $\Sigma_i$:

$Y_i = A sin(2\pi \Omega t_i + \Phi) + W_i$

where $Y_i$, $A$, $\Omega$, $\Phi$ and $W_i$ are random variables and

$W_i = \mathcal{N}(0, \sigma^2)$

i.e. we assume the variance of the noise is the same for all samples.

Since we have some knowledge of how the data was generated, let’s see if we can apply Bayes’ theorem. We have a continuous number of possible hypothesis, one for each combination of the amplitude, frequency, phase and variance.

$f_{A ,\Omega,\Phi,\Sigma|Y}(\alpha , \omega, \varphi, \sigma|y) = \frac {f_{Y|A , \Omega, \Phi, \Sigma}(y| \alpha , \omega, \varphi, \sigma)f_{A,\Omega,\Phi, \Sigma}(\alpha, \omega, \varphi, \sigma)} {\iiiint f_{Y|A, \Omega, \Phi, \Sigma}(y| \alpha, \omega, \varphi, \sigma)f_{A,\Omega,\Phi, \Sigma}(\alpha, \omega, \varphi, \sigma) d\alpha d\omega d\varphi d\sigma}$

where the functions $f$ are probability density functions.

We are trying to find the quadruplet of values that maximizes the expression above so we don’t really care about the denominator since it’s constant. If we make the reasonable assumption that $A$, $\Omega$, $\Phi$ and $\Sigma$ are independent then

$f_{A, \Omega, \Phi, \Sigma}(\alpha, \omega, \varphi, \sigma) = f_{A}(\alpha)f_{\Omega}(\omega)f_{\Phi}(\varphi)f_{\Sigma}(\sigma)$

where $f_{A}$, $f_{\Omega}$, $f_{\Phi}$ and $f_{\Sigma}$ are our priors and since we don’t know much about them let’s assume that each one of them can be modeled as an Uniform distribution.

Assuming the measurements are independent from each other, the likelihood function for the vector of measurements $y$ is

$f_{Y|A, \Omega, \Phi, \Sigma}(y|\alpha, \omega, \varphi, \sigma) = \prod_{i=1}^N f_{Y_i|A, \Omega, \Phi, \Sigma}(y_i|\alpha, \omega, \varphi, \sigma)$

Now we just have to specify our likelihood function for a single measurement, but since we have an idea of how the data was generated, we can do that easily

$f_{Y_i|\alpha, \omega, \varphi, \sigma}(y_i|\alpha, \omega, \varphi, \sigma) = \mathcal{N}(\alpha sin(2\pi \omega t_i + \varphi), \sigma^2)$

To compute the posterior distributions of our random variables given a series of measurements performed at the other end of the line, let’s use a Monte Carlo method with the python library pymc. As you can see from IPython notebook, it’s extremely easy to write and evaluate computationally a probabilistic model.

Final Thoughts

We have seen that Bayesian inference can be applied successfully in seemingly unrelated fields. Even though there are better suited techniques for logic, recommender systems and signal processing, it’s still interesting to see how such a simple theorem can have so many applications.

# Forecasting time series for fun

Predicting time series can be very interesting not only for quants. Any server that logs metrics like the number of submissions or requests over time generates a so called time series or signal. An interesting time series I had the chance to play with some time ago is the one generated by Telemetry’s submissions. This is how it looks like for the Nightly channel for the past 60 days:

Figure 1

It’s immediately evident to an human eye when and where there was a drop in submissions in the past couple of months (bugs!). An interesting problem is to be able to automatically identify a drop in submissions as soon as it happens and at the same time reducing to a minimum the number of “false alarms”. It might seem rather trivial at first, but given that the distribution is quite sparse, caused mostly by daily oscillations, an outlier detection method based on the standard deviation is doomed to fail. Using the median absolute deviation is more robust but still not good enough to avoid false positives.

The periodic patterns might not be immediately visible from the raw data plot but once we connect the dots the daily and weekly pattern appear in all their beauty:

Figure 2

The method I came up with to catch drops does the following:

1. It retrieves the distributions of the last 10 days from the current data point
2. It performs a series of Mann-Whitney tests to compare the last 24h to the distributions of the previous 9 days
3. If the distributions are statistically different for at least 5 days with the current daily one having a lower mean, then we have a drop

Figure 3

The algorithm requires a certain amount of history to make good predictions, reason why it detected the first drop on the left only after several days. As expected though it was able to detect the second drop without any false positives. Sudden drops are easy to detect with a robust outlier detection method but slow drops, as we experienced in the past, can go unnoticed if you just look for outliers.

Another interesting approach is to use time series analysis to decompse the series into its seasonals (periodic), trend and and noise signals. A simple classical decomposition by moving average yields the following series:

Figure 4

This simple algorithm was able to remove most of the periodic pattern; the trend is affected now by the weekly signal and the drops. It turns out that newer methods are able to decompose time series with multiple periodic patterns, or seasonalities. One algorithm I particularly like is the so called TBATS method, which is an advanced exponential smoothing model:

Figure 5

That’s pretty impressive! The TBATS algorithm was able to identify and remove the daily and weekly frequency from our signal, what remains is basically the trend and some random noise. Now that we have such a clean signal we could try to apply statistical quality control to our time series, i.e. use a set of rules to identify drops. The rules look at the historical mean of a series of datapoints and based on the standard deviation, the rules help judge whether a new set of points is experiencing a mean shift (drop) or not.

Given a decomposition of a time series, we can also use it to predict future datapoints. This can be useful for a variety of reasons beyond detecting drops. To have an idea of how well we can predict future submissions let’s take a clean subset of our data, from day 20 to day 40, and let’s try to predict Telemetry’s submissions for the next 5 days while comparing it to the actual data:

Figure 6

That’s pretty neat, we can immediately see that we have an outlier and the prediction is very close to the actual real data.

I wonder if there are other methods used to detect alterations to time series so feel free to drop me a line with a pointer if you happen to have a suggestion.

<edit>

• The goal of this post is to show that main-thread IO matters and not to “shame” this or that add-on, or platform subsystem. There is only so much I can do alone by filing bugs and fixing some of them. By divulging knowledge about the off-main thread IO API and why it matters, my hope is that Firefox and add-ons developers alike will try to exploit it the next time an opportunity arises. I changed the title of this post accordingly to reflect my intentions.
• Please note that just because some code performs IO on the main-thread, it doesn’t necessarily mean that it can be avoided. This has to be evaluated case by case.

</edit>

Recently a patch landed in Firefox that allows telemetry to collect aggregated main-thread disk IO timing per filename. Each filename comes with its accumulated time spent and the number of operations performed for open(), read(), write(), fsync() and stat(). Yoric wrote an interesting article some time ago on why main-thread IO can be a serious performance issue.

I collected a day worth of Nightly data using Telemetry’s map-reduce framework and filtered out all those filenames that were not common enough (< 10%) among the submissions. What follows are some interesting questions we could answer thanks to the newly collected data.

• How common are the various operations?
The most common operation is stat() followed by open(), all other operations are very uncommon with a third quartile of 0.
• Is there a correlation between the number of operations and the time spent doing IO on a file?
There is an extremely weak correlation which is what one would expect. Generally, what really makes a difference is the amount of bytes read and written for instance, not the number of actual operations performed.
• Does having a SSD disk lower the time spent doing IO on a file?
A Mann-Whitney test confirms that the distributions of time conditioned on “having a SSD” and “not having a SSD” seem to be indeed statistically different. On average having an SSD disk lowers the time by a factor of 2.7.

The above facts apply to all submissions, but what we are really interested in is discerning those files for which the accumulated time is particularly high. An interesting exercise is to compare some distributions of the top 100 files (in terms of time) against the distributions of the general population to determine if bad timings are correlated with older machines:

• Does the distribution of Windows versions differ between the population and the top 100?
Fisher’s test confirms that the distributions do not differ (with high probability), i.e. the OS version doesn’t seem to influence the timings.
• Does the distribution of CPU architectures (32 vs 64 bit) differ between the population and the top 100?
Fisher’s test confirms that the distributions do not seem to differ.
• Does the distribution of the disk type differ between the population and the top 100?
The distributions do not seem to differ according to Fisher’s test.

Can we deduce something from all this? It seems that files that behave badly do so regardless of the machine’s configuration. That is, removing the offending operations performing IO in the main-thread will likely benefit all users.

Which brings us to the more interesting question of how we should prioritize the IO operations that have to be moved off the main-thread. I am not a big fan of looking only at aggregates of data since outliers turn out to be precious more often than not. Let’s have a look at the distrubtion of the top 100 outliers.

1. Distribution of the top outliers.

Whoops, looks like httpDataUsage.dat, place-sqlite-wal, omni.ja and prefs.js behave particularly bad quite often. How bad is bad? For instance, one entry of httpDataUsage.dat performs about 100 seconds of IO operations (38 stat & 38 open), i.e. ~ 1.3 seconds per operation!

Now that we have a feeling of the outliers, let’s have a look at the aggregated picture. Say we are interested in those files most users hit (> 90%) and rank them by their third quartile:

2. Top 20 files most users hit.

In general the total time spent doing IO is about 2 to 3 orders of magnitude lower than for the outliers.

But what happened to the add-ons? It turns out that the initial filtering we applied on our dataset removed most of them. Let’s rectify the situation by taking a dataset with the single hottest entry for each file, in terms of time, and aggregate the filenames by extension:

3. Top extensions with slow file operations.

To keep track of some of the statistics presented here, mreid and myself prepared a  dashboard. Time to file some bugs!

# Idle wakeups are evil

I have been spending some quality time with gdb tracking down timers firing during idle in Firefox, which turns out to be the main culprit for FF’s energy consumption. Wakeups are triggered either because a timer is fired or a process sleeping on a synchronisation variable is woken up. In fact the former can be reduced to the latter, since also the timer thread which executes the timer callbacks, blocks on a synchronisation variable.

Tracking down wakeups is nothing new and has been done in the past. Automating it on the other hand isn’t trivial without modifying heavily the sourcecode. I tried automating the process through a gdb script but it kept crashing randomly and when it didn’t FF got significantly slower so I gave up on that. There are some alternatives though, e.g. on Linux the perf subsystem can be exploited to track the following events:

• the context switches (shced:sched_switch),
• wakeups (sched:sched_wakeup),
• and the C-states of the CPU (power:cpu_idle).

By keeping track of the CPU C-states we know when the CPU is sleeping. The wakeup probe tells us when a specific thread is woken up, and by correlating it with the previous probe, it gives us the culprit that caused the CPU C-state to change. Finally, the context switch probe allows to retrieve a user-space callgraph of the thread woken up.

That’s an interesting approach and powertop does something similar but it turns out that to investigate wakeups we don’t really need that degree of accuracy. Instead of considering only the context switches caused by wakeups correlated to C-state changes, we can just consider all context switches. Some context switches will not be associated with C-state changes but this heuristic is still good enough on first approximation during an idle workload. On Linux the following command can be used to collect callgraphs of context switches:

perf record -e sched:sched_switch -R -c 1 -g -p \$(pgrep firefox)

As shown in the example above, most of the times a context switch happens we are either in the BackgroundHangManager, in the Timer thread or nsSocketTransportService. Not all timers are created equal though and we would really like to know which timers are firing. Unfortunately since the specific timer callback is executed only after the context switch, there isn’t a simple way to get that sort of information through perf.

The linker comes to our rescue and by using dladdr we can simply retrieve the mangled name of a timer callback handler, which we can demangle and print with __cxa_demangle. This clearly works only if the callbacks are exported in the dynamic symbol table, which on OSX are by default whereas on Linux we have to recompile the codebase with the -export-dynamic linker flag.

# A matter of energy

In the past month I have looked into measuring the power consumption of Firefox on desktop. It all started with a complaint of excessive background activity during idle so I wrote a little tool to measure the energy consumption of FF while idling on a blank page. I wasn’t able though to notice any statistically valid difference between between the idle machine and the idle machine + the idle browser (on Win8, OSX & Ubuntu).

A more traditional profiling using DTrace yield some wakeups though that I could backtrack to a particular set of timers firing. Most of those timers have been addressed and fixed in the current Nightly. I suspect that the reason I couldn’t identify them on my power benchmark is that the profiling interval I used was just too short (5-10min). A more sensible test would be to use longer intervals and/or measure the time it takes for a battery to drain, for instance.

The next step was to have look also at the drainage of energy when idling on some popular websites. In order to reduce the variance of the measurements I had to disable as many background processes as possible, i.e. spotlight on OSX. For the same reason I couldn’t use Selenium and I had to rely on simple terminal commands to steer the browsers. The problem with Selenium is that each browser has its own implementation with a complete different power signature. It probably could be possible to remove the noise using some ML algorithm but that would require a large enough training set, for each distinct website and OS… A more realistic alternative would be to write some “simple” and lightweight windowing event generator to steer the different browsers.
That being said, even without Selenium it’s possible to get some interesting data: before we can optimize the power drain of FF while scrolling on cnn.com for instance, we must address the power drainage of simply idling on it, since that drainage is very likely to contribute to any power profile performed while scrolling anyway.

Keep in mind the following bits when interpreting the data:

1. the latest release versions of FF, Chrome, IE & Safari as of this writing have been used;
2. the sampling interval amounts to only to 30 seconds; even though its extremely short and might not catch background activities like sporadic timers, it’s obvious that if there is something very wrong with a page it will show up in the profile anyway;
3. the error bars mark the 95% confidence intervals;
4. the energy profiler starts measuring only about a minute after a page has been loaded; that’s needed to ensure the browser is in a steady state;
5. the profiles were performed by idling on the homepages of the different sites, i.e. youtube.com wasn’t playing any video;
6. only for Facebook the browsers were logged into my profile;
7. the data was collected on different machines at different times, i.e. the 3 plots are not comparable between each other;
8. I am assuming that a particular homepage doesn’t change dramatically when collecting the profiles for the different browsers (more on this later);
9. only the CPU specific power drainage is measured.

Firefox is performing quite well, only Facebook seems to have a particular adverse effect on the energy profile.

Let’s have a look at the the OSX plot:

It seems here that FF is performing particularly bad on Facebook and Yahoo, but is in good company in the latter case.

And finally we have the profile for Ubuntu:

And again Facebook is a clear outlier here. As you can notice the variance here is much lower since on Linux it was easier to remove all unneeded background tasks.

The main pattern across those profiles seems to be that the OS vendor-specific browsers are slightly better at not draining energy during idle. Also, Facebook seems to behave negatively on FF across all OSs so it doesn’t seem to be a random fluctuation. It looks like I am going to have some fun this week tracking down the root of the problem :-)

For the future, a distributed benchmark that runs on O*B equally configured machines where O is the number of OSs and B the number of browsers, would be ideal. This would allow the profiles to be comparable across OSs and also make them independent of homepage updates.

# My first two weeks at Mozilla

In the first couple of weeks at Mozilla I spent some time with the client and server-side telemetry implementation. Telemetry allows Engineering to receive aggregate data of browser health in the field, e.g. cache hit rates or page load times across all browser instances.

In summary, the telemetry workflow looks more or less like this:

1. Firefox generates telemetry data while its being used, if the user explicitly enabled the collection;
2. the collected data is sent to once a day to a server via HTTPS;
3. the received data is collected into a queue which is post-processed through a converter, which validates, compacts and compresses the data, before its being sent to persistent storage;
4. analysis jobs are run on the data available through the persistent storage and the results are presented on the telemetry dashboard;
5. finally, developers can access the persistent data through custom map-reduce jobs to compute user-define metrics.

I have been working on small bits of the project, i.e.:

1. adding a feature to firefox that allows certain telemetry data to “expire”, i.e. not being sent to the server;
2. integrating a C++ record compressor into the back end;
3. running some map-reduce jobs to determine the number of SSDs vs HDDs.

In particular, for the last item, it seems that about 8% of nightly users are running from SSD disks, which is not as high as I initially suspected. Running is this sort of queries against telemetry datasets is very easy. A json file specifies the filter for the datasets you want to analyze and the actual analysis is implemented in a python file as a map-reduce job.

# Hello World

I’m Roberto, the newest member of Mozilla’s Desktop Performance team. Before that I worked on optimizing the software framework used in ATLAS, one of LHC’s experiments.

I will be working on Telemetry during my first weeks, like adding expiration dates to telemetry probes (bug 742500).

I am very excited to start this new journey! I am the only team-member working from the London office, which is located in one of the most gorgeous areas of the city, the Theater District.