Dasbhoard generator for custom Telemetry jobs

tldr: Next time you are in need of a dashboard similar to the one used to monitor main-thread IO, please consider using my dashboard generator which takes care of displaying periodically generated data.

So you wrote your custom analysis for Telemetry, your map-reduce job is finally giving you the desired data and you want to set it up so that it runs periodically. You will need some sort of dashboard to monitor the weekly runs but since you don’t really care how it’s done what do you do? You copy paste the code of one of our current dashboards, a little tweak here and there and off you go.

That basically describes all of the recent dashboards, like the one for main-thread IO (mea culpa). Writing dashboards is painful when the only thing you care about is data. Once you finally have what you were looking for, the way you present is often considered an afterthought at best. But maintaining N dashboards becomes quickly unpleasant.

But what makes writing and maintaining dashboards so painful exactly? It’s simply that the more controls you have, the more the different kind events you have to handle and the easier things get out of hand quickly. You start with something small and beautiful that just displays some csv and presto you end up with what should have been properly described as a state machine but instead is a mess of intertwined event handlers.

What I was looking for was something on the line of Shiny for R, but in javascript and with the option to have a client-only based interface. It turns out that React does more or less what I want. It’s not necessary meant for data analysis so there aren’t any plotting facilities but everything is there to roll your own. What makes exactly Shiny and React so useful is that they embrace reactive programming. Once you define a state and a set of dependencies, i.e. a data flow graph in practical terms, changes that affect the state end up being automatically propagated to the right components. Even though this can be seen as overkill for small dashboards, it makes it extremely easy to extend them when the set of possible states expands, which is almost always what happens.

To make things easier for developers I wrote a dashboard generator, iacumus, for use-cases similar to the ones we currently have. It can be used in simple scenarios when:

  • the data is collected in csv files on a weekly basis, usually using build-ids;
  • the dashboard should compare the current week against the previous one and mark differences in rankings;
  • it should be possible to go back back and forward in time;
  • the dashboard should provide some filtering and sorting criterias.

Iacumus is customizable through a configuration file that is specified through a GET parameter. Since it’s hosted on github, it means you just have to provide the data and don’t even have to spend time deploying the dashboard somewhere, assuming the machine serving the configuration file supports CORS. Here is how the end result looks like using the data for the Add-on startup correlation dashboard. Note that currently Chrome doesn’t handle properly our gzipped datasets and is unable to display anything, in case you wonder…

My next immediate goal is to simplify writing map-reduce jobs for the above mentioned use cases or to the very least write down some guidelines. For instance, some of our dashboards are based on Firefox’s version numbers and not on build-ids, which is really what you want when you desire to make comparisons of Nightly on a weekly basis.

Another interesting thought would be to automatically detect differences in the dashboards and send alerts. That might be not as easy with the current data, since a quick look at the dashboards makes it clear that the rankings fluctuate quite a bit. We would have to collect daily reports and account for the variance of the ranking in those as just using a few weekly datapoints is not reliable enough to account for the deviation.

Regression detection for Telemetry histograms.

tldr: An automatic regression detector system for Telemetry data has been deployed; the detected regressions can be seen in the dashboard.

Mozilla is collecting over 1,000 Telemetry probes which give rise to histograms, like the one in the figure below, that change slightly every day.

Average frame interval during any tab open/close animation (excluding tabstrip scroll).

Average frame interval during any tab open/close animation (excluding tabstrip scroll).

 

Until lately the only way to monitor those histogram was to sit down and literally stare the screen while something interesting was spotted. Clearly there was the need for an automated system which is able to discern between noise and real regressions.

Noise is a major challenge, even more so than with Talos data, as Telemetry data is collected from a wide variety of computers, configurations and workloads. A reliable mean of detecting regressions, improvements and changes in a measurement’s distribution is fundamental as erroneous alerts (false positives) tend to annoy people to the point that they just ignore any warning generated by the system.

I have looked at various methods to detect changes in histogram, like

  • Correlation Coefficient
  • Chi-Square statistic
  • U statistic (Mann-Whitney)
  • Kolmogorov-Smirnov statistic of the estimated densities
  • One Class Support Vector Machine
  • Bhattacharyya Distance

Only the Bhattacharyya distance proved satisfactory for our data. There are several reasons why each of the previous methods fails with our dataset.

For instance a one class SVM wouldn’t be a bad idea if some distributions wouldn’t change dramatically over the course of time due to regressions and/or improvements in our code; so in other words, how do you define how a distribution should look like? You could just take the daily distributions of the past week as training set but that wouldn’t be enough data to get anything meaningful from a SVM. A Chi-Square statistic instead is not always applicable as it doesn’t allow cells with an expected count of 0. We could go on for quite a while and there are ways to get around those issues but the reader is probably more interested in the final solution. I evaluated how well those methods are actually at pinpointing some past known regressions and the Bhattacharyya distance proved to be able to detect the kind of pattern changes we are looking for, like distributions shifts or bin swaps, while minimizing the number of false positives.

Having a relevant distance metric is only part of the deal since we still have to decide what to compare. Should we compare the distribution of today’s build-id against the one from yesterday? Or the one from a week ago? It turns out that trying to mimic what an human would do yields a good algorithm: if

  • the variance of the distance between the histogram of the current build-id and the histograms of the past N build-ids is small enough and
  • the distance between the histograms of the current build-id and the previous build-id is above a cutoff value K yielding a significant difference and
  • a significant difference is also present in the next K build-ids, then a distribution change is reported.

Furthermore, Histograms that don’t have enough data are filtered out and the cut-off values and parameters are determined empirically from past known regressions.

I am pretty satisfied with the detected regressions so far, for instance the system was able to correctly detect a regression caused by the OMTC patch that landed the 20st of May which caused a significant change in the the average frame interval during tab open animation:

Average frame interval during tab open animation of about:newtab.

Average frame interval during tab open animation of about:newtab.

We will soon roll-out a feature to allow histogram authors to be notified through e-mail when an histogram change occurs. In the meantime you can have a look at the detected regressions in the dashboard.

Most popular GPUs on Nightly and Release

We have a long-standing bug about Firefox not scrolling smoothly with Intel GPUs. It turns out that recently, thanks to a driver update, the performance has drastically improved for some of our users. While trying to confirm this hypothesis I found some interesting data about Firefox users on Nightly 33 and Release 30 that might come in handy in deciding on what hardware to run our benchmarks in the future.

First things first, let’s have a look at the popularity of the various GPU vendors:

That doesn’t really come as a surprise considering how ubiquitous Intel chips are nowadays. This also means we should concentrate our optimization efforts towards Intel GPUs since it’s there where we can have the biggest impact.

But how well does Firefox perform with GPUs from the above mentioned vendors? It turns out that one of our telemetry metrics, that measures the average time to render a frame during a tab animation, comes in pretty handy to answer this question:

Ideally we would like to reach 60 frames per second for any vendor. Considering that practically every second user has an Intel GPU, the fact that our performance on those chips is not splendid weights even more on our shoulders. Though, as one might suspect, Intel chips are usually not as performant as Nvidia’s or AMD’s ones. There is also a suspicious difference in performance between Nightly and Release; could this be related to Bug 1013262?

The next question is what specific models do our users possess. In order to answer it, let’s have a look at the most popular GPUs that account for 25% of our user base on Nightly:

nightly_pop

and on Release:

As you can notice there are only Intel GPUs in the top 25% of the population. The HD 4000 and 3000 are dominating on Nightly while the distribution is much more spread out on Release with older models, like the GMA 4500, being quite popular. Unsurprisingly though the most popular chips are mobile ones.

Let’s take now as reference the most popular chips on Release and see how their performance compares to Nightly:

We can immediately spot that newer chips perform generally better than older ones, unsurprisingly. But more importantly, there is a marked performance difference between the same models on Nightly and Release which will require further investigation. Also noteworthy is that older desktop models outperform newer mobile ones.

And finally we can try to answer the original question by correlating the tab animation performance with the various driver versions. In order to do that, let’s take the Intel HD 4000 and plot the performance for its most popular drivers on the release channel and compare it to the nightly one:


We can notice that there is a clear difference in performance between the older 9.X drivers and the newer 10.X which answers our original question. Unfortunately though only about 25% of our users on the release channel have updated their driver to a recent 10.X version. Also, the difference between the older and newer drivers is more marked on the nightly channel than on the release one.

We are currently working on an alerting system for Telemetry that will notify us when a relevant change appears in our metrics. This should allow us to catch pre-emptively regressions, like Bug 1032185, or improvements, like the one mentioned in this blog post, without accidentally stumbling on it.

Using Telemetry to recommend Add-ons for Firefox

This post is about the beauty of a simple mathematical theorem and its applications in logic, machine learning and signal processing: Bayes’ theorem. During the way we will develop a simple recommender engine for Firefox add-ons based on data from Telemetry and reconstruct a signal from a noisy channel.

tl;dr: If you couldn’t care less about mathematical beauty and probabilities, just have a look at the recommender system.

Often the problem we are trying to solve can be reduced to determining the probability of an hypothesis H, given that we have observed data D and have some knowledge of the way the data was generated. In mathematical form:

P(H|D) = \frac{P(D|H)*P(H)}{P(D)}

Where P(D|H) is the likelihood of the data D given our hypothesis H, i.e. how likely it is to see data D given that our hypothesis H is true, and P(H) is the prior belief about H, i.e. how likely our hypothesis H is true in the first place.

That’s all there is, check out my last post to see a basic application of the theorem to find out why many science research findings based purely on statistical inference turn out later to be false.

Logic

According to logic, from the statement “if A is true then B is true” we can deduce that “if B is false then A is false”. This can be proven simply by using a truth table or probabilistic reasoning:

P(A=F|B=F) = 1 - P(A=T|B=F)

= 1 - \frac{P(B=F|A=T)P(A=T)}{P(B=F|A=T)P(A=T) + P(B=F|A=F)P(A=F)} = 1

where

P(B=F|A=T) = 1 - P(B=T|A=T) = 1 - 1 = 0

So we can see that probabilistic reasoning can be applied to make logical deductions, i.e. deductive logic can be seen as nothing more than a limiting case of probabilistic reasoning, where probabilities take only values of 0 or 1.

Recommender Engine

Telemetry submissions contain the IDs of the add-ons of our users. Note that Telemetry is opt-in for our release builds so we don’t collect data if you don’t grant us explicitly permission to do so.

We might be interested in answering the following question: given that a user has the add-ons A_{1}, ..., A_{k} , what’s the probability that the user also has add-on B? We could then use the answer to suggest new add-ons to users that don’t possess B but have A_{1}, ..., A_{k} . So let’s use Bayes theorem to derive our probabilities. What we are after is the following probability:

P(B|A_{1}, ..., A_{k}) = \frac{P(A_{1}, ..., A_{k}|B)P(B)}{P(A_{1}, ..., A_{k})}   eq. 1

which corresponds to the ratio between Telemetry submissions that contain A_{1}, ..., A_{k}, B and the submissions that contain all those add-ons but B. Note that if

B \in \{A_{1}, ..., A_{k}\}

then obviously eq. 1 is going have a value of 1 and it isn’t of interest, so we assume that

B \not\in \{A_{1}, ..., A_{k}\}

Since the denominator is constant for any B, we don’t really care about it.

P(B|A_{1}, ..., A_{k}) \propto P(A_{1}, ..., A_{k}|B)P(B)   eq. 2

Say we want to be smart and precompute all possible probabilities in order to avoid the work for each request. It’s going to be pretty slow considering that the number of possible subsets of N add-ons is exponential.

If we make the naive assumptions that the add-ons are conditionally independent, i.e. that

P(A_{1}, ..., A_{k}|B) = P(A_{1}|B)P(A_{2}|B)...P(A_{k}|B)

we can rewrite eq. 2 like so

P(B|A_{1}, ..., A_{k}) \propto P(A_{1}|B)P(A_{2}|B)...P(A_{k}|B)P(B)

Awesome, this means that once we calculate the N probabilities

P(B|A_{i})

where N is the total number of add-ons in the universe, we have everything we need to calculate eq. 2 with just k multiplications.

If we perform this calculation for every possible add-on B and return the add-on for which the equation is maximized, then we can make a reasonable suggestion. Follow the link to play with a simple recommendation engine that does exactly this.

Keep in mind that I didn’t spend time cleaning up the dataset from non interesting add-ons like Norton Toolbar, etc. The reason those add-ons show up is simply because a very large fraction of the population has them.

Signal Processing

Say we send a signal over a noisy channel and at the other end we would like to reconstruct the original signal by removing the noise. Let’s assume our original signal has the following form

y(t) = \alpha sin(2\pi \omega t + \varphi)

where \alpha is the amplitude, \omega is the frequency and \varphi is the phase. The i-th sample that we get on the other side of the channel is corrupted by some random noise \Sigma_i :

Y_i = A sin(2\pi \Omega t_i + \Phi) + W_i

where Y_i , A , \Omega , \Phi and W_i are random variables and

W_i = \mathcal{N}(0, \sigma^2)

i.e. we assume the variance of the noise is the same for all samples.

Since we have some knowledge of how the data was generated, let’s see if we can apply Bayes’ theorem. We have a continuous number of possible hypothesis, one for each combination of the amplitude, frequency, phase and variance.

f_{A ,\Omega,\Phi,\Sigma|Y}(\alpha , \omega, \varphi, \sigma|y) = \frac  {f_{Y|A , \Omega, \Phi, \Sigma}(y| \alpha , \omega, \varphi, \sigma)f_{A,\Omega,\Phi, \Sigma}(\alpha, \omega, \varphi, \sigma)}  {\iiiint f_{Y|A, \Omega, \Phi, \Sigma}(y| \alpha, \omega, \varphi, \sigma)f_{A,\Omega,\Phi, \Sigma}(\alpha, \omega, \varphi, \sigma) d\alpha d\omega d\varphi d\sigma}

where the functions f are probability density functions.

We are trying to find the quadruplet of values that maximizes the expression above so we don’t really care about the denominator since it’s constant. If we make the reasonable assumption that A , \Omega , \Phi and \Sigma are independent then

f_{A, \Omega, \Phi, \Sigma}(\alpha, \omega, \varphi, \sigma) = f_{A}(\alpha)f_{\Omega}(\omega)f_{\Phi}(\varphi)f_{\Sigma}(\sigma)

where f_{A} , f_{\Omega} , f_{\Phi} and f_{\Sigma} are our priors and since we don’t know much about them let’s assume that each one of them can be modeled as an Uniform distribution.

Assuming the measurements are independent from each other, the likelihood function for the vector of measurements y is

f_{Y|A, \Omega, \Phi, \Sigma}(y|\alpha, \omega, \varphi, \sigma) = \prod_{i=1}^N f_{Y_i|A, \Omega, \Phi, \Sigma}(y_i|\alpha, \omega, \varphi, \sigma)

Now we just have to specify our likelihood function for a single measurement, but since we have an idea of how the data was generated, we can do that easily

f_{Y_i|\alpha, \omega, \varphi, \sigma}(y_i|\alpha, \omega, \varphi, \sigma) = \mathcal{N}(\alpha sin(2\pi \omega t_i + \varphi), \sigma^2)

To compute the posterior distributions of our random variables given a series of measurements performed at the other end of the line, let’s use a Monte Carlo method with the python library pymc. As you can see from IPython notebook, it’s extremely easy to write and evaluate computationally a probabilistic model.

Final Thoughts

We have seen that Bayesian inference can be applied successfully in seemingly unrelated fields. Even though there are better suited techniques for logic, recommender systems and signal processing, it’s still interesting to see how such a simple theorem can have so many applications.

Forecasting time series for fun

Predicting time series can be very interesting not only for quants. Any server that logs metrics like the number of submissions or requests over time generates a so called time series or signal. An interesting time series I had the chance to play with some time ago is the one generated by Telemetry’s submissions. This is how it looks like for the Nightly channel for the past 60 days:

Figure 1

It’s immediately evident to an human eye when and where there was a drop in submissions in the past couple of months (bugs!). An interesting problem is to be able to automatically identify a drop in submissions as soon as it happens and at the same time reducing to a minimum the number of “false alarms”. It might seem rather trivial at first, but given that the distribution is quite sparse, caused mostly by daily oscillations, an outlier detection method based on the standard deviation is doomed to fail. Using the median absolute deviation is more robust but still not good enough to avoid false positives.

The periodic patterns might not be immediately visible from the raw data plot but once we connect the dots the daily and weekly pattern appear in all their beauty:

Figure 2

The method I came up with to catch drops does the following:

  1. It retrieves the distributions of the last 10 days from the current data point
  2. It performs a series of Mann-Whitney tests to compare the last 24h to the distributions of the previous 9 days
  3. If the distributions are statistically different for at least 5 days with the current daily one having a lower mean, then we have a drop

Figure 3

The algorithm requires a certain amount of history to make good predictions, reason why it detected the first drop on the left only after several days. As expected though it was able to detect the second drop without any false positives. Sudden drops are easy to detect with a robust outlier detection method but slow drops, as we experienced in the past, can go unnoticed if you just look for outliers.

Another interesting approach is to use time series analysis to decompse the series into its seasonals (periodic), trend and and noise signals. A simple classical decomposition by moving average yields the following series:

Figure 4

This simple algorithm was able to remove most of the periodic pattern; the trend is affected now by the weekly signal and the drops. It turns out that newer methods are able to decompose time series with multiple periodic patterns, or seasonalities. One algorithm I particularly like is the so called TBATS method, which is an advanced exponential smoothing model:

Figure 5

That’s pretty impressive! The TBATS algorithm was able to identify and remove the daily and weekly frequency from our signal, what remains is basically the trend and some random noise. Now that we have such a clean signal we could try to apply statistical quality control to our time series, i.e. use a set of rules to identify drops. The rules look at the historical mean of a series of datapoints and based on the standard deviation, the rules help judge whether a new set of points is experiencing a mean shift (drop) or not.

Given a decomposition of a time series, we can also use it to predict future datapoints. This can be useful for a variety of reasons beyond detecting drops. To have an idea of how well we can predict future submissions let’s take a clean subset of our data, from day 20 to day 40, and let’s try to predict Telemetry’s submissions for the next 5 days while comparing it to the actual data:

Figure 6

That’s pretty neat, we can immediately see that we have an outlier and the prediction is very close to the actual real data.

I wonder if there are other methods used to detect alterations to time series so feel free to drop me a line with a pointer if you happen to have a suggestion.

Main-thread IO Analysis

<edit>

  • The goal of this post is to show that main-thread IO matters and not to “shame” this or that add-on, or platform subsystem. There is only so much I can do alone by filing bugs and fixing some of them. By divulging knowledge about the off-main thread IO API and why it matters, my hope is that Firefox and add-ons developers alike will try to exploit it the next time an opportunity arises. I changed the title of this post accordingly to reflect my intentions.
  • Please note that just because some code performs IO on the main-thread, it doesn’t necessarily mean that it can be avoided. This has to be evaluated case by case.

</edit>

Recently a patch landed in Firefox that allows telemetry to collect aggregated main-thread disk IO timing per filename. Each filename comes with its accumulated time spent and the number of operations performed for open(), read(), write(), fsync() and stat(). Yoric wrote an interesting article some time ago on why main-thread IO can be a serious performance issue.

I collected a day worth of Nightly data using Telemetry’s map-reduce framework and filtered out all those filenames that were not common enough (< 10%) among the submissions. What follows are some interesting questions we could answer thanks to the newly collected data.

  • How common are the various operations?
    The most common operation is stat() followed by open(), all other operations are very uncommon with a third quartile of 0.
  • Is there a correlation between the number of operations and the time spent doing IO on a file?
    There is an extremely weak correlation which is what one would expect. Generally, what really makes a difference is the amount of bytes read and written for instance, not the number of actual operations performed.
  • Does having a SSD disk lower the time spent doing IO on a file?
    A Mann-Whitney test confirms that the distributions of time conditioned on “having a SSD” and “not having a SSD” seem to be indeed statistically different. On average having an SSD disk lowers the time by a factor of 2.7.

The above facts apply to all submissions, but what we are really interested in is discerning those files for which the accumulated time is particularly high. An interesting exercise is to compare some distributions of the top 100 files (in terms of time) against the distributions of the general population to determine if bad timings are correlated with older machines:

  • Does the distribution of Windows versions differ between the population and the top 100?
    Fisher’s test confirms that the distributions do not differ (with high probability), i.e. the OS version doesn’t seem to influence the timings.
  • Does the distribution of CPU architectures (32 vs 64 bit) differ between the population and the top 100?
    Fisher’s test confirms that the distributions do not seem to differ.
  • Does the distribution of the disk type differ between the population and the top 100?
    The distributions do not seem to differ according to Fisher’s test.

Can we deduce something from all this? It seems that files that behave badly do so regardless of the machine’s configuration. That is, removing the offending operations performing IO in the main-thread will likely benefit all users.

Which brings us to the more interesting question of how we should prioritize the IO operations that have to be moved off the main-thread. I am not a big fan of looking only at aggregates of data since outliers turn out to be precious more often than not. Let’s have a look at the distrubtion of the top 100 outliers.

Distribution of the top outliers.

1. Distribution of the top outliers.

Whoops, looks like httpDataUsage.dat, place-sqlite-wal, omni.ja and prefs.js behave particularly bad quite often. How bad is bad? For instance, one entry of httpDataUsage.dat performs about 100 seconds of IO operations (38 stat & 38 open), i.e. ~ 1.3 seconds per operation!

Now that we have a feeling of the outliers, let’s have a look at the aggregated picture. Say we are interested in those files most users hit (> 90%) and rank them by their third quartile:

Top 20 files most users hit

2. Top 20 files most users hit.

In general the total time spent doing IO is about 2 to 3 orders of magnitude lower than for the outliers.

But what happened to the add-ons? It turns out that the initial filtering we applied on our dataset removed most of them. Let’s rectify the situation by taking a dataset with the single hottest entry for each file, in terms of time, and aggregate the filenames by extension:

Top extensions with slow file operations.

3. Top extensions with slow file operations.

To keep track of some of the statistics presented here, mreid and myself prepared a  dashboard. Time to file some bugs!

Idle wakeups are evil

I have been spending some quality time with gdb tracking down timers firing during idle in Firefox, which turns out to be the main culprit for FF’s energy consumption. Wakeups are triggered either because a timer is fired or a process sleeping on a synchronisation variable is woken up. In fact the former can be reduced to the latter, since also the timer thread which executes the timer callbacks, blocks on a synchronisation variable.

Tracking down wakeups is nothing new and has been done in the past. Automating it on the other hand isn’t trivial without modifying heavily the sourcecode. I tried automating the process through a gdb script but it kept crashing randomly and when it didn’t FF got significantly slower so I gave up on that. There are some alternatives though, e.g. on Linux the perf subsystem can be exploited to track the following events:

  • the context switches (shced:sched_switch),
  • wakeups (sched:sched_wakeup),
  • and the C-states of the CPU (power:cpu_idle).

By keeping track of the CPU C-states we know when the CPU is sleeping. The wakeup probe tells us when a specific thread is woken up, and by correlating it with the previous probe, it gives us the culprit that caused the CPU C-state to change. Finally, the context switch probe allows to retrieve a user-space callgraph of the thread woken up.

That’s an interesting approach and powertop does something similar but it turns out that to investigate wakeups we don’t really need that degree of accuracy. Instead of considering only the context switches caused by wakeups correlated to C-state changes, we can just consider all context switches. Some context switches will not be associated with C-state changes but this heuristic is still good enough on first approximation during an idle workload. On Linux the following command can be used to collect callgraphs of context switches:

perf record -e sched:sched_switch -R -c 1 -g -p $(pgrep firefox)

As shown in the example above, most of the times a context switch happens we are either in the BackgroundHangManager, in the Timer thread or nsSocketTransportService. Not all timers are created equal though and we would really like to know which timers are firing. Unfortunately since the specific timer callback is executed only after the context switch, there isn’t a simple way to get that sort of information through perf.

The linker comes to our rescue and by using dladdr we can simply retrieve the mangled name of a timer callback handler, which we can demangle and print with __cxa_demangle. This clearly works only if the callbacks are exported in the dynamic symbol table, which on OSX are by default whereas on Linux we have to recompile the codebase with the -export-dynamic linker flag.

A matter of energy

In the past month I have looked into measuring the power consumption of Firefox on desktop. It all started with a complaint of excessive background activity during idle so I wrote a little tool to measure the energy consumption of FF while idling on a blank page. I wasn’t able though to notice any statistically valid difference between between the idle machine and the idle machine + the idle browser (on Win8, OSX & Ubuntu).

A more traditional profiling using DTrace yield some wakeups though that I could backtrack to a particular set of timers firing. Most of those timers have been addressed and fixed in the current Nightly. I suspect that the reason I couldn’t identify them on my power benchmark is that the profiling interval I used was just too short (5-10min). A more sensible test would be to use longer intervals and/or measure the time it takes for a battery to drain, for instance.

The next step was to have look also at the drainage of energy when idling on some popular websites. In order to reduce the variance of the measurements I had to disable as many background processes as possible, i.e. spotlight on OSX. For the same reason I couldn’t use Selenium and I had to rely on simple terminal commands to steer the browsers. The problem with Selenium is that each browser has its own implementation with a complete different power signature. It probably could be possible to remove the noise using some ML algorithm but that would require a large enough training set, for each distinct website and OS… A more realistic alternative would be to write some “simple” and lightweight windowing event generator to steer the different browsers.
That being said, even without Selenium it’s possible to get some interesting data: before we can optimize the power drain of FF while scrolling on cnn.com for instance, we must address the power drainage of simply idling on it, since that drainage is very likely to contribute to any power profile performed while scrolling anyway.

Keep in mind the following bits when interpreting the data:

  1. the latest release versions of FF, Chrome, IE & Safari as of this writing have been used;
  2. the sampling interval amounts to only to 30 seconds; even though its extremely short and might not catch background activities like sporadic timers, it’s obvious that if there is something very wrong with a page it will show up in the profile anyway;
  3. the error bars mark the 95% confidence intervals;
  4. the energy profiler starts measuring only about a minute after a page has been loaded; that’s needed to ensure the browser is in a steady state;
  5. the profiles were performed by idling on the homepages of the different sites, i.e. youtube.com wasn’t playing any video;
  6. only for Facebook the browsers were logged into my profile;
  7. the data was collected on different machines at different times, i.e. the 3 plots are not comparable between each other;
  8. I am assuming that a particular homepage doesn’t change dramatically when collecting the profiles for the different browsers (more on this later);
  9. only the CPU specific power drainage is measured.

Let’s start with a plot performed on Windows 8:

windows_power

Firefox is performing quite well, only Facebook seems to have a particular adverse effect on the energy profile.

Let’s have a look at the the OSX plot:

osx_power

It seems here that FF is performing particularly bad on Facebook and Yahoo, but is in good company in the latter case.

And finally we have the profile for Ubuntu:

ubuntu_power

And again Facebook is a clear outlier here. As you can notice the variance here is much lower since on Linux it was easier to remove all unneeded background tasks.

The main pattern across those profiles seems to be that the OS vendor-specific browsers are slightly better at not draining energy during idle. Also, Facebook seems to behave negatively on FF across all OSs so it doesn’t seem to be a random fluctuation. It looks like I am going to have some fun this week tracking down the root of the problem :-)

For the future, a distributed benchmark that runs on O*B equally configured machines where O is the number of OSs and B the number of browsers, would be ideal. This would allow the profiles to be comparable across OSs and also make them independent of homepage updates.

My first two weeks at Mozilla

In the first couple of weeks at Mozilla I spent some time with the client and server-side telemetry implementation. Telemetry allows Engineering to receive aggregate data of browser health in the field, e.g. cache hit rates or page load times across all browser instances.

In summary, the telemetry workflow looks more or less like this:

  1. Firefox generates telemetry data while its being used, if the user explicitly enabled the collection;
  2. the collected data is sent to once a day to a server via HTTPS;
  3. the received data is collected into a queue which is post-processed through a converter, which validates, compacts and compresses the data, before its being sent to persistent storage;
  4. analysis jobs are run on the data available through the persistent storage and the results are presented on the telemetry dashboard;
  5. finally, developers can access the persistent data through custom map-reduce jobs to compute user-define metrics.

I have been working on small bits of the project, i.e.:

  1. adding a feature to firefox that allows certain telemetry data to “expire”, i.e. not being sent to the server;
  2. integrating a C++ record compressor into the back end;
  3. running some map-reduce jobs to determine the number of SSDs vs HDDs.

In particular, for the last item, it seems that about 8% of nightly users are running from SSD disks, which is not as high as I initially suspected. Running is this sort of queries against telemetry datasets is very easy. A json file specifies the filter for the datasets you want to analyze and the actual analysis is implemented in a python file as a map-reduce job.

Hello World

I’m Roberto, the newest member of Mozilla’s Desktop Performance team. Before that I worked on optimizing the software framework used in ATLAS, one of LHC’s experiments.

I will be working on Telemetry during my first weeks, like adding expiration dates to telemetry probes (bug 742500).

I am very excited to start this new journey! I am the only team-member working from the London office, which is located in one of the most gorgeous areas of the city, the Theater District.