(FTRL) Ad Click Prediction A View From The Trenches (Google 2013)
(FTRL) Ad Click Prediction A View From The Trenches (Google 2013)
(FTRL) Ad Click Prediction A View From The Trenches (Google 2013)
probability with which event t is sampled (either 1 or r), to what happens when the model serves queries. The online
and so by definition st = !1t . Thus, we have loss also has considerably better statistics than a held-out
validation set, because we can use 100% of our data for
1 both training and testing. This is important because small
E[`t (wt )] = st !t `t (wt ) + (1 st )0 = st `t (wt ) = `t (wt ).
st improvements can have meaningful impact at scale and need
Linearity of expectation then implies the expected weighted large amounts of data to be observed with high confidence.
objective on the subsampled training data equals the ob- Absolute metric values are often misleading. Even if pre-
jective function on the original data set. Experiments have dictions are perfect, the LogLoss and other metrics vary de-
verified that even fairly aggressive sub-sampling of unclicked pending on the difficulty of the problem (that is, the Bayes
queries has a very mild impact on accuracy, and that predic- risk). If the click rate is closer to 50%, the best achievable
tive performance is not especially impacted by the specific LogLoss is much higher than if the click rate is closer to 2%.
value of r. This is important because click rates vary from country to
country and from query to query, and therefore the averages
change over the course of a single day.
5. EVALUATING MODEL PERFORMANCE We therefore always look at relative changes, usually ex-
Evaluating the quality of our models is done most cheaply pressed as a percent change in the metric relative to a base-
through the use of logged historical data. (Evaluating mod- line model. In our experience, relative changes are much
els on portions of live traffic is an important, but more ex- more stable over time. We also take care only to compare
pensive, piece of evaluation; see, for example, [30].) metrics computed from exactly the same data; for exam-
Because the di↵erent metrics respond in di↵erent ways to ple, loss metrics computed on a model over one time range
model changes, we find that it is generally useful to evaluate are not comparable to the same loss metrics computed on
model changes across a plurality of possible performance another model over a di↵erent time range.
metrics. We compute metrics such as AucLoss (that is,
1 AUC, where AUC is the standard area under the ROC 5.2 Deep Understanding through Visualization
curve metric [13]), LogLoss (see Eq. (1)), and SquaredError. One potential pitfall in massive scale learning is that ag-
For consistency, we also design our metrics so that smaller gregate performance metrics may hide e↵ects that are spe-
values are always better. cific to certain sub-populations of the data. For example,
a small aggregate accuracy win on one metric may in fact
5.1 Progressive Validation be caused by a mix of positive and negative changes in dis-
We generally use progressive validation (sometimes called tinct countries, or for particular query topics. This makes
online loss) [5] rather than cross-validation or evaluation on it critical to provide performance metrics not only on the
a held out dataset. Because computing a gradient for learn- aggregate data, but also on various slicings of the data, such
ing requires computing a prediction anyway, we can cheaply as a per-country basis or a per-topic basis.
stream those predictions out for subsequent analysis, aggre- Because there are hundreds of ways to slice the data mean-
gated hourly. We also compute these metrics on a variety of ingfully, it is essential that we be able to examine a visual
sub-slices of the data, such as breakdowns by country, query summary of the data e↵ectively. To this end, we have de-
topic, and layout. veloped a high-dimensional interactive visualization called
The online loss is a good proxy for our accuracy in serving GridViz to allow comprehensive understanding of model per-
queries, because it measures the performance only on the formance.
most recent data before we train on it—exactly analogous A screen-shot of one view from GridViz is shown in Fig-
model without regularization; our models are online, do not
assume IID data (so convergence is not even well defined),
and heavily regularized. Standard statistical methods (e.g.,
[18], Sec. 2.5) also require inverting a n ⇥ n matrix; when n
is in the billions, this is a non-starter.
Further, it is essential that any confidence estimate can
be computed extremely cheaply at prediction time — say in
about as much time as making the prediction itself.
We propose a heuristic we call the uncertainty score, which
is computationally tractable and empirically does a good job
of quantifying prediction accuracy. The essential observa-
tion is that the learning algorithm itself maintains a notion
of uncertainty in the per-feature counters nt,i used for learn-
ing rate control. Features for which ni is large get a smaller
learning rate, precisely because we believe the current coef-
Figure 3: Visualizing Uncertainty Scores. Log-odds ficient values are more likely to be accurate. The gradient
1 ⇤
errors | 1 (pt ) (pt )| plotted versus the uncer- of logistic loss with respect to the log-odds score is (pt yt )
tainty score, a measure of confidence. The x-axis and hence has absolute value bounded by 1. Thus, if we
is normalized so the density of the individual esti- assume feature vectors are normalized so |xt,i | 1, we can
mates (gray points) is uniform across the domain. bound the change in the log-odds prediction due to observ-
Lines give the estimated 25%, 50%, 75% error per- ing a single training example (x, y). For simplicity, consider
centiles. High uncertainties are well correlated with 1 = 2 = 0, so FTRL-Proximal is equivalent to online gra-
larger prediction errors. P
dient descent. Letting nt,i = + ts=1 gs,i 2
and following
Eq. (2), we have
X
ure 2, showing a set of slicings by query topic for two models |x · wt x · wt+1 | = ⌘t,i |gt,i |
in comparison to a control model. Metric values are rep- i:|xi |>0
resented by colored cells, with rows corresponding to the X xt,i
model name and the columns corresponding to each unique ↵ p = ↵⌘⌘ · x ⌘ u(x)
nt,i
slicing of the data. The column width connotes the impor- i:|xi |>0
tance of the slicing, and may be set to reflect quantities such where ⌘ to is the vector of learning rates. We define the
as number of impressions or number of clicks. The color of uncertainty score to be the upper bound u(x) ⌘ ↵⌘⌘ · x; it
the cell reflects the value of the metric compared to a chosen can be computed with a single sparse dot product, just like
baseline, which enables fast scanning for outliers and areas the prediction p = (w · x).
of interest, as well as visual understanding of the overall per-
formance. When the columns are wide enough the numeric Experimental Results. We validated this methodology
value of the selected metrics are shown. Multiple metrics as follows. First, we trained a “ground truth” model on real
may be selected; these are shown together in each row. A data, but using slightly di↵erent features than usual. Then,
detailed report for a given cell pops up when the user mouse we discarded the real click labels, and sampled new labels
overs over the cell. taking the predictions of the ground-truth model as the true
Because there are hundreds of possible slicings, we have CTRs. This is necessary, as assessing the validity of a confi-
designed an interactive interface that allows the user to se- dence procedure requires knowing the true labels. We then
lect di↵erent slicing groups via a dropdown menu, or via a ran FTRL-Proximal on the re-labeled data, recording pre-
regular expression on the slicing name. Columns may be dictions pt , which allows us to compare the accuracy of the
1 ⇤
sorted and the dynamic range of the color scale modified to predictions in log-odds space, et = | 1 (pt ) (pt )| where
⇤
suite the data at hand. Overall, this tool has enabled us pt was the true CTR (given by the ground truth model).
to dramatically increase the depth of our understanding for Figure 3 plots the errors et as a function of the uncertainty
model performance on a wide variety of subsets of the data, score ut = u(xt ); there is a high degree of correlation.
and to identify high impact areas for improvement. Additional experiments showed the uncertainty scores per-
formed comparably (under the above evaluation regime) to
the much more expensive estimates obtained via a bootstrap
6. CONFIDENCE ESTIMATES of 32 models trained on random subsamples of data.
For many applications, it is important to not only esti-
mate the CTR of the ad, but also to quantify the expected
accuracy of the prediction. In particular, such estimates can 7. CALIBRATING PREDICTIONS
be used to measure and control explore/exploit tradeo↵s: in Accurate and well-calibrated predictions are not only es-
order to make accurate predictions, the system must some- sential to run the auction, they also allow for a loosely cou-
times show ads for which it has little data, but this should pled overall system design separating concerns of optimiza-
be balanced against the benefit of showing ads which are tions in the auction from the machine learning machinery.
known to be good [21, 22]. Systematic bias (the di↵erence between the average pre-
Confidence intervals capture the notion of uncertainty, dicted and observed CTR on some slice of data) can be
but for both practical and statistical reasons, they are inap- caused by a variety of factors, e.g., inaccurate modeling as-
propriate for our application. Standard methods would as- sumptions, deficiencies in the learning algorithm, or hidden
sess the confidence of predictions of a fully-converged batch features not available at training and/or serving time. To
address this, we can use a calibration layer to match pre- New signals can be vetted by automatic testing and white-
dicted CTRs to observed click–through rates. listed for inclusion. White-lists can be used both for ensuring
Our predictions are calibrated on a slice of data d if on correctness of production systems, and for learning systems
average when we predict p, the actual observed CTR was using automated feature selection. Old signals which are
near p. We can improve calibration by applying correction no longer consumed are automatically earmarked for code
functions ⌧d (p) where p is the predicted CTR and d is an cleanup, and for deletion of any associated data.
element of a partition of the training data. We define success E↵ective automated signal consumption management en-
as giving well calibrated predictions across a wide range of sures that more learning is done correctly the first time. This
possible partitions of the data. cuts down on wasted and duplicate engineering e↵ort, saving
A simple way of modeling ⌧ is to fit a function ⌧ (p) = p many engineering hours. Validating configurations for cor-
to the data. We can learn and using Poisson regression rectness before running learning algorithms eliminates many
on aggregated data. A slightly more general approach that cases where an unusable model might result, saving signifi-
is able to cope with more complicated shapes in bias curves cant potential resource waste.
is to use a piecewise linear or piecewise constant correction
function. The only restriction is that the mapping function 9. UNSUCCESSFUL EXPERIMENTS
⌧ should be isotonic (monotonically increasing). We can In this final section, we report briefly on a few directions
find such a mapping using isotonic regression, which com- that (perhaps surprisingly) did not yield significant benefit.
putes a weighted least-squares fit to the input data subject
to that constraint (see, e.g., [27, 23]). This piecewise-linear 9.1 Aggressive Feature Hashing
approach significantly reduced bias for predictions at both
In recent years, there has been a flurry of activity around
the high and low ends of the range, compared to the reason-
the use of feature hashing to reduce RAM cost of large-scale
able baseline method above.
learning. Notably, [31] report excellent results using the
It is worth noting that, without strong additional assump-
hashing trick to project a feature space capable of learning
tions, the inherent feedback loop in the system makes it im-
personalized spam filtering model down to a space of only
possible to provide theoretical guarantees for the impact of
224 features, resulting in a model small enough to fit easily
calibration [25].
in RAM on one machine. Similarly, Chapelle reported using
the hashing trick with 224 resultant features for modeling
8. AUTOMATED FEATURE MANAGEMENT display-advertisement data [6].
An important aspect of scalable machine learning is man- We tested this approach but found that we were unable
aging the scale of the installation, encompassing all of the to project down lower than several billion features without
configuration, developers, code, and computing resources observable loss. This did not provide significant savings for
that make up a machine learning system. An installation us, and we have preferred to maintain interpretable (non-
comprised of several teams modeling dozens of domain spe- hashed) feature vectors instead.
cific problems requires some overhead. A particularly in-
teresting case is the management of the feature space for 9.2 Dropout
machine learning. Recent work has placed interest around the novel tech-
We can characterize the feature space as a set of contex- nique of randomized “dropout” in training, especially in the
tual and semantic signals, where each signal (e.g., ‘words in deep belief network community [17]. The main idea is to
the advertisement’, ‘country of origin’, etc.) can be trans- randomly remove features from input example vectors inde-
lated to a set of real-valued features for learning. In a large pendently with probability p, and compensate for this by
installation, many developers may work asynchronously on scaling the resulting weight vector by a factor of (1 p)
signal development. A signal may have many versions corre- at test time. This is seen as a form of regularization that
sponding to configuration changes, improvements, and alter- emulates bagging over possible feature subsets.
native implementations. An engineering team may consume We have experimented with a range of dropout rates from
signals which they do not directly develop. Signals may be 0.1 to 0.5, each with an accompanying grid search for learn-
consumed on multiple distinct learning platforms and ap- ing rate settings, including varying the number of passes over
plied to di↵ering learning problems (e.g. predicting search the data. In all cases, we have found that dropout training
vs. display ad CTR). To handle the combinatorial growth of does not give a benefit in predictive accuracy metrics or gen-
use cases, we have deployed a metadata index for managing eralization ability, and most often produces detriment.
consumption of thousands of input signals by hundreds of We believe the source of di↵erence between these negative
active models. results and the promising results from the vision community
Indexed signals are annotated both manually and auto- lie in the di↵erences in feature distribution. In vision tasks,
matically for a variety of concerns; examples include dep- input features are commonly dense, while in our task input
recation, platform-specific availability, and domain-specific features are sparse and labels are noisy. In the dense setting,
applicability. Signals consumed by new and active models dropout serves to separate e↵ects from strongly correlated
are vetted by an automatic system of alerts. Di↵erent learn- features, resulting in a more robust classifier. But in our
ing platforms share a common interface for reporting signal sparse, noisy setting adding in dropout appears to simply
consumption to a central index. When a signal is depre- reduce the amount of data available for learning.
cated (such as when a newer version is made available), we
can quickly identify all consumers of the signal and track 9.3 Feature Bagging
replacement e↵orts. When an improved version of a signal Another training variant along the lines of dropout that
is made available, consumers can be alerted to experiment we investigated was that of feature bagging, in which k mod-
with the new version. els are trained independently on k overlapping subsets of the
feature space. The outputs of the models are averaged for a [12] L. Fan, P. Cao, J. Almeida, and A. Broder. Summary
final prediction. This approach has been used extensively in cache: a scalable wide-area web cache sharing protocol.
the data mining community, most notably with ensembles of IEEE/ACM Transactions on Networking, 8(3), jun 2000.
decision trees [9], o↵ering a potentially useful way of man- [13] T. Fawcett. An introduction to roc analysis. Pattern
recognition letters, 27(8):861–874, 2006.
aging the bias-variance tradeo↵. We were also interested in
[14] D. Golovin, D. Sculley, H. B. McMahan, and M. Young.
this as a potentially useful way to further parallelize training. Large-scale learning with a small-scale footprint. In ICML,
However, we found that feature bagging actually slightly re- 2013. To appear.
duced predictive quality, by between 0.1% and 0.6% AucLoss [15] T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich.
depending on the bagging scheme. Web-scale Bayesian click-through rate prediction for
sponsored search advertising in microsofts bing search
9.4 Feature Vector Normalization engine. In Proc. 27th Internat. Conf. on Machine Learning,
2010.
In our models the number of non-zero features per event [16] D. Hillard, S. Schroedl, E. Manavoglu, H. Raghavan, and
can vary significantly, causing di↵erent examples x to have C. Leggetter. Improving ad relevance in sponsored search.
di↵erent magnitudes kxk. We worried that this variability In Proceedings of the third ACM international conference
may slow convergence or impact prediction accuracy. We on Web search and data mining, WSDM ’10, pages
x 361–370, 2010.
explored several flavors of normalizing by training with kxk
with a variety of norms, with the goal of reducing the vari- [17] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever,
and R. Salakhutdinov. Improving neural networks by
ance in magnitude across example vectors. Despite some preventing co-adaptation of feature detectors. CoRR,
early results showing small accuracy gains we were unable abs/1207.0580, 2012.
to translate these into overall positive metrics. In fact, our [18] D. W. Hosmer and S. Lemeshow. Applied logistic
experiments looked somewhat detrimental, possibly due to regression. Wiley-Interscience Publication, 2000.
interaction with per-coordinate learning rates and regular- [19] H. A. Koepke and M. Bilenko. Fast prediction of new
ization. feature utility. In ICML, 2012.
[20] J. Langford, L. Li, and T. Zhang. Sparse online learning via
truncated gradient. JMLR, 10, 2009.
10. ACKNOWLEDGMENTS [21] S.-M. Li, M. Mahdian, and R. P. McAfee. Value of learning
We gratefully acknowledge the contributions of the fol- in sponsored search auctions. In WINE, 2010.
lowing: Vinay Chaudhary, Jean-Francois Crespo, Jonathan [22] W. Li, X. Wang, R. Zhang, Y. Cui, J. Mao, and R. Jin.
Exploitation and exploration in a performance based
Feinberg, Mike Hochberg, Philip Henderson, Sridhar Ra- contextual advertising system. In KDD, 2010.
maswamy, Ricky Shan, Sajid Siddiqi, and Matthew Streeter. [23] R. Luss, S. Rosset, and M. Shahar. Efficient regularized
isotonic regression with application to gene–gene
11. REFERENCES interaction search. Ann. Appl. Stat., 6(1), 2012.
[24] H. B. McMahan. Follow-the-regularized-leader and mirror
[1] D. Agarwal, B.-C. Chen, and P. Elango. Spatio-temporal
models for estimating click-through rate. In Proceedings of descent: Equivalence theorems and L1 regularization. In
the 18th international conference on World wide web, AISTATS, 2011.
pages 21–30. ACM, 2009. [25] H. B. McMahan and O. Muralidharan. On calibrated
[2] R. Ananthanarayanan, V. Basker, S. Das, A. Gupta, predictions for auction selection mechanisms. CoRR,
H. Jiang, T. Qiu, A. Reznichenko, D. Ryabkov, M. Singh, abs/1211.3955, 2012.
and S. Venkataraman. Photon: Fault-tolerant and scalable [26] H. B. McMahan and M. Streeter. Adaptive bound
joining of continuous data streams. In SIGMOD optimization for online convex optimization. In COLT,
Conference, 2013. To appear. 2010.
[3] R. Bekkerman, M. Bilenko, and J. Langford. Scaling up [27] A. Niculescu-Mizil and R. Caruana. Predicting good
machine learning: Parallel and distributed approaches. probabilities with supervised learning. In ICML, ICML ’05,
2011. 2005.
[4] B. H. Bloom. Space/time trade-o↵s in hash coding with [28] M. Richardson, E. Dominowska, and R. Ragno. Predicting
allowable errors. Commun. ACM, 13(7), July 1970. clicks: estimating the click-through rate for new ads. In
[5] A. Blum, A. Kalai, and J. Langford. Beating the hold-out: Proceedings of the 16th international conference on World
Bounds for k-fold and progressive cross-validation. In Wide Web, pages 521–530. ACM, 2007.
COLT, 1999. [29] M. J. Streeter and H. B. McMahan. Less regret via online
[6] O. Chapelle. Click modeling for display advertising. In conditioning. CoRR, abs/1002.4862, 2010.
AdML: 2012 ICML Workshop on Online Advertising, 2012. [30] D. Tang, A. Agarwal, D. O’Brien, and M. Meyer.
[7] C. Cortes, M. Mohri, M. Riley, and A. Rostamizadeh. Overlapping experiment infrastructure: more, better, faster
Sample selection bias correction theory. In ALT, 2008. experimentation. In KDD, pages 17–26, 2010.
[8] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, [31] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and
Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, J. Attenberg. Feature hashing for large scale multitask
K. Yang, and A. Y. Ng. Large scale distributed deep learning. In ICML, pages 1113–1120. ACM, 2009.
networks. In NIPS, 2012. [32] L. Xiao. Dual averaging method for regularized stochastic
[9] T. G. Dietterich. An experimental comparison of three learning and online optimization. In NIPS, 2009.
methods for constructing ensembles of decision trees: [33] Z. A. Zhu, W. Chen, T. Minka, C. Zhu, and Z. Chen. A
Bagging, boosting, and randomization. Machine learning, novel click model and its applications to online advertising.
40(2):139–157, 2000. In Proceedings of the third ACM international conference
[10] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient on Web search and data mining, pages 321–330. ACM,
methods for online learning and stochastic optimization. In 2010.
COLT, 2010. [34] M. Zinkevich. Online convex programming and generalized
[11] J. Duchi and Y. Singer. Efficient learning using infinitesimal gradient ascent. In ICML, 2003.
forward-backward splitting. In Advances in Neural
Information Processing Systems 22, pages 495–503. 2009.