Advances in Bias and Fairness in Information Retrieval 2021
Advances in Bias and Fairness in Information Retrieval 2021
Advances in Bias and Fairness in Information Retrieval 2021
Boratto
Stefano Faralli
Mirko Marras
Giovanni Stilo (Eds.)
123
Editors
Ludovico Boratto Stefano Faralli
Eurecat - Centre Tecnològic de Catalunya Unitelma Sapienza University of Rome
Barcelona, Spain Rome, Italy
Mirko Marras Giovanni Stilo
École Polytechnique Fédérale de Lausanne University of L’Aquila
(EPFL) L’Aquila, Italy
Lausanne, Switzerland
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Advances in Bias and Fairness
in Information Retrieval: Preface
Overall, this edition of the workshop proved to be a success, in line with the very
successful 2020 event, as witnessed by the number of submissions and the level of
engagement during the talks and the open discussion. We believe that this workshop
has strengthened the community working on algorithmic bias and fairness in infor-
mation retrieval, fostering ideas and solutions for the current challenges and developing
networks of researchers for future projects and initiatives. Plans to organize the third
edition of the workshop next year were formed. The organizers would like to thank the
authors, the reviewers for allowing us to shape an interesting program, and the
attendees for their participation.
Workshop Chairs
Ludovico Boratto Eurecat - Centre Tecnológic de Catalunya, Spain
Stefano Faralli University of Rome Unitelma Sapienza, Italy
Mirko Marras École Polytechnique Fédérale de Lausanne,
Switzerland
Giovanni Stilo University of L’Aquila, Italy
Program Committee
Himan Abdollahpouri Northwestern University, USA
Luca Aiello Nokia Bell Labs, UK
Mehwish Alam FIZ Karlsruhe and Karlsruhe Institute of Technology,
Germany
Marcelo Armentano National University of Central Buenos Aires, Argentina
Alejandro Bellogin Universidad Autónoma de Madrid, Spain
Bettina Berendt Katholieke Universiteit Leuven, Belgium
Glencora Borradaile Oregon State University, USA
Federica Cena University of Turin, Italy
Jeffrey Chen RMIT University, Australia
Pasquale De Meo University of Messina, Italy
Sarah Dean University of California, Berkeley, USA
Danilo Dessì FIZ Karlsruhe and Karlsruhe Institute of Technology,
Germany
Michael Ekstrand Boise State University, USA
Francesco Fabbri Universitat Pompeu Fabra, Spain
Jean Garcia-Gathright Spotify, USA
Aniko Hannak Northeastern University, USA
Nina Grgic-Hlaca Max Planck Institute for Software Systems, Germany
Genet Asefa Gesese FIZ Karlsruhe and Karlsruhe Institute of Technology,
Germany
Toshihiro Kamishima AIST, Japan
Martha Larson Radboud University and TU Delft, the Netherlands
Aonghus Lawlor University College Dublin, Ireland
Sandy Mayson University of Georgia, USA
Rishabh Mehrotra Spotify, UK
Brent Mittelstadt University of Oxford, UK
Cataldo Musto University of Bari Aldo Moro, Italy
Panagiotis Papadakos FORTH-ICS, Greece
Mykola Pechenizkiy Eindhoven University of Technology, the Netherlands
Simone Paolo Ponzetto Universität Mannheim, Germany
viii Organization
Yunhe Feng1(B) , Daniel Saelid1 , Ke Li1 , Ruoyuan Gao2 , and Chirag Shah1
1
University of Washington, Seattle, USA
{yunhe,saeliddp,kel28,chirags}@uw.edu
2
Rutgers University, New Brunswick, USA
[email protected]
1 Introduction
As one of the emerging topics in fairness-aware information systems, presenting
relevant results to the users while ensuring fair exposure of the content suppliers
have raised more and more attention. Fairer information retrieval and search
systems not only provide relevant search results with higher diversity and trans-
parency, but also offer reasonable discoverability for underrepresented groups.
For example, a high-quality academic paper from small institutions, which have
very limited media outlets and resources, should also be treated equally to get its
deserved exposures in search systems, especially at the early stage of publication
when such papers are more likely to suffer from cold-start problems.
This paper investigates fairness ranking within an academic search task con-
text, where the goal was to provide fair exposure of different groups of authors
while maintaining good relevance of the ranked papers regarding given queries.
However, it is difficult to achieve such a goal due to the following challenges.
c Springer Nature Switzerland AG 2021
L. Boratto et al. (Eds.): BIAS 2021, CCIS 1418, pp. 1–8, 2021.
https://doi.org/10.1007/978-3-030-78818-6_1
2 Y. Feng et al.
2 Data Description
The Semantic Scholar (S2) Open Corpus released by TREC 2020 Fairness Rank-
ing Track [3,4] consists of extracted fields of academic papers. For most papers,
the available fields include the S2 paper ID, title, abstract, authors, inbound and
outbound citations. In addition, another three auxiliary datasets are provided.
The first dataset maps paper ids to a list of corresponding author positions with
their corpus id. The second one contains paper information such as paper id,
title, year of publication, venue, number of citations, and number of key cita-
tions. The last one contains author features including author’s name, number
of citations, h-index (and a dependent feature, h-class), i10-Index, and number
of papers published. A detailed data description can be found in our previous
TREC 2020 Fairness Ranking Track report [7].
3 Methodology
We first defined author groups based on general demographic characteristics
including genders and countries. Then, we utilized Okapi BM25 [8] to estimate
Towards Fairness-Aware Ranking by Defining Latent Groups 3
the relevance of papers for given search queries. Based on the group definition
and BM25 relevance score, we proposed our fairness-aware re-ranking algorithm.
4. Search author name, author affiliation on Google, scrape the first URL, then
parse for country code.
5. Call Google Places API with affiliations, then return associated countries.
6. Search author name + ‘homepage’ on Google, scrape the first URL, then
parse for country code.
Once all authors had been processed, we mapped each author’s affiliated
country to ‘advanced economy’ or ‘developing economy’ based on the IMF’s
October 2019 World Economic Outlook report [6]. The results are shown in
Table 2. Here, ‘unidentified’ means that no country was predicted for that author.
their similarities. We also assigned weights w for relevance cost and fairness cost
for each defined group. The cost function is expressed as:
C(d, w, R, D , q) = wr ∗F (d, D , q)+ wv ∗KL(p(v, R+{d}) p(v, D )) (1)
v∈{g,c}
We used random ranking and BM25 as baselines in our study to reveal the
ranking performance without considering relevance and fairness, respectively.
As its name implies, the random ranking algorithm randomly ranks all items
ignoring relevant scores. In contrast, BM25 only cares about the relevance but
fails to take fairness into account. We will compare baselines with the proposed
fairness-aware re-ranking algorithm in Subsect. 4.2.
We evaluated the utility and unfairness, which were used as official evaluation
metrics by the TREC 2019 Fairness Ranking Track [3], with different combi-
nations of wr , wg , wc in Eq. 1 from the perspective of the gender and country
groups. As shown in Fig. 1, in both gender and country groups, BM25 demon-
strates a relatively high utility score but a low fairness score, implying that
BM25 fails to take fairness into account during the ranking. Another interest-
ing finding is that the random ranking achieves lower fairness than most of our
proposed methods on the country group but the highest fairness on the gender
group. So, the fairness performance of random ranking methods is sensitive to
the definition of groups. In other words, the definition of groups is not a trivial
task as we claimed in Sect. 1. As we expected, our methods’ utility drops greatly
when BM25 scores are excluded (wr = 0). When wr is assigned a positive value,
the performance of our methods with different combinations of wr , wg , wc are
comparable on both country and gender groups (see the cluster on left top in
Fig. 1(a), and the cluster on the middle top in Fig. 1(b)).
Fig. 1. Utility versus unfairness with different group definitions. The utility and unfair-
ness scores were calculated based on Equation (7) and Equation (6) in the TREC 2019
Fairness Ranking Track [3] respectively.
6 Conclusion
This paper presents how to define latent groups using inferred features for fair
ranking. Specifically, we construct gender and location groups, which are gener-
alized but not contained in the raw dataset, to promote search result fairness. We
also propose a fairness-aware retrieval and re-ranking algorithm incorporating
both relevance and fairness for Semantic Scholar data. Evaluation results with
different weights of relevance, gender, and location information demonstrated
that our algorithm was flexible and explainable.
References
1. Ammar, W., et al.: Construction of the literature graph in semantic scholar. In:
NAACL (2018)
2. Demografix ApS: genderize.io (2020). https://genderize.io/
3. Biega, A.J., Diaz, F., Ekstrand, M.D., Kohlmeier, S.: Overview of the TREC 2019
fair ranking track. In: The Twenty-Eighth Text REtrieval Conference (TREC 2019)
Proceedings (2019)
4. Biega, A.J., Diaz, F., Ekstrand, M.D., Kohlmeier, S.: The TREC 2020 Fairness
Track (2020). https://fair-trec.github.io
5. Cholewiak, S.A., Ipeirotis, P., Revision, V.S.: Scholarly: Simple access to Google
Scholar Authors and Citations (2020). https://pypi.org/project/scholarly/
6. Research Department, International Monetary Fund: World economic outlook.
World Economic Outlook, International Monetary Fund (2019). https://doi.org/
10.5089/9781513508214.081
7. Feng, Y., Saelid, D., Li, K., Gao, R., Shah, C.: University of Washington at TREC
2020 fairness ranking track. arXiv preprint arXiv:2011.02066 (2020)
8. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi
at TREC-3 (1994). https://fair-trec.github.io
Media Bias Everywhere? A Vision
for Dealing with the Manipulation
of Public Opinion
Abstract. This paper deals with the question of how artificial intel-
ligence can be used to detect media bias in the overarching topic of
manipulation and mood-making. We show three fields of actions that
result from using machine learning to analyze media bias: the evaluation
principles of media bias, the information presentation of media bias, and
the transparency of media bias evaluation. Practical applications of our
research results arise in the professional environment for journalists and
publishers, as well as in the everyday life of citizens. First, automated
analysis could be used to analyze text in real-time and promote balanced
coverage in reporting. Second, an intuitive web browser application could
reveal existing bias in news texts in a way that citizens can understand.
Finally, in education, pupils can experience media bias and the use of
artificial intelligence in practice, fostering their media literacy.
1 Introduction
In contrast to fake news analysis and detection [15], which are mostly limited
to evaluating the content of facts, this paper is devoted to the topics of manipu-
lation and mood-making. In cooperation with computer science, the humanities
and social sciences,1 we develop criteria that can be used to assess media bias
in news texts. Furthermore, we investigate whether methods of artificial intelli-
gence (AI) are suitable to analyze media bias in news texts in an understandable
manner for citizens and to promote balanced coverage in reporting and media
empowerment of citizens.
Most computational approaches to assessing media bias use text mining meth-
ods, such as the lexical analysis of phrases [12]. AI methods, such as deep neural
networks, can recognize complex relationships and extract knowledge from texts.
Hence, we assume that, in the future, media bias will be recognized automatically
in news texts and a quantitative analysis of media bias in its diverse dimensions
(e.g., hidden assumptions, subjectivity, representation tendencies, and overall
bias [6]) can be carried out.
In the following, we present three fields of action with respect to the appli-
cation of AI for detecting media bias.
Current research shows that annotated data sets for the fine-grained detection
of media bias in news texts are still missing [6,9,14]. As the annotation by experts
is time-consuming and expensive, we presented a scalable annotation approach
to media bias in news texts based on crowd-sourcing [6]. The approach is applied
to news texts about the Ukraine crisis in 2014 and 2015. In this way, we created
a new media bias data set based on an annotation scheme at sentence level and
the bias dimensions hidden assumptions, subjectivity, and framing.
1
We particularly thank Thomas Fetzer from the University of Mannheim, Jessica
Heesen from the University of Tübingen, and Michael Decker from the Karlsruhe
Institute of Technology (KIT).
Media Bias Everywhere? A Vision for Dealing 11
3 Conclusion
In this paper, we presented how media bias can be analyzed and determined
automatically, based on artificial intelligence methods and the extent to which
the evaluation of methods for automatic media bias annotation is sensitive to
various aspects of computer science, the humanities, and the social sciences.
Accordingly, three main fields of action were outlined, in which machine learning
processes contribute to media bias evaluation: the evaluation principles of media
bias, the information presentation of media bias, and the transparency of media
bias evaluation.
Possible use cases of our research findings are the assistance of journalists
and publishers for balanced coverage in reporting, as well as fostering citizens’
media literacy. For example, an intuitive web browser application could highlight
media bias in news texts to better understand distortions, and, applied in edu-
cation, could allow pupils to experience both media bias and the use of artificial
intelligence in practice to remedy weaknesses in their media literacy [11].
Acknowledgements. The project was funded as part of the digilog@bw joint research
project by the Baden-Württemberg Ministry for Science, Research and Art with funds
from the state digitization strategy digital@bw.
References
1. Bellows, M.: Exploration of Classifying Sentence Bias in News Articles with
Machine Learning Models (2018). https://digitalcommons.uri.edu/theses/1309
2. Botnevik, B., Sakariassen, E., Setty, V.: BRENDA: browser extension for fake news
detection. In: Proceedings of the 43rd International ACM SIGIR Conference on
Research and Development in Information Retrieval, SIGIR 2020, pp. 2117–2120.
ACM (2020). https://doi.org/10.1145/3397271.3401396
3. Cremisini, A., Aguilar, D., Finlayson, M.A.: A challenging dataset for bias detec-
tion: the case of the crisis in the Ukraine. In: Proceedings of the 12th International
Conference on Social Computing, Behavioral-Cultural Modeling and Prediction
and Behavior Representation in Modeling and Simulation, SBP-BRiMS 2019, pp.
173–183 (2019). https://doi.org/10.1007/978-3-030-21741-9 18
4. D’Alessio, D., Allen, M.: Media bias in presidential elections: a meta-analysis.
J. Commun. 50(4), 133–156 (2000). https://doi.org/10.1111/j.1460-2466.2000.
tb02866.x
5. European Commission: Ethics guidelines for trustworthy AI, November
2020. https://ec.europa.eu/digital-single-market/en/news/ethics-guidelines-trust
worthy-ai
6. Färber, M., Burkard, V., Jatowt, A., Lim, S.: A multidimensional dataset based
on crowdsourcing for analyzing and detecting news bias. In: Proceedings of the
29th ACM International Conference on Information and Knowledge Manage-
ment, CIKM 2020, pp. 3007–3014. ACM (2020). https://doi.org/10.1145/3340531.
3412876
7. Hamborg, F., Donnay, K., Gipp, B.: Automated identification of media bias in news
articles: an interdisciplinary literature review. Int. J. Digit. Libr. 20(4), 391–415
(2019). https://doi.org/10.1007/s00799-018-0261-y
Media Bias Everywhere? A Vision for Dealing 13
8. Holton, A., Chyi, H.I.: News and the overloaded consumer: factors influencing
information overload among news consumers. Cyberpsychol. Behav. Soc. Netw.
15(11), 619–624 (2012). https://doi.org/10.1089/cyber.2011.0610
9. Lim, S., Jatowt, A., Färber, M., Yoshikawa, M.: Annotating and analyzing biased
sentences in news articles using crowdsourcing. In: Proceedings of The 12th Lan-
guage Resources and Evaluation Conference, LREC 2020, pp. 1478–1484 (2020).
https://www.aclweb.org/anthology/2020.lrec-1.184/
10. van der Linden, S., Roozenbeek, J., Compton, J.: Inoculating against fake news
about COVID-19. Frontiers Psychol. 11, 2928 (2020). https://doi.org/10.3389/
fpsyg.2020.566790
11. Maksl, A., Ashley, S., Craft, S.: Measuring news media literacy. J. Med. Literacy
Educ. 6(3), 29–45 (2015), https://digitalcommons.uri.edu/jmle/vol6/iss3/3
12. Recasens, M., Danescu-Niculescu-Mizil, C., Jurafsky, D.: Linguistic models for ana-
lyzing and detecting biased language. In: Proceedings of the 51st Annual Meeting
of the Association for Computational Linguistics, ACL 2013, pp. 1650–1659 (2013)
13. Song, H., Jung, J., Kim, Y.: Perceived news overload and its cognitive and atti-
tudinal consequences for news usage in South Korea. J. Mass Commun. Q. 94(4),
1172–1190 (2017). https://doi.org/10.1177/1077699016679975
14. Spinde, T., et al.: Automated identification of bias inducing words in news articles
using linguistic and context-oriented features. Inf. Process. Manage. 58(3) (2021).
https://doi.org/10.1016/j.ipm.2021.102505
15. Zhou, X., Zafarani, R., Shu, K., Liu, H.: Fake news: fundamental theories, detec-
tion strategies and challenges. In: Proceedings of the Twelfth ACM International
Conference on Web Search and Data Mining, WSDM 2019, pp. 836–837. ACM
(2019). https://doi.org/10.1145/3289600.3291382
Users’ Perception of Search-Engine Biases
and Satisfaction
1 Introduction
Search engines often present results that are biased toward one subtopic, view, or
perspective due to the way they compute relevance and measure user satisfaction.
Among various types of search engine biases, one describes the case where the
search engines embed features that favor certain values over the others [7,13].
Many studies have attempted to detect, measure and mitigate the impacts from
search engines biases, with the goal of improving users’ satisfactions. All those
works aimed to address the issues from the source of the biases—search engines
themselves.
In the previous study we conducted [under review], we took a different path
to inspect the problem from the aspect of end-users. We paired a real search page
and a synthesized page (more varieties in the search results, thus less biased)
and asked participants which one they prefer. The results showed no significant
c Springer Nature Switzerland AG 2021
L. Boratto et al. (Eds.): BIAS 2021, CCIS 1418, pp. 14–24, 2021.
https://doi.org/10.1007/978-3-030-78818-6_3
Users’ Perception of Search-Engine Biases and Satisfaction 15
differences between the ratios of selecting two pages. However, what remained
unknown to us is that why did participants select the ones they prefer? What are
the reasonings underneath their preferences? Therefore, we revisited this study
and improved our survey design catering to our goals (more details in Sect.
3). We would like to evaluate users’ perceptions of the biases, thus hoping to
reveal the reasoning of their selections of pages. Additionally, we are interested
in studying the effects on users’ satisfactions when the biases are abated.
2 Background
Several prior studies have attempted to disclose and regulate biases, not just
limited in search engines, but also in wilder context of automated systems such
as recommender systems. For example, Collins et al. [4] confirmed the position
bias in recommender systems, which is the tendency of users to interact with the
top-ranked items than the lower-ranked ones, regardless of their relevance. Ovaisi
et al. [11] focused on the selection bias in the learning-to-rank (LTR) systems,
which occurs because “clicked documents are reflective of what documents have
been shown to the user in the first place”. They proposed a new approach to
account of the selection bias, as well as the position bias in LTR systems. Another
bias, popularity bias, states the negative influences of historical users’ feedback
on the qualities of returned items from current recommender systems. Boratto
et al. [2] designed two metrics to quantify such popularity bias and proposed
a method to reduce the biased correlation between item relevance and item
popularity.
To reduce biases of the search engines, in other words, is to provide fairer
search results. Therefore, our problem is also closely related with fair-ranking
studies, in which the goal is to generate ranking lists with nondiscriminatory
and fair exposures of various defined groups, such as race, gender, region etc.
In our case, the groups are the subtopics of the items, within which the items
share similar values and topics. Chen et al. [3] investigated the resume search
engine and found out the gender-based unfairness from the usage of demographic
information in the ranking algorithm. Zehlike et al. [14] defined the principles
of ranked group fairness and the fair top-K ranking problems. They proposed
the FA*IR algorithm, which maximizes the utility while satisfying ranked group
fairness. In addition to the mitigation of fairness at the group level, Biega et al.
[1] proposed new measures to capture, quantify, and mitigate unfairness at the
individual subjects level. They proposed a new mechanism—amortized fairness,
to address the position bias in the ranking problems.
Additionally, there are some studies in the machine learning domain that
investigated human’s perceptions of fairness and biases in algorithms. Srivas-
taca et al. [12] deployed experiments to detect the most appropriate notions of
fairness that best captures human’s perception of fairness, given different societal
domains. They found out that simplest definition, demographic parity, is aligned
with most people’s understanding of fairness. Grgić-Hlača et al. [8] deployed a
survey study in the criminal risk prediction main to analyze how people peo-
ple perceive and reason the fairness in the decisions generated by algorithms.
16 B. Han et al.
They found out that people’s concerns about fairness are multi-dimensional and
unfairness should not be just limited to discrimination.
However, fewer studies in the fair-ranking domain have devoted to probe
users’ consciousness towards the biases and their behaviors associated with their
awareness. Fewer studies have analyzed how users’ satisfactions are related with
the biases in general. Consequently, inspired by the bias/fairness perception
studies in the machine learning community, our work aims to dive deeper in this
direction.
3 Method
To generate synthesized search pages that are less biased (more diverse in
subtopics), we implemented epsilon-0.3 algorithm with statistical parity as
the fairness controller. We first group the documents into K number of groups.
Documents within each group share similar topics, values, views etc. Therefore,
each group can be treated as a subtopic group. The fairness controller aims to
provide a list of documents with equal or close presence of different subtopic
groups: given a search query, we replace three items from the top-10 list with
three lower ranked items, proportionally to the frequencies of different subtopic
groups in the top-10 list. For instance, suppose that there are two subtopic
groups (A and B). If the top-10 list has eight items from group A and two items
from group B, we would replace three out of eight items from group A at top-10
with three lower ranked documents from group B. The replacement of the docu-
ments could happen in different locations in the top-10 list. Therefore, there are
two versions of the algorithm. Version one, presented in Table 1, replaces three
documents from top-5 in the top-10 list. Version two is exactly the same as the
version one, except for that the replacement happens at bottom-5 in the top-10
list. Please refer to Fig. 1 for details.
We chose the epsilon-0.3 algorithm not only to be consistent with our pre-
vious study, but also based on the fair-ranking work by Gao and Shah [6]. They
tested multiple fairness ranking strategies to probe the relationships among fair-
ness, diversity, novelty and relevance and found out that epsilon-greedy algo-
rithms could bring fairer representations of the search results without a cost
on the relevance. In the previous study, we experimented with the variants of
the algorithm—epsilon-0.1 and epsilon-0.2, and found out that the replacement
Users’ Perception of Search-Engine Biases and Satisfaction 17
ratios (0.1 and 0.2) were too low. Therefore, we decided to work with the epsilon-
0.3 algorithm. Additionally, we worked with top-10 list because popular search
engines, such as Google and Bing, usually return 10 items per page as the default
setting (though adjustable). Therefore, we decided to stick with the default num-
ber of 10 documents.
Fig. 1. In the left figure, we replace three documents (marked in orange) from top-5 in
the top-10 list with lower ranked results. In the right figure, we replace three documents
from bottom-5 in the top-10 list. (Color figure online)
– H2: The location where the differences present matters. When dif-
ferences are at the bottom of the search list, people do not care:
intuitively, users might treat the top-ranked results more seriously than the
lower-ranked ones. Even in top-10 list, the top-5 might attract different atten-
tion than the bottom-5. Therefore, in our survey design, the replacement
happens in both locations (top-5 or bottom-5 in the top-10 list).
– H3: People prefer results with high relevance as opposed to high
diversity: this hypothesis could answer the second question. Introducing
lower ranked search items means adding more diversities into the results, thus
weakening the potential biases of search engines that consistently favor some
values over the others. Unavoidably, however, adding lower ranked results
would sabotage the relevance of the search results, leading to consequences
of potentially lowering users’ satisfactions. Therefore, we want to see whether
they prefer higher relevance (more biased) or higher diversity (less biased).
Experimental Design
The experiment starts with a consent form to be signed, which is followed by sev-
eral demographic questions (e.g. age group, gender and education background).
Then the participants are provided with instructions on how to complete the
survey through a quick demo (as shown in Fig. 2). Once they are familiar with
the details, they may proceed to answer the questions. The survey has 20 rounds
in total. Each round consists a pair of real Bing search page and a synthesized
page using the algorithm aforementioned, given a specific search query. Partic-
ipants have 30 s to read the query, compare the items between two pages, and
make a selection. Out of 20 rounds, we randomly select 10 rounds to perform the
top-5 in top-10 replacement, while the rest rounds receive bottom-5 in top-10
replacement.
Users’ Perception of Search-Engine Biases and Satisfaction 19
Fig. 2. The interface of the survey. The query shows in the search box at the top. The
paired pages consist of a real Bing search page and a synthesized one. Participants can
select whichever one they prefer and submit.
Based on the experience from the previous study, we thought 20 rounds pro-
vide sufficiently large data information for statistical analysis, while not overly
fatiguing the participants by pouring information on them. Additionally, we
tested with some trial runs of the survey and found out that 30 s are enough to
participants to compare the two pages and made a selection. After each round,
there is a reflection question (as shown in Fig. 3) on the reasons of the partici-
pants’ choice of preferences:
– “I did not notice any differences” addresses the H1. The differences might
not be palpable enough for the participants.
– “I noticed some differences but did not care. So I randomly picked up one.”
addressed the H1. Participants might detect the discrepancies, but they do
not make a difference in users’ satisfaction.
– “I noticed some differences and picked the one that had more results on the
same topic.” & “I noticed some differences and picked the one that had more
variety in the results”. They together address H3. More results on the same
topic means that the documents are more consistent with each other. More
varieties in the results represent the introduced lower ranked results.
4 Results
We launched the survey in MTurk (Amazon Mechanical Turk) by creating a
total of 108 assignment. All the assignments were completed within 3 h from
the launch time, with the average completion time as 16.4 min. With 137 par-
ticipants completing the survey, we recorded 2,408 responses. After removing
invalid entries, such as users that did not complete the survey or responses with
empty selections, 111 participants with 2,134 responses were used in the analysis.
The basic demographic information in presented in Table 2.
Selection Ratios
Starting with the overall selection ratios of the two pages, 53.3% (N = 1137)
of the responses prefer the real search pages, while 46.7% (N = 997) selected
the synthesized versions. We ran Chi-Square goodness of fit test, where the null
hypothesis states that the expected frequencies of selecting both choices are the
same. The results turned out to be significant at 0.01 significance level (p =
0.002). In the bottom-5 replacement group (N = 1066), half responses chose
the real pages and half chose the synthesized ones. There is no difference in the
selection ratios. However, we observed significantly different results in the top-5
replacement group (N = 1068), where 56.6% (N = 604) responses preferred the
real pages while 43.4% (N = 464) liked the synthesized pages better. Goodness
of fit test yield significant result (p < 1e-4).
Users’ Perception of Search-Engine Biases and Satisfaction 21
Based on the separate tests in each replacement group, it seems that the
location where the diversities are introduced have an impact on users’ prefer-
ences. To further confirm the conjecture, we ran Chi-square test of independence
on two categorical variables: users’ preferences (real page or synthesized page)
and replacement group (top-5 or bottom-5). The result is significant given p =
0.003. It demonstrates that the location is associated with participants’ prefer-
ences.
Reasoning Analysis
The default four reasons, corresponding to the four choices in order, are “No
Diff”, “Diff Random”, “Diff Same Topic”, and “Diff Variety”. We probed the
reasons for three groups separately – the group that selected the real pages
(called “original” group), the group that selected the synthesized pages with the
top-5 replacement (“Top-5” group), and the group that selected the synthesized
pages with bottom-5 replacement (“Bottom-5” group). We only presented the
analysis of the four default answers here because users’ own explanations are
diverse and sparse, which will be analyzed in the discussion section.
The distributions of default answers for each group are exhibited in Table
3. We noticed that within each group, “Diff Same Topic” dominated all other
answers. Within each group, we ran Chi-square goodness of fit test, in which the
null hypothesis states that the expected frequencies of the default choices are
the same. All three p-values are extremely small, indicating that the observed
frequencies are significantly different from the expected ones.
Table 3. Distributions of four default choices in each selection group. p-values are
from Chi-square goodness of fit test within each group.
Groups No diff Diff random Diff same topic Diff variety p-value
Original 117 222 461 310 2.2e−49
Top-5 49 87 186 132 7.5e−20
Bottom-5 55 104 218 147 1.5e−23
5 Discussion
As defined in Sect. 3, the first question we would like to answer is “Does intro-
ducing more varieties in the search engine results, equivalently less biased, hinder
users’ satisfaction?” From the analysis above, we showed that the proportion of
participants preferring the real pages is significantly higher than that of the par-
ticipants that selected the synthesized pages. Bringing up lower ranked results
into the top ranked list introduces more varieties and values in the search results,
thus weakening the biases of search engine in favoring certain values. However, it
potentially damages the relevance and consistence of the results to the queries.
22 B. Han et al.
6 Conclusion
In our study, we designed a survey to assess users’ perceptions of search engine
biases, with the goal of diagnosing the reasoning underneath their preferences of
the real search pages or the synthesized pages. We also investigated the effects
of bias-mitigation on users’ satisfactions. We noticed that overall, participants
prefer the real search pages over the synthesized ones with a significant higher
ratio. It indicates that adding more varieties makes the results less biased but
less relevant and consistent to the queries, which hurts users’ satisfactions. In
addition, when the diversities in the synthesized pages are present at the top-5,
participants tend to prefer the real pages. However, when they are at bottom-5,
there is no significant difference between the ratios of selecting two pages. It
confirms our hypothesis that the location where the bias-mitigation happens is
critical.
In terms of the future work, two directions could be considered. First, the sur-
vey design could be improved. The reflection question in the first round might
give additional information of what will be asked in later rounds and could
potentially impact users’ selections. In addition, the response options in the
reflection questions are shown in a fixed order, which might generate order bias
[9]. Redesigning the format of reflection questions could potentially improve the
study results. Second, if more variables of interests could be collected (if appli-
cable) in addition to the demographic features, mixed-effect regression models
could be conducted to account for repeated measures from the same individu-
als and the relationships among the various features and preferences could be
probed simultaneously.
References
1. Biega, A.J., Gummadi, K., Weikum, G.: Equity of attention: amortizing individ-
ual fairness in rankings. In: The 41st International ACM SIGIR Conference on
Research & Development in Information Retrieval (2018)
2. Boratto, L., Fenu, G., Marras, M.: Connecting user and item perspectives in pop-
ularity debiasing for collaborative recommendation. Inf. Process. Manage. 58(1),
102387 (2021)
3. Chen, L., Ma, R., Hannák, A., Wilson, C.: Investigating the impact of gender
on rank in resume search engines. In: Proceedings of the 2018 CHI Conference
on Human Factors in Computing Systems, CHI 2018, pp. 1–14. Association for
Computing Machinery, New York (2018)
4. Collins, A., Tkaczyk, D., Aizawa, A., Beel, J.: Position bias in recommender sys-
tems for digital libraries. In: Chowdhury, G., McLeod, J., Gillet, V., Willett, P.
(eds.) iConference 2018. LNCS, vol. 10766, pp. 335–344. Springer, Cham (2018).
https://doi.org/10.1007/978-3-319-78105-1 37
5. Couvering, E.J.V.: Search engine bias - the structuration of traffic on the world-
wide web. Ph.D. dissertation, London School of Economics and Political Science
(2009)
6. Gao, R., Shah, C.: Toward creating a fairer ranking in search engine results. Inf.
Process. Manage. 57(1), 102138 (2020)
24 B. Han et al.
7. Goldman, E.: Search engine bias and the demise of search engine utopianism. Yale
J. Law Technol. 8, 188 (2005)
8. Grgic-Hlaca, N., Redmiles, E.M., Gummadi, K.P., Weller, A.: Human perceptions
of fairness in algorithmic decision making: a case study of criminal risk prediction.
In: Proceedings of the 2018 World Wide Web Conference, WWW 2018, Republic
and Canton of Geneva, CHE 2018, page 903–912. International World Wide Web
Conferences Steering Committee (2018)
9. Krosnick, J.A., Alwin, D.F.: An evaluation of a cognitive theory of response-order
effects in survey measurement. Public Opin. Q. 51(2), 201–219 (1987)
10. Kulshrestha, J., et al.: Search bias quantification: investigating political bias in
social media and web search. Inf. Retrieval J. 22, 188–227 (2019). https://doi.org/
10.1007/s10791-018-9341-2
11. Ovaisi, Z., Ahsan, R., Zhang, Y., Vasilaky, K., Zheleva, E.: Correcting for selection
bias in learning-to-rank systems. In: Proceedings of The Web Conference 2020,
WWW 2020, pp. 1863–1873. Association for Computing Machinery, New York
(2020)
12. Srivastava, M., Heidari, H. Krause, A.: Mathematical notions vs. human perception
of fairness: a descriptive approach to fairness for machine learning. In: Proceedings
of the 25th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, KDD 2019, pp. 2459–2468. Association for Computing Machinery,
New York (2019)
13. Tavani, H.: Search engines and ethics. In: Zalta, E.N. (ed.) The Stanford Encyclo-
pedia of Philosophy. Fall 2020 edn. Metaphysics Research Lab, Stanford University
(2020)
14. Zehlike, M., Bonchi, F., Castillo, C., Hajian, S., Megahed, M., Baeza-Yates, R.:
FA*IR: a fair top-k ranking algorithm. In: Proceedings of the 2017 ACM on Con-
ference on Information and Knowledge Management, CIKM 2017, pp. 1569–1578.
Association for Computing Machinery, New York (2017)
Preliminary Experiments to Examine
the Stability of Bias-Aware Techniques
1 Introduction
One of the goals of the current research concerning fairness-aware machine learn-
ing is to develop techniques for learning a predictor that is as accurate as possible
while satisfying a given fairness constraint. The development of fairness-aware
techniques have focused on the improvement of the trade-offs between accuracy
and fairness; however, the accuracy is measured over a dataset whose annotation
is potentially biased, because fair labels are not accessible. It is unclear whether
the accuracy over an unbiased dataset is appropriately measured. We thus pro-
pose that another property is desirable for a fairness-aware predictor, i.e., the
stability.
c Springer Nature Switzerland AG 2021
L. Boratto et al. (Eds.): BIAS 2021, CCIS 1418, pp. 25–35, 2021.
https://doi.org/10.1007/978-3-030-78818-6_4
26 T. Kamishima et al.
2 Datasets
After describing the generative model of preference data below, we provide the
details of the procedure used to collect the datasets. We then confirm that the
collected data were influenced by the subjects’ cognitive biases.
Second, after dividing the subjects into two groups, we showed pairs of sushi
types to each subject by using the baseline interface, and we then merged the
datasets obtained from the two groups. In one group, the sushi that was ranked
higher in the popularity order was placed always at the left pane, and in the
other group, it was always placed in the right pane. This can still be considered
as a randomized controlled trial, because the assignment of items is random and
the assignment of the subjects to groups is also random. However, it might be
incomplete because some implicit factors are not fully randomized. For example,
the subjects could be influenced by a previous choice, the so-called memory
effect, because (a) they chose their preferred types of sushis sequentially, and
(b) the influence of this procedure could be stronger than that of the random
procedure. We call this procedure ‘fixed’ because the placements of the types of
sushi were fixed for each subject.
Third, we again divided the subjects into two groups, but used the band-
wagon interface. While the sushi-type that was ranked higher in the popularity
order was emphasized in one group, the type that was ranked lower was empha-
sized in the other group. In the case of a bandwagon effect, S = 1 indicates that
the item represented by X1 is emphasized. As in the case of the fixed procedure,
this procedure can be considered a randomized controlled trial, but it could be
incomplete. We call this the ‘bandwagon’ procedure.
We collected the data by using a crowdsourcing service in Japan. The data
were collected during the approximately 3-week period between 31 Jan. 2020
to 22 Feb. 2020, and we paid each subject JPY 50 for their participation. The
number of queries per subject was 50. Two of the queries were concentration
tests that requested the subject to “Select right”, and we used the data from the
subjects who passed the tests. The sizes of the datasets are shown in Table 1.
30 T. Kamishima et al.
In this section, we evaluate the total effect of the cognitive biases, which can be
measured by Eq. (1). Note that in a context of fairness-aware machine learning,
the quantity of the total effect corresponds to the risk difference [1,9]. The total
effect under each procedure is described in Table 2. A positive value indicates
that the item in the left pane was selected (in the random and fixed procedures)
or that the emphasized item was selected (in the bandwagon procedure). We can
confirm that the choice of items was influenced by a cognitive bias. In addition,
the effect of a bandwagon effect was stronger than that of the positional bias.
These results are consistent with the reported observation that the bandwagon
effect was stronger than the other effects [4].
We further evaluated the subjects’ cognitive biases, and evaluated the effect
of each item. For each sushi i, the target variable Yi becomes 1 if the item i is
preferred to the other items. The degree of cognitive biases per item becomes
These effects were illustrated in Fig. 3. No clear patterns were observed in terms
of a positional effect. In the case of a bandwagon effect, highly popular and
highly unpopular items were more strongly influenced by a cognitive bias. This
observation also indicates that the influence of a bandwagon effect is stronger
than that of a positional effect.
Preliminary Experiments to Examine the Stability of Bias-Aware Techniques 31
We here discuss the relationship between causal inference and a bias-aware tech-
nique, and we then provide the preliminary results of removing the cognitive
biases.
means the sum of effects through all of the causal paths from an intervention
variable to a target variable.
We interpret a model for bias-aware classification from the viewpoint of
causal inference. As depicted in Fig. 1(b), we treat a sensitive feature, S, as
a confounder, and non-sensitive features, X, become an intervention. We try to
remove the effect of S on Y while keeping the effect of X on Y . For this purpose,
in a case of causal inference, a dataset is first stratified according to the value
of S. The statistics are then computed for each stratum, and these statistics are
aggregated by the summation with weights that are proportional to the sizes of
the strata. This operation is the summation weighted by Pr[S],
Pr[Y |X] = Pr[S = s] Pr[Y|S = s, X]. (3)
s
By simply computing for all the values of X, we can remove a cognitive bias
contained in S.
We want to note that this process is related to a bias-aware approach that is
the post-process type [5,6]. In this approach, each sub-classifier is learned from
a dataset consisting of data whose sensitive values are equivalent. The decision
thresholds or weights of sub-classifiers are then modified so that the difference
between Pr[Y |X, S = 1] and Pr[Y |X, S = 0] is minimized. According to Eq. (3),
the stratification forces Pr[Y |X, S = 1] and Pr[Y |X, S = 0] to be Pr[Y |X], and
the difference becomes 0. The stratification can thus be regarded as a type of
fairness-aware technique.
In addition, the stratification operation is related to the generative model for
bias-aware classification. A joint distribution satisfying the condition of statisti-
cal parity, S ⊥
⊥ Y , can be formalized as
As described above, the obtained models are rather diverged, contrary to our
hypothesis. As next steps, we discuss two approaches to modify our hypothesis.
One approach is based on the possibility that the stratification fails to remove
the effect of cognitive biases. The other approach is based on the possibility that
there may be confounders other than the cognitive bias that we control.
First, cognitive biases may not be completely removed by the stratification.
We generated two data from a single comparison by the subjects. For example,
when comparing items A and B, we generate a data whose X1 = A and a datum
whose X1 = B. We adopted this procedure to obtain data both for S = 0 and
S = 1, but the scheme might be fatal for satisfying the condition for a randomized
controlled trial. We designed so that the fixed and bandwagon procedures depend
on the preference order in each subject group, and that the scheme might make
a causal path from X to S. Therefore, S may behave as a mediator as well as a
confounder. We need to develop another causal structure model by using these
factors.
Second, confounders other than the controlled cognitive bias might exist.
When actually unpopular items are displayed as popular, the information would
not match the subjects’ intuition, and the mismatch may cause another type
of bias. To check this hypothesis, we plan to collect datasets designed to be
influenced by other types of cognitive biases, e.g., a memory effect.
34 T. Kamishima et al.
Fig. 4. The stratified probabilities that each type of sushi was preferred to the other
types.
NOTE: The x-axes indicate the types of sushi, and the y-axes show the average of the
stratified effect of X on Y , Eq. (3), over the data whose X1 is equivalent.
4 Conclusion
We have discussed the stability of bias-aware techniques. To investigate this
property, we collected preference data that were expected to be influenced by
cognitive biases. After discussing the relationship between causal inference and
bias-aware techniques, we described a technique of stratification to remove the
biases. Our experimental results indicate the instability of the stratification,
contrary to our expectation. As next steps, we plan to modify our bias-removal
technique and to collect datasets that may be influenced by other types of cog-
nitive bias.
References
1. Calders, T., Verwer, S.: Three naive bayes approaches for discrimination-free clas-
sification. Data Min. Knowl. Disc. 21, 277–292 (2010). https://doi.org/10.1007/
s10618-010-0190-x
Preliminary Experiments to Examine the Stability of Bias-Aware Techniques 35
1 Introduction
Web search engines are important information intermediaries that help users
navigate through web content. By filtering and ranking information in response
to user queries, search engines determine what users learn about specific topics
or entities [32] in turn influencing individual and collective perception of social
reality [22]. However, search engine outputs can be biased - that is systematically
skewed towards particular individuals or groups [21] - which may lead to the
distorted perception of the search subject and potentially result in negative
societal effects such as racial or gender discrimination.
A growing number of studies discusses how search engines perpetuate biases
related to gender and race, in particular in image search results [30,38,40].
Because of their affective and interpretative potential [12], images can be effec-
tive means of educating the public about complex social phenomena, such as
c Springer Nature Switzerland AG 2021
L. Boratto et al. (Eds.): BIAS 2021, CCIS 1418, pp. 36–50, 2021.
https://doi.org/10.1007/978-3-030-78818-6_5
Visual Representation of AI in Web Search 37
gender or race, but also of reiterating stereotypes [30]. With image search being
used for a broad range of purposes, varying from educators preparing teach-
ing materials [36] to media professionals producing new content [41], its biased
outputs can reinforce skewed representation, in particular of already vulnerable
groups, and amplify discrimination [38].
Currently, research on race and gender bias in image search focuses on visual
representation of a few subjects, such as professional occupations [30] or emo-
tions [41]. However, there is a growing recognition that representation of other
aspects of contemporary societies can also be genderly or racially skewed. One of
such aspects is technological innovation, the representation of which in the West
historically tended to be decontextualized and often associated with masculinity
[11] and whiteness [29]. Such biases can further aggravate existing inequalities by
influencing hiring decisions (e.g., by stereotyping a certain field as racially homo-
geneous) and positioning the technologies, predominantly portrayed as White,
above marginalised non-white people [14]. Biases found to be present in web
search outputs (e.g., [30,38]) have the potential to influence public opinion and
perceptions of the social reality [19,32]. This is further aggravated by the fact
that users tend to trust the output of search engines [44].
Besides expanding the current focus of search bias research to new areas, it
is also important to consider the consequences of recent studies on search engine
auditing for evaluating the robustness of bias measurements. Firstly, while the
effect of personalization on the variability of search outputs is shown to be minor
[49,50], the influence of randomization (e.g., result reshuffling for maximizing
user engagement) can be more significant [35] and is yet to be accounted for
in the context of bias research. Second, despite substantial differences in con-
tent selection across search engines [28,35,51], the majority of existing research
focuses on individual search engines (e.g., Google [30] or Bing [41]), whereas
possible bias variation between different engines (including the ones prevalent in
non-Western context, such as Yandex or Baidu) remains understudied.
In this paper, we aim to make two contributions. First, we introduce a mixed-
method approach for detecting race and gender bias in image search outputs that
takes into consideration potential effects of randomization and personalization.
Second, we apply this method for conducting a cross-engine comparison of bias
in the visual representation of artificial intelligence (AI). Our choice of a case
study is attributed to common criticism of AI representation being racially and
genderly skewed both in popular culture and industry [4,46] and the recent
claims about Google amplifying these biases via its image search results [14,46].
4 Methodology
To collect data, we utilized a set of virtual agents - that is software simulat-
ing user browsing behavior (e.g., scrolling web pages and entering queries) and
recording its outputs. The benefits of this approach, which extends algorithmic
auditing methodology introduced by Haim et al. [24], is that it allows controlling
for personalization [25] and randomization [35] factors influencing outputs of web
search. In contrast to human actors, virtual agents can be easily synchronized
(i.e., to isolate the effect of time at which the search actions are conducted) and
deployed in a controlled environment (e.g., a network of virtual machines using
the same IP range, the same type of operating system (OS) and the same brows-
ing software) to limit the effects of personalization that might lead to skewed
outputs.
In addition to controlling for personalization, agent-based auditing allows
addressing randomization of web search that is caused by search engines testing
different ways of ranking results to identify their optimal ordering for a query
(e.g., the so-called “Google Dance” [10]). Such randomization leads to a sit-
uation, when identical queries entered under the same conditions can result in
different sets of outputs (or their different ranking), thus making the observations
non-robust. One way of addressing this issue is to deploy multiple virtual agents
40 M. Makhortykh et al.
Table 1. The number of unique images per each result subgroup per engine
Sex: For anthropomorphized images, we determined the sex of the entity por-
trayed to determine whether there is a gendered skew. We used sex as a proxy
for gendered representation because of the complexity of the notion of gender.
Unlike sex, which is a binary concept, gender encompasses a broad variety of
social and cultural identities that makes it hard to detect based on visual cues.
Hence, we opted out for a more robust option that is still sufficient for evaluat-
ing gender-related aspects of AI representation. The possible options included
1) male, 2) female, 3) mixed (when both male and female entities were present),
4) abstract (when an entity was shown as sexless), and 5) unknown (when it was
not possible to reliably detect sex).
5 Findings
Unlike Kay et al. [30], who had data on gender distribution for occupations to
compare their representation in image search outputs, we do not have a clear
baseline for AI representation. Hence, we follow Otterbacher et al. [40] and treat
the unequal retrievability - that is the accessibility of outputs with specific char-
acteristics [48] - as an indicator of bias in search outputs. By systematically
prioritizing images with specific features (e.g., the ones showing males and not
females; [40]), the system creates a skewed perception of the phenomenon rep-
resented via its outputs.
Fig. 1. Ratio of anthropomorphic representations of AI per result set for each engine
(1–10 refers to results 1 to 10; 11–20 to results 11 to 20; 21–30 to results 21 to 30).
were almost absent, both Baidu and Yandex prioritized more masculine repre-
sentation of AI. Such effect was achieved by highlighting images of both male
developers and users as well as human-like robots with masculine facial features.
One possible explanation of non-Western engines promoting a masculine por-
trayal of AI can be its different representation in popular culture. At least in
the case of Russia, where Yandex originates from, a number of prominent cul-
tural products present AI as a masculine entity (e.g., The Adventures of Elec-
tronic, Far Rainbow, Guest from the Future), whereas feminine representations
are rather few.
While the aforementioned observation can be treated as evidence that search
engine outputs depend on popular culture representations of AI, we did not
observe any overly sexualized images of female AI despite its intense sexualiza-
tion in Western popular culture. This finding indicates that cultural embedded-
ness of bias does not necessarily translate into its visibility in search outputs and
can be attributed to more active countering of gender bias in the recent years
[38].
6 Discussion
Our observations indicate that visual representation of AI on the world’s most
popular search engines is skewed in some racial and, to a lesser degree, gen-
der aspects. While it is not sufficient to claim that search mechanisms used to
46 M. Makhortykh et al.
retrieve information about AI are racially or genderly biased, our findings sup-
port earlier research [40] that found search engines reiterating social biases. In
the case of AI, it results in predominantly white portrayal of the technology
and the omittance of non-white AI designs as well as non-white developers and
users. By offering rather skewed selection of visual information, search engines
misrepresent important developments in the field of AI and erase the presence
of non-white groups that can be viewed as a form of discrimination.
Similar to other forms of web bias [7], the white-centric representation of AI
on search engines can be explained by multiple factors. Because of its prevalence
in Western Anglophone popular culture and industry, representation of AI as
White commonly appears on “authoritative” websites, such as the ones related
to government and research institutions and mainstream media. Outputs from
these websites are prioritized both because they are treated as more reliable
sources of information [23] and because they often have the large number of
backlinks, a feature which is important for website ranking on the majority
of search engines [2] (Yandex, however, is a notable exception with its larger
emphasis not on backlinking, but on user engagement [1]).
Additional factor which contributes to racial bias in AI representation is the
fact that image search outputs are often based on text accompanying the image,
but not on the image features [3,15]. Under these circumstances, the search
algorithm is not necessarily able to differentiate between white and non-white
representations of AI. Instead, it just retrieves images which are accompanied by
certain text from the websites, the ranking of which is determined using the same
criteria as text search results. Considering that racial bias in AI representation
remains mainstream [14] as contrasted by gender bias (e.g., it is harder to imagine
academic or government websites hosting images of sexualized AI), it results in
the iteration of white-centric representations, in particular by Western search
engines.
The reliance on textual cues for generating image search outputs and engine-
specific ranking signals (e.g., number of backlinks and source type) can also
explain differences in AI representation between Western and non-Western
search engines. Unlike Western engines, where the selection of ranking signals is
similar and results in reiteration of the same set of images stressing the white-
ness of AI, the focus on the specific regions (i.e., China for Baidu and Russia
for Yandex) together with substantial differences in ranking mechanisms (e.g.,
prioritization of backlinks coming from China for Baidu [2] and the reliance on
different ranking signals for Yandex [1]) leads to the inclusion of more non-white
representations of technology. However, if this explanation is valid, then in order
to be able to deal with racial bias in a consistent manner, search engines would
need either to more actively engage with actual image features (and not just text
accompanying images) or expand the selection of websites prioritized for retriev-
ing image outputs beyond currently prioritized mainstream Western websites,
where white-centered AI representations are prevalent.
Overall, racial bias in the way web search mechanisms treat visual repre-
sentation of AI can hardly be viewed as something that search engines invent
Visual Representation of AI in Web Search 47
on their own. However, they do reinforce the bias by creating a vicious cycle
in which images of AI as “technology of whiteness” [29] appear on the top of
search results and are more likely to be utilized by users, including educators or
media practitioners. However, this reinforcement loop can be broken as shown
by the substantially less biased representation of AI in terms of gender: despite
the strong tendency for its femalization and subsequent sexualization in popular
culture, we found relatively few gendered images of AI in the top results and
none of them was sexualized.
Together with the earlier cases of addressing skewed web search outputs
that were identified by the researchers (e.g., racialized gender bias [38]), our
observations support the argument of Otterbacher [39] about the importance
of designing new approaches for detecting bias in IR systems. In order to be
addressed, bias has first to be reliably identified, but so far there is only a few
IR studies that investigate the problem in a systematic way. By applying a
new approach to examine bias in the context of AI, our paper highlights the
importance of conducting further research to achieve better understanding of
how significant are racial and gender biases in search outputs in relation to
different aspects of contemporary societies, including (but not limited to) other
forms of innovation.
It is also important to note several limitations of the research we conducted.
First, we used very simple binary classification schematas both for race and
gender features of AI portrayal. Second, our observations rely on a snapshot
experiment conducted at a certain point of time, so it does not account for
possible fluidity of image search results. Third, the experimental setup (i.e.,
the choice of the query and the infrastructure location) can also influence the
observations produced.
References
1. SEO Tips on How to Optimize for Yandex. https://www.link-assistant.com/news/
yandex-seo.html
2. Search Engine Differences: Google, Bing, Yandex & More. https://www.deepcrawl.
com/knowledge/technical-seo-library/search-engine-differences/
3. Yandex - Technologies - Computer vision. How it works. https://yandex.com/
company/technologies/vision/
4. Adams, R.: Helen A’Loy and other tales of female automata: a gendered reading of
the narratives of hopes and fears of intelligent machines and artificial intelligence.
AI Soc. 35(3), 569–579 (2020)
5. Adams, R., Loideáin, N.N.: Addressing indirect discrimination and gender stereo-
types in AI virtual personal assistants: the role of international human rights law.
Cambridge Int. Law J. 8(2), 241–257 (2019)
6. Araújo, C.S., Meira Jr., W., Almeida, V.: Identifying stereotypes in the online
perception of physical attractiveness. In: Spiro, E., Ahn, Y.-Y. (eds.) SocInfo 2016.
LNCS, vol. 10046, pp. 419–437. Springer, Cham (2016). https://doi.org/10.1007/
978-3-319-47880-7 26
7. Baeza-Yates, R.: Bias on the web. Commun. ACM 61(6), 54–61 (2018)
48 M. Makhortykh et al.
8. Baker, P., Potts, A.: ‘Why do white people have thin lips?’ Google and the perpet-
uation of stereotypes via auto-complete search forms. Crit. Discourse Stud. 10(2),
187–204 (2013)
9. Bartneck, C., et al.: Robots and racism. In: Proceedings of the 2018 ACM/IEEE
International Conference on Human-Robot Interaction, pp. 196–204. ACM (2018)
10. Battelle, J.: The Search: How Google and Its Rivals Rewrote the Rules of Business
and Transformed Our Culture. Hachette UK (2011)
11. Blake, M.K., Hanson, S.: Rethinking innovation: context and gender. Environ.
Planning A: Econ. Space 37(4), 681–701 (2005)
12. Bleiker, R.: Visual Global Politics. Routledge (2018)
13. Bonart, M., Samokhina, A., Heisenberg, G., Schaer, P.: An investigation of biases
in web search engine query suggestions. Online Inf. Rev. 44(2), 365–381 (2019)
14. Cave, S., Dihal, K.: The whiteness of AI. Philos. Technol. 33(4), 685–703 (2020)
15. Cui, J., Wen, F., Tang, X.: Real time google and live image search re-ranking.
In: Proceedings of the 16th ACM International Conference on Multimedia, pp.
729–732 (2008)
16. DiSalvo, C., Gemperle, F.: From seduction to fulfillment: the use of anthropo-
morphic form in design. In: Proceedings of the 2003 International Conference on
Designing Pleasurable Products and Interfaces, pp. 67–72. ACM (2003)
17. Doise, W., Mugny, G., James, A.S., Emler, N., Mackie, D.: The social development
of the intellect, vol. 10. Elsevier (2013)
18. Eddo-Lodge, R.: Why I’m No Longer Talking to White People About Race.
Bloomsbury Publishing (2020)
19. Epstein, R., Robertson, R.E.: The search engine manipulation effect (SEME) and
its possible impact on the outcomes of elections. Proc. Natl. Acad. Sci. 112(33),
E4512–E4521 (2015)
20. Eubanks, V.: Automating Inequality: How High-Tech Tools Profile, Police, and
Punish the Poor. St. Martin’s Press (2018)
21. Friedman, B., Nissenbaum, H.: Bias in computer systems. ACM Trans. Inf. Syst.
14(3), 330–347 (1996)
22. Gillespie, T.: The relevance of algorithms. In: Gillespie, T., Boczkowski, P.J., Foot,
K.A. (eds.) Media Technologies, pp. 167–194. The MIT Press (2014)
23. Grind, K., Schechner, S., McMillan, R., West, J.: How google interferes with its
search algorithms and changes your results. Wall Street J. 15 (2019)
24. Haim, M., Arendt, F., Scherr, S.: Abyss or shelter? On the relevance of web search
engines’ search results when people Google for suicide. Health Commun. 32(2),
253–258 (2017)
25. Hannak, A., et al.: Measuring personalization of web search. In: Proceedings of the
22nd International Conference on World Wide Web, pp. 527–538. ACM (2013)
26. Heider, D.: White News: Why Local News Programs Don’t Cover People of Color.
Routledge (2014)
27. Hübinette, T., Tigervall, C.: To be non-white in a colour-blind society: conversa-
tions with adoptees and adoptive parents in Sweden on everyday racism. J. Inter-
cult. Stud. 30(4), 335–353 (2009)
28. Jiang, M.: The business and politics of search engines: a comparative study of
Baidu and Google’s search results of Internet events in China. New Media Soc.
16(2), 212–233 (2014)
29. Katz, Y.: Artificial Whiteness: Politics and Ideology in Artificial Intelligence.
Columbia University Press (2020)
Visual Representation of AI in Web Search 49
30. Kay, M., Matuszek, C., Munson, S.A.: Unequal representation and gender stereo-
types in image search results for occupations. In: Proceedings of the 33rd Annual
Conference on Human Factors in Computing Systems, pp. 3819–3828. ACM (2015)
31. Kivel, P.: Uprooting Racism - 4th Edition: How White People Can Work for Racial
Justice. New Society Publishers (2017)
32. Kulshrestha, J., et al.: Search bias quantification: investigating political bias in
social media and web search. Inf. Retrieval J. 22(1), 188–227 (2019)
33. Leufer, D.: Why we need to bust some myths about AI. Patterns 1(7) (2020).
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7660373/
34. Makhortykh, M., González Aguilar, J.M.: Memory, politics and emotions: internet
memes and protests in Venezuela and Ukraine. Continuum 34(3), 342–362 (2020)
35. Makhortykh, M., Urman, A., Ulloa, R.: How search engines disseminate infor-
mation about COVID-19 and why they should do better. Harvard Kennedy Sch.
Misinformation Rev. 1 (2020)
36. Müller, H., Despont-Gros, C., Hersh, W., Jensen, J., Lovis, C., Geissbuhler, A.:
Health care professionals’ image use and search behaviour. In: Proceedings of Med-
ical Informatics Europe (MIE 2006), pp. 24–32. IOS Press (2006)
37. Nilsson, N.J.: Artificial Intelligence: A New Synthesis. Morgan Kaufmann Publish-
ers Inc. (1998)
38. Noble, S.U.: Algorithms of Oppression: How Search Engines Reinforce Racism.
New York University Press (2018)
39. Otterbacher, J.: Addressing social bias in information retrieval. In: Bellot, P., et al.
(eds.) CLEF 2018. LNCS, vol. 11018, pp. 121–127. Springer, Cham (2018). https://
doi.org/10.1007/978-3-319-98932-7 11
40. Otterbacher, J., Bates, J., Clough, P.: Competent men and warm women: gender
stereotypes and backlash in image search results. In: Proceedings of the 2017 CHI
Conference on Human Factors in Computing Systems, pp. 6620–6631. ACM (2017)
41. Otterbacher, J., Checco, A., Demartini, G., Clough, P.: Investigating user percep-
tion of gender bias in image search: the role of sexism. In: The 41st International
Conference on Research & Development in Information Retrieval, pp. 933–936.
ACM (2018)
42. Pan, B., Hembrooke, H., Joachims, T., Lorigo, L., Gay, G., Granka, L.: In Google
we trust: users’ decisions on rank, position, and relevance. J. Comput.-Mediated
Commun. 12(3), 801–823 (2007)
43. Pradel, F.: Biased Representation of Politicians in Google and Wikipedia Search?
The Joint Effect of Party Identity. Gender Identity and Elections. Political Com-
munication, pp. 1–32 (2020)
44. Schultheiß, S., Sünkler, S., Lewandowski, D.: We still trust in Google, but less than
10 years ago: an eye-tracking study. Inf. Res. 23(3), 1–13 (2018)
45. Schwemmer, C., Knight, C., Bello-Pardo, E.D., Oklobdzija, S., Schoonvelde, M.,
Lockhart, J.W.: Diagnosing gender bias in image recognition systems. Socius 6,
1–17 (2020)
46. Sparrow, R.: Do robots have race?: Race, social construction, and HRI. IEEE
Robot. Autom. Mag. 27(3), 144–150 (2020)
47. Statcounter: Search Engine Market Share Worldwide (2020). https://gs.
statcounter.com/search-engine-market-share
48. Traub, M.C., Samar, T., van Ossenbruggen, J., He, J., de Vries, A., Hardman, L.:
Querylog-based assessment of retrievability bias in a large newspaper corpus. In:
Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries,
pp. 7–16. ACM (2016)
50 M. Makhortykh et al.
49. Trielli, D., Diakopoulos, N.: Partisan search behavior and Google results in the
2018 U.S. midterm elections. Inf. Commun. Soc. 1–17 (2020)
50. Unkel, J., Haim, M.: Googling Politics: Parties, Sources, and Issue Ownerships on
Google in the 2017 German Federal Election Campaign. Social Science Computer
Review, pp. 1–20 (2019)
51. Urman, A., Makhortykh, M., Ulloa, R.: Auditing source diversity bias in video
search results using virtual agents. In: Companion Proceedings of the Web Con-
ference 2021, pp. 232–236. ACM (2021)
Equality of Opportunity in Ranking:
A Fair-Distributive Model
1 Introduction
Ranking systems have rapidly spread in nowadays economies: despite such tools
have been widely employed since decades in Information Retrieval field [21], they
have recently come back at the cutting edge thanks to the explosive growth of
computational power and data availability [15]. Ranking is one of the predomi-
nant forms by which both online and offline software systems present results in
a wide variety of domains ranging from web search engines [19] to recommenda-
tion systems [21]. The main task of ranking systems is to find an allocation of
elements to each of the n positions so that the total value obtained is maximized.
The key technical principle that for decades has driven this optimization is the
Probability Ranking Principle [23], according to which elements are ranked in
descending order depending on their probability of relevance for a certain query
q. Consequently, each element will have a probability of exposure given by its
c Springer Nature Switzerland AG 2021
L. Boratto et al. (Eds.): BIAS 2021, CCIS 1418, pp. 51–63, 2021.
https://doi.org/10.1007/978-3-030-78818-6_6
52 E. Beretta et al.
relevance for the query [27]. It is widely recognized that the position of an ele-
ment in the ranking has a crucial influence on its exposure and its success;
hence, systems with ranking algorithms, whose only task is to maximize util-
ity, do not necessarily lead to fair or desirable scenarios [13]. In fact, a number
of researches [12,32,33] have demonstrated that rankings produced in this way
can lead to the over-representation of one element to the detriment of another,
causing forms of algorithmic biases that in some cases can lead to serious social
implications. Web search engine results that inadvertently promote stereotypes
through over-representation of sensitive attributes such as gender, ethnicity and
age are valid examples [1,16,30,37]. In order to mitigate and overcome biased
results, researchers have proposed a number of fairness metrics [3]. However,
the majority of these studies formalizes the notion of equity only in supervised
machine learning systems, keeping equity in ranking systems a poorly explored
ground despite the increasing influence of rankings on our society and econ-
omy. The lower attention devoted to this field is probably due to the complexity
of ranking and recommendation systems, which are characterized by dynamics
difficult to predict, multiple models and antithetic goals, and are difficult to
evaluate due to the great sparsity (e.g., see [6,7,20]). In addition, a leading path
to exploring the trade-off between the expected utility of a ranking system and
its fairness has not yet been mapped out. We address these challenges through
developing a multi-objective ranking system that optimizes the utility of the sys-
tem and simultaneously satisfies some ethical constraints. Our model is inspired
by fair division models [34] dealing on how to divide a set of resources among
a series of individuals. The main contributions of this article are the following:
first, we introduce a Fair-Distributive Ranking System combining methods of
supervised machine learning and criteria derived from economics and social sci-
ences; secondly, we define a class of fairness metrics for ranking systems based
on the Equality of Opportunity theory [25]. Finally, we conduct an empirical
analysis to study the trade-off between fairness and utility in ranking systems.
2 Related Work
Several recent works have addressed the issue of fairness in ranking systems.
Some studies minimize the difference in representation between groups in a
ranking through the concept of demographic parity, that requires members of
protected groups are treated similarly to the advantaged ones or to the entire
population [2,26,35,36]. In particular, Yang and Stoyanovich [35] have dealt with
this issue as a multi-objective programming problem, while Celis et al. [8] have
approached it from the perspective of the ranking results’ diversification, as in
[36]. More recently, Asudeh et al. [2] have proposed a class of fair scoring func-
tions that associates non-negative weights to item attributes in order to compute
an item score. Singh and Joachims [28] have proposed a Learning-to-Rank algo-
rithm that optimizes the system’s utility according to the merit of achieving a
certain level of exposure. Lastly, some recent studies have investigated the notion
of fairness through equalizing exposure; in this specific strand studies differ in
Equality of Opportunity in Ranking: A Fair-Distributive Model 53
the way exposure is allocated: while Biega et al. [4] have investigated individ-
ual fairness alongside utility, Singh and Joachims [27] have proposed an optimal
probabilistic ranking to equalize exposure among groups. It is worth noting that
the majority of the previous works have established equity constraints reflect-
ing demographic parity constraints by narrowing the elements fraction for each
attribute in the ranking or balancing utility with exposure merit. The method-
ology we propose goes beyond these parity and merit constraints for several
reasons: a) the protected attributes are not a-priori established but are updated
on the basis of the sample features, and b) the exposure is defined on the basis
of the effort variable; this variable represents the real effort that the elements
have made to be included in the Top-N-rank according to Roemer’s Equality of
Opportunity theory [25].
3 Problem Statement
In Information Retrieval the notion of utility is commonly stated as the expected
ranking that maximizes the system’s utility by exploiting the nth ranking r and
a query q, such that argmaxU (rankingr |q), where r = 1...R (R is the rankings
set); it is generally achieved through a series of utility measures in ranking system
domain that leverage a mapping function β to detect the relevance of an item
to each user given a certain query q, β(Rel(itemi |useru , q)), where i = 1...I and
u = 1...U (I and U are the items set and the users set). Several recent works
establish a certain exposure degree to each individual or group of individuals as
a fairness constraint. The exposure indicates the probability of attention that
the item gets based on the query and its ranking position, and is generically
1
calculated as log(1+j) , where j is the position of the itemi in the rankingr . We
adapt the example proposed by Singh and Joachims [27] to our scenario: suppose
a group of students has to enroll at university, the decision-maker then sorts the
students according to their relevance for the expressed query and draws up a
certain number of rankings to evaluate the system response accuracy. Relevance
is thus derived from the probability that the candidate is relevant for the query.
In this example, 8 individuals are divided into 3 groups based on ethnicity
attribute. Individuals belonging to the white group have relevance 1, 0.98, 0.95,
the Asians have 0.93, 0.91, 0.88, the African-Americans 0.86, 0.84. Students are
sorted in ranking according to relevance. Since exposure is a measure exploiting
relevance and ranking position, it is computed after sorting. As shown in Fig.
1a, Asian and African-American students, despite being placed a few positions
below white ones, get a very low exposure; this means their average exposure
is significantly lower compared to the white group, despite a minimal deviation
in relevance. Efforts to enforce a fairness constraint on exposure, even if impor-
tant, are missing the real point that is instead tied to relevance. As a matter
of fact, exposure is calculated on the basis of the candidate’s position, regard-
less of the student’s traits. Consider the new ranking in Fig. 1b. In this case,
a fairness constraint is applied to proportionally allocate exposure among eth-
nic groups; despite the constraint, the African-American minority remains in
54 E. Beretta et al.
Fig. 1. Both pictures a and b revise the example of Singh and Joachims [27]. Blue:
white group; green: Asian group; red: African-American group. (Color figure online)
lower positions compared to the other two groups. This problem is even more
serious in case of binary relevance: assuming the decision-maker would admit
to the university the first 3 students, no African-American individuals would be
included in the top-3-ranking. To address the problem of fairness in rankings we
suggest to marginally consider exposure and to focus on analyzing how relevance
is computed and which features correlate with the query q. This means that a
ranking is considered unfair if the students’ relevance, hence their position, is
systematically established on the basis of irrelevant features such as protected
attributes.
Preliminary. Egalitarian theories [24] such as EOp arise from the notion of dis-
tributive justice, which recognizes that all goods should be equally distributed
across society. The key principle of Roemer s Equality of Opportunity (EOp)
theory is based on the assumption that the resources obtained by individuals
depend on two factors: individual choices, which lie within the sphere of per-
sonal responsibility, and circumstances, which are exogenous to individual con-
trol. He claims that if inequalities in a set of individuals are caused by birth
circumstances, which include variables such as gender, race, or familiar socio-
economic status and so forth, then these are morally unacceptable and must be
compensated by society. The theory is therefore based on four key principles:
circumstances, effort, responsibility and reward. Responsibility is a theoretical
notion reflecting the effort degree that individuals invest in achieving the acts
they perform. The reward is the fraction of resources that individuals belonging
to a disadvantaged group get in case an inequality of opportunity occurs, and it
is established by a certain policy [9,22]. According to Roemer, policies should be
Equality of Opportunity in Ranking: A Fair-Distributive Model 55
Since effort is not directly observable, we need a proxy in order to measure it.
Roemer argues that it exists an effort distribution function that characterizes
the entire subgroup within which the location of the individual is set and what
is needed is a measure of effort that is comparable between different types. The
basic assumption is that two individuals belonging to a different type t who
occupy the same position in their respective distribution functions have exerted
the same level of effort - and therefore of responsibility. Since, under the same
circumstances, individuals who make different choices exercise different degrees
of effort - and thus achieve a different outcome -, the differences in outcome
within the same type are by definition determined by different degrees of effort,
and therefore are not considered in the computation of the EOp. In general,
Roemer states that to estimate effort it is necessary to:
III measure the effort that an individual has exerted through the quantile occu-
pied in his or her type distribution.
Consequently, all the individuals positioned at the same quantile in the distri-
bution of the respective type are by assumption characterized by the same level
of effort. Hence, the counterfactual score distribution ỹ is computed following
these steps:
5 Experiment
1
For a detailed explanation of Bernstein Polynomials Log Likelihood, see [5, 17, 38].
58 E. Beretta et al.
5.2 Metrics
We apply three types of metrics in order to fulfill both ranking and fairness
constraints: i) ranking domain metrics. ii) inequality domain metrics (i.e., Gini
index and Theil index), and iii) a set of metrics we propose to study our fair-
ness constraints (i.e., Opportunity Loss/Gain Profile, Opportunity Loss/Gain
Set, Unexplained Inequality Rate, Reward Profile, Reward Rate). Regarding the
inequality metrics, Gini index is a statistical concentration index ranging from 0
to 1 that measures the inequality degree distribution [11]; a low or equal to zero
Gini index indicates the tendency to the equidistribution and expresses perfect
equality, while a high or equal to 1 value indicates the highest concentration
and expresses maximum inequality. Theil index [31] is a measure of entropy to
study segregation; a zero Theil value means perfect equality. Finally, we have
proposed a new fairness metrics set: the Opportunity-Loss/Gain Profile and the
Opportunity-Loss/Gain Set are computed to study inequality in the original dis-
tribution. They indicate which levels of score could be reached by each type with
different effort degrees. The Unexplained Inequality Rate calculates the amount
of fair removed inequality due to individuals’ responsibility. The Reward Profile
calculates the type that obtained the highest gain/loss from the re-allocation of
scores - i.e. after applying fairness constraints; while the Reward Rate calculates
the average re-allocation score rate for each type. All formulas are summarized
in Table 1.
Table 1. Summary of inequality domain metrics and of a set of novel metrics proposed
to study fairness constraints. Notation: F(y)= cumulative distribution function of the
score, µ = mean score; R = number of types, pi = frequency of types; yπt = score dis-
tribution aggregated by type and quantile; ỹi = standardized score; adj(ỹπt )= adjusted
mean-type score at each effort degree (after policy).
(a) Barplot of aggregate exposure for the top-50, top-150 and top-300, initial score
ranking, standardized ranking and Γ ranking.
trend. Both ranking exhibit same value of Theil index, revealing entropy is sim-
ilarly balanced. By observing the Outcome Set, we notice that D and F types
get the lower average outcomes for all degrees of effort; this doesn’t necessarily
60 E. Beretta et al.
mean they are in a disadvantaged position. There are indeed multiple reasons,
which do not necessarily indicate inequality, why some types systematically show
a lower average outcome. We compute Gini Index on the standardized distribu-
tion to observe if there are types that systematically receive a lower outcome due
to their type membership. In this way, only and exclusively inequalities caused
by circumstances and effort degrees are obtained. This explains why types show-
ing a lower average outcome are not directly compensated with higher rewards.
Reward Rate is expressed as a dispersion around the quantile mean for each
type, thus showing that it doesn’t produce a significantly change in expected
ranking utility. The aggregated exposure is computed recursively on all rank-
ings with n+1 individuals. Analysis shows extremely disproportionate exposure
values for the initial score ranking for all top-N-ranking. The Γ ranking keeps
a proportionate aggregate exposure level among types for large subsets, while
for smaller subsets it tends to favor individuals in groups which displays high
levels of inequality (Fig. 2a). Overall, these results indicate that our approach
produces a fair ranking Γ with minimal cost for utility, that we compute in terms
of relevance for the query “best candidates selection” (Fig. 2b).
6 Conclusions
The method we have proposed generates a ranking with a guaranteed fair division
score, with minimal cost for utility. Our ranking is based on a counterfactual
score indicating which score students would have gotten if they had not belonged
to different type. In this sense, the ranking is drawn up on the basis of the effort
(aka, individual responsibility) that individuals have exerted to reach the initial
score. As a result, our ranking presents equal opportunities for each group of
individuals exhibiting the same circumstances (types) to achieve high ranking
positions (high relevance) and good exposure rates. Moreover, the paper provides
a set of new metrics to measure fairness in Fair-Distributive ranking. Finally,
we study the trade-off between the aggregated type exposure and the system’s
utility: our analyses show that the counterfactual score doesn’t affect significantly
the expected ranking utility and preserves level of exposure proportionally among
groups. The method presented has some limitations, including for example: the
need to have a dataset containing several demographic attributes, and the need
to have one or more target variables to calculate conditional inference trees
(alternatively, the method is subordinate to the construction of score indices).
As far as next steps are concerned, it is important to i)verify the robustness of
the model (internal validity) with larger datasets (also synthetic ones) and ii)
verify the external validity of the approach by applying the model on different
fields of application. In the long term, our intention is to implement a ranking
simulator that tests the results of different distributive justice theories.
Equality of Opportunity in Ranking: A Fair-Distributive Model 61
References
1. ALRossais, N.A., Kudenko, D.: Evaluating stereotype and non-stereotype recom-
mender systems. In: Proceedings of the 12th ACM Conference on Recommender
Systems. RecSys 2018. ACM, Vancouver (2018)
2. Asudeh, A., Jagadish, H.V., Stoyanovich, J., Das, G.: Designing fair ranking
schemes. In: Proceedings of the 2019 International Conference on Management
of Data, SIGMOD 2019, pp. 1259–1276. Association for Computing Machinery,
New York (2019). https://doi.org/10.1145/3299869.3300079
3. Barocas, S., Hardt, M., Narayanan, A.: Fairness and machine learning (2018).
http://www.fairmlbook.org
4. Biega, A.J., Gummadi, K.P., Weikum, G.: Equity of attention: amortizing indi-
vidual fairness in rankings. In: The 41st International ACM SIGIR Conference
on Research & Development in Information Retrieval, SIGIR 2018, pp. 405–414.
Association for Computing Machinery, New York (2018). https://doi.org/10.1145/
3209978.3210063
5. Brunori, P., Neidhöfer, G.: The evolution of inequality of opportunity in Germany:
A machine learning approach. ZEW - Centre Eur. Econ. Res. Discussion 20 (2020).
https://doi.org/10.2139/ssrn.3570385
6. Burke, R.: Multisided fairness for recommendation, July 2017. http://arxiv.org/
abs/1707.00093
7. Burke, R., Sonboli, N., Ordonez-Gauger, A.: Balanced neighborhoods for multi-
sided fairness in recommendation. In: Conference on Fairness, Accountability and
Transparency in Proceedings of Machine Learning Research, vol. 81, pp. 202–214.
Proceedings of Machine Learning Research (PMLR), New York, 23–24 February
2018. http://proceedings.mlr.press/v81/burke18a.html
8. Celis, L.E., Straszak, D., Vishnoi, N.K.: Ranking with fairness constraints (2018)
9. Checchi, D., Peragine, V.: Inequality of opportunity in Italy. J. Econ. Inequality
8(4), 429–450 (2010). https://doi.org/10.1007/s10888-009-9118-3
10. Cortez, P., Silva, A.: Using data mining to predict secondary school student per-
formance. In: Brito, A., Teixeira, J. (eds.) Proceedings of 5th FUture BUsiness
TEChnology Conference (FUBUTEC 2008). pp. 5–12 (2008). https://archive.ics.
uci.edu/ml/datasets/Student+Performance
11. Gastwirth, J.L.: The estimation of the Lorenz curve and Gini index. Rev. Econ.
Stat. 54, 306–316 (1972)
12. Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning. In:
Proceedings of the 30th International Conference on Neural Information Processing
Systems, NIPS 2016, pp. 3323–3331. Curran Associates Inc., Red Hook (2016)
13. Helberger, N., Karppinen, K., D’Acunto, L.: Exposure diversity as a design prin-
ciple for recommender systems. Inf. Commun. Soc. 21(2), 191–207 (2016)
14. Hothorn, T., Hornik, K., Zeileis, A.: Unbiased recursive partitioning: a conditional
inference framework. J. Comput. Graph. Stat. 15(3), 651–674 (2006)
15. Irfan, S., Babu, B.V.: Information retrieval in big data using evolutionary compu-
tation: a survey. In: 2016 International Conference on Computing. Communication
and Automation (ICCCA), pp. 208–213. IEEE, New York (2016)
16. Karako, C., Manggala, P.: Using image fairness representations in diversity-based
re-ranking for recommendations. In: Adjunct Publication of the 26th Conference
on User Modeling, Adaptation and Personalization, UMAP 2018. pp. 23–28. ACM,
Singapore (2018). https://doi.org/10.1145/3213586.3226206
62 E. Beretta et al.
33. Weinsberg, U., Bhagat, S., Ioannidis, S., Taft, N.: Blurme: inferring and obfuscat-
ing user gender based on ratings. In: Proceedings of the Sixth ACM Conference
on Recommender Systems, RecSys 2012, pp. 195–202. Association for Computing
Machinery, New York (2012). https://doi.org/10.1145/2365952.2365989
34. Yaari, M.E., Bar-Hillel, M.: On dividing justly. Soc. Choice Welfare 1(1), 1–24
(1984)
35. Yang, K., Stoyanovich, J.: Measuring fairness in ranked outputs. In: Proceedings
of the 29th International Conference on Scientific and Statistical Database Man-
agement. SSDBM 2017. Association for Computing Machinery, New York (2017).
https://doi.org/10.1145/3085504.3085526
36. Zehlike, M., Bonchi, F., Castillo, C., Hajian, S., Megahed, M., Baeza-Yates, R.:
Fa*ir: a fair top-k ranking algorithm. In: Proceedings of the 2017 ACM on Con-
ference on Information and Knowledge Management, CIKM 2017, pp. 1569–1578.
Association for Computing Machinery, New York (2017). https://doi.org/10.1145/
3132847.3132938
37. Zehlike, M., Sühr, T., Castillo, C., Kitanovski, I.: Fairsearch: a tool for fairness
in ranked search results. In: Companion Proceedings of the Web Conference 2020,
WWW 2020, pp. 172–175. Association for Computing Machinery, New York (2020).
https://doi.org/10.1145/3366424.3383534
38. Zhong, G.: Efficient and robust density estimation using Bernstein type poly-
nomials. J. Nonparametric Stat. 28(2), 250–271 (2016). https://doi.org/10.1080/
10485252.2016.1163349
Incentives for Item Duplication Under
Fair Ranking Policies
1 Introduction
1
https://community.withairbnb.com/t5/Hosting/Unfair-duplication-of-same-listing-
to-gain-more-exposure/td-p/850319, all links accessed on 02-03-21.
2
https://community.withairbnb.com/t5/Help/Duplicate-photos-in-listings-and-
terms-of-service/td-p/1081009.
3
https://sellercentral.amazon.com/forums/t/duplicate-search-results/445552.
66 G. M. Di Nunzio et al.
Fig. 1. Amazon result page for query controller issued on February 19, 2021 by
a Boston-based unregistered user in incognito browser mode. Top 4 results comprise
near-duplicates in positions 1 and 3 (0-based indexing).
2 Related Work
Fairness in ranking requires that the items ranked by a system receive a suitable
share of exposure, so that the overall allocation of user attention is considered
fair according to a criterion of choice [6,21]. Fair ranking criteria depend on the
specific context and normative reasoning, often inheriting and adapting notions
from the machine learning fairness literature, such as independence and separa-
tion [3].
Position bias and task repetition are peculiar aspects of many fair ranking
problems. Position bias refers to the propensity of users of ranking systems to
concentrate on the first positions in a list of ranked items, while devoting less
attention to search results presented in lower positions [15]. Common measures
of accuracy in ranking, such as Expected Reciprocal Rank (ERR) [8], hinge on
this property: they reward rankings where items are presented in decreasing
order of relevance, so that the positions which attract most user attention are
occupied by the most relevant items. These are static ranking measures, which
summarize the performance of a system with respect to an information need by
modeling a single user-system interaction. However, users can issue the same
query multiple times, requiring a search engine to repeatedly attend to the same
Incentives for Item Duplication Under Fair Ranking Policies 67
task (task repetition). Repeated queries, stemming from the same information
need, are sometimes called query impressions.4
Recently, several measures of fairness in rankings have been proposed, which
take into account the peculiarities of ranking problems [6,10,21]. These mea-
sures incorporate position bias, by suitably modeling user browsing behaviour
when estimating item exposure, and consider task repetition by evaluating sys-
tems over multiple impressions of the same query, thus encouraging rotation of
relevant items in top ranks. For example, equity of amortized fairness [6] consid-
ers cumulative attention and relevance of items over multiple query repetitions,
and is defined as follows. “A sequence of rankings ρ1 , . . . , ρJ offers equity of
amortized attention if each subject receives cumulative attention proportional
to her cumulative relevance”, where the accumulation and amortization process
are intended over multiple queries and impressions.
Depending on its amortization policy, measures of fairness in rankings can
be (i) cross-query or (ii) within-query. (i) Cross-query measures are aimed at
matching cumulative attention and relevance across different information needs
[6]; this approach has the advantage of naturally weighing information needs
based on their frequency and to enforce fairness over a realistic query load. On
the downside, these fairness measures may end up rewarding systems that dis-
play irrelevant items in high-ranking positions. (ii) Within-query measures, on
the other hand, enforce fairness over impressions of the same query [10]; this
amortization policy results in one measurement for each information need and
does not run the risk of rewarding systems that compensate an item’s exposure
across different information needs, which may result in balancing false negatives
(missed exposure when relevant) with false positives (undue exposure when irrel-
evant).
Different approaches have been proposed to optimize ranking systems against
a given fairness measure. Most of them make use of the task repetition property
by employing stochastic ranking policies. These systems are non-deterministic
since, given a set of estimated relevance scores, the resulting rankings are not
necessarily fixed. A key advantage of stochastic ranking policies over determinis-
tic ones lies in the finer granularity with which they can distribute exposure over
multiple impressions of a query. Depending on whether they keep track of the
unfairness accumulated by items, policies can be stateless or stateful. Stateless
systems are based on drawing rankings from an ideal distribution independently
[10,21,22]. This family of approaches can yield high-variance exposure for items,
especially over few impressions, due to rankings being independent from one
another. Moreover, they are not suitable to target cross-query measures as they
would require estimating the future query load. Stateful solutions, on the other
hand, keep track of the unfairness accumulated by items [6,18,23], and exploit it
to build controllers or heuristics that can actively drive rankings toward solutions
that increase the average cumulative fairness.
4
https://fair-trec.github.io/2020/doc/guidelines-2020.pdf.
68 G. M. Di Nunzio et al.
Symbol Meaning
ui items to be ranked by system, i ∈ {1, . . . , I}
uĩ duplicate of item ui
qj query impressions issued to system, j ∈ {1, . . . , J}
ρj ranking of items in response to query qj
ρji rank of item ui in ranking ρj
π a ranking policy
1
ρπ ρ , . . . , ρJ sequence of rankings obtained via policy π
u(ρπ ) utility function rewarding ranking sequence based on user satisfaction
aji attention received by item ui in ranking ρj
rij relevance of item ui for the information need expressed by qj
k cost of duplication, such that rĩj = krij , k ∈ (0, 1)
δi+1,i
j
difference in relevance for adjecently ranked items (simplified to δ)
J
Ai aji , i.e. cumulative attention received by item ui
j=1
Ri j=1 ri , i.e. cumulative relevance of item ui over queries {q1 , . . . , qJ }
J j
ranking. Firstly, some of these features, such as user ratings, stem from the
interaction of users with items; if an item is duplicated, its interactions with
users will be distributed across its copies, presumably reducing their relevance
score and lowering their rank in result pages. Moreover, a ranking system may
explicitly measure the diversity of retrieved items [9] and favour rankings with
low redundancy accordingly [19]. Finally, some platforms forbid duplication, in
the interest of user experience, and enforce this ban with algorithms for duplicate
detection and suppression procedures [1].
Therefore, we treat duplication as an expensive procedure, with a negative
impact on items’ relevance scores. We assume that the cost of duplicating an
item only affects the new copy, while leaving the relevance of the original copy
intact. We model duplication cost as a multiplicative factor k ∈ (0, 1), reducing
the relevance score of new copies of an item. In other words, if a copy uĩ of
item ui is created, then ri remains constant, while rĩ = kri . Richer models of
duplication cost are surely possible (e.g. also reducing the relevance ri of the
original copy), and should be specialized depending on the application at hand.
Here u(·) is a function computing the utility of a sequence of rankings ρπ for item
consumers, produced by a policy π. For example, in an IR setting, u(·) is a proxy
for the information gained by users from ρπ . We measure the utility of a ranking
(for a single impression) via normalized ERR [8], where the normalization ensures
that a ranking where items are perfectly sorted by relevance has utility equal
to 1, regardless of the items available for ranking and their relevance. ERR is
based on a cascade browsing model, where users view search results from top to
bottom, with a probability of abandoning at each position which increases with
rank and relevance of examined items.5 The overall utility u(ρπ ) is computed as
the average utility over all impressions. More in general, utility can be broadly
characterized as user satisfaction, potentially including notions of diversity in
rankings [9].
The objective function Q also comprises a function f (·) which combines the
cumulative relevance Ri and exposure Ai of items to compute the fairness of
ranking policy π toward item providers. To this end, we follow [5] by requiring
that items receive a share of cumulative exposure that matches their share of
cumulative relevance. More precisely, let us define J the attention
Jand relevance
accumulated by item ui over J queries as Ai = j=1 aji , Ri = j=1 rij , and let
us denote as Āi and R̄i their normalized versions
Ai Ri
Āi = I ; R̄i = I . (2)
i=1 Ai i=1 Ri
5
Following [5], the probability of user stopping at position p, after viewing items
{u1 , . . . , up } is set to P (stop|u1 , . . . , up ) = γ p−1 crp p−1
i=1 (1 − cri ), c = 0.7, γ = 0.5.
Incentives for Item Duplication Under Fair Ranking Policies 71
3.3 Results
As a first step, we ensure that the stateful policy can effectively trade off rele-
vance and fairness in the basic setting where no duplicates are present. In Fig. 2,
we evaluate the impact of parameter λ on the utility and unfairness of rankings
produced by the stateful policy, where unfairness is defined as the negation of
function f (·) in Eq. 4. As a baseline, we test the PL-based policy πPL , reporting
median values for utility and unfairness over 1,000 repetitions, along with the
5th and 95th percentile. Each panel in Fig. 2 corresponds to a different combi-
nation of relevance difference, parametrized by δ, and number of impressions J.
The top row corresponds to a less frequent query (J = 20) and the bottom row
to a more frequent one (J = 100). Panels on the left depict results for a large
relevance difference (δ = 0.25), middle panels correspond to an intermediate
relevance difference (δ = 0.125) and left panels to a small one (δ = 0.05).
We find that, over large relevance differences (left panels), a value λ ≥ 0.3 is
required to approach zero unfairness, while, for small relevance differences (right
panels), λ = 0.1 is sufficient. This is expected: as relevance becomes uniform
across items, even a policy marginally focused on fairness (λ = 0.1) can bring
about major improvements in the distribution of attention. Moreover, for a small
relevance difference, the trade-off between fairness and utility is less severe, which
is also expected. When items have a similar relevance, a policy can more easily
grant them a share of exposure proportional to their share of relevance, while only
suffering a minor loss in terms of utility. Furthermore, the unfairness of exposure
brought about by a solution purely based on relevance (λ = 0) increases as the
difference in relevance for the available items become smaller. This is a desirable
72 G. M. Di Nunzio et al.
Fig. 2. Unfairness (x axis) vs average utility (y axis) for stateful greedy solutions over
different values of λ (downsampled and color-coded in legend). In black, summary of
1,000 Plackett-Luce repetitions, reporting median, 5th and 95th percentile for utility
and unfairness. Each column corresponds to a different relevance profile for the avail-
able items, namely large relevance difference (left −δ = 0.25), intermediate difference
(middle −δ = 0.125) and small difference (right −δ = 0.05). Solutions with λ > 0.5 are
omitted for better color-coding as they are all in a close neighbourhood of λ = 0.5.
Fig. 3. Solid lines represent the attention Ai accumulated by each item (y axis) under
a fairness-aware policy πλ=0.5 (blue) or a policy solely focused on (ERR-based) utility
πλ=0 (red) in the absence of duplicates, over J = 100 impressions of the same query.
Item indices i ∈ {0, . . . , 4} vary along the x axis. Round markers summarize the extra-
attention one item would obtain if duplicated. Each column corresponds to a different
relevance profile for the available items, namely large relevance difference (left −δ =
0.25), intermediate difference (middle −δ = 0.125) and small difference (right −δ =
0.05). Each row corresponds to a different relevance multiplier for duplicates, namely
k = 1 (top) and k = 0.5 (bottom). (Color figure Online)
For every combination of parameters k and δ considered and for each item ui ,
duplicates are always rewarded more under a fairness-aware policy πλ=0.5 than
under a policy solely focused on relevance πλ=0 . This finding suggests that fair-
ness in rankings may be gamed by providers who duplicate their items. Moreover,
in the presence of duplicates or near-duplicates, fairness of rankings may be at
odds with diversity. Duplicated items, especially top-scoring ones, end up obtain-
ing a significant amount of extra-attention. In turn, this may incentivize item
providers to duplicate their listings. If redundancy in candidate items increases,
it becomes harder for a ranking system to achieve diverse rankings, with potential
repercussions on user satisfaction [11] and perception [16]. As expected, however,
the benefits of duplication become smaller as its cost increases (bottom panels).
Figure 4a summarizes the same analysis for λ = 0.2, which corresponds
to a more balanced ranking policy. In general, policy πλ=0.2 is more similar
to πλ=0 , i.e. it is more focused on relevance and rewards duplicates less than
πλ=0.5 . The most relevant items still obtain a sizeable benefit from duplication,
especially when the copying process does not affect item relevance (top panels).
Finally, we evaluate the extent to which a policy based on PL sampling rewards
duplicates. Figure 4b reports the extra-attention obtained by duplicated items
under πPL . These results are similar to those obtained under policy πλ=0.5 in
Fig. 3, showing that duplicates are likely to receive a sizeable extra-exposure also
under the stateless PL-based policy. This finding is not surprising given that, in
order for πλ=0.5 and πPL to achieve similarly low unfairness for frequent queries
(Fig. 2b), they must distribute item exposure in a similar fashion.
Incentives for Item Duplication Under Fair Ranking Policies 75
4 Conclusions
In this work we have shown that duplicates are a potential blind spot in the nor-
mative reasoning underlying common fair ranking criteria. On one hand, fairness-
aware ranking policies, both stateful and stateless, may be at odds with diversity
due to their potential to incentivize duplicates more than policies solely focused
on relevance. This can be an issue for system owners, as diversity of search
results is often associated with user satisfaction [11]. On the other hand, allow-
ing providers who duplicate their items to benefit from extra-exposure seems
unfair for the remaining providers. Finally, system users (item consumers) may
end up being exposed to redundant items in low-diversity search results; this
would be especially critical in situations where items convey opinion.
While technical solutions for near-duplicate detection and removal are cer-
tainly available [1,7], they may not always be viable, as nearly identical listings
can be posted in accordance with system regulation, e.g. to stress slight dif-
ferences in products. Control over near-duplicates is even weaker in web page
collections indexed by search engines. Therefore, it is important to consider the
entities and subjects who benefit from exposure of an item and factor them into
the normative reasoning underlying a fair ranking objective. While in market-
places beneficiaries of exposure are more easily identifiable, for document collec-
tions the situation is surely nuanced, including for instance the writer, publisher
and subject of a document.
Future work should comprise a more detailed study, including cross-query
measures, considering different user browsing models and richer models for dupli-
cation and its cost. Moreover, it will be interesting to systematically assess the
relationship between provider-side fairness and diversity of search results in the
presence of duplicates, and the extent to which these desirable objectives are in
conflict with one another.
References
1. Amazon: Potential duplicates. https://sellercentral.amazon.com/gp/help/
external/G202105450. Accessed 12 June 2020
2. Amazon: Split different products into different pages. https://sellercentral.amazon.
com/gp/help/external/help.html?itemID=201950610. Accessed 12 June 2020
3. Barocas, S., Hardt, M., Narayanan, A.: Fairness and Machine Learning. fairml-
book.org (2019). http://www.fairmlbook.org
4. Bernstein, Y., Zobel, J.: Redundant documents and search effectiveness. In: Pro-
ceedings of the 14th ACM International Conference on Information and Knowledge
Management, pp. 736–743. CIKM 2005, ACM, New York, NY, USA (2005)
5. Biega, A.J., Diaz, F., Ekstrand, M.D., Kohlmeier, S.: Overview of the trec 2019
fair ranking track. In: The Twenty-Eighth Text REtrieval Conference (TREC 2019)
Proceedings (2019)
76 G. M. Di Nunzio et al.
6. Biega, A.J., Gummadi, K.P., Weikum, G.: Equity of attention: amortizing indi-
vidual fairness in rankings. In: The 41st International ACM SIGIR Conference
on Research & Development in Information Retrieval, pp. 405–414. SIGIR 2018,
ACM, New York, NY, USA (2018)
7. Broder, A.Z.: Identifying and filtering near-duplicate documents. In: Giancarlo,
R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg
(2000). https://doi.org/10.1007/3-540-45123-4 1
8. Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for
graded relevance. In: Proceedings of the 18th ACM Conference on Information and
Knowledge Management, pp. 621–630. CIKM 2009, ACM, New York, NY, USA
(2009)
9. Clarke, C.L., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., Büttcher,
S., MacKinnon, I.: Novelty and diversity in information retrieval evaluation. In:
Proceedings of the 31st Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, pp. 659–666. SIGIR 2008, ACM, New
York, NY, USA (2008)
10. Diaz, F., Mitra, B., Ekstrand, M.D., Biega, A.J., Carterette, B.: Evaluating
stochastic rankings with expected exposure. In: Proceedings of the 29th ACM
International Conference on Information & Knowledge Management, pp. 275–284.
CIKM 2020, ACM, New York, NY, USA (2020)
11. Ekstrand, M.D., Harper, F.M., Willemsen, M.C., Konstan, J.A.: User perception of
differences in recommender algorithms. In: Proceedings of the 8th ACM Conference
on Recommender Systems, pp. 161–168. RecSys 2014, ACM, New York, NY, USA
(2014)
12. Fröbe, M., Bevendorff, J., Reimer, J.H., Potthast, M., Hagen, M.: Sampling bias
due to near-duplicates in learning to rank. In: Proceedings of the 43rd International
ACM SIGIR Conference on Research and Development in Information Retrieval,
pp. 1997–2000. SIGIR 2020, ACM, New York, NY, USA (2020)
13. Fröbe, M., Bittner, J.P., Potthast, M., Hagen, M.: The effect of content-equivalent
near-duplicates on the evaluation of search engines. In: Jose, J.M., et al. (eds.)
Advances in Information Retrieval, pp. 12–19. Springer International Publishing,
Cham (2020)
14. Geyik, S.C., Ambler, S., Kenthapadi, K.: Fairness-aware ranking in search & rec-
ommendation systems with application to linkedin talent search. In: Proceedings
of the 25th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, pp. 2221–2231. KDD 2019, ACM, New York, NY, USA (2019)
15. Joachims, T., Radlinski, F.: Search engines that learn from implicit feedback. Com-
puter 40(8), 34–40 (2007)
16. Kay, M., Matuszek, C., Munson, S.A.: Unequal representation and gender stereo-
types in image search results for occupations. In: Proceedings of the 33rd Annual
ACM Conference on Human Factors in Computing Systems, pp. 3819–3828. CHI
2015, Association for Computing Machinery, New York, NY, USA (2015)
17. Luce, R.D.: Individual Choice Behavior. Wiley, Hoboken (1959)
18. Morik, M., Singh, A., Hong, J., Joachims, T.: Controlling fairness and bias in
dynamic learning-to-rank. In: Proceedings of the 43rd International ACM SIGIR
Conference on Research and Development in Information Retrieval, pp. 429–438.
SIGIR 2020, ACM, New York, NY, USA (2020)
19. Nishimura, N., Tanahashi, K., Suganuma, K., Miyama, M.J., Ohzeki, M.: Item
listing optimization for e-commerce websites based on diversity. Front. Comput.
Sci. 1, 2 (2019)
Incentives for Item Duplication Under Fair Ranking Policies 77
20. Plackett, R.L.: The analysis of permutations. J. Roy. Stat. Soc. Ser. C (Appl. Stat.)
24(2), 193–202 (1975)
21. Singh, A., Joachims, T.: Fairness of exposure in rankings. In: Proceedings of the
24th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining, pp. 2219–2228. KDD 2018, ACM, New York, NY, USA (2018)
22. Singh, A., Joachims, T.: Policy learning for fairness in ranking. In: Wallach,
H., Larochelle, H., Beygelzimer, A., dAlché-Buc, F., Fox, E., Garnett, R. (eds.)
Advances in Neural Information Processing Systems, vol. 32, pp. 5426–5436. Cur-
ran Associates, Inc., Red Hook (2019)
23. Thonet, T., Renders, J.M.: Multi-grouping robust fair ranking. In: Proceedings of
the 43rd International ACM SIGIR Conference on Research and Development in
Information Retrieval, pp. 2077–2080. SIGIR 2020, ACM, New York, NY, USA
(2020)
Quantification of the Impact
of Popularity Bias in Multi-stakeholder
and Time-Aware Environments
1 Introduction
Popularity bias is one of the main biases present in recommendation algorithms
[1]. This consists in the fact that the most popular items are over-recommended
by the algorithms, while items with less interactions are invisible [19]. This gen-
erates a rich-get-richer effect [24], in which recommendations increase the popu-
larity of already popular items and do not give the less popular ones a chance to
emerge. In general, efforts to decrease this bias have been placed on those who
consume these recommendations, i.e., the users. However, in platforms where
there are multiple stakeholder groups, it is important to consider what impact
each one has, otherwise some of these groups may have incentives to stop using
the platform. In order to make good and useful recommendations to users, it
is important that the recommender system takes into consideration the novelty
and diversity of these recommendations [13]. For example, if Netflix made only
popular recommendations due to the bias of its algorithm, only the directors of
the most popular movies would win. In turn, those who would be most interested
c Springer Nature Switzerland AG 2021
L. Boratto et al. (Eds.): BIAS 2021, CCIS 1418, pp. 78–91, 2021.
https://doi.org/10.1007/978-3-030-78818-6_8
Quantification of the Impact of Popularity Bias in Multi-stakeholder 79
in the platform would be those users who have more popular tastes. This would
hurt those directors and users who create/consume less popular movies, which
would eventually give them incentives to leave the platform.
Within this context, it is also important to consider the time in which an
item is popular as a variable. That is, items that are popular today may not
be popular tomorrow. Thus, users who at one time had popular tastes may
eventually become users with less common tastes; and similarly, suppliers may go
from being unpopular to becoming popular. While metrics have been constructed
to measure the impact of popularity bias, they have not considered the time
dimension of interactions.
Thus, the main contributions of this paper are:
1. Propose a way to measure the popularity of the items but considering the
dynamics of time. This will also allow us to catalogue stakeholders in dynamic
groups of popularity over time.
2. Perform a popularity bias analysis over time, considering the state of the art
metrics on the subject, which measure the impact of the popularity bias on
users and providers.
2 Related Work
2.1 Multistakeholders
Over the years, this topic has been approached from two main perspectives,
which are independent and complementary [18]. On the one hand, we have the
notion that the popularity of an item may change over time, and occurs intrinsi-
cally in every context, but is also influenced by external events. Examples of this
perspective are the prediction of values on a timeline through before and after
interactions [36] and the incorporation of time-dependent weights that decrease
the importance of old items [6,17].
On the other hand, there is the temporal evolution of the users; the way they
interact with their environment is not constant and therefore their tastes or
opinions vary depending on the temporal context [12,33]. In particular, this has
motivated research for models with a time dimension. Among them, for example:
models aware of the day of the week [5], models aware of the time of the day
[35] and models where the importance of the interactions is penalized according
to how many newer interactions with the same item the user has [25].
3 Datasets
In order to carry out the desired study, we used two datasets: ‘LFM-1B’ [28],
which has the reproductions made by users to different songs in Last.FM site, and
Quantification of the Impact of Popularity Bias in Multi-stakeholder 81
Last.FM dataset
Month 7 8 9 10 11 12
# of interactions 124,869 151,657 179,953 209,423 242,800 316,666
# of users 9,573 11,101 12,631 14,093 15,716 19,129
# of suppliers (artists) 18,202 20,799 23,360 25,752 28,355 33,638
# of items (albums) 26,583 31,003 35,385 39,551 44,050 53,636
KASANDR dataset
Day 5 7 9 11 13 15
# of interactions 184,620 251,395 326,053 390,456 462,196 522,196
# of users 9,526 12,371 15,092 17,345 19,806 24,936
# of suppliers (sellers) 620 634 643 651 659 665
# of items (products) 103,832 132,540 162,656 186,415 213,906 228,537
4 Proposed Method
Intuitively, when we talk about the popularity of a song for example, we consider
that it becomes popular when many users start listening to it. Thus, most works
in the area consider the number of interactions of an item as a key factor to
define its popularity.
On the other hand, from a mathematical perspective, when we are facing a
continuous change in time, it is natural to model the problem as a derivative,
82 F. Guı́ñez et al.
To illustrate this, we will take as example the album 21 from Adele, which was
released on January 24, 2011. Figure 1a shows the discrete acumulated inter-
actions Nit and its aproximation Ni (t) using d = 5 to illustrate. On the other
side, Fig. 1b shows the popularity of this item in time. Altough the number of
interactions increase monotonely in time (Fig. 1a) from which we could say that
the popularity of this item is constant, Fig. 1b shows that this is not necessarily
true since the popularity changes over time. Every album has variations in the
speed of growth of its interactions, no matter how small they are. Our popularity
function detects these variations, which can be seen in Fig. 1b with a zoomed
Y axis. The better the adjustment of Ni (t), the better modeled the popularity
function will be.
Fig. 1. Nit , Ni (t) and Popi (t) illustration for the album item 21 from Adele.
Once these popularity functions Popi (t) for all i items have been obtained, the
dynamic grouping of items and stakeholders (users and suppliers) is carried out.
To make this grouping in the items, as in [1], we used the Pareto Principle
[27]. Given a period of time t, the items are ordered from most to least popular
according to Popi (t). The first 20% of these items will represent group Ht . The
last 20% of the items will belong to the T t group. The remaining items will
belong to Mt .
With regard to the grouping of users and suppliers, we will proceed to cal-
culate the simple average of popularity of the items that have been listened to
or created, respectively. A similar procedure will then be carried out for group-
ing, using the same percentages of cuts as for item grouping. Let Wu (t) be the
average level of popularity interactions by the user u over time t, and Pa (t) be
the popularity of the supplier a over time t, then:
Popi (t)
Popi (t) i∈Cat
Pa (t) = (3)
Wu (t) =
i∈Eut
(2) |Cat |
t
|Eu |
Where Eut are the items that the user u interacted with until time t, and
Cat are the items offered by the supplier a until time t. We will call the groups
derived from this procedure as U1t , U2t , U3t for the users and At1 , At2 , At3 for the
suppliers, where the larger the number, the more popular the group.
1
With respect to UPD, a small modification in the way it is calculated will be con-
sidered, but it follows the same idea proposed by [1].
Quantification of the Impact of Popularity Bias in Multi-stakeholder 85
Here, we calculate SP D(s) and IP D(c) as proposed in [1], with the differ-
ence that the popularity groups, instead of being defined statically by the number
of interactions, were defined according to the proposed new popularity metric,
which considers a time-varying subdivision. In addition, we also considered the
variables associated with the recommendations given to a user and the interac-
tions of a user in a variable way in time. On the other hand, we considered a
slight variation of the formula for U P D(g) with respect to what was proposed by
[1], but it was decided to maintain the same idea proposed to calculate SP D and
IP D and to average over the user popularity groups the subtraction between
the proportion of recommendations achieved by a group and the proportion of
interactions achieved by that same group, that is:
u∈U t j∈tu 1(V (j)∈s) u∈U t j∈Eut 1(V (j)∈s)
qu(g) = (9) pu(g) = (10)
n × |U t | t
|Eu |
For these last three metrics, lower values mean that there is a smaller average
difference between the proportion of the recommended and the proportion of the
actual interactions achieved per popularity group, so the algorithm would be fair
to the different popularity groups.
The hyperparameters studied, both for ALS and BPR, were the number of
latent factors (50, 100 and 200) and the regularization parameter (0.01 and 0.1).
In addition, the learning rate was varied for BPR (0.001, 0.01 and 0.1).
First, for Last.FM dataset, the decision was made to make 5 recommenda-
tions per user, since a smaller number does not allow us to adequately analyze
the capacity of the algorithms to recommend less popular items due to most pop-
ular options monopolize all the recommendations. On the other hand, a higher
number of recommendations would not be representative of the context to be
studied, since very few users have interacted with more than 5 different items,
adding noise to the metrics.
Second, in the case of KASANDR dataset, we decided to make 5 recommen-
dations also based on what was said in [30].
Then, for each specific hyperparameter configuration, MAP@5 and nDCG@5
were calculated for each period. Finally, the average of the metrics for each set of
parameters was obtained in order to select the one that delivers a higher value.
For Last.FM dataset, the chosen parameters for ALS were 50 latent factors
and 0.1 as a regularization parameter. With this configuration higher values
were obtained in both MAP@5 and nDCG@5. Meanwhile for BPR, the chosen
parameters were 100 latent factors, 0.01 as a regularization parameter and 0.01
as a learning rate parameter.
For KASANDR dataset, the chosen parameters for ALS were 200 latent
factors and 0.1 as a regularization parameter, with higher values obtained in
MAP@5 and nDCG@5. Meanwhile for BPR, the chosen parameters were 200
latent factors, 0.01 as a regularization parameter and 0.001 as a learning rate
parameter.
6 Results
Agg-div values per month APL values per month LC values per month
0.03
Random
APL
0.6 0.03
LC
0.02
0.4 0.02
0.01
0.2 0.01
IPD values per month SPD values per month UPD values per month
0.30
0.25 0.5
0.25
0.20 0.4
0.20
UPD
SPD
IPD
0.15 0.3
0.15
Agg-div values per day APL values per day LC values per day
0.5
0.4 Algorithm
0.4
ALS
0.4 BPR
Most popular
Agg-div
0.3 0.3
0.3 Random
APL
LC
0.2 0.2
0.2
IPD values per day SPD values per day UPD values per day
0.6
0.4 0.4
0.5
0.3
0.2 0.2
0.2
0.1 0.1
0.1
dataset. Moreover, for APL metric, the best algorithm depends on the epoch
and the dataset. This performance of ALS means that it is less unfair than BPR
when recommending unpopular items.
Another aspect worth noting is that, when it comes to IPD, SPD and UPD,
which are metrics that measure unfairness, both algorithms perform much better
than the Most Popular algorithm, which naturally turns out to be very unfair
in prioritizing popular items.
It is interesting to note that ALS is the algorithm that manages to maintain
better metrics over time with respect to Agg-Div, LC, IPD, APL and SPD. This
means that, in general, ALS manages to give higher priority in recommendations
to less popular items compared to BPR from a time-aware perspective.
With respect to the unfairness of recommendations among popularity groups,
the UPD, SPD and IPD metrics show fairly constant values over time in the case
of ALS, which means that unfairness is maintained constantly. On the other
hand, these metrics have non-linear variations from one month to another for
BPR, which shows that the injustice of this algorithm may vary over time.
The results presented in Sect. 6 demonstrated that the popularity bias is not
static in time. This highlights the need to build time-conscious recommendations,
since timeless analyses do not give a complete picture of the problem to be
addressed.
A dataset of musical interactions and e-commerce interactions were used as
object of study, with which recommendations were made considering a sequence
of instants in time. With the results obtained, we concluded that ALS is less
unfair than BPR when recommending unpopular items, since ALS is able to
maintain lower and more consistent metrics of injustice over time.
The main difficulties arose from the high computational cost of estimating
the popularity functions for each item, which was overcome by subsampling the
information. This decision did not greatly affect the results of this research and
a similar analysis can be carried out in the future with better hardware.
References
1. Abdollahpouri, H.: Popularity bias in recommendation: a multi-stakeholder per-
spective. Ph.D. thesis (2020). https://doi.org/10.13140/RG.2.2.18789.63202
2. Abdollahpouri, H., et al.: Multistakeholder recommendation: survey and research
directions. User Model. User-Adap. Inter. 30(1), 127–158 (2020)
3. Abdollahpouri, H., Mansoury, M., Burke, R., Mobasher, B.: The impact of
popularity bias on fairness and calibration in recommendation. arXiv e-prints
arXiv:1910.05755 (2019)
4. Abdollahpouri, H., Mansoury, M., Burke, R., Mobasher, B.: The unfairness of
popularity bias in recommendation, August 2019
5. Adomavicius, G., Tuzhilin, A.: Context-aware recommender systems. In: Ricci, F.,
Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp.
217–253. Springer, Boston, MA (2011). https://doi.org/10.1007/978-0-387-85820-
37
6. Anelli, V.W., Di Noia, T., Di Sciascio, E., Ragone, A., Trotta, J.: Local popularity
and time in top-N recommendation. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr,
P., Hauff, C., Hiemstra, D. (eds.) ECIR 2019. LNCS, vol. 11437, pp. 861–868.
Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15712-8 63
7. Baeza-Yates, R.: Bias in search and recommender systems. In: Fourteenth ACM
Conference on Recommender Systems, p. 2 (2020)
8. Bellogı́n, A., Castells, P., Cantador, I.: Statistical biases in information retrieval
metrics for recommender systems. Inf. Retrieval J. 20(6), 604–634 (2017)
9. Boratto, L., Fenu, G., Marras, M.: The effect of algorithmic bias on recommender
systems for massive open online courses. In: Azzopardi, L., Stein, B., Fuhr, N.,
Mayr, P., Hauff, C., Hiemstra, D. (eds.) ECIR 2019. LNCS, vol. 11437, pp. 457–
472. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-15712-8 30
10. Boratto, L., Fenu, G., Marras, M.: Connecting user and item perspectives in pop-
ularity debiasing for collaborative recommendation. Inf. Process. Manage. 58(1)
(2021). https://doi.org/10.1016/j.ipm.2020.102387
11. Borges, R., Stefanidis, K.: On mitigating popularity bias in recommendations via
variational autoencoders (2021)
12. Campos, P.G., Dı́ez, F., Cantador, I.: Time-aware recommender systems: a com-
prehensive survey and analysis of existing evaluation protocols. User Model. User-
Adap. Inter. 24(1–2), 67–119 (2014)
90 F. Guı́ñez et al.
13. Castells, P., Hurley, N.J., Vargas, S.: Novelty and diversity in recommender sys-
tems. In: Ricci, F., Rokach, L., Shapira, B. (eds.) Recommender Systems Hand-
book, pp. 881–918. Springer, Boston (2015). https://doi.org/10.1007/978-1-4899-
7637-6 26
14. Chelliah, M., Zheng, Y., Sarkar, S.: Recommendation for multi-stakeholders and
through neural review mining. In: Proceedings of the 28th ACM International
Conference on Information and Knowledge Management, pp. 2979–2981 (2019)
15. Ekstrand, M.D., et al.: All the cool kids, how do they fit in?: popularity and
demographic biases in recommender evaluation and effectiveness. In: Friedler, S.A.,
Wilson, C. (eds.) Proceedings of the 1st Conference on Fairness, Accountability
and Transparency. Proceedings of Machine Learning Research, vol. 81, pp. 172–
186. PMLR, New York, 23–24 February 2018. http://proceedings.mlr.press/v81/
ekstrand18b.html
16. Fu, Z., et al.: Fairness-aware explainable recommendation over knowledge graphs.
arXiv preprint arXiv:2006.02046 (2020)
17. Garg, D., Gupta, P., Malhotra, P., Vig, L., Shroff, G.: Sequence and time aware
neighborhood for session-based recommendations: STAN. In: Proceedings of the
42nd International ACM SIGIR Conference on Research and Development in Infor-
mation Retrieval, pp. 1069–1072 (2019)
18. Koren, Y.: Collaborative filtering with temporal dynamics. In: Proceedings of the
15th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pp. 447–456 (2009)
19. Kowald, D., Schedl, M., Lex, E.: The unfairness of popularity bias in music recom-
mendation: a reproducibility study. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS,
vol. 12036, pp. 35–42. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-
45442-5 5
20. Liu, W., Guo, J., Sonboli, N., Burke, R., Zhang, S.: Personalized fairness-aware
re-ranking for microlending. In: Proceedings of the 13th ACM Conference on Rec-
ommender Systems, pp. 467–471 (2019)
21. Mena-Maldonado, E., Cañamares, R., Castells, P., Ren, Y., Sanderson, M.: Agree-
ment and disagreement between true and false-positive metrics in recommender
systems evaluation. In: Proceedings of the 43rd International ACM SIGIR Confer-
ence on Research and Development in Information Retrieval, pp. 841–850 (2020)
22. Morik, M., Singh, A., Hong, J., Joachims, T.: Controlling fairness and bias in
dynamic learning-to-rank. arXiv preprint arXiv:2005.14713 (2020)
23. Nguyen, P., Dines, J., Krasnodebski, J.: A multi-objective learning to re-rank app-
roach to optimize online marketplaces for multiple stakeholders. arXiv preprint
arXiv:1708.00651 (2017)
24. Nikolov, D., Lalmas, M., Flammini, A., Menczer, F.: Quantifying biases in online
information exposure. J. Am. Soc. Inf. Sci. 70(3), 218–229 (2019). https://doi.org/
10.1002/asi.24121
25. Pavlovski, M., et al.: Time-aware user embeddings as a service. In: Proceedings
of the 26th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, pp. 3194–3202 (2020)
26. Rodriguez, M., Posse, C., Zhang, E.: Multiple objective optimization in recom-
mender systems. In: Proceedings of the Sixth ACM Conference on Recommender
Systems, pp. 11–18 (2012)
27. Sanders, R.: The pareto principle: its use and abuse. J. Serv. Mark. 1, 37–40 (1987).
https://doi.org/10.1108/eb024706
28. Schedl, M.: The LFM-1B dataset for music retrieval and recommendation. In:
ICMR (2016). https://doi.org/10.1145/2911996.2912004
Quantification of the Impact of Popularity Bias in Multi-stakeholder 91
29. Seabold, S., Perktold, J.: Statsmodels: econometric and statistical modeling with
python. In: 9th Python in Science Conference (2010)
30. Sidana, S., Laclau, C., Amini, M.R., Vandelle, G., Bois-Crettez, A.: KASANDR:
a large-scale dataset with implicit feedback for recommendation. In: Proceedings
of the 40th International ACM SIGIR Conference on Research and Development
in Information Retrieval, pp. 1245–1248 (2017)
31. Stewart, J.: Calculus: Early Transcendentals. Cengage Learning (2010)
32. Wang, S., Gong, M., Li, H., Yang, J.: Multi-objective optimization for long tail
recommendation. Knowl.-Based Syst. 104, 145–155 (2016)
33. Xiang, L., Yang, Q.: Time-dependent models in collaborative filtering based recom-
mender system. In: 2009 IEEE/WIC/ACM International Joint Conference on Web
Intelligence and Intelligent Agent Technology. vol. 1, pp. 450–457. IEEE (2009)
34. Xiao, L., Min, Z., Yongfeng, Z., Zhaoquan, G., Yiqun, L., Shaoping, M.:
Fairness-aware group recommendation with pareto-efficiency. In: Proceedings of
the Eleventh ACM Conference on Recommender Systems, pp. 107–115 (2017)
35. Yuan, Q., Cong, G., Ma, Z., Sun, A., Thalmann, N.: Time-aware point-of-interest
recommendation. In: Proceedings of the 36nd International ACM SIGIR Confer-
ence on Research and Development in Information Retrieval, pp. 363–372 (2013)
36. Zhang, Y., Zheng, Z., Lyu, M.R.: WSPred: a time-aware personalized qos predic-
tion framework for web services. In: 2011 IEEE 22nd International Symposium on
Software Reliability Engineering, pp. 210–219 (2011)
37. Zheng, Y., Pu, A.: Utility-based multi-stakeholder recommendations by multi-
objective optimization. In: 2018 IEEE/WIC/ACM International Conference on
Web Intelligence (WI), pp. 128–135. IEEE (2018)
When Is a Recommendation Model
Wrong? A Model-Agnostic Tree-Based
Approach to Detecting Biases
in Recommendations
1 Introduction
Recommendation systems are typically optimized to improve a global objective
function, such as the error between the real and predicted user action. How-
ever, these approaches result in optimization for the mainstream trends while
minority preference groups, as well as those interested in niche products, are not
represented well. Given a lack of understanding of the dataset characteristics
and insufficient diversity of represented individuals, such approaches inevitably
lead to amplifying hidden data biases and existing disparities. While the prob-
lem of system fairness has recently attracted much research interest, most of
the works are based on analyzing a single dimension selected a-priori, such as a
sensitive user attribute or a protected category of products. However, it is not
clear how to identify which groups should be protected, and different types of
recommendation algorithms are prone to different vulnerabilities.
Moreover, bias is often caused by certain combinations of circumstances
rather than a single feature. For instance, a recommender may propagate a gen-
der bias from the training data by under-estimating the preferences of female
c Springer Nature Switzerland AG 2021
L. Boratto et al. (Eds.): BIAS 2021, CCIS 1418, pp. 92–105, 2021.
https://doi.org/10.1007/978-3-030-78818-6_9
When Is a Recommendation Model Wrong? 93
students for technical courses [30]. Another challenge for recommendation sys-
tems is to consider temporal shifts of interests, which may be due to season-
ality. For example, a news recommendation system may consider that certain
football-related articles were popular last week, and may assign similar articles
appearing this week high probability scores. However, the popularity could be
due to a match that was taking place, and so the probabilities for this week will
be overestimated. There are several other situations when the suggested recom-
mendations do not work out well, for example when some people have unusual
tastes or there is a scarcity of feedback [1,8]. In all such situations, it may be
difficult to identify the protected attributes a-priori. We address this problem in
our research by identifying the particular circumstances when a recommender
system’s predictions are wrong. More precisely, this paper makes the following
contributions:
In this section, the main challenges of the current recommendation systems are
discussed, and an overview of the related work concerning recommendation fair-
ness and generating explanations for recommendation algorithms is presented.
Current recommendation systems face many tough problems when dealing with
real-world data, thereby degrading their performance in various situations. For
94 J. Misztal-Radecka and B. Indurkhya
instance, data sparsity is a critical problem for many online recommender sys-
tems, as often there is not enough information to make predictions for a given
user or a new item. The cold-start problem is especially troublesome for collab-
orative filtering (CF) approaches, which recommend items to a user based on
the ratings of other users who have similar tastes.
Moreover, data collected from the web is prone to different types of biases
related to user demography, nationality, and language [3,6,15,19,29]. More gen-
erally, a small number of influential users may end up having a massive impact
on the recommendations of other users [12]. Moreover, the grey and black sheep
users have unusual tastes and so have no or few similar users have, therefore the
CF methods fail to find adequate recommendations for them [23,27]. Two other
types of biases [3] are the presentation and the position biases, which are related
to how particular items are displayed to a user on a website: better-positioned
products have a higher probability of being selected than those that are not so
prominently visible. Moreover, as noticed by [22], offline recommender systems
face many challenges in dynamic domains such as online services due to user
interest shift and dynamic popularity trends [2,7]. It has been found that user
preferences are not constant but are influenced by temporal factors such as the
time of the day, the day of the week, or the season.
All these situations can lead to generating irrelevant recommendations due
to systematic biases and disparities. In our research, we aim to automatically
detect such disparities to enable taking corrective actions.
search. Though our works also applies a tree-based post-hoc model, we aim at
approximating the model errors, whereas [26] focuses on evaluating how well
the post-hoc model approximates the original ranking. Thus, we use a similar
model to solve a different type problem, and therefore these approaches are not
comparable.
3 Problem Definition
The recommendation algorithms are usually trained to predict the score for a
user-item pair, where the target may be given explicitly (user-item rating val-
ues, for example) or specified implicitly (a purchase or a click, for example). As
pointed out in the previous section, these predictions may be inaccurate for vari-
ous reasons, depending on the domain, the data characteristics and the algorithm
at hand. Moreover, the disparities may be caused by a complex combinations
of factors rather than a single dimension. Hence, our goal in this research is to
identify those situations where many systematic recommendation errors occur
regularly.
In most standard recommendation algorithms, the model parameters are
adjusted based on a global optimization function. For instance, for the popular
model-based matrix factorization (MF) collaborative filtering algorithms, learn-
ing is performed by minimizing a global objective function, which is defined as
the sum of the differences between the real r and the predicted r̂ ratings in
the training set for all the data samples. Hence, the model weights are adjusted
to optimize the predictions for the majority of the cases or the most impactful
instances. As shown by [5], this leads to underfitting for some unusual situations
such as specific interest groups and niche products, and results in a long-tailed
distribution of the prediction errors.
Another example of unequal error distribution is the standard k-Nearest
Neighbors (KNN) collaborative filtering algorithm. In this case, the predicted
rating is estimated based on the similarity between the user or the item vectors.
Hence these predictions may be inaccurate for users who have a low similarity
to others (grey sheep), or those with very few ratings (the cold-start problem).
Moreover, when tuning the model hyper-parameters (such as the number of
neighbors for KNN or the vector size for MF), the performance of the model on
the validation set is estimated with an average error measure, such as absolute or
squared error of all the examples. Though the optimization is performed globally,
the model may underperform in certain situations, such as for particular types
of users or items. As the average error does not carry any information about the
metric’s distribution, these hidden biases may remain unnoticed.
In this research, our goal is to detect those situations where the recom-
mender introduces such systematic biases: that is, when the average error for
some combinations of metadata is significantly higher than for the other cases.
To define this problem more formally, let ru,i ∈ R be a rating of a user u
for an item i, xui = xui ui
1 , . . . , xK , x
ui
∈ X are the attributes associated with
this rating (a vector of user and item features) and r̂u,i is the model’s pre-
diction for this rating. Then, Tk = {k1 , . . . kN } is a set of possible values
96 J. Misztal-Radecka and B. Indurkhya
for an attribute xk (for instance, for feature genre the possible values are
Tgenre = {crime, comedy, thriller}, and so on). Next, we define the subset of
instances:
N
e(ru,i , r̂u,i ) ui
eXkn ,lm ,... = , x ∈ Xkn ,lm ,...
n=1
N
We want to detect the combinations of user and item attributes such that
the average error for each of these combinations is significantly higher than for
the other cases:
For instance, if the prediction error is higher for female users and thriller movies,
this set may contain a combination (genderf emale , genrethriller ). We are inter-
ested in detecting such situations for both the training error (when the model
underfits for some combinations of features such as the cold start problem or
the grey sheep) and the test set (if the model was fitted on the train set and the
characteristics of the new data differ from the training set, for instance, due to
the interest shift or seasonality).
4 Proposed Approach
To identify those situations where the recommender results are systematically
biased, we split the instances for which the error is significantly higher than
for the remaining group. We fit a Chi-square Automatic Interaction Detection
(CHAID) tree for regression, which is a well-established statistical method pro-
posed by Gordon V. Kass in 1980. We fit the tree model with the inputs of xui
as independent variables and the target e(ru,i , r̂u,i ) as the predicted dependent
variable. In each iteration, the CHAID algorithm cycles through all attributes
k, and the pairs of values kj , kl ∈ Tk with no significant difference (p > αmerge )
in error distribution are merged into one category kj,l . Then, the next tree split
is performed on the attribute k with the most significant difference (the small-
est adjusted p-value), concerning the predicted error value. This procedure is
repeated recursively until no further splits can be made (p > αsplit for all nodes
or the max tree depth was achieved).
Since errors are continuous values that do not have a normal distribution,
the median Levene’s test [9] is used for assigning the statistical significance of
When Is a Recommendation Model Wrong? 97
5 Experimental Validation
1. Value unfairness (Fig. 1A)—the predictions for certain situations are system-
atically underestimated: μb < 0. In this example, the predictions for female
users and thriller movies are systematically lower than the real ratings, which
leads to a higher error on the test set (σ = 0.1 for all cases).
−0.5 if gender = F and genre = thriller
μ=
0 otherwise
98 J. Misztal-Radecka and B. Indurkhya
The following user and item features are used to search for potential biases:
The continuous variables are discretized into three equal-sized buckets based
on the sample quantiles. We set α = 0.01 (for both the merge and the split condi-
tions) to detect significant differences in error distribution. To avoid over-fitting
or selecting too specific rules that would be hard to interpret, the minimum
number of samples in a leaf node is set to 100.
Results. First, the maximum and minimum detected values of error in leaves
for different node depths are compared. As presented in Fig. 2, the detected
minimum and maximum values are more extreme at higher depths of the tree.
This indicates that our proposed method can detect more severe disparities than
100 J. Misztal-Radecka and B. Indurkhya
Fig. 2. Minimum (A) and maximum (B) test-error values at different tree levels for
the compared CF algorithms.
the single-attribute approaches that are most often used for detecting algorithmic
biases.
While the minimum leaf error is smaller at higher tree depths, the maximum
detected leaf error increases for the first 2 or 4 levels and then remains constant.
This may mean that either the further splits did not result in any significant
differences, or the detected samples were too small. Hence, in further analysis
we limit the depth of trees to 3 levels.
The results for the compared recommendation algorithms in terms of global
MAE and detected biased leaves are presented in Table 2 and Fig. 3. For all
the models, the highest test error was identified for a combination of user, item
and activity-based features. This observation supports the conclusion that our
proposed approach identifies the roots of discrimination more accurately than the
analysis based on one type of attributes. We observe that while Slope One and
SVD yield the smallest error on the test set, SVD and Co-Clustering show the
least tendency to discriminate particular feature combinations. Co-Clustering
has the least difference between the largest (0.851) and the smallest (0.611)
test error in leaves with a standard deviation of 0.078. On the contrary, KNN
has the least error on the training set and the largest error on the test set.
An analysis of error distribution in leaves shows that it overfits only for some
feature combinations: the difference between the largest (0.797) and the smallest
(0.283) leaf error on the training set. The largest differences are detected for the
When Is a Recommendation Model Wrong? 101
Fig. 3. (A) Global MAE for the training and the test sets for different recommendation
algorithms. (B) MAE aggregated for each leaf in the bias-detection tree.
Fig. 4. Distribution of the test error for detected tree leaves with the maximum depth
limited to 2, and tree structures for the two CF algorithms—(A) NMF, (B) KNN. The
probability density (Y axis) of the absolute error of the model (X axis) is estimated
with KDE.
combination of all features. Similarly, Fig. 3 shows that NMF clearly overfits for
some attribute combinations as the error for these leaves is close to zero.
Figure 4 shows the constructed bias trees (with the maximum depth limited
to 2 for better readability) and the test error distributions for each detected leaf
102 J. Misztal-Radecka and B. Indurkhya
Table 2. Global MAE and results for the maximum and minimum detected tree nodes
for the compared CF algorithms.
Model Global error Tree Max leaf error Min leaf error Std leaves errors
Train Test Attributes Train Test Train Test Train Test
Slope one 0.653 0.749 Activity 0.710 0.851 0.335 0.725 0.138 0.054
All 0.710 0.924 0.291 0.640 0.133 0.078
Item 0.752 0.902 0.487 0.640 0.082 0.074
User 0.712 0.799 0.630 0.730 0.035 0.036
KNN 0.525 0.778 Activity 0.671 1.044 0.318 0.733 0.142 0.138
All 0.797 1.044 0.283 0.622 0.148 0.121
Item 0.666 0.910 0.330 0.571 0.087 0.099
User 0.649 0.968 0.498 0.743 0.055 0.090
Co-clustering 0.694 0.764 Activity 0.700 0.825 0.691 0.745 0.006 0.037
All 0.770 0.851 0.594 0.611 0.049 0.078
Item 0.897 0.824 0.594 0.678 0.092 0.057
User 0.776 0.804 0.668 0.751 0.043 0.037
NMF 0.604 0.761 Activity 0.648 0.907 0.163 0.733 0.171 0.077
All 0.673 0.918 0.134 0.642 0.162 0.100
Item 0.679 0.804 0.519 0.654 0.056 0.063
User 0.650 0.825 0.554 0.747 0.035 0.040
SVD 0.674 0.750 Activity 0.738 0.796 0.663 0.730 0.032 0.035
All 0.792 0.977 0.512 0.603 0.080 0.104
Item 0.782 0.870 0.512 0.603 0.073 0.083
User 0.746 0.797 0.650 0.731 0.037 0.030
for NMF and KNN models. For the NMF recommender, the main difference is
detected with respect to the movies’ production year so that the error is larger
for the latest movies and less active users. For older movies, the error is larger
for female users; hence, this group may be less satisfied with the recommen-
dations. The least error is detected for older movies and male users—possibly,
the model overfits this group. For KNN, the first split is performed based on
the item popularity. More precisely, the largest error is for the least frequent
items (indicating the item cold-start problem) and female users. The next level
of split considers the movie year—the error for popular movies is smaller for the
older movies compared to the newer ones. While some attributes that determine
significant differences in error distribution are common for both the algorithms
(for instance, the item popularity and the user gender), it can also be seen that
the reasons for discrimination are different. In particular, the user-based KNN
approach seems to be particularly sensitive for the item popularity bias and the
cold-start problem, while some types of items and users are not modeled well by
NMF.
ally, the distribution of error may differ dramatically for different combinations of
attributes. Hence, the globally-optimized approaches may result in severe dispar-
ities and unfair functioning of recommender systems. To address this problem,
we presented a model-agnostic approach to detecting possible biases and dispar-
ities in the recommendation systems that may result from a combination of user
and item attributes. The results on a real-world movie recommendation dataset
show that our proposed method can identify severe disparities for certain feature
combinations that are missed by the single-attribute approaches most often used
for analyzing the recommender fairness.
In the future, we plan to incorporate debiasing techniques in our approach,
such as focused learning [5] or fairness objective [30]. Moreover, we plan to
apply our method to enhance the design process of hybrid approaches to better
address the diversity of users and items. We are also experimenting with other
algorithms and datasets from different domains to analyze potential algorithmic
biases in various other recommendation settings. Finally, we plan to incorporate
additional attributes such as the recommendation context to identify other types
of biases such as presentation and position bias.
References
1. Abdollahpouri, H., Mansoury, M., Burke, R., Mobasher, B.: The connection
between popularity bias, calibration, and fairness in recommendation. In: Four-
teenth ACM Conference on Recommender Systems, pp. 726–731. RecSys 2020,
Association for Computing Machinery, New York (2020). https://doi.org/10.1145/
3383313.3418487
2. Anelli, V.W., Di Noia, T., Di Sciascio, E., Ragone, A., Trotta, J.: Local popu-
larity and time in top-n recommendation. In: Azzopardi, L., Stein, B., Fuhr, N.,
Mayr, P., Hauff, C., Hiemstra, D. (eds.) Adv. Inf. Retrieval, pp. 861–868. Springer
International Publishing, Cham (2019)
3. Baeza-Yates, R.: Bias on the web. Communications of the ACM 61, 54–61 (2018).
https://doi.org/10.1145/3209581
4. Barocas, S., Hardt, M., Narayanan, A.: Fairness and Machine Learning (2019).
http://www.fairmlbook.org
5. Beutel, A., Chi, E.H., Cheng, Z., Pham, H., Anderson, J.: Beyond globally opti-
mal: focused learning for improved recommendations. In: Proceedings of the 26th
International Conference on World Wide Web, pp. 203–212. WWW 2017, Interna-
tional World Wide Web Conferences Steering Committee, Republic and Canton of
Geneva, CHE (2017). https://doi.org/10.1145/3038912.3052713
6. Bolukbasi, T., Chang, K.W., Zou, J.Y., Saligrama, V., Kalai, A.T.: Man is to
computer programmer as woman is to homemaker? debiasing word embeddings. In:
Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances
in Neural Information Processing Systems 29, pp. 4349–4357. Curran Associates,
Inc. (2016)
7. Boratto, L., Fenu, G., Marras, M.: The effect of algorithmic bias on recommender
systems for massive open online courses. In: Azzopardi, L., Stein, B., Fuhr, N.,
Mayr, P., Hauff, C., Hiemstra, D. (eds.) Adv. Inf. Retrieval, pp. 457–472. Springer
International Publishing, Cham (2019)
104 J. Misztal-Radecka and B. Indurkhya
8. Boratto, L., Fenu, G., Marras, M.: Connecting user and item perspectives
in popularity debiasing for collaborative recommendation. Inf. Proc. Manage.
58(1), 102387 (2021). https://www.sciencedirect.com/science/article/pii/S030645
7320308827
9. Brown, M.B., Forsythe, A.B.: Robust tests for the equality of variances. J.
Am. Stat. Assoc. 69(346), 364–367 (1974). https://doi.org/10.1080/01621459.
1974.10482955. https://www.tandfonline.com/doi/abs/10.1080/01621459.1974.10
482955
10. Burke, R., Sonboli, N., Ordonez-Gauger, A.: Balanced neighborhoods for multi-
sided fairness in recommendation. In: Friedler, S.A., Wilson, C. (eds.) Proceedings
of the 1st Conference on Fairness, Accountability and Transparency. Proceedings
of Machine Learning Research, vol. 81, pp. 202–214. PMLR, New York, NY, USA,
23–24 Feb 2018. http://proceedings.mlr.press/v81/burke18a.html
11. Deldjoo, Y., Anelli, V.W., Zamani, H., Bellogin, A., Di Noia, T.: Recommender
systems fairness evaluation via generalized cross entropy. In: Proceedings of the
2019 ACM RecSys Workshop on Recommendation in Multistakeholder Environ-
ments (RMSE) (2019)
12. Eskandanian, F., Sonboli, N., Mobasher, B.: Power of the few: analyzing the impact
of influential users in collaborative recommender systems. In: Proceedings of the
27th ACM Conference on User Modeling, Adaptation and Personalization, pp.
225–233. UMAP 2019, Association for Computing Machinery, New York, NY, USA
(2019). https://doi.org/10.1145/3320435.3320464
13. Gajane, P., Pechenizkiy, M.: On formalizing fairness in prediction with machine
learning (2017)
14. George, T., Merugu, S.: A scalable collaborative filtering framework based on co-
clustering. In: Proceedings of the Fifth IEEE International Conference on Data
Mining, pp. 625–628. ICDM 2005, IEEE Computer Society, USA (2005). https://
doi.org/10.1109/ICDM.2005.14
15. Graells-Garrido, E., Lalmas, M., Menczer, F.: First women, second sex: Gender
bias in wikipedia. CoRR abs/1502.02341 (2015). http://arxiv.org/abs/1502.02341
16. Guidotti, R., Monreale, A., Turini, F., Pedreschi, D., Giannotti, F.: A survey of
methods for explaining black box models. CoRR abs/1802.01933 (2018). http://
arxiv.org/abs/1802.01933
17. Harper, F.M., Konstan, J.A.: The movielens datasets: history and context.
ACM Trans. Interact. Intell. Syst. 5(4), 1–19 (2015). http://doi.acm.org/10.1145/
2827872
18. Hug, N.: Surprise: a python library for recommender systems. J. Open Source
Softw. 5(52), 2174 (2020). https://doi.org/10.21105/joss.02174
19. Islam, A.C., Bryson, J.J., Narayanan, A.: Semantics derived automatically from
language corpora necessarily contain human biases. CoRR abs/1608.07187 (2016).
http://arxiv.org/abs/1608.07187
20. Lee, D.D., Seung, H.S.: Learning the parts of objects by nonnegative matrix fac-
torization. Nature 401, 788–791 (1999)
21. Li, J., Sun, L., Wang, J.: A slope one collaborative filtering recommendation algo-
rithm using uncertain neighbors optimizing. In: Wang, L., Jiang, J., Lu, J., Hong,
L., Liu, B. (eds.) Web-Age Inf. Manage., pp. 160–166. Springer, Berlin Heidelberg,
Berlin, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28635-3 15
When Is a Recommendation Model Wrong? 105
22. Li, L., Wang, D., Li, T., Knox, D., Padmanabhan, B.: Scene: A scalable two-stage
personalized news recommendation system. In: Proceedings of the 34th Inter-
national ACM SIGIR Conference on Research and Development in Information
Retrieval. pp. 125–134. SIGIR 2011, ACM, New York, NY, USA (2011). https://
doi.org/10.1145/2009916.2009937
23. McCrae, J., Piatek, A., Langley, A.: Collaborative filtering (2004). http://www.
imperialviolet.org
24. Salakhutdinov, R., Mnih, A.: Probabilistic matrix factorization. In: Advances in
Neural Information Processing Systems, vol. 20 (2008)
25. Sánchez, P., Bellogı́n, A.: Attribute-based evaluation for recommender systems:
Incorporating user and item attributes in evaluation metrics. In: Proceedings of
the 13th ACM Conference on Recommender Systems, pp. 378–382. RecSys 2019,
Association for Computing Machinery, New York, NY, USA (2019). https://doi.
org/10.1145/3298689.3347049
26. Singh, J., Anand, A.: Posthoc interpretability of learning to rank models using
secondary training data. CoRR abs/1806.11330 (2018). http://arxiv.org/abs/1806.
11330
27. Su, X., Khoshgoftaar, T.M.: A survey of collaborative filtering techniques. Adv. in
Artif. Intell. 2009 (2009). https://doi.org/10.1155/2009/421425
28. Tintarev, Nava, Masthoff, Judith: Designing and evaluating explanations for rec-
ommender systems. In: Ricci, Francesco, Rokach, Lior, Shapira, Bracha, Kantor,
Paul B. (eds.) Recommender Systems Handbook, pp. 479–510. Springer, Boston,
MA (2011). https://doi.org/10.1007/978-0-387-85820-3 15
29. Tsintzou, V., Pitoura, E., Tsaparas, P.: Bias disparity in recommendation systems.
CoRR abs/1811.01461 (2018). http://arxiv.org/abs/1811.01461
30. Yao, S., Huang, B.: Beyond parity: Fairness objectives for collaborative filtering.
In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan,
S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp.
2921–2930. Curran Associates, Inc. (2017)
31. Zhang, Y., Chen, X.: Explainable recommendation: A survey and new perspectives.
CoRR abs/1804.11192 (2018). http://arxiv.org/abs/1804.11192
Examining Video Recommendation Bias
on YouTube
1 Introduction
YouTube is among the most popular social media platforms. It is a common source of
news consumption. The platform is one of the most popular hubs of content production,
monetization, and cross-platform dissemination. Nevertheless, YouTube is not exempt
from the dynamics that lead to the generation, dissemination, and wide-scale consump-
tion of misinformation, disinformation, hate speech, and many other content types with
ramifications for individuals and society. As a result, the platform has been subject to
public debate in recent years, receiving journalistic and scholarly attention.
Among the subjects of the ongoing debate are the potential roles played by YouTube’s
recommendation, personalization, and ranking algorithms in the spread and influence
of harmful content, as well as the formation of echo-chambers and polarization [9, 12,
14, 17–20]. However, most of the evidence of the alleged effects is anecdotal and is
often acquired through journalistic investigations. Although the number and scope of
the systematic scholarly studies gradually increase, currently, the literature lacks an
established understanding of whether the YouTube algorithm exacerbates the given set
of problems.
This study aims to explore whether the YouTube recommendation algorithm pro-
duces biased results regardless of personalization, user entry point, or the topical and
categorical differences between the videos watched. Such an approach would enable
generalized assessments and measurement of the emergent bias of recommendations.
Despite recent studies on different aspects of the problem, the current literature lacks
comprehensive and systematic evaluations of whether the video recommendation system
on the platform tends to be biased in favor of a small number of videos in generalizable
ways. Besides, if bias exists, how should it be characterized?
This study aims to approach the problem from a graphical probabilistic perspective,
focusing on the recommendation graph structures and node-centric probabilistic distri-
butions. We operationalize our approach using PageRank computations and studying the
structural properties of recommendation graphs vis-à-vis the PageRank values’ distribu-
tion. We contend that exploration and characterization of emergent algorithmic biases in
YouTube’s video recommendation system using the mentioned graph analysis approach
would significantly contribute to the relevant literature. In particular, node-centric proba-
bilistic analysis of potential emergent biases in recommender systems will be applicable
in the fairness of recommender systems, patterns of content consumption, information
diffusion, echo-chamber formation, and other significant problems. By evaluating the
structural tendencies of bias in YouTube recommendation graphs, we aim to contribute to
the given interdisciplinary area regarding implicit and structural biases, feedback loops,
and complex networks across social media platforms.
This work continues as follows. Section 2 includes a brief review of the recent liter-
ature on research problems associated with the YouTube algorithm and graph analytical
approaches to evaluate and characterize recommendation systems. Section 3 describes
our data collection approach and the underlying rationale for the diversification of
datasets. A short introduction to the operationalization and PageRank computations
are described further along with the main findings in Sect. 4. As demonstrated in mul-
tiple steps, the results show an emergent structural bias exacerbating the probability of
being viewed for a small fraction of the videos in the empirical recommendation graphs.
Further, we briefly discuss the potential implications, qualitative interpretations, and
limitations of the study.
This paper aims to examine the following primary research questions:
2 Related Work
This section will present a discussion of recent and relevant literature on the YouTube
recommendation algorithm and associated set of problems, the use of graph analytical
approaches to detect and evaluate bias in recommender systems, and the relevant foun-
dational concepts about bias in search and recommendation. It will also introduce the
rationale, overall approach, and the methods we employed in this study.
Several studies examined YouTube’s video recommendation, search, and personal-
ization systems in recent years. One of the most frequent themes in recent literature
has been the radicalization of content and users, examining whether recommendations
lead people to more radicalized or extreme content over time. Others examined potential
biases in recommended videos regarding demographic factors such as gender, socioeco-
nomic status, and political affiliation. Besides, a few recent studies suggest that bias in
YouTube’s recommendations can be understood and characterized through topological
analysis of recommendation networks. Broadly, the analysis, evaluation, characteriza-
tion, and mitigation of bias in search and recommendation systems constitute a growing
interdisciplinary literature. Popularity bias and feedback loops have been associated with
unfair recommendations in group and individual levels, impacting the user experience
as well as “diversity, novelty, and relevance” of recommended items [21].
that the “problem of radicalization on YouTube” has become (or will become) obsolete
[9]. Roth et al. [19] focus on the concept of “filter bubbles” or “confinement” effect that
allegedly exists on the YouTube platform and measure view diversity. Using a random
walk strategy, they focus on confinement dynamics and analyze the video recommenda-
tion system’s workings. The researchers found that recommendations tended to lead to
topical confinement, which seemed “to be organized around sets of videos that garner
the highest audience” [19].
While most studies in the domain focused on YouTube’s algorithm and its impact
on viewers, Buntain et al. [6] looked at the cross-platform aspect and analyzed videos
of conspiracy theories hosted on YouTube and shared across social media platforms
(Twitter and Reddit) in the eight months following YouTube’s announcement of their
attempts to proactively curtail such content.
Search and recommendation systems are increasingly intertwined. Current search and
recommendation algorithms rely on historical data, user data, personalization, and a few
other information sources [5]. Overall, detection and mitigation of bias in “automated
decisions” are growing interdisciplinary research areas [21]. Biases emanating from
the historical data or emergent properties of the recommendation systems may some-
times exacerbate societal problems. Thus, evaluation of video recommendation biases on
YouTube relates to the overarching interdisciplinary effort to understand, characterize,
and prevent bias and fairness issues in search and recommendation systems. As one of
the more complex case studies, an enhanced understanding of the mechanics and struc-
tural properties of bias on YouTube would provide significant insights to information
retrieval research and other relevant disciplines.
One of the specific research themes in search and recommendation systems is the
“popularity bias.” In general, fairness, accuracy, or coverage problems arise when popu-
lar items are ranked higher and recommended more, especially when this tendency cre-
ates a feedback loop where popular items become more popular, so it is recommended
even more frequently over time. Abdollahpouri et al. [1] focuses on the problem and
suggests a “learning-to-rank” algorithm that includes less popular items to recommenda-
tions, with a specific demonstration of protecting accuracy while enhancing the coverage
of recommendations.
Bias also relates to potential inefficiencies in the evaluation of recommendation
performance, accuracy, and effectiveness. Bellogín et al. [3] examine “statistical biases”
in “metrics for recommender systems,” with a particular focus on the biases that distort
the evaluation of the performance of recommender systems. In a novel and experimental
research design, they contribute to understanding popularity bias, especially by focusing
on potential ways to improve recommender systems’ evaluation. Beutel et al. [4] focus
on the recommender systems as pairwise applications and propose a series of metrics
to assess fairness risks in real-world applications. Beutel et al. also offer a method for
“unbiased measurements of recommender system ranking fairness.”
Verma et al. [21] survey relevant papers from several conferences about fairness in
search and recommendation systems and evaluate 22 recent papers that propose new
fairness metrics and models for various application cases. They categorize the proposed
methods and recent literature in five different categories, including “non-personalized
recommendation setting, crowd-sourced recommendation setting, personalized recom-
mendation setting, online advertisements, and marketplace,” while also emphasizing the
distinction between three aspects of fairness and its evaluation: “diversity, novelty, and
relevance” [21].
As mentioned above, there has been a growing focus on radicalization, extremism,
misinformation, and other societal issues concerning how the YouTube algorithm leads
its users vis-à-vis such problematic content. Nevertheless, the evaluation of video rec-
ommendation bias on YouTube should extend beyond such specific problem areas and
start with the platform’s search and recommendation systems’ structural and inherent
characteristics. Referring to the broad survey and classification of Verma et al. [21],
understanding recommendation bias on YouTube may relate to dimensions of diversity,
novelty, and relevance of recommendations. On the other hand, YouTube presents a
Examining Video Recommendation Bias on YouTube 111
non-trivial case study to understand and evaluate the recommendation system, primar-
ily due to the constant evolution of historical data, repetitive updates of the algorithm,
and challenges in acquiring sufficient, diversified, and meaningful datasets for research.
This study proposes a graph analysis approach and focuses on distributions of PageRank
scores of videos in recursively crawled recommendation networks. This study’s primary
objective is to examine whether the platform tends to impose bias in favor of a small
set of videos in real-world scenarios and whether those algorithmic tendencies can be
generalized. We contend that for systematic prediction, detection, and mitigation of such
problems on YouTube and similar platforms, understanding the recommender system’s
structural properties and emergent behavioral patterns is a requirement, which would
potentially precede future models and mitigation efforts.
Our approach is based on the assumption that we should first capture the recommendation
algorithm’s general behavioral patterns as they tend to apply to a diverse set of conditions
across the entire platform. In our opinion personalization plays a role in the evolution of
recommendations to a specific logged-in user. The platform also utilizes the user’s web
traces through cookies. Nevertheless, a system-level, structural analysis that pays special
attention to the node-specific probabilistic computations may help to understand the
examined system’s inherent properties. Therefore, we posit that we can fairly examine
recommendation algorithm bias by excluding the user personalization aspect of the
YouTube platform.
Data collection was conducted using the YouTube Tracker tool1 [15] and the YouTube
Data API [11]. Videos were retrieved based on a keyword search or by using a list
of seed videos as starting points. We then used lists of “related videos” and collected
the details about other videos recommended by the YouTube algorithm. These videos
would typically appear on YouTube under the “Up Next” heading or in the list of the
recommended videos. We determined that two consecutive hops of crawling would
enable a sizable amount of data for each experiment while also allowing us to diversify
our collection (Table 1).
The analysis phase starts with building the recommendation graphs for each dataset we
acquired by drawing edges between “parent videos” and “recommended videos.” To
explore and evaluate potential biases in video recommendations, we utilized a proba-
bilistic approach focusing on the nodes potentially gaining influence over the rest of the
graph due to the recommendations. We extracted the PageRank scores acquired by each
node in the recommendation graph and explored PageRank [16] distributions of each
1 YouTube Tracker (2020), https://vtracker.host.ualr.edu, COSMOS, UALR.
112 B. Kirdemir et al.
Table 1. Short description of seed data, entry points, and total number of collected
recommendations in each experiment.
4 Experiment Results
Examination of in-degree and PageRank values for each graph confirm the skewed
distribution of recommendations, with tiny fractions of nodes receiving a vast number
of connections. Figure 1 shows an example of the PageRank distributions. We observed
a similar shape of distribution in all of the corresponding experiments. Nevertheless, an
exploratory comparison of slopes show differences between different recommendation
graphs we experimented with. They also indicate transitions in how distributions of
cumulative PageRank and in-degree probabilities change within a single graph.
Examining Video Recommendation Bias on YouTube 113
Fig. 1. Distribution of PageRank values in the recommendation graphs 1 (left) and 2 (right). We
observe similar results in all recommendation graphs. The count of videos is represented in log
scale on the y-axis.
To explore the variance between different recommendation graphs and their structural
features based on PageRank or in-degree distributions, we compared how the given
CCDF distributions fit the power-law based on the maximum likelihood estimation
developed by Clauset, Shalizi, and Newman [7]. We used a Python implementation [2]
of the log-likelihood functions that enable comparative computations suitable for our
purpose. We observed power-law fits and CCDF distributions having some variance,
especially in terms of the exponent and how well the distributions fit the power law.
For example, the recommendation graph for dataset 7 seems to fit a likely power-law
almost perfectly, while the curve of the recommendation graph for dataset 8 has only
a partial fit that stops approximately after the mid-point on the plot. Furthermore, the
CCDF slopes of recommendation graphs 5 and 6 diverge from the power-law lines at
smaller PageRank values than other graphs (Fig. 3).
Fig. 3. The variance of PageRank probability distributions and log-likelihood of their power-law
fits on CCDF plots. Examples show variance between different recommendation graphs 2 (left)
and 5 (right).
5 Conclusion
This study aimed to discover, explore, and understand the potential, emergent, and inher-
ent characteristics of bias in YouTube’s video recommendations. Despite the black-box
features and closed inner-workings of the algorithm itself, our approach enabled iden-
tifying recommendation bias in the studied system. Using probabilistic distributions
and the PageRank algorithm to operationalize our stochastic approach, we were able to
demonstrate the resulting influence of a small number of videos over the entire network.
We collected distinct datasets and explored the structural properties as well as node-
centric features of the recommendation graphs. In all experiments, the bias of the recom-
mendation algorithm in favor of a small fraction of videos seems to emerge as a basic,
scale-free characteristic that is evident on the topological level. In particular, cumulative
probability distributions of PageRank values demonstrate that a few videos turn out to
be far more likely to be visited by a user following the recommended items with some
Examining Video Recommendation Bias on YouTube 115
randomness included. The experiments also show that the shape, skewness, and pro-
portion of the bias varies between different use case scenarios. The variance of bias in
different recommendation graphs should be subject to further investigations.
We also prioritized the robustness of our evaluation and characterization of bias.
Primarily we relied on a diversified data collection and increased the quantity of exper-
iments conducted under realistic scenarios and expected sources of behavioral variance
in the studied system. The resulting indicators of such variance between different rec-
ommendation networks point to further investigations in the future. In the subsequent
phases of this effort, we aim to produce models that can help predict and understand the
behavioral patterns that lead to the documented bias and its variations.
Acknowledgements. This research is funded in part by the U.S. National Science Foun-
dation (OIA-1946391, OIA-1920920, IIS-1636933, ACI-1429160, and IIS-1110868), U.S.
Office of Naval Research (N00014-10-1-0091, N00014-14-1-0489, N00014-15-P-1187, N00014-
16-1-2016, N00014-16-1–2412, N00014-17-1–2675, N00014-17-1-2605, N68335-19-C-0359,
N00014-19-1-2336, N68335-20-C-0540, N00014-21-1-2121), U.S. Air Force Research Lab,
U.S. Army Research Office (W911NF-17-S-0002, W911NF-16-1-0189), U.S. Defense Advanced
Research Projects Agency (W31P4Q-17-C-0059), Arkansas Research Alliance, the Jerry L.
Maulden/Entergy Endowment at the University of Arkansas at Little Rock, and the Australian
Department of Defense Strategic Policy Grants Program (SPGP) (award number: 2020-106-094).
Any opinions, findings, and conclusions or recommendations expressed in this material are those
of the authors and do not necessarily reflect the views of the funding organizations. The researchers
gratefully acknowledge the support. The researchers also thank MaryEtta Morris for helping with
proofreading and improving the paper.
References
1. Abdollahpouri, H., Burke, R., Mobasher, B.: Controlling popularity bias in learning-to-
rank recommendation. In: Proceedings of the Eleventh ACM Conference on Recommender
Systems (RecSys 2017), pp 42–46 (2020)
2. Alstott, J., Bullmore, E., Plenz, D.: Power-law: a Python package for analysis of heavy-tailed
distributions. PloS One 9(1), e85777 (2014)
3. Bellogín, A., Castells, P., Cantador, I.: Statistical biases in information retrieval metrics for
recommender systems. Inf. Retriev. J. 20(6), 606–634 (2017). https://doi.org/10.1007/s10
791-017-9312-z
4. Beutel, A., et al.: Fairness in recommendation ranking through pairwise comparisons. In:
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery
& Data Mining (KDD 2019), pp. 2212–2220 (2019)
5. Boratto, L., Marras, M., Faralli, S., and Stilo, G.: International workshop on algorithmic bias
in search and recommendation (Bias 2020). In: Jose, J. et al. (eds.) Advances in Information
Retrieval. ECIR 2020. Lecture Notes in Computer Science. Springer (2020)
6. Buntain, C., Bonneau, R., Nagler, J., Tucker, J.A.: YouTube Recommendations and Effects
on Sharing Across Online Social Platforms (2020). arXiv preprint arXiv:2003.00970
7. Clauset, A., Shalizi, C.R., Newman, M.E.: Power law distributions in empirical data. SIAM
Rev. 51(4), 661–703 (2009)
8. Davidson, J., et al.: The YouTube video recommendation system. In: Proceedings of the fourth
ACM Conference on Recommender systems, pp. 293–296, September 2010
116 B. Kirdemir et al.
9. Faddoul, M., Chaslot, G., Farid, H.: A Longitudinal Analysis of YouTube’s Promotion of
Conspiracy Videos (2020). arXiv preprint arXiv:2003.03318
10. Galeano, K., Galeano, L., Mead, E., Spann, B., Kready, J., Agarwal, N.: The role of YouTube
during the 2019 Canadian federal election: a multi-method analysis of online discourse and
information actors, Fall 2020, no. 2, pp. 1–22. Queen’s University, Canada (2020). Journal
of Future Conflict
11. Google Developers: YouTube Data API, Google (2020). https://developers.google.com/you
tube/v3
12. Hussein, E., Juneja, P., Mitra, T.: Measuring misinformation in video search platforms: an
audit study on YouTube. Proc. ACM Hum.-Comput. Interact. 4(CSCW1), 1–27 (2020)
13. Le Merrer, E., Trédan, G.: The topological face of recommendation. In: International
Conference on Complex Networks and their Applications, pp. 897–908. Springer, Cham
(2017).https://doi.org/10.1007/978-3-319-72150-7_72
14. Ledwich, M., Zaitsev, A.: Algorithmic extremism: Examining YouTube’s rabbit hole of
radicalization. First Monday (2020)
15. Marcoux, T., Agarwal, N., Erol, R., Obadimu, A., Hussain, M.: Analyzing Cyber Influence
Campaigns on YouTube using YouTubeTracker. Lecture Notes in Social Networks, Springer.
Forthcoming (2018)
16. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing
Order to the Web. Stanford InfoLab (1999)
17. Ribeiro, M.H., Ottoni, R., West, R., Almeida, V.A., Meira Jr., W.: Auditing radicalization
pathways on YouTube. In: Proceedings of the 2020 Conference on Fairness, Accountability,
and Transparency, pp. 131–141 (2020)
18. Roose, K.: The making of a YouTube Radical. The New York Times (2019)
19. Roth, C., Mazières, A., Menezes, T.: Tubes and bubbles topological confinement of YouTube
recommendations. PloS One 15(4), e0231703 (2020)
20. Tufekci, Z.: YouTube, the Great Radicalizer. The New York Times, vol. 10, p. 2018 (2018)
21. Verma, S., Gao, R., Shah, C.: Facets of fairness in search and recommendation. In: Borratto,
L., Faralli, S., Marras, M., Stilo, G. (eds.) Bias and Social Aspects in Search and Recommen-
dation, First International Workshop, BIAS 2020, Lisbon, Portugal, April 14, Proceedings.
Communications in Computer and Information Science, vol. 1245, pp. 1–11 (2020). https://
doi.org/10.1007/978-3-030-52485-2_1
22. Wakabayashi, D.: YouTube Moves to Make Conspiracy Videos Harder to Find. The New
York Times, 25 Jan 2019. https://www.nytimes.com/2019/01/25/technology/youtube-conspi
racy-theory-videos.html
23. Zhou, R., Khemmarat, S., Gao, L.: The impact of YouTube recommendation system on video
views. In: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement,
pp. 404–410, November 2010
24. Zhou, R., Khemmarat, S., Gao, L., Wan, J., Zhang, J.: How YouTube videos are discovered
and its impact on video views. Multimed. Tools Appl. 75(10), 6035–6058 (2016). https://doi.
org/10.1007/s11042-015-3206-0
An Information-Theoretic Measure
for Enabling Category Exemptions
with an Application to Filter Bubbles
1 Introduction
Personalized recommendation systems (e.g. news feeds, social media feeds, prod-
uct or video recommendations) often obsessively maximize a utility metric
C. Jiang and B. Wu—Equal Contribution.
c Springer Nature Switzerland AG 2021
L. Boratto et al. (Eds.): BIAS 2021, CCIS 1418, pp. 117–129, 2021.
https://doi.org/10.1007/978-3-030-78818-6_11
118 C. Jiang et al.
related to revenue generated, e.g., click-through rate (CTR) [30], hovering time,
etc. However, one of the major potential perils of personalization is that it signif-
icantly influences users’ opinions by entrapping them in filter bubbles 1 [6,7,29].
Filter bubbles refer to a state of intellectual isolation caused due to an Internet
users’ exposure to content that reinforces their beliefs rather than a diverse set
of opinions and viewpoints [28,29].
We are interested in eliminating filter bubbles pertaining to a user’s biases
about belief-based attributes (denoted by Z), such as political inclination, social
opinions, etc. However, an attempt to make the recommendations entirely inde-
pendent of Z may be too restrictive for maintaining high utility metrics (e.g.
CTR) that these platforms aim to maximize. In this work, we take the viewpoint
that the users, if aware, will carefully choose which filter bubbles to participate
in and which ones to eliminate. For example, a user invested in a certain cate-
gory, such as climate change, may want to be in a filter bubble that preserves
their belief-based attributes, motivates them, and enables them to forge rela-
tionships to make social change. On the other hand, the same user attempting
to learn about another category, such as healthcare, may want to learn from
all perspectives of the debate, and hence may want to reduce the bias of the
recommender system for other categories. Thus, the platform should enable the
users to choose “content-categories” in which they prefer to be in intellectual
bubbles and exempt bias.
In this work, we propose a novel method of quantifying nonexempt-category-
specific bias in personalized recommendation systems, and then leverage this
measure to selectively reduce bias only in the categories that are not exempted by
the user. We assume that an article of online content (subsequently, “an article”)
can belong simultaneously to several categories. If any one of these categories
is exempted, then it is desirable that the bias in that article is exempted. The
exempt categories could be, e.g., news from a preferred source (e.g. New York
Times or Fox News), or news corresponding to a category of interest (e.g. climate
change, religion, or healthcare). In the process, by being selective about cate-
gories in which bias is exempted, we succeed in preserving (in toy and real-data
examples; see Sect. 4), to an extent (that depends on the fraction of categories
exempted by the user), some bias in the personalization of recommendation.
Our work strikes a balance between the competing goals of maximizing utility
metrics for personalization and reducing bias to reduce effects of filter bubbles
(related to work in algorithmic fairness; see “Related Work” below). This offers
an alternative to adopting an approach of entirely anonymizing personalized
search history [11] (e.g. using anonymous windows), which significantly hurt the
utility of personalization.
1
Acknowledging our limited visibility into recommender systems used in practice, we
note that this work is not pointing fingers at any political party, news channel, or
industrial recommender systems. We believe there is room for improvement in the
existing design of recommender systems, and this needs to be pursued in collabora-
tion with industry to redesign today’s recommender systems.
An Information-Theoretic Measure for Enabling Category Exemptions 119
2 Related Work
The concept of context-specific independence (i.e., Z ⊥ Ŷ |Cex = 0) in Bayesian
networks [5] dates back to 1996. Also, closely related to this idea is recent work
on fairness under exemptions [2,14,16,23,24,31] where the goal is to perform
fair classification with respect to protected attributes, e.g., gender, race etc.,
while exempting bias due to some critical or explanatory features (also called
“resolving variables”), e.g., qualifications in hiring, choice of department in col-
lege admissions [24], annual income in loan applications. For example, in [2,16],
the authors use the related measure I(Z; Y |Xc ), where Xc is the set of exempt
features. While our work shares conceptual similarities with these aforemen-
tioned works, our novelty lies in using these ideas to propose a measure for a
different problem, that is particularly relevant in today’s societal context. Com-
pared to the works on fairness, our exemption is value-based, i.e., the bias needs
to be exempted only when the article falls in certain categories (further clari-
fied in Sect. 3.1) while preserving the bias in other categories. This means that
the realization of categories is important here. The measure should not average
over all realizations of categories, as is done in I(Z; Y |Xc ), e.g., one may want a
loan-decision to be conditionally independent of Z given a feature income that
is critical to the decision.
The problem of filter bubbles also shares connection with some other works
fair personalization [13], and fair ranking [12] which also strike a trade-off
between a utility metric and fairness. However, these works do not allow for
category exemptions. Future work will examine the use of a fairness criterion
derived from value-based conditioning in these scenarios to allow for category
exemptions.
Several measures have been proposed to correct bias in recommender sys-
tems. Collaborative filtering and ranking, sometimes used together, are the main
settings for this line of work. [34] generalizes the statistical parity and equal
opportunity to apply in personalized ranking systems. Similarly, [4] proposes
two measures item statistical parity and item equal opportunity. The former
aims at achieving uniform distribution of recommended items regardless of item
popularity, whereas the latter focuses on retaining item popularity distribution
faithfully. [17] defines bias as the predictability of the protected attribute given
recommendation results in a collaborative filtering context. Similar to our work,
[27] zooms in to the recommendation bias of an individual user. It evaluates
how diverse the recommendation results are by computing the content diver-
sity defined by pairwise item distances. Our work defines and solves a different
problem in a different setting however we believe the same problem can also
be redefined in a meaningful way for the collaborative filtering- or rank-based
systems.
Another line of work focuses on “diversity” [9], i.e., diversifying recommen-
dation results to burst filter bubbles. Our work aims to achieve this by providing
the users a more fain-grained control over the bubbles that they would like to
keep, and the ones that they want to not be in. Another approach uses “cali-
bration” [33], which reflects faithfully and proportionally a user’s likes/dislikes.
An Information-Theoretic Measure for Enabling Category Exemptions 121
For example, if a user has watched 70% romance movies and 30% action movies,
then the recommendation list of movies should comply to the same ratio of two
genres. These works focus on faithful preservation of the bias in the user’s online
history. As we will show later, our work can be interpreted as a combination
of diversity and calibration, each of which is achieved in a different category.
Diversity is for breaking the filter bubble, whereas calibration is for preserving
wanted bias. Our technique achieves both of them simultaneously via a single
regularizer.
Several existing tools e.g., Ghostery [32], DuckDuckGo [11] use anonymiza-
tion to protect the user’s privacy, search history, preferences, etc. from being
used in personalization. In contrast to these approaches, our proposed redesign of
recommender systems does not prevent personalization, but debiases the recom-
mendation (measured by VCMI) while trading off between utility and unwanted
bias. Alternate softwares [7], e.g., Balancer, Scoopinion, Wobble, etc. track and
inform the user about their online activity, and make them aware of their filter
bubbles. There are also softwares [7] such as ConsiderIt, Opinion Space, Reflect
etc. that adopt the viewpoint of “deliberative democracy,” and gently nudge the
user to read the views or articles offering a different opinion. Instead of proposing
a new app or platform, our work proposes a solution which can be adopted by
existing platforms. Nevertheless, adoption of this solution might, as a side-effect,
increase user awareness of filter bubbles by having them make choices of exempt
categories.
prevent most of those articles to minimize I(Z; Y |Cex = 1). Here, VCMI is more
appropriate as it only minimizes I(Z; Y |Cex = 0), debiasing articles that are not
about climate change.
When is CMI More Appropriate? Suppose Alice likes a particular biased
news-source, say, Foobar news, and wants unbiased articles (from Foobar news or
outside), but also wants to continue to receive articles from this source (i.e., does
not want debiasing to lower the likelihood of recommendation of articles from
Foobar news). Using MI here could, as in Scenario I, significantly reduce articles
from Foobar news since articles in Foobar news are strongly correlated with Z.
What is needed here is debiasing while exempting the bias arising from other
features for both cases: whether the article is from Foobar news or not. This is
subtly different from Scenario I, where only when the article is in the exempt
category is the bias exempted. CMI is more appropriate for this scenario because
it minimizes the conditional bias for articles, conditioned on them belonging, and
not belonging, to Foobar news. As for VCMI, it would only minimize the bias in
articles not from Foobar news, exempting the bias in the articles from Foobar
news.
When is MI More Appropriate? Suppose Alice wants all her recommenda-
tions to be neutral with respect to the political stance irrespective of the source
or the category. Here, MI is the most appropriate. In our proposed redesign of
recommendation systems, MI could be the default regularizer, and users can add
exemptions as they see fit.
4 Experimental Evaluation
Integrate VCMI into Training: We propose two ways of integrating the
VCMI measure into machine learning training.
– sVCMI: Single model with VCMI regularizer for all articles irrespective of
whether they are exempt or not.
– dVCMI. Two separate models for exempt and nonexempt articles; VCMI
regularizer only in the latter model.
One might think that there is little difference between above two. After all, with
sufficient complexity, sVCMI can emulate dVCMI. The implicit assumption here
is that the number of parameters in each of them is limited, so that they can be
trained without a very large dataset (as is the case in our Twitter dataset). This
limited dataset case is where we see the largest distinction between the two in
our experimental evaluation.
We will train classifiers whose outputs indicate the likelihood (between 0
and 1) of recommendation of the article. We use binary cross-entropy as our
loss function, and train five models: (i) Vanilla: without regularizer; (ii) MI:
MI regularizer; (iii) CMI: CMI regularizer; (iv) sVCMI: VCMI regularizer; and
(v) dVCMI: Two MLPs for exempt and nonexempt data respectively and apply
VCMI regularizer to the later one.
124 C. Jiang et al.
and VCMI is
Left: Bias in nonexempt and exempt category (toy example). MI, CMI,
sVCMI, and dVCMI, all eliminate the bias in nonexempt category. For the
exempt category, sVCMI and dVCMI preserve the bias, but MI and CMI do not.
Middle: JS distance between the distributions of Pr(Y |Z = 0) and Pr(Y |Z = 1)
in nonexempt and exempt category. Using this different distance measure, the
conclusion is similar. Right: AUC for different models in the toy example.
Because sVCMI and dVCMI preserve the dependency between Z and Y in
exempt categories, they are able to retain higher AUC.
each news article as left or right slant using the method in [10], and specify the
news category of each article using an LSTM based model [26]. Assuming that
retweeting means an user is interested in the article, we use collaborative filtering
implemented in Implicit [19] to generate the ground truth labels (to recommend
or not). We study the user in the later experiment who have the largest number
of real ground-truth before the collaborative filtering.
Created Dataset Description: Each article is represented by a feature vector
with the following fields: (a) Categories: multi-hot encoding of news categories;
(b) Article length; (c) Source: one-hot encoding of the publisher of the news; (d)
Publication time. Each piece of news is also associated with a political view, left,
right or neutral. This attribute is only used when calculating our regularizers
and is not included in the input features. The true labels are binary.
Experimental Setup: We predict whether the user will be interested an article.
We use a MLP with two hidden layers (32 neurons per layer) as the classifier.
Evaluation metrics and other parameters are the same as in the toy example.
The dataset is randomly split into training and test sets with a ratio of 7:3. The
models were trained for 200 epochs using λ = 0.25. We collected statistics for
32 runs of the model to mitigate effects of random initialization.
Left: Bias in nonexempt and exempt category (case study). Error bars indi-
cate 95% confidence interval in all following figures. sVCMI and dVCMI better
preserve I(Z; Y |Cex = 1), and reduce I(Z; Y |Cex = 0). dVCMI is better at pre-
serving the exempt bias than sVCMI. Middle: JS distance between Pr(Y |Z = 0)
and Pr(Y |Z = 1) in nonexempt and exempt category. sVCMI again cannot well
preserve the JS distance in exempt group. Right: AUC for different models in
the case study. The AUC is maintained for all models, but dVCMI takes a
small hit, potentially because of data insufficiency induced by the two network
approach.
An Information-Theoretic Measure for Enabling Category Exemptions 127
In this work, we identify the problem of allowing users to choose which filter
bubbles to stay in and which ones to eliminate in recommendation systems. We
propose to selectively eliminate bias towards belief-based attributes (e.g. politi-
cal inclination) in certain user-chosen content-categories, while preserving such
bias in others. We arrive at a simple information-theoretic measure, VCMI, for
quantifying such bias. It aims to reduce dependence of Z on Y given Cex = 0,
while maintaining, as much as possible, the accuracy as well as dependence of
Z on Y given Cex = 1. While our experiment and case studies on the synthetic
and created datasets suggest that VCMI is able to attain this goal, some notable
issues remain: (i) CMI can sometimes lower I(Z; Y |Cex = 0) below that done by
VCMI; (ii) VCMI may sometimes lower I(Z; Y |Cex = 1) as well, possibly due
to limitations of MLP-based models and the ability to affect conditional joint
distributions without affecting the overall joint distribution. We leave a compre-
hensive evaluation to future work. Future works may also explore the following:
(i) More reliable dataset: Given the novelty of this problem, we could also not
find any off-the-shelf dataset to test our measure. To conduct further study, we
need labeled data from real world applications, e.g., Facebook news feed sys-
tem. Nevertheless, we hope that this connection between fairness measures and
filter bubbles receives further attention from the community. (ii) Alternative
estimation techniques for VCMI building on [20] and the references therein. (iii)
Practical applicability, e.g., by improving upon the method of selecting exempt
categories or belief-based attributes that is more applicable to unsupervised or
semi-supervised settings. These shortcomings need to be addressed before the
method can indeed be deployed in a real world setup.
128 C. Jiang et al.
References
1. Agarwal, A., Beygelzimer, A., Dudı́k, M., Langford, J., Wallach, H.: A reductions
approach to fair classification. arXiv preprint arXiv:1803.02453 (2018)
2. Anonymous: Conditional debiasing for neural networks (2019)
3. Bakshy, E., Messing, S., Adamic, L.A.: Exposure to ideologically diverse news and
opinion on Facebook. Science 348(6239), 1130–1132 (2015)
4. Boratto, L., Fenu, G., Marras, M.: Connecting user and item perspec-
tives in popularity debiasing for collaborative recommendation. Inf. Process.
Manag. 58(1) (2021). https://doi.org/10.1016/j.ipm.2020.102387. https://www.
sciencedirect.com/science/article/pii/S0306457320308827
5. Boutilier, C., Friedman, N., Goldszmidt, M., Koller, D.: Context-specific indepen-
dence in Bayesian networks. In: Proceedings of the Twelfth International Con-
ference on Uncertainty in Artificial Intelligence, UAI 1996, pp. 115–123. Morgan
Kaufmann Publishers Inc., San Francisco (1996)
6. Bozdag, E.: Bias in algorithmic filtering and personalization. Ethics Inf. Technol.
15(3), 209–227 (2013)
7. Bozdag, E., van den Hoven, J.: Breaking the filter bubble: democracy and design.
Ethics Inf. Technol. 17(4), 249–265 (2015)
8. Bradley, A.P.: The use of the area under the roc curve in the evaluation of machine
learning algorithms. Pattern Recogn. 30(7), 1145–1159 (1997). https://doi.org/10.
1016/S0031-3203(96)00142-2
9. Bradley, K., Smyth, B.: Improving recommendation diversity. In: Proceedings
of the Twelfth Irish Conference on Artificial Intelligence and Cognitive Science,
Maynooth, Ireland, pp. 85–94. Citeseer (2001)
10. Brena, G., Brambilla, M., Ceri, S., Di Giovanni, M., Pierri, F., Ramponi, G.: News
sharing user behaviour on twitter: a comprehensive data collection of news articles
and social interactions. In: Proceedings of the International AAAI Conference on
Web and Social Media, vol. 13, no. 01, pp. 592–597, July 2019. https://www.aaai.
org/ojs/index.php/ICWSM/article/view/3256
11. Buys, J.: Duckduckgo: a new search engine built from open source. GigaOM
OStatic blog (2010)
12. Celis, L.E., Straszak, D., Vishnoi, N.K.: Ranking with fairness constraints. arXiv
preprint arXiv:1704.06840 (2017)
13. Celis, L.E., Vishnoi, N.K.: Fair personalization. arXiv preprint arXiv:1707.02260
(2017)
14. Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., Huq, A.: Algorithmic deci-
sion making and the cost of fairness. In: Proceedings of the 23rd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, KDD 2017,
pp. 797–806. ACM, New York (2017). https://doi.org/10.1145/3097983.3098095.
http://doi.acm.org/10.1145/3097983.3098095
15. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Hoboken
(2012)
16. Dutta, S., Venkatesh, P., Mardziel, P., Datta, A., Grover, P.: An information-
theoretic quantification of discrimination with exempt features. In: Association for
the Advancement of Artificial Intelligence (2020)
17. Edizel, B., Bonchi, F., Hajian, S., Panisson, A., Tassa, T.: FaiRecSys: mitigating
algorithmic bias in recommender systems. Int. J. Data Sci. Anal. 9(2), 197–213
(2020)
An Information-Theoretic Measure for Enabling Category Exemptions 129
18. Edwards, A.L.: The Correlation Coefficient, chap. 4, pp. 33–46. W. H. Freeman
(1976)
19. Frederickson, B.: Implicit (2019). https://github.com/benfred/implicit
20. Gao, W., Kannan, S., Oh, S., Viswanath, P.: Estimating mutual information for
discrete-continuous mixtures. In: Advances in Neural Information Processing Sys-
tems, pp. 5986–5997 (2017)
21. Garrett, R.K.: The “echo chamber” distraction: disinformation campaigns are the
problem, not audience fragmentation (2017)
22. Ghassami, A., Khodadadian, S., Kiyavash, N.: Fairness in supervised learning:
an information theoretic approach. In: 2018 IEEE International Symposium on
Information Theory (ISIT), pp. 176–180. IEEE (2018)
23. Kamiran, F., Žliobaitė, I., Calders, T.: Quantifying explainable discrimination and
removing illegal discrimination in automated decision making. Knowl. Inf. Syst.
35(3), 613–644 (2013)
24. Kilbertus, N., Rojas-Carulla, M., Parascandolo, G., Hardt, M., Janzing, D.,
Schölkopf, B.: Avoiding discrimination through causal reasoning. In: Proceedings
of the 31st International Conference on Neural Information Processing Systems,
NIPS 2017, pp. 656–666. Curran Associates Inc., USA (2017). http://dl.acm.org/
citation.cfm?id=3294771.3294834
25. Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf.
Theory 37(1), 145–151 (1991)
26. Misra, R.: News category dataset (2018). https://www.kaggle.com/rmisra/news-
category-dataset/
27. Nguyen, T.T., Hui, P.M., Harper, F.M., Terveen, L., Konstan, J.A.: Exploring the
filter bubble: the effect of using recommender systems on content diversity. In: Pro-
ceedings of the 23rd International Conference on World Wide Web, WWW 2014,
pp. 677–686. Association for Computing Machinery, New York (2014). https://doi.
org/10.1145/2566486.2568012
28. O’Callaghan, D., Greene, D., Conway, M., Carthy, J., Cunningham, P.: Down the
(white) rabbit hole: the extreme right and online recommender systems. Soc. Sci.
Comput. Rev. 33(4), 459–478 (2015)
29. Pariser, E.: The Filter Bubble: What the Internet is Hiding from You. Penguin UK
(2011)
30. Richardson, M., Dominowska, E., Ragno, R.: Predicting clicks: estimating the click-
through rate for new ads. In: Proceedings of the 16th International Conference on
World Wide Web, pp. 521–530. ACM (2007)
31. Salimi, B., Rodriguez, L., Howe, B., Suciu, D.: Interventional fairness: causal
database repair for algorithmic fairness. In: Proceedings of the 2019 International
Conference on Management of Data, SIGMOD 2019, pp. 793–810. ACM, New York
(2019). https://doi.org/10.1145/3299869.3319901
32. Signanini, J., McDermott, B.: Ghostery (2014). https://www.ghostery.com/
33. Steck, H.: Calibrated recommendations. In: Proceedings of the 12th ACM Confer-
ence on Recommender Systems, RecSys 2018, pp. 154–162. Association for Com-
puting Machinery, New York (2018). https://doi.org/10.1145/3240323.3240372
34. Zhu, Z., Wang, J., Caverlee, J.: Measuring and mitigating item under-
recommendation bias in personalized ranking systems. In: Proceedings of the 43rd
International ACM SIGIR Conference on Research and Development in Informa-
tion Retrieval, SIGIR 2020, pp. 449–458. Association for Computing Machinery,
New York (2020). https://doi.org/10.1145/3397271.3401177
Perception-Aware Bias Detection
for Query Suggestions
Abstract. Bias in web search has been in the spotlight of bias detec-
tion research for quite a while. At the same time, little attention has
been paid to query suggestions in this regard. Awareness of the prob-
lem of biased query suggestions has been raised. Likewise, there is a
rising need for automatic bias detection approaches. This paper adds
on the bias detection pipeline for bias detection in query suggestions of
person-related search developed by Bonart et al. [2]. The sparseness and
lack of contextual metadata of query suggestions make them a difficult
subject for bias detection. Furthermore, query suggestions are perceived
very briefly and subliminally. To overcome these issues, perception-aware
metrics are introduced. Consequently, the enhanced pipeline is able to
better detect systematic topical bias in search engine query suggestions
for person-related searches. The results of an analysis performed with
the developed pipeline confirm this assumption. Due to the perception-
aware bias detection metrics, findings produced by the pipeline can be
assumed to reflect bias that users would discern.
1 Introduction
Fairness in online search and bias detection are important topics in information
retrieval research. Consequently, there are many approaches for detecting bias
in search results. Little research and few methodological approaches exist for
bias detection in query suggestions. Query suggestions are an important aspect
of online information retrieval via search engines and significantly impact what
people search for [17]. Due to the sparseness of query suggestions (no author, no
sources, no publishing platform, less text) and context-dependency, bias detec-
tion of query suggestions is less straight forward than bias detection of search
results [23]. Unless a person performing online search does not have a clear
information need, very little attention is paid to the query suggestions. Because
of the brief exposure, search engine users perceive query suggestions distinctly,
even though certain effects like the diminishing attention paid to elements fur-
ther down the list still applies [5,8]. Summarizing these findings, we are left with
two research questions to focus on as we develop a bias detection pipeline:
c Springer Nature Switzerland AG 2021
L. Boratto et al. (Eds.): BIAS 2021, CCIS 1418, pp. 130–142, 2021.
https://doi.org/10.1007/978-3-030-78818-6_12
Perception-Aware Bias Detection for Query Suggestions 131
– RQ1: To what extent can bias towards metadata on the persons searched
(e.g., gender, age, party membership) be uncovered in query suggestions to
person-related searches using perception-aware metrics?
– RQ2: How do perception-aware metrics perform compared to simpler metrics
in detecting bias in query suggestions to person-related searches?
2 Related Work
Findings of various studies show that search engines such as Google are seen as
a trustworthy source of information on many topics, including political infor-
mation [4,22]. According to Houle, search engines significantly impact political
opinion formation [10]. The trust in search engines is problematic because their
results are prone to be biased. This might be due to bias induced by algorithms
[11], or by representing bias inherent in mental models: Although not at all true
by definition of these words, a doctor is usually assumed to be male, while a
nurse is typically expected to be female [1]. Similar biased patterns exist plenti-
ful, and because search engines show the information inherent in data, bias is as
omnipresent in search results as it is in peoples’ minds and natural language texts
[6,18]. Even Google acknowledges this in a white paper, stating their awareness
of disinforming and biased contents presented by their search engine [7]. Biased
search results have been widely discussed and researched. Kulshrestha et al.
investigated and compared bias in Google search results and search on Twitter
[13]. Their bias detection methodology relies on peripheral information, such as
the author, publishing and distribution platforms, as well as information gath-
ered from these sources [13].
Aside from search results, query suggestions play a key role in what people
search for [17]. There are many types of query suggestions such as query expan-
sions, auto completions, query predictions, or query refinements [3,20]. We use
the term query suggestion as an umbrella term for all facets of the previously
mentioned terms, to describe the list of suggested search queries returned by
search engines for an input query or search term. Although it is unclear how
exactly they are formed, there is no doubt that query suggestions are derived
from what other users search for in a location and language [24]. Also, they
can be manipulated [24]. Query suggestions are assumed to be as bias-laden
as search results, with a study by Olteanu et al. illustrating how diverse and
hard to detect the forms of problematic and biased search results are [19]. The
difficulty in detecting bias in query suggestions lies in their sparseness. With-
out context, they neither offer author or source information nor is input bias
available to judge their ranking. Furthermore, bias in query suggestions is often
132 F. Haak and P. Schaer
context-dependent and not derivable from the terms themselves [19]. For these
reasons, approaches like the one utilized by Kulshrestha et al. do not work for
query suggestions.
To overcome the hurdles of identifying bias in sparse, contextless search query
suggestions, Bonart et al. developed a bias identification pipeline for person-
related search [2]. It represents a natural language processing pipeline with three
modules: Data acquisition, data preprocessing, and bias analysis (cf. Fig. 1). The
collection they used to develop and test the pipeline consists of search queries
and their corresponding lists of query suggestions, gathered since 2017 in Ger-
man from Google, DuckDuckGo, and Bing. The search terms consist primarily
of names of German politicians. With the developed pipeline, Bonart et al. tried
to find a way to automatically identify systematic topical biases towards certain
groups of people with shared meta-attributes (e.g., gender, age, party member-
ship) [2]. Topical bias describes content bias as misrepresented information in
the documents themselves [21]. Concerning the meta-attributes, systematic top-
ical bias refers to differences in the distribution of topics in query suggestions
of groups of politicians of different meta-attribute characteristics (e.g., male/
female). Due to their aforementioned sparseness, this type of bias is the easiest
to detect in query suggestions.
The metric Bonart et al. used for detecting bias is the number of unique topi-
cal cluster terms in the query suggestions for each search term (cf. section 3). Bias
is then identified by comparing the differences between certain groups of search
terms. The results produced by the bias identification pipeline only revealed
minor indicators of bias within the groups of politicians. In addition to the
insignificant findings, the metric employed by Bonart et al. does not consider
two critical aspects of how query suggestions are perceived: (A) The frequency
Perception-Aware Bias Detection for Query Suggestions 133
3 Methodology
The following section briefly outlines the bias detection pipeline and highlights
changes and additions made to it.
Data Acquisition. The bias analysis is based on a set of N search terms ti with
i = 1, ..., N which share a set of P meta-attributes xi,1 , ..., xi,P [2]. The dataset
consists of a collection of query suggestions returned for these terms, which con-
sist of names of German politicians or politically relevant people from Germany.
Twice per day, for all search terms, a specialized web crawler collects query sug-
gestions in German by HTTP request from three search engines’ auto-complete
APIs: Google, DuckDuckGo and Bing. Requests only contain the input query as
well as language and location information. Therefore, no user profiles or search
histories can influence the results.
Preprocessing. Preprocessing is the first module of the pipeline that was changed
significantly. The lemmatizer was changed from pattern.de to the german news
standard model made available within the spacy module by Honnibal et al. [9].
An entity recognition step was introduced and is performed using spacy, as well.
It is performed after lemmatization on all query suggestions that do not con-
sist of a single word. After cleaning and lemmatization, about 20% of the query
suggestions are discarded in the original pipeline because clustering can only
be performed on single-word-suggestions. By employing an entity recognition
system, many suggestions such as “summer festival” are condensable to a sin-
gle term, which can be used in the cluster analysis. Since query suggestions are
most probably formed considering entities to not deliver multiple very similar
suggestions to the user (cf. [3]), the suggestions shortened by entity recognition
are not expected to change significantly in meaning. The last step in the pre-
processing module consists of the unmodified vectorization of the single term
query suggestions using the word2vec module [15]. Although there are newer
and more elaborate vectorizers, word2vec performed best on the given collection
of German terms.
Bias Analysis. The first step in the bias analysis module, topic clustering, has
not been changed methodically. The vectorized query suggestions are assigned
to topical clusters utilizing a k-means approach.
134 F. Haak and P. Schaer
Most importantly, new metrics have been introduced. These metrics are
perception-aware, meaning that they aim to determine bias analog to how a user
would perceive it. Due to the sparseness of metadata around the suggestions and
the often context dependant bias, topical bias is the most promising approach
to automatically detecting bias. A main characteristic of the perception of query
suggestions is the low attention given to them, especially suggestions on lower
ranks [8,16]. Therefore, the main factors influencing the exposure of a topic over
a time span are the percentage of relevant topical suggestions and their ranks.
As a first step to derive the new metrics, a matrix is created, from which the
metrics are calculated. This matrix contains rows for all search terms tir with
i = 1, ..., N being the identifier of the term and r = 1, ..., 10 signifying the rank
in the list of query suggestions for the term. Search terms and suggestion rank
form the multi-index structure for the rows of the matrix. These rows are paired
with all M preprocessed single-term query suggestions sj with j = 1, ..., M , that
have been assigned to a cluster. This results in a structure where the frequency
for every search term-suggestion combination at every rank is stored. Based on
this, the number and percentage of suggestions of each of the topical clusters at
each rank can be calculated for each search term.
The problem of judging the systematic topical bias in the query suggestions
approximates the relevance judgment. Relevance judgments usually describe the
probability of a document to fulfill a given information need. Likewise, the per-
centage of suggestions for a certain cluster for a rank describes the probability of
a randomly selected single term belonging to that topic. Thus, using a discounted
relevance measure to judge the rank- and perception-aware topical affiliation is
not far-fetched. Hence, discounted cumulative gain and its normalized variety are
introduced as metrics for detecting bias. Both are usually employed to rank the
relevance of a list of documents, for example, as returned by a search query [14].
The metrics put an emphasis on the rank of the judged elements, which is what
we want our metric to do. Discounted Cumulative Gain (DCG) is implemented
adopted as a bias metric for query suggestions as [12]:
10
2P (Cx (i),q) − 1
DCG(Cx , q) = , (1)
i=1
log2 (i + 1)
where DCG(Cx , q) describes the DCG of a term q for cluster x and P (Cx (i), q)
is the percentage of total appearances of clustered query suggestions at rank i
of the list of query suggestions for the term. Instead of relevance, we employ the
percentage of suggestions for a topical cluster, which can be interpreted as the
topical affiliation for that cluster. Counting appearances of cluster words and
using the percentages as key measurements is similar to using graded instead
of dichotomous relevance judgments. In essence, instead of a measure of gain,
DCG as a metric for topical affiliation describes the average degree of perceived
exposure to a topic within the query suggestions of a search term. By revealing
differences in the topical affiliation between groups, topical bias towards these
groups can be identified and quantified.
Perception-Aware Bias Detection for Query Suggestions 135
DCG(Cx , q)
nDCG(Cx , q) = (2)
IDCG(Cx , q)
By normalizing, the nDCG expresses how every cluster is distributed over the
ranks of the query suggestions of a term, neglecting the overall number of cluster
words and other clusters. Thereby it expresses, what average rank the sugges-
tions of the topical cluster appear on. A high nDCG score means that a topical
cluster appears on average on the first ranks of the suggestions. However, it
does not indicate how often suggestions of the topic appear over the span of
data acquisition. The nDCG could be a useful metric if the lengths of query
suggestions vary, when particular clusters or terms are uncommon or when very
little data is available. For example, when trying to judge how prominent a term
appeared in searches, that coined only in a very brief time in search suggestions
(e.g., suggestion “flu” with names of different states). These terms do not appear
often enough over a year to impact the DCG score, but the nDCG allows for
differentiated insight anyway by only emphasizing the rank.
Regression Analysis. The metrics describe how the identified topical clusters
would manifest in the query suggestions for the search terms. The goal is to
identify significant differences between the groups of meta-attributes xi,p (e.g.,
female, SPD-member) in the perception-aware metrics for each cluster yi,c (e.g.,
DCG, nDCG). By doing so, topical bias (e.g., towards terms that describe pri-
vate or family topics) is detectable. To reveal significant differences, multiple
linear regression is performed using dichotomous dummy variables for the meta-
attributes as independent variables and the perception-aware metrics nDCG and
DCG as dependent variables. The model of this regression for topical clusters
c ∈ 1, ..., k can be expressed as
where i is the independent error term and i = 1, ..., N are the observation
indices. To avoid multicollinearity, one variable per attribute is used as the base
category and omitted.
After describing the main changes to the bias detection pipeline in the previous
section, this section explores its effectiveness by performing an analysis using the
pipeline on the most recent version of the same dataset of German politicians
used to test the first version of the pipeline.
136 F. Haak and P. Schaer
Preprocessing. The updated preprocessing module with the added entity detec-
tion step still has to omit some of the crawled query suggestions. Albeit, with a
loss of around 18%, there is less potential loss of information due to the removal
of longer query suggestions. After cleaning, lemmatization, and entity detec-
tion, 5405 unique single word suggestions remained. The vector-transformation
algorithm was able to vectorize 3979 of these words.
Bias Analysis. The word embedding vectors for each of the suggestions were
used to perform a cluster analysis. A k-means approach was performed with
three clusters, as suggested by the employed heuristics. By manually evaluating
the clusters, we assigned a label that best describes the topic of each cluster (cf.
Table 1). The first cluster includes terms with personal meaning. The second
cluster consists mostly of names of cities and places that are of no political
significance. The third group contains words with political meaning ranging from
topics (e.g. “drug report”) over other political persons (e.g. “Guelen”) to cities
and counties that are of political significance (e.g. “Afghanistan”).
1
https://www.abgeordnetenwatch.de/.
2
https://www.wikidata.org/wiki/Wikidata:MainPage.
Perception-Aware Bias Detection for Query Suggestions 137
Table 1. Examples for terms of the clusters found by performing a k-means clustering
approach on the preprocessed single-word query suggestions. Translated from German
to English.
As described in Sect. 3, bias metrics are calculated based on the position and
frequency of suggestions corresponding to the clusters assigned in the previous
step. Before calculating the metrics, search terms with less than 10 cluster words
are dropped. This reduces the number of search terms to 2510, 1321 of which
have a federated state, 1146 a Party, 1238 a gender, and 1253 an age assigned.
1227 of the politicians have all meta-attributes assigned.
Table 2 shows the results of the multiple linear regression analysis performed
on the DCG and nDCG. The CDU and Baden-Württemberg were chosen as
base-categories for the attributes party and federated state. For all metrics, the
F-test rejected the joint null hypothesis that all coefficients are zero. Therefore,
relevant biased patterns towards each of the metrics can be considered. Although
there are biased patterns for each cluster, cluster 2 shows notably less. Very few
attributes show biased patterns towards suggestions of that topical cluster. This
reflects in the amount of variance explained by the models for the clusters. The
regression model using the DCG scores could explain 7%, 1% and 5% of the
variance for clusters 1, 2 and 3. The nDCG performed very similarly with 5%,
2% and 6%, respectively.
For cluster 2 (names of cities and countries), only politicians of the CSU
(christian social union in bavaria) and the LINKE (democratic socialist party)
exhibit significantly higher DCG values than the base category. The members of
the LINKE also have significantly higher nDCG values. Cluster 2 suggestions,
names of places without political significance, appear on average 1.5 ranks higher
for LINKE politicians than for other parties. The perception-aware metrics show
a significant topical gender bias towards the cluster of political and economic-
related suggestions. The results show significantly (P < 0.01) lower average DCG
scores (cluster 3: male 0.7, female 0.49, cf. Fig. 2) for suggestions of cluster 3 if
the search term is a female person. This also shows in the corresponding nDCG
values. With a coefficient of roughly -0.1 (nDCG scores for cluster 3: male 0.46,
female 0.36, cf. Fig. 2), query suggestions with political topics appear on average
one rank lower if the searched person is female. Age was identified as a biased
factor for both cluster 1 and cluster 3. The older the politician, the more politics-
and the less personal-related are the query suggestions. Figure 2 shows the mean
scores for politicians over and under 40 years of age. The DCG score for cluster
138 F. Haak and P. Schaer
1 is significantly higher for younger politicians, while for cluster 3 the opposite
is the case. This also reflects in the regression results. We found some significant
bias for both metrics within the political parties and the federated states towards
suggestions of the cluster of political terms as well as the cluster of private terms
(cf. Table 2).
Fig. 2. DCG and nDCG scores as well as total appearance percentages for gender and
age meta-attributes. The dataset includes 818 male and 420 female politicians, 1096
older than or exactly 40 years old and 1414 younger than 40 years.
5 Discussion
The developed pipeline was able to detect a significant systematic topical gender
bias that presents searches for female German politicians with less suggestions
on lower average ranks that can be associated with politics and economics. Simi-
larly, the findings show a topical age bias. Query suggestions for older politicians
consist of less and lower ranked suggestions associated with the cluster of per-
sonal topics and more and higher ranked suggestions that fit the politics and
economics cluster. The overall percentage of explained variance in the metrics
seems low, but without comparison and assuming that many unknown factors
influence the topics of query suggestions, the results are satisfactory. It seems
that the quality of the identified clusters is essential for the effectiveness of the
bias identification abilities and the insights the pipeline can produce. By intro-
ducing more carefully selected groups of query suggestions as topical clusters,
possibly by full or partial manual selection of topic words or utilizing a language
model based methodology, the bias identification capabilities could be enhanced
further. Another subject for a follow-up study is to test how the pipeline per-
forms on non-person-related searches.
DCG has shown to be a useful metric for describing the perceived topical affil-
iation but can only be interpreted relative to other DCG scores. It can therefore
be used to describe systematic topical bias. The nDCG score can describe the
average rank of a cluster or single suggestion. This leads to results similar to
the DCG scores if the percentages of terms of the clusters are comparable. For
rare or single terms, or if the cluster sizes differ greatly, the metric might be a
very useful measure. This could not be tested with the used dataset, however.
Compared to the simple percentages of cluster words, the ranking aware metrics
Perception-Aware Bias Detection for Query Suggestions 139
Table 2. Results of the regression analysis for nDCG and DCG scores for each of
the clusters. Shown are the coefficients B along with the significance value of the test
for coefficients P, for all metric-attribute-combinations. The F-test score for overall
significance and the adjusted R2 measure R2 c can be found in the row labeled “Model”.
All values are rounded, significant results (P < 0.05) are highlighted.
DCG and nDCG did reveal more bias. Since the rank- and frequency-aware met-
rics offer more insight without compromising effectiveness, this speaks in favor of
the introduced metrics. Directly comparing the new metrics to the old metric is
difficult because the primary defining attribute of the perception-aware metrics
is, that a different kind of bias is measured. The ability of the pipeline to reveal
bias was enhanced by introducing the perception-aware metrics. The results by
Bonart et al. explained little more of the variance inherent in the used metric
in cluster 3. However, more significant topical bias was discovered towards more
of the groups of meta-attributes and in more of the topical clusters. The new
pipeline showed significant biases for clusters 1 and 2 and identified systematic
topical biases towards the age and gender and some of the parties and feder-
ated states. Overall, the findings and the number of groups in which bias was
discovered suggest an improvement to the bias detection capabilities. Due to the
140 F. Haak and P. Schaer
6 Conclusion
The main goal was to introduce perception-aware metrics for bias detection in
query suggestions of person-related searches. Integrating rank and frequency of
cluster words into the bias detection pipeline enables detecting bias that consid-
ers how query suggestions are perceived. This is achieved by adopting the DCG
and nDCG metrics for bias detection.
By combining perception-aware metrics with topical clustering of query sug-
gestions, the bias detection pipeline is able to overcome the challenges posed
by the sparse character of query suggestions. The results presented in Sect. 4
are more meaningful and better interpretable than the results produced by the
pipeline by Bonart et al. Perception-aware bias metrics represent a novel app-
roach to bias detection in query suggestions that could prove useful for other
bias detection scenarios as well.
References
1. Bolukbasi, T., Chang, K.W., Zou, J., Saligrama, V., Kalai, A.: Man is to com-
puter programmer as woman is to homemaker? Debiasing word embeddings (2016).
http://arxiv.org/abs/1607.06520
2. Bonart, M., Samokhina, A., Heisenberg, G., Schaer, P.: An investigation of biases
in web search engine query suggestions. Online Inf. Rev. 44(2), 365–381 (2019).
https://doi.org/10.1108/oir-11-2018-0341
3. Cai, F., de Rijke, M.: A survey of query auto completion in information
retrieval. Found. Trends Inf. Retr. 10(4), 273–363 (2016). https://doi.org/10.1561/
1500000055
Perception-Aware Bias Detection for Query Suggestions 141
4. Daniel J. Edelman Holdings, Inc.: 2020 Edelman Trust Barometer (2020). https://
www.edelman.com/trustbarometer
5. Dean, B.: We analyzed 5 million google search results. Here’s what we learned
about organic CTR (2019). https://backlinko.com/google-ctr-stats
6. Dev, S., Phillips, J.M.: Attenuating bias in word vectors. CoRR (2019). http://
arxiv.org/abs/1901.07656
7. Google: How Google Fights disinformation (2019). https://kstatic.googleuserc
ontent.com/files/388aa7d18189665e5f5579aef18e181c2d4283fb7b0d4691689dfd1bf9
2f7ac2ea6816e09c02eb98d5501b8e5705ead65af653cdf94071c47361821e362da55b
8. Hofmann, K., Mitra, B., Radlinski, F., Shokouhi, M.: An eye-tracking study of user
interactions with query auto completion. In: Li, J., Wang, X.S., Garofalakis, M.N.,
Soboroff, I., Suel, T., Wang, M. (eds.) Proceedings of the 23rd ACM International
Conference on Conference on Information and Knowledge Management, CIKM
2014, Shanghai, China, 3–7 November 2014, pp. 549–558. ACM (2014). https://
doi.org/10.1145/2661829.2661922
9. Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: industrial-
strength natural language processing in Python (2020). https://doi.org/10.5281/
zenodo.1212303
10. Houle, C.S.: The search engine manipulation effect (SEME) and its possible impact
on the outcomes of elections. Proc. Natl. Acad. Sci. 112(33), E4512–E4521 (2015).
https://doi.org/10.1073/pnas.1419828112
11. Introna, L., Nissenbaum, H.: Defining the web: the politics of search engines. Com-
puter. 33, 54–62 (2000). https://doi.org/10.1109/2.816269
12. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques.
ACM Trans. Inf. Syst. 20(4), 422–446 (2002). https://doi.org/10.1145/582415.
582418
13. Kulshrestha, J., et al.: Search bias quantification: investigating political bias in
social media and web search. Inf. Retriev. J. 188–227 (2018). https://doi.org/10.
1007/s10791-018-9341-2
14. Lin, J., Nogueira, R., Yates, A.: Pretrained transformers for text ranking: BERT
and beyond (2020). https://arxiv.org/abs/2010.06467
15. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space (2013). https://arxiv.org/abs/1301.3781
16. Mitra, B., Shokouhi, M., Radlinski, F., Hofmann, K.: On user interactions with
query auto-completion. In: Proceedings of the 37th International ACM SIGIR
Conference on Research & Development in Information Retrieval, SIGIR 2014,
pp. 1055–1058. Association for Computing Machinery, New York (2014). https://
doi.org/10.1145/2600428.2609508
17. Niu, X., Kelly, D.: The use of query suggestions during information search. Inf.
Process. Manag. 50(1), 218–234 (2014). https://doi.org/10.1016/j.ipm.2013.09.002
18. Noble, S.U.: Algorithms of Oppression: How Search Engines Reinforce Racism.
NYU Press (2018). http://www.jstor.org/stable/j.ctt1pwt9w5
19. Olteanu, A., Diaz, F., Kazai, G.: When are search completion suggestions problem-
atic? In: Computer Supported Collaborative Work and Social Computing (CSCW).
ACM (2020)
20. Ooi, J., Ma, X., Qin, H., Liew, S.C.: A survey of query expansion, query sug-
gestion and query refinement techniques. In: 4th International Conference on
Software Engineering and Computer Systems (2015). https://doi.org/10.1109/
ICSECS.2015.7333094
21. Pitoura, E., et al.: On measuring bias in online information. CoRR. vol.
abs/1704.05730 (2017). http://arxiv.org/abs/1704.05730
142 F. Haak and P. Schaer
22. Ray, L.: 2020 google search survey: How much do users trust their search results?
(2020). https://moz.com/blog/2020-google-search-survey
23. Robertson, R.E., Jiang, S., Lazer, D., Wilson, C.: Auditing autocomplete: sugges-
tion networks and recursive algorithm interrogation. In: Boldi, P., Welles, B.F.,
Kinder-Kurlanda, K., Wilson, C., Peters, I., Jr., W.M. (eds.) Proceedings of the
11th ACM Conference on Web Science, WebSci 2019, Boston, MA, USA, 30 June–
03 July 2019, pp. 235–244. ACM (2019). https://doi.org/10.1145/3292522.3326047
24. Wang, P., et al.: Game of missuggestions: semantic analysis of search-autocomplete
manipulations. In: NDSS (2018)
Crucial Challenges in Large-Scale Black
Box Analyses
1 Introduction
to deceive voters [6], search engine results can reinforce racism [25], or ads with
deceiving medical advice can be distributed to users with a severe illness [28].
Some of these actions are illegal, others are only ethically questionable. Some of
them fall in the clear responsibility of the ad designer, e.g., factual correctness,
others more on the side of the technical system, like the targeting of tainted
political ads or the targeted distribution of medical ads with dubious content,
which are difficult to assign.
Missing regulation in assigning responsibility is one problem, another obstacle
is that these cases are often discussed on anecdotal evidence instead of clear cut
data. For example, in the course of the Brexit, the journalist Carole Cadwalladr
noticed that many people in her hometown voted leave because they saw targeted
political ads on facebook [6]. However, ads on facebook that a user was seeing
cannot be retrieved after the fact, resulting in no quantifiable evidence.
To enable an analysis of who gets to see what, there are in principle two
solutions: getting insight into the algorithmic systems and all processes around
it or, if that is not attainable, a so-called black box analysis, which observes and
analyzes patterns in the input and output of such a system without insight into
its inner workings.
Black box analyses can be used to audit the decisions of an algorithmic sys-
tem and to detect problematic patterns in them. This is a first and necessary,
but not sufficient, step to hold the providers of an algorithmic system account-
able. Accountability in general can be defined as “a relationship between an
actor and a forum, in which the actor has an obligation to explain and to justify
his or her conduct, the forum can pose questions and pass judgement, and the
actor may face consequences [4, p.442]. Following Bovens’ definition, Wieringa
defined algorithmic accountability as follows: Instead of explaining and justifying
its own conduct, algorithmic accountability now focuses on the behavior of the
algorithm or the algorithmic system in question, which has to be justified and
explained by the person or company who puts it in use. Accordingly, this frame-
work requires (1) an actor (individual, collective or organizational) who explains
the behavior of the algorithm to (2) a forum which then challenges this account.
The (3) relationship between the two is shaped by disclosure and discussion of
(4) the account and its criteria, and ultimately (5) the consequences imposed
by the forum [31]. If the actor is held accountable for the results of proprietary
algorithms, the latter usually remain undisclosed or obfuscated by design as they
constitute trade secrets whose disclosure would allow gaming the system [18].
Thus, without any real insight into the algorithmic system and without any hard
facts, any demand regarding algorithmic accountability is a toothless tiger and
must fail: If the forum has no means to challenge the account of the actor, the
actor can in essence not be held accountable.
So far, there have been only a handful of successful attempts to scrutinise
the services these platforms provide with such black box analyses, e.g. [1,7,23].
Most of these were sparked by a concrete evidence or tangible suspicion which
determined the subsequent process of analysis. Why are there not more black
Crucial Challenges in Large-Scale Black Box Analyses 145
box analyses on this important topic, if they are the necessary basis for a public
discourse?
In this paper, we want to discuss the design process and the challenges that
arise when conducting a large-scale black box analysis, mainly based on a recent
study we conducted in 2019/20.
The study arose from the work of Anna Couturier at the University of Edin-
burgh and EuroStemCell in the area of public information, patient decision-
making, and stem cell research. Her work’s focus on the development of patient
and researcher co-development of resources on stem cell treatments pointed to a
larger question of how information about medical treatments moves through digi-
tal spaces. In particular, she investigated the impact of search engines as primary
means for patient-led information gathering on their conditions and diseases and
subsequent decision making. Feedback from patient advocates from the Parkin-
son’s Disease and Multiple Sclerosis community noted that patients anecdotally
noted that their search queries around their conditions often returned advertise-
ments from private clinics offering unproven treatments 1 . This led to an initial
study of advertisements of unproven stem cell treatments within the United
Kingdom [13]. These initial investigations, however, were unable to address the
largest actor within this network of knowledge dissemination; Google Search
itself. This blind spot led Anna Couturier to reach out to us to conduct a black
box analysis on how often these ads appear and whether they seem to be tar-
geted to patients rather than a healthy control group. In our “2019 Eurostemcell
Data Donation Project” we were able to collect evidence that patients do actu-
ally see more of these ads [28], despite a new policy by Google to ban stem cell
therapy ads [3]. In Sect. 2 the concept of black box analysis and its limitations
are presented. In the following section, the above mentioned Eurostemcell Data
Donation with its design and results are showcased. Section 4 derives general
challenges in conducting a black box analysis, based on the different experiences
that were made. In Sect. 5 the basis for the demand for a long term watchdog
analyses to ensure algorithmic accountability is lain out and finally Sect. 6 gives
a short summary.
The concept of black box analysis can be seen as a descendant of reverse engi-
neering. Diakopoulos defines Reverse Engineering as “the process of articulating
the specifications of a system through a rigorous examination drawing on domain
knowledge, observation, and deduction to unearth a model of how that system
works” [10]. It allows the analysis of an opaque system (the black box) by obser-
vation of in- and outputs and deduction of the inner mechanics that transforms
the former into the latter. It can be best achieved if next to the observation of the
1
These initial impressions were collected during the Wellcome Trust Seed project-
funded workshop “Patienthood and Participation in the Digital Era: findings and
future directions” hosted by the Usher Institute at the University of Edinburgh in
August 2018. (Erikainen et al. [14]).
146 T. D. Krafft et al.
Of course, not all kind of questions can be answered by such an analysis [30]: A
problem is that search engines, like most other algorithmic systems embedded
in a complex socio-technical system, are not a stable research subject:
Despite the above limits of a black box analysis, it is still a useful tool:
To assess social consequences of an algorithm’s deployment, absolute knowledge
about its workings may not always be necessary [9]. A “critical understanding
of the mechanisms and operational logic [5, p. 86] is sufficient, as long as it
considers those conditions that are required to understand a phenomenon [19].
If that can be achieved, the results of a black box analysis can constitute a
meaningful algorithmic accountability relationship in the sense of Wieringa [31]
between those, who can access its results (as the forum) and the algorithm
provider (as the actor who is held accountable).
However, designing and conducting a reliable black box analysis of search
results and ad distributions proves to be challenging as we will report in the
next section on the example of our 2019 Eurostemcell Data Donation Project
(EDD) and other black box analyses conducted in the last years.
Crucial Challenges in Large-Scale Black Box Analyses 147
Design decisions in Phase 1: In the design of the study, the first phase
requires the choice of an analysis strategy, namely whether the data is collected
based on bot accounts (which is called a scraping audit) (1a), on bot accounts
simulating humans (1b) or real peoples’ user accounts, which is called a crowd-
sourced audit or data donation (1c) following [29]. We chose to use both, the
first and third approach.
By choosing the crowdsourced approach, patients can contribute to scientific
progress and be invited to take an active stand in enacting their autonomy,
express solidarity and benefit from indirect reciprocity [27]. For the analysis,
we recruited voluntary participants through patient advocacy groups to donate
their data. A second group consisted of users without any of the diseases we
148 T. D. Krafft et al.
were looking at, recruited by newsletters and social media. We further added
bot accounts with no search history to ensure a baseline, against which we could
compare our findings to understand whether patients would get more ads than
user accounts without any known health information.
Design Decisions for Phase 2: The scraping audit was enabled by a browser
plugin. It is important to note that we would rather use a way to integrate the
data collection into the mobile Google app - however, this would be technically
challenging and possibly illegal at the time being. In any case, the plugin auto-
mates the search queries and data collection, i.e., once it is installed and runs,
the users did not have to do anything. It thus provided a scalable, platform inde-
pendent and accessible solution that required minimal interaction from the user
during the donation. For more than 4 months, the plugins of our participants
searched 6 times per day for keywords related to stem cells or specific diseases
as long as the browser was running. The plugin scraped the search engine result
pages delivered by Google to extract search results and ads.
Our investigation of the crawled data showed that despite an official ban
of stem cell therapy related ads by Google at the beginning of the study [17],
the captured search results still included ads offering unproven stem cell ther-
apy treatments [28]. On top of that, participants that self-identified as affected,
received more advertisement than the control.
As our study question was whether patients got search results and ads for
(unproven) stem cell therapies, we needed to involve real patients in the study.
This also entailed that we needed a control group of real people not suffering
from the diseases under study.
Problems with Participant Engagement and Enrollment. In general,
participant enrollment is the more cumbersome, the more technical proficiency
it requires. This is particularly salient in the case of our study as the condi-
tions faced by our targeted study groups may, in fact, contribute to difficulties
in on-boarding. For example, patients with Parkinson’s Disease are on average
over the age of 65 years old at first diagnosis [26]. This may lead to challenges
with enrollment due to a age demographic unfamiliarity with the technology
Crucial Challenges in Large-Scale Black Box Analyses 149
necessary to take part. In our final study iteration, we were pleased to enroll
around 100 patients participants. This number is comparatively large for a socio-
anthropological medical study. However, for a large scale statistical analysis of
the results, this number is comparatively small.
It would be easiest, if participants could grant scientific study teams a
restricted access to their account on the given platform [24]. For example, if
they were able to go to their Google account, enroll to the study, and search
results and ads would be automatically collected and sent to the conductor of
the study. However, at the time being, there is no way to access specific infor-
mation of social media accounts even if users give their full consent, neither
for platforms such as the various Google services, Facebook, or Twitter. Face-
book actually offered the Facebook Graph API that granted targeted access to
users’ accounts if they gave their permission - however, following the Cambridge
Analytica Scandal, they restricted this access so much that black box analyses
targeting specific aspects like ad distribution or specific messages in the timeline
are not possible anymore from outside of Facebook.
Diversity of Hardware and Software Environments. Enrolling real per-
sons also entails being confronted with a multitude of devices, operating sys-
tems, browsers (that come in different versions), and other software running on
the device and interfering with the data collection. In our black box analysis
regarding the election in 2017, multiple participants were not able to install the
plugin, or it would not send any data, or it would hinder the normal usage of
their browsers, e.g., by excessive consumption of computing power. In the run-
ning study, we were not able to figure out whether any of this was caused, e.g.,
by their ad blocking software. Another small problem arose from the different
settings of the participant’s Google user account, e.g., the setting of the preferred
language or the preferred number of search results displayed on one page.
Problems Scraping Websites. The only technology left to collect data that
real people see in their search results, was the browser plugin. It is basically a
scraper, which is very susceptible to any changes of how the result page in the
browser is structured. For example, in our black box analysis study concerning
the German election in 2017, Google’s layout for their result page changed mid-
way. This resulted in empty data columns in our data collection for some days
until we noticed the problem. In our study on dynamic pricing, we learned that
web shops are actively fighting against price scraping by changing the structural
design of their page regularly, which makes any attempt to investigate personal
pricing based on scraping very difficult.
We learned on the one hand that it is absolutely necessary to check the
collected data regularly and on the other hand to make any updating procedure
of the data collecting software as smooth as possible. Participants are very likely
to drop out of the study, if they have to re-install or manually update the data
collecting application, as we learned in our black box analysis study in 2017
where one of our plugins had a severe bug: We could not update it remotely
and thus a re-installation was necessary. Here we encountered the double-edged
challenge of ensuring privacy. In order to maintain the privacy of the data donors,
150 T. D. Krafft et al.
we did not collect contact information, but rather relied on the donor themselves
to install and run the donation plugin. We did not even have an email list or
other communication channel to make our participants aware of the problem.
Problems Caused by Dynamic Internet Content. Another general prob-
lem in the data collection is the dynamic nature of the content advertised in
ads or search results: very often, we collected links from ads that at the time of
the analysis were already invalid. We learned that it might have been better to
crawl these links at collection time and to save the respective pages for future
analysis. However, with A/B-testing being abundant, where part of the users
following a link get version A of some website and others get version B (or C,
D,...) of it [11], it would be necessary to follow the link from within the plugin.
That is, the participant’s browser would not only open the Google webpage but
also any other webpage advertised or displayed on the results’ page. This entails
problems of safety and data privacy that are difficult to solve plus it might be
illegal w.r.t. general terms and conditions of Google’s search engine service.
Almost No Manipulation of Input Possible. While the crowd-sourced app-
roach has the huge advantage to collect data that users would see, it is almost
impossible to change the “input to the search engine” in any meaningful way,
to better understand the real behavior of the system. The “input” to the search
engine in a personalised account is not only given by the keywords, time of
day the search is conducted, the IP-address of the machine used to conduct the
search, and so on, but also by the personal history of searches, of web usage in
general, induced properties of the human user imputed by the software (like age,
income, gender, etc.). None of this can be easily changed such that a wanted
user profile can be consistently achieved. It was this restriction that prompted
us to adopt the dual approach of virtual bot-based data gathering. However, the
bot-based approach came with its own challenges.
Our study was conducted in four countries, where we rented a set of so-called
virtual private servers to run searches from IP addresses located in the same
country.
Problems with Bot Detections. An unfortunate drawback of bot-based
approaches is that they are routinely identified by most popular online plat-
forms and then blocked. While these measures are necessary to detect malicious
bot attacks, it hinders the mainly benign and public interest-driven scientific
investigations. This would include any regular black box analyses by NGOs or
the government established to hold software or platform providers accountable.
Problems with Regionalisation. A small problem that we encountered is
the localisation of services by IP-addresses and other indicators of the place
from where a service is approached: when using virtual private servers, the IP
addresses are not as distributed over the country as if persons would use the
Crucial Challenges in Large-Scale Black Box Analyses 151
Next to profane bugs in the plugin software, a pre-study of reduced length and
number of participants can, e.g., help to estimate the size of the effect that is to
be studied, thereby indicating the number of participants needed to run reliable
statistical analyses. It helps to discover problems with the technical setup that
occur very often, giving room for a technical improvement of the user experience.
It might also detect problems with quickly changing website layouts, e.g., when
website owners use that tactic to hinder scraping as discussed above.
It will also help to reveal at least some of the weaknesses of the study design
and to mitigate unanticipated problems: For example, in the study concerning
the election of 2017 [23], we were not aware of the fact that searches on Google
could result in Google+ pages to be displayed. Google+ was the attempt of
Google to create a social network platform and it allowed to build up contacts
and to post and comment on URLs. When a person was searched on Google,
who was in the contact list, all their contact data would be shown on the result
page, in a special area reserved for that information. Similarly, if a key word
was searched for, that was associated with any content on the user’s Google+-
account, that could also become part of the search results. We did not scrape
this reserved area of the result page which could possibly contain personal data
of contacts of our participants. However, we did scrape the search engine results
and thus needed to make sure to delete all results from the Google+-accounts
because otherwise these could have been used to deanonymise our participants.
Since we did not have time for a pre-study, we were confronted with this problem
in the full study which created some problems in the data collection.
We also discovered only in the analysis of the fully collected data, that most
probably, the preferred language setting of our participants in their Google
account, produced some of the anomalies that we encountered [23]. However,
because we were not aware of this additional “input” to the search engine, we
did not collect this information and thus, cannot be sure about its effect.
152 T. D. Krafft et al.
6 Summary
In this paper we showed that there are a number of technical challenges that
hinder large scale black box analysis of digital platforms. Our group found it
Crucial Challenges in Large-Scale Black Box Analyses 153
an important reminder that the final output of these algorithms was not simply
search results, but the potential of an individual impacted by life-altering dis-
ease to be exposed to at-best economically exploitative practices and at-worst
potentially risky, unproven medical treatments. Some of the challenges discussed
in this paper can be mitigated by a careful study design including a pre-study.
However, the resources for this and for a large-scale analysis that includes high
numbers of patients, should not be underestimated. Next to the technical chal-
lenges that can be mitigated there are mayor technical obstacles that can only
be resolves together with platform providers. To enable accountability, where it
is necessary, a watchdog approach cannot be realized without solving these prob-
lems. The study we conducted show that this is a societal problem that cannot be
ignored any longer. We see that political bodies like the Deutsche Bundestag [12],
the Data Ethics Commission [8] and the European Parliament [15] are currently
searching for solutions.
Acknowledgment. The presented project EDD has been partially funded by the
EU stem cell public engagement project, EuroStemCell2 and by a generous grant
from the University of Edinburgh School of Social and Political Science. The research
was supported by the project GOAL “Governance of and by algorithms (Funding
code 01IS19020) which is funded by the German Federal Ministry of Education and
Research.
References
1. Andreou, A., Venkatadri, G., Goga, O., Gummadi, K., Loiseau, P., Mislove, A.:
Investigating Ad transparency mechanisms in social media: a case study of Face-
book’s explanations. In: NDSS 2018 - Network and Distributed System Security
Symposium, San Diego, CA, United States, pp. 1–15 (2018)
2. Ashby, W.R.: An Introduction to Cybernetics. Chapman & Hall Ltd., London
(1957)
3. Biddings, A.: A new policy on advertising for speculative and experimental medi-
cal treatments. Google Ads Help (2019). https://support.google.com/google-ads/
answer/9475042. Accessed 11 Mar 2021
4. Bovens, M.: Analysing and assessing accountability: a conceptual framework1. Eur.
Law J. 13(4), 447–468 (2007). https://doi.org/10.1111/j.1468-0386.2007.00378.x
5. Bucher, T.: Neither black nor box: ways of knowing algorithms. In: Kubitschko,
S., Kaun, A. (eds.) Innovative Methods in Media and Communication Research,
pp. 81–98. Palgrave Macmillan, Cham (2016). https://doi.org/10.1007/978-3-319-
40700-5 5
6. Cadwalladr, C.: Facebook’s role in Brexit - and the threat to democracy
(2019). TED Talk. https://www.ted.com/talks/carole cadwalladr facebook s role
in brexit and the threat to democracy. Accessed Mar 11 2021
7. Datta, A., Tschantz, M.C., Datta, A.: Automated experiments on Ad privacy set-
tings. Proc. Priv. Enhancing Technol. 2015(1), 92–112 (2015). https://doi.org/10.
1515/popets-2015-0007
2
www.eurostemcell.org.
154 T. D. Krafft et al.
26. Pagano, G., Ferrara, N., Brooks, D.J., Pavese, N.: Age at onset and Parkinson
disease phenotype. Neurology 86(15), 1400–1407 (2016). https://doi.org/10.1212/
WNL.0000000000002461
27. Prainsack, B.: Data donation: how to resist the iLeviathan. In: Krutzinna, J.,
Floridi, L. (eds.) The Ethics of Medical Data Donation. PSS, vol. 137, pp. 9–22.
Springer, Cham (2019). https://doi.org/10.1007/978-3-030-04363-6 2
28. Reber, M., Krafft, T.D., Krafft, R., Zweig, K.A., Couturier, A.: Data donations
for mapping risk in google search of health queries: a case study of unproven stem
cell treatments in SEM. In: IEEE Symposium Series on Computational Intelligence
(SSCI), pp. 2985–2992 (2020)
29. Sandvig, C., Hamilton, K., Karahalios, K., Langbort, C.: Auditing algorithms:
research methods for detecting discrimination on internet platforms. Data Discrim.
Conv. Crit. Concerns Prod. 22, 4349–4357 (2014)
30. Seaver, N.: Knowing algorithms. In: Media in Transition, Cambridge, MA, vol. 8
(2014)
31. Wieringa, M.: What to account for when accounting for algorithms: a system-
atic literature review on algorithmic accountability. In: Proceedings of the 2020
Conference on Fairness, Accountability, and Transparency, pp. 1–18 (2020)
32. Zweig, K.A., Krafft, T.D., Klingel, A., Park, E.: Sozioinformatik Ein neuer Blick
auf Informatik und Gesellschaft. Carl Hanser Verlag (2021, in publication)
New Performance Metrics for Offline
Content-Based TV Recommender System
Luisa Simões, Vaibhav Shah(B) , João Silva, Nelson Rodrigues, Nuno Leite,
and Nuno Lopes
Abstract. The past decade has seen a fast rise in popularity of recom-
mendation systems provided by many entertainment and social media
services. However, despite the recognised advances in different recom-
mendation approaches and technologies, there remain many challenges,
particularly in TV content recommendation systems. More precisely,
machine learning based TV content recommendation systems suffer from
a class imbalance problem; hence, it is difficult to evaluate the system
using traditional metrics. Moreover, specific challenges arise during the
development phase, when the system operates in ‘offline’ mode. This
means the recommendations are not actually presented to users - mak-
ing it even more difficult to measure the quality of those recommenda-
tions. This paper presents a proof-of-concept demonstrator of a television
recommendation system, based on Content-based Filtering, as a contri-
bution towards building a full-scale intelligent recommendation system.
New evaluation metrics are proposed for ‘offline’ testing mode, while also
tackling the class imbalance problem. The experimental results, based on
real usage data, are promising and help in defining the future path as
presented along with the conclusion.
1 Introduction
a recent manifesto discusses the topic of the need for a science of forecasting
system performance rather than only focusing on intrinsic evaluation [6]. In an
attempt to overcome these and other issues, important events and competitions
in the area (e.g., Netflix Prize, RecSsys challange) have been contributing over
the years with a wide range of strategies to optimise recommendation systems [1]
without forgetting the challenge of ‘offline evaluation’, determining whether all
metrics evaluated offline are necessary and provide additional valuable informa-
tion for an RS in a real environment [14]. But for more than a decade it is possible
to witness multiple strategies that are being explored by industry and academia
to predict user feedback and historical models of user behaviour [9]. This trend
in research aims to predict items from implicit feedback [10], rather than rely
on a set of explicitly generated ratings that are possible in online systems. To
overcome this challenge some authors are using models of user interaction (i.e.,
click models) to construct estimators that learn statistically efficiently in an
offline evaluation environment [12]. Although an offline environment is not the
ideal setup to evaluate a recommender system, it is always necessary to per-
form some kind of evaluation before implementing an algorithm in a production
environment. Other relevant work that adressed offline evaluation metrics was
performed in [11,13].
Which leaves the authors to conclude that new metrics should be explored
to analyse and compare different algorithms. Three new metrics are proposed,
namely Genre Hit-Ratio, User Hit-Ratio and True Positives Quality, to help the
system developers with an insight on how well a user’s preferences were predicted,
firstly in terms of genres recommended/predicted (since genre was the principal
‘feature’ in this phase of the system development), secondly how many of the
users actually watched at least one of the recommended contents, and finally the
quality of visualisation to verify how well the selected recommendations served
the users’ interests.
2.1 Architecture
The current phase of the development cycle involves a simple content-based rec-
ommendation engine that outputs filtered lists of recommendations based on
each user’s content genre based preferences. The idea is to place this engine
eventually as a functional component inside the broader framework of a cable
TV service provider, i.e. the system operates on a real-life setup with real data.
Nevertheless, the presented engine is a fully functional system that can oper-
ate in a standalone mode as well, with all the required modules. The complete
New Performance Metrics 159
The following sub-sections describe the input datasets and each of the steps
in the recommendation pipeline.
The input data is supplied in two datasets - Usage and Content, and can be
summarised as follows: 3 months of visualisation history (usage) data; from 100k
total devices; having 12M content visualisations of total 332k unique contents.
The content dataset is an exhaustive list of all the contents offered by the service
provider, out of which the recommendations are prepared. The usage dataset is a
list of all the visualisations, by all the users, during a specific period, considering
which the recommendations are generated. Both these datasets contain several
columns to describe the visualisations history and the contents. Table 1 describes
the most relevant features for the presented use-case, for each dataset.
160 L. Simões et al.
Usage dataset
Id Device id
Start Visualisation start datetime
End Visualisation end datetime
Content id Content id, to match with the content dataset
Content dataset
Content id Unique id to identify the content
Content dur The content duration in seconds
Genre The genre, or list of genres of the content
Title The content’s title
Sub title The content’s original title
Season The season of the content (in case of a tv series)
Episode The episode of the content (in case of a tv series)
2.3 Pre-processing
Data Cleaning - The entries for visualisations under 5 min were also discarded,
since they reveal little to no information about the user’s interests and with this
the whole data becomes lighter. Then, invalid entries, i.e. entries with crucial
parameters missing or with undefined values were removed.
Implicit Ratings - The input datasets do not contain a rating for each visual-
isation (or each content), based on the users’ explicit feedback, and so a rating
inference for the contents is performed. The rating R is inferred based on the
visualisation time in comparison to the content’s actual duration time. In this
case, the visualisation time matches the time between the selection and play
(tstart ) of a content, and the action of closing (tend ) the visualisation panel
of that content, hich causes a percentage bigger than 100% in some cases. The
rating R is calculated as is denoted in Eq. (1):
New Performance Metrics 161
tend − tstart
R = N orm ∗ 10 (1)
d
where,
x, if x ≤ 1
R ∈ Z, 1 ≤ R ≤ 10andN orm(x) = ,
1, if x > 1
tstart = visualisation start time,
tend = visualisation end time,
d = content duration.
The normalisation function in Eq. (1) assumes that every value bigger than 1
(a percentage bigger than 100%), must be assigned to the ideal maximum value
of 1, and then scale these values to have ratings as integers in the [1,10] interval.
Data Splitting - The data was split into train and test datasets, in which the
former uses approximately 11 weeks of data, and the latter uses the last 1 week.
The following block only applies to the training data, as seen in Fig. 1, and the
testing data is used in the final step to evaluate the recommendation system.
Based on the values obtained from the confusion matrix, the Table 3 shows
the calculations as well as meanings of the traditional metrics in the present
context. These measures are of little significance, due to several reasons:
– Only 10, out of large number of, classes are recommended for selection;
– Number of items recommended (predicted) are always hugely inferior to the
number of items not recommended (class imbalance);
– Multiple interesting items are left out of the final list of recommendations;
– A True Positive implies a selection of even 1 of the 10 recommendations.
a high number of contents that are not watched than the ones that were in a
small list of recommendations/predictions and were not watched.
Considering these issues with the traditional metrics for a TV recommenda-
tion system, as well as given the challenge of testing the system in offline mode,
new metrics were proposed.
U
ui
i=1 0, if T Pi = 0
U HR = , where ui = (2)
U 1, if T Pi > 1
In Eq. 2, ui is the ith user with any visualisation, and T Pi is the True Positive
rate of the ith user.
of the chosen features, therefore, GHR was developed to measure the accuracy
of the predictions of users’ genre-wise preferences. With this new metric it is
possible to observe how many of the watched contents belong to the predicted
genres. This metric is crucial to evaluate whether or not the recommendations
are similar to the users’ taste. If a user watches only comedy and drama contents,
the recommendations will be other contents that contain these two genres. Even
if the user does not select any of the ten recommendations, it is important to
evaluate if the genre was accurately ‘predicted’. To illustrate, consider that the
user watched five new contents after he/she received the recommendations. Four
out of these five were from comedy or drama genres, which gives a GHR of 80%.
In this example, although the user never watched any of the recommendations,
the contents on this list were compatible with the user’s genre taste.
To formulate the calculation, let Ci be the total number of contents watched
by the ith user. Further, consider a genre calculation function G(), such that
G(C) gives the list of all the genres of content C; additionally consider a user-
wise genre calculation function GU (), such that GU (u) returns the preferred
genres of useru. Then, the Genre-Hit Ratio is denoted in Eq. (3),
U 10
wij
i=1 j=1 1, if G(cij ) ∩ GU (ui ) = ∅
GHR = , where wij = (3)
U 0, if G(cij ) ∩ GU (ui ) = ∅
Ci
i=1
In Eq. (3), wij is an indicator whether the j th content watched by the ith
user had that user’s recommended/predicted genres or not.
True Positive Quality (TPQ). The traditional metrics discussed before, mea-
sure the system’s performance merely based on quantity parameters. However,
even when a user watches all the recommended contents (100% precision), it does
not mean that they were relevant to him/her. To assess relevance, it is neces-
sary to look at the quality of the watched recommendations, which is commonly
evaluated based on user feedback. For offline contexts, the authors propose an
implicit feedback approach - True Positive Quality - which is the mean value of
the True Positive (Table 2) content’s implicit ratings. With this, it is possible to
evaluate how much the user enjoyed the recommendations that were presented
as in Eq. (4). These values range from 1 to 10, and if the system obtains a 9
TPQ it means that the ‘accepted recommendations’ were extremely relevant to
the users. On the other hand, if this value is 5 or below, it means that the users
found the recommendations to be mediocre.
U
Ni
Rij
i=1 j=1
TPQ = (4)
U
Ni
i=1
166 L. Simões et al.
where,
th th
Rij = rating of j true positive content f or the i user(f romeq.(1)),
Ni = number of true positive contents f or the ith user.
Analysing the Traditional Metrics, it is possible to realise that the values for
each metrics are either extremely low, or extremely high. All tests have precision
and false positive rate values under 1%. Recall values show a bit more variance
(2.52%) in comparison, and have slightly higher values, though still under 3%.
However, the Recall values increase when the data contains sparse values, which
is the case of the presented work. Accuracy is on the other side of the spectrum,
with all tests scoring higher than 99.8%, with a variance of 0.06%. In general,
none of these metrics give good insights about the recommendation system,
as none of them reflect the dimensions of the advances between pre-processing
blocks and the different algorithms.
As described in the Proposed New Metrics, it is possible to observe a bigger
variance in the test’s results, i.e. 5.45%, 16.07% and 1.27% respectively, but most
importantly these metrics give more relevant information about the system:
New Performance Metrics 167
1. User Hit-Ratio measures the users that watched the recommended contents
2. Genre Hit-Ratio compares the watched content to those that were recom-
mended (considering that genre is a factor of interest in this work)
3. True Positive Quality expresses the quality of the accepted items, by calcu-
lating the mean of the implicit ratings.
4.1 Conclusion
This is an ongoing work, and there are already several lessons learned from the
presented experiments. Some of the shortcomings are planned, on multiple levels,
to be corrected in the next development cycle that is ongoing at the time of this
publication. For example, improved input datasets will be used with more infor-
mation regarding the visualisations and contents, to correctly predict a user’s
visualisation rating as well as prepare a better (more relevant) content list for
recommendation. Also, the current methodology of preparing training and test
sets introduced a time-bias. To address this, the new phase includes implemen-
tation of a ‘prequential methodology’ [7] that splits the data into shorter periods
of time, in a sliding window like manner. The feature engineering section is also
being enhanced to lessen the bias in the implicit ratings and prepare new fea-
tures to include currently discarded content information, such as namely cast
and crew. Additionally, a collaborative filtering approach based algorithm is
168 L. Simões et al.
References
1. Abel, F., Deldjoo, Y., Elahi, M., Kohlsdorf, D.: RecSys challenge 2017: offline and
online evaluation. In: RecSys 2017 - Proceedings of the 11th ACM Conference on
Recommender Systems (2017). https://doi.org/10.1145/3109859.3109954
2. Beel, J., Langer, S.: A comparison of offline evaluations, online evaluations, and
user studies in the context of research-paper recommender systems. In: Kapidakis,
S., Mazurek, C., Werla, M. (eds.) TPDL 2015. LNCS, vol. 9316, pp. 153–168.
Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24592-8 12
3. Cañamares, R., Castells, P., Moffat, A.: Offline evaluation options for recommender
systems. Inf. Retriev. J. 23(4), 387–410 (2020). https://doi.org/10.1007/s10791-
020-09371-3
4. Çano, E., Morisio, M.: Hybrid recommender systems: a systematic literature
review. Intell. Data Anal. 21(6), 1487–1524 (2017)
5. Ferraro, A., Bogdanov, D., Choi, K., Serra, X.: Using offline metrics and user
behavior analysis to combine multiple systems for music recommendation (2019)
6. Ferro, N., et al.: From evaluating to forecasting performance: how to turn infor-
mation retrieval, natural language processing and recommender systems into pre-
dictive sciences: manifesto from dagstuhl perspectives workshop 17442, Dagstuhl
Manifestos, vol. 7, no. 1, pp. 96–139 (2018)
7. Gama, J., Sebastião, R., Rodrigues, P.P.: On evaluating stream learning algo-
rithms. Mach. Learn. 90(3), 317–346 (2012). https://doi.org/10.1007/s10994-012-
5320-9
8. Ge, M., Delgado-Battenfeld, C., Jannach, D.: Beyond accuracy: evaluating recom-
mender systems by coverage and serendipity. In: 4th ACM Conference on Recom-
mender Systems, pp. 257–260 (2010)
9. Hu, Y., Volinsky, C., Koren, Y.: Collaborative filtering for implicit feedback
datasets. In: IEEE International Conference on Data Mining, ICDM (2008)
New Performance Metrics 169
10. Jeunen, O.: Revisiting offline evaluation for implicit-feedback recommender sys-
tems. In: Proceedings of the 13th ACM Conference on Recommender Systems, pp.
596–600. ACM, NY, USA (2019)
11. Krauth, K., et al.: Do offline metrics predict online performance in recommender
systems? (2020)
12. Li, S., Abbasi-Yadkori, Y., Kveton, B., Muthukrishnan, S., Vinay, V., Wen, Z.:
Offline evaluation of ranking policies with click models. In: 24th International
Conference on Knowledge Discovery & Data Mining, NY, USA, pp. 1685–1694
(2018)
13. Myttenaere, A.D., Grand, B.L., Golden, B., Rossi, F.: Reducing offline evaluation
bias in recommendation systems (2014)
14. Peska, L., Vojtas, P.: Off-line vs. on-line evaluation of recommender systems in
small e-commerce. In: Proceedings of the 31st ACM Conference on Hypertext and
Social Media, pp. 291–300. ACM, NY, USA (2020)
15. Rao, S., et al.: Learning to be Relevant. In: Proceedings of the 28th ACM Inter-
national Conference on Information and Knowledge Management, pp. 2625–2633.
ACM, New York (2019)
16. Shani, G., Gunawardana, A.: Evaluating recommendation systems. In: Ricci, F.,
Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp.
257–297. Springer, Boston, MA (2011). https://doi.org/10.1007/978-0-387-85820-
38
17. Silveira, T., Zhang, M., Lin, X., Liu, Y., Ma, S.: How good your recommender sys-
tem is? A survey on evaluations in recommendation. Int. J. Mach. Learn. Cybern.
10(5), 813–831 (2017). https://doi.org/10.1007/s13042-017-0762-9
Author Index
Indurkhya, Bipin 92
Ulloa, Roberto 36
Jiang, Chenyu 117 Urman, Aleksandra 36