A Reality Check (-List) For Digital Methods: Big Data, Big Infrastructures

A Reality Check(-list) for Digital Methods

Tommaso Venturini
University of Lyon, Lyon, France
Liliana Bounegru
Ghent University, Belgium; University of Groningen, The Netherlands
Jonathan Gray
King’s College London, UK
Richard Rogers
University of Amsterdam, The Netherland
How to cite:
Venturini, T., Bounegru, L., Gray, J., & Rogers, R. (2018). A reality check(list) for digital methods. New
Media & Society, (forthcoming), 146144481876923. http://doi.org/10.1177/1461444818769236

Big data, big infrastructures

Great expectations and great concerns have been raised about ‘big data’ (Boyd & Crawford, 2012), ‘data science’
(O’Neil & Schutt, 2013) and the ‘computational social sciences’ (Lazer et al., 2009). A rumbling storm of digital
traces is said to loom over the humanities and the social sciences bringing great power but also responsibilities. This
prophecy of the ‘data deluge’ has some truth to it. In a handful of years, digital traceability has indeed bestowed our
disciplines with larger and more diverse datasets than we have ever dreamt of. Yet, the deluge metaphor is also
misleading as it mistakenly implies that the advent of social traceability is (1) unprecedented or (2) unproblematic.
First, while it is not wrong to emphasise the transformations brought to scientific knowledge by the growing
availability of structured information, it is important not forget that datafication was not born with digital
technologies. In their book on Big Data, Mayer-Schönberger and Cukier (2013) present dozens of examples of large
and systematic campaigns of data collection ranging from census in ancient Egypt and China, to Renaissance
bookkeeping, to XIX century navigation.
Far from representing a break with the past, the traceability of digital media constitutes the latest development of the
older phenomenon of ‘media traceability’. Media are socio-technical systems that produce and enable inscriptions of
individual and collective actions: “media are our infrastructures of being, the habitats and materials through which
we act and are” (Durham Peters, 2016:15). The specific way in which they enable (but also constrain) our actions is
by translating them onto physical materials (stone, paper, copper, silicon…). As old as cave painting, this process of
inscription has drastically accelerated because of digital technologies:
Once you can get information as bores, bytes, modem, sockets, cables and so on, you have actually a
more material way of looking at what happens in Society. Virtual Society thus, is not a thing of the
future, it’s the materialisation, the traceability of Society. It renders visible because of the obsessive
necessity of materialising information into cables, into data (Latour, 1998).
Not surprisingly, this growing traceability of collective actions has affected social sciences. Researchers have always
relied on media inscriptions to investigate collective phenomena and the advent of digital media has increased the
quantity and variety of the traces at their disposal. Hence the understandable excitement of social scientists finally
having access to datasets that are as large and rich as those of their colleagues in the natural sciences (Venturini,
Jensen & Latour, 2015).
Second and as a constraint to this excitement, the increase in the information available on social phenomena does
not come for free. Digital traceability may provide the social sciences with quantities of information that are
comparable to those collected in natural science laboratories, but the quality of such traces is radically different.

Unlike the traces produced by a telescope or a microscope, media inscriptions are (in general) not created by or for
the academic community (Marres, 2017; Savage and Burrows 2007, 2014).
Although both the Internet and the Web were initially incubated in universities and research institutions, those times
have passed (with the partial exception of The Internet Archive and Wikipedia). Nowadays, digital media belongs to
the companies and institutions that have paid the huge costs necessary to set up their infrastructures (Frischmann,
2001). This is true at every level of digital networks: from the submarine cables bringing Internet to every corner of
the planet (Boullier, 2013), to the social platforms allowing anyone to push and pull information from the Web.
Digital media have not developed by themselves. They have been developed by the investments of a (limited) number of
public and private organisations which are now the gatekeepers of their traceability. The metaphor of ‘media
ecologies’ (Fuller, 2005), often used to describe the interactions within and among media, is in this sense improper.
Media are not natural ecosystems evolving spontaneously out of ‘chance and necessity’ (Monod, 1970). Instead, they
may be understood as artificial ‘landscapes’ (Rogers, 1999) carefully cultivated by the actors that participate in their
However, the fact that digital inscriptions are created outside academia, does not make them unfit to be used for
research purposes. On the contrary, the growing dominance of these platforms makes the academic exploration of
their traces even more important (Savage & Burrows, 2007, 2014). And, while researchers might not harvest all
media inscriptions without the collaboration of the infrastructures that created them, they can still obtain partial
access to their the traces. It is a feature of the political economy of contemporary media that information is not only
accumulated but also partly redistributed. Media companies collect information from us and redistribute it in various
configurations and products as part of their business strategy (see e.g. Bodle, 2011). Google, Facebook, Twitter and
their likes may strive to collect and monetise our messages, clicks and hyperlinks, but in doing so, they also provide
us with insights from the data that they collect. This is not simply a compensatory move: it is part of the strategies
for platform to present themselves as providers of valuable analytics and partners to established and emerging data
While digital platforms still rely on classic business models based on the exchange of information against money
(through subscriptions) or against attention (through advertising), they are also increasingly trading information
against other information. Every time you use its search engine you provide Google information about your
interests, but in exchange you are given access to the largest database ever created, through the most advanced
information engine ever conceived. And, though (according to the terms of service) you are not supposed “access [the
information] using a method other than the interface and the instructions that we [Google] provide”, nothing
prevents you from being methodologically inventive and re-using this information for purposes other than those
which were originally intended (Lury & Wakeford, 2012). The same applies for most Web and mobile services,
whose business model implies some redistribution of data.

The ‘digital methods’ approach

Creatively repurposing the traces and the methods inscribed in digital objects, is the aim of an emerging approach
that has come to be known as digital methods (Rogers, 2013). This approach aims at exploring “how may one learn
from how online devices (e.g., engines and recommendation systems) make use of the objects, and how may such
uses be repurposed for social and cultural research? (Rogers, 2009, p. 1)
Precisely because they propose to do "research with the Internet" – that is to say by exploiting the information made
available by Internet platforms – digital methods are not suitable for all research scenarios. For this approach to
make sense, the investigated phenomenon must be to some extent performed or, at least, reflected in such
platforms, and this is clearly not the case for all collective events. As far as the digital tendrils may have extended,
there still are plenty of crucial social dynamics that play out prevalently in face-to-face interactions (or through non-
digital media). These include, of course, interpersonal exchanges taking place in homes, classrooms and offices, but
also (and in a way that is often forgotten) the very situations in which digital messages are received. The non-digital
situations in which digital media are consumed (at work, on the couch, in an Internet café, on the subway through a
smartphone…) and ways in which their contents are processed through direct conversations may influence crucially
their reception (as repeatedly shown in the case of traditional broadcast – cf. Gans, 1993; Katz, 2001 and Staiger,

2005). Yet these influence remains outside the grasp of digital methods and should be assessed through other
Additionally, while digital methods may in theory apply to the inscriptions produced by any digital infrastructure,
most studies following this approach focus on the largest online platforms. Engaging vast and non-specialised
populations, services such as Google, Facebook, Twitter, Wikipedia have understandably captured the interest of
computational social scientists. As a consequence, while many tools exist to investigate the giants of the web, smaller
and specialised platforms remain relatively unexplored (ANONYMISED).
Besides these general limitations and even when scholars investigate phenomena clearly visible in mainstream online
platforms, caution is required. Repurposing the interfaces, databases and methods of digital media may be extremely
useful but it demands some vigilance. To produce useful and interesting findings, digital methods require the extra
care needed for the secondary analysis of inscriptions that have not been created by or for the social sciences and
thus bear the imprint of the particular purposes (whether political, commercial or otherwise) and technical
infrastructures through which they were created. Using digital methods, we are always at risk of mistaking the
characteristics of medium for the signature of the phenomena we wish to observe.
To a great extent, this cannot be helped. Since McLuhan (1964, McLuhan & Fiore 1967), we know that “the
medium is the message” and that electronic infrastructures do not simply transport social phenomena, but also
participate in their production. This basic constructivist acknowledgement, however, should not be taken as an
excuse for careless work. It is precisely because there is a priori no clear separation between noise and information,
that efforts should be invested to distinguish them a posteriori (Marres & Moats, 2015). This operation of accounting
for the entanglement of the digital devices and the actions which they constrain and enable (contrasting the specific
signal of the observed phenomenon to the general background of the repurposed medium) is vital, but often
In this paper, we provide a basic list of precautions which may be taken when using digital methods. Others have
already discussed various 'perils' (Bollier & Firestone, 2010), 'provocations' (Boyd & Crawford, 2011), 'challenges'
(Rieder & Röhle, 2012), 'problems' (Marcus & Davis, 2014), 'traps' (Lazer et al. 2014) and 'misunderstandings'
(Venturini et al. 2014) of digital research. While these and other inquiries offer interesting reflections about the way
in which digital technologies affect the epistemology of social sciences, this paper has a more practical objective: it
aims at collating a series of caveats that scholars may bear in mind while designing their digital research.
Since we are particularly interested in the interferences between the production of inscriptions by digital media and
their repurposing by scholars, we will focus on the first phases of digital research, where the entanglements between
technological infrastructures and social phenomenon are stronger. With reference to the four ‘analytical moments’
identified by Mikkel Flyverbom and Anders Koed Madsen (2016) – production, structuring, distribution and
visualization – we concentrate on the first two stages of research (but some discussion of the latter can found in
ANONYMIZED 2015 and in Gray et al., 2016).
We will illustrate our precautions with a series of studies developed at the Digital Methods Initiative at the
University of Amsterdam (digitalmethods.net). Though equally interesting projects have been carried out in many
other research institutions, we decided to focus on the DMI to facilitate the comparison. In addition, the research
carried out at DMI have the advantage of being accompanied by web pages presenting the research methods,
protocols and datasets (cf. wiki.digitalmethods.net/Dmi/DmiSummerSchool and
wiki.digitalmethods.net/Dmi/WinterSchool, but specific URLs will also be provided).

A few preliminary definitions

For the sake of clarity, we will formulate our precautions as four sets of questions (from the most theoretical to the
most practical). This arrangement is to a large extent artificial: in the practice of digital research, these precautions
overlap and problems have the unruly tendency to arise in knots rather than in orderly sequences.
Before moving to the checklist, we will introduce some of the key notions used in this paper. The descriptions we
provide below should not be understood as strict definitions, but as a working heuristic for digital researchers.

A medium is any technical infrastructure that allows the organisation and extension of collective actions in space
or time. The printing press, television, telephone and Web are media in that they allow social actors to interact
without being in the physical presence of each other. One should be not fooled by this simple definition into
taking media infrastructures for granted. Classic works in media studies emphasise how media are not neutral
agents and instead play an active role in the articulation of meaning and communications. For example, Howard
Innis’s pioneering studies (1986, 2008) highlighted what he called the particular characteristics or “biases” of
media and how they enabled different social institutions of law, religion, culture and commerce. Drawing on
Innis’s work, Marshall McLuhan contributed to the recognition of media systems as objects and sites of study.
James Carey recognises Innis and McLuhan’s role in establishing the study of media as “not merely [...]
appurtenances to society but as crucial determinants to the social fabric” (1967, pp. 270-271). He underlines the
limits of models focussing on “transportation and transmission” and instead proposed to consider media as
processes through which “reality is produced, maintained, repaired, and transformed” (2009, p. 19).
A platform is a specific way of organising a media infrastructure, constraining the way in which the medium can
be employed but also facilitating its exploitation. Recent research has focused both on the rhetorical aspects of
platforms (Gillespie, 2010), as well as their “material-technical” characteristics (Helmond, 2015). Facebook is an
interesting example of how a limited repertoire of “sentiments” – including “likes” and a number of ‘reaction’
emoticons – are facilitated through a centralised social media company, and then extended in relation to almost any
digital media.
A scientific inscription is any piece of information that is materialized through the use of a technical device for
the purposes of research. Inscriptions are the foundation of any scientific enterprise for they allow to imprint
knowledge on materials which can be stored, transformed and transmitted (Latour & Woolgar, 1979 and Latour
A digital trace is any inscription produced by a digital medium in its mediation of collective actions – for
instance, a post published on a blog, a hyperlink connecting two websites or the log of an e-commerce
transaction. We call this particular type of inscriptions ‘traces’ as a reminder that they are (most often) generated
by purposes other than academic research. Some of these inscriptions are ‘native’ to digital media while others
are originally analogue and digitized a later stage.
A corpus is an ensemble of inscriptions or traces that have undergone the process of selection, cleaning and
refining necessary to prepare them for scientific analysis. For instance, hyperlinks are a classic example of digital
traces, but they only become a research corpus when they are translated into constricted lists or into arcs
connecting a network of websites.
The notion of digital methods was introduced in 2007 as a counterpoint to virtual methods, which sought to
introduce the social scientific instrumentarium to digital research (Rogers, 2009). Virtual methods, it was claimed,
consisted in the digitisation of traditional research methods such (e.g. in online surveys or online ethnography).
Rooted in media studies and the so-called computational turn in the humanities and social sciences, digital
methods sought instead to learn from the methods of the medium and repurpose them for social and cultural
research. Reflecting on ‘natively digital’ methods sensitised the researcher to the specificities of the then ‘new
media’, to their effects, platform vernaculars and user cultures. ‘Following the medium’ also would offer the
researcher a strategy to cope with the ephemerality and instability of the Web, where a new feature, a changed
setting or the shutting down of an API could stymie longitudinal studies. Whilst remaining critical of the
implications of such changes, digital methods would ask which kind of research the platform affords. Digital
methods thus may be defined as techniques for the ongoing research on the affordances of online media.

1. Digital media and objects of study

The first checkpoint encountered in all digital methods projects concerns the adequacy of the source exploited in
relation to the object of the study. The adequacy can be defined as the extent to which the observed phenomenon
takes place within the medium that is repurposed to examine it. If one is interested in the public of a particular issue,
one might ask whether and how this public is online, whether its members use a given platform, space or device

from which digital data is collected (Twitter, Facebook, a website, a mobile app), and what kinds of technical skills,
capacities and networks they have available to them (cf. Hargittai & Hsieh, 2013).
Say, for example, that you are investigating data collected through the Steam gaming platform (steampowered.com).
Different cautions will be needed depending on the ambition of your research: do you plan to describe the gaming
habits of Steam’s users? Or are you interested in online gaming trends? Or do you want to inspect the cognitive
effects of computer games? Or question the social role of game playing in general? If you are studying the practices
of specific platform (eg, the habits of Steam gamers), then the inscriptions produced by that platform constitute the
primary traces of the phenomenon you are after. But if you use those particular activities as an example of a more
general phenomenon (e.g. collective game playing), then your traces will only offer you a partial observation.
As should be clear from the example, the distinction is neither binary nor written in stone. It depends on how you
define the scope your investigation and can change as your research evolves. It should also be noted that working
with partial traces is not necessarily unsurmountable problem. If it is true that the larger is the coverage of your
study object, the easier it will be to generalise your findings, it is also true that the more phenomena and media
coincide and the harder it is to separate them analytically. In the paragraphs above, we have used the expression “to
take place in”. While this expression conveniently describes the way in which actions happen within or beyond a
specific medium, it has the disadvantage of artificially separating collective actions from the medium that supports
them. Media are not only ‘places’, ‘spaces’ or ‘contexts’ but actors themselves whose actions interfere and transform
the behaviour of their users (Castells, 2009). These ‘media effects’ should be taken into consideration to understand
that the phenomena we observe are not just hosted and traced by the media in which they occur, but also deeply
shaped by them.

1.1. How much of your study object occurs in the medium you are studying?
In its simplest definition a collective phenomenon can be defined as a network of interfering actions (Latour, 2005).
These actions can be of very different kinds, varying from an occasional intervention of an individual actor (e.g.
when a customer makes a bid in an online auction) to a longstanding configuration of socio-technical forces (e.g. the
legal constraints implemented in the mechanism of the bidding interface). What counts here is the extent to which
the actions that comprise the phenomenon you wish to observe are mediated by – and therefore leave traces in – the
medium that you are repurposing.
If you are studying learning practices in Massive Open Online Courses (or MOOCs), you may for example safely
assume that most interactions that you investigate may take place through the MOOC platform and therefore be
recorded by it. But if you are studying the life of a university through the records of its administrative systems, you
should be aware that most of the informal face-to-face exchanges that constitute a crucial part of college experience
will not show up in your dataset.
A good example of a close alignment between the research object and the medium is a 2016 study of conflict
resolution practices in Wikipedia (Weltevrede & Borra, 2016). This study draws on a project which was initially
meant to use Wikipedia’s traces to identify emerging societal controversies (contropedia.net). Soon, however, it
became clear that, while tensions coming from external debate are often mirrored in the online encyclopaedia, such
conflicts are hard to distinguish from the internal quarrels around the platform’s architecture, policies and guidelines.
Acknowledging this difficulty, the study shifted the focus of its inquiry to examine practices of coordination specific
to the platform and the distinctive ways in which they facilitate collaboration and defuse conflict.

Figure 1. Two screenshots of the Contropedia.net interface. Controversial wiki-links are highlighted in red on the original page
and the full evolution of the discussion surrounding them is displayed below (original figure in Weltevrede & Borra, 2016).

Other times, the partiality of medium coverage with respect to the phenomenon may be used strategically. Drawing
on James Gibson’s theory of visual perception (1986) Anders Koed Madsen (2012 and 2015) introduced the term
“web-vision analysis” precisely to point at the way in which researchers can use different media and filtering
parameters to compare different angles on the same phenomenon:
Web-visions are cases that result from deliberate combinations of devices and tools, and the mode of
seeing that results from these combinations is the basis of their potential relevance… the researcher
is left with an arsenal of variables that can be used to manipulate the construction of the web-visions
in a quasi-experimental fashion. The mode of seeing can, for example, be tweaked by altering the
logic of filtering in the delineation device, the country of origin of the device, the language used to
query the device or the settings of the web-crawler used to construct the visualization (Madsen, 2012,
p. 62)
Partiality, in other words, is not always a liability. Purposefully moving away from the main site where the
phenomenon occurs and where it is typically studied may offer fresh angles and perspectives.

1.2. Are you studying media traces for themselves or as proxies?

A subtler dimension of the question of coverage has to do with the nature of the actions traced in the medium that
you are investigating. Do they constitute the very phenomenon that you are examining or are they the occasion to
study other actions not directly traced in the data at your disposal?
Social media, for example, have become a mine of information for the study of social movements as civic
organisations increasingly rely on them to coordinate the actions of their members both online and offline
(Gerbaudo, 2012). Yet, it is one thing to use traces from social platforms to investigate online mobilisation and
another to use them to study street protests (Rogers & Marres, 2002). In the former case, the messages exchanged

online constitute the very object of the study, in the second they are the proxies of other actions (walking, standing,
shouting…) taking place outside the medium. Indeed, digital methods takes the explicit stance of using digital traces
to study not only online phenomena but culture and society in general (Rogers, 2013, 2017). Repurposing the media
means using digital traces as proxies for phenomena that extend beyond them.
In an exploratory project, for example, a group of researchers compared the Google Web Search results for the
query “rights” in a number of languages, to highlight the specific ways in which cultures conceive the question of
human rights (Bekema et al., 2009 - https://www.digitalmethods.net/Dmi/NationalityofIssues).

Figure 2. A visual representation of the different human rights as appearing in the results of Google Search for different
countries and languages (original figure in Bekema et al., 2009).

2 Definition of the object of study

As we have just seen, the key to securing the adequacy between observed phenomenon and repurposed medium is
to handle with care the relation between the scope of your research questions and the traces that you will use to
investigate them. In the previous point, we considered such questions ‘passively’ as if the only thing researchers
could do is to choose a source that fit their ambitions. Yet, researchers can also (and in fact should also) actively and
creatively operate to align the two. This process is called ‘operationalization’ and it refers to the way in which the
entities that you wish to observe are defined through the traces at your disposal (see e.g. Moretti 2013). In digital
methods research this takes the shape of “an on-going process of assembling, re-configuring, and aligning research
questions with digital media and device cultures” (Weltevrede, 2016).
Suppose you want to observe the connections between private companies, public institutions and civic groups
through the way in which they refer to each other in their online discourse. There is a number of different ways in
which you can operationalize this research question: you can look at the hyperlink network among the websites of
your actors, but you can also consider how the overlap of their Facebook friends. You can follow the retweeting of
their representatives, or explore the connections among their pages on Wikipedia. All these operational definitions
are legitimate, but each of them will give you a different view on your object of study with different possible biases
that should be carefully considered.

Even within a single platform, different operational definitions of the same research object are often possible. Take
the case of the investigation of controversies in Wikipedia. Because of the way in which MediaWiki (the software
that supports the famous collaborative encyclopaedia) stores information, 'controversiality' can be operationalized at
the article level (to highlight which topics are disputed), but also at the level of smaller elements such as the links
within the articles (to reveal, for instance, which references are most contested). In addition to this, multiple
measures of controversiality may be defined, from the volume of edit histories, to the depth of discussions in
associated talk pages (Borra et al, 2014, Weltevrede and Borra, 2016). Each of these operationalizations lead to a
different appraisal of what constitutes a matter of concern or an expression of disagreement.

2.1. Is your operationalization attuned to the medium formats?

To validate your operationalisation, start by considering its agreement with the source in which it will be deployed.
Working with secondary data, you do not have the leisure to define your objects of study as you wish, but you are
obliged to consider (at least in part) the way in which they are formatted by the technical and organizational
standards of the medium.
For example, when investigating public debate through Twitter, one cannot avoid acknowledging that topics in this
platform tend to be organised through a very specific technical object: the hashtag. This object has distinctive
features. It is always preceded by the “#” symbol, it acts as a topical marker, it assembles publics around a shared
matter of concern, and it can be used as a keyword for monitoring or searching content. These features influence the
way in which actors discuss but also the manner in which research can investigate such discussions (Marres and
Gerlitz, 2015).
Let's say you want to use Twitter to explore the groups engaged with the issue of public finances. Following a
traditional sociological approach, you may profile publics according to geography, demographic features or societal
sectors (see, e.g. McCormick et al, 2015; Sloan et al, 2015). But if you want to ensure that your line of inquiry is
attuned to Twitter's practices, these might not be the best starting points. Twitter follows a different approach to
knowledge production than classical “research devices”– through follower-followee relations, liking, linking, tagging
and curated “moments” (see, Ruppert, Law and Savage, 2013). In a project on the dynamics of European public
finances we thus focused on the specific forms of engagement facilitated by Twitter. For example, we collected data
about a series of hashtags (e.g. #EUBudget, “EU budget” or #ESIF) and explored the associated actors and issues
(e.g. #refugeecrisis, #youth, #OurFundsOurRights, #Regeneration and #Brexit). Through such analysis we
observed the formation of new publics, as well as dynamics of “hashtag hijacking” – the convergence of different
social worlds through the accidental or purposive use of similar keywords.

2.2. Is your operationalization attuned to the medium practices?

The technical implementation of actions and actors, however, is only one of the ways in which digital mediation
structures your research object. Another one has to do with the practices employed by the users of the medium.
While the technical infrastructures clearly influence the interactions that take place through them, they are also open
to interpretation (see Gillespie, Boczkowski & Foot, 2014, Paßmann & Gerlitz, 2014). Uses and technical formats
are not independent – actors both ‘do with’ media affordances and influence the way in which such affordances
evolve (Bucher and Helmond, 2017). For example, while Twitter offers an official way to signal association between
different accounts, through the ‘follow’ function, such associations have been proved to be weaker than the action
of ‘mentioning’ or ‘re-tweeting’, both of which have been initiated by users and only later officially adopted by the
platform (Kooti et al. 2012).
Your operationalization, therefore, should not only be adjusted to the technical infrastructure of your medium but
also to the practices of its users. In anthropology, this question is addressed through the distinction between ‘etic
categories’ (the concepts employed by the researchers and their peers) and ‘emic categories’ (the notions employed
by the community under research) (see Munk, 2013 for a discussion of how some classic anthropological notions
can be applied to digital phenomena). This distinction it reminds us that the intellectual tools that we use to describe
a collective phenomenon should be respectful of the way in which the actors conceive their own social existence.

This care is particularly crucial as 'query design' is concerned (Rogers, 2017). The ways in which actors label the
phenomena in which they are engaged can be subtle and complicated. For example, one may note that climate
'scepticism' is the self-description preferred by those who doubt the human causes of climate change, while climate
change 'denial' is the notion used by their opponents (Niederer 2013). Understanding the nuances of emic language
can help you capture different sides of your object and the competitive ways in which different groups frame the
same phenomena (see the concept of 'equivalence framing' in Cacciatore et al., 2016). It also allows you to generate
better and more precise queries. A recent study of 'mental illness' on Tumblr, for example, started from the generic
query #mentalillness (Sanchez-Querubin et al. 2016). Soon, however, the co-hashtag network around this query (the
hashtags most often used alongside #mentalillness) revealed that the concept of #recovery characterizes the most
significant practices associated with mental illness on Tumblr, and thus became the focus of the study.

3. From single-platform to cross-platform analysis

So far, we have discussed the cautions necessary to handle digital traces, under the assumption that those traces
derived from one medium. Of course, this is not always the case. Most collective phenomena tend to extend beyond
the frontiers of any single medium and often force researchers to follow them across different media. Cross-
platform projects tend to be richer than single-platform ones, as they allow to compare the findings observed in one
medium with those obtained in others. Depending on your research question, this ‘triangulation’ can help to
separate the characteristics of collective phenomena from the features of the media.
For example, when observing the rapidity with which issues rise and fall on Facebook, it is difficult to decide
whether such rhythm is an indication of a superficial debate or an effect of the platform which encourages shorter
attention spans. Most probably, both are true. However, by comparing how the same topics evolve on Facebook, in
the blogosphere, in newspapers, or in the scientific literature, one can distinguish the underlying features of a
collective phenomenon from the specific way in which it is enacted in a particular medium.

3.1. Does your study object spill across several media?

Going ‘cross-platform’ is indispensable when the media themselves encourage the circulation of the same contents
across their borders. This is often the case in Web-based media, for hypertext protocols facilitate the creation of
connections. Despite all the discussion about the ‘walled-gardens’ of social media, every platform is connected to
other platforms and sometimes to different media.
Twitter is particularly illustrative in this sense. Given the short span of its messages, this platform has from its
inception been used as a device to point to contents published elsewhere. Building on such characteristic, most
other social platforms offer functionalities of ‘automatic tweeting’. As a consequence, Twitter's dialogues are often
influenced by the echoes of the discussion happening in other contexts (Gerlitz & Rieder, forthcoming)
Studying the circulation fake news, for example, we soon realised the impossibility to limit our study to a single
medium (ANONYMIZED - fakenews.publicdatalab.org). The danger of successful fake news stories comes less
from their falseness (which is in most cases easy to detect) than from the virality with which they bounce from a
medium to the other and thereby steadily occupy the public agenda (see also Leskovec 2009 and Shifman 2013).

Figure 3. Spread and debunk of the fake story according to which the Pope would have endorsed Donald Trump. The nodes
represent the web pages in which the story has circulated and the lines the different ways in which they mention each other
(original figure in ANONYMIZED).

3.2. Do you use different but comparable operationalizations for different media?
While ‘cross-platform’ research can be useful and sometimes indispensable, it also entails additional difficulties due
to the necessity of developing multiple operational definitions of the entities under consideration. Each of these
definitions should be attuned to the specific medium in which they are used, but also sufficiently consistent to allow
Earlier, we considered the investigation of how actors associate online. The most straightforward choice would be to
operationalize their connections as the hyperlinks connecting their different online personae (their website, their
Facebook page, their Twitter account…), as the notion of hyperlink is defined at a low layer of Web protocols and
does not depend on the specific platform implementation. Such choice, however, would fail to recognise that
different platforms and practices. Hyperlinks among websites may better be translated by ‘friendships’ or ‘likes’ in
Facebook and by ‘retweets’ and ‘mentions’ in Twitter. In cross-platform approaches the trade-off between
attunement and comparability is always problematic and one should find specific solutions coherent with the aims
and the constraints of the research.

In an ongoing study, we compared different media to reveal competing framing of open data politics
(ANONYMIZED). On Twitter, many actors seemed to cluster around topics related to business opportunities
(such as #startup, #smartcity or #innovation) as well as transparency and open government (#ogd,
#opengovernment, #transparencia). By contrast, by analysing the Wikipedia pages connected to the theme of open
data, we observed topics such as “open source software”, “free software movement”, “open access”, “free culture
movement” and “Creative Commons” – indicating how open data is articulated less as a policy or economic issue,
and more as part of the “digital commons” movement. Finally, newspaper analysis suggests that open data is
frequently discussed in relation to international development.

4. Corpus demarcation and data access

Once you have chosen your object of study, the media through which you will examine it and how to operationalise
passage from the one to the others, you are still confronted with practical difficulties. We have gathered them in this
final checkpoint because they concern the way in which media inscriptions are turned into a research corpus.
For the sake of simplicity, we have so far discussed the adequacy between digital traces and research ambitions
considering vast social phenomena (e.g. the sharing economy of housing) and entire media platforms (e.g. Airbnb).
Such a breadth of scope is, however, likely to be highly inadvisable in many cases as it may only yield superficial
results. Instead, researchers should concentrate on specialised questions (e.g. whether peer-to-peer renting has
professionalised in a given city and in a given and period of time) and on restricted subsets of the traces generated
by the medium they investigate. The necessity of selecting a specific object of study from the fabric of collective life
and a correspondingly delimited corpus out of the web of digital traces is a crucial operation and one that raises a
few delicate questions.

4.1. What does your corpus represent?

In question 1.1, we discussed how the traces offered by one medium are rarely coextensive to the research
ambitions. The problem of partial coverage is intensified by the fact that the data of any given research is always a
subset of the traces offered by their source. Every time you use a query (or a set of queries) to extract information
from a platform, you should infer which words are used by the actors you are interested in to define their matters of
concern. Sometime this inference is straightforward – commercial brands, for example, invest huge efforts to
standardize their name – but think of how many different expressions can be used to refer to ‘environmental
degradation’. How to be sure that the query you use results in sufficient coverage of your study object?
The problem here is not to be exhaustive. Exhaustiveness is a false ideal in digital research. Not only because there
are just too many digital traces out there for researchers to hope to seize them all, but also – and more importantly –
because extending one’s coverage may produce more noise than signal. When we say that media traces are not co-
extensive with social phenomena, we mean that the former are narrower but also broader than the latter. The
blogosphere, for instance does not contain the climate debate because such debate also occurs in many other media
(scientific literature, news, social platforms, etc.), but the reverse is also true: climate change is only one of the
innumerable topics discussed online.
Digital corpora can never be exhaustive. They can however be representative. They should not necessarily cover each
and every thread that constitutes the fabric of social phenomena, but they not should tear such fabric or artificially
extend it. The notion of representativeness may be inappropriate here, for it is associated with clear statistical
definitions that are inapplicable to digital methods. For most digital research, there is no straightforward statistical
test to assess the validity of a corpus. The best you can do is to describe explicitly the various operations of selection
and transformation that connects the original traces to the final corpus and reflect on their analytical consequences.

4.2. Are you accounting for the ways in which data are ‘given’ by the media?
Much has been written about the unfortunate etymology of the word ‘data’, which conveys the false impression that
information is objectively given and not constructed not only by the researchers but also by the technical infrastructures

that have generated those data, their users and the companies and institutions that own those infrastructures
(Bowker, 2013; Drucker, 2011). Acknowledging that we receive our information from someone else (data are ‘given’
at least in this sense) brings our attentions to the conditions of such delivery.
The sources from which we derive our inscriptions and the instruments through which we acquire them have
consequences on the quality of our observations. When observed through the traces that it leaves on Twitter, public
debate often appears as a chaotic flux of conversations ephemerally agglutinating around emerging ideas, while
struggles between overarching world visions and systems of forces, become almost invisible. As the saying goes in
digital methods community, “when all you have is a Twitter feed, everything looks like a hashtag”. Electronic media
do not merely record the interactions that they mediate – not unlike social researchers, they also measure and analyse
(Marres, 2012; Gray et al., forthcoming). They count them beside making them countable (Agre, 1994, Gerlitz and
Rieder, forthcoming).
Investigating climate debate on Twitter, Marres and Gerlitz (2015) noted that the platform relies on “frequency of
mentions” to identify and promote trending topics. Such focus encourages specific practices among the users (e.g.
re-tweeting as way of having messages picked up by the system) and is transmitted to most Twitter analytic tools.
This ends up privileging hashtags referring to events or campaigns (e.g. #cop16, #auspol, #savethearctic) that are
subject to hype-like dynamics. In order to detect more substantial issues, the researchers then moved from
frequency measures to “associationist measures” (not how many times a hashtag is mentioned, but how many other
hashtags co-occur with it), which allowed them to identify tags such as #economics, #flood, #co2, #health,
#environment, and #drought.

Figure 4. Comparison of the most mentioned and most connected hashtags connected to climate change debate in Twitter
(original figure in Marres & Gerlitz, 2015).

Not only different digital traces are infused with the technical, commercial and ideological premises of the platforms
that generate them (cf. Srnicek, 2017, Havens & Lotz, 2012, Mandiberg, 2012), but our datasets depends on our
entry point to digital inscriptions. For example, most digital platforms provide an API (Application Programming
Interface) that structures what and how much information may be accessed, as well as by whom and with which
restrictions. The information accessed through such ‘pipelines’ is often significantly different in detail and
completeness from that displayed on the interface of the same platforms (as a result of operations of aggregation,
anonymization or normalisation). Sometimes important portions of digital traces are excluded from APIs – the
Facebook API, for example, recently withdrew all information on personal profiles due to privacy requirements,
even though such profiles constitute the bulk of Facebook’s inscriptions (Rieder, 2013). The possibility remains, of
course, to ‘scrape’ information directly from the publicly accessible interfaces, services and applications, but even in
this way traces bring with them the mark of their origin (Marres & Weltevrede, 2013).

As a conclusion
Instead of concluding with a theoretical discussion, we prefer to remain faithful to the practical approach of this
paper and provide a summary of the eight questions discussed above. This summary is offered in the form of an
aide-mémoire that researchers embarking upon digital method projects can keep with them as a reality-check list of
their findings and interpretations.

1. Role of digital media in relation to object of study

1.1. How much of your object of study occurs in the medium that you are studying?
1.2. Are you studying media traces for themselves or as proxies?

2. Definition of the study object

2.1 Is your operationalization attuned to the formats of the medium?
2.2. Is your operationalization attuned to the practices of the medium users?

3. From single-platform to cross-platform analysis

3.1. Does the phenomenon that you are studying spill across several media?
3.2. Have you different but comparable operationalizations, for the different media?

4. Corpus demarcation and data access

4.1 What does your corpus represent?
4.2 Are you accounting for the ways in which data are ‘given’ by the media?

