Eyetracking For Two-Person Tasks With Manipulation of A Virtual World

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Behavior Research Methods

2010, 42 (1), 254-265


doi:10.3758/BRM.42.1.254

Eyetracking for two-person tasks with


manipulation of a virtual world
Jean Carletta, Robin L. Hill, Craig Nicol, and Tim Taylor
University of Edinburgh, Edinburgh, Scotland

Jan Peter de Ruiter


Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
and

Ellen Gurman Bard


University of Edinburgh, Edinburgh, Scotland

Eyetracking facilities are typically restricted to monitoring a single person viewing static images or pre­
recorded video. In the present article, we describe a system that makes it possible to study visual attention in
coordination with other activity during joint action. The software links two eyetracking systems in parallel and
provides an on-screen task. By locating eye movements against dynamic screen regions, it permits automatic
tracking of moving on-screen objects. Using existing SR technology, the system can also cross-project each
participant’s eyetrack and mouse location onto the other’s on-screen work space. Keeping a complete record of
eyetrack and on-screen events in the same format as subsequent human coding, the system permits the analysis
of multiple modalities. The software offers new approaches to spontaneous multimodal communication: joint
action and joint attention. These capacities are demonstrated using an experimental paradigm for cooperative
on-screen assembly of a two-dimensional model. The software is available under an open source license.

Monitoring eye movements has become an invaluable & Garrod, 2004), the interaction between language and vi-
method for psychologists who are studying many aspects sual perception (Henderson & Ferreira, 2004), how visual
of cognitive processing, including reading, language pro- attention is directed by participants in collaborative tasks
cessing, language production, memory, and visual atten- (Bangerter, 2004; Clark, 2003), and the use of eye move-
tion (Cherubini, Nüssli, & Dillenbourg, 2008; Duchowski, ments to investigate problem solving (Charness, Reingold,
2003; Griffin, 2004; Griffin & Oppenheimer, 2006; Meyer Pomplun, & Stampe, 2001; Grant & Spivey, 2003; J. Un-
& Dobel, 2003; Meyer, van der Meulen, & Brooks, 2004; derwood, 2005), the generation of suitably rich, multimodal
Rayner, 1998; Spivey & Geng, 2001; Trueswell & Tanen- data sets has up until now been difficult.
haus, 2005; G. Underwood, 2005; Van Gompel, Fischer, There is certainly a need for research of such breadth.
Murray, & Hill, 2007). Although recent technological ad- Complex multimodal signals are available in face-to-face
vances have made eyetracking hardware increasingly ro- dialogue (Clark & Krych, 2004), but we do not know how
bust and suitable for more active scenarios (Land, 2006, often interlocutors actually take up such signals and exploit
2007), current software can register gaze only in terms of them online. Examining each modality separately will give
predefined, static regions of the screen. To take eyetrack- us an indication of its potential utility but not necessarily
ing to its full potential, we need to know what people are of its actual utility in context. Single-modality studies may
attending to as they work in a continuously changing visual leave us with the impression that, for joint action in dia-
context and how their gaze relates to their other actions and logue or shared physical tasks, all instances of all sources of
to the actions of others. Although present limitations sim- information influence all players. In some cases, we tend to
plify data collection and analysis and call forth consider- underestimate the cost of processing a signal. For example,
able ingenuity on the part of experimenters, they invite us we know that some indication of the direction of an inter-
to underestimate the real complexity of fluid situations in locutor’s gaze is important to the creation of virtual copres-
which people actually observe, decide, and act. At present, ence, and that it has many potential uses (Cherubini et al.,
we are only beginning to understand how people handle 2008; Kraut, Gergle, & Fussell, 2002; Monk & Gale, 2002;
multiple sources of external information, or multiple com- Velichkovsky, 1995; Vertegaal & Ding, 2002). Controlled
munication modalities. Despite growing interest in the in- studies with very simple stimuli, however, have shown
teractive processes involved in human dialogue (Pickering that processing the gaze of another is not a straightfor-

J. Carletta, [email protected]

© 2010 The Psychonomic Society, Inc. 254


PS
Dual Eyetracking     255

ward ­bottom-­up process. Instead, it interacts with what the it can pass messages to other Eyelink trackers on the same
viewer supposes the gazer might be looking at (Lobmaier, network. We use this facility to implement communica-
Fischer, & Schwaninger, 2006). In genuine situated interac- tion between our eyetrackers. SR Research, the makers of
tion, there are many sources of such expectations, and all the Eyelink II, provided us with sample code that displays
require some processing on the part of an interlocutor. Gen- the eye cursor from one machine on the display of another
eral studies of reasoning and decision making (Gigerenzer, by encoding it as a string on the first machine, sending
Todd, & ABC Research Group, 1999) have suggested that the string as a message to the second machine, parsing the
people have astute strategies for circumventing processing string back into an eye position on the second machine,
bottlenecks in the presence of superabundant information. and using that information to change the display (Bren-
It would be surprising if they did not streamline their inter- nan, Chen, Dickinson, Neider, & Zelinsky, 2008).
actions in the same way. To know how, we need to record Despite the lack of commercial support for dual eye-
individuals’ dividing their attention between the fluid ac- tracking and analysis against other ways of recording
tions of others, their own attempts at communication, and experimental data, some laboratories are beginning to
the equally fluid results of a joint activity. implement their own software. In one advance, a series of
Given two eyetrackers and two screens, four break- eye movements (scanpaths) produced by experts are later
throughs are required before the technology can be use- used to guide the attention of novices (Stein & Brennan,
fully adapted to study cooperation and attention in joint 2004). In another, two eyetrackers are used in parallel with
tasks. First, one central virtual world or game must drive static displays but without cross-communication between
both participants’ screens, so that both can see and manipu- one person’s tracker and the other’s screen (Hade­lich &
late objects in the same world. Second, it must be possible Crocker, 2006; Richardson, Dale, & Kirkham, 2007;
to record whether participants are looking at objects that Richard­son, Dale, & Tomlinson, 2009). In a third, one par-
are moving across the screen. Third, it must be possible ticipant’s scanpath is simulated by using an automatic pro-
for each participant to see indications of the other’s inten- cess to display a moving icon while genuinely eyetracking
tions, as they might in real face-to-face behavior. For on- the other participant as he or she views the display (Bard,
screen activities, those intentions would be indicated by Anderson, et al., 2007). In a fourth, Brennan et al. (2008)
icons representing their partner’s gaze and mouse location. projected genuine eye position from one machine onto the
Finally, to give a full account of the interactions between screen of another while participants shared a visual search
players, the eyetracking, speech, and physical actions of task over a static display. Steptoe et al. (2008) and Murray
the 2 participants must be recorded synchronously. Find- and Roberts (2006) keyed the gaze of each avatar in an im-
ing solutions to these problems would open up eyetracking mersive virtual environment to actual tracked gaze of the
methodology not only to studies of joint action but also participant represented by the avatar, but without a shared
to studies of related parts of the human repertoire—for dynamic visual task.
example, joint viewing without action, competitive rather
than collaborative action, or learning what to attend to in Data Capture
the acquisition of a joint skill. In the present article, we will Dual eyetracking could be implemented using any pair
describe solutions to these problems using the SR Research of head-free or head-mounted eyetrackers, as long as they
Eyelink II platform. The resulting software is available can pass messages to each other. Our own implementation
under an open source license from http://wcms.inf.ed.ac uses two head-mounted Eyelink II eyetrackers. Eyetrack-
.uk/jast. We will demonstrate the benefits of the software ing studies start with a procedure to calibrate the equip-
within an experimental paradigm in which 2 participants ment that determines the correspondence between tracker
jointly construct a figure to match a given model. readings and screen positions, and they usually incorporate
periodic briefer recalibrations to correct for any small drift
Review in the readings. The Eyelink II outputs data in its propri-
Commercial and proprietary eyetracking software re- etary binary format, “EDF,” which can either be analyzed
ports where a participant is looking, but only in absolute with software supplied by the company or be converted to a
terms or by using predefined static screen regions that do time-stamped ASCII format that contains one line per event.
not change over the course of a trial. The generic software The output normally contains a 500-Hz data stream of eye
for the Eyelink II tracker supplied with the system (SR locations with additional information about the calibration
Research, n.d.) provides a good basis for going forward. It and drift correction results used in determining those loca-
can be used to generate raw eye positions and will calcu- tions, plus an online parsed representation of the eye move-
late blinks, fixations, and saccades, associating fixations ments in terms of blinks, fixations, and saccades. In addi-
with static, user-defined regions of the screen. It does tion to this data originating from the eyetracker itself, the
not support the definition of dynamic screen regions that eyetracker output will contain any messages that have been
would be required to determine whether the participant passed to the eyetracker from the experimental software,
is looking at an object in motion. Nor does it support the stamped with the time they were received. An Eyelink II
analysis of eye data in conjunction with alternative data natively uses two computers that are connected via a local
streams such as language, video, and audio data, or eye network. The host machine drives the eyetracking hardware
stream data from two separate machines. It will, however, by running the data capture and calibration routines, and
take messages generated by the experimental software and the display machine runs the experimental software. Our
add them to the rest of the synchronized data output, and installation uses two Pentium 4 3.0-GHz display machines
256    Carletta et al.

ply a matter of passing sufficient messages. There is, of


course, a lag between the time when a message is gener-
ated on one participant’s machine and the time when it is
registered on the other’s. In our testing, 99% of the lags
Eye gaze recorded were less than 20 msec. In pilot experiments,
transfer debriefed participants did not notice any lag. Even with
the small additional time needed to act on a message to
change the display, this degree of lag is tolerable for stud-
Display PC Display PC ies of joint activity. During any trial, the experimental
software should loop through checking for local changes
JCT JCT in the game and passing messages until both of the par-
ticipants have signaled that the trial has finished. In our
Camtasia Camtasia experimental software, each loop takes around 42 msec.
Since eye positions move faster than do other screen ob-
jects, we pass them twice during each loop. Whenever a
JCT data display machine originates or receives a message, it sends
transfer a copy of that message to its host machine for insertion
into its output data stream. As a result, the output of both
eyetracking systems contains a complete record of the ex-
Host PC Host PC periment, although the eye positions are sampled more
coarsely on the opposing than on the native machine.
Setup Setup
Recording Recording Example
The experimental paradigm. In our experimen-
tal paradigm, 2 participants play a series of construction
Person A Person B games. Their task is to reproduce a static two-dimensional
model by selecting the correct parts from an adequate set
and joining them correctly. Either participant can select and
Figure 1. Hardware in the joint construction task (JCT) experi-
mental setup. move (left mouse button) or rotate (right mouse button) any
part not being grasped by the other player. Two parts join
together permanently if brought into contact while each is
running Windows XP with 1-GB DDR RAM, a 128-MB “grasped” by a different player. Parts break if both partici-
graphics card, a 21-in. CRT monitor, a Gigabit Ethernet pants select them at the same time, if they are moved out
card, and a Soundblaster Audigy 2 sound card, and two of the model construction area, or if they come into contact
Pentium 4 2.8-GHz host machines running Windows XP with an unselected part. Any of these “errors” may be com-
and ­ROMDOS7.1 for the Eyelink II control software, with mitted deliberately to break an inadequate construction.
512-MB DDR RAM and a Gigabit Ethernet card. New parts can be drawn from templates as required.
Our arrangement for dual eyetracking is shown in Fig- These rules are contrived to elicit cooperative behavior:
ure 1. Here, because there are two systems, four comput- No individual can complete the task without the other’s help.
ers are networked together. In addition to running the ex- The rules can easily be reconfigured, however, or alterna-
perimental software, the display machines perform audio tive sets of rules can be implemented. Figure 2A shows an
and screen capture using Camtasia (TechSmith, n.d.) annotated version of the initial participant display from our
and close-talking headset microphones. Audio capture implementation of the paradigm with labels for the cursors
is needed if we are to analyze participant speech; screen and static screen regions. Figure 2B shows a later stage in
capture provides verification and some insurance against a trial in which a different model (top right) is built. Initial
failure in the rest of the data recording by at least yielding parts are always just below the model. A broken part counter
a data version that can easily be inspected, although the appears at the top left, a timer in the middle at the top, and
resulting videos have no role in our present data analysis. new part templates across the bottom of the screen. After
As usual, the display machines are networked to their each trial, the participants can be shown a score reflecting
respective hosts so that they can control data recording accuracy of their figure against the given model.
and insert messages into the output data stream but, in ad- This experimental paradigm is designed to provide a
dition, the display machines pass messages to each other. joint collaborative task. The general framework admits a
These are used to keep the displays synchronized. For in- number of independent variables, such as the complexity
stance, if Participant A moves his or her eyes, his mouse, of the model, the number of primary parts supplied, the
or some on-screen object, the experimental software run- difficulties that are presented by the packing of the ini-
ning on A’s display machine will send a message to that tial parts, and whether or not the participants have access
effect to the experimental software on Participant B’s dis- to each other’s speech, gaze cursor, or mouse position.
play machine, which will then update the graphics to show These variables can all be altered independently from
A’s gaze cursor, A’s mouse cursor, and the shared object in trial to trial. The paradigm is suitable for a range of re-
their new locations. Coordinating the game state is sim- search interests in the area of joint activity. Performance
Dual Eyetracking     257

Figure 2. Annotated screen shots of two variants of the JAST joint construction task. (A) Initial
layout of a construction task. (B) Screen shot 20 sec into a trial in a tangram construction task.
258    Carletta et al.

can be automatically measured in terms of time taken, 1. Markers showing the time when each trial starts and
breakages, and accuracy of the constructed figure against ends, along with the performance scores for the number of
the target model (maximum percentage of pixels that are breakages and the accuracy of the final construction.
colored correctly when the constructed figure is laid over 2. Markers near the beginnings of trials showing when
the model and rotated). In addition, the paradigm allows a special audio tone and graphical symbol were displayed,
for an analogous task for individuals that serves as a use- to make it easier to synchronize the data against audio and
ful control. In the individual “one-player” version, two video recordings.
parts join together when they touch, even if only one is 3. Participant eye positions.
selected. 4. Sufficient information to reconstruct the model-
The experimental software. The JAST joint construc- ­building state, including joins, breakages, part creation,
tion task, or JCT, is software for running experiments that and any changes in the locations or rotations of individual
fit this experimental paradigm. It is implemented under parts.
Windows XP in Microsoft Visual C++.Net and draws on Our video reconstruction utility takes the ASCII data
a number of open source software libraries. These include from one of the eyetrackers and uses it to create a video
Simple DirectMedia Layer (SDL) support for computer that shows the task with the eye and mouse positions of the
graphics (Simple DirectMedia Layer Project, n.d.); add- 2 participants superimposed. The videos produced from
ons available for download along with SDL and from the the two ASCII data files for the same experiment are the
SGE project (Lindström, 1999) that provide further sup- same, apart from subtle differences in timing: Each video
port for things like audio, rotation, collision detection, and shows the data at the time that it arrived at, or was produced
text; Apache’s XML parser, Xerces (Apache XML Project, by, the individual eyetracker used as its source. The utility,
1999) and the Simple Sockets Library for network com- which is again written in Microsoft Visual C++.Net, uses
munication (Campbell & McRoberts, 2005), which we use FFmpeg (Anonymous, n.d.) and libraries from the JCT
instead of the Eyelink II networking support specifically experimental software to produce MPEG2 format videos
so that the software can also be run without eyetracking. with a choice of screen resolutions and color depths.
When the experimental software is run without an eye- Interpreting participant behavior. Capturing the
tracker, rather than passing messages to a host machine, data with the methods described creates a complete and
the display machine stores the time-stamped messages in faithful record of what happened during the experiment,
an ASCII file that follows the same structure that the eye- but using primitives below the level of abstraction required
tracker itself produces, so that the file can be used in the for analysis. For instance, the messages describe the abso-
same onward processing. Even without the richness of eye- lute positions of eyes, mice, and all parts throughout the
movement data, collaborative behavior can therefore still task, but not in terms of when a participant is looking at a
be investigated with this software using only two standard particular part, even though knowing this is essential to un-
networked PCs. derstanding how gaze, speech, and action relate. Therefore,
The configuration of an experiment is defined by a set the next step is to add this higher level of analysis to the data
of extensible markup language (XML) files. Each experi- set. To do this, we transfer the data out of the ASCII format
ment consists of a number of trials using one model per exported by the Eyelink II and into an XML format. XML
trial. Models are built of polygon parts. Curved surfaces format allows us to use existing parsers and makes it easy to
are simulated by a polygon with a high number of ver- check that the data conform to our expectations by validat-
tices. Our software includes a utility to generate these ing data files against a document-type definition specifying
curved parts. The initial and target configuration for each what sorts of events and objects they should contain.
model, and descriptions of the polygon parts used, are The resulting XML file contains most of the informa-
stored in a “stimulus set” XML file, and the experiment tion gathered so far in the experiment. Drawing data from
configuration links to a number of these files, which are the output of both eyetrackers and from the experiment
presented in order. For each trial, the experiment config- configuration file, it includes all of the messages that were
uration file specifies the stimulus set, whether to show passed, parsed eye data for both participants, plus a list of
the clock, whether this clock counts up or down, the time the parts, their movements, and their join events, with part
limit, whether the position of the partner’s eye and mouse properties—such as shape and initial locations—taken
should be visible, and any text or graphics that should be from the experiment configuration file. Although it would
shown between trials. It also specifies the machines on not be difficult to include the full 500-Hz eye position data
which the experiment will run, whether the experiment in the XML representation, it is now omitted for several
is to be run for individuals or pairs of participants, the reasons. First, this level of detail is better suited for the
size and location of the static screen regions, and a num- kind of parsing provided by the eyetracker software itself.
ber of experimenter-composed text messages to display at Second, including it would increase file size substantially.
certain points in the experiment. When run, the software Third, it is unnecessary for our intended analysis.
performs the eyetracker calibration and then presents the In addition to reencoding these kinds of data from the
trials in the order specified, performing drift correction experimental software, the utility that produces the XML
between trials. file adds a number of data interpretations:
In this implementation of the experimental paradigm, Look events. During these, a participant looked at a
the messages passed between the eyetrackers, and there- part, composite, or static region (typically referred to as
fore stored in the output, are the following. regions of interest, or ROIs). Where gaze is on a mov-
Dual Eyetracking     259

able object (a dynamic ROI, or DROI), these events cover software includes export to two such packages: ELAN
both the entire time that the eye position is within a con- (MPI, n.d.) and the NITE XML Toolkit, or NXT (Lan-
figurable distance of the moving screen region associated guage Technology Group, n.d.). Both can display behav-
with whatever is being looked at, and whether the eye ioral data in synchrony with one or more corresponding
is currently engaged in a fixation, smooth pursuit, or a audio and video signals, and both support additional data
saccade. Since the XML file contains parsed eye move- annotation. ELAN has strengths in displaying the time
ments, the class of eye activity can be established by later course of annotations that can be represented as tiers of
analysis, if required. mutually exclusive, timestamped codes, whereas NXT has
Hover events. During these, a participant located the strengths in supporting the creation and search of annota-
mouse over a part or composite without “grasping” or tions that relate to each other both temporally and struc-
moving the part by buttonpress. Hover events also cover turally, as is usual for linguistic annotations built on top of
the entire time that the mouse cursor is within a configu- orthographic transcription.
rable distance of the moving screen region associated with Both ELAN and NXT use XML in their data formats.
the part. Because our utilities for export to these programs are
Construction history for each trial. This is a description based on XSLT stylesheets (World Wide Web Consortium,
of how the pair constructed the figure, which is given as 1999)—the standard technique for transducing XML data
a set of binary trees where the leaves are the parts, each from one format to another—they will run on any plat-
with a unique identifier. Each node uniquely identifies a form, as long as a stylesheet processor is installed.
composite that is created by joining any other composites
or parts that are the children of the node. Example experiment and
Composition phases. These divide the time taken to cre- analyses afforded
ate each composite into phases. The final phase covers the
final action of docking the last piece of the composite, Method
and two earlier phases simply divide the rest of the com- Task. We can illustrate the utility of the JCT systems via a de-
position time in half. Like the construction history, this signed corpus of time-aligned multimodal data (eye movements,
actions, and speech) that were produced while pairs of individu-
representation can be used to study construction strategy
als assembled a series of tangrams collaboratively. Produced as
or to find subtasks of particular difficulty. part of the Joint-­Action Science and Technology project (www
Per-trial breakage scores. These are calculated over the .jast-project­.eu), this corpus was used to explore factors that might
XML format for use as a parity check against those com- benefit human–robot interactions by studying human–human inter-
ing directly from the experimental software. actions in collaborative practical tasks. In this example, we devised
The JastAnalyzer software that performs this process- 16 target tangrams, none of which resembled any obvious nameable
ing requires the same environment as does the JCT experi- entity. To engineer referring expressions, we designed each part to
represent a unique color–shape combination, with each color repre-
mental software. It works by passing the ASCII data back sented only once, and each shape at most twice. The same initial set
through libraries from the JCT software in order to interpret of seven pieces was available at the beginning of each construction
the task state. Because we exploit the resulting XML data trial. All had to be used.
for a number of different purposes, we call this our “general Because trials in our paradigm can take over 4 min to complete,
data format,” or GDF. It is straightforward, given the data drift correction is important. Midtrial interruptions for calibration are
format, to produce scripts that show useful analyses, such as undesirable for collaborative problem solving. Instead, the software
package offers a manual correction utility for use on the reconstructed
the lag between when one participant looks at an object and videos, which enables optional offline adjustments to be made.
when the other participant follows. It is also easy to produce Design. In order to investigate the relative worth of speech
per-trial summary statistics, such as how many times a par- and gaze feedback in joint action, communication modalities were
ticipant looked at each of the parts and what percentage of varied factorially: Participants could speak to each other and see
the time he or she spent looking at the clock. the other person’s eye position; participants could speak but could
Export to data display and analysis packages. Thus not see where the other person was looking; participants could not
speak but could see where their collaborator was looking; partici-
far, we have seen how the automatically collected JCT
pants could neither speak nor see the gaze location of their partner.
data are represented. For more complicated data analysis, The condition order was rotated between dyads, but each pair of
we export from our GDF into the format for existing data participants built four models under each condition. Additionally,
display and analysis packages. These allow the data to be because leadership and dominance factors can modulate how people
inspected graphically, “playing” them synchronized to the interact, half of the dyads were assigned specific roles: one person
video and audio records. Such packages are useful both was the project manager, whereas the other was the project assistant.
for checking that the data are as expected and for brows- The other half were merely instructed to collaborate. All were asked
to reproduce the model tangram as quickly and as accurately as pos-
ing with an eye to understanding participant behavior. sible while minimizing breakages. To determine the usefulness of
But the real strength of such packages lies in the ability a verbal channel of communication during early strategy develop-
to create or import other sources of information for the ment, half of the dyads encountered the speaking conditions in the
same data, such as orthographic transcription, linguistic first half of the experiment, and the other half in the second.
annotations relating to discourse phenomena, and video
coding. Some packages also provide data search facilities Results
that can go beyond the simple analyses generated with In the following sections, we will indicate how such a
our GDF-based scripts—for example, investigations of corpus may be exploited. In the first two cases, we will
the relationship between task behavior and speech. Our illustrate types of automatically produced data that could
260    Carletta et al.

be used to explore some current questions. In the third In the following examples, figures for Participant A
section, we will show how these automatically recorded are followed by those for Participant B in parentheses.
events can be combined with human coding to provide In the trial seen in Figure 2B, 71% (72%) of the time is
multimodal analyses. We will cite other work that further spent looking at the model, and 27% (20%) at the con-
exploits the rich data of similar corpora. struction area, with the rest spent looking in other areas or
Fine-grained action structures. Often resources are off-screen. The player’s position overlaps with a tangram
provided for joint actors on the assumption that whatever part (moving or stationary) in the construction area 24%
is available to an actor is used. Sometimes (see Bard, An- (11%) of the time, divided over 213 (173) different occa-
derson, et al., 2007) this proves not to be the case. The data sions, of which 56 (41) involve stable fixations. There are
that were recorded for the present experiment permitted 29 occasions (of the 213 [173] gaze-part overlaps) during
very fine-grained analyses of action/gaze sequences that which both participants’ eyetracks overlap with the same
would allow us to discover without further coding who object, totaling 21.5 sec across the entire trial, but only 14
consults what information at critical phases of their ac- occasions during which 1 player’s gaze overlaps with an
tions. Figure 2B shows Participant B’s screen 20 sec into object that the other player is currently manipulating. Thus,
an experimental trial. In this figure, Participant A has we have prima facie evidence that, for this dialogue, the
grasped the triangle on the right, and B has grasped the common task on yoked screens does not draw the majority
triangle on the left. Two seconds after the point captured of players’ visual activity to the same objects at the same
in Figure 2B, the game software records that the players time. Nor does an object that 1 player moves automatically
successfully joined the two parts together. Because our entrain the other player’s gaze. Making individual tests on
software captures continuous eyetracks, and because it pilot trials for a new type of joint task would allow the ex-
records the precise time at which parts are joined, we perimenter to determine whether the task had the necessary
can use the construction and eyetrack records to discover gaze-drawing properties for the experimental purpose.
that in the 10 sec preceding the join, Participant A’s gaze, In capturing this information about dyadic interac-
which is shown here as a small circular cursor, moved tion, our experimental setup reveals a level of detail that
rapidly between the two moving triangles and briefly is unprecedented in previous studies. The data from our
rested in the workspace position at which the pieces were eyetracker software suffice for studies of how the experi-
ultimately joined. Participant B’s gaze, which is not dis- mental variables interact and influence measures of task
played on this snapshot of B’s screen but would appear success, such as speed and accuracy. They also enable the
on the reconstructed video, also traveled between the two analyst, for instance, to calculate the lag between when
triangles, but with two separate excursions to fixate the one person looks at a part and when the partner does, or to
target model at the top right corner of the screen. Thus, determine who initiates actions, as would be required for
the players are consulting different aspects of the available measuring dominance.
information while achieving a common goal. This par- Hand coding. Although the system automates some
ticular goal—a combination of two triangles to construct tasks, it is clear that it cannot meet every experimental
a larger shape—was successfully achieved. Because all purpose. Export to NITE or ELAN allows further cod-
partial constructions have unique identifiers that are com- ing to suit the main experimental goals. As a simple
posed of their ancestry in terms of smaller constructions demonstration, we follow the data of this study through
or primary parts, it is possible to distinguish successful transcription and reference coding. We use ChannelTrans
acts of construction, which are not followed by breakage (International Computer Science Institute, n.d.) to create
and a restart, from those that are quickly broken. It would orthographic transcription in which the conversational
then be possible to discover whether, as in this example, turns are time stamped against the rest of the data. By
successful constructions were characterized by the dif- importing the transcription into the NITE XML Toolkit,
ferentiation of visual labor in their final delicate phases, we can identify and code the referring expressions used
whereas unsuccessful constructions were not. It would during the task. Figure A1 in the Appendix illustrates the
also be possible to discover whether the distribution of use of an NXT coding tool to code each referring expres-
critical phase visual labor in each dyad could predict over- sion with the system’s identifier for its referent. The result
all performance (automatically recorded per trial dura- is a time-aligned multimodal account of gaze, reference,
tions, breakage counts, and accuracy scores). Since the action, and trial conditions.
composition history automatically divides the final half of This allows us, for instance, to count the number of lin-
each interval between construction points from two earlier guistic expressions referring to each part and to relate these
periods, we might also look for differentiation or align- references to the gaze and mouse data. As an example of
ment of gaze during those earlier phases in which players the sorts of measures that this system makes easy to cal-
must develop a strategy for achieving the next subgoal. culate, over the eight speech condition trials, for one of the
Trialwise measures: Actions entraining gaze. Our participant dyads, there were 267 instances of speech and
software can also summarize events during a given trial. 1,262 instances of “looking” at DROIs, which lasted over
We illustrate how such figures might be used to determine a 45-msec duration. On 78 occasions, a player looked at a
whether gaze is largely entrained by the ongoing physi- part while uttering a speech segment containing a reference
cal actions. If so, there may be little opportunity for role to it. On 95 occasions, one player looked at a part while the
or strategy to determine attention here. We have several other was uttering a speech segment referring to it. Figure 3
measures to use. shows the complete data flow used to obtain this analysis.
Dual Eyetracking     261

Chop to synch
marks

Eyelink II Transcription NXT


hardware (Channeltrans) conversion
Audio NXT-based
NXT
data data
Camtasia JAST storage analysis
Analyzer GDF
Video
backup data
JCT using
Eyelink II
software Video Referring
ASCII reconstructor Video expression
data coding (NXT)

KEY:

External software Our software Human process Data storage

Figure 3. Data flow used to obtain the analysis for the example experiment.

Using data like these, we have shown how mouse move- ample, we began by examining the time that participants
ments coincide with particular distributions of referring spent looking at any of these ROIs, dynamic or otherwise,
expressions (Foster et al., 2008) and how the roles as- as a percentage of the total time spent performing the task
signed to the dyad influence audience design in mouse using a 2 (speech, no speech) 3 2 (gaze visible, invis-
gestures (Bard, Hill, & Foster, 2008). ible) 3 2 (dyad member: A, B) ANOVA with dyads as
Empirical investigation: Determining whether cases. Since we know that, in monologue conditions, what
dialogue or gaze projection during joint action leads people hear someone say will influence how they view a
to greater visual alignment. If dialogue encourages static display (Altmann & Kamide, 1999, 2004, 2007),
alignment all the way from linguistic form to intellectual we would expect the speaking conditions to direct play-
content, as Pickering and Garrod (2004) proposed, and if ers’ gazes to the parts under discussion quite efficiently.
there is a shared visual environment in which to establish In fact, the proportion of time spent inspecting any of the
common ground (Clark, 1996; Clark & Brennan, 1991; available ROIs was significantly lower when participants
Clark, Schreuder, & Buttrick, 1983; Lockridge & Brennan, could speak to each other (17.41%) than when they were
2002), then the opportunity for 2 people to speak should unable to speak (21.78%) [F(1,31) 5 53.49, p , .001].
assure that they look at their common visual environment We would also expect the direction of one player’s gaze
in a more coordinated way. Similarly, having a directly to help focus the other’s attention on objects in play. We
projected visual cue that indicates where another person is know that gaze projection provides such help for static
looking (perhaps analogous to using a laser pointer) offers ROIs (Stein & Brennan, 2004). In fact, there was no dis-
an obvious focus for visual alignment. Thus, we would cernable difference in players’ DROI viewing times as a
expect conditions allowing players to speak as they con- consequence of being able to see their partners’ eye po-
struct tangrams to have more aligned attention than those sitions [F(1,31) , 1]: With and without cross-projected
in which they play silently. And we would expect condi- gaze, players tracked DROIs about 20% of the time over-
tions in which each one’s gaze is cross-projected on the all. So players were not looking at the working parts as
other’s screen to have more aligned attention than those much when they were speaking, and seeing which piece
with no gaze cursors. Using the aforementioned study, their partners were looking at did not alter the proportion
we can test these hypotheses via analyses of viewing be- of time spent looking at task-critical objects.
havior, visual alignment, and joint action, which were not Second, we exploited a measure that is contingent on
previously possible. being able to track both sets of eye movements against
First, we will examine a new dependent variable that potentially moving or shifted targets: how often both play-
can be automatically extracted from the data: the time ers are looking at the same thing at exactly the same time.
spent looking at the tangram parts and the partially built Here, a 2 (speech) 3 2 (gaze) ANOVA indicates that the
tangrams. These are particularly complex DROIs, because ability to speak in fact reduced the number of instances
either player is free to move, rotate, or even destroy any of of aligned gaze by nearly 24% [29.7 with speech vs. 39.1
the parts at any time. In this paradigm, however, the parts without speech; F(1,31) 5 15.76, p , .001]. Reduction
are the tools, and the tangrams are the goals of the game. in the frequency of visual alignment is not an artifact of
They must be handled and guided with care. In this ex- trial duration. In fact, trials involving speech were an aver-
262    Carletta et al.

age of 7 sec longer than those without speech [95.99 sec cluding showing the same game on the screens of the eye-
vs. 89.02 sec, respectively; F(1,31) 5 5.11, p , .05] and trackers with indicators of the partner’s gaze and mouse
might be expected to increase the number of eye–eye over- positions. The method allows us to construct a record of
laps, even if just by chance. Again, however, the ability to participant interaction that gives us the fine-scale timing
see precisely where the other person was looking had no of gaze, mouse, and task behaviors relative to each other
influence [F(1,31) , 1], and there was no interaction. and to moving objects on the screen.
The analyses that reveal overlapping gaze can also gen- We have demonstrated our techniques using a pair of
erate the latency between the point at which one player Eyelink  II eyetrackers and experimental software that
begins to look at an ROI and the point at which the other’s implements a paradigm in which 2 participants jointly
gaze arrives: the eye–eye lag (Hadelich & Crocker, 2006). construct a model. Although our experimental paradigm
Eye–eye lag was shorter when participants could speak is designed for studying a particular kind of joint action,
to each other [197 msec with speech as compared with the basic methods demonstrated in the software can be
230 msec without speech; F(1,31) 5 6.44, p , .05]. Per- used to advance work in other areas. The software shows
haps surprisingly, the projection of a collaborator’s gaze how to use one task to drive two eyetracker screens, re-
position onto the screen failed to make any difference to cord gaze against moving screen objects, show a partner’s
these lag times [212 msec when gaze position was visible gaze and mouse icons, and synchronize the eyetracking,
vs. 214 msec when it was not; F(1,31) , 1], and again, speech, and game records of 2 participants. Inventive mes-
there was no interaction between the variables. sage passing is the key, since messages can be both passed
Taken together, these results indicate that in a joint col- between the eyetrackers and stored in the output record,
laborative construction task, the facility for members of providing all of the data communication required. Track-
a working dyad to speak to each other reduces the pro- ing the relationship between gaze and moving objects, for
portion of time spent looking at the construction pieces instance, is simply a case of having the experimental soft-
and reduces the number of times both partners look at the ware note the movement as messages in the eyetracker
same thing at the same time. Visual alignment appears output, so that it can be checked against the gaze track
to be inhibited when a dyad could engage in speech, but analytically later on. Audio and screen capture can be syn-
when it did occur, there was a shorter delay in coordina- chronized to the eyetracker output by having the experi-
tion since the lag between one person looking at a poten- mental software provide audible and visual synchroniza-
tial target and the partner then moving his or her eyes onto tion marks, noting the time of these marks relative to the
the same target was smaller. In contrast, a direct indicator eyetracker output, again, as messages. Coordinating two
of a collaborator’s gaze position did not appear to have screens requires each copy of the experimental software
any effect (facilitatory or inhibitory) on the measures ex- to record its state in outgoing messages and to respond to
amined. Contrary to expectations, therefore, being able any changes reported to it in incoming ones. In theory, our
to discuss a yoked visual workspace does not automati- techniques will work with any pair of eyetrackers that can
cally yoke gaze. And there is no evidence that one person handle message passing, including pairs of eyetrackers
will track what another person is looking at more often from different manufacturers. The message passing itself
when his or her eye position is explicitly highlighted. The could become a bottleneck in some experimental setups,
combined influence and shape of any interaction between especially if it was used to study groups, but we found it
available modalities is obviously of critical importance to adequate for our purpose. Manufacturers could best sup-
the understanding and modeling of multimodal communi- port the methodological advances we describe by ensur-
cation and, as our present example demonstrates, the para- ing that they include message-passing functionality that is
digm expounded here offers an ideal method of enhancing both efficient and well documented.
research into this topic. These advances are useful not just for our work, but
Finally, gaze coordination can be further examined using also for work in other areas in which more limited eye-
cross-recurrence analysis. This technique has been used in tracking methods are currently in use. Work on visual at-
the investigation of language and visual attention by Rich- tention could benefit from the ability to register dynamic
ardson and colleagues (Richardson & Dale, 2005; Richard- screen regions. Multiple object tracking (Bard et  al.,
son et al., 2007; Richardson et al., 2009) to demonstrate that 2009; Bard, Hill, et al., 2007; Pylyshyn, 2006; Pylyshyn
visual coordination can manifest as being in phase or non- & Annan, 2006; Wolfe, Place, & Horowitz, 2007), subiti-
random without necessarily being simultaneous. However, zation (Alston & Humphreys, 2004), and feature binding
these studies used only shared static images or replayed a (Brockmole & Franconeri, 2009; Luck & Beach, 1998),
monologue to participants. Bard and colleagues (Bard, Hill, for instance, would be less constrained if the objects were
& Arai, 2009; Bard et al., 2008; Bard, Hill, Nicol, & Car- treated in the way we suggest.
letta, 2007) have taken this technique further to examine the It is joint action that has the most to gain, however,
time course of visual alignment during the JCT. because it requires all of the advances made. The abil-
ity to import the data that comes out of the experimental
Discussion paradigm into tools used for multimodal research, such as
ELAN and NXT, offers particular promise for the study of
Our software implements a new method for collecting joint action, especially where language is involved. These
eyetracking data from a pair of interacting participants, in- tools will enable the basic data to be browsed, combined
Dual Eyetracking     263

with orthographic transcription, enriched with hand an- row the distinction between field and laboratory research
notation that interprets the participant behaviors, and while combining vision, language, and action. The system
searched effectively. Using these techniques will allow allows for more or less visual information recording (mon-
a fuller analysis of what the participants are doing, say- ocular, binocular data, or none at all) as well as for varying
ing, and looking at than can any of the current techniques. symmetrical and asymmetrical cross-projections of action
The effect of copresence on joint action (Fussell & Kraut, or attention. Although optimized for joint action, it can
2004; Fussell et al., 2004; Horton & Keysar, 1996; Kraut, operate in a stand-alone solo mode. This system there-
Fussell, & Siegel, 2003; Kraut et al., 2002) can also eas- fore offers a new means of investigating a broad range of
ily be manipulated by altering the proximity of the two topics, including vision and active viewing, language and
­eyetrackers—the limit only depending on the network dialogue, joint action, copresence, strategy development
connections. Similarly, the effect of perspective on lan- and problem solving, and hand–eye coordination.
guage use (Hanna & Tanenhaus, 2004), including the use
of deictic expressions such as “this” or “that,” can be in- Author Note
vestigated, as well as the differences in eye movements
J.P.d.R. is now at the Department of Linguistics, University of
during comprehension (Spivey, Tanenhaus, Eberhard, Bielefeld, Germany. R.L.H. is now at the Department of Psychology,
& Sedivy, 2002) versus production (Horton & Keysar, School of Philosophy, Psychology, and Language Sciences, University
1996), and whether speakers engage in “audience design” of Edinburgh. The present work was supported by the EU FP6 IST Cog-
or instead operate along more egocentric lines (Bard, An- nitive Systems Integrated Project Joint Action Science and Technology
derson, et al., 2007; Bell, 1984). Functional roles, such (JAST) FP6-003747-IP. The authors are grateful to Joseph Eddy and
Jonathan Kilgour for assistance with programming. Correspondence
as instruction giver or follower, can be assigned to mem- concerning this article should be addressed to J. Carletta, Human Com-
bers of a dyad to determine how this might influence gaze munication Research Centre, Informatics Forum, University of Ed-
behavior and the formation of referring expressions (En- inburgh, 10 Crichton Street, Edinburgh EH8 9AB, Scotland (e-mail:
gelhardt, Bailey, & Ferreira, 2006). [email protected]).
The experimental environment opens up many possi-
bilities for eyetracking dynamic images and active scene References
viewing, topics that are still surprisingly underresearched. Alston, L., & Humphreys, G. W. (2004). Subitization and atten-
In particular, the smooth pursuit of objects (Barnes, 2008; tional engagement by transient stimuli. Spatial Vision, 17, 17-50.
Burke & Barnes, 2006) that is controlled either by the per- doi:10.1163/15685680432277825
son him- or herself or by his or her partner in a purposeful, Altmann, G. T. M., & Kamide, Y. (1999). Incremental interpretation at
nonrandom fashion can be investigated. Data can be output verbs: Restricting the domain of subsequent reference. Cognition, 73,
247-264. doi:10.1016/S0010-0277(99)00059-1
in their raw, sample-by-sample form, permitting customized Altmann, G. T. M., & Kamide, Y. (2004). Now you see it, now you
smooth pursuit detection algorithms to be implemented as don’t: Mediating the mapping between language and the visual world.
total gaze durations (accumulated time that the eye posi- In J. Henderson & F. Ferreira (Eds.), The interface of language, vision
tion coincided with an on-screen object, irrespective of eye and action: Eye movements and the visual world (pp. 347-386). New
York: Psychology Press.
stability), or as a series of fixations automatically identified Altmann, G. T. M., & Kamide, Y. (2007). The real-time mediation of
by the SR Research (n.d.) software. Since most paradigms visual attention by language and world knowledge: Linking anticipa-
involve only static images, eye movements are almost ex- tory (and other) eye movements to linguistic processing. Journal of
clusively reported in terms of saccades or fixations, ignor- Memory & Language, 57, 502-518. doi:10.1016/j.jml.2006.12.004
ing the classification of pursuit movements. The EyeLink II Anonymous (n.d.). FFMPEG Multimedia System. Retrieved from
http://ffmpeg.mplayerhq.hu /; last accessed February 2, 2010.
also offers either monocular or binocular output, both of Apache XML Project (1999). Xerces C++ parser. Retrieved Febru-
which can be utilized by our software. ary 2, 2010, from http://xerces.apache.org/xerces-c/.
The eyes gather information that is required for motor Bangerter, A. (2004). Using pointing and describing to achieve joint
actions and are therefore proactively engaged in the pro- focus of attention in dialogue. Psychological Science, 15, 415-419.
doi:10.1111/j.0956-7976.2004.00694.x
cess of anticipating actions and predicting behavior (Land Bard, E. G., Anderson, A. H., Chen, Y., Nicholson, H. B. M., Ha-
& Furneaux, 1997). These skills are developed surpris- vard, C., & Dalzel-Job, S. (2007). Let’s you do that: Sharing the
ingly early, with goal-directed eye movements being in- cognitive burdens of dialogue. Journal of Memory & Language, 57,
terpretable by the age of 1 (Falck-Ytter, Gredeback, & 616-641. doi:10.1016/j.jml.2006.12.003
von Hofsten, 2006). More recently, Gesierich, Bruzzo, Bard, E. G., Hill, R., & Arai, M. (2009, July). Referring and gaze
alignment: Accessibility is alive and well in situated dialogue. Paper
Ottoboni, and Finos (2008) studied gaze behavior and the presented at the Annual Meeting of the Cognitive Science Society,
use of anticipatory eye movements during a computerized Amsterdam.
block-stacking task. The phenomenon of the eyes pre- Bard, E. G., Hill, R., & Foster, M. E. (2008, July). What tunes ac-
empting upcoming words in the spoken language stream cessibility of referring expressions in task-related dialogue? Paper
presented at the Annual Meeting of the Cognitive Science Society,
is, as we have noted, the basis of the well-established “vi- Washington, DC.
sual world” paradigm in psycholinguistics (Altmann & Bard, E. G., Hill, R., Nicol, C., & Carletta, J. (2007, August). Look
Kamide, 1999, 2004, 2007). here: Does dialogue align gaze in dynamic joint action? Paper pre-
In summary, we have successfully developed a com- sented at AMLaP2007, Turku, Finland.
bined, versatile hardware and software architecture that Barnes, G. (2008). Cognitive processes involved in smooth pursuit eye
movements. Brain & Cognition, 68, 309-326. doi:10.1016/j.bandc
enables groundbreaking and in-depth analysis of sponta- .2008.08.020
neous, multimodal communication during joint action. Its Bell, A. (1984). Language style as audience design. Language in Soci-
flexibility and naturalistic game-based format help nar- ety, 13, 145-204.
264    Carletta et al.

Brennan, S. E., Chen, X., Dickinson, C., Neider, M., & Zelin- Experimental Psychology: Learning, Memory, & Cognition, 32, 943-
sky, G. (2008). Coordinating cognition: The costs and benefits of 948. doi:10.1037/0278-7393.32.4.943
shared gaze during collaborative search. Cognition, 106, 1465-1477. Hadelich, K., & Crocker, M. W. (2006, March). Gaze alignment of in-
doi:10.1016/j.cognition.2007.05.012 terlocutors in conversational dialogues. Paper presented at the 19th An-
Brockmole, J. R., & Franconeri, S. L. (Eds.) (2009). Binding [Spe- nual CUNY Conference on Human Sentence Processing, New York.
cial issue]. Visual Cognition, 17(1 & 2). Hanna, J., & Tanenhaus, M. K. (2004). Pragmatic effects on refer-
Burke, M., & Barnes, G. (2006). Quantitative differences in smooth ence resolution in a collaborative task: Evidence from eye movements.
pursuit and saccadic eye movements. Experimental Brain Research, Cognitive Science, 28, 105-115. doi:10.1207/s15516709cog2801_5
175, 596-608. doi:10.1007/s00221-006-0576-6 Henderson, J. M., & Ferreira, F. (Eds.) (2004). The interface of lan-
Campbell, C., Jr., & McRoberts, T. (2005). The simple sockets li- guage, vision, and action: Eye movements and the visual worlds. New
brary. Retrieved February 2, 2010, from http://mysite.verizon.net/ York: Psychology Press.
astronaut/ssl/. Horton, W. S., & Keysar, B. (1996). When do speakers take into ac-
Charness, N., Reingold, E. M., Pomplun, M., & Stampe, D. M. count common ground? Cognition, 59, 91-117. doi:10.1016/0010
(2001). The perceptual aspect of skilled performance in chess: Evi- -0277(96)81418-1
dence from eye movements. Memory & Cognition, 29, 1146-1152. International Computer Science Institute (n.d.). Extensions
Cherubini, M., Nüssli, M.-A., & Dillenbourg, P. (2008). Deixis and to Transcriber for Meeting Recorder Transcription. Retrieved from
gaze in collaborative work at a distance (over a shared map): A com- www.icsi.berkeley.edu/Speech/mr/channeltrans.html; last accessed
putational model to detect misunderstandings. In K.-J. Räihä & A. T. February 2, 2010.
Duchowski (Eds.), ETRA 2008—Proceedings of the Eye Tracking Re- Kraut, R., Fussell, S., & Siegel, J. (2003). Visual information as a con-
search and Application Symposium (pp. 173-180). New York: ACM versational resource in collaborative physical tasks. Human–­Computer
Press. doi:10.1145/1344471.1344515 Interaction, 18, 13-49. doi:10.1207/S15327051HCI1812_2
Clark, H. H. (1996). Using language. Cambridge: Cambridge Univer- Kraut, R., Gergle, D., & Fussell, S. (2002, November). The use of
sity Press. visual information in shared visual spaces: Informing the develop-
Clark, H. H. (2003). Pointing and placing. In S. Kita (Ed.), Pointing: ment of virtual co-presence. Paper presented at the ACM Conference
Where language, culture, and cognition meet (pp. 243-268). Mahwah, on Computer Supported Cooperative Work, New Orleans, LA.
NJ: Erlbaum. Land, M. F. (2006). Eye movements and the control of actions in everyday
Clark, H. H., & Brennan, S. E. (1991). Grounding in communication. life. Progress in Retinal & Eye Research, 25, 296-324. doi:10.1016/
In L. B. Resnick, J. Levine, & S. D. Teasley (Eds.), Perspectives on j.preteyeres.2006.01.002
socially shared cognition (pp. 127-149). Washington, DC: American Land, M. F. (2007). Fixation strategies during active behavior: A brief
Psychological Association. history. In R. P. G. van Gompel, M. H. Fischer, W. S. Murray, & R. L.
Clark, H. H., & Krych, M. A. (2004). Speaking while monitoring ad- Hill (Eds.), Eye movements: A window on mind and brain (pp. 75-95).
dressees for understanding. Journal of Memory & Language, 50, 62- Oxford: Elsevier.
81. doi:10.1016/j.jml.2003.08.004 Language Technology Group (n.d.). NITE XML Toolkit Home­
Clark, H. H., Schreuder, R., & Buttrick, S. (1983). Common ground pages. Retrieved from http://groups.inf.ed.ac.uk/nxt/; last accessed
and the understanding of demonstrative reference. Journal of Verbal February 2, 2010.
Learning & Verbal Behavior, 22, 245-258. Land, M. F., & Furneaux, S. (1997). The knowledge base of the oc-
Duchowski, A. T. (2003). Eye tracking methodology: Theory and prac- ulomotor system. Philosophical Transactions of the Royal Society B,
tice. London: Springer. 352, 1231-1239.
Engelhardt, P. E., Bailey, K. G. D., & Ferreira, F. (2006). Do speak- Lindström, A. (1999). SGE: SDL Graphics Extension. Retrieved Febru-
ers and listeners observe the Gricean Maxim of Quantity? Journal of ary 2, 2010, from www.digitalfanatics.org/cal/sge/index.html.
Memory & Language, 54, 554-573. doi:10.1016/j.jml.2005.12.009 Lobmaier, J. S., Fischer, M. H., & Schwaninger, A. (2006). Objects
Falck-Ytter, T., Gredeback, G., & von Hofsten, C. (2006). Infants capture perceived gaze direction. Experimental Psychology, 53, 117-
predict other people’s action goals. Nature Neuroscience, 9, 878-879. 122. doi:10.1027/1618-3169.53.2.117
doi:10.1038/nn1729 Lockridge, C. B., & Brennan, S. E. (2002). Addressees’ needs in-
Foster, M. E., Bard, E. G., Guhe, M., Hill, R. L., Oberlander, J., fluence speakers’ early syntactic choices. Psychonomic Bulletin &
& Knoll, A. (2008). The roles of haptic-ostensive referring ex- Review, 9, 550-557.
pressions in cooperative task-based human–robot dialogue. In Luck, S. J., & Beach, N. J. (1998). Visual attention and the binding
T.  Fong, K.  Dautenhahn, M.  Scheutz, & Y.  Demiris (Eds.), Pro- problem: A neurophysiological perspective. In R. D. Wright (Ed.),
ceedings of the 3rd ACM/IEEE International Conference on Visual attention (pp. 455-478). Oxford: Oxford University Press.
Human–­Robot Interaction (pp. 295-302). New York: ACM Press. Max Planck Institute for Psycholinguistics (n.d.). Language Ar-
doi:10.1145/1349822.1349861 chiving Technology: ELAN. Retrieved from www.lat-mpi.eu/tools/
Fussell, S., & Kraut, R. (2004). Visual copresence and conversational elan/; last accessed February 2, 2010.
coordination. Behavioral & Brain Sciences, 27, 196-197. doi:10.1017/ Meyer, A. S., & Dobel, C. (2003). Application of eye tracking in speech
S0140525X04290057 production research. In J. Hyönä, R. Radach, & H. Deubel (Eds.), The
Fussell, S., Setlock, L., Yang, J., Ou, J. Z., Mauer, E., & Kramer, A. mind’s eye: Cognitive and applied aspects of eye movement research
(2004). Gestures over video streams to support remote collabora- (pp. 253-272). Amsterdam: Elsevier.
tion on physical tasks. Human–Computer Interaction, 19, 273-309. Meyer, A. S., van der Meulen, F., & Brooks, A. (2004). Eye
doi:10.1207/s15327051hci1903_3 movements during speech planning: Speaking about present and
Gesierich, B., Bruzzo, A., Ottoboni, G., & Finos, L. (2008). Human remembered objects. Visual Cognition, 11, 553-576. doi:10.1080/
gaze behaviour during action execution and observation. Acta Psycho- 13506280344000248
logica, 128, 324-330. doi:10.1016/j.actpsy.2008.03.006 Monk, A., & Gale, C. (2002). A look is worth a thousand words: Full
Gigerenzer, G., Todd, P. M., & ABC Research Group (1999). Simple gaze awareness in video-mediated conversation. Discourse Processes,
heuristics that make us smart. Oxford: Oxford University Press. 33, 257-278. doi:10.1207/S15326950DP3303_4
Grant, E. R., & Spivey, M. J. (2003). Eye movements and problem Murray, N., & Roberts, D. (2006, October). Comparison of head gaze
solving: Guiding attention guides thought. Psychological Science, 14, and head and eye gaze within an immersive environment. Paper pre-
462-466. doi:10.1111/1467-9280.02454 sented at the 10th IEEE International Symposium on Distributed Sim-
Griffin, Z. M. (2004). Why look? Reasons for eye movements related ulation and Real-Time Applications, Los Alamitos, CA. doi:10.1109/
to language production. In J. M. Henderson & F. Ferreira (Eds.), The DS-RT.2006.13
integration of language, vision, and action: Eye movements and the Pickering, M., & Garrod, S. (2004). Towards a mechanistic psychology
visual world. New York: Psychology Press. of dialogue. Behavioral & Brain Sciences, 27, 169-190. doi:10.1017/
Griffin, Z. M., & Oppenheimer, D. M. (2006). Speakers gaze at objects S0140525X04000056
while preparing intentionally inaccurate labels for them. Journal of Pylyshyn, Z. W. (2006). Some puzzling findings in multiple object
Dual Eyetracking     265

tracking (MOT): II. Inhibition of moving nontargets. Visual Cogni- Steptoe, W., Wolff, R., Murgia, A., Guimaraes, E., Rae, J., Shar-
tion, 14, 175-198. doi:10.1080/13506280544000200 key, P., et al. (2008, November). Eye-tracking for avatar eye-gaze
Pylyshyn, Z. W., & Annan, V. (2006). Dynamics of target selection and interactional analysis in immersive collaborative virtual environ-
in multiple object tracking (MOT). Spatial Vision, 19, 485-504. ments. Paper presented at the ACM Conference on Computer Sup-
doi:10.1163/156856806779194017 ported Cooperative Work, San Diego, CA.
Rayner, K. (1998). Eye movements in reading and information process- TechSmith (n.d.). Camtasia Studio. Retrieved February 2, 2010, from
ing: 20 years of research. Psychological Bulletin, 124, 372-422. www.techsmith.com/camtasia.asp.
Richardson, D., & Dale, R. (2005). Looking to understand: The cou- Trueswell, J. C., & Tanenhaus, M. K. (Eds.) (2005). Approaches
pling between speakers’ and listeners’ eye movements and its relation- to studying world-situated language use: Bridging the language-as-
ship to discourse comprehension. Cognitive Science, 29, 1045-1060. ­product and language-as-action traditions. Cambridge, MA: MIT
doi:10.1207/s15516709cog0000_29 Press.
Richardson, D., Dale, R., & Kirkham, N. (2007). The art of con- Underwood, G. (Ed.) (2005). Cognitive processes in eye guidance.
versation is coordination: Common ground and the coupling of eye Oxford: Oxford University Press.
movements during dialogue. Psychological Science, 18, 407-413. Underwood, J. (2005). Novice and expert performance with a dynamic
doi:10.1111/j.1467-9280.2007.01914.x control task: Scanpaths during a computer game. In G. Underwood
Richardson, D., Dale, R., & Tomlinson, J. (2009). Conversation, gaze (Ed.), Cognitive processes in eye guidance (pp. 303-323). Oxford:
coordination, and beliefs about visual context. Cognitive Science, 33, Oxford University Press.
1468-1482. doi:10.1111/j.1551-6709.2009.01057.x Van Gompel, R. P. G., Fischer, M. H., Murray, W. S., & Hill, R. L.
Simple DirectMedia Layer Project (n.d.). SDL: Simple Direct­Media (Eds.) (2007). Eye movements: A window on mind and brain. Oxford:
Layer. Retrieved February 2, 2010, from www.libsdl.org/. Elsevier.
Spivey, M., & Geng, J. (2001). Oculomotor mechanisms activated by Velichkovsky, B. M. (1995). Communicating attention: Gaze position
imagery and memory: Eye movements to absent objects. Psychologi- transfer in cooperative problem solving. Pragmatics & Cognition, 3,
cal Research, 65, 235-241. doi:10.1007/s004260100059 199-222.
Spivey, M., Tanenhaus, M. K., Eberhard, K. M., & Sedivy, J. C. Vertegaal, R., & Ding, Y. (2002, November). Explaining effects of eye
(2002). Eye movements and spoken language comprehension: Effects gaze on mediated group conversations: Amount or synchronization?
of visual context on syntactic ambiguity resolution. Cognitive Psy- Paper presented at the ACM Conference on Computer Supported Co-
chology, 45, 447-481. doi:10.1016/S0010-0285(02)00503-0 operation Work, New Orleans, LA.
SR Research (n.d.). Complete eyetracking solutions. Retrieved Febru- Wolfe, J. M., Place, S. S., & Horowitz, T. S. (2007). Multiple object
ary 2, 2010, from www.sr-research.com/EL_II.html. juggling: Changing what is tracked during extended multiple object
Stein, R., & Brennan, S. E. (2004, October). Another person’s eye tracking. Psychonomic Bulletin & Review, 14, 344-349.
gaze as a cue in solving programming problems. Paper presented at World Wide Web Consortium (1999). XSL Transformations (XSLT)
ICMI ’04: 6th International Conference on Multimodal Interfaces, Version 1.0: W3C Recommendation, 16 November. Retrieved from
State College, PA. www.w3.org/TR/xslt; last accessed February 2, 2010.

Appendix

Figure A1. Screen shot of NXT’s discourse entity coding tool in use to code referring
expressions for the example experiment.

(Manuscript received June 12, 2009;


revision accepted for publication September 30, 2009.)

You might also like