Social Signal Processing: Understanding Social Interactions Through Nonverbal Behavior Analysis
Social Signal Processing: Understanding Social Interactions Through Nonverbal Behavior Analysis
43
Preprocessing
Figure 2. Machine analysis of social signals and behaviors: a general scheme. The process includes two main stages. Preprocessing takes as
input the recordings of social interaction and gives as output multimodal behavioral streams associated with each person. Social interaction
analysis maps the multimodal behavioral streams into social signals and social behaviors.
analysing social behavior. ically performed with speaker diarization [61], face detec-
The last code relates to space and environment, i.e. tion [73], or any other kind of technique that allows one to
the way people share and organize the space they have at identify intervals of time or scene regions corresponding to
disposition. Human sciences have investigated this code, specific individuals.
showing in particular that people tend to organize the space Person detection is the step preliminary to behavioral
around them in concentric zones accounting for different cues extraction, i.e. the detection of nonverbal signals dis-
relationships they have with others [29]. For example, Fig- played by each individual. Some approaches for this stage
ure 1 shows an example of individuals sharing the intimate have been mentioned in Section 2. Extensive overviews are
zone, the concentric area closest to each individual. Tech- available in [68][69].
nology has started only recently to study the use of space, The two main challenges in social behavior understand-
but only for tracking and surveillance purposes. ing are the modeling of temporal dynamics and fusing the
data extracted from different modalities at different time
3. State-of-the-art scales.
Figure 2 shows the main technological components (and Temporal dynamics of social behavioral cues (i.e., their
their interrelationships) of a general SSP system. The timing, co-occurrence, speed, etc.) are crucial for the inter-
scheme does not correspond to any approach in particular, pretation of the observed social behavior [ 3][21]. However,
but most SSP works presented in the literature follow, at relatively few approaches explicitly take into account the
least partially, the processing chain in the picture (see Sec- temporal evolution of behavioral cues to understand social
tion 5). behavior. Some of them aim at the analysis of facial ex-
The first, and crucial, step is the data capture. The most pressions involving sequences of Action Units (i.e., atomic
commonly used capture devices are microphones and cam- facial gestures) [60], as well as coordinated movements of
eras (with arrangements that go from a simple laptop we- head and shoulders [63]. Others model the evolution of col-
bcam to a fully equipped smart meeting room [ 36][70]), lective actions in meetings using Dynamic Bayesian Net-
but the literature reports the use of wearable devices [ 20] works [17] or Hidden Markov Models [37].
and pressure captors [41] (for recognizing posture of sitting To address the second challenge outlined above (tempo-
people) as well. ral, multimodal data fusion), a number of model-level fu-
In most cases, the raw data involve recordings of dif- sion methods have been proposed that aim at making use of
ferent persons (e.g., the recording of a conversation where the correlation between audio and visual data streams, and
different voices can be heard at different moments in time). relax the requirement of synchronization of these streams
Thus, a person detection step is necessary to know which (see [76] for a survey). However, how to model multimodal
part of the data corresponds to which person (e.g., who fusion on multiple time scales and how to model tempo-
talks when in the recording of a conversation). This is typ- ral correlations within and between different modalities is
44
largely unexplored. 4.1. Nonverbal Behavior in Conflicts
Context Understanding is desirable because no correct Human sciences have studied conversations in depth as
interpretation of human behavioral cues in social interac- these represent one of the most common forms of social in-
tions is possible without taking into account the context, teraction [53]. Following [74], conversations can be thought
namely where the interactions take place, what is the ac- of as markets where people compete for the floor (the right
tivity of the individuals involved in the interactions, when of speaking):
the interactions take place, and who is involved in the inter-
action. Note, however, that while W4 (where, what, when, [...] the most widely used analytic approach is
who) is dealing only with the apparent perceptual aspect of based on an analogy with the workings of the
the context in which the observed human behavior is shown, market economy. In this market there is a scarce
human behavior understanding is about W5+ (where, what, commodity called the floor which can be defined
when, who, why, how), where the why and how are directly as the right to speak. Having control of this scarce
related to recognizing communicative intention including commodity is called a turn. In any situation
social behaviors, affective and cognitive states of the ob- where control is not fixed in advance, anyone can
served person [47]. Hence, SSP is about W5+. attempt to get control. This is called turn-taking.
However, since the problem of context-sensing is ex- This suggests that turn-taking is a key to understand con-
tremely difficult to solve, especially for a general case (i.e., versational dynamics.
general-purpose W4 technology does not exist yet [ 47]), In the specific case of conflicts, social psychologists have
answering the why and how questions in a W4-context- observed that people tend to react to someone they disagree
sensitive manner when analysing human behavior is virtu- with rather than to someone they agree with [53][74]. Thus,
ally unexplored area of research. the social signal conveyed as a direct reaction is likely to
be disagreement. Hence, the corresponding nonverbal be-
havioral cue is adjacency in speakers turns. This social psy-
chology finding determines the design of the conflict analy-
4. An Example: the Analysis of Conflicts sis approach described in the rest of this section.
This section aims at providing a concrete example of 4.2. Data Capture and Person Detection
how principles and ideas outlined in previous sections are
The previous section suggests that turn-taking is the key
applied to a concrete case, i.e. the analysis of conflicts in
to understand conversational dynamics in conflicts. The
competitive discussions. Conflicts have been extensively in-
data at disposition are television political debates and the
vestigated in human sciences. The reason is that they influ-
turn-taking can be extracted from the audio channel using a
ence significantly the outcome of groups expected to reach
speaker diarization approach (see [61] for an extensive sur-
predefined targets (e.g., deadlines) or to satisfy members
vey on diarization). The diarization approach used in this
needs (e.g., in families) [35].
work is that proposed in [1]. The audio channel of the po-
This section focuses on political debates because these litical debates is converted into a sequence S:
are typically built around the conflict between two fronts
(including one or more persons each) that defend opposite S = {(s1 , t1 , Δt1 ), . . . , (sN , tN , ΔtN )}, (1)
views or compete for a reward (e.g., the attribution of an
where each triple accounts for a turn and includes a speaker
important political position) that cannot be shared by two
label si ∈ A = {a1 , . . . , aG } identifying the person speak-
parties. The corpus used for the experiments includes 45
ing during the turn, the starting time t i of the turn, and
debates (roughly 30 hours of material) revolving around
the duration Δt i of the turn (see Figure 3). Thus, the se-
yes/no questions like “are you favorable to new laws on en-
quence S contains the entire information about the turn-
vironment protection?”. Each debate involves one moder-
taking, namely who talks when and how much. The pu-
ator, two guests supporting the yes answer, and two guests
rity (see [67] for a definition of the purity) of the resulting
supporting the no answer. The guests state their answer ex-
speaker segmentation is 0.92, meaning that the groundtruth
plicitly at the beginning of the debate and this allows one to
speaker segmentation is mostly preserved.
label them unambiguously in terms of their position.
The diarization can be considered a form of person de-
The goal of the experiments is 1) to identify the moder- tection because it identifies the parts of the data that corre-
ator, and 2) to reconstruct correctly the two groups (yes and spond to each person. In the case of this work, this allows
no) resulting from the structure outlined above. The next for the identification of speaker adjacencies representing the
sections show how the different steps depicted in Figure 2 target cue based on which agreement and disagreement be-
are addressed. tween debate participants will be detected.
45
s1 =a1 s 2 =a 3 s 3 =a1 s 4 =a 3 s 5 =a 2 s 6 =a 1 s 7 =a 2
t1 t 2 t3 t4 t 5 t 6 t7 t
Figure 3. Turn-Taking pattern. The figure shows an example of turn-taking where three persons are assigned to different states.
4.3. Social Signal Understanding above model, even if rather simple, still performs ten times
better than chance.
The suggestion that people tend to react to someone
they disagree with rather than to someone they agree with
can be expressed, in mathematical terms, by saying that 5. Main SSP Applications
speaker si is statistically dependent on speaker s i−1 (see The first extensive surveys of SSP applications have been
Figure 3). Statistical dependence between sequence el- proposed in [68][69], after the expression Social Signal
ements that follow one another can be modeled using a Processing has been introduced for the first time in [ 51] to
Markov Chain where the set Q of the states contains three denote several pioneering works published by Alex Pent-
elements, namely T 1 (the first group), T 2 (the second group) land and his group at MIT.
and M (the moderator).
The earliest SSP works focused on vocal behavior with
If ϕ : A → Q is a mapping that associates a speaker s i ∈ the goal of predicting (with an accuracy higher than 70%)
A with a state qj ∈ Q, then the conflict analysis problem the outcome of dyadic interactions such as salary ne-
can be thought of as finding the mapping ϕ ∗ satisfying the gotiations, hiring interviews, and speed dating conversa-
following expression: tions [14]. One of the most important contributions of these
N works is the definition of a coherent framework for the anal-
ϕ∗ = arg max p(ϕ(s1 )) p(ϕ(sn )|ϕ(sn−1 )), (2) ysis of vocal behavior [48][49], where a set of cues accounts
ϕ∈QA for activity (the total amount of energy in the speech sig-
n=2
nals), influence (the statistical influence of one person on
where N is the number of turns in the turn-taking, p(ϕ(s 1 )) the speaking patterns of the others), consistency (stability
is the probability of starting with state q 1 = ϕ(s1 ), and of the speaking patterns of each person), and mimicry (the
p(ϕ(sn )|ϕ(sn−1 )) is the probability of a transition between imitation between people involved in the interactions). Re-
state qn = ϕ(sn ) and state qn−1 = ϕ(sn−1 ). cent approaches for the analysis of dyadic interactions in-
The expression on the left side of Equation ( 2) has the clude the visual analysis of movements for the detection of
same value if all the speakers assigned state T 1 are switched interactional synchrony [38][39].
to state T2 and viceversa. In other words, the model is sym- Other approaches, developed in the same period
metric with respect to an exchange between T 1 and T2 . The as the above works, have aimed at the analysis of
reason is that T1 and T2 are simply meant to distinguish small group interactions [35], with particular empha-
between members of different groups. sis on meetings and broadcast data (talk-shows, news,
The Markov Model is trained using a leave-one-out ap- etc.). Most of the works have focused on recogni-
proach: all debates at disposition but one are used as train- tion of collective actions [17][37], dominance detec-
ing set, while the left out one is used as the test set. The tions [31][55], and role recognition [7][19][23][34][75].
experiment is reiterated and each time a different debate is The approaches proposed in these works are often mul-
used as the test set. The results show that 64.5% of the de- timodal [17][19][31][37][55][75], and the behavioral cues
bates are correctly reconstructed, i.e., the moderator is cor- most commonly extracted correspond to speaking energy
rectly identified and the two supporters of the same answer and amount of movement. In many cases, the approaches
are assigned the same state. This figure goes up to 75% are based only on audio, with features that account for
when using the groundtruth speaker segmentation (and not turn-taking patterns (when and how much each person
the speaker segmentation automatically extracted from the talks) [7][34], or for combinations of social networks and
audio data). The average performance of an algorithm as- lexical features [23].
signing the states randomly is 6.5% and this means that the Social network analysis has been applied as well
46
in [65][66][71] to recognize the roles played by people ambiguous, sometimes they actually convey social mean-
in broadcast data (movies, radio and television programs, ing, but sometimes they simply respond to contingent fac-
etc.), and in an application domain known as reality min- tors (e.g., postures can communicate a relational attitude,
ing, where large groups of individuals equipped with smart but also be determined by the search for comfort). Finally,
badges or special cellular phones are recorded in terms of an important issue is the use of real-world data in the ex-
proximity and vocal interactions and then represented in a periments. This will lead to more realistic assessments of
social network [20][50]. technology effectiveness and will link research to potential
The reaction of users to social signals exhibited by com- application scenarios.
puters has been investigated in several works showing that The strategic importance of the domain is confirmed
people tend to behave with machines as they behave with by several large projects funded at both national and
other humans. The effectiveness of computers as social ac- international level around the world. In particular, the
tors, i.e., entities involved in the same kind of interactions European Network of Excellence SSPNet (2009-2014)
as the humans, has been explored in [42][43][44], where aims not only at addressing the issues outlined above, but
also at fostering research in SSP through the diffusion of
computers have been shown to be attributed a personality
knowledge, data and automatic tools via its web portal
and to elicit the same reactions as those elicited by persons. (www.sspnet.eu). In this sense, the portal is expected
Similar effects have been shown in [13][45], where chil- to be not only a site delivering information, but also an
dren interacting with computers have modified their voice to instrument allowing any interested researcher to enter
match the speaking characteristics of the animated personas the domain with an initial effort as limited as possible.
of the computer interface, showing adaptation patterns typ-
ical of human-human interactions [9]. Further evidence of
the same phenomenon is available in [5][6], where the inter- References
action between humans and computers is shown to include
the Chameleon effect [11], i.e. the mutual imitation of indi- [1] J. Ajmera, I. McCowan, and H. Bourlard. Speech/music
viduals due to reciprocal appreciation or to the influence of segmentation using entropy and dynamism features in a
one individual on the other. HMM classification framework. Speech Communication,
40(3):351–363, 2003. 4
6. Conclusion [2] K. Albrecht. Social Intelligence: The new science of success.
John Wiley & Sons Ltd, 2005. 6
The long term goal of SSP is to give computers social [3] N. Ambady, F. Bernieri, and J. Richeson. Towards a histol-
intelligence [2]. This is one of the multiple facets of human ogy of social behavior: judgmental accuracy from thin slices
intelligence, maybe the most important because it helps to of behavior. In M. Zanna, editor, Advances in Experimental
deal successfully with the complex web of interactions we Social Psychology, pages 201–272. 2000. 2, 3
are constantly immersed within, whether this means to be [4] M. Argyle. The Psychology of Interpersonal Behaviour. Pen-
recognized as a leader on the workplace, to be a good par- guin, 1967. 1
ent, or to be a person friends like to spend time with. The [5] J. Bailenson and N. Yee. Virtual interpersonal touch and dig-
first successes obtained by SSP are impressive and have at- ital chameleons. Journal of Nonverbal Behavior, 31(4):225–
tracted the praise of both technology [26] and business [8] 242, 2007. 6
communities. However, there is still a long way to go before [6] J. Bailenson, N. Yee, K. Patel, and A. Beall. Detecting digital
artificial social intelligence and socially-aware computing chameleons. Computers in Human Behavior, 24(1):66–87,
2008. 6
become a reality.
[7] S. Banerjee and A. Rudnicky. Using simple speech based
Several major issues need to be addressed in this di-
features to detect the state of a meeting and the roles of the
rection. The first is to establish an effective collabora-
meeting participants. In Proceedings of International Con-
tion between human sciences and technology. SSP is in- ference on Spoken Language Processing, pages 2189–2192,
herently multidisciplinary, no effective analysis of social 2004. 5
behavior is possible without taking into account the basic [8] M. Buchanan. The science of subtle signals. Strat-
laws of human-human interaction that psychologists have egy+Business, 48:68–77, 2007. 6
been studying for decades. Thus, technology should take [9] J. Burgoon, L. Stern, and L. Dillman. Interpersonal Adap-
into account findings of human sciences, and these should tation: Dyadic Interaction Patterns. Cambridge University
formulate their knowledge in terms suitable for automatic Press, 1995. 6
approaches. The second issue is the development of ap- [10] J. Cassell. Embodied conversational interface agents. Com-
proaches dealing with multiple behavioral cues (typically munications of the ACM, 43(4):70–78, 2000. 1
extracted from several modalities), often evolving at dif- [11] T. Chartrand and J. Bargh. The chameleon effect: the
ferent time scales while still forming a coherent social sig- perception-behavior link and social interaction. Journal of
nal. This is necessary because single cues are intrinsically Personality and Social Psychology, 76(6):893–910, 1999. 6
47
[12] J. Cortes and F. Gatti. Physique and self-description of tem- Emotion Expression, Synthesis, and Recognition, pages 185–
perament. Journal of Consulting Psychology, 29(5):432– 218. 2008. 2
439, 1965. 2 [29] E. Hall. The silent language. Doubleday, 1959. 3
[13] R. Coulston, S. Oviatt, and C. Darves. Amplitude conver- [30] M. Hecht, J. De Vito, and L. Guerrero. Perspectives on non-
gence in children’s conversational speech with animated per- verbal communication. codes, functions and contexts. In
sonas. In International Conference on Spoken Language L. Guerrero, J. De Vito, and M. Hecht, editors, The nonver-
Processing, pages 2689–2692, 2002. 6 bal communication reader, pages 201–272. 2000. 2
[14] J. Curhan and A. Pentland. Thin slices of negotiation: pre- [31] D. Jayagopi, H. Hung, C. Yeo, and D. Gatica-Perez. Model-
dicting outcomes from conversational dynamics within the ing dominance in group conversations using non-verbal ac-
first 5 minutes. Journal of Applied Psychology, 92(3):802– tivity cues. IEEE Transactions on Audio, Speech and Lan-
811, 2007. 5 guage: Special Issue on Multimedia, to appear, 2009. 5
[15] T. Darrell, G. Gordon, M. Harville, and J. Woodfill. Inte- [32] D. Keltner and P. Ekman. Facial expression of emotion. In
grated person tracking using stereo, color, and pattern detec- M. Lewis and J. Haviland-Jones, editors, Handbook of Emo-
tion. International Journal of Computer Vision, 37(2):175– tions, pages 236–249. 2000. 2
185, 2000. 2 [33] M. Knapp and J. Hall. Nonverbal Communication in Human
[16] C. Darwin. The Expression of the Emotions in Man and An- Interaction. Harcourt Brace College Publishers, 1972. 1, 2
imals. J. Murray, 1872. 2 [34] K. Laskowski, M. Ostendorf, and T. Schultz. Modeling vocal
[17] A. Dielmann and S. Renals. Automatic meeting segmenta- interaction for text-independent participant characterization
tion using dynamic bayesian networks. IEEE Transactions in multi-party conversation. In Proceedings of the SIGdial
on Multimedia, 9(1):25, 2007. 3, 5 Workshop on Discourse and Dialogue, pages 148–155, 2008.
[18] K. Dion, E. Berscheid, and E. Walster. What is beauti- 5
ful is good. Journal of Personality and Social Psychology, [35] J. Levine and R. Moreland. Small groups. In D. Gilbert
24(3):285–290, 1972. 2 and G. Lindzey, editors, The handbook of social psychology,
volume 2, pages 415–469. Oxford University Press, 1998. 4,
[19] W. Dong, B. Lepri, A. Cappelletti, A. Pentland, F. Pianesi,
5
and M. Zancanaro. Using the influence model to recognize
functional roles in meetings. In Proceedings of the Interna- [36] I. McCowan, S. Bengio, D. Gatica-Perez, G. Lathoud,
tional Conference on Multimodal Interfaces, pages 271–278, F. Monay, D. Moore, P. Wellner, and H. Bourlard. Model-
2007. 5 ing human interaction in meetings. In Proceedings of IEEE
International Conference on Acoustics, Speech and Signal
[20] N. Eagle and A. Pentland. Reality mining: sensing complex
Processing, pages 748–751, 2003. 3
social signals. Journal of Personal and Ubiquitous Comput-
[37] I. McCowan, D. Gatica-Perez, S. Bengio, G. Lathoud,
ing, 10(4):255–268, 2006. 3, 6
M. Barnard, and D. Zhang. Automatic analysis of multi-
[21] P. Ekman and E. Rosenberg. What the Face Reveals: Ba-
modal group actions in meetings. IEEE Transactions on
sic and Applied Studies of Spontaneous Expression Using
Pattern Analysis and Machine Intelligence, 27(3):305–317,
the Facial Action Coding System (FACS). Oxford University
2005. 3, 5
Press, 2005. 2, 3
[38] L. Morency, I. de Kok, and J. Gratch. Context-based recogni-
[22] A. Elgammal. Human-Centered Multimedia: representations tion during human interactions: automatic feature selection
and challenges. In Proc. ACM Intl. Workshop on Human- and encoding dictionary. In Proceedings of the 10th interna-
Centered Multimedia, pages 11–18, 2006. 1 tional conference on Multimodal interfaces, pages 181–188,
[23] N. Garg, S. Favre, H. Salamin, D. Hakkani-Tür, and A. Vin- 2008. 5
ciarelli. Role recognition for meeting participants: an ap- [39] L. Morency, I. de Kok, and J. Gratch. Predicting listener
proach based on lexical information and social network anal- backchannels: A probabilistic multimodal approach. In Lec-
ysis. In Proceedings of the ACM International Conference on ture Notes in Computer Science, volume 5208, pages 176–
Multimedia, pages 693–696, 2008. 5 190. Springer, 2008. 5
[24] J. Gemmell, K. Toyama, C. Zitnick, T. Kang, and S. Seitz. [40] N. Morgan, E. Fosler, and N. Mirghafori. Speech recognition
Gaze awareness for video-conferencing: A software ap- using on-line estimation of speaking rate. In Proceedings of
proach. IEEE Multimedia, 7(4):26–35, 2000. 1 Eurospeech, pages 2079–2082, 1997. 2
[25] E. Goffman. The presentation of self in everyday life. Anchor [41] S. Mota and R. Picard. Automated posture analysis for de-
Books, 1959. 2 tecting learners interest level. In Proceedings of Conference
[26] K. Greene. 10 emerging technologies 2008. MIT Technology on Computer Vision and Pattern Recognition, pages 49–56,
Review, february 2008. 6 2003. 3
[27] H. Gunes and M. Piccardi. Assessing facial beauty through [42] C. Nass and S. Brave. Wired for speech: How voice activates
proportion analysis by image processing and supervised and advances the Human-Computer relationship. The MIT
learning. International Journal of Human-Computer Stud- Press, 2005. 1, 6
ies, 64(12):1184–1199, 2006. 2 [43] C. Nass and K. Lee. Does computer-synthesized speech
[28] H. Gunes, M. Piccardi, and M. Pantic. From the lab to manifest personality? Experimental tests of recognition,
the real world: Affect recognition using multiple cues and similarity-attraction, and consistency-attraction. Journal of
modalities. In J. Or, editor, Affective Computing: Focus on Experimental Psychology: Applied, 7(3):171–181, 2001. 6
48
[44] C. Nass and J. Steuer. Computers and social actors. Human [63] M. Valstar, H. Gunes, and M. Pantic. How to distinguish
Communication Research, 19(4):504–527, 1993. 6 posed from spontaneous smiles using geometric features. In
[45] S. Oviatt, C. Darves, and R. Coulston. Toward adaptive con- Proceedings of the International Conference on Multimodal
versational interfaces: Modeling speech convergence with Interfaces, pages 38–45, 2007. 3
animated personas. ACM Transactions on Computer-Human [64] R. Vertegaal, R. Slagter, G. van der Veer, and A. Nijholt.
Interaction, 11(3):300–328, 2004. 6 Eye gaze patterns in conversations: there is more to con-
[46] M. Pantic, A. Pentland, A. Nijholt, and T. Huang. Human versational agents than meets the eyes. In Proceedings of
computing and machine understanding of human behavior: the SIGCHI conference on Human Factors in computing sys-
A survey. In Lecture Notes in Artificial Intelligence, volume tems, pages 301–308, 2001. 2
4451, pages 47–71. Springer Verlag, 2007. 1 [65] A. Vinciarelli. Sociometry based multiparty audio recording
[47] M. Pantic, A. Pentland, A. Nijholt, and T. Huang. Human- segmentation. In Proceedings of IEEE International Con-
centred intelligent human-computer interaction (HCI2 ): ference on Multimedia and Expo, pages 1801–1804, 2006.
How far are we from attaining it? International Jour- 6
nal of Autonomous and Adaptive Communications Systems, [66] A. Vinciarelli. Speakers role recognition in multiparty au-
1(2):168–187, 2008. 4 dio recordings using social network analysis and duration
[48] A. Pentland. Social dynamics: Signals and behavior. In In- distribution modeling. IEEE Transactions on Multimedia,
ternational Conference on Developmental Learning, 2004. 9(9):1215–1226, 2007. 6
5 [67] A. Vinciarelli and S. Favre. Broadcast news story segmenta-
[49] A. Pentland. Socially aware computation and communica- tion using social network analysis and hidden markov mod-
tion. IEEE Computer, 38(3):33–40, 2005. 5 els. In Proceedings of the ACM International Conference on
[50] A. Pentland. Automatic mapping and modeling of human Multimedia, pages 261–264, 2007. 4
networks. Physica A, 378:59–67, 2007. 6 [68] A. Vinciarelli, M. Pantic, and H. Bourlard. Social Signal
[51] A. Pentland. Social Signal Processing. IEEE Signal Process- Processing: survey of an emerging domain. Image and Vi-
ing Magazine, 24(4):108–111, 2007. 1, 5 sion Computing, to appear, 2009. 1, 2, 3, 5
[52] S. Petridis and M. Pantic. Audiovisual laughter detection [69] A. Vinciarelli, M. Pantic, H. Bourlard, and A. Pentland. So-
based on temporal features. In Proceedings of the 10th inter- cial Signal Processing: State-of-the-art and future perspec-
national conference on Multimodal interfaces, pages 37–44. tives of an emerging domain. In Proceedings of the ACM
ACM New York, NY, USA, 2008. 2 International Conference on Multimedia, pages 1061–1070,
[53] G. Psathas. Conversation Analysis - The study of talk-in- 2008. 1, 3, 5
interaction. Sage Publications, 1995. 2, 4
[70] A. Waibel, T. Schultz, M. Bett, M. Denecke, R. Malkin,
[54] V. Richmond and J. McCroskey. Nonverbal Behaviors in
I. Rogina, and R. Stiefelhagen. SMaRT: the Smart Meet-
interpersonal relations. Allyn and Bacon, 1995. 1, 2
ing Room task at ISL. In Proceedings of IEEE International
[55] R. Rienks, D. Zhang, and D. Gatica-Perez. Detection and Conference on Acoustics, Speech, and Signal Processing,
application of influence rankings in small group meetings. In pages 752–755, 2003. 3
Proceedings of the International Conference on Multimodal
[71] C. Weng, W. Chu, and J. Wu. Rolenet: Movie analysis from
Interfaces, pages 257–264, 2006. 5
the perspective of social networks. IEEE Transactions on
[56] K. Scherer. Personality markers in speech. Cambridge Uni-
Multimedia, 11(2):256–271, 2009. 6
versity Press, 1979. 2
[72] Y. Wu and T. Huang. Vision-based gesture recognition: A re-
[57] K. Scherer. Vocal communication of emotion: A review of
view. In A. Braffort, R. Gherbi, S. Gibet, J. Richardson, and
research paradigms. Speech Communication, 40(1-2):227–
D. Teil, editors, Gesture based communication in Human-
256, 2003. 2
Computer Interaction, pages 103–114. 1999. 2
[58] E. Shriberg. Phonetic consequences of speech disfluency.
Proceedings of the International Congress of Phonetic Sci- [73] M. Yang, D. Kriegman, and N. Ahuja. Detecting faces in
ences, 1:619–622, 1999. 2 images: A survey. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 24(1):34–58, 2002. 3
[59] L. Smith-Lovin and C. Brody. Interruptions in group discus-
sions: the effects of gender and group composition. Ameri- [74] G. Yule. Pragmatics. Oxford University Press, 1996. 2, 4
can Sociological Review, 54(3):424–435, 1989. 2 [75] M. Zancanaro, B. Lepri, and F. Pianesi. Automatic detec-
[60] Y. Tong, W. Liao, and Q. Ji. Facial action unit recognition by tion of group functional roles in face to face interactions. In
exploiting their dynamic and semantic relationships. IEEE Proceedings of the International Conference on Multimodal
Transactions on Pattern Analysis and Machine Intelligence, Interfaces, pages 28–34, 2006. 5
29(10):1683–1699, 2007. 3 [76] Z. Zeng, M. Pantic, G. Roisman, and T. Huang. A survey
[61] S. Tranter and D. Reynolds. An overview of automatic of affect recognition methods: audio, visual and spontaneous
speaker diarization systems. IEEE Transactions on Audio, expressions. IEEE Transactions on Pattern Analysis and Ma-
Speech, and Language Processing, 14(5):1557–1565, 2006. chine Intelligence, 31(1):39–58, 2009. 2, 3
3, 4
[62] K. Truong and D. Leeuwen. Automatic detection of laugh-
ter. In Proceedings of the European Conference on Speech
Communication and Technology, pages 485–488, 2005. 2
49