• Nie Znaleziono Wyników

Automatically predicting mood from expressed emotions

N/A
N/A
Protected

Academic year: 2021

Share "Automatically predicting mood from expressed emotions"

Copied!
182
0
0

Pełen tekst

(1)

Automatically

predicting mood from

expressed emotions

(2)
(3)

Automatically predicting mood

from expressed emotions

PROEFSCHRIFT

ter verkrijging van de graad van doctor

aan de Technische Universiteit Delft,

op gezag van de Rector Magnificus prof.ir. K.C.A.M.

Luyben;

voorzitter van het College voor Promoties,

in het openbaar te verdedigen op 23 maart 2016

om 12:30 uur

door

Christina KATSIMEROU

Master of Science in Electrical and Computer

Engineering

Aristotle University of Thessaloniki

geboren te Thessaloniki, Griekenland.

(4)

This dissertation has been approved by the promotor: Prof.dr. I.E.J. Heynderickx copromotor: Dr. J.A. Redi

Composition of the doctoral committee:

Rector Magnificus, Chairman

Prof. dr. I. Heynderickx Eindhoven University of Technology, promotor Dr. J.A. Redi, Delft University of Technology, copromotor Independent members:

Prof. dr. D. Heylen, University of Twente

Prof. dr. A. Hanjalic, Delft University of Technology Prof. dr. H. de Ridder, Delft University of Technology Prof. dr. Tobias

Hossfeld, University of Duisburg-Essen Dr. M. Soleymani, University of Geneva

This research project was supported by Philips Research and Delft University of Technology.

Copyright © 2016, Christina Katsimerou. All rights reserved. ISBN: 978-94-6186-615-8

An electronic version of this dissertation is available at

http://repository.tudelft.nl/

(5)

Contents

CHAPTER 1 INTRODUCTION ... 1

1 MOTIVATION ... 2

2 PROBLEM DEFINITION ... 5

3 AFFECTIVE TERMS: SIMILARITIES AND DIFFERENCES ... 6

4 REPRESENTATION OF EMOTIONS AND MOOD ... 7

5 RESEARCH OBJECTIVES OF THE THESIS ... 9

6 THESIS OUTLINE ... 12

REFERENCES ... 16

CHAPTER 2 PREDICTING MOOD FROM EMOTION ANNOTATIONS ON VIDEOS ... 19

1 INTRODUCTION ... 20

2 RELATED WORK ... 22

2.1 MOOD AND EMOTIONS IN PSYCHOLOGY ... 23

2.2 PREDICTION OF MOOD FROM AUDIO-VISUAL SIGNALS ... 24

3 PROBLEM SETUP AND METHODOLOGY ... 25

DOMAINS OF EMOTION AND MOOD ... 25

3.1 PROBLEM FORMULATION ... 26 3.2 4 DATA ... 27 DATABASES ... 27 4.1 PRE-PROCESSING OF THE DATA ... 29

4.2 AVERAGE-CODER ANNOTATION ... 30

4.3 5 EXPERIMENT 1: PRELIMINARY ANALYSIS ... 32

DETECTION OF CO-EXISTING AND SHIFTED MOODS ... 32

5.1 MOOD PREDICTION ... 34

5.2 RESULTS AND DISCUSSION ... 34

5.3 6 EXPERIMENT 2: MOOD PREDICTION MODEL ... 37

MODELLING MOOD FROM A SEQUENCE OF PUNCTUAL EMOTIONS ... 38

6.1 RESULTS AND DISCUSSION ... 40

6.2 7 CONCLUSIONS AND OUTLOOK ... 46

CHAPTER 3 CROWDSOURCING EMPATHETIC INTELLIGENCE: THE CASE OF THE ANNOTATION OF EMMA DATABASE FOR EMOTION AND MOOD RECOGNITION ... 51

INTRODUCTION ... 52

1. 2 RELATED WORK ... 55

CHARACTERISTICS OF AUDIO VISUAL DATABASES FOR AFFECT 2.1 RECOGNITION ... 55

AFFECTIVE ANNOTATIONS BY EXPERTS AND IN THE CROWD ... 58

(6)

3 CREATION OF THE DATABASE MATERIAL ... 61

DEFINITION OF THE SITUATIONAL CONTEXT ... 61

3.1 MOOD INDUCTION ... 62 3.2 RECORDING SETUP ... 62 3.3 RECORDED DATA ... 63 3.4 4 ANNOTATION ... 64 ANNOTATION TOOLS ... 65 4.1 ANNOTATION PROTOCOL ... 66 4.2 PROCESS ... 66 4.3 5 DATA ANALYSIS AND FILTERING OF UNRELIABLE ANNOTATIONS ... 69

DATA FILTERING ... 69

5.1 WORKERS AGREEMENT ANALYSIS ... 73

5.2 6 VALIDATION OF THE CROWDSOURCED ANNOTATIONS VIA A CONTROLLED EXPERIMENT... 75

EXPERIMENTAL SETUP ... 75

6.1 COMPARISON BETWEEN LAB AND CROWDSOURCING ANNOTATIONS ... 76

6.2 7 DISCUSSION ... 78

8 CONCLUSIONS AND OUTLOOK ... 81

REFERENCES ... 82

APPENDIX ... 87

CHAPTER 4 EXPLORING A COMPUTATIONAL MODEL OF MOOD FROM VISUALLY EXPRESSED EMOTIONS ... 91

INTRODUCTION ... 92

1. 2 RELATED WORK ... 95

MOOD SENSING FROM VISUAL INPUT ... 95

2.1 SEQUENCE CLASSIFICATION ... 96

2.2 3 MODELLING MOOD FROM A TEMPORAL SEQUENCE OF EMOTIONS ... 99

FEATURE BASED APPROACH ... 101

3.1 HIDDEN MARKOV MODELS (HMM) ... 104

3.2 HIDDEN CONDITIONAL RANDOM FIELDS (HCRF)... 105

3.3 4 EXPERIMENTAL VALIDATION ... 107 DATA ... 107 4.1 EXPERIMENTAL SETUP ... 109 4.2 5 RESULTS ... 110

MODEL PARAMETER SELECTION ... 111

5.1 EVALUATION ... 112

5.2 6 DISCUSSION ... 113

(7)

REFERENCES ... 120

CHAPTER 5 AUTOMATIC PREDICTION OF THE VALENCE OF MOOD FROM FACIAL EXPRESSIONS ... 125

1 INTRODUCTION ... 126

2 RELATED WORK ... 128

3 MOOD AS A FUNCTION OF A SEQUENCE OF EMOTIONS ... 130

4 DATA ... 131

AGREEMENT ON MOOD LABELS ... 131

4.1 5 EXPERIMENTAL SETUP ... 132

AFFECTIVE FACIAL EXPRESSION RECOGNITION ... 132

5.1 FEATURE EXTRACTION... 134

5.2 EXTREME LEARNING MACHINES ... 135

5.3 TARGETS AND NETWORK ARCHITECTURE ... 136

5.4 6 RESULTS ... 137 7 DISCUSSION ... 140 8 CONCLUSIONS ... 141 REFERENCES ... 143 CHAPTER 6 CONCLUSIONS ... 147 1 CONTRIBUTIONS ... 148

2 ANSWERING THE RESEARCH QUESTIONS ... 150

3 OUTLOOK AND CHALLENGES ... 152

MOOD SHIFTS ... 152

3.1 EMOTION RECOGNITION FROM FACE ... 152

3.2 BODILY EXPRESSION OF AFFECT ... 153

3.3 CONTEXT ANALYSIS AND PERSONALIZATION ... 154

3.4 4 APPLICATIONS ... 156 REFERENCES ... 158 SUMMARY ... 161 SAMENVATTING ... 163 ABSTRACT ... 165 ACKNOWLEDGMENTS ... 167 LIST OF PUBLICATIONS ... 171 JOURNAL PUBLICATIONS ... 171 CONFERENCE PUBLICATIONS ... 171

(8)
(9)

1

Chapter 1

Introduction

(10)

2

1

MOTIVATION

Affect characterizes human nature. Different affective states are manifested in a range of situations and ways, while interacting with others or when being alone, overtly or pervasively in the background of conscience and cognition (Russell, 2003). For this reason, it would be desirable that any machine that is designed to interact intelligently and seamlessly with humans, is able to sense human affective states in the same way as other humans do, and then adapt its behavior accordingly.

Since its inception, researchers in Affective Computing have tried to endow machines with artificial affective intelligence, with the ultimate goal to improve human-computer interaction. This simulated empathy usually entails automatic sensing of human affective behavior, for example from visual and vocal expressions (Picard, 2000). By bringing together knowledge in psychology, neuroscience and computer science, exciting applications have been envisioned or devised to enable software agents to sense the user‘s frustration or boredom and react empathically, to adapt

the level of difficulty in gaming for a better user experience, and to allow more intuitive communication with conversational agents (embodied or not) and robots (humanoid or not).

Besides the obvious benefits of affect-aware systems in facilitating human-machine interaction, affective computing technologies have a great potential in creating systems for affectively assisting the user, in the absence of an explicit interaction with a human. Imagine a senior person feeling gloomy because s/he is feeling the time passing by and hard times getting closer. Of course, the presence of a relative would help in getting those sad feelings going away. But often, human support is not available right away. What if a machine could sense the negative affective state of the user, and act towards improving it, empathically? The realm of a smart ambience, with ubiquitously located sensors and actuators, provides an excellent platform to implement this vision. Such an ambience should be able to read the user‘s affective needs and take action to maintain or

improve his/her well-being, for example in hospitals to alleviate a patient‘s

pain or depression, or to keep a driver calm or alert in a car. We will refer to this specific type of smart ambiences as affect-assistive systems.

A special case of an affect-assistive system, and actually the main application context of this thesis, is an intelligent lighting system able to create a comforting ambience in the rooms of care homes for elderly, to improve, when needed, the mood of the room‘s occupants. The need for

such an affect-assistive system is evident when thinking about how often and systematically elderly just relocated in a care home experience negative moods, such as distress or anxiety, because they are lonely, homesick or disoriented in the new unfamiliar environment (Huldtgren, et al., 2015). Comforting elderly, that are in these negative affective states, requires additional attention from the care-takers, who are usually already

(11)

3

overloaded with work in most care homes. Providing care homes (but also private houses where elderly live independently) with affectively smart technology able to comfort and assist users when needed, would be therefore of great added value.

In order to meet this challenge, we draw inspiration from the advances in intelligent ambiences, nowadays equipped with unobtrusive, yet invisible, sensor arrays and smart actuator devices, able to predict and fulfil the personal needs of their occupants1. Among the objects that can compose an

intelligent ambience, lighting equipment is preferred frequently for its potential to improve the mood of the room‘s occupants (Golden, et al.,

2005) (Kobayashi, et al., 2001) (Kueller, et al., 2006) (Maier & Kempter, 2010). Applying this technology to reinforce the affective independence of seniors in care centers, we envision an intelligent, fully automated system, able to assess the affective state of elderly people and act adaptively to improve it, for example through colorful lighting (Huldtgren, et al., 2015). The proposed solution consists of two platforms: a) the sensing, and b) the ambience creation (actuator) (see Fig. 1). This work revolves around the first one, which aims at measuring automatically the user‘s affective state,

and in particular mood.

The key word and focus in this work is mood. We emphasize this because, as we will also underline in the following, mood and emotion tend to be confoundedly used as synonyms (Scherer, 2005), or even used interchangeably, especially since the same adjective can sometimes be attributed to either an emotion or mood, such as “sadness” (Frijda, 1994). Although psychologists do not seem to agree on the exact definition of both terms, they do agree more on their conceptual differences. Emotions are short-term, impulsive and activated by certain cognitive or appraisal factors; mood is instead longer-term, less mutable and not attributed to a specific factor (Beedie, et al., 2005) (Scherer, 2005) (Jenkins, et al., 1998). Having clarified this difference, we can point out two things: (a) when designing an affect-assistive system as the one described above, one may want to target mood improvement, and thereby mood detection, because the system should adapt to the slowly changing mood in a smooth way, rather than to the more erratic sequence of emotions in order to avoid abrupt system changes that might annoy the user; and (b) automatic emotion recognition is hardly the same problem as automatic mood recognition.

So far, research in Affective Computing has mostly tackled automatic emotion recognition, based on either visual or audio signal analysis. Especially when a video is used as input, most existing algorithms predict emotions at the level of each frame, or at the level of short segments of not

(12)

4

more than a few seconds, typically by analyzing facial expressions (for a comprehensive overview see an overview (Valstar, 2014)), and more rarely by analyzing body language (Kleinsmith & Bianchi-Berthouze, 2013). In addition, in their vast majority, the application of automatic emotion recognition encompasses some sort of interaction, either in a human-human or in a human-computer context, within which people intend to illustrate their emotions deliberately, and therefore often more intensely (Busso, et al., 2008) (McKeown, et al., 2010) (Grimm, et al., 2008).

Conversely, automated mood recognition, intended as the automated recognition of affective states that last for a long time-span, are expressed more subtly, and can hardly be attributed to a specific cause, as we will describe in more detail later. Because of these peculiarities of mood, we define automatic mood recognition as a new problem, different from automatic emotion recognition, and still largely unexplored in literature. To tackle this problem, we investigate the possibility of using the (automatically) recognized emotions, as an information channel for predicting mood. The advantage of such a modular approach is that it enables to build upon the current progress in automatic affective sensing, while focusing on solving the mapping from emotions to mood. An additional key contribution of this work, is the design of a new database tailored for automatic mood analysis, with the goal to validate experimentally our proposed computational models, as well as to serve ultimately as a benchmark for the progress of automatic mood recognition.

Fig. 1 Framework of adaptive ambience creation through colorful lighting for mood enhancement.

(13)

5

2

PROBLEM DEFINITION

The work in this thesis is framed within the context of developing a soothing ambience to counteract the negative mood of a room‘s occupant,

in absence of companionship. The mood sensing platform is therefore developed targeting a single-user. The input sensors are chosen to be unobtrusive, in the sense that they do not intervene to the user‘s activities.

This choice is advocated by the observation that obtrusive sensors interfere with people‘s actions, and especially seniors may suffer from

anxiety or discomfort when wired up, as they may associate it with medical monitoring.

From the unobtrusive sensors, we deem the visual cues to be more informative than the audio ones, mainly because in a single-user context we expect speech to be absent most of the time. Of course, one may argue that there may be other sounds hinting to the mood of a person, such as sighs. However, in the particular context of a person alone in a room, ambient noises, like TV sounds, or personal noises, like coughing, are more likely to be present than the rare events of affective sounds. Regarding the visual sensors, we choose to rely on low-cost off-the-shelf cameras (for example web-cameras instead of active cameras), which would enable the easy adoption of such a technological solution by a care center with a lot of rooms.

We envision, therefore, to use as input for the mood sensing a video stream portraying a single person in a room, and we set as target the prediction of the mood of this person, as portrayed in the video. More precisely, we target the mood of the person as displayed, and then perceived by an empathetically equipped human observer, such as a nurse or care taker with experience in recognizing a person‘s mood. As such, we distinguish the

consciously experienced affect by the person itself (i.e., felt affect) from the affect perceived by an external observer based on the expressed ‘symptoms‘

of the affect. While felt and perceived affect do not necessarily coincide, we opt for the latter, since especially seniors are often not aware of their own mood or they cannot report it properly (Huldtgren, et al., 2015).

From the video stream we basically extract facial expressions, since humans convey relevant affective information through their face (Ekman & Friesen, 2003). As mentioned above, psychologists and computer engineers have delved into assessing emotions inferred from facial cues or changes of the facial appearance. Based on this research, we aim at pushing the state-of-the-art in automatic mood prediction from facial expression one step further in this thesis.

(14)

6

3 AFFECTIVE TERMS: SIMILARITIES AND DIFFERENCES

Emotions, mood and affect are very complex phenomena in their underlying biological mechanisms (Dalgleish, 2004) and, even, in their subjective experience (Scherer, 2005). Psychologists and neuroscientists argue about the origin and functional utility of moods and emotions. Reviewing the exact theories explaining the biological origin and purposes of emotions and moods is, however, beyond the scope of this work. So, we here restrict ourselves to a short summary of the theories that are helpful for the development of a computational model for the visual manifestation of mood.

Affect is a “neurophysiological state consciously accessible”, which encompasses all the human affective phenomena, evident in the core of emotions and mood (Russell, 2003). Affect runs constantly, dynamically and pervasively in the background or the foreground of conscience (Russell, 2003) (Watson & Tellegen, 1985). It manifests itself as a certain affective state, that being emotion, mood (or even temperament) (Watson, 2000).

Emotion is a relatively brief and often intense affective state, triggered by a specific object (internal or external event) (Scherer, 2005). Its duration is only fuzzily defined, spanning from a few milliseconds to a few minutes (Oatley & Jenkins, 1996), but it is usually approximated to fall between 0.5 and 4 seconds (Levenson, 1988). The terms emotional episode and emotion are often used interchangeably. An emotional episode is, according to (Russell, 2003), an instantiation of an emotion that fits a certain pattern of physiological responses, visual and vocal expressions and actions. When used in this thesis, emotional episode refers to an often full-blown, single emotion and its subsequent reactions, captured in all its temporal phases within a video excerpt. Scenes depicting a person pulling back out of fear when viewing a fearsome stimulus or an angry fight between two people are typical examples of emotional episodes. Such an episode typically lasts a few seconds. In the particular case of a camera capturing the affective state of a person, we may assume that the marginal case of a snapshot or a single video frame depicts an instantaneous expression, which represents an instantaneous/punctual emotion. Then, the emotions expressed by a person along the course of a few minutes may be translated into a sequence of punctual emotions. The instantaneous emotions may fluctuate and evolve continuously over time, without needing to define a clear beginning or end (temporally unsegmented). Mood is an affective state of low intensity, long-lasting, enduring, often diffuse without apparent cause (Scherer, 2005). The duration is again fuzzily determined by psychologists, ranging from minutes to days or weeks (Ekman, 1984). Given the defined range of duration, we may assume that the mood of a person remains stable for a few minutes.

(15)

7

Mood and emotion differ conceptually mainly upon their duration, intensity, rapidity of change and cause (Lane & Terry, 2000) (Batson, et al., 1992) (Beedie, et al., 2005) (Scherer, 2001). Although distinctive between the two, the difference in duration of the two types of affective states is hardly quantified in literature. In addition, Lane and Terry (Lane & Terry, 2000) argue that intensity is also a blurred distinctive factor, as both emotions and mood can be expressed through intense responses. Finally, the cause and/or attribution of mood is not clear, due to the mutual confluence of emotions and mood: mood can help the interpretation of emotions, and the subsequent emotions contribute to the mood (Parkinson, 1996). This lack of demarcation boundaries shows that there is a grey area when it comes to characterizing whether an affective state corresponds to an emotion or mood. To overcome this uncertainty, in this thesis we mainly adopt the assumption that emotions are instantaneously expressed, whereas mood refers to the affective state felt by the user (or deemed to be felt by the user according to an external observer) over a prolonged time period of at least a few minutes.

Lastly, both emotions and mood should not be confounded with “feeling”, which is the subjective experience of an emotion (Scherer, 2001), or personal assessment of one‘s current condition (Russell, 2003).

4

REPRESENTATION OF EMOTIONS AND MOOD

There are two main ways to represent emotions in psychology, and in the related automatic affect recognition research: categorical and dimensional ways. A third type, i.e., the appraisal representation (Scherer, 2001), wants emotions to be generated through the continuous subjective evaluation of the environmental stimuli and the internal state of the person. Although psychologically plausible, this representation has not yet been regarded as computationally tractable in automatic affect recognition, due to difficulties in obtaining such complex measurements (Gunes, 2010).

The categorical representation has its roots in the pioneering work of Darwin (Darwin, 1998), according to which there is a small number of basic emotions, evolutionarily developed, and therefore hard-wired in the human brain. Over the years, a number of theories advocating for categorical emotions emerged, such as the ones proposed by (Tomkins, 1962) , (Izard, 1971) and (Ortony & Turner, 1990). Ekman (Ekman, 1992) provided evidence that six basic (prototypical) emotions, namely joy, sadness, fear, disgust, surprise, and anger, are universally expressed in a similar way and recognized as such, thereby supporting Darwin‘s theory.

Ekman and Friesen introduced later a coding of related facial muscles‘

configurations, the so-called facial action units (FACS), and argued that each of these basic emotions generates a facial expression as a unique combination of FACS (Ekman & Friesen, 2003) The six basic emotions system was notably the most influential on automatic emotion recognition,

(16)

8

since most of the past research focused on recognizing these basic emotions (Pantic & Rothkrantz, 2000) (Zeng, et al., 2009) (Valstar, 2014).

Similarly to emotions, mood is also commonly discretized in categories, based on factor analysis of mood assessment vocabularies. Examples of mood checklists established through this approach are the Nowlis Mood Adjective List (Nowlis, 1965), the Profile of Mood States (McNair, et al., 1971) and the Eight State Questionnaire (Curran & Cattell, 1976), whose basic mood adjectives overlap to a larger or lesser extent.

The main criticism to discrete representations of both emotions and mood is that they are insufficient to describe the full spectrum of emotions and moods, expressed and experienced by humans, or even animals (Scherer, 2005). In addition, the discrete categories are typically tied to the particular measurement system of the experimenter that defines the categories (Thayer, 1989). The words representing each category are empirically defined, and prone to linguistic inter-cultural differences. Moreover, no clear demarcation boundaries exist between the affective states clustered under each category. Therefore, this arbitrary discretization might lead to fuzzily overlapping and/or redundant categories.

The dimensional representations consider affect to be characterized along a small number of dimensions, ideally independent from each other, so, orthogonal in the affective space. Probably the most widely accepted dimensional model in affective computing research is Russel‘s circumplex

model of affect (Russell, 1980), (Russell, 2003). The circumplex model defines two dimensions to explain most of the variance of affect: one is the valence, which is the level of pleasantness of the affective state, ranging from negative to positive, and the second one is the arousal, which is the level of activation of the affective state, ranging from drowsy to frantic. Categorical emotions can be represented as regions in this continuous space. A third dimension of dominance (i.e., power) was introduced to separate better emotions that occur during interactions (Russell, 1977); for example, anger and fear are closely positioned on the valence/arousal space, but are clearly distinguished in the third dimension of dominance. Apart from the advantage of representing the whole spectrum of affect through a continuum of reduced dimensionality, dimensional models are particularly useful for computational modelling of affective states that evolve continuously in time. On top of it, the circumplex model representing affect, offers a unified framework for the representation of emotions and mood. The former reasons explain why we adopted the dimensional representation for both emotions and mood in the rest of this thesis.

(17)

9

5

RESEARCH OBJECTIVES OF THE THESIS

The main research objective of this thesis could be formulated as follows: To develop and validate a system that automatically assesses a person‘s

mood from video input.

As pointed out earlier, automatic mood prediction is rather unexplored in Affective Computing. On the other hand, automatic emotion recognition from, usually, short video excerpts, is a mature but still highly active field, especially when the emotions are inferred from the visually observed configuration of the facial muscles. Within the last years, recognition of dimensional emotion continuous in time was also attempted, as it is considered more adequate for video data that do not focus on a single emotional episode, but rather on emotional expressions that fluctuate over time. The problem of continuous affect recognition is far from being perfectly solved, but it is tractable and keeps on improving over the years, following the advancements in face detection, face registration, facial feature extraction and machine interpretation of the facial configuration (Valstar, 2014). At the moment, we see the first commercial products (by companies such as Noldus and Affectiva) to decode the emotional signs behind the facial expression, not just in terms of basic emotions, but also along the valence and arousal dimensions, being deployed. What is interesting for this thesis, is that since the existing algorithms can perform automatic emotion recognition quite successfully and keep on improving, we can build upon them, by using the recognized emotions as information for predicting the mood.

Psychology can provide us with insights of how to map a continuous sequence of recognized emotions into mood. Even though psychologists distinguish clearly emotions from mood, they also underline the close relationship between the two terms. Ekman, for example, argues that we tend to associate emotions directly with the mood in agreement; so, joy would translate into a happy mood and fear into stress or anxiety. Similarly, (Frijda, 1994) argues that mood might develop out of an accumulation of emotions, while (Lane & Terry, 2000) describe mood as involving multiple single emotions. From a broader perspective, evidence exists that emotions and mood influence each other and co-evolve. For example, (Morris, 2000) finds the onset or offset of an emotional episode as potential sources of mood. In turn, emotion elicitation is influenced by the underlying mood (Oatley & Johnson-Laird, 1987), fortifying the emotions that agree with the baseline mood of the person. For example, it is easier and more effective to induce sadness to a person already being in a negative mood, than to a person being in a positive mood. Mood can help the interpretation of emotions, and the subsequent emotions contribute to the mood (Parkinson, 1996).

(18)

10

The work reviewed above gives sufficient empirical evidence that emotions could be used to potentially infer the mood; these findings, though, are far from being quantified. An actual functional relationship between mood and emotions cannot be directly retrieved from the theoretical psychological framework. The goal of this thesis is to retrieve this functional relationship, in order to accomplish accurate mood prediction (from video analysis) by taking advantage of the existing cutting-edge technologies in emotion recognition. Specifically, we are interested in establishing a relationship between a sequence of emotions expressed over time by a person in a video, and the underlying mood of the person in the timeframe within which his/her emotions are observed (and recognized). Breaking down the automatic mood prediction in two separate problems adds a sustainable dimension to the solution for two reasons: (a) it reuses the algorithms on automatic emotion recognition, and (b) it can be further applied as a plug-in module for automatic mood prediction from other input modalities, such as audio or physiological signals, as long as another system recognizes automatically the emotions generating these signals. As a first step, we set out to explore whether it is possible to predict mood from a sequence of emotions expressed over time. The proof of concept analysis can fit under the main research question, addressed in Chapter 2: I. Can a sequence of expressed emotions of a person in a video be mapped

into an estimate of this person‘s mood?

As test-bed for validating the relationship between emotions and mood, we propose audio-visual affective databases, containing videos representing affectively colored moments of a user in more or less naturalistic contexts. The videos of these databases are typically annotated (by humans other than the person in the video) in terms of emotions at frame-level; in some cases, global label corresponding to the affective state of the user along the whole video is also included. Unfortunately, the existing audio-visual affective databases (VAM (Grimm, et al., 2008) and exemplar HUMAINE (Douglas-Cowie, et al., 2007)) that could qualify for validating the mapping from emotions to mood had limitations in terms of generalizability of their content (and thereby of the models trained on it). Specifically, they either consisted of short videos built around a single emotional episode, or they were not explicitly annotated in terms of mood, and therefore not necessarily providing the appropriate ground truth for mood. These limitations called for the need to establish reliable ground truth to faithfully explore the mood model. We envisioned this ground truth as a database including long, naturalistic videos, explicitly annotated in emotion (over time) and mood (globally). The design of such a database involves a number of important decisions, such as the level of control (i.e., whether to collect spontaneous material or record data in the lab), the spectrum and diversity of affect needed, and the situational context represented. In the first part of chapter 3, we discuss these questions to answer:

(19)

11

II-1 How do we produce videos suitable for training automatic mood recognition algorithms?

Of equal importance to the design of the database material is its reliable annotation, since the annotations of mood and emotion constitute the ground truth for the training and evaluation of the mood prediction algorithms. Since the application is tailored to predict perceived mood, the source of the ground truth is to be obtained by third observers, rather than by self-report. In this case, more than one person should be recruited for annotating the same video, due to the subjectivity of affect (emotion and mood) perception (Cowie & McKeown, 2006). Recruiting a small number of experts to annotate the videos in a controlled environment is a common practice (McKeown, et al., 2010) (Busso, et al., 2008) (Metallinou, et al., 2010). However, given the vast amount of videos that a database can contain, their long duration and the requirement for multiple annotations per video, expert annotation can be a very tedious and time-consuming task. An interesting alternative to recruiting experts is to crowdsource the annotation task through platforms such as Mechanical Turk and Microworkers. These platforms allow to distribute small tasks across thousands of people around the world, willing to execute these tasks for a very small compensation. The tedious annotation task can then be split in short (e.g. single video) annotation sub-tasks, to be performed by the crowd. Whereas this yields clear advantages from a time efficiency point of view, the anonymity and uncontrolled crowdsourcing environment may pose challenges to the reliability of the annotations obtained. Being the first to employ laymen for empathetic annotation, we needed to address the following key question in the second part of chapter 3:

II-2 Does crowdsourcing present a valuable alternative to expert annotation for annotating affective databases?

Having acquired appropriate ground truth, we could then use these data to train and validate more accurate models mapping emotions into mood. Instead of using pre-defined functional mappings, we considered the expressed emotions as a temporal sequence, translating the problem of mood prediction into a sequence classification problem. The literature offers an abundance of data-driven tractable solutions for modelling the relationship between a sequence and the category it maps to. These can be roughly clustered under two categories: (a) feature-based solutions, which extract statistical features from the sequence and map them to the sequence label through supervised machine learning, and (b) solutions based on graphical models, which account for the temporal coherence between the samples in the sequence. Transferring these approaches to modelling mood from emotions can not only allow us to predict the mood more accurately, but also shed more light into the extent to which different assumptions of each model fit the mood estimation process. For instance, the feature-based approach may ignore the order with which the emotions are expressed during a video for estimating the mood, whereas a graphical approach would take the temporal relationships and dynamics

(20)

12

within the sequence into account. Therefore, the research question answered in Chapter 4 is:

III. Which sequence classification method predicts mood from the sequence of recognized emotions more accurately?

Finally, having devised a computational model that emulates best the human process of mood estimation from emotions, we need to validate this model on machine-recognized emotions instead of the ideal case of human recognized (annotated) ones, as the overarching question to this thesis dictates. As mentioned earlier, emotion recognition from facial expression is a well-studied problem, and the state-of-the-art algorithms are incorporated already in commercially available products, such as the FaceReader and the AffDex. Since such tools are available off-the-shelf, it would be interesting to see whether the mood model could be added as a sort of plug-in to the existing technology, to enable mood prediction: the facial expression recognition software predicts emotions at a frame-level from facial expressions, and by using these predictions as input for our model, we then validate a fully automated system that predicts a person‘s

mood from an input video. What requires investigation are the factors that account for the imprecisions in the pipeline of mood prediction. More specifically, we are interested in quantifying the extent to which the model performance decreases due to (1) failed detection of face and facial features, (2) imprecise prediction of emotions from the emotion recognition tool, or (3) imprecise prediction of mood from the sequence of emotions. Hence, the question we answer in Chapter 5 is:

IV. How accurately does a fully automatic system predict a person‘s mood

from his/her facial expressions?

6

THESIS OUTLINE

The first research question, concerning the proof of the concept that the mood of a person portrayed in a video can be inferred from the emotions that the person expresses over time, is addressed in Chapter 2. We perform two types of analysis. First, we cluster the emotions, punctually recognized over a given timespan, in the emotion (valence/arousal) space to test whether the centroid of the clustered emotions is located in the portion of the affective circumplex where also the corresponding mood would be located. Specifically, we consider four discrete moods, these being the four quadrants of the valence-arousal mood space. In addition, since psychology agrees that more than one mood may occur within a particular time-window, we examine whether this “clustering” hypothesis is valid for cases of single, shifting or also co-existing moods. We validate this simple model experimentally on human annotations of two existing audio-visual affective databases (i.e., HUMAINE and VAM).

(21)

13

This analysis indicates that in case of multiple moods or in case of recognized emotions at irregular time intervals within the video, emotions do not necessarily cluster around the corresponding moods. On the other hand, the clustering yields promising results for videos portraying people in a single mood, and for regular emotion annotations within the time-frame of the video. Focusing on the latter cases, we attempt to retrieve a more accurate model to capture the relationship between emotions and mood. We formulate a set of a-priori functional mappings of emotions to mood in the valence-arousal space, grounded on psychological definitions and studies. These simple functions estimate the mood e.g. as the mean or moving average, the maximum (most intense), the most frequent, the first or the last emotion. In addition we investigate the temporal properties of the best functional mapping, addressing whether the accuracy of the model depends on the duration of the time-span within which the mood is expressed, or on the temporal regularity of the emotion samples.

As the results of Chapter 2 indicate that the proposed approach is promising, but cannot fully be explored due to limitations in the databases on which the analysis has been performed (see also section 1.5), we set out to create, as described in Chapter 3, an affective video database with desired properties for training (and validating) an automatic mood recognizer. We record the video material in a carefully controlled environment, where we invite actors to represent different moods, while being induced to these moods with music, physical theatre exercises and exemplary situational scenarios, to increase the naturalness of the expressed affect. The recordings resulted in long footages representing a variety of moods by a variety of actors. We annotate the videos in terms of both mood (for the overall video) and emotion (continuously throughout the video). We resort to crowdsourcing, instead of recruiting experts in the lab, given the prohibitive dimensions of the task at hand. Chapter 3 describes the filtering stages we undertake to ensure high quality in the obtained annotations. We propose economic guidelines to separate the noisy from the reliable annotations, as a trade-off between inter-annotator agreement and a sufficiently large number of annotations per video. As a final reliability test, we compare the rating behavior of the crowd-workers with that of experts. To do so, we invite selected annotators in the lab, and asked them to annotate a subset of the videos of the database, following the exact same process as the workers.

Having acquired appropriate video examples and labels, in Chapter 4 we could train and validate better more complex models for mood from emotions. Complementing the analysis of Chapter 2, we let a classifier learn the boundaries between the mood classes in the mood space, instead of defining them a-priori. We devised two basic models inspired from sequence classification: (a) a feature-based one, which extracts statistical features from the sequence of emotions and maps them to mood with an Extreme Learning Machine, and (b) a graphical model that accounts for the temporal relationships between the expressed emotions in the sequence. We implemented two typologies of such a graphical model, a directed

(22)

14

generative one (i.e., a Hidden Markov Model) and a non-directed discriminative one (i.e., a Hidden Conditional Random Field).

Based on the model developed in Chapter 4, in Chapter 5 we implement the fully automatic system that assesses the mood of a person from the input video, using information from the facial expression. We take into account insights from semi-structured interviews with care takers and senior inhabitants in a number of care homes, according to which depression (i.e., a low arousal negative mood) is more likely to occur than anxiety (i.e., a high arousal negative mood) (Huldtgren, et al., 2015). With this information in mind, we focus on predicting only the valence dimension of the mood. In this final implementation, instead of using the emotional values as recognized by humans (annotators) as input, we employ the FaceReader software to predict the emotional valence from the facial expression at the frame-level. We compare the results with a system that relies on human recognized emotions, to quantify the inaccuracy of mood estimation due to imperfect (in comparison to human) emotion recognition. Finally, we propose a fuzzy implementation of the Extreme Learning Machine classifier to predict instead of one crisp mood label a mood label vector, representing the proportion of agreement of the annotators on each of the predefined mood classes. Fig. 2 illustrates schematically the outline of the thesis.

(23)

15

Fig. 2 Schematic outline of the thesis for automatic mood prediction from videos. The videos of the database are annotated by human annotators, in terms of emotions and mood. The computational model of mood from emotions is initially built as a function mapping the emotion annotations to the mood annotations. The right part of the figure illustrates the fully automatic mood prediction pipeline, starting with the frame-by-frame automatic recognition of the expressed emotions which are then fed into the mood prediction module. The fully automatic model is evaluated by comparing the predicted mood with the ground truth, namely the mood annotations.

(24)

16

REFERENCES

Batson, C. D., Shaw, L. L. & Oleson, K. C., 1992. Differentiating affect, mood, and emotion: Toward functionally based conceptual distinctions. Review of Personality and Social Psychology, Volume 13, pp. 294-326. Beedie, C., Terry, P. & Lane, A., 2005. Distinctions between emotion and mood. Cognition & Emotion, 19(6), pp. 847-878.

Busso, C. et al., 2008. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, Volume 42, pp. 335-359.

Cowie, R. & McKeown, G., 2006. Statistical analysis of data from initial labelled database and recommendations for an economical coding scheme Curran, J. P. & Cattell, R. B., 1976. Manual for the eight state questionnaire. IPAT, Champaign(III).

Dalgleish, T., 2004. The emotional brain. Nature Reviews Neuroscience, 5(7), pp. 583-589.

Darwin, C., 1998. The expression of the emotions in man and animals. Oxford University Press.

Douglas-Cowie, E. et al., 2007. The HUMAINE database: addressing the collection and annotation of naturalistic and induced emotional data., pp. 488-500.

Ekman, P., 1984. Expression and the nature of emotion. In: Approaches to emotion 3, pp. 19-344.

Ekman, P., 1992. An argument for basic emotions. Cognition & Emotion, Volume 6(3-4), pp. 169-200.

Ekman, P. & Friesen, W., 2003. Unmasking the face: A guide to recognizing emotions from facial clues. Ishk.

Frijda, N. H., 1994. Varieties of affect: Emotions and episodes, moods, and sentiments. In: The nature of emotions: Fundamental questions., pp. 197-202.

Golden, R. N. et al., 2005. The efficacy of light therapy in the treatment of mood disorders: a review and meta-analysis of the evidence. American Journal of Psychiatry, Volume 162, p. 656–662.

Grimm, M., Kroschel, K. & Narayanan, S., 2008. The Vera am Mittag German audio-visual emotional speech database..

Gunes, H., 2010. Automatic, dimensional and continuous emotion recognition.

Huldtgren, A. et al., 2015. Design considerations for adaptive lighting to improve seniors‘ mood. Geneva, In press.

Izard, C. E., 1971. The face of emotion.

Jenkins, J. M., Oatley, K. & Stein, N. L., 1998. Human emotions: A reader..

Kleinsmith, A. & Bianchi-Berthouze, N., 2013. Affective Body Expression Perception and Recognition : A Survey. Transactions on affective computing, p. 15–33.

Kobayashi, R. et al., 2001. Effects of bright light at lunchtime on sleep of patients in a geriatric hospital. Psychiatry and Clinical Neurosciences, 55(3), p. 287–289.

(25)

17

Kueller, R. et al., 2006. The impact of light and colour on psychological mood:a cross-cultural study of indoor work environments. Ergonomics, Volume 49, p. 1496–1507.

Lane, A. M. & Terry, P. C., 2000. The nature of mood: Development of a conceptual model with a focus on depression. Journal of Applied Sport Psychology, 12(1), pp. 16-33.

Levenson, R., 1988. Emotion and the autonomic nervous system:A prospectus for research on autonomic specificity. In: J. W. &. Sons, ed. Social Psychophysiology and emotion: Theory and Clinical applications. New York, pp. 17-42.

Maier, E. & Kempter, G., 2010. ALADIN-a Magic Lamp for the Elderly?. Handbook of Ambient Intelligence and Smart Environments, pp. 1201-1227.

McKeown, G., Valstar, M. F., Cowie, R. & Pantic, M., 2010. The SEMAINE corpus of emotionally coloured character interactions. IEEE International Conference on Multimedia and Expo (ICME).

McNair, D. M., Lorr, M. & & Droppleman, L. F., 1971. Manual for Profile of Mood States. San Diego: Educational and Industrial Testing Service. Metallinou, A. et al., 2010. The USC CreativeIT database: A multimodal database of theatrical improvisation., Workshop on Multimodal Corpora LREC.

Morris, W. N., 2000. Some Thoughts about Mood and Its Regulation. Psychological Inquiry, pp. 200-202.

Nowlis, V., 1965. Research with the Mood Adjective Checklist. Affect, Cognition and Personality.

Oatley, K. & Jenkins, J., 1996. Understanding Emotions. Blackwell Publishers Ltd.

Oatley, K. & Johnson-Laird, P. N., 1987. Towards a cognitive theory of emotions. Cognition and emotion, 1(1), pp. 29-50.

Ortony, A. & Turner, T. J., 1990. What‘s basic about basic emotions?.

Psychological review, 97(3), p. 315.

Pantic, M. & Rothkrantz, L. J. M., 2000. Automatic analysis of facial expressions: The state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), pp. 1424-1445.

Parkinson, B., 1996. Changing moods: The psychology of mood and mood regulation. Addison-Wesley Longman Limited.

Picard, R. W., 2000. Affective computing. MIT press.

Russell, J., 1980. A circumplex model of affect. Personality and Social Psychology, pp. 1161-1178.

Russell, J. A., 2003. Core affect and the psychological construction of emotion. Psychological review , 110(1), p. 145.

Russell, J. A. a. A. M., 1977. Evidence for a three-factor theory of emotions. Journal of research in Personality, Volume 11.3, pp. 273-294. Scherer, K. R., 2001. Appraisal considered as a process of multilevel sequential checking. Appraisal processes in emotion: Theory, methods, research , Volume 92, p. 120.

Scherer, K. R., 2005. What are emotions? And how can they be measured?. Social science information, 44(4), pp. 695-729.

(26)

18

Thayer, R. E., 1989. The biopsychology of mood and arousal.Oxford University Press.

Tomkins, S. S., 1962. Affect, imagery, consciousness: Vol. I. The positive affects.

Valstar, M., 2014. Automatic Facial Expression Recognition. In: Understanding facial expressions in communication. Manas Mandal and Avinash Awasthi.

Watson, D., 2000. Mood and temperament.Guilford Press.

Watson, D. & Tellegen, A., 1985. Toward a consensual structure of mood. Psychological bulletin , pp. 219-235.

Zeng, Z., Pantic, M., Roisman, G. I. & Huang., T. S., 2009. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions, Issue Pattern Analysis and Machine Intelligence, pp. 39-58.

(27)

19

Chapter 2

Predicting mood from emotion

annotations on videos

A smart environment designed to adapt to a users affective state should

be able to decipher unobtrusively that users underlying mood. Great effort

has been devoted to automatic punctual emotion recognition from visual input. Conversely, little has been done to recognize longer-lasting affective states, such as mood. Taking for granted the effectiveness of emotion recognition algorithms, we propose a model for estimating mood from a known sequence of punctual emotions. To validate our model experimentally, we rely on the human annotations of two well-established databases: the VAM and the HUMAINE. We perform two analyses: the first serves as a proof of concept and tests whether punctual emotions cluster around the mood in the emotion space. The results indicate that emotion annotations, continuous in time and value, facilitate mood estimation, as opposed to discrete emotion annotations scattered randomly within the video timespan. The second analysis explores factors that account for the mood recognition from emotions, by examining how individual human coders perceive the underlying mood of a person. A moving average function with exponential discount of the past emotions achieves mood prediction accuracy above 60%, which is higher than the chance level and higher than mutual human agreement.

This chapter is equivalent to the publication:

Predicting mood from punctual emotion annotations on videos

Katsimerou Christina, Heynderickx Ingrid & Redi Judith

IEEE Transactions on Affective Computing, vol. 6, no. 2, pp. 179 - 192,

2015.

(28)

20

1 INTRODUCTION

Affect-adaptive systems are commonly embedded into so-called smart ambiences, ready to read our needs and address them, and as such providing a higher quality of life. Technology endowed with emotional intelligence can, among other things, drive or maintain the user to a positive affective state, for instance by adapting the lighting system in a care center room to comfort the inhabitants (Kuijsters, et al., 2011). In this example, the adaptive system needs to be equipped with empathic technology, which is able to identify (unobtrusively) the user‘s affective

state, at least as well as a caretaker would do.

Automatic affect recognition is often based on visual input (images and videos), due to the unobtrusiveness of visual sensors and the fact that people convey important affective information via their facial and bodily expressions (Darwin, 1998). A large body of work has been devoted to mapping visual representations of facial expressions (De la Torre & Cohn, 2011), and body postures (Kleinsmith & Bianchi-Berthouze, 2013) into short-term affective states, commonly referred to as emotions. Emotions are short-term, intense and highly volatile affective states. Their automated recognition from video input is, therefore, often performed frame-by-frame. In certain affect-adaptive systems, however, this choice may be suboptimal: adapting the system behavior to the dynamics of instantaneous emotions may be redundant, if not counterproductive. Take the case of a lighting system that adapts its configuration (in terms of intensity and color temperature, for example) to improve the occupant’s affective state: it is neither necessary nor desirable that the light changes at the speed of instantaneous emotional fluctuations. The recognition of longer-term, underlying affective states would be more appropriate for such a system.

In psychological literature, long-term diffused affective states are often referred to as ‘mood‘. Mood is distinguished from emotions typically based

on its duration (Jenkins, et al., 1998) and the intensity of its expression, although these differences are hardly quantified in an operational way. Little is known, for example, about the time spanned by either emotional or mood episodes. Thus, to cope with this vagueness and make as few assumptions as possible, in the rest of the chapter we will use the term emotion to refer to a punctual (i.e. instantaneous) affective state and the term mood as an affective state attributed to a certain time window.

(29)

21

From an engineering perspective, mood is typically assumed to be synonym to emotion and the two terms are often used interchangeably. Very little research, indeed, has tried to explicitly perform automatic mood recognition from visual input, except for some remarkable yet preliminary attempts (Thrasher, et al., 2011) (Sigal et al, 2010), discussed in more detail in section 2. On the other hand, psychological literature recognizes a relationship between underlying mood and expressed emotions (Morris, 1992) (Oatley & Johnson-Laird, 1987). Thus, it may be possible to infer mood from a sequence of emotions, recognized from e.g. video. This would entail the existence of a model that maps, for a given lapse of time, punctual emotion expressions into an overall mood prediction (Fig. 1). In this chapter, we propose a methodology for inferring the mood of a person portrayed in a video, similar to what an external human observer would do. In the applicative context of a care-home, the mood perceived by the experienced care-takers can be considered a reliable ground truth, especially when it concerns fragile groups (e.g., elderly, patients), who are often not aware or cannot report properly their mood. We therefore refer to the target of our model as ‘perceived mood‘. Such model then should

infer perceived mood as a function of the instantaneous emotions expressed by the person in the video. As a proof of concept, we validate this functional relationship by examining how human annotators, while watching an emotionally colored video of a person, relate the emotions they perceive over time with the overall mood they perceive that person to be in. Perceived affect serves as ground truth for training any automatic affect recognition system, and thus, we consider it a required bed-set for

Fig. 1. Framework of automatic mood prediction module from a sequence of recognized emotions.

(30)

22 proving the validity of our idea.

In summary, this chapter aims at answering the following research questions:

1) To what extent can we identify (perceived) mood from punctually perceived emotions?

2) How do humans account for the (perceived) displayed emotions when judging the overall mood of another person?

Answering these questions brings us closer to retrieving a model of human affective intelligence that can then be replicated for machine-based mood perception, possibly re-using existing emotion recognition systems. In answering our research questions, we contribute to the field by a) bringing out automatic mood recognition as different (yet related) problem from emotion recognition, b) formulating a model where mood is the outcome of a function with punctual emotions as arguments, c) indicating an experimental setup for validating this model, d) defining the cases where this model is applicable, e) specifying the best fitting mood function out of a number of heuristics, and f) optimizing the parameters (e.g. minimum frequency for emotion recognition) of such function in terms of accuracy and computational complexity.

The remainder of the chapter evolves as follows: section 2 presents the related background in psychological literature, for distinguishing and connecting emotions to mood, and in affective computing literature for automatic mood recognition. Section 3 defines the affective domains in which emotions and mood are expressed, and formulates the mood model within these domains. Section 4 presents the databases we used to validate our model. In section 5 we introduce a simple experimental setup as a proof of concept for the mood model. In section 6 we zoom into the model and quantify its salient parameters. Finally, we discuss in the last section the insights gained by the experiments, along with limitations of our analysis and future improvements.

2 RELATED WORK

The necessary background for our work stems from two directions: the psychological literature and the affective computing literature. Psychology provides the theoretical insights on how emotions and mood are different concepts, and leads to our claim that automatic mood recognition diverges from the problem of automatic emotion recognition (section 2.1). A second notion we retrieve from psychology is how the expressed emotions and the underlying mood are related, so that we can use the emotions to infer the mood. Subsection 2.2 lists the state-of-the-art efforts from the affective

(31)

23

computing perspective to automatically recognize mood, which in essence boils down to emotion recognition.

2.1

Mood and emotions in psychology

A lot of psychologists who study affect in its nature, tend to consider emotion and mood as affective states very much associated, but at the same time distinguish them in terms of duration, intensity, stability, clarity and direction (awareness of cause). Jenkins (Jenkins, et al., 1998) considers emotions, mood and temperament as affective states with increasing duration respectively, although without quantifying the difference in duration. Lane and Terry (Lane & Terry, 2000) define mood as “a set of feelings, ephemeral in nature, varying in intensity and duration, and usually involving more than one emotion”. Russell (Russell, 2003) describes mood as being prolonged core affect without object or with no specific object. Timing was proposed as a distinctive factor between emotions and mood also in (Parkinson, et al., 1996), where emotions are said to evolve under clear dynamic phases (onset-apex-offset), whereas mood “lingers somewhere in the background of consciousness”. Finally, in (Beedie, et al., 2005), the authors conducted a so-called ‘folk psychological

study‘, in which they asked ordinary people to describe how they

experience the difference between emotion and mood. A qualitative and quantitative analysis on the responses indicated cause and duration as the two most significant distinctive features between the two concepts.

Even though emotions and mood are well established as different constructs, literature agrees also on the confluence of the two, to the point that the terms are used interchangeably and a one-to-one mapping between the two is often assumed (Batson, et al., 1992). Ekman (Ekman, 1999), for example, claims that we infer mood from the signals of the emotions we associate with the mood, at least in part. We might deduce that someone is in a cheerful mood because we observe behavior that matches joy. Likewise, stress as an emotion would imply an anxious mood or even further an anxious personality. Mood can be instantiated due to an emotion (Morris & Schnurr, 1989), even though temporally remote from it (Morris, 1992) and in a dissipated intensity (Oatley & Johnson-Laird, 1987). On the other hand, when eliciting emotions, the underlying mood of a person works potentially as a bias: it fortifies the emotions related to it and alleviates the not-relevant ones (Oatley & Johnson-Laird, 1987). To summarize, emotions and mood are different concepts, yet tightly linked. As such, in this work we aim at capturing a functional relationship between these constructs, yet without attempting to determine causes and effects regulating this relationship.

(32)

24

2.2

Prediction of mood from audio-visual signals

From the affective computing point of view, automating the process of mood recognition (prediction of the perceived mood) entails linking data collected from sensors monitoring the user (e.g. cameras, microphones, physiological sensors) to a quantifiable representation of the (perceived) mood. In the case of visual input, very scarce results are retrievable in the literature. In fact, the latest studies in the field have been geared towards recognizing continuously the emotions along videos rich in emotions and emotional fluctuations, e.g. as requested by the AVEC challenge of 2011 (Schuller et al, 2011) and 2012 (Schuller, et al., 2012). However, typically a decision on the affective state is made on frame-level, i.e., for punctual emotions, whereas no summarization into a mood prediction is attempted. In (Thrasher, et al., 2011) we find explicit reference to mood recognition from upper body posture. The authors induced mood with music in subjects in a lab, and recorded eight-minute videos focusing on their upper body after the induction. They analyzed the contribution of postural features in the mood expression and found that only the head position predicted (induced) mood with an accuracy of 81%. However, they considered only happy versus sad mood and the whole experiment was very controlled, in the sense that it took place in a lab and the subjects knew what they were intended to feel, making their expressions perhaps less genuine. Another reference to mood comes from (Sigal et al, 2010), where the authors inferred again the bipolar happiness-sadness mood axis from data of 3D pose tracking and motion capturing. Finally, the authors of (Metallinou & Narayanan, 2013) were the first to briefly tap in the concept of summarizing punctual annotations of affective signals to an overall judgment. However, they only considered the mean or the percentiles of the values of valence and arousal as global predictors of the affective state, without taking into account their temporal distribution. In this study, we significantly extend the latter work, by constructing systematically a complex mood model from simple functions, analyzing its temporal properties and proposing it as an intermediate module in automatic mood recognition from video, i.e., after the punctual (on frame level) emotion recognition module (see Fig. 1).

(33)

25

3

PROBLEM SETUP AND METHODOLOGY

Domains of emotion and mood

3.1

To define a model that maps punctual emotion estimations into a single mood, it is necessary to first define the domains in which emotion and mood are represented. In affective computing there are two main trends for affect representation: the discrete (Ekman, 1992) and the dimensional one (Rusell, 1980) (Thayer, 1996). The latter most commonly identifies two dimensions, i.e., valence and arousal, accounting for most of the variance of affect. It allows continuous representation of emotion values, capturing in this way a wider set of emotions. This is why we resort to it in our work.

In this study we assume the valence and arousal dimensions to span a Euclidean space (hereafter referred to as the VA space), where emotions are represented as point-vectors. Analogously, mood can be represented in a Euclidean (mood) space as a pair of valence and arousal values. In this work, we opt to discretize the mood space into four partitions, corresponding to the four quadrants defined by the valence and arousal axes. As a result, we define mood as a discrete variable belonging to one of the following mood classes (shown in Fig. 2):

1. positive valence - high arousal (PH), including moods such as happiness and excitement,

2. negative valence - high arousal (NH), including anxiety and anger, 3. negative valence - low arousal (NL), including sadness and sombreness, 4. positive valence - low arousal (PL), including calmness and serenity.

Fig. 2 Model of emotion and mood space. Each emotion is a point on the emotion space and the trajectory of points of the emotion space is mapped as a point on the mood space.

(34)

26

This 4-class mood discretization gives a trade-off between ambiguity and simplicity. It is assumed to be able to capture all possible moods sufficiently distinguishable, yet to eliminate redundancies. It should be noted, however, that this space partition is based on the psychological theoretical model, and thus, may be suboptimal; we demand the identification of more appropriate space configuration strategies to further research.

Problem formulation

3.2

Suppose we have a video i representing a person in a mood, which would be perceived by an external human observer as mood Mi. The person in

the video portrays over time different emotions, which are perceived by a human (or perceived by a system) as emotions ei. Emotion recognition is

performed at certain time-steps, for example for every video frame k during the entire video i.

For every independent video i consisting of ni frames, the punctual

emotion vector ei corresponding to the perceived emotion at frame k is

expressed in the VA space as:

( )

(

( ) ( )

)

i

k = v k ,a k ,k = 1,2,...,n

i i i

e

, (1)

where v ki( ) and a ki( ) are recognized valence and arousal values of the

emotion expressed at frame kni of the video i. Assuming that the

sequence of punctual emotion vectors for video i

( ) ( )

( )

(

)

i i i i i

E =

e

1 ,

e

2 ,...,

e

n

, (2)

is known, we want to express the mood vector

(

v a

)

i i i

m

= m ,m

as

i= F(E )i

m , (3)

where F is the function mapping the emotion sequence into the mood vector, whose components v

i

m and a i

m represent the mood valence and arousal respectively. We finally retrieve from the mood vector the quantized mood Mi with

v a

i i i

M = Q(m , m ) , (4)

where the function

Q : R

2

{

PH,NH,NL,PL

}

maps the continuous mood

vector in the quantized mood space defined in Sec. 3.1, and is defined as:

≥ ≥   < >  < <  > <  PH if sgn(x) 0 and sgn(y) 0 NH if sgn(x) 0 and sgn(y) 0 NL if sgn(x) 0 and sgn(y) 0 PL if sgn(x) 0 and sgn(y) 0

Q(x, y) =

. (5)

Cytaty

Powiązane dokumenty

Also, the utilization of artificial intelligence will surely make life even more convenient for humankind in the years to come and even force humans to evolve their skill sets,

Next, we discuss decomposition of the Marchenko equation and show how this can be used to retrieve the downgoing and upgoing parts of the Green ’s function at a virtual receiver in

Wyodrębnieniem form metali wy­ stępujących w środowisku (specjacja) bądź jako zdefiniowanych związków chemicznych (specjacja indywidualna), bądź też jako grup czy

Men trots att svenskarna redan hade haft fönster i sina hem före tyskarnas stora invandring, var deras fönster annorlunda än dessa som kom till Sverige med

Figure 1 (on p. 669) presents a spectrum model of mood syndromes integrating three dimensions: 1) severity – from mood symptoms in normal subjects, via sub-threshold minor

Tym, którzy przed wstąpieniem do klasztoru żyli w ta- kim ubóstwie, że nie znajdowali tego, co konieczne do życia, Augustyn przypomina, aby nie uważali się za szczęśliwych

„Spełniając prośby bardzo wielu Braci w biskupstwie oraz licznych wiernych całego świata, za radą Kongregacji do Spraw Kanonizacji, wziąwszy pod uwagę

W muzeum będzie można zobaczyć również warunki techniczne i efekty działalności wydawnictwa w postaci zestawień ekspozycyjnych złożonych z maszyn drukarskich, urządzeń