Assessment of Shopping Behavior: Automatic System for Behavioral Cues Analysis

(1)

Behavior

Automatic System for Behavioral Cues

Detection and Analysis

(2)

(3)

Behavior

Automatic System for Behavioral Cues

Analysis

PROEFSCHRIFT

ter verkrijging van de graad van doctor aan de Technische Universiteit Delft,

op gezag van de Rector Magnificus Prof. ir. K.C.A.M. Luyben, voorzitter van het College voor Promoties,

in het openbaar te verdedigen op dinsdag 8 januari 2013 om 12.30 uur

door

Mirela Carmia POPA

Master of Science in Computer Science, Transylvania University, Bra¸sov, Professional Doctorate in Engineering in Software Technology,

Technische Universiteit Eindhoven, geboren te Bra¸sov, Roemeni¨e.

(4)

Prof. dr. C. M. Jonker

Prof. dr. drs. L. J. M. Rothkrantz

Samenstelling promotiecommissie:

Rector Magnificus, voorzitter

Prof. dr. C. M. Jonker, Technische Universiteit Delft, promotor Prof. dr. drs. L. J. M. Rothkrantz, Netherlands Defence Academy, promotor Prof. dr. ir. F. W. Jansen, Technische Universiteit Delft

Prof. dr. M. Neerincx, Technische Universiteit Delft Prof. dr. Ing. E. N¨oth, University of Erlangen-N¨urnberg Prof. Ing. P. Slavik Czech Technical University

Dr. C. Shan, Philips Research

Prof. dr. I. Heynderickx, Technische Universiteit Delft, reservelid.

This research was supported by the Netherlands Organization for Scientific Research (NWO) under Grant 018.003.017.

Delft University of Technology Delft, the Netherlands Philips Research Eindhoven, the Netherlands

This thesis was typeset using LA_TEX.

Graphs were prepared using Matlab .R

Figures were prepared using Inkscape and GIMP. Printed by W¨ohrmann Print Service, the Netherlands.

(5)

(6)

(7)

1 Introduction 1

1.1 Visual surveillance introduction . . . 2

1.2 Problem Definition . . . 4

1.3 Approach . . . 8

1.4 Methods used . . . 12

1.5 Outline of the Dissertation . . . 13

2 Shopping Behavior Assessment 19 2.1 Introduction . . . 20 2.2 Related Work . . . 22 2.3 Behavioral Models . . . 23 2.4 Computational Framework . . . 26 2.4.1 People Tracking . . . 27 2.4.2 Trajectory Analysis . . . 27

2.4.3 Analysis of Regions of Interest . . . 28

2.4.4 Action Recognition . . . 28 2.4.5 Classification Techniques . . . 30 2.4.6 Reasoning Mechanism . . . 31 2.5 Experimental Results . . . 32 2.5.1 Datasets . . . 32 2.5.2 Experiments . . . 32

2.6 Conclusions and Future Work . . . 37

3 Shopping Behavior Recognition using a Language Model Anal-ogy 43 3.1 Introduction . . . 44

3.2 Related Work . . . 46

3.3 Language Behavioral Model . . . 48

3.4 Shopping Behavior Recognition Model . . . 51

3.5 Experimental Results . . . 54

3.5.1 Datasets . . . 54

3.5.2 Experiments . . . 56

(8)

4 Comparison of HMM vs. DBN for AUs recognition 67

4.1 Introduction . . . 68

4.2 Related Work . . . 70

4.3 Facial Action Units Modeling . . . 75

4.3.1 Dataset . . . 76

4.3.2 Region of Interest Selection . . . 77

4.3.3 Feature Extraction . . . 78

4.3.4 Classification . . . 79

4.3.5 Hidden Markov Models . . . 80

4.3.6 Dynamic Bayesian Networks . . . 80

4.4 Results and Discussions . . . 82

4.5 Summary and Conclusions . . . 85

5 Assessment of Product Facial Expressions 93 5.1 Introduction . . . 94 5.2 Related Work . . . 96 5.3 Database Description . . . 97 5.4 Methods . . . 101 5.4.1 Feature Extraction . . . 102 5.4.2 Learning Methods . . . 104 5.5 Experimental Results . . . 106 5.5.1 Clustering approach . . . 106 5.5.2 Classification approach . . . 107

5.6 Conclusions and Future Work . . . 110

6 Summary, Conclusions, and Future Work 117 6.1 Summary and Conclusions . . . 118

6.2 Future Work . . . 124

Summary 129

Samenvatting 133

Acknowledgements 137

(9)

Chapter

1

Introduction

This chapter introduces the problem of video surveillance for shopping be-havior analysis, the main research questions, our proposed approach towards developing an automatic assessment system for behavior recognition, and the outline of the thesis. The answers to the formulated research questions can be found in Chapter 6, while the specific aspects of each investigated problem are presented in Chapters 2-5 of this thesis.

(10)

1.1 Visual surveillance introduction

Surveillance in public places by means of Closed Circuit TeleVision (CCTV) systems is currently widely used to monitor locations and the behavior of the people in those areas. Since events like the terrorist attack in Madrid and London, there has been a further increasing demand for video sensor network systems to guarantee the safety of people in public areas. But also events like football games, music concerts, and large venues like shopping malls where many people gather, have a need for video surveillance systems to guarantee safety. Video cameras can be found nowadays in many public areas, such as streets, metro stations, or public buildings (see an example in Fig. 1.1).

Currently, the existing video surveillance systems in public places are used by human operators for monitoring the situation, analysing the data, and de-tecting abnormal or unwanted human behavior such as theft or aggression or for later retrieval in case an unwanted event was reported. Human monitoring has benefits such as intelligent reasoning about the situation, but also limitations such as fatigue or loss of concentration, especially when nothing happens for a long period of time, or difficulty to cope with the overwhelming number of cameras and to watch them continuously, given that in London alone there are almost two million cameras installed. The costs involved for hiring a reasonable number of human operators would be huge on one side, and on the other side is almost impossible to guarantee alertness 24/7. Therefore, a supporting alterna-tive could be represented by the development of automatic systems designated at monitoring the video streams and signalising the human operators only in the case of unusual or unwanted events.

Another applicability of an automatic surveillance system consists of de-tecting and analysing human behavior, being useful in several domains such as patients monitoring, supporting elderly people [Virone, 2009], detecting anti-social behavior of people in public places [Kuklyte et al., 2009], or aggression detection in trains [Lefter et al., 2012].

We chose to study the applicability of an automatic surveillance system in a shopping environment, which will be regarded as a general case study for the remainder of this thesis. The existing video surveillance infrastructure in shop-ping malls, designated usually at serving security purposes such as aggression or theft detection, could be extended to other purposes, such as investigating the shopping behavior of customers. In the marketing domain, shopping behavior represents the reaction of customers to the products, being influenced directly by their needs, goals, and emotions on one side, but also indirectly by the shop-ping area, the atmosphere (light, music, ambience) and the quality of offered services.

In a shop, products are displayed in such a way to optimize the buying be-havior of shoppers. Using the surveillance systems, the ideas about placement of products and store layout can be validated and improved, leading to sales increase. Furthermore, the optimal lighting conditions can be tested in different conditions, helping at creating the best shopping experience. For assisting cus-tomers, usually human shop assistants are available, but given peak hours they might encounter difficulties to supervise the whole shopping area in an efficient manner.

A supporting alternative can be provided by developing an intelligent auto-matic surveillance system, which could detect when there is a need for support

(11)

Figure 1.1: Surveillance video cameras in public places (pictures by Danielle Waugh/NCC News).

or a selling opportunity and would signalise a shop assistant to take appropriate action.

Another use of such a system would be the detection of long queues at the pay desk, situation in which an additional counter should be opened. Also, accidents such as falling of products on the ground or medical emergencies, or unwanted events as pickpocketing, stealing, or incidents between customers would be possible to be detected automatically.

This type of automatic assistance could contribute to a better safety, un-derstanding of the customers’ needs, and an improved service, representing an important goal in the marketing domain.

We highlighted several benefits of an automatic surveillance system in a shopping environment, still the system applicability can be extended to other domains in which human behavior recognition is useful, by taking into account the specific characteristics of each analysis domain, while applying similar meth-ods.

An area which could be simplified and improved by an automatic analysis, regards the investigation of people’ opinions with respect to different matters (e.g. quality of offered services or regarding choices of goods, companies, medi-cal care, or even politimedi-cal views) by means of questionnaires [Schaeffer, 2000]. In the marketing domain, customers’ reaction to products is usually investigated by means of questionnaires [Reich, 2001], a method which has some limitations, as it implies a lot of time and energy and lacks the means of distinguishing between a rational and a spontaneous answer. By automatic analysis of the non-verbal behavior of customers in relation to products a lot of cues can be extracted re-lated to their perception of a product. Body language, gestures, speech tonality, and language semantics but most of all facial expressions indicate in what kind of product/products the customer is interested in.

Another application of automatic monitoring of shopping behavior would be the reduction of the number of human operators needed to supervise the video streams (see an example in Fig. 1.2). The detection of possible abnormal or unwanted human behavior would be signalized automatically to an operator

(12)

Figure 1.2: (a) Customer interacting with products captured by the surveillance camera system in a shop. (b) Customer-shop assistant interaction at the pay desk. (c) Monitoring by a human operator.

who could further inspect the situation and decide to take appropriate actions. In order to develop such systems, as described in this section, we need to investigate algorithms and methods from computer vision or pattern recogni-tion fields. In the last decade an increasing amount of research was devoted to the area of video analytics, while semantic interpretation was used to au-tomate surveillance in public places to guarantee safety [Collins et al., 1999], [Matheny et al., 2011]. Automatic systems for detecting abandoned luggage, entering private areas, or for aggression detection have been proposed and tested in controlled conditions. Still, a fully automated surveillance system for shop-ping behavior analysis is currently not available. Some software packages do exist [SBXL, 2009], but they mostly record video streams and provide little further automatic analysis.

In the next section, we continue by introducing the main problems addressed in this thesis.

1.2 Problem Definition

The main focus of this dissertation consists of conducting the basic research needed for developing a fully automated intelligent system for video surveillance and behavior analysis applied and tested in the test-case of shopping behavior. Therefore, the main research question addressed in this thesis is:

How to automate video surveillance and behavior analysis?

In order to answer it, we divide it into more specific questions, while high-lighting the associated scientific challenges. We mention that the scope of this dissertation regards the shopping environment, which is used as a proof of con-cept. Still, a number of research questions are formulated in a general manner, as algorithms and methods applied to shopping behavior could be also appro-priate for other types of behavior observed in different environments.

A video surveillance system in a shopping environment depends at the basic level on the sensors, such as video cameras which need to be efficiently

(13)

dis-tributed to completely monitor a site. A first issue which needs to be solved consists of finding the required camera resolution, the optimum number of video cameras, their position, and also the required level of redundancy obtained by overlapping camera fields of view (FOV). Furthermore, when deciding on the system parameters we need to take into account the object of interest (e.g. a person or a crowd of people).

Next, provided that the video data was recorded, the next step consists of analysing it and extracting meaningful information. For understanding a dynamically changing environment, a computer vision system can be designed to interpret behaviors from human actions and interactions captured visually in that environment. The representation of behavior consists of characteristic features which can be estimated from video data. Therefore, the following research question appears, which is handled in Chapter 2 of this thesis:

1. How to automatically extract relevant behavioral cues? 1.a How to automatically analyse trajectories? 1.b How to automatically analyse human actions?

1.c Which type of context information can be associated with the different regions of interest?

Facial expression analysis, as part of non-verbal behavior analysis, represents one of the important types of information which contributes to the behavior recognition task. One method towards facial expressions recognition consists of using Facial Action Units (AUs) as basic units, provided that each AU is asso-ciated with the activation of one or more specific facial muscles and each facial expression can be described as a combination of AUs [Ekman and Friesen 1978]. Given that there are many available methods for classification, deciding the best suited one for the problem at hand can be quite difficult. At a general level, behavior representation needs to accommodate both cumulative and tem-poral information; therefore spatial-temtem-poral methods seem to be a good option. From this group, Hidden Markov Models (HMMs) and Dynamic Bayesian Net-works (DBNs) proved to be the most successful ones. In order to find which method is most appropriate in our case, we pose the following research question, which will be investigated in more detail in Chapter 4 of this dissertation:

2. Which of the two spatial-temporal methods, HMMs and DBNs, is better suited for modeling and recognizing facial action units?

Next, we also plan to investigate which facial expressions are encountered in product appreciation. There is an impressive amount of research on the recogni-tion of the six basic emorecogni-tions (happiness, sadness, angry, disgust, surprise, and fear), as they are assumed to be universal. While in a shopping environment, we might encounter blended emotions, expressed in many ways depending on the personal characteristics of the customers and also different from one culture to another one. All these issues highlight the difficulty of defining models for product related emotions and sustain the need of investigating this problem. Our focus in this dissertation is not only on recognizing discrete facial expres-sions classes, but also on mapping them on the two-dimensional space defined by valence and arousal [Russel 1980]. Therefore, the next research question, dealt within Chapter 5, is:

(14)

3. How can we automatically evaluate customers’ response to products using facial expressions analysis?

Furthermore, there are data streams coming from different types of video cameras which need to be combined. Next, the sensory information has to be mapped on the high semantic level, in order to associate it with a meaningful interpretation. One of the main challenges addressed in this dissertation refers to bridging the gap between sensory and semantic information by means of a multi-level framework for shopping behavior analysis. The different types of information such as trajectories, sequences of actions, or context information regarding the visited regions of interest (ROIs), need to be fused at different moments in time and mapped to a higher semantic level consisting of different behavioral types. Recognition of the different behavioral types is difficult as it depends on the reliability of the intermediary components, it needs to cope with uncertainty of possible human behaviors, and also has to deal with the complexity of discerning between two similar behaviors. These requirements lead us to the following research issue, which is treated in more detail in Chapters 2 and 3 of this thesis:

4. How to map sensory information to the semantic level?

Another challenge considered in this thesis refers to the adaptation of the ideas from language processing models to the task of behavior recognition, given the analogy between the two worlds (speech recognition and behavior recogni-tion). In speech recognition, semantic knowledge is useful at filtering out im-possible or less likely combinations of sounds or words, reducing in this way the computational complexity and the error rates. In the same manner, the task of behavior recognition can benefit from additional knowledge regarding valid se-quences of shopping related actions. Consequently, the next research question, which is presented in detail in Chapter 3, refers to:

5. How to develop a behavioral model to support automated surveillance? Assuming that the above research questions were answered, we also need to investigate how we can design such an intelligent system, which modules need to be developed, and how to integrate them.

The following step, after system development consists of testing its com-ponents and assessing if it can be applied in a real shopping environment. In order to learn the behavioral model parameters from data we need a lot of training examples. Furthermore, at the date of this writing, no public domain database of shopping behavior was available. Therefore, in order to develop a proof-of-concept prototype we need to record a database of shopping behavior. One alternative would be to use our laboratory which has the appearance of a real shop (see Fig. 1.3), while permitting easy installation of the required video equipment. Initial control of the recording protocol is important in order to be able to deal with potential problems or challenges, such as the optimum position of the cameras, illumination conditions, or occlusions. The recorded shopping sequences need to involve a reasonable level of spontaneity. This task can be achieved by permitting the participants to behave as they like, without giving them instruction about what to do or how to perform an action, but provid-ing them with a high-level task such as findprovid-ing a piece of clothprovid-ing in the shop.

(15)

Figure 1.3: Different views of the ShopLab, indicating the position of the in-stalled video cameras.

The research question about system evaluation is handled in all the remaining chapters of this dissertation.

6. How can we evaluate automated video surveillance and behavior analysis systems in laboratory settings and which performance can be achieved?

Additionally, we also aim at using a database collected in a supermarket, consisting of recorded shopping trips of customers, which poses several chal-lenges. In such a case we might not be able to control the recording conditions (e.g. camera types, position, and resolution) and we have to cope with the ex-isting set-up. The innovative research consists in this case of finding a suitable approach for recognizing shopping related actions in a complex environment and dealing with the low camera resolution and distorted images. In other words, we aim at investigating the following research issue:

7. How can we evaluate automated video surveillance and behavior analysis systems in real-life conditions and which performance can be achieved?

We never aimed beyond developing a proof-of-concept prototype.

The next sections provide an introduction to the approach and methods used to obtain the solutions and answers to the research issues.

(16)

Figure 1.4: Flowchart of the shopping behavior assessment system.

1.3 Approach

In order to answer the main research question addressed in this thesis, we fol-lowed a modular approach which was inspired from the behavior of human op-erators [Yang, 2009]. They extract different types of information from the video data and associate or fuse them, for a better understanding of the situation.

Furthermore, we use the multi-layer architecture for analyzing surveillance videos, proposed in [Choudhary et al., 2008], [Wijnhoven et al., 2006], and also [Nam et al., 2010]. We describe next the flowchart of the complete proposed system in Fig. 1.4.

An important aspect of the proposed framework is that it needs to deal with different types of information originating from sensors. We organized the framework into three levels of abstraction and we explain the framework in terms of the case study into shopping behavior.

1. Sensor Level

At the sensor level, we propose using different video cameras, which should be synchronized and used in a collaborative manner. A fish eye camera, mounted on the ceiling is useful at detecting people and tracking them through the shop. Besides having the advantage of a wide field of view which enables capturing the whole scene, it also has the disadvantage of

(17)

distorting the image, especially on the borders, a region which is very rel-evant in our case, as it corresponds with the products areas. Therefore, new cameras, high-definition ones are needed in the relevant Regions of Interest (ROIs), meaning that people are performing specific actions in these places (e.g. entering, paying, resting). In the specific case of shop-ping behavior, products ROIs are important, as they could facilitate the recognition of the customers’ actions in relation to products. In each of the relevant ROI, a customer can perform different actions. If we are in-terested in a more refined analysis of the customers’ reaction to products, we can use another type of video cameras, such as high quality web cam-eras, devoted at recording the frontal facial expressions of the shoppers. The reason why we prefer web cameras is

The different levels of abstraction can be recognized not only in the way the framework architecture is structured but also in the designated func-tionality of the different types of employed video cameras. The fish-eye camera serves for human detection and tracking, conveying global infor-mation about the shopper’s behavior. For a more refined analysis, the high-definition camera is used for recognizing a customer’s actions. Fur-thermore, the most detailed information about a shopper’s behavior is obtained from the web camera which enables the analysis of the displayed facial expressions.

The video cameras need to be managed by a centralized component, which is also responsible for saving the acquired video streams to a database for later retrieval.

2. Intermediate Level

The intermediate level is responsible for extracting the relevant features from the video feed. The analysis of the video streams originating from the different types of cameras has to be performed in parallel.

• Trajectory Analysis

The analysis of the video stream obtained from the fish-eye camera serves for human tracking. After a person is detected using the person detection module, an identifier is assigned to it and the tracking module is started. In order to collect a track of a shopper along the shop, the information from multiple-cameras needs to be combined. Re-identification of a person from one camera to another one can be difficult, especially when several persons are grouped together. Next, provided that the tracks information is available, trajectory features have to be extracted, characterizing differ-ent walking patterns, which carry relevant information about the displayed shopping behavior. For example, walking straight towards a product dis-play, spending little time at that location and then continuing the trip might be an indication of a shopper who knows what he wants and where to find it, or in other words a ’goal oriented’ shopper. On the other hand, a customer wandering around and returning several times at the same location, might mean that he/she needs help at finding a product, and is also called a ’disoriented’ type of shopper. This type of information is encapsulated by trajectory features and represents one of the main types

(18)

of information which will be used at the high semantic level to draw a con-clusion about the displayed shopping behavior. Moreover, the customers’ position can be used to detect the visited shopping areas and especially the places where they stop.

• Region of Interest Detection

For obtaining a better overview of what is happening inside a shop, we can use context information which is based on the segmentation of the shopping area into Regions of Interest (ROIs) such as products, passing areas, pay desk, or resting areas. Features related to the ROIs, such as the time spent in each ROI together with the transitions between different ROIs can contribute to a better recognition of the shopping behavior, as an action can have different meanings in different ROIs. A problem regarding ROIs definition regards the manual vs. automatic segmentation of the shopping area. A manual definition of the ROIs in a shop can lead to a refined segmentation, enhancing the benefit of contextual information at the cost of the work needed to obtain it. On the other hand, automatic segmentation is easy to obtain and to re-use in case the layout of the shop would be changed, while the resulting map might not be as satisfactory as the manual defined one.

Another issue encountered when defining the relevant ROIs is related to the type of shop. The type of products or services offered in a shop, determine which sort of additional ROIs can be defined. In a clothes shop, new ROIs can be found such as mirror or fitting room ROIs, while in a bakery an eating ROI could be encountered, and in an electronics shop, a special ROI would be devoted for testing the electronic devices.

• Action Recognition

The second video stream coming from the high-definition cameras does not need to be analysed continuously, but only when the shopper stops in a specific ROI. Detecting when a customer is stationary inside a relevant ROI, is signalized to the central reasoning board, which is responsible for supervising all video streams and starting the action recognition module. Human actions inside a shop are interesting to be recognized as they mainly concern customers’ interaction with products (e.g. pointing, tak-ing, touchtak-ing, inspecttak-ing, putting back, trying on, taking off) or with the shopping cart or shopping basket. In order to assess this type of informa-tion we need to extract relevant features, regarding both the appearance of a person and his/her movements. Next, a classification module needs to be used for discriminating between the different types of actions. The analysis of shopping related actions is important, as different sequences of actions can have different semantic meanings, contributing to the recog-nition of the shopping behavioral models.

• Product Appreciation Assessment

Last but not least the customer’s emotional state plays an important role in their buying decisions and also in the way they perceive products. For

(19)

instance an angry person might be less open than a happy person in find-ing or appreciatfind-ing a new product. An impressive amount of research has been devoted to the six basic emotions [Ekman, 1999] recognition by facial expressions, still there is little known about product related emo-tions [Desmet and Hekkert, 2002]. Therefore, a separate analysis stream investigates a customer reaction to a product based on the displayed fa-cial expressions. In order to recognize fafa-cial expressions, there are several steps involved. First, the face detection module is employed, based on the Viola and Jones approach, which uses the principle of a boosted cascade of simple classifiers and is capable of processing images extremely rapidly while achieving high detection rates [Viola and Jones, 2002]. The next step consists of identifying the facial landmarks (e.g. mouth, nose, eyes, and eyebrows) as they are involved in generating different facial expres-sions. One of the most used methods for achieving this task consists of applying Active Appearance Models (AAMs) [Cootes et al. 1998] due to their ability of locating deformable objects. Around the detected facial landmarks, discriminative features are extracted and then fed to a classifi-cation method in order to distinguish between different facial expressions. 3. Semantic Level

Finally, the highest level of abstraction or the semantic level is respon-sible for deciding about the displayed shopping behavioral type. Each intermediary analysis stream: trajectory analysis, action recognition, ROI detection module, and facial expression analysis, provides an input to the reasoning model, which is based on the observables, and will formulate a hypothesis regarding the most likely shopping behavioral model. The fu-sion algorithm of the different behavioral cues is modeled according to the importance or availability of each modality, by associating them different weights. Furthermore, the data fusion is done both in a deterministic man-ner, by incorporating expert knowledge, and also in a probabilistic way. In the last case we had to gather a sufficient number of behavioral samples in order to learn the model parameters. Both types of data fusion have important benefits, reason why we proposed a combined solution which exploits their advantages and incorporates semantic expert knowledge into the probabilistic models.

An example depicting the semantic reasoning about the displayed shop-ping behavior, starts with the assessment of the direction and the speed of walking of a customer observed by the top camera. When the customer stops for a period of time in the products ROI, the system changes the fo-cus to the side camera and the fo-customer’s actions are assessed. In case the performed set of actions are recognized correctly: (the customer is look-ing at products, he/she takes one, touches it and then puts it in his/her basket, after that he/she takes the product from the shopping basket and puts it back on the shelf), an interpretation is formulated. Furthermore, extracting relevant information from the customer’s facial expression, in-dicates the positive/negative appreciation of the chosen product. The presented example highlights the importance of each modality, as each of them contributes to the final decision about the shopping behavioral type and carries relevant information about the possible solutions.

(20)

The introduced framework represents a first step towards semi-automation in the shopping environment, while it could be also useful for supporting human operators. This goal can be achieved, by developing a Graphical User Interface for displaying the analysis of the shopping behavior next to the video stream. Visualization of what is happening in the shop in the form of tracks, recog-nized actions, or behavior probabilities can be of great use while supervising the situation, but also for training shop assistants to recognize different types of shoppers.

It needs to be mentioned, that in our thesis, we do not focus particularly on the image processing part, as we use third party, state-of-art software to extract basic features, such as the widely used open source computer vision library [OpenCV]. Furthermore, our thesis is intended as a proof-of-concept which aims at investigating how to automatically assess shopping behavior by means of a video surveillance system. Still, there are also other relevant issues for the analysed problem which are beyond the scope of this dissertation. From this category, we mention gesture analysis such as pointing towards products, so-cial interactions inside the shopping environment, which influence the shopping behavior, as customers could be accompanied by family or friends, unwanted incidents, speech analysis which could reveal customers’ opinion about different brands or the shop itself, or usability studies which would reflect the end-user’s opinion about the intended system.

One of the main challenges of this research is at the semantic level, where we intend to investigate different models of shopping behavior and algorithms to assess them automatically. We present in the next section the adopted method-ology for conducting the research in the field of shopping behavior analysis.

1.4 Methods used

For investigating the main research aspects posed in this thesis, we followed an approach which consisted of several important steps.

• Literature survey

After identifying the main problem which had to be solved, our next step consisted of investigating the state of the art in behavior analysis. Ongoing research projects on monitoring of underground transportation environ-ment [VANAHEIM, 2010] or on visual context modelling [ViCoMo, 2009], supported the applicability and relevance of our chosen research topic. There are currently a number of disciplines, such as image processing, computer vision, pattern recognition, and behavior understanding, which contribute to our field of study. Therefore, we carried out a detailed anal-ysis of the successful methods and algorithms proposed by the previously mentioned directions of research.

• Designing shopping behavioral models

Our aim was to build an automatic intelligent system for shopping behav-ior analysis. In order to achieve this goal, we needed to design empirical based models of shopping behavior. We accomplished this task by using observations of people while shopping and knowledge from marketing ex-perts. We observed the features characteristic for each type of behavior

(21)

and we grouped them depending on the addressed behavioral cue (trajec-tory analysis, action recognition, or facial expression analysis).

• Developing an automatic intelligent system for shopping behavior analysis Next, we aimed at investigating the underlying technologies needed for automatically processing and analysing the observed features. The details regarding the developed modules and the adopted system architecture were presented in Section 1.3.

• Shopping behavioral models validation

In order to validate and test our models we needed to make recordings of shopping behavior both in laboratory and real-life settings. The analysis of the obtained experimental results provided us new insights regarding the most appropriate methods for data processing but also related to the limitations of our proposed approach.

We used an iterative process, which incorporated the learned lessons in a new iteration for improving the analysis methods and obtaining better results.

Finally, after achieving a satisfactory outcome, we summarized our work by presenting our conclusions and proposing directions for future work.

We continue in the next section by describing the thesis outline.

1.5 Outline of the Dissertation

The dissertation is structured in six chapters, including the introduction chapter in which we set out the scope of this thesis. The Chapters 2-5 are based on published papers. The visual illustration of the dissertation structure is depicted in Fig. 1.5.

Chapter 2 (Semantic Assessment of Shopping Behavior using Trajectories, Shopping Related Actions, and Context Information) provides an overview of the state-of-the-art in shopping behavior analysis, followed by the description of the proposed modeling approach towards shopping behavior representation. Next the design of the multi-level framework for shopping behavior assessment is presented together with the methodology adopted for the automatic extraction of the proposed behavioral cues, addressing the fourth research question intro-duced in Section 1.2. The software modules responsible for trajectory analysis and action recognition in the special ROIs are described in detail, by providing information about the adopted feature extraction methods and classification techniques, and offering an answer to the first research question investigated in this thesis. The chapter also includes the description of the employed datasets, which were recorded both in laboratory and real-life conditions. Furthermore, a discussion of the obtained experimental results on the considered datasets is provided, which addresses partially the sixth and seventh research questions. Finally, the chapter ends with conclusions and ideas for future developments. The results and analysis presented in this chapter were published in various international journals and conferences listed below:

(22)

Human Action

Recognition Information Context

Unsupervised Analysis Database

Acquisition Supervised Analysis Feature

Extraction Classification Methods Tools Trajectory Analysis Classification (HMM vs. DBN) Tools Feature Extraction

Figure 1.5: Outline of the thesis structure.

[Popa12a] M. C. Popa, L. J. M. Rothkrantz, C. Shan, T. Gritti, and P. Wiggers, 2012. Semantic Assessment of Shopping Behavior Using Trajectories, Shopping Related Actions, and Context Information, Pattern Recognition Let-ters, DOI:10.1016/j.patrec.2012.04.015.

[Popa12b] M. C. Popa, L. J. M. Rothkrantz, C. Shan, and P. Wiggers, (in press) 2012. Assessment of Customers Level of Interest, IEEE Int. Conference on Image Processing (ICIP), Florida, U.S.A.

[Popa11a] M. C. Popa, T. Gritti, L. J. M. Rothkrantz, C. Shan, and P. Wig-gers, 2011. Detecting Customers’ Buying Events on a Real-life Database, Com-puter Analysis of Images and Patterns, 14th International Conference (CAIP 2011), Springer Heidelberg, ISBN 978-3-642-23672-3, pp. 17-25, Seville, Spain.

[Popa11b] M. C. Popa, L. J. M. Rothkrantz, P. Wiggers, C. Shan, and T. Gritti, 2011. Automatic Assessment of Customers Buying Behavior, Interna-tional Workshop on Computer Vision Applications (CVA), Eindhoven.

[Popa10a] M. C. Popa, L. J. M. Rothkrantz, P. Wiggers, R. Braspenning, and C. Shan, 2010. Analysis of Shopping Behavior based on Surveillance System, Systems, Man and Cybernetics (SMC’10), pp. 2512-2519, Istanbul, Turkey.

(23)

The main contribution of Chapter 2 consists of introducing a framework for shopping behavior analysis, which combines different behavioral cues at the high semantic level. Still, the proposed fusion method is a deterministic one, lacking the ability of coping with uncertainty. Therefore, a new probabilistic behavioral model is proposed in the next chapter.

Chapter 3 (Shopping Behavior Recognition using a Language Modeling Anal-ogy) introduces a theoretical model of behavior inspired from speech recognition. By means of the proposed behavioral model, low-level information extracted from video data is associated with semantic information. The main contribu-tions of this chapter are on two levels of abstraction. Firstly, on the action recognition level, new features are proposed consisting of fusing Histograms of Optical Flow (HOF) with directional features. Secondly, on the behavior level, smoothed bi-grams are combined with the maximum dependency in a chain of conditional probabilities, leading to an improvement over the baseline models by capturing correlations between the basic actions. This chapter provides an answer to the fifth research question. The results obtained on both laboratory and real-life datasets are presented and discussed in parallel, offering a further insight into the aspects addressed in the sixth and seventh research questions, while the chapter ends with concluding remarks. The shopping behavioral model proposed in this chapter was introduced in the following publication:

[Popa12c] M. C. Popa, L. J. M. Rothkrantz, C. Shan, and P. Wiggers, 2012. Shopping Behavior Recognition using a Language Modeling Analogy, Pattern Recognition Letters, DOI:10.1016/j.patrec.2012.11.015.

The first two chapters elaborated on methods and techniques suitable for shopping behavior assessment based on trajectory analysis and basic actions recognition. Another type of information relevant for understanding customers’ behavior is represented by facial expression analysis. The next chapter focuses on finding the best methodology towards facial action units recognition. Chapter 4 (A Comparative study of HMMs and DBNs applied to Facial Ac-tion Units RecogniAc-tion) offers an insight into the facial expressions recogniAc-tion problem based on action units (AUs). The Facial Action Coding System (FACS) developed by Ekman and Friesen decomposes the face into 46 AUs, each AU being related to the contraction of one or more specific facial muscles. FACS proved its applicability to facial behavior modeling, enabling the recognition of an extensive palette of facial expressions. Even though a lot has been pub-lished on this theme, it is still difficult to draw a conclusion regarding the best methodology to follow, as there is no common basis for comparison and some-times no argument is given why a certain classification method was chosen. This chapter provides the basis for comparison of different methods involved at different steps in the analysis problem, such as different facial regions of inter-est selection, different algorithms for feature extraction, and last but not least different spatio-temporal classification techniques (HMMs vs. DBNs). Even though, from a theoretical point of view, Hidden Markov Models (HMMs) and Dynamic Bayesian Networks (DBNs) are similar, in practice they pose different challenges. The benefits and also the drawbacks of the two considered methods

(24)

are presented in the experimental section, providing the reader with an insight into the sixth research question. The chapter ends with concluding remarks regarding the most appropriate method for the studied problem and offers an answer to the second research question addressed in this thesis. The methods, results, and discussion presented in this chapter were published in:

[Popa10b] M. C. Popa, L. J. M. Rothkrantz, D. Datcu, P. Wiggers, R. Braspenning, and C. Shan, 2010. A comparative study of HMMs and DBNs applied to Facial Action Units Recognition, Neural Network World, vol. 20, no. 6, pp. 737-760.

[Popa10c] M. C. Popa, L. J. M. Rothkrantz, and P. Wiggers, 2010. Prod-ucts appreciation by facial expression analysis, 11th International Conference on Computer Systems and Technologies (CompSysTech’10), pp. 293-298.

Chapter 4 familiarizes us with the problem of facial expression analysis and proposes an approach towards solving it. However, even though the database used in the experimental session was good from a statistical point of view, containing a reasonable number of annotated samples and representing a good basis for testing different methods, from an application point of view is less suitable as it contains posed facial expressions. Given that our aim was to investigate shopping behavior, we needed to obtain/record a database of product related facial expressions, problem which is going to be investigated in the next chapter.

Chapter 5 (Assessment of Facial Expressions in Product Appreciation) pre-sents the methodology used to collect a database of emotional facial expressions, by presenting a set of product related pictures to a number of test subjects. Next, the analysis of the displayed facial expressions consists of extracting both geometric and appearance features, as they contain complementary information. Furthermore, two types of approaches are employed. Unsupervised methods are used to discover emotional classes and proved efficient at differentiating between positive and negative facial expressions in 78% of the cases. For a more refined analysis of the different types of product related emotions, we employed different classification methods and we achieved 84% accuracy for seven emotional classes and 95% for the positive vs. negative. Finally, the obtained results are discussed and the applicability of the performed research is highlighted, while providing an answer to the third research question. The analyses presented in this chapter were submitted to the following publication:

[Popa12d] M. C. Popa, L. J. M. Rothkrantz, C. Shan, and P. Wiggers, As-sessment of Facial Expressions in Product Appreciation, under review in the International Journal of Human Computer Studies.

Chapter 6 (Summary, Conclusions, and Future Work ) formulates the most important findings presented across the dissertation and comments upon them. It also provides directions for future developments in the behavior recognition field, while referring to the specific case of shopping behavior.

(25)

[Choudhary et al., 2008] Choudhary, A., Chaudhury, S., and Banerjee, S., 2008. Framework for Analysis of Surveillance Videos, In: Proceedings of the Sixth Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP 2008), IEEE Computer Society, Washington, DC, pp. 344-351. [Collins et al., 1999] Collins, R., Lipton, A., and Kanade, T., 1999. A system

for video surveillance and monitoring, In: textitAmerican Nuclear Society 8th Internal Topical Meeting on Robotics and Remote Systems.

[Cootes et al., 1998] Cootes, T. F., Edwards, G. J., and Taylor, C. J., 1998. Active Appearance Models, In: H. Burkhardt and B. Neumann, editors, 5th European Conference on Computer Vision, Springer, Berlin, vol. 2, pp. 484-498.

[Desmet and Hekkert, 2002] Desmet, P. M. A., and Hekkert, P., 2002. The basis of product emotions, In: W. Green and P. Jordan (Eds.), Pleasure with Products, Beyond usability, pp. 60-68.

[Ekman and Friesen, 1978] Ekman, P., Friesen, W., 1978. Facial Action Coding System, Consulting Psychologists Press, Inc., Palo Alto California, USA. [Ekman, 1999] Ekman, P., 1999. Basic Emotions, In: Handbook of Cognition

and Emotion, edited by T. Danlgleish and M. Power, Sussex, UK, John Wiley & Sons Ltd.

[Kuklyte et al., 2009] Kuklyte, J., Kelly, P., Conaire, C., OConnor, N. E., and Xu, L.-Q., 2009. Anti-social Behavior Detection in Audio- Visual Surveil-lance Systems, Workshop on Pattern Recognition and Artificial Intelligence for Human Behaviour Analysis.

[Lefter et al., 2012] Lefter, I., Burghouts, G. J., Rothkrantz, L. J. M., 2012. Automatic Audio-Visual Fusion for Aggression Detection using Meta-Information, 9th IEEE International Conference on Advanced Video and Sen-sor based Surveillance.

[Matheny et al., 2011] Matheny, M. E., Normand, S. L., Gross, T. P., Marinac-Dabic, M., Loyo-Berrios, N., Vidi, V. D., Donnelly, S., Resnic, F. S., 2011. Evaluation of an automated safety surveillance system using risk adjusted

(26)

sequential probability ratio testing, BMC Medical Informatics and Decision Making, 2011, vol. 11, no. 75.

[Nam et al., 2010] Nam, Y., Rho, S., Park, J. H., 2010. Intelligent Video Surveillance System: 3-tier context-aware surveillance system with metadata, Multimedia Tools and Applications (MTAP), DOI:10.1007/s11042-010-0677-x.

[OpenCV] OpenCV: Open Source Computer Vision Library, available at: http: //www.intel.com/technology/computing/opencv.

[Reich, 2001] Reich, A. Z., 2001. Determining a firms linear market position: In search of an effective methodology. Journal of Hospitality and Tourism Research, vol. 25(2), pp. 159-172.

[Russel, 1980] Russel, J. A., 1980. A circumplex model of affect, Journal of Personality and Social Psychology, vol. 39, pp. 1167-1178.

[SBXL, 2009] Shopping Behavior Xplained (SBLX), 2009. Axis cameras watch shopper’s behavior, available at http://www.axis.com/files/success_ stories/ss_ret_sbxl_36113_en_0907_lo.pdf.

[Schaeffer, 2000] Schaeffer, N. C., 2000. Asking questions about threatening top-ics: a selective overview, Stone, A. A. (Ed), Turkkan, J. S. (Ed), Bachrach, C. A. (Ed), Jobe, J. B. (Ed), Kurtzman, H. S. (Ed), Cain, V. S. (Ed), (2000) The science of self-report: Implications for research and practice, pp. 105-121. [Senior et al., 2007] Senior, A. W., Brown, L., Hampapur, A., Shu, C. F., Zhai, Y., Feris, R. S., Tian, Y. L., Borger, S., and Carlson, C., 2007. Video analytics for retail, In: Proc. IEEE Conference on Advanced Video and Signal-based Surveillance, London, UK, pp. 423-428.

[VANAHEIM, 2010] VANAHEIM: Video/Audio Networked surveillance sys-tem enhAncement through Human-cEntered adaptIve Monitoring, 2010, available at: http://www.vanaheim-project.eu/.

[ViCoMo, 2009] ViCoMo: Visual Context Modeling, 2009, available at: http: //www.vicomo.org/.

[Viola and Jones, 2002] Viola, P. and Jones, M., 2002. Robust real-time object detection, International Journal of Computer Vision.

[Virone, 2009] Virone, G., 2009. Assessing everyday life behavioral rhythms for the older generation, Journal of Pervasive and Mobile Computing, vol. 5, no. 5, pp. 606-622.

[Wijnhoven et al., 2006] Wijnhoven, R. G. J., Jaspers, E. G. T., de With, P. H. N., 2006. Flexible Surveillance System Architecture for Prototyping Video Content Analysis Algorithms, Multimedia Content Analysis, Management, and Retrieval, Proceedings of the SPIE Electronic Imaging, vol. 6073, San Jose, CA.

[Yang, 2009] Yang, Z., 2009. Multi-Modal Aggression Detection in Trains. PhD thesis, Delft University of Technology.

(27)

Chapter

2

Semantic Assessment of Shopping

Behavior Using Trajectories,

Shopping Related Actions, and

Context Information

The goal of this chapter consists of defining shopping behavior models and proposing a multi-level framework towards recognizing them automatically. After identifying the main behavioral cues, such as walking patterns, customer-product interaction patterns, and regions of interest in the shop, we present the adopted approach towards automatic assessment in Section 2.4. The fusion of the behav-ioral cues and the semantic interpretation is achieved by employing a reasoning model. The experiments are performed on both laboratory and real-life record-ings in a supermarket, achieving a satisfactory outcome. While this chapter introduces a deterministic model for fusing the behavioral cues, in Chapter 3 we propose a probabilistic model for behavior recognition. 1

1_{This chapter is equivalent to the publication: M. C. Popa, L. J. M. Rothkrantz, C. Shan,}

T. Gritti, and P. Wiggers, 2012. Semantic Assessment of Shopping Behavior Using Tra-jectories, Shopping Related Actions, and Context Information, Pattern Recognition Letters, DOI:10.1016/j.patrec.2012.04.015.

(28)

Abstract Automatic understanding of customers’ shopping behavior and act-ing accordact-ing to their needs is relevant in the marketact-ing domain and is attractact-ing a lot of attention lately. In this work, we propose a multi-level framework for the automatic assessment of customers’ shopping behavior. The low level input to the framework is obtained from different types of cameras, which are synchro-nized, facilitating efficient processing of information. A fish-eye camera is used for tracking people, while a high-definition one serves for the action recognition task. The experiments are performed on both laboratory and real-life recordings in a supermarket. From the video recordings, we extract features related to the spatio-temporal behavior of trajectories, the dynamics and the time spent in each region of interest (ROI) in the shop and regarding the customer-products inter-action patterns. Next we analyze the shopping sequences using a Hidden Markov Model (HMM). We conclude that it is possible to accurately classify trajectories (93%), discriminate between different shopping related actions (91.6%), and rec-ognize shopping behavioral types by means of our proposed reasoning model in 95% of the cases.

Key words: Shopping Behavior, Semantic Analysis, Trajectory Analysis, Action Recognition, Hidden Markov Models.

2.1 Introduction

In recent years there is an increasing interest towards developing intelligent soft-ware solutions to enhance the user experience. Fields such as affective comput-ing, gaming industry, surveillance, or marketing could benefit a lot by enabling systems which could act according to the user preferences or intentions.

In the marketing domain it is of great interest to build a satisfactory relation with the customer, by assessing his emotional state and intentions. The shop-ping experience could be enhanced by facilitating easy access to the products for which the customer shows interest or by offering timely assistance when-ever a customers needs help in finding or selecting a merchandise. For assisting customers, usually human shop assistants are available, but given peak hours they are too expensive to meet the whole demand or they are not always well trained or willing to adapt to the different types of customers. Following the conclusions presented in [Tsai and Huang, 2002], the employee’s affective deliv-ery plays an important role in the customer’s level of satisfaction. Therefore, a supporting alternative can be provided by developing an automatic behavior assessment system. Using the available surveillance systems of video cameras in shops [Popa et al., 2010], we aim at a semantic interpretation of the customers’ shopping behavior and at detecting when there is a need for support or a selling opportunity. Our system can also be used for other purposes, such as detecting long queues in front of the pay desks or to generate statistics about customers’ interaction with products.

The modeling of the shopping behavior is based on different types of infor-mation. The customers’ walking pattern provides a first indication of the type of behavior, while the different customer-product interaction patterns can re-veal the level of interest of the customer for a specific product. Our assessment model is context sensitive and is based on the segmentation of the shopping area into Regions of Interest (ROIs) such as products, passing areas, pay desk, or resting areas. Features such as the time spent in each ROI together with the

(29)

transitions between different ROIs can contribute to a better modeling of the shopping behavior, as an action can have different meanings in different ROIs. For example, standing in the products area can mean visual inspection of the available products, the same action at the pay desk ROI, denotes waiting, while in the passing ROI can be regarded as orientation or waiting for another person. It is very important to detect if a customer spends too much time in a specific ROI, in order to take appropriate actions on short or long term, such as sending a shop assistant to offer help or optimizing the products arrangement.

Customers display different shopping behaviors depending on several fac-tors, such as whether they are experienced with the shopping area or not, their purpose, their mood, if they are accompanied or alone. Different shopping be-haviors can be noticed not only between different shopping trips, but also during the same one. This fact adds even more complexity to the behaviour modeling task.

Figure 2.1: Flowchart of the proposed system.

Our contributions in this paper consist of designing a framework for au-tomatic assessment of shopping behavior built in a hierarchical manner, by employing different levels of abstraction, from the low sensory level up to the semantic level. At the sensor level different video cameras are synchronized and used in a collaborative manner. A fish eye camera is used to detect people and to track them through the shop, while a high-definition camera is employed for action recognition. Analyzing a customer’s actions is not relevant all the time, but only when he/she is in a specific ROI. Given the position of a customer, entering or exiting from a ROI can be detected, and the action recognition module can be started. The high-level semantic interpretation of the different shopping behavioral types is realized using a reasoning model, which combines the intermediary outputs of the trajectory analysis, action recognition, and ROI detection modules. The summary flowchart of the proposed system is presented in Fig. 2.1. The proposed framework is tested both in a laboratory set-up and in a real-life scenario.

The outline of the paper is as follows. In Section 2.2 we provide an overview of related work. Next, we describe the proposed shopping behavior models in Section 2.3. We continue in Section 2.4, by presenting the computational frame-work, the integrated modules, namely trajectory analysis and action recognition

(30)

are discussed in terms of the underlying feature extraction and classification ap-proaches. The proposed reasoning model is introduced in Section 2.4.6. Next, we provide a description of the used datasets and the experimental results in Section 2.5. Finally, we formulate our conclusions and give directions for future work.

2.2 Related Work

Organizations, companies, and retailers design products, experiences, and am-biance in order to influence people’s behavior, as a means to increase their profitability and popularity. In order to achieve this goal, human behavior needs to be understood and modeled accordingly. There were many attempts to explain and categorize human behavior. Social psychology explains human behavior as a result of the interaction of mental states and immediate social situations. Examples of proposed models are: the Social Cognitive Theory [Pajares et al., 2009], the Cost-benefit model [Bias and Mayhew, 1994] or the Theory of Reasoned Action [Hale et al., 2003]. The model developed by Fogg named FBM [Fogg, 2009] represents a valid approach towards describing human behavior. This model has three basic components: motivation, ability and trig-gers and asserts that in order for a behavior to happen: “a person must have sufficient motivation, sufficient ability, and an effective trigger”. In the context of shopping behavior, it is reasonable to assume that the same components play an important role.

The technical perspective upon human shopping behavior is focused on de-veloping efficient algorithms which can enable automatic behavior recognition. An attempt towards human behavior analysis while shopping was investigated in [Sicre and Nicolas, 2010]. They propose a finite-state-machine model for the de-tection of simple actions, while the interaction between customers and products is based on Motion History Image (MHI) [Bobick and Davis, 2001] and Accu-mulated Motion Image (AMI) [Kim et al., 2010] description and Support Vector Machines (SVM) classification. An interesting idea towards enhancing shopping experience in a retail store was proposed in [Meschtscherjakov et al., 2010]. Cus-tomers are made aware of the activity of other cusCus-tomers in the shop through a dynamic map, similar to concepts of online-shops, as sales rank. The results of this study proved that customers are interested in areas which are frequently visited by other customers. This pilot study can be implemented using a surveil-lance system inside a shop and detecting the most visited area(s) by the cus-tomers. Further on, this information can be useful to re-arrange products or promotions in the shop. Wiliem et al. present their work towards uncommon behavior detection in [Wiliem et al., 2008], which is obtained by measuring devi-ations from defined normal behavior. By clustering trajectories from a shopping mall corridor and the CAVIAR dataset, the most common paths are detected.

Computer vision supports shopping behavior analysis by providing multi-ple techniques which enable surveillance, trajectory analysis, or action recog-nition. People tracking, behavior analysis and prediction were investigated in [Kanda et al., 2008]. Accumulated people’s trajectories over a long period of time provided a temporal use-of-space analysis, facilitating the behavior pre-diction task performed by a robot. Still this approach lacks a path-planning process, which is important for notifying the target person of the robot’s

(31)

pres-ence.

Between the several main research directions which contribute to human behavior assessment, we mention human action recognition. Laptev et al. pro-pose space-time interest points (STIP) in combination with a multi-channel SVM classifier to recognize realistic human actions in unconstrained movies in [Laptev et al., 2008]. This method has several advantages such as robustness to occlusions and under different illumination conditions. Another successful ap-proach towards action recognition is presented in [Ke et al., 2005]. The authors define the integral video to efficiently calculate 3D spatio-temporal volumetric features and train cascaded classifiers to select features and recognize human actions. In the shopping environment, [Hu et al., 2009] use MHI along with a foreground image obtained by background subtraction and the histogram of oriented gradients (HOG) [Dalal and Triggs 2005] to obtain discriminative fea-tures for action recognition. The proposed approach is improved by building a multiple-instance learning framework SMILE-SVM. This method proved its effectiveness on a real world scenario from a surveillance system in a shopping mall aimed at recognizing customers’ interest in products defined by the intent of getting the merchandise from the shelf.

We presented two different views on human behavior, which can be combined for achieving a deeper understanding of human behavior. The social psycholog-ical view offers insights into the motivations and reasons why people behave in a certain manner, enabling behavior modeling, while computer vision provides the means to analyze and recognize shopping behavior in an automatic manner using both sensory and semantic information. To the best of our knowledge no study has proposed an automatic system for customers’ shopping behavior as-sessment, based on trajectory analysis, action recognition, and context related features.

In order to provide an answer to the main research question addressed in this paper, we will present in the following sections the considered shopping behav-ior models together with the proposed reasoning model for shopping behavbehav-ior interpretation.

2.3 Behavioral Models

There are many ways in which human behavior can be investigated, starting from the traditional methods which involve questionnaires, interviews, focus groups, online research or scanner data, and ending with the more advanced automatic techniques which are non-intrusive such as audio-video recordings.

Our methodology towards behavior modeling consists of two steps. We started by participating observation of shopping trips of customers in an unob-trusive manner. Based on 20 hours of observations collected by the researchers in shops we defined the following types of shopping behavior: goal oriented, look-ing around, disoriented, looklook-ing for support, fun-shopper, and duo-shopper. We introduced these types in [Popa et al., 2010], while the focus of this work con-sists of the customer-product interaction analysis and of the design and imple-mentation of a computational framework for detecting and assessing shopping behavior. Our study goes beyond the individual shopping behavior, towards social interactions during which people could display certain behaviors not in relation with products but as a cause of other circumstances. For example a

(32)

mother with children could wander around not because she is disoriented in her shopping behavior but to take care of her children, or a person going directly to the coffee corner, is not an example of a goal-oriented shopper but he intends to get free coffee.

In a second step we validated our assumptions by watching video recordings of shopping trips in a real shop. This brought us a new insight and helped us at refining the proposed models, by revealing different ways in which an action can be performed and also which are the most common combinations of behaviors. We consider human behavior from a general point of view, in order to under-stand what triggers it, which are the motivations behind it and finally which are the causes that prevent it from happening. We started from the general human behavior model proposed by Fogg in [Fogg, 2009] and adapted it to the shopping context. It is reasonable to assume that the same components introduced by Fogg: motivation, ability, and triggers play an important role also in the case of shopping behavior. There is always a motivation for which people go shopping such as: need, relaxation, curiosity, desire to be up to the latest trend and so on. Complementary to motivation there is also the ability of performing the actual action of buying a product. From this perspective, ability is highly correlated with the amount of money a person has, but also with other factors such as time and effort required to obtain a certain item. Considering this view we can better understand why a behavior happens or not, even though certain criteria were met. On one hand a person might have the motivation to buy a product, be willing to spend time and effort to find it, but if he/she doesn’t have enough money, finally he/she will not buy it. On the other hand, a person might have both the motivation and the money to buy a product, but if it is too difficult to find it, the person might decide to give up. Especially this type of situation we aim at avoiding by providing an automatic system able to recognize different types of shopping behavior and to trigger an alarm when a customer displays a disoriented or looking for support behavior.

Still, recognizing the other types of shopping behavior is important for gath-ering statistics about customer preferences in terms of products, areas visited in the shop, or preferred shopping paths and could contribute to enhancing the shopping experience.

Each type of shopping behavior has a number of characteristic features which are presented in Fig. 2.2. The features are displayed next to each behavior along with the possible transitions from one behavior to another one. We assume the goal oriented type of shopper knows what he/she wants and where to find the product(s) of his/her interest. If he/she goes directly to a product display, takes the product and then heads towards another place, we assume he/she will not need assistance.

The disoriented type of shopping behavior is representative for a customer who does not seem to know what he/she wants, where to find it, or how to choose. He/she is going from one place to another one without an apparent plan. A shop assistant might help him/her at finding a particular product, at selecting something appropriate for him/her, or to make a choice. The customer which does not know where to find a product of his/her interest and asks for help is called the looking for support type of shopper.

On the other hand a looking around shopper is assumed to be inspecting the offer, without needing something in particular. He/she might become at a certain moment interested in a product and decide to buy it. In that case

(33)

Figure 2.2: Diagram of different shopping behavior types and the dynamics between them.

his/her behavior will be similar with the one depicted by the goal oriented shopper. Another type of shopping behavior is the fun-shopper who visits a shop for getting acquainted with the latest promotions or novelties and is rather attracted by a crowd, an exposition, or an event in the shop than by products. Shopping is not only an individual activity but also a social activity in which people are shopping together with their partner, children, or friends. We call this the duo-shopper type of behavior, which is characterized by trajectories that are usually close together and by interactions between the shoppers, regarding selection or appreciation of products.

During a shopping trip, the behavior can remain constant or it can change ac-cording to the different situations or products encountered. Any behavioral type can suffer transformations, therefore, we model the behavior, segment based. At the end of the shopping trip we draw a conclusion using a reasoning model. By segment, we mean the transitions from one ROI to another one and the behavior displayed by the customer while he is present in that ROI. While modeling the shopping area we considered the following representative ROIs: entrance/exit, passing areas, products, pay desk, and resting areas, adding that for a clothes shop also mirrors and fitting rooms ROIs are present (see Fig. 2.3a). The pro-posed representation of the relevant ROIs might be too fine-grained, still it needs to be mentioned that the analysis of the customer’s actions is performed only when he is stationary in a specific ROI. A diagram of possible transitions from one ROI to the other ones is depicted in Fig. 2.3b.

We present in the next section our approach towards building a computa-tional framework for assessing shopping behavior, based on different behavioral cues.

(34)

Figure 2.3: (a) Segmentation of the shopping area into Regions of Interest (ROIs). (b) Transitions between different ROIs.

2.4 Computational Framework

In order to answer the main research question addressed in this paper (How to design an automatic system for shopping behavior assessment? ), we propose a modular approach and describe next the functionality of each module. A flowchart of the proposed framework was presented in Section 2.1 (Fig. 2.1).

The framework is organised on several levels of abstraction. At the low-level, different types of video cameras are employed, contributing to the efficient gathering of information. A fish-eye camera, mounted on the ceiling, captures the whole scene and is used for people tracking. It has the disadvantage of distorting the image, especially on the borders, a region which is very relevant in our case, as it corresponds with the products areas. Therefore, new cameras are installed in the products ROIs facilitating recognition of the customers’ actions.

The intermediary level includes the trajectory analysis and the action recog-nition modules. The position of a customer is analyzed continuously for detect-ing his/her presence in the defined ROIs. Every time a customer is stationary in the products ROI, an event is triggered to start the action recognition mod-ule, enabling efficient data processing, given that the two types of cameras are synchronized, recording the video feed at twenty frames per second (fps).

(35)

Finally, the high-semantic level is responsible for combining the intermediary outputs with the context-related features and drawing a conclusion regarding the customer’s shopping behavioral type, by means of the reasoning module. Next, we present in more details each module.

2.4.1 People Tracking

Our purpose consisted in finding a reliable people tracker capable of coping with the constraints imposed by our data. Mean shift [Comaniciu et al., 2000] together with the more recent Predator [Kalal et al., 2010] algorithms were both considered. The first algorithm described in [Popa et al., 2010] uses color his-tograms and the Bhattacharyya distance, while the second one is build on the Lucas-Kanade tracker, and provides long tracking by employing a P-N learning algorithm. Both methods require an initialization phase in which the properties of the object to be tracked are computed. We reduce the manual intervention of the user, by incorporating context properties, which imply that every customer enters the shop in a specific ROI. When a person is detected in the entering ROI, the tracker algorithm is started. People detection is realized using the algorithm presented in [Laptev, 2006]. An example of the tracking results is depicted in Fig. 2.4.

2.4.2 Trajectory Analysis

The output of the tracking module consists of trajectories, which are further analyzed in order to make a distinction between the different types of shoppers. Each trajectory point (x,y) obtained in image coordinates needs to be mapped in the ground-plane coordinates, in order to compensate for the distortion in-troduced by the wide angle fish-eye camera. While more advanced methods are available, we found this first order approximation sufficient and we compute the normalized image coordinates ( ¯xi, ¯yi), as follows:

¯ xi= (xi− x0) (W idthI/2) and ¯yi= (yi− y0) (HeightI/2) , (2.1)

in which xi and yi are the input image coordinates, HeightI and W eightI the

image height and width, x0and y0are the coordinates of the centre of the image

adjusted for a possible principal point shift px and py:

x0=

W idthI

2 + pxand y0=

HeightI

2 + py. (2.2)

Next, trajectory features are computed using the normalized coordinates. Surveillance applications [Moris and Trivedi, 2008] are usually based on feature sets: f t = [xn, yn, x 0 n, y 0 n, x 00 n, y 00

n], described by position (xn, yn), velocity (x

0

n,

y0_n), and acceleration (x00n, y

00

n), which were considered as a starting point.

Another characteristic of trajectories, which could reveal customers’ shopping intentions is the trajectory orientation, that can be described by including cur-vature related features. The curcur-vature k of a trajectory was considered due to its properties such as invariance under planar rotation and translation of the