User Intent in Online Video Search

(1)

CHRISTOPH KOFLER

USER INTENT

I N O N L I N E

VIDEO SEARCH

USER INTENT

IN ONLINE VIDE

O SEAR

CH

O

ver the recent years, user expectations of the ability of video search engines

have significantly risen. Users expect video search engines to be useful as an

instrument that facilitates communication, education, entertainment and

prob-lem solving and, in relation to this, to satisfy diverse information needs. A user's

formation need is the lack that a user is attempting to overcome by engaging in

in-formation seeking behavior and can be seen as having two important dimensions:

it comprises both a ‘what’ dimension reflecting the topic of the search and a ‘why’

dimension corresponding to the user intent, the immediate reason, purpose or goal

behind the information need. Video search engines are relatively successful at

return-ing search results that users find to be on topic. These results do not, however,

com-pletely satisfy the users' information needs unless they also fulfill the users' intents.

The purpose of this thesis is to enable the intent-related focus shift in the design and

realization of video search engines and to advance them in terms of user intent in order

to satisfy users' information needs to their full extent. This advancement is challenging

because it affects the entire pipeline of the video search engine: video indexing, query

processing, and search results ranking. However, it also has the potential to

substan-tially improve the overall utility of video search engines and increase the impact,

signif-icance and economic value of the online video content.

USER INTENT

I N O N L I N E

VIDEO SEARCH

CHRISTOPH KOFLER

392 Central Park West, Apt. 4C,

New York, NY 10025, USA

christoph.kofler@gmail.com

INVITATION

You are cordially invited to

attend the public defense of

the Ph.D. thesis entitled:

On Thursday, November 12,

2015 at 12:30 in the

after-noon in the Senaatszaal of the

Auditorium of Delft University

of Technology, Mekelweg 5,

Delft, The Netherlands.

Prior to the defense, at 12:00

noon, there will be a brief

introduction to the research

topic.

A reception will be held after

the defense.

T

OPH K

OFLER

(2)

(3)

(4)

U

SER

I

NTENT IN

O

NLINE

V

IDEO

S

EARCH

Proefschrift

ter verkrijging van de graad van doctor aan de Technische Universiteit Delft,

op gezag van de Rector Magnificus prof. ir. K. C. A. M. Luyben, voorzitter van het College voor Promoties,

in het openbaar te verdedigen op donderdag 12 november 2015 om 12:30 uur

door

Christoph K

OFLER

Diplom-Ingenieur der Informatik Alpen-Adria-Universität Klagenfurt, Oostenrijk

(5)

promotor: Prof. dr. A. Hanjalic copromotor: Dr. M. A. Larson

Composition of the doctoral committee:

Rector Magnificus, chairman

Prof. dr. A. Hanjalic, Delft University of Technology

Dr. M. A. Larson, Delft University of Technology

Independent members:

Prof. dr. C. Griwodz, University of Oslo, Simula Research Laboratory, Oslo, Norway

Prof. dr. F. G. B. De Natale, University of Trento, Italy

Prof. dr. F. M. G. de Jong, University of Twente, Erasmus University Rotterdam

Prof. dr. ir. H. J. Sips, Delft University of Technology

Dr. M. S. Lew, Leiden University

Prof. dr. ir. M. J. T. Reinders, Delft University of Technology, reserve member

Keywords: Multimedia information retrieval, User intent, Video search, Video

in-dexing, Retrieval algorithms, Crowdsourcing, Social-Web mining

Printed by: Gildeprint

Front & Back: Photo “Serra de Sintra, Portugal” (https://www.flickr.com/

photos/joaoferraosantos/10003026715/) by Jo˜ao Ferr˜ao dos

Santos (https://www.flickr.com/people/joaoferraosantos/)

is licensed under CC BY-NC 2.0 (https://creativecommons.org/

licenses/by-nc/2.0/legalcode)

An electronic version of this dissertation is available at

(6)

(7)

(8)

C

ONTENTS

Summary xi

Samenvatting xiii

1 Introduction 1

1.1 Video Search . . . 3

1.2 User Intent in Video Search. . . 4

1.2.1 Importance of User Intent . . . 4

1.2.2 Fundamental Challenges of User Intent in Video Search. . . 5

1.2.3 Delimiting the Definition of ‘User Intent’ . . . 7

1.3 Thesis Contribution. . . 8

1.3.1 Outline. . . 8

1.3.2 Full List of Publications . . . 11

2 Failing Queries in Video Search 15 2.1 Introduction . . . 16

2.2 Rationale and Contribution. . . 18

2.3 Related Work . . . 19

2.3.1 Search Failure and Query Failure. . . 19

2.3.2 Query Performance Prediction. . . 20

2.3.3 Transaction Log Analysis. . . 21

2.4 Transaction Log Analysis of Failed Queries . . . 21

2.4.1 Transaction Log . . . 21

2.4.2 Query Failure Observations . . . 22

2.5 Approach. . . 24

2.5.1 User indicators. . . 24

2.5.2 Engine indicators . . . 27

2.5.3 Model Training and Prediction. . . 30

2.6 Experiments . . . 31

2.6.1 Experimental Setup . . . 31

2.6.2 Performance Evaluation of Baseline Systems. . . 33

2.6.3 Performance Evaluation of User Indicators . . . 33

2.6.4 Performance Evaluation of Engine Indicators . . . 34

2.6.5 Performance Evaluation of User and Engine Indicators . . . 36

2.6.6 Performance Evaluation for Different Query Properties . . . 39

2.7 Analysis and Discussion. . . 41

2.7.1 Performance of Individual Queries. . . 41

2.7.2 Influence of Semantic Visual Concepts. . . 41

2.7.3 Influence of Query Characteristics. . . 43 vii

(9)

2.8 Conclusion and Outlook . . . 44

3 Search Intent in Online Video Search Engines 47 3.1 Introduction . . . 48

3.4 Intent Classes in Video Search . . . 52

3.4.1 Social-Web Mining for Intent Discovery . . . 53

3.4.2 Categories of User Intent in Video Search . . . 54

3.5 Intent Indexing . . . 56

3.5.1 Learning Approach and Features. . . 56

3.5.3 Experimental Results. . . 60

3.6 Intent in Online Video Search. . . 62

3.6.1 Crowdsourcing User Study. . . 62

4 Uploader Intent for Online Video 69 4.1 Introduction . . . 70

4.3.1 User intent. . . 72

4.3.2 Video Understanding . . . 73

4.3.3 Online Video. . . 74

4.4 Video Uploader Intent Typology . . . 74

4.4.1 Social-Web Mining for Video Uploader Intent Discovery. . . 75

4.4.2 Crowdsourcing for Video Uploader Intent Discovery. . . 76

4.4.3 Dataset for Video Uploader Intent . . . 77

4.4.4 Finalizing the Typology . . . 77

4.5 Multimodal Video Uploader Intent Features . . . 79

4.5.1 Feature Extraction . . . 80

4.5.2 Feature Analysis . . . 81

4.6 Video Uploader Intent Prediction. . . 83

4.6.3 Experimental Analysis . . . 86

4.7 Applications of Video Uploader Intent . . . 88

4.7.1 Reach: Video Uploader Intent and Target Audience . . . 89

4.7.2 Impact: Video Uploader Intent and Popularity. . . 91

4.7.3 Search: Video Uploader Intent and Query Search Intent. . . 92

(10)

CONTENTS ix

5 Intent-Aware Video Search Result Optimization 95

5.1 Introduction . . . 96

5.3.1 Search Results List Optimization. . . 100

5.3.2 Intent-Aware Search . . . 101

5.3.3 Crowdsourcing for Relevance Evaluation . . . 102

5.4 User Intent in Video Search Results Lists . . . 102

5.5 Approach. . . 105

5.5.1 Intent Response Classification. . . 106

5.5.2 Intent-Aware Results Lists Optimization. . . 107

5.5.3 Intent-Aware Visual Reranking. . . 108

5.6 Experiments . . . 109

5.6.2 Intent Response Classification. . . 112

5.6.3 General Results Optimization Performance . . . 113

5.6.4 Intent Satisfaction Performance . . . 114

5.7 Analysis and Discussion. . . 115

5.7.1 Intent-Aware Results Lists Optimization. . . 115

5.7.2 Individual Query Performance Evaluation. . . 117

6 Key Challenges Moving Forward 121 6.1 Expanding and Consolidating Conceptual Models of User Intent. . . 122

6.2 Intent-Aware Indexing . . . 124

6.3 Expression of User Intent & Intent-Aware Query Processing . . . 126

6.4 Intent-Aware Search Results Ranking . . . 129

6.5 Intent-Aware Search Results Explanation. . . 131

6.6 Evaluating Intent-Aware Multimedia Information Retrieval. . . 133

Bibliography 137

Acknowledgments 157

(11)

(12)

S

UMMARY

Over the recent years, user expectations of the ability of video search engines have sig-nificantly risen. Users expect video search engines to be useful as an instrument that facilitates communication, education, entertainment and problem solving and, in rela-tion to this, to satisfy diverse informarela-tion needs. A user’s informarela-tion need is the lack that a user is attempting to overcome by engaging in information seeking behavior and can be seen as having two important dimensions: it comprises both a ‘what’ dimension reflecting the topic of the search and a ‘why’ dimension corresponding to the user

in-tent, the immediate reason, purpose or goal behind the information need. Video search

engines are relatively successful at returning search results that users find to be on topic. These results do not, however, completely satisfy the users’ information needs unless they also fulfill the users’ intents.

The purpose of this thesis is to enable the intent-related focus shift in the design and realization of video search engines and to advance them in terms of user intent in order to satisfy users’ information needs to their full extent. This advancement is challenging because it affects the entire pipeline of the video search engine: video indexing, query processing, and search results ranking. However, it also has the potential to substantially improve the overall utility of video search engines and increase the impact, significance and economic value of the online video content.

We start to tackle this challenge by analyzing a real-world transaction log produced by a state-of-the-art video search engine with the objective to obtain a deeper under-standing why queries submitted by users in their search sessions fail. Based on the re-sults of this analysis, we build classifiers to automatically predict these reasons for query failure given a set of multimodal features derived from both the user interactions with the search engine as well as the search results produced by the engine.

Our analysis of the transaction log reveals several distinct reasons for why user queries in video search fail. Among others, one of the reasons is the way user goals are ex-pressed in the query, i.e., a single query can correspond to different underlying goals. In other words, intent is often not explicitly reflected in the query. This fact motivates us to tackle this challenge and to investigate the usefulness of incorporating user intent in video search engines.

As a first step, we investigate the nature of the immediate reason, purpose or goal behind a user information need that constitutes intent. We carry out a social-Web min-ing approach combined with crowdsourcmin-ing and a manual codmin-ing process in order to derive a conceptual model (i.e., a typology constituting search intent categories) cover-ing different reasons why users consult video search engines. This typology builds the basis for integrating user intent in video search engines. We then provide evidence that users differentiate videos in search engine results lists on the basis of these user intent categories.

(13)

In addition to understanding which search intents exist in video search, it is equally important to understand which intents are associated with videos. This understanding is crucial, as it builds the foundation of matching the search intents expressed by users in search scenarios to the intents that are associated with videos stored in the video search engine’s index. While in search scenarios intent can be characterized by the different reasons why users consult a video search engine, comparable user actions can be inves-tigated for why videos were added to the search engine’s index in the first place. For this reason, we investigate the user action of uploading videos to the Internet and apply a combination of social-Web mining and crowdsourcing to arrive at a conceptual model (i.e., a typology constituting uploader intent categories) that characterizes the various reasons why users upload videos to the Internet. We then build algorithms that auto-matically classify videos into these categories. Finally, we demonstrate that uploader in-tent categories correlate with search inin-tent categories, which provides the opportunity for incorporating intent into the retrieval functions of video search engines.

With search intent categories and uploader intent categories and their automatic prediction at hand, we face the challenging task of introducing user intent in search re-sults rankings that produce video search rere-sults lists that optimally reflect user intent. We propose an intent-aware video search result optimization approach that exploits the structure of topically-relevant initial results lists produced by the search engine in response to user-submitted queries in order to predict which search intent/s the user would most likely wish to satisfy. Based on this information, the approach optimizes the initial lists in a way that search results with the highest potential to satisfy the users’ search intent/s are positioned at the very top of the list without decreasing its topical focus.

Finally, although this thesis contributes a substantial amount of research towards user intent-aware video search engines, we believe that additional challenges will emerge in the future that will go above and beyond the challenges addressed in this thesis. We identify and discuss these challenges and expect them to attract significant research ef-forts that will lead to productive outcomes in the field of user intent-aware video search engines in the following years.

(14)

S

AMENVAT TING

Verwachtingen van gebruikers rondom wat videozoekmachines kunnen zijn de afgelo-pen jaren aanzienlijk gestegen. Gebruikers verwachten van videozoekmachines dat deze kunnen worden ingezet als instrument om hen te helpen op het gebied van communi-catie, onderwijs, vermaak en het oplossen van problemen, en, gerelateerd hieraan, dat deze in diverse informatiebehoeften kunnen voorzien. De informatiebehoefte van een gebruiker is een gemis dat hij of zij probeert boven te komen door een informatiezoe-kend gedrag aan te nemen en kan worden gezien als iets dat uit twee dimensies bestaat: het omvat zowel een ‘wat’-dimensie die het onderwerp van de zoekopdracht reflecteert als een ‘waarom’-dimensie die overeenkomt met de gebruikersintentie, de directe reden

of doel achter de informatiebehoefte. Videozoekmachines zijn behoorlijk succesvol in het

teruggeven van zoekresultaten die gebruikers qua onderwerp als relevant beschouwen. Echter, deze resultaten vervullen de informatiebehoefte van een gebruiker niet volledig tenzij ze ook voldoen aan de gebruikersintentie.

Het doel van dit proefschrift is om de focusverschuiving naar intentie in het ontwer-pen en verwezenlijken van videozoekmachines mogelijk te maken en videozoekmachi-nes verder te ontwikkelen op het gebied van gebruikersintentie om zo in de informatie-behoefte van gebruikers in uiterste mate te kunnen voorzien. Deze ontwikkeling vormt een uitdaging, omdat het de gehele keten van de videozoekmachine raakt: het indexe-ren van video’s, de verwerking van zoekopdrachten en het rangschikken van zoekresulta-ten. Desondanks is de potentie aanwezig om de bruikbaarheid van videozoekmachines aanzienlijk te verbeteren en de impact, betekenis en economische waarde van online videoinhoud te vergroten.

Om deze uitdaging aan te pakken beginnen we met een analyse van een transactielog uit de praktijk, afkomstig van een moderne videozoekmachine. De analyse heeft als doel om een beter besef te krijgen waarom zoekopdrachten die gebruikers tijdens hun zoek-sessie insturen mislukken. Op basis van inzicht verkregen via deze analyse bouwen we classificeerders om automatisch de reden van mislukte zoekopdrachten te voorspellen gebruikmakende van multimodale kenmerken ontleend aan zowel de gebruikerinterac-ties met de zoekmachine als de door de zoekmachine geproduceerde zoekresultaten.

Uit onze analyse van de transactielog blijkt dat er meerdere afzonderlijke redenen bestaan voor waarom zoekopdrachten van gebruikers mislukken. Een van de redenen is de manier waarop een gebruikers doel wordt uitgedrukt in de zoekopdracht. Zo is het bijvoorbeeld mogelijk dat een enkele zoekopdracht met meerdere onderliggende doelen kan overeenkomen. Met andere woorden: vaak weerspiegelt de zoekopdracht de inten-tie niet expliciet. Dit feit motiveert ons om deze uitdaging aan te pakken en het nut om gebruikersintentie in videozoekmachines op te nemen en te onderzoeken.

Als een eerste stap onderzoeken we directe reden of doel achter een gebruikers in-formatiebehoefte die intentie vertegenwoordigt. Onze aanpak bestaat uit het vergaren van informatie op het sociale web gecombineerd met crowdsourcing en een handmatig

(15)

coderingsproces om een conceptueel model (d.w.z. een typologie bestaande uit

zoek-intentiecategorieën) af te leiden dat verschillende redenen dekt waarom gebruikers

vi-deozoekmachines raadplegen. Deze typologie vormt de grondslag voor het integreren van gebruikersintentie in videozoekmachines. We leveren daarop bewijs dat gebruikers onderscheid maken tussen video’s in zoekresultatenlijsten op basis van de gebruikersin-tentiecategorieën uit deze typologie.

Naast het begrijpen van welke zoekintenties bestaan voor het zoeken naar video’s, is het net zo belangrijk om te begrijpen welke intenties er verbonden zijn met video’s. Dit besef is cruciaal, daar het de grondslag vormt voor het laten passen van de zoekinten-ties zoals uitgedrukt door gebruikers in zoekscenario’s bij de intenzoekinten-ties die verbonden zijn met de video’s in de index van de zoekmachine. Zoals intenties in het geval van zoeksce-nario’s gekenmerkt kunnen worden door de verschillende redenen waarom gebruikers een zoekmachine raadplegen, kunnen op een vergelijkbare manier gebruikershandelin-gen worden onderzocht naar waarom video’s waren toegevoegd aan de zoekmachine’s index in de eerste plaats. Om deze reden onderzoeken we de gebruikershandeling van het uploaden van video’s naar het internet en vergaren we informatie van het sociale web en combineren dit met het toepassen van crowdsourcing om zo te komen tot een conceptueel model (d.w.z. een typologie bestaande uit uploadersintentiecategorieën) dat de uiteenlopende redenen waarom gebruikers video’s uploaden naar het internet karak-teriseert. Daarop volgend ontwikkelen we algoritmen die automatisch video’s kunnen classificeren als behorende tot een van deze categorieën . Tenslotte tonen we aan dat uploadersintentiecategorieën in onderling verband staan met zoekintentiecategorieën , wat de gelegenheid geeft om intentie in de retrieval functies van videozoekmachines te integreren.

Beschikking hebbende over zoekintentie categorieën en uploadersintentiecatego-rieën en deze automatisch te kunnen voorspellen, komen we nu tot de uitdaging om ge-bruikersintentie te introduceren in het rangschikken van zoekresultaten zodanig dat de uiteindelijk geproduceerde zoekresultatenlijsten optimaal deze intentie van de gebrui-ker weerspiegelen. We stellen als aanpak voor een intentiebewuste optimalisering van videozoekresultaten. Deze aanpak maakt gunstig gebruik van de aanwezige structuur in de aanvankelijke, qua onderwerp relevante resultatenlijst die door de zoekmachine ge-produceerd is als respons op door de gebruiker ingestuurde zoekopdracht om zodanig te voorspellen welke zoekintentie(s) de gebruiker het meest waarschijnlijk wil vervullen. Op basis van deze informatie optimaliseert onze aanpak de oorspronkelijke resultaten-lijst op zo’n manier dat zoekresultaten met de meeste potentie om aan de zoekintentie(s) van de gebruikers te voldoen helemaal boven aan de lijst worden geplaatst zonder dat er verlies in onderwerpfocus optreedt.

Tenslotte, hoewel dit proefschrift een aanmerkelijke bijdrage levert qua wetenschap-pelijk onderzoek op het gebied van intentiebewuste videozoekmachines, zijn wij van mening dat in de toekomst bijkomende uitdagingen zullen verrijzen die nog breder en dieper reiken zullen reiken dan de uitdagingen die in dit proefschrift aan bod komen. We stellen vast wat deze uitdagingen zullen zijn en bespreken ze. We verwachten dat deze uitdagingen aanzienlijke onderzoeksinspanningen zullen aantrekken, wat zal leiden tot productieve resultaten op het gebied van gebruikersintentiebewuste videozoekmachi-nes in de komende jaren.

(16)

1

I

NTRODUCTION

H

OWpeople interact with and use video on the Internet has undergone a dramatic

change over the past decade. The challenge of Internet video was in the begin-ning largely technical and most of the related effort was devoted to developing solutions for capturing, editing, uploading and serving videos online. For example, Burgess and Green [17] cite the main innovation of YouTube1, at its moment of founding in 2005, to be technological in nature: removing the barriers to widespread video sharing. At the time, users expected little of online video—it was enough to be entertaining. They were satis-fied with simple mechanisms (e.g., browsing) enabling them to flip through the available online content, in terms of recency or simple similarity.

The picture today is radically different. The technical challenges have been addressed, if not completely solved: Video recording equipment is widely available to users, storage capacity is plentiful, and technology has made content upload and online playback al-most effortless. However, as online video has come of age, user expectations have also risen. Those who originally expected online video to be only entertaining, have increas-ingly turned into users expecting online video to be useful as an instrument capable of facilitating communication, education and problem solving.

This evolution can be characterized as rapidly increasing expectation that video search engines are able to satisfying a wide spectrum of different information needs. An infor-mation need is the lack that a user is attempting to overcome by engaging in inforinfor-mation seeking behavior [131,16]. It is the abstract, difficult to express gap that the user is at-tempting to fill using a video search engine, and formulating search engine queries. The wide variety in today’s video information needs is illustrated by Figure1.1, which con-tains a list of example questions expressing video information needs that were posted to

the popular question-answering platform Yahoo! Answers2[57]. As information needs

This chapter consists of material adopted from: Christoph Kofler, Martha Larson, and Alan Hanjalic: User

Intent in Multimedia Search: A Survey of the State of the Art and Future Challenges. ACM Computing Surveys

(Full paper), under review [98].

1_{http://www.youtube.com/} 2_{http://answers.yahoo.com/}

(17)

1

‘I need to find a good video that explains how to do a cross knot bracelet.’_{‘Where can I find a video of how to make an almond cake? I need the recipe and like a}

video to literally teach me.’

‘Where can I find a video of Flavor Flav laughing? My friend said I laugh like him and I

just wanted to hear it.’

‘Do you know how I can find a video review of BMW X5? I am changing my car into a SUV

and I want to buy a BMW X5 and I need a decent video review about this car.’

Figure 1.1: Examples of real-world user information needs in video search [57].

continue to evolve, this variety continues to increase. Looking forward, we can expect user expectations go on rising, and online video search engines need to advance in or-der to satisfy these needs to their full extent.

Figure1.2depicts the user search process and illustrates the relationship of informa-tion needs to queries. Informainforma-tion needs arise in the context of a real-world task that users desire to solve. This task involves user activity that is usually completely indepen-dent of video search. To understand the real-world nature of the user task, notice that the user could potentially also carry out the task without video search. In the example in Figure1.2, the user’s task is to write an article on the meaning of koi fish in Japan, and this could also be accomplished by reading books, talking to experts, or traveling to Japan to observe.

Task _{Information Need} _Query

Write an article that summarizes the meaning of

koi fish in Japan's culture.

Find videos that provide information about koi fish

koi fish information

Figure 1.2: A real-world task is the trigger for users to consult a video search engine with a set of information

needs, which are verbalized and submitted to the video search engine in form of (textual) queries.

Once users decide to consult a video search engine given their initial real-world task, they approach the search engine with one or more concrete information needs. Infor-mation needs express a desire for inforInfor-mation that is both about a topic (e.g., koi fish) and that also fits the reason why the user performs the search (e.g., to obtain

informa-tion). Users’ information needs exist in abstract form in the minds of users. Given the

standard query-based interface of today’s search engines, the needs are then verbalized

in the form queries (e.g.,koi fish information). The video search engine responds

to the queries with a list of results. The results are considered to be relevant, if the user’s information need is fulfilled, and the user can move forward towards successful comple-tion of the task.

(18)

1.1VIDEOSEARCH

1

3 Query Processing Indexed Videos Search Results Ranking Information Need Indexing WWW Crawling

Ranked Search Results List Query

Search (Online) Index Generation (Offline)

Figure 1.3: High-level overview of a video search engine: In the offline index generation step, videos that are either crawled from the Internet or manually added by users are processed and effectively described for scal-able and efficient access. In the online search step, query processing prepares the user-submitted query in a way that it is compatible with the data that has been indexed in the offline processing step. The query serves as input to the search results ranking component, which compares the query to the videos in the index and produces a ranked list of search results that is returned to the user.

1.1 V

IDEO

S

EARCH

As depicted in Figure1.3, a video search engine operates through two main steps, the offline index generation step and the online search step. The main functional compo-nents characterizing these steps are indexing, query processing and search results

rank-ing. Since these components provide the context of the contribution made in this thesis,

we briefly explain them in turn.

Indexing— The main purpose of this component is to effectively store and describe videos that have either been automatically collected (i.e., discovered) by the search en-gine from the Internet or that have been manually added to the system by users (e.g., user-uploaded videos on YouTube) in a way that they are organized for scalable and ef-ficient access. For this purpose, video files are paired, in general, with different types of metadata (e.g., titles, tags, description). Next to the textual metadata, which is typ-ically described using tf-idf [131] or comparable representations, the actual content of videos is indexed using a variety of local- (e.g., SIFT [126] combined with bag-of-visual-words [165]), global- (e.g., color histograms, edge histograms etc. [127]), and semantic-(e.g., [169], [13]) representations.

Query Processing— Once users have formulated a query given their information need, the main purpose of this component is to prepare this query in a way that it is compati-ble with the data that has been indexed in the offline processing step. Additionally, this component is responsible to optimize the query by (i) suggesting similar queries to the user that might better reflect the user’s information need (e.g., [211]); (ii) classifying the query into predefined classes for which different ranking strategies are optimized (e.g., [89]); and (iii) expanding the initial query such that the ranking component is able to return videos that are maximally relevant to the user (e.g., [42]).

Search Results Ranking— Once the query is prepared, it serves as input to the search results ranking component whose main purpose is to compare and match the query

(19)

rep-1

resentation with the representation of the (billions of ) videos stored in the search en-_{gine’s index. How well a particular video matches the given query constitutes the basis}

for estimating the relevance between the query and the video. Relevance is used to pro-duce a ranked search results list—ordered by decreasing relevance—as response to the query. Additionally to producing a relevance score for a video, this component is also responsible to optimize an initially or intermediately produced search results list. Here, optimization is typically carried out by either performing reranking (i.e., increasing the homogeneity of the top ranks of results lists in terms of a particular property as, for ex-ample, proposed in [205], [206], [65], [89]) or diversification (i.e., increasing the hetero-geneity of the top ranks of results lists in terms of a particular property as, for example, proposed in [62], [184], [186]) of initial results lists produced by a video search engine.

1.2 U

SER

I

NTENT IN

V

IDEO

S

EARCH

As discussed above, the ultimate goal of a video search engine is to satisfy users’ informa-tion needs to their full extent. The conceptualizainforma-tion of the search process in Figure1.2 (cf. page2) illustrates why it is important to break down user information needs into two dimensions. The first dimension is the ‘what’ dimension, which reflects the topic of the search. The second dimension is the ‘why’ dimension, and ties the need of the user di-rectly to the task that the user is attempting to carry out. We define the ‘why’ dimension as the user intent, which corresponds to the immediate reason, purpose or goal behind a

user’s information need. This follows the dictionary definition of intent, that is, ‘the thing

that [the user] plans to do or achieve; an aim or purpose’3. Since the ‘aim or purpose’ here

is linked to the specific search activity of the user, we will generally refer to user intent in this case as the user search intent (or search intent), particularly to be able to distinguish it from other types of intent that will be addressed in this thesis and that also need to be taken into account when optimizing video search engines in light of user intent.

1.2.1 I

MPORTANCE OF

U

SER

I

NTENT

Until now, the field of multimedia information retrieval has invested a great deal of effort in developing algorithms for analyzing video in terms of its topic. Video search engines are for this reason relatively successful at returning search results that users find to be on topic (cf. [169], [134]). Although these techniques are clearly critical and contribute to providing users with relevant search results, these results do not completely satisfy users’ information needs to their full extent and might not be truly useful to users unless they also satisfy the users’ intents.

User intent was first introduced by Broder [15] in the field of conventional text-based Web search. Broder’s work demonstrated that users’ information needs are not all re-lated to acquiring information, but rather span a much wider spectrum of intent types. Rose and Levinson [153] and Jones and Klinkner [87] make a similar observation and characterize intent as the underlying goal behind a Web search. Similar observations have been made in image search [129] and video search [57], [110].

Baeza-Yates et al. [7] study the intention of queries decomposed into user goals (cor-responding to ‘why’ users search) and topic categories (cor(cor-responding to ‘what’ users are

(20)

1.2USERINTENT INVIDEOSEARCH

1

5

searching for). Here, we note that understanding of the treatment of intent in this the-sis and in the literature overall requires close attention to terminology. Although ‘intent’ and ‘intention’ are understood to mean ‘aim or purpose’, there is a subtle, but important, difference of meaning between the two4. An ‘intention’ is something that one sets about to do, but perhaps does not completely succeed in. For this reason, it is the preferred word to describe the information need as a whole—The user has the intention to express the information need as well as possible in the query, but perhaps is not completely suc-cessful. ‘Intent’ reflects an underlying mindset, and suggests great deliberation. For this reason, it is the preferred word to describe the underlying goal of the user, i.e., specifically the ‘why’ dimension of the user need. We point out explicitly that under the definition of intent adopted in this thesis, ‘user intent’ corresponds to the ‘user goal’ part of the ‘user intention’ in [7]. In summary, the terms intent and goal can be used synonymously, are part of a user’s information need (sometimes also referred to as intention), and refer to the immediate reason why a person is using a search engine.

The example expressions presented in Figure1.1(cf. page2) provide evidence that user intent has significant, real-world importance to users in video search. Additionally, they also reveal the critical difference between the topical and intent component of user information needs. For example, both information needs ‘I need to find a good video that

explains how to do a cross knot bracelet.’ and ‘Where can I find a video of how to make an almond cake?’ clearly have similar intent statements, i.e., both users want to use video

in order to learn how to do something, but focus on different topics (‘cross knot bracelet’ vs. ‘almond cake’). We do not claim that there is complete orthogonality between the ‘what’ and the ‘why’ dimensions of video information needs. However, treating the two dimensions as independent allows a video search engine more latitude to mix and match topics with intent types. The advantage is a potentially large improvement of the ability of the search engine to cover the full spectrum of user information needs.

Figure1.4further illustrates the difference between the ‘what’ and ‘why’ dimensions in an actual video search scenario and emphasizes that several intents might be relevant for one specific topic: A user issues the querykoi pondto a video search engine and gets a ranked list of results. All three results in Figure1.4are topically relevant to the query, i.e., match the query with respect to the ‘what’ dimension of the information need. However, the results differ in the particular goal they support the user in achieving, and thus in the reason ‘why’ the user would be interested in each one of them. Result 1 is an informational video that provides the user with the opportunity to obtain knowledge, result 2 is a relaxing video that users might search for to change their mood and result 3 is a tutorial video that shows how to build a koi pond.

1.2.2 F

UNDAMENTAL

C

HALLENGES OF

U

SER

I

NTENT IN

V

IDEO

S

EARCH Since in the example in Figure1.4the top-three search results already contain videos rep-resenting three different user intents, one might assume that users could simply scan the results list and pick the video for the intent they had in mind. However, the scan-and-pick method fails if the search engine does not produce videos fully covering possible user intents in the top of the results list. Further, observing only the example provided in Figure1.4, one might imagine that users might refine the query to better reflect the

(21)

1

koi pond

ŚŽǁƚŽďƵŝůĚŐĂƌĚĞŶŬŽŝƉŽŶĚ;ƐƚĞƉďǇƐƚĞƉͿ

ďǇƉŚƵĐůĞͻϮǇĞĂƌƐĂŐŽͻϱϱ͕ϳϮϬǀŝĞǁƐ how to build garden koi pond (step by step)

ϯϰ͗ϭϲ

ZĞůĂǆŝŶŐďǇƚŚĞ<ŽŝWŽŶĚ

ďǇ&/^,'> ϲϵͻϱǇĞĂƌƐĂŐŽͻϭϯϴ͕ϴϴϰǀŝĞǁƐ Relaxing aquatics , fish feeding. Koi and butterfly koi.

ĞŝŶŐ&ŝƐŚ/Ŷ:ĂƉĂŶͲ <ŽŝĂŶĚ<ŽŝWŽŶĚƐ

ďǇ^ƵďŵĂƌŝŶĞŚĂŶŶĞůͻϯǇĞĂƌƐĂŐŽͻϭϬ͕ϱϴϱǀŝĞǁƐ The coveted Koi carp is a symbol of male potency . And what our muscle -bound protagonist lacks ...

ϲ͗ϱϯ

ϯ͗Ϯϵ

Figure 1.4: Example search query with an excerpt of the produced results list illustrating the difference between

the topical- and the intent-component of a user’s information need: The querykoi pondreturns search

re-sults that all match the topical component of the user’s information need, however, each satisfying different user intents.

intent. However, such query refinement can only take place when the user can con-sciously differentiate between different intents and also explicitly express the difference in the query—an endeavor that is not easy to achieve for users. While they are relatively skilled in expressing the topical component of their information need in a query, they often fail to express their intent or difference in intents clearly, if they attempt to do so at all [57]. Moreover, for query refinement to be effective, the video collection should be in-dexed accordingly. Due to a wide variety of possible search intents that are not known at the indexing stage, the latter is unlikely. For these reasons, it is challenging to infer intent from the query, match it to videos stored in the index of a video search engine, and per-form search results ranking that is aware of user intent. Predominantly because of the nontrivial challenges mentioned above, the research effort carried out in the multime-dia information retrieval community towards incorporating user intent in video search engines has not yet been significant. In other words, the field has yet to turn its atten-tion to developing algorithms sensitive to why users are searching for video. This is what motivated the work reported in this thesis.

The examples provided in Figure1.1further illustrate that many information needs are better satisfied with video (or other multimedia content types) rather than with text. While the fields of conventional text-based Web search and multimedia information re-trieval have much in common, it is important not to assume that information needs in the area of video search are identical to those in the case of conventional text-based Web search and that intent-aware algorithms proposed there can simply be applied in video search. Several investigations (e.g., [76], [77], [75], [178]) point out that there exist significant differences between searching for text and video. First, and most obviously, video involves multiple modalities (e.g., video plus audio, video plus text), while text consists of a single modality. Second, and more importantly, the range of information

(22)

1.2USERINTENT INVIDEOSEARCH

1

7

needs in video search appears to be growing broader [29,54], signifying that algorithms proposed for conventional text-based Web search are not applicable for video search without adaptation and that video search requires a separate, dedicated study—an en-deavor that we aim to tackle in this thesis.

1.2.3 D

ELIMITING THE

D

EFINITION OF

‘U

SER

I

NTENT

’

It is important to mention that over the past years, a large number of approaches have been proposed that aim at improving the quality of video search results. Many of these approaches cite the ‘intent’ or the ‘intention’ behind the user query as the target of the improvement, but they do not deal with ‘the immediate reason, purpose or goal behind a user’s information need’. In other words, the ways the ‘intent’ has been introduced has not been consistent, and also not in the spirit of the intent definition we follow in this thesis that was justified in Section1.2.1. Recall from the discussion above that our reason for declaring our definition of intent is its origin in the text information retrieval literature as well as the fact that it allows us to make a clear differentiation between the ‘what’ dimension of a user’s information need, and the ‘why’ dimension. Here, we briefly summarize the research efforts that uses the term ‘intent’ or ‘intention’ in unexpected or alternate ways, and discuss how it is related to our definition of user intent. We can distinguish three different interpretations:

• Query ambiguity—This literature uses the term ‘intent’ in the sense of

‘informa-tion need’ or ‘inten‘informa-tion’ that we defined in Sec‘informa-tion1.2.1. The goal of this work is to clarify the topical component of the information need expressed in the query. In other words, ‘user intent’ is considered satisfied if the queries submitted by users to video search engines are topically disambiguated, allowing the user to specify the search topic more precisely. We can distinguish here three major classes of approaches aiming at topically disambiguating the initially submitted query and relating such disambiguation to satisfying user intent. These types include query suggestion (e.g., [211,9,120,42]), search results diversification (e.g., [62,185,184, 186]), and query and ranking refinement (e.g., [48,33,212,67]). While we empha-size that these approaches are valuable and critically necessary to support users in their search endeavors, they do not consider why users turn to search engines in order to satisfy their information needs. Instead, they rather help users to dis-cover query aspects [158,144] and to obtain a broader picture of the topics related to their initial query.

• Clarity of user goals—This literature examines the degree to which the user

infor-mation need is well focused. It differentiates users who are ‘just browsing’ from those who have highly specific goals. In other words, ‘user intent’ is interpreted as the clarity, degree or specificity of user goals (e.g., [36,30,106]). Although a cate-gorization of the specificities of user goals is important as a criterion for steering the refinement of a search results list, it is not aligned with the reasons why users would search for videos.

• Personalization—This literature uses the term ‘intent’ in the sense of ‘long-term

(23)

1

are used to support interpretation of the current information need. While ap-_{proaches that study the long-term interest of users (i.e., general topics particular}

users are interested in over a longer period of time) are crucial for personalizing the search results, the discussed approaches do not consider the immediate rea-son why users consult video search engines.

Investigating these unexpected or alternative interpretations provides us with an un-derstanding that incorporating ‘user intent’ in video search engines is generally not ap-proached from the perspective of why users actually carry out search. The very limited amount of work that has been carried out in the past and that follows the definition of user intent adopted in this thesis will be discussed in the related work sections of chap-ters throughout this thesis.

1.3 T

HESIS

C

ONTRIBUTION

The purpose of this thesis is to enable the intent-related focus shift in the design and realization of video search engines in order to satisfy users’ information needs to their full extent, or in other words, to answer the question ‘Can video search engines be

ad-vanced by making them aware of user intent?’

This advancement is challenging due to the issues discussed above as well as be-cause it affects the entire pipeline of the video search engine: video indexing, query pro-cessing, and search results ranking. However, it also has the potential to substantially improve the overall utility of video search engines and increase the impact, significance and economic value of the online video content.

1.3.1 O

UTLINE

We start by taking a step back and analyzing why user queries submitted to video search engines fail. In Chapter2, we focus on answering the following research question:

What are the major reasons for failing queries in video search and can these reasons be predicted automatically?

We carry out a qualitative analysis of a transaction log produced by a state-of-the-art video search engine. In pstate-of-the-articular, we investigate the interactions of users in search sessions in order to derive reasons why user queries fail and build classifiers to auto-matically predict these reasons given a set of multimodal features derived from both the user interactions with the search engine as well as the search results produced by the engine. The analysis of our transaction log reveals several distinct reasons for why user queries in video search fail. Among others, one of the reasons is the way user goals are expressed in the query, i.e., a single query can correspond to different underlying goals, corresponding to different user decisions to click on a result. For this reason, the main challenge we face in this chapter is the enormous variability in user interaction patterns and search engine responses when automatically predicting whether queries will fail or not. Our automatic prediction approach builds on the well-known concept of query performance prediction (e.g., [213,60,69,32,59,208]) introduced in conventional text-based Web search to estimate the query’s retrieval performance, but extends this con-cept with two novel characteristics: User indicators are derived from our transaction log,

(24)

1.3THESISCONTRIBUTION

1

9

capture the patterns of user interactions with the video search engine, and exploit the context in which a particular query was submitted; Engine indicators are derived from the search results list and measure the consistency of search results at the level of textual and visual features associated with videos.

The fact that a single query can correspond to different underlying user goals, or in other words, that intent is often not explicitly reflected in the query, and the issues men-tioned earlier in this chapter, motivate us to explore algorithms that tackle this challenge and to investigate the usefulness of incorporating user intent in video search engines in order to satisfy user information needs to their full extent.

Prior to incorporating user intent in video search engines, however, we have to un-derstand the nature of the immediate reason, purpose or goal behind a user informa-tion need that constitutes intent. In other words, we first have to obtain an understand-ing of the different reasons why users consult video search engines, and create concep-tual models, i.e., typologies, covering these reasons. In the literature, several concepconcep-tual models have been proposed that cover different reasons why users search. For example, Broder [15], Rose and Levinson [153], Baeza-Yates et al. [7], and Morrison et al. [140] in-vestigate typologies for traditional text-based Web search; Lux et al. [129] and Fidel [43] study models for image search. Because, as discussed in a more general sense earlier in this chapter, several investigations (e.g., [76], [77], [75], [178]) point out that there exists a significant difference between searching for different modalities and because the range of information needs appears to be different for them, the user intent schemes men-tioned above are not applicable for video search without adaptation. For this reason, in

Chapter3, we focus on answering the following research question:

What are the major reasons for users to consult video search engines, i.e., which search intent categories exist that are applicable to video search, and do these categories have the potential to significantly improve video search results from the user’s perspective?

In order to answer this research question, we propose a social-Web mining approach for the discovery of high-level user intent categories in video search, i.e., search intent categories. In particular, we mine real-world user descriptions of video information needs from a popular question-answering platform, apply crowdsourcing to extract intent-related statements from these descriptions, and perform a manual coding process of these intent-related statements in order to arrive at a set of consistent user search in-tent categories in video search. Then, we demonstrate with a crowdsourcing-based user study that users differentiate videos in search engine results lists on the basis of intent, which provides evidence for the opportunity for video search engines to achieve a closer fit of search results with user information needs by taking user intent into account.

Next to understanding which search intents exist in video search, it is equally impor-tant to know which intents are associated with videos. This understanding is crucial, as it builds the foundation of matching the search intents expressed by users in search sce-narios to the intents that are associated with videos stored in the search engine’s index. In order to achieve a close fit between the search intent and the representation of videos in the index, the video search engine has to specifically make use of intent-sensitive fea-tures to describe videos. One approach to intent-sensitive indexing is to associate videos directly with possible search intents, in the form of inferred category labels. However,

(25)

1

this requires inferring in advance the different types of search intent with which a user_{might search for a video. Search intents are varied and unexpected, so predicting this}

in-formation at the time that a video is indexed initially appears impossible. While the work presented in Chapter3contains a preliminary discussion of this challenge, in Chapter4 we instead take a different standpoint and take the position that search intent is not the only sort of intent associated with a video. Instead, we can also investigate user in-tent that is specifically related to indexing of videos and think of the inin-tent of users who created a video, or uploaded and otherwise shared it. In Chapter4, we investigate the user action of uploading videos to the Internet by focusing on answering the following research questions:

What are the major reasons for users to upload videos to the Internet, i.e., which uploader intent categories exist that are applicable to video search? Can uploader intents of videos be predicted automatically and does a correlation between search intent and uploader intent exist that can be exploited for intent-aware video search result list optimization?

Similar to the derivation of search intent categories, we apply a combination of social-Web mining and crowdsourcing to arrive at a conceptual model (i.e., a typology consti-tuting uploader intent categories) that characterizes the different reasons why users up-load videos to the Internet and that is applicable to a broad range of videos. We then use a set of multimodal features derived from the textual metadata of videos, as well as their visual content, in order to automatically classify videos into these categories. Finally, in a crowdsourcing-based user study we evaluate the correlation between search intents and uploader intents that provides evidence for the usefulness of user intent-aware search results lists.

With search intent categories and uploader intent categories and their automatic prediction at hand, we face the challenging task of introducing user intent in search re-sults rankings that produce video search rere-sults lists that optimally reflect user intent. Two main approaches can be considered in order to incorporate user intent in search results rankings. First, the ranking produced by the video search engine can directly take intent into account when performing the query matching and retrieval step. Second, an initial search results list produced by the video search engine can be optimized with re-spect to user intent. That is, initially produced rankings (that are, for example, on-topic) can be refined using intent-aware features that are extracted both from the query as well as from the videos stored in the search engine’s index. It is critical, however, that for ei-ther of these approaches the topical focus of the query is not neglected. In oei-ther words, it does not help to optimize the engine’s response regarding intent if the topical focus of the results list is lost. In Chapter5, we investigate the latter possibility of optimizing topically-relevant initial results lists produced by video search engines in response to user-submitted queries by proposing relevance ranking functions that are user intent-aware. We focus on answering the following research question:

Can video search results lists be optimized from an intent-aware perspective?

Since, as discussed earlier in this thesis, automatically predicting the user’s search in-tent directly from the submitted query is a challenging and failure-prone task that could

(26)

1

11

potentially decrease information need satisfaction for users in case wrong predictions were carried out, in this chapter we follow a different approach and exploit the structure of the results list itself. We carry out a manual qualitative analysis of topically-relevant results lists and observe that they do contain videos that satisfy the user’s search intent, but that videos with the highest potential for satisfaction are often buried within or scat-tered over the results list. Our approach exploits this fact and optimizes initial results lists, through search results reranking and diversification, in a way that search results with the highest potential to satisfy the user’s search intent are positioned at the very top of the list without decreasing its topical focus.

Finally, in Chapter6, we identify and discuss what we see as the key future chal-lenges lying ahead for intent-aware video search engines and multimedia search engines in general. Although this thesis contributes a substantial amount of work to user intent-aware video search engines, we believe that additional challenges will emerge in the fu-ture that go above and beyond the challenges addressed in this thesis. These challenges are crucial to facilitate a direction of multimedia information retrieval research in fu-ture work aimed to developing algorithms that allow video search engines to maximally respond to user intent.

1.3.2 F

ULL

L

IST OF

P

UBLICATIONS

The following papers have been published throughout the years that led up to this the-sis. Please note that the content presented in Chapters2through5is based on original publications. The references to these publications are given below as well as at the be-ginning of each chapter. As a consequence of working with original publications, the introductory parts of chapters in this thesis may be similar in terms of argumentation and the material they cover.

Journal publications

4. Christoph Kofler, Martha Larson, and Alan Hanjalic: User Intent in Multimedia Search: A

Survey of the State of the Art and Future Challenges. ACM Computing Surveys (Full paper),

under review [98]. —[Chapters1&6]

3. Christoph Kofler, Subhabrata Bhattacharya, Martha Larson, Tao Chen, Alan Hanjalic, and Shih-Fu Chang: Uploader Intent for Online Video: Typology, Inference and Applications.

IEEE Transactions on Multimedia (Full paper) (to appear, 2015) [96]. —[Chapter4]

2. Christoph Kofler, Martha Larson, and Alan Hanjalic: Intent-Aware Video Search Result

Op-timization. IEEE Transactions on Multimedia (Full paper), 16(5): 1421-1433 (2014) [101].

—[Chapter5]

1. Christoph Kofler, Linjun Yang, Martha Larson, Tao Mei, Alan Hanjalic, and Shipeng Li:

Pre-dicting Failing Queries in Video Search. IEEE Transactions on Multimedia (Full paper), 16(7):

(27)

1

Conference publications

7. Michael Riegler, Martha Larson, Mathias Lux, and Christoph Kofler: How ’How’ Reflects

What’s What: Content-based Exploitation of How Users Frame Social Images. In Proceed-ings of the ACM International Conference on Multimedia (MM ’14) (Brave New Ideas paper).

ACM, New York, NY, USA, 397-406 (2014) [150].

6. Sebastian Schmiedeke, Peng Xu, Isabelle Ferrané, Maria Eskevich, Christoph Kofler, Martha A. Larson, Yannick Estève, Lori Lamel, Gareth J. F. Jones, Thomas Sikora: Blip10000: a social

video dataset containing SPUG content for tagging and retrieval. In Proceedings of the ACM International Conference on Multimedia Systems (MMSys ’13) (Dataset paper). ACM, New

York, NY, USA, 96-101 (2013) [161].

5. Christoph Kofler, Linjun Yang, Martha Larson, Tao Mei, Alan Hanjalic, and Shipeng Li:

When Video Search Goes Wrong: Predicting Query Failure Using Search Engine Logs and Visual Search Results. In Proceedings of the ACM International Conference on Multimedia (MM ’12) (Full paper). ACM, New York, NY, USA, 319-328 (2012) [104]. —[Chapter2]

4. Alan Hanjalic, Christoph Kofler, and Martha Larson (alphabetical order): Intent and its

Discontents: The User at the Wheel of the Online Video Search Engine. In Proceedings of the ACM International Conference on Multimedia (MM ’12) (Brave New Ideas paper). ACM, New

York, NY, USA, 1239-1248 (2012) [57]. —[Chapter3]

3. Martha Larson, Christoph Kofler, and Alan Hanjalic: Reading between the Tags to Predict

Real-World Size-Class for Visually Depicted Objects in Images. In Proceedings of the ACM International Conference on Multimedia (MM ’11) (Full paper). ACM, New York, NY, USA,

273-282 (2011) [113].

2. Christoph Kofler, Martha Larson, and Alan Hanjalic: To Seek, Perchance to Fail: Expressions

of User Needs in Internet Video Search. In Proceedings of the European Conference on Infor-mation Retrieval (ECIR ’11) (Short paper, best paper nomination). Springer-Verlag, 611–616

(2011) [100].

1. Raynor Vliegendhart, Martha Larson, Christoph Kofler, and Johan Pouwelse: A Peer’s-eye

View: Network Term Clouds in a Peer-to-peer System. In Proceedings of the ACM Interna-tional Conference on Information and Knowledge Management (CIKM ’11) (Short paper).

Workshop & Demo publications and Editorships

8. Svetlana Kordumova, Christoph Kofler, Dennis C. Koelma, Bouke Huurnink, Bauke Freiburg, Joris Kleinveld, Manuel van Rijn, Marco van Deursen, Martha Larson, and Cees G. M. Snoek:

SocialZap: Catch-up on Interesting Television Fragments Discovered from Social Media. In Proceedings of International Conference on Multimedia Retrieval (ICMR ’14) (Demo paper).

7. Michael Riegler, Mathias Lux, Christoph Kofler: Frame the Crowd: Global Visual Features

Labeling boosted with Crowdsourcing Information. In Proceedings of the MediaEval 2013 Workshop (2013) [151].

6. Sebastian Schmiedeke, Christoph Kofler, Isabelle Ferrané: Overview of the MediaEval 2012

(28)

1

13

5. Christoph Kofler, Martha Larson, and Alan Hanjalic: Alice’s Worlds of Wonder: Exploiting

Tags to Understand Images in terms of Size and Scale. In Proceedings of the ACM Interna-tional Conference on Multimedia (MM ’11) (Grand Challenge paper). ACM, New York, NY,

USA, 643-646 (2011) [99].

4. Christoph Kofler, Luz Caballero, Maria Menendez, Valentina Occhialini, and Martha Lar-son: Near2Me: An Authentic and Personalized Social Media-based Recommender for Travel

Destinations. In Proceedings of the International Workshop on Social Media (WSM ’11) in conjuction with the International Conference on Multimedia (MM ’11). ACM, New York, NY,

USA, 47-52 (2011) [97].

3. Raynor Vliegendhart, Martha Larson, Christoph Kofler, Carsten Eickhoff and Johan Pouwelse:

Investigating Factors Influencing Crowdsourcing Tasks with High Imaginative Load. In Work-shop on Crowdsourcing for Search and Data Mining (CSDM ’11) in conjuction with the ACM International Conference on Web Search and Data Mining (WSDM ’11) ACM, New York, NY,

USA (2011) [187].

2. Martha Larson, Maria Eskevich, Roeland Ordelman, Christoph Kofler, Sebastian Schmiedeke, Gareth J. F. Jones: Overview of MediaEval 2011 Rich Speech Retrieval Task and Genre Tagging

Task. In Proceedings of the MediaEval 2011 Workshop. CEUR Workshop Proceedings 807,

CEUR-WS.org (2011) [112].

1. Martha Larson, Adam Rae, Claire-Hélène Demarty, Christoph Kofler, Florian Metze, Raphaël Troncy, Vasileios Mezaris, Gareth J. F. Jones: Working Notes Proceedings of the MediaEval

(29)

(30)

2

F

AILING

Q

UERIES IN

V

IDEO

S

EARCH

The recent increase in the volume and variety of video content available online presents growing challenges for video search. Users face increased difficulty in formulating effec-tive queries and search engines must deploy highly effeceffec-tive algorithms to provide relevant results. Although lately much effort has been invested in optimizing video search engine results, relatively little attention has been given to predicting for which queries results op-timization is most useful, i.e., predicting which queries will fail. The ability to predict when a video search query is not likely to deliver satisfying search results is expected to en-able more effective search results optimizations and improved search experience for users. In this chapter, we propose a novel context-aware query failure prediction approach that predicts whether a particular query submitted in a user’s search session is likely to fail. The approach builds on the well-known concept of query performance prediction introduced in conventional text-based Web search to estimate the query’s retrieval performance, but extends this concept with two novel characteristics, user indicators and engine indicators. User indicators are derived from transaction logs, capture the patterns of user interactions with the video search engine, and exploit the context in which a particular query was sub-mitted. Engine indicators are derived from the search results list and measure the consis-tency of search results at the level of textual and visual features associated with videos. Ex-tensive evaluation of the approach on a test set containing 1+ million video search queries shows its effectiveness and demonstrates a significant improvement over traditional and state-of-the-art baseline approaches.

Initial investigations carried out for the work presented in this chapter is published as: Christoph Kofler,

Lin-jun Yang, Martha Larson, Tao Mei, Alan Hanjalic, and Shipeng Li: When Video Search Goes Wrong:

Predict-ing Query Failure UsPredict-ing Search Engine Logs and Visual Search Results. In ProceedPredict-ings of the ACM

Interna-tional Conference on Multimedia (MM ’12). ACM, New York, NY, USA, 319-328 (2012) [104]. An extension of

this work is published as: Christoph Kofler, Linjun Yang, Martha Larson, Tao Mei, Alan Hanjalic, and Shipeng

Li: Predicting Failing Queries in Video Search. IEEE Transactions on Multimedia, 16(7): 1973-1985 (2014)

[105]. This chapter merges these two publications.

(31)

2

2.1 I

NTRODUCTION

T

HEultimate goal of a video search engine is to match the most relevant search results to a user’s query and satisfy the user’s information need as effectively and efficiently as possible. However, due to the rapidly growing amount of video data on the Internet and the well-known semantic [168] and intent [57] gaps between the video collection and users’ information needs, this task becomes increasingly challenging. In recent years, this challenge has been addressed throughout the entire video search engine pipeline: Visual content of videos is indexed based on semantic visual concepts [142,83,169], rel-evance functions are optimized through search results reranking [134], and the query formulation process is supported by automatic query suggestion [211] and search intent derivation [57]. These and other optimization techniques could, however, also be sup-ported by mechanisms that predict at which points in the sequence of user interactions with the engine (i.e., a video search session) optimizations are critically needed. Specif-ically, these are the points where the user is likely to be confronted with non-relevant search results, generated by failing queries.

meerkat habitat 26-9-2009 ŵĞĞƌŬĂƚƐŚĂďŝƚĂƚĐŽŶĐĂǀĞ ďƵďďůĞ 15-2-2010 DĞĞƌŬĂƚƐŐŽǁŝůĚŽǀĞƌ ĐŚĂŶŐĞŝŶƚŚĞŝƌŚĂďŝƚĂƚ ϭϵͲϲͲϮϬϭϮ meerkat ĂďǇŵĞĞƌŬĂƚĞǆƉůŽƌĞƐ ƚŚĞĨƌŝĐĂŶǁŝůĚ 14-1-2013 DĞĞƌŬĂƚƐΖDŽďZƵůĞ 19-3-2009 dŚĞŵĞĞƌŬĂƚΖƐŵŝƌĂĐƵůŽƵƐ ƐƚŽƌǇ ϭͲϯͲϮϬϭϭ hƐĞƌ hƐĞƌ

༃ ༄ ༄ ༃ ^ĞƐƐŝŽŶϭ ^ĞƐƐŝŽŶ Ϯ ϯ͗Ϯϭ ϯ͗ϯϰ ϱ͗ϭϭ ϲ͗ϱϯ ϲ͗ϭϭ ϰ͗ϰϱ DĞĞƌŬĂƚƐĂŶĚƚŚĞďŝƌĚŝŶ ƚŚĞŝƌŚĂďŝƚĂƚ

Figure 2.1: A visualization of excerpts from two independent, ongoing search sessions corresponding to two

different underlying user needs. User A first submits querymeerkat habitatwhich produces a query failure

before achieving query success with querymeerkat. The (independent) search session of User B depicts the

exact opposite scenario.

In this chapter, we address the challenge of automatically predicting query failures to ultimately help deploying optimization algorithms in a more informed fashion. We propose a context-aware query failure prediction approach that derives such prediction from the context of the video search session in which the query is issued. To define query failure, we adopt the rationale from transaction log analysis from the field of text infor-mation retrieval [86,45], including assumptions about the relationship between clicks on search results and actual relevance. We consider a query as failed if it produces a results list containing no search results the user considers relevant enough to click. We illustrate this with two excerpts from ongoing search sessions in Figure2.1. In one search session, User A first submits query meerkat habitat, returning a non-satisfying results

(32)

2.1INTRODUCTION

2

17 0 0.1 0.2 0% 1-9% 10-19% 20-29% 30-39% 40-49% 50-59% 60-69% 70-79% 80-89% 90-100% F re q u en cy

Success rates of queries (from the Bing video search engine)

Figure 2.2: Distribution of success rates for sample queries in our transaction log.

list and resulting in no user click, i.e., in a failed query. The user then generalizes the query to meerkat, which leads to results containing a video the user does click. In an-other search session, User B first submits query meerkat to the video search engine. This does not lead to a search results click, but narrowing the query to meerkat habitat does. The examples in Figure2.1illustrate that the challenge we pursue in this chapter is due to the enormous variability in user interaction patterns and engine responses. The fact that either query expansion or reduction (next to many other actions performed by a user in a search session) can lead to query success illustrates the difficulty of query fail-ure prediction and motivates us to propose a query failfail-ure approach that goes beyond simplistic characteristics of query sequences within a search session when designing our approach. Developing a successful query failure prediction method is therefore not triv-ial. Furthermore, the examples clearly show that the problem of query failure prediction cannot be considered independently of the context provided by the search session.

This insight is underlined by statistics for queries drawn from users’ interactions with the Bing video search engine1: Figure2.2depicts different success rates of ca. 56K unique queries (cf. Section2.4for details about the video search engine transaction log these queries were sampled from). The success rate of a query is defined as the relative pro-portion of cases in which users click on a result in the returned results list produced by this query. For example, if query meerkat got submitted 100 times and results got clicked 60 times, the query shows a success rate of 0.6. The fact that the majority of the queries fall midway between the two extremes of 0% and 100% provides evidence that a single query can correspond to different underlying information needs, corresponding to dif-ferent user decisions to click on a result. This evidence provides support for our position to not treat all instances of a query equally, but rather to take the context of the query within the search session into account when predicting query failure.

In addition to the information derived from the search session context, we rely in our approach also on another information source to make query failure prediction better in-formed, namely the result lists produced by the video search engine. Predicting how well a query performs by looking at the results list it produces has previously been addressed in conventional text-based information retrieval by query performance prediction (QPP) techniques (e.g., [213,60,69,32,59,208]). The key principle behind QPP is that highly