Scalable dimensionality reduction methods for recommender systems

(1)

Scalable dimensionality reduction methods for recommender systems

Michał Ciesielczyk, MSc

Michal.Ciesielczyk@put.poznan.pl

Doctoral Dissertation

prepared at

Faculty of Computing Science Pozna ´ n University of Technology

Primary supervisor: Mikołaj Morzy, PhD, DSc Auxiliary supervisor: Andrzej Szwabe, PhD

Pozna ´ n, March 2015

(2)

Skalowalne metody redukcji wymiarowo´sci dla systemów rekomendacyjnych

mgr in ˙z. Michał Ciesielczyk

Michal.Ciesielczyk@put.poznan.pl

Rozprawa Doktorska

Wydział Informatyki Politechnika Pozna ´ nska

Promotor: dr hab. in ˙z. Mikołaj Morzy Promotor pomocniczy: dr in ˙z. Andrzej Szwabe

Pozna ´ n, Marzec 2015

(3)

Dedicated to my beloved wife Ania.

(4)

Abstract

Recommendation algorithms are aimed at assisting people in dealing with the excess of in- formation available in a system of one kind and another; nowadays, the information overload takes place in many systems, especially in the Internet. Although such algorithms have been widely adopted by both research and e-commerce bodies, the problem that their authors aim to address is still regarded as not fully solved.

The research reported in the thesis is oriented on developing a technique that eectively copes with the high data sparsity problem, but at the same time is not more computation- ally complex than the state-of-the-art collaborative and content-based ltering methods. The vector-based model has been used to represent data since it guarantees a exible way for stor- ing and processing information and knowledge. In order to address the scalability issue, the algorithms proposed by the author are based on two-stage processing of the input data. First, the points representing the modelled data in the vector space are projected onto a randomly selected subspace of reduced-dimensionality. According to the Johnson-Lindenstrauss lemma, the size of the input matrix may be signicantly reduced while still approximately preserving the distance between points in the vector space. Subsequently, the result vector space is fac- torized to preserve only the most salient features. Moreover, due to the fact that real-world data sets are often very sparse, the thesis focuses also on the most challenging case of ex- tremely high collaborative data sparsity, for which the use of many widely-referenced methods is disqualied. For this reason the dimensionality reduction methods and reective data pro- cessing are investigated, by carrying out a series of experiments, from the perspective of the ability to produce high precision recommendations and to cope with high unpredictability of the data sparsity.

The results of the theoretical research have been evaluated according to a research method-

ology that is well established in the relevant literature using publicly available data sets and

following scenarios corresponding to the real-world demands. Therefore, as in practice it is

sucient to identify only a small and non-trivial set of items for each user from a vast set

of choices, the evaluation is novelty oriented and based on the so-called nd-good-items task,

rather than on the low-error-of-ratings prediction. In order to make the results comparable

with those presented in relevant literature, all the methods are tested using the MovieLens

(5)

2 data set, which is one of the most widely referenced datasets used in research on collaborative

ltering.

This dissertation presents and evaluates a range of methods, applicable to recommender systems, that have been developed over the past several years, together with methods proposed by the author. As shown in the analysis of experimental results, the proposed solutions enable to make recommendation techniques more reliable, accurate, and applicable to even more real-world applications. Design and implementation of the proposed algorithm, major insights, and examples of the system applicability are also discussed.

Based on presented analytical research and experimental results, the author states that vector-space recommendation techniques and dimensionality reduction methods may be com- bined in a way preserving the high quality of recommendations, regardless of the amount of processed heterogeneous data.

Keywords: Collaborative Filtering, Dimensionality Reduction, Machine Learning, Reective Random Indexing, Statistical Relational Learning

Thesis Statement

Vector-space recommendation techniques and dimensionality reduction methods may be com-

bined in a way preserving the high quality of recommendations regardless of the amount of

processed heterogeneous data.

(6)

3 Acknowledgements

Numerous people helped me through the completion of this thesis and I would like to express my appreciation to some of them here.

First of all, I would like to express my sincere gratitude to my doctoral supervisors. I am very grateful to Prof. Mikoªaj Morzy, my supervisor, for his continuous support of my research and for his assistance with rening this document. His guidance, enthusiasm, vast knowledge and skill in many areas have certainly contributed to my growing up as a scientist. I would like to deeply thank Dr. Andrzej Szwabe, my auxiliary supervisor, for inspiring me with the research on recommender systems, and most of all, for his time and patience. Our numerous scientic discussions provided me with relevant feedback and allowed me to follow important research directions. His assistance throughout the years helped me in many aspects of my work.

I also thank Dr. Paweª Misiorek, my direct collaborator, for sincere supporting my initia- tives and always having the time for a meeting. His expertise, and insightful understanding, added considerably to formalize my scientic ideas and helped to clarify this work. In ad- dition, I appreciate his assistance in writing grant proposals, scholarship applications and technical reports related with this work. I also gratefully acknowledge the support of Prof.

Czesªaw J¦drzejek, the Head of the Division of Information Technology and Systems, who initially created the environment in which I work and gave me the opportunity to carry out my research. The possibilities that I have as an employee of the institute have contributed greatly to my development as a scientist.

I would also like to appreciate the funding directly supporting my scientic research.

Particularly, while working on the doctoral thesis I was supported by the Polish Ministry of Science and Higher Education (the Information-theoretic analysis of semantic structures for collaborative recommendations project), The National Science Centre (the Information- theoretic abductive reasoning for context-based recommendation project) and the Faculty of Computing at Poznan University of Technology (scholarship from a pro-quality grant and for the best PhD students).

I thank all my co-workers for conversations on various scientic and technical subjects on a daily basis, providing me with valuable feedback. Among them, I would like to mention Adam Styperek with whom I share the room, Dr. Jarosªaw B¡k for maintaining a positive atmosphere at our workplace, Przemysªaw Walkowiak for joint work in a number of initia- tives in recent years, and Maciej Urba«ski for providing us with technical support. I am also indebted to many communities contributing to the open source initiatives by providing technologies that I have used to conduct my research.

Most importantly of all, I would like to especially thank my family for love and support

in everyday situations. In particular, I appreciate my beloved wife and best friend, Anna,

without whose understanding, patience, continuous encouragement and enormous support

in dicult times I would not have nished this thesis. I thank my parents, for their faith,

devotion of time and help they provided me through my entire life, allowing me to be as

ambitious as I wanted. Finally, I thank my friends and all other individuals who, more or

less, contributed to the completion of this work.

(7)

Acronyms 7

Glossary 8

1 Introduction 10

1.1 Motivation . . . . 10

1.2 Area of research . . . . 12

1.2.1 Recommendation vs. classication . . . . 12

1.2.2 Recommendation vs. prediction . . . . 13

1.2.3 Dataset-based o-line evaluation methodology . . . . 14

1.2.4 Multidimensional vector-space model in relational learning . . . . 14

1.3 Aims of the dissertation . . . . 15

1.3.1 Sparse data processing . . . . 15

1.3.2 Scalability . . . . 15

1.3.3 Graph-based perspective on reective vector-space processing . . . . . 16

1.3.4 Multi-relational data processing . . . . 16

1.4 Research of the author . . . . 17

1.4.1 RI-based approximation of SVD . . . . 17

1.4.2 Hybrid recommendation methods . . . . 18

1.4.3 Long-tail recommendation . . . . 18

1.4.4 CF based on bi-relational data . . . . 18

1.5 Contribution of the thesis . . . . 19

1.6 Structure of this thesis . . . . 20

2 State of the art 21 2.1 Sparse data processing . . . . 21

2.2 Scalability . . . . 24

2.3 Graph-based perspective on reective vector-space processing . . . . 26

2.4 Multi-relational data processing . . . . 27

2.5 Deciencies in the state of the art . . . . 29

(8)

Contents 5

3 Methodology 30

3.1 Basic denitions . . . . 30

3.2 Recommendation quality evaluation . . . . 31

3.3 Running example . . . . 33

4 Sparse data processing 35 4.1 Introduction . . . . 35

4.2 Proposed method . . . . 36

4.2.1 Preliminary processing using content feature data . . . . 36

4.2.2 Training based on collaborative ltering data . . . . 37

4.2.3 Input matrix reconstruction . . . . 38

4.2.4 Running example . . . . 38

4.3 Evaluation . . . . 39

4.3.1 Evaluated recommendation methods . . . . 41

4.3.2 Accuracy of recommendations based on SECF-RSVD . . . . 42

4.4 Conclusions . . . . 44

5 Scalability 46 5.1 Introduction . . . . 46

5.2 Proposed method . . . . 47

5.2.1 Index vectors generation and context vectors training . . . . 48

5.2.2 SVD-based item matrix decomposition . . . . 48

5.2.3 Reconstruction of user vectors matrix . . . . 49

5.2.4 Input matrix reconstruction . . . . 49

5.2.5 Running example . . . . 51

5.3 Evaluation . . . . 53

5.3.1 RSVD evaluation methodology . . . . 53

5.3.2 Recommendation accuracy evaluation results . . . . 54

5.3.3 Relation between recommendation accuracy and selected statistical measures . . . . 57

5.3.4 Algebraic properties of RI-based dimensionality reduction methods . . 57

5.3.5 Execution times measurement . . . . 61

5.3.6 SECF-RSVD computational eectiveness . . . . 61

5.4 Conclusions . . . . 63

6 Graph-based perspective on reflective vector-space processing 66 6.1 Introduction . . . . 66

6.2 Proposed method . . . . 68

6.2.1 Modelling user-item dependencies as a probability space . . . . 68

6.2.2 Reective data processing . . . . 69

6.2.3 The PRI algorithm . . . . 70

6.2.4 Running example . . . . 72

6.3 Evaluation . . . . 74

6.3.1 Methodology . . . . 74

6.3.2 Recommendation quality evaluation . . . . 78

6.4 Conclusions . . . . 81

(9)

Contents 6

7 Multi-relational data processing 84

7.1 Introduction . . . . 84

7.1.1 Methodology . . . . 85

7.1.2 Proposed data representation model . . . . 86

7.1.3 Application of the proposed model in the bi-relational CF scenario . . 87

7.2 Proposed method . . . . 87

7.2.1 Data representation based on an element-fact matrix . . . . 87

7.2.2 Generation of prediction lists . . . . 88

7.2.3 Running example . . . . 89

7.3 Evaluation . . . . 95

7.3.1 Evaluated scenarios . . . . 95

7.3.2 Dataset preparation . . . . 95

7.3.3 Evaluated recommendation algorithms . . . . 96

7.3.4 Experiments . . . . 97

7.4 Conclusions . . . 104

8 Conclusions 105 8.1 Main results . . . 105

8.1.1 Sparse data processing . . . 105

8.1.2 Scalability . . . 106

8.1.3 Graph-based perspective on reective vector-space processing . . . 106

8.1.4 Multi-relational data processing . . . 107

8.2 Contributions overview . . . 107

8.2.1 Sparse data processing . . . 108

8.2.2 Scalability . . . 108

8.2.3 Graph-based perspective on reective vector-space processing . . . 109

8.2.4 Multi-relational data processing . . . 109

8.3 Potential economic and societal impact . . . 110

8.4 Potential directions of further studies . . . 111

Bibliography 112

(10)

Acronyms

AUROC Area Underneath the ROC curve.

CF Collaborative Filtering.

F 1 F 1 -score.

IR Information Retrieval.

LSA Latent Semantic Analysis.

MAE Mean Absolute Error.

NLP Natural Language Processing.

PCA Principal Component Analysis.

RI Random Indexing.

RMSE Root Mean Square Error.

RRI Reective Random Indexing.

RSVD Randomised Singular Value Decomposition.

SRL Statistical Relational Learning.

SVD Singular Value Decomposition.

TR Text Retrieval.

(11)

Glossary

Area Underneath the ROC curve (AUROC, ROC AUC) - area under the Receiver Operating Characteristic curve, often used as a summary statistic for the predictor robustness.

Collaborative Filtering - method which allows to generate predictions about interests of a specic user using techniques involving collection of information from many users.

F 1 -score - accuracy measure in statistical analysis of binary classication, usually dened as the harmonic mean of precision (the number of correct results divided by the number of all returned results) and recall (the number of correct results divided by the number of results that should have been returned).

Information Retrieval - an activity of nding specic items (e.g. documents) of an un- structured nature (e.g. text) that are relevant to an information need from within large collections.

Latent Semantic Analysis - SVD-based technique of analysing semantic similarity rela- tions between documents and terms, in which every document and term is represented by a corresponding vector in a multidimensional space.

Long Tail - the name for a long-known feature of some statistical distributions. A population distribution is said to have a long tail, if a larger share of occurrences (i.e., more than a half) is located in the tail of the distribution. More formally, the distribution of a random variable X with measured cumulative distribution function P ^c is said to have a long right tail if ∀ t>0 lim

x→∞ e ^λx P ^c (x + t) = ∞ .

Mean Absolute Error - quantity used to measure the dierences between predicted values and the true observed values equivalent to the average of the absolute errors.

Natural Language Processing - eld of computer science, articial intelligence, and lin-

guistics concerned with enabling the computers to derive meaning from human or natural

language input.

(12)

Glossary 9

Principal Component Analysis - statistical procedure enabling to convert a set of obser- vations of possibly correlated variables into principal components a set of values of linearly uncorrelated variables.

Random Indexing - dimension reduction method based on the insight that a high- dimensional model can be approximately projected into a random space of suciently lower dimensionality without compromising distance between the points in the input vector space.

Randomised Singular Value Decomposition - vector space dimensionality reduction method based on a combination of RI-based pre-processing and SVD-based vector space optimization.

Reective Random Indexing - an iterative variant of the Random Indexing approach in- volving cyclical training allowing to draw meaningful associations between modelled objects as the system generates new inferences by considering what it has learned from a data set in a previous iteration.

Root Mean Square Error - accuracy measure representing the sample standard deviation of the dierences between predicted values and true observed values.

Singular Value Decomposition - factorization of a rectangular matrix, used in signal pro- cessing and statistics.

Statistical Relational Learning - sub-discipline of articial intelligence and machine learning that focuses on learning when samples are non-independent and identically distributed using the ideas from probability theory and statistics to address uncertainty.

Text Retrieval - matching of user queries against unstructured text in order to nd infor-

mation to given criteria.

(13)

Introduction 1

1.1 Motivation

Nowadays, more and more people are overwhelmed with choice, driven by rapidly growing Internet and still increasing globalization, which results in more and more diculties in making decisions. This phenomenon has been presented as the paradox of choice, according to which the greater choice a person has, the less chance that she will be satised with her ultimate decision [1]. This paradox is strictly related to the information overload, which refers to the diculty a person can have making decisions or understanding an issue that can be caused by the presence of too much information.

Although the term information overload was popularized by Alvin Toer in 1970 [2], it can be traced, to some extent, to antiquity [3]. Though, it is not surprising that researches in the area of Information Retrieval (IR) have been intensively focused on recommender systems which aim to assist people in decision-making processes.

In general, a recommender system is a software tool providing users with suggestions of items that may be of their interest, using a model built from a set of transactions collected before [4]. Item is a general term for an object, usually having specic characteristics, that may be recommended. Every user of a recommender system has her/his own goals and individual preferences. A transaction is dened as a recorded human-computer interaction. It is assumed that acquiring an item by a user always incurs in a certain cost, and the purpose of suggestions is to reduce this cost by supporting various decision-making processes.

The transactions may contain implicit information that was not provided intentionally,

or explicit feedback which the user provided. Usually, user ratings (collected explicitly or

implicitly), indicating their item preferences, are the most popular transaction data used by

recommender systems [4, 5, 6]. In the classical approach, this type of information is often

modelled as a bipartite graph, in which each user and each item is represented as a node,

and the edges encode a relation between them. Such a model is usually stored in a form

an adjacency matrix, also referred to as a user-item matrix, in which rows represent users

and columns represent items [4, 5, 6]. In more advanced graph-based approaches, the data

are modelled in a form of a graph where each user and each item is represented as a node,

(14)

1.1 Motivation 11

and directed edges encode the multiple relations between them (i.e. ratings or any other interactions) [7,8].

In many real-world cases, a recommender system is expected to predict user actions in order to provide assistance in selecting items from an overwhelming set of choices, simulta- neously reducing the amount of necessary human-system interaction [6,9]. Widely-referenced approaches are based on the use of an input matrix that represents each user prole as a vector in a space of items and each item as a vector in a space of users [10, 11]. One of the main challenges in this area of research is the feasibility of large-scale matrix processing [5].

At the same time, the most typical scenario in e-commerce is the case of high data sparsity, where one has to manage a large number of users, choosing only a few items from a relatively big inventory [5, 12]. Under these conditions all the available data needs to be used in order to provide useful recommendations.

Evaluation of the state of the art recommendation algorithms performed using one of the most widely referenced Collaborative Filtering (CF) datasets, the MovieLens ¹ [12], indicates that trivial popularity-based techniques achieve results comparable to those achieved by much more sophisticated methods [10, 11, 13]. Therefore, in order to build an advanced recommen- dation system, one has to implement explicit item popularity modelling. Algorithms that use eigenvector computations, such as PageRank [14], are able to globally rank nodes (represent- ing web documents in case of PageRank) of a graph by taking into account only the primary eigenvalue [15]. Using such a ranking, one is able to estimate how close to the centre of a graph each node is. In other words, it is possible to determine the global importance of every node. As a consequence, algorithms like PageRank may be considered to be an adequate recommendation algorithm for new users (i.e. users of whom we know nothing). However, due to the lack of sensitivity of local dependencies, this type of algorithms fails to produce personalized recommendations.

Advanced IR methods frequently use matrix factorization techniques, such as Singular Value Decomposition (SVD), which enable them to capture latent associations [5, 16]. The key idea of applying SVD in the CF domain is to decompose the user-item rating matrix into a product of two signicantly lower rank matrices, representing independent user/item features, and a diagonal matrix containing singular values. The purpose of such an operation is to reduce the noise (causing errors in predictions) in the input data [16]. However, while removing the dimensions corresponding to the least signicant singular values (i.e. during so-called k-cut) one simultaneously removes not only the noise but also the local dependencies.

Thus SVD-based algorithms are performing well the task of nding global dependencies in the dataset.

However, in real-world application scenarios, especially in e-commerce, local depen- dencies can be just as important as global trends due to the well known long-tail phe- nomenon [9, 10, 17, 18]. The behavioural data collected by on-line retailers indicate that the majority of successful recommendations are based on the predictions of items from the long tail [11]. Trivial (popularity-based) recommendations are of little benet for users who expect some degree of personalization and novelty from them. Therefore, in order to pro- vide high quality and useful recommendations one has to take into account the local (i.e.

non-global) relationships between objects in the dataset. Generally, the higher the number of local relationships in data the higher the number of dimensions is needed to represent

1 http://www.grouplens.org/node/73

(15)

1.1 Motivation 12

them. In such a case, a method employing the dimensionality reduction techniques like SVD is of little use. However, algorithms based on so-called reective data processing, due to high-dimensional data modelling and recursive retraining, in contrast to SVD-based algo- rithms, provide the means to model multiple individual features using appropriately many orthogonal (i.e. independent) dimensions. Reective processing is established on the concept of exploring knowledge about coincidences between modelled objects. As a result, it enables to discover indirect correspondences between one object to another [9], and hence it allows to generate not obvious and useful recommendations. That is why algorithms based on reective processing are gaining a lot of attention of the Natural Language Processing (NLP) and IR communities in the last few years [19,20].

Alas, the more dimensions an algorithm uses to model objects and correspondences in the dataset, the less scalable it is, which is crucial in large-scale data processing [19,21,22]. Most recommender systems simply do not handle large volumes of data at all, and even if they do, they require substantial investments to operate. For instance, large-scale solutions like Mahout and Hadoop [23] require a lot of training, development, deployment, and constant maintenance and support of a complex and expensive multiple cloud-based servers. One of the solutions addressing the scalability issue is to use an iterative algorithm, such as Random Indexing (RI) [22], allowing to reduce the dimensionality.

It should be also noted that the visualization method used to present the computed recom- mendation is reported to be a critical factor for the acceptance of a recommender system [4].

Although, the recommendations are commonly presented as a ranked list of items, it has been observed that much information is lost in such an approach [24]. It has been motivated by the fact that two items, both of which match the user preferences, can signicantly dier from each other. The use of dimensionality reduction methods along with building a two-dimensional visualization of the recommendations enables one to retain essential part of this information, by positioning similar recommendations closer to each other [25].

As can be seen, there are two main reasons for using dimensionality reduction methods.

Firstly, they allow, with a certain approximation, to scale down the computational complexity, and secondly, they enable to reduce the noise in the input data by preserving only the most salient features.

1.2 Area of research

In this section the main research problem is dened and the evaluation methodology is pro- posed. Specically, the recommendation task is distinguished from classication and regres- sion analysis. Consequently, the research is aimed at comparing, both quantitatively and qualitatively, the introduced recommendation methods on the basis of such assumptions. Ad- ditionally, the author tries to propose an unambiguous notation to model the user interaction with the system, extending the classical user-item matrix paradigm.

1.2.1 Recommendation vs. classification

Despite the fact that some of the articles in the literature try to dene the best item recom-

mendation problem as a classication problem [26], in the author's opinion these two research

(16)

1.2 Area of research 13

areas dier in principle. In general, the purpose of a classication algorithm is to choose one option from a limited set of possibilities [27]. On the other hand, the goal of recommenda- tion algorithms is to select and rank a list of items from a vast set of choices, usually in a personalized way (i.e. individually for each user) [4, 6].

In the terminology of machine learning, classication (herein understood by supervised learning) is usually conducted to establish a function using correctly classied training data [27]. However, the characteristics of the data are usually not taken into account. This fact can be of great importance when the observed behavioural data is highly sparse, and there are almost no training examples on instance level (i.e. per each user or item). Such an characteristic, also referred to as behavioural/collaborative data sparsity, makes these non-observed entries especially hard to interpret [4,6]. Moreover, it is often reported that the users of recommender systems often do not need obvious suggestions [6, 10, 28, 29]. As every recommender system is interested in recommending mostly these non-observed entries, opti- mizing a recommendation model for predicting all of them as zero or negative would obviously result in poor predictions. For instance, a very trivial 1-NN classier [30] would perfectly t most of the observations, but it would certainly fail as a useful recommender. Therefore, the fact that behavioural data, which are frequently used by recommender systems, are neither accurate nor consistent [31,32], partially due to context-sensitivity, additionally dierentiates the recommendation task from the classication task.

What is more, learning for a classication is done by optimizing a model with respect to the hinge loss [33]. As a consequence, the optimization criterion does not directly reect the ranking task. In other words, the optimization in classication methods is done against the desired recommendation goal. Thus, despite the fact that rst glance it may seem that the classication task looks like recommendation, these two techniques substantially dier.

1.2.2 Recommendation vs. prediction

The research presented in this dissertation is conducted from the perspective of the practical applicability of the recommender systems in real-world (specically e-commerce) scenarios.

Therefore, despite the fact that the goal of a recommender system is to predict user pref- erences, it's not necessary to estimate absolute rating values of all items not yet rated by the user. As a consequence, the evaluation methodology proposed by the author is selection oriented [6], reecting the real-world demands, corresponding to context-aware personalized recommendation. Overall, it is rather sucient to identify only a small set of items that are the most likely to be attractive to a given user [10]. In other words, the evaluation is oriented on the so-called nd-good-items task [6], rather than on the low-error-of-ratings prediction (i.e., regression analysis).

Classication metrics, such as Area Underneath the ROC curve (AUROC), measure the

probability of making correct or incorrect decisions by the recommender system about whether

an item is relevant. Moreover, classication metrics tolerate the dierences between actual

and predicted values, as long as they do not lead to wrong decisions. Thus, these metrics are

appropriate to examine binary relevance relationships [6]. In particular, while using AUROC

it is assumed that the ordering among relevant items does not matter. According to [34],

AUROC is equivalent to the probability of the system being able to choose properly between

two items, one randomly selected from the set of relevant items, and one randomly selected

(17)

1.2 Area of research 14

from the set of non-relevant items. For this reason, the results of the theoretical research are evaluated by means of experiments based on quality measures that are probabilistically interpretable such as AUROC.

1.2.3 Dataset-based off-line evaluation methodology

A series of experiments following known experiment design practices [35], has to be con- ducted in order to verify the thesis statement. Usually, at the design phase of a recommender system, o-line experiments (following the evaluation methodology) are conducted as they require no interaction with real users [28]. The o-line evaluation is carried out by running several algorithms on the number of selected datasets containing user decisions (e.g., ratings or information about products' purchases) and then measuring the performance of the algo- rithms. Such an approach enables one to compare, both quantitatively and qualitatively, the recommendation algorithms against real user actions.

The o-line evaluation is mostly conducted on existing public benchmark data or, if such data is not available, on collected real-world data [4, 10, 12]. Since there is a lack of freely accessible e-commerce data sets, it is dicult to make the results comparable with those presented in relevant papers [10,11,19,36]. Therefore, all the other recommendation methods presented in this dissertation are tested using the MovieLens dataset, which is one of the most widely referenced CF datasets and contains movie ratings collected from real-world users [4, 12].

1.2.4 Multidimensional vector-space model in relational learn- ing

The tensor representation of ontological data is one of the modelling techniques used in rela- tional learning. Ontology is a set of concepts and relations between them formally represent- ing a domain knowledge [37]. In RDF terminology, each relation may contain many triples (subject-predicate-object expressions). The subject denotes a resource, and the predicate de- notes traits or aspects of the resource and expresses a relationship between the subject and the object. A collection of triples intrinsically represents a labelled, directed multi-graph, where nodes correspond to specic resources and edges represent relations. Such a graph may be expressed in a form of a third-order tensor, in which every relation is represented as adjacency matrix a one slice of the tensor. Such model is providing possibilities of using linear algebra methods to extract information about latent relationships between nodes. In particular, tensor decompositions are a method of nding hidden relations [38].

The advantage of that approach is a possibility of non-trivial reasoning about facts (i.e.

without the use of implicitly given logical rules). Additionally, such a model is suitable to

extract new rules between objects and relations. A few algorithms of tensor decomposition

are currently known, however not each may be used in such reasoning and extraction tasks,

what is a motivation to conduct research in this area.

(18)

1.3 Aims of the dissertation 15

1.3 Aims of the dissertation

In this section the main research problems such as scalability and high data sparsity are discussed. The author also tries to justify the aim of integrating a heterogeneous tensor-based model for data representation with the leading vector-space processing methods, in particular those based on the Hilbert space model. Major insights, along with proposed innovative solutions, are briey introduced.

1.3.1 Sparse data processing

Usually, a single user is likely to rate or buy only a small percentage of possible items [5,12].

As a result, the real-world datasets are often very sparse. For this reason, the reported research is oriented on developing a technique which eectively copes with high data sparsity problem, but at the same time is not more computationally complex than the state-of-the-art collaborative and content-based ltering methods presented in the literature [19, 36]. This thesis, in contrast to many publications in the literature [12,39,40, 41], focuses also on a case of extremely high collaborative data sparsity. Such a scenario corresponds better to many real-world applications [42], but at the same time, due to missing data problem, is being considered as the most challenging one [36].

For the purpose of designing a scalable system, this thesis investigates (by carrying out a series of experiments) dimensionality reduction methods and reective data processing from the perspective of their ability to produce high precision low-dimensional recommendations.

Such a reective way of data processing may be especially advantageous when an application has to cope with high unpredictability of data sparsity, which is typical of all dimensionality reduction methods based on RI [19]. The author demonstrates that such an approach is more appropriate in high data sparsity scenarios, which disqualify the use of many widely-referenced CF methods [36,43].

What is more, the natural sparsity of CF datasets limits the set of recommendations that can be evaluated o-line [6]. Particularly, one cannot examine the usefulness of an item recommended for a user if one does have such an information in the dataset (more specically in the test set). As a consequence, the result of the F ₁ -score measurements analysis [10, 12], which additionally requires the use of the @n parameter, may be unclear (especially when the tested datasets have various sparsity). Therefore, for the purposes of evaluation of an recommender system, one should choose the probabilistically interpretable AUROC measure.

1.3.2 Scalability

In order to address the scalability issue the algorithms introduced in this dissertation are based on two-stage processing of the input data. First, by means of RI, points (representing modelled data) in the vector space are projected onto a lower-dimensional subspace. RI method is supported by the Johnson-Lindenstrauss lemma [44, 45], which states that the distance between points will be approximately preserved with high probability if they are projected onto a reduced-dimensional randomly selected subspace of sucient dimensionality.

Particularly, the size of a given matrix can be signicantly reduced, with sucient accuracy,

especially in the case of high data sparsity, by multiplying it with a set of nearly orthogonal,

(19)

1.3 Aims of the dissertation 16

random vectors. Subsequently, the resulting vector space is optimized using the Principal Component Analysis (PCA) to preserve only the most salient features. Similar approach was introduced in Randomised Singular Value Decomposition (RSVD) method, which may be used to provide higher quality results more eectively to these obtained when using only SVD [19, 21].

In addition, the RSVD method may be used as a more advanced alternative for RI. What is more, the lower the number of ultimately required dimensions is, the more RSVD resembles SVD [19]. As a consequence, RSVD may be easily and successfully used in a recommender system originally based on SVD, at least as far as CF accuracy is concerned, enabling to signicantly reduce the matrix factorization complexity, and at the same time enabling to improve the accuracy [19].

1.3.3 Graph-based perspective on reflective vector-space processing

The vector space model is commonly proposed as the main model of data representation in many IR and knowledge retrieval tasks involving large graphs [44, 46]. In the context of information retrieval, the key property of a vector space is that it may be accompanied by a similarity measure based on the inner product. As a result, such a vector space (in which vector lengths and angles between vectors may be measured and/or expressed) may be treated as a Hilbert space. The main reasons for the prevalence of the Hilbert space model are its simplicity and probabilistic interpretation [8,47].

In such a model, usually stored in a form an adjacency matrix, the value of a particular entry may be seen as measure of the relevance of the objects corresponding to the respective indices. The research reported in this thesis is aimed at constructing a model enabling the processing of data in order to derive most plausible hypotheses about the relations between objects. Despite of the this functionality, the presented model should be also useful for coping with errors or noise in the data set.

Additionally, an analysis of the dependencies between the degree of dimensionality reduc- tion, spectral properties of the modelled structures, and the eectiveness measured during the experiments has been done.

1.3.4 Multi-relational data processing

The author questions the classical approach of recommender systems, in which the whole history of each user is taken equally into account to generate personalized recommendations, as, in his opinion, there is no such thing as an `average taste', even for an individual. In the author's opinion, without the incorporation of context-awareness it is impossible to provide fully-functional recommendation which takes into account all the important factors modelling human decision in the real-world environment. In order to enable truly context-aware per- sonalized services, rst of all, one has to extend the classical user-item matrix paradigm, and secondly, to specify an unambiguous notation to model the user interaction with the system [48].

The vector-based model is used to represent data since it guarantees a exible way for

storing and processing information and knowledge. This includes the tensor model that as-

(20)

1.3 Aims of the dissertation 17

sumes using more than two dimensions, and, in some special cases, the two-dimensional (more popular and easier to deal with) matrix model. Using this model enables representing the information about the objects from the physical world and relations between them. More- over, such a model is suitable to extract the rules between the relations. Additionally, the vector-based model is convenient to represent multi-dimensional structures and graphs, en- abling extraction of the information encoded in the structure of the data as well. Therefore, the tensor representation is suitable to model a process observed in everyday situations, e.g., for the needs of context-aware or localization-driven recommendation the additional con- textual information may be eciently realized using the multidimensional approach.

Using the tensor-based data model increases the complexity of the data representation (compared to classical matrix representations), resulting in more sparse and bigger data structures. Nevertheless, the tensor processing algorithms may be still expressed by means of standard matrix operations [8,38,49,50]. In other words, a tensor may be seen as a nite set of matrices, which represent dierent perspectives or views on heterogeneous data [51,52].

Consequently, the data can be still processed in an ecient two-mode way, regardless the usage of the tensor-based data representation model. Moreover, due to scalability issues men- tioned above, the research herein is oriented on incremental tensor pre-processing techniques as presented in [49], so the use of the Kronecker product is avoided.

1.4 Research of the author

This dissertation follows and concludes the research of the author in the area of recommender systems. A range of recommendation methods that have been developed over the past several years, as well as methods proposed by the author [8,9,19,36,43], are presented and evaluated.

1.4.1 RI-based approximation of SVD

As shown in the research of the author, in scenarios where the input data can be meaningfully represented using vector spaces of signicantly lower dimensionality, algorithms like Reec- tive Random Indexing (RRI) may be successfully adapted for preprocessing in SVD-based methods [19, 36]. Specically, [21] investigates a method applicable to collaborative lter- ing, referred to as RSVD, which is a combination of RI-based pre-processing and SVD-based vector space optimization. In contrast to RI algorithms presented in the literature [20, 22], RSVD may be used to provide results arbitrarily similar to these obtained when using SVD. It has been shown that RSVD, compared with the SVD and the two most widely-referenced RI methods the basic RI and RRI [20,22], improves the recommendation accuracy and reduces the computational complexity.

In [19] the author analysed dimensionality reduction methods from the perspective of their ability to produce a low-rank customer-product matrix representation. The research showed that the Frobenius-norm optimality of SVD does not correspond to the optimal recommen- dation accuracy, when measured in terms of F ₁ . On the other hand, a high collaborative

ltering quality is achievable when a matrix decomposition, based on a combined use of RRI

and SVD, leads to increased diversity of low-dimensional singular vectors (as seen from the

perspective of cosine similarities).

(21)

1.4 Research of the author 18

1.4.2 Hybrid recommendation methods

According to authors of several widely cited surveys on recommendation systems, hybrid rec- ommendation methods represent the state of the art in recommender systems technology by eectively dealing with the well-known behavioural data sparsity problem [11, 53]. In par- ticular, in [43] the author introduced a class of hybrid recommender systems that are based on separated processing of collaborative and content data and combination of dual predic- tions, or are based on incorporating some content-based characteristics into the collaborative approach or vice versa. It has been shown that a low-dimensional space is suitable for rec- ommendation generation, despite collaborative data sparsity disqualifying usage of methods widely referenced in the literature [39, 54, 55].

Additionally, the proposed hybrid recommendation method based on two-stage data pro- cessing (rst dealing with content features describing items, and then handling user be- havioural data), which was presented in [36], allowed to improve the recommendation accuracy without increasing the computational complexity. Specically, the introduced method was a combination of content features' preprocessing performed by means of RI, a reective retrain- ing of preliminary item vectors (which have a reduced number of dimensions) according to CF data, and vector space optimization based on SVD. The experiments presented in both [43]

and [36] were focused on the most challenging case of extreme collaborative data sparsity. The author demonstrated, that in such application scenario, the proposed recommender system design is more eective means for hybrid recommendation than weighted feature combination.

1.4.3 Long-tail recommendation

In certain domains, especially in e-commerce, the users generally nd recommendations of items they were not familiar with more useful [6, 29]. In other words, obvious recommen- dations, despite being almost always correct, often have no eect on customers' behaviour.

Therefore, in order to reect `real-world' demands, in particular those regarding non-triviality of recommendations, evaluation presented in [9] is novelty-oriented. Particularly, when prepar- ing datasets for the experiments, similarly to the approach presented in [10], a specied number of the most popular items were removed from the dataset.

The method proposed by the author CF data processing based on reective vector-space retraining was evaluated against its ability to provide recommendations of items from the Long Tail, using the probabilistically interpretable AUROC measure. The reported research was also oriented on the investigation regarding the advantage of a generalization of RRI (which itself is recently gaining a lot of attention of the NLP and IR communities), over well-known methods based on the matrix factorisation heuristics, such as LSA. What is more, the analysed data sets (i.e., prepared using the MovieLens data set) were modelled as bipartite graphs to show the relation between the structural properties of the user-item matrix and the optimal number of reections.

1.4.4 CF based on bi-relational data

When the behavioural input data have the form of (userX, likes, itemY) and (userX, dis-

likes, itemY) triples, one has to propose a representation that is more exible for the use of

propositional data than the ordinary user-item ratings matrix (i.e., in which each user prole

(22)

1.4 Research of the author 19

is a vector in a `space of items' and, analogically, each item is a vector in a `space of users').

In order to address this issue, in [48] and [8], the author proposed the use of an element-fact matrix, in which RDF-like triples are represented by columns, and users, items, and relations are represented by rows. The research showed that by following such a triple-based approach to the bi-relational data representation (instead of using a `standard' user-item matrix), com- bined with reective data processing, it is possible to improve the quality of CF (measured using AUROC).

1.5 Contribution of the thesis

Based on presented analytical research and experimental results, the author aims to conrm that, at least in the investigated application cases and while applying the proposed solutions, the vector-space recommendation techniques and dimensionality reduction methods may be combined in a way preserving the high quality of recommendations regardless of the amount of processed heterogeneous data.

As the goal of a recommender system is to recommend to a given user the items which the system `guesses' to be liked by the user, this class of systems may be found as one of the most clear and evident examples of an approximate reasoning application case. In this dissertation, all of the solutions proposed by the author have been investigated in CF scenarios using publicly available data sets. The vector-based model has been used to represent data since it guarantees a exible way for information storing and processing. Because of the fact that a high number of dimensions is needed to represent the objects in the vector space (leading to so-called `curse of dimensionality' [45]) a technique allowing to reduce the vectors' dimensionality has to be applied. Specically, dimensionality reduction applied for a set of vectors represented in an n-dimensional space is realized by embedding such a set of vectors in a space of reduced dimensionality i.e. r-dimensional space, where r < n.

On the other hand, a sucient representation of the locally-visible relationships between objects in the dataset should be also provided. Although such relationships do not contribute to principal components of the data, they are known as being especially important in the context of recommending specialized (i.e. not popular) items from the so-called Long Tail.

Is such a case, a higher number of dimensions is needed and typically used dimensionally reduction methods such as SVD are frequently of little use [9]. For these reasons, this thesis is focused on the issue of analysing the vector-space dimensionality reduction methods, based on the Johnson-Lindenstrauss lemma [44], allowing to preserve local dependencies.

The quality of recommendations has been measured according to a research methodology

that is well established in the relevant literature [6,12,28] and follows scenarios corresponding

to the real-world demands. Therefore, as in practice it is sucient to identify only a small

and non-trivial set of items for each user, the evaluation is novelty oriented and based on

the so-called nd-good-items task [6]. Moreover, due to scalability and sparsity issues, the

research herein is focused on both the cases of large-scale data processing and on the cold-start

cases in which the amount of behavioural data is severely limited. In order to deal with the

data sparsity problem it is necessary to exploit information from various data sources as

much as possible. Under such conditions, one has to provide methods that enable scalable

(23)

1.5 Contribution of the thesis 20

incremental data processing and, at the same time, eectively cope with extremely sparse and heterogeneous data.

The main contributions of the dissertation are:

• Analysis of dimensionality reduction methods and reective data processing from the perspective of their ability to produce high precision low-dimensional recommendations.

• Design of recommender system architecture based on incremental processing of the input data, addressing the scalability issue.

• Specication of a model, allowing to dene user interactions with the system, extending the classical user-item matrix paradigm.

• Analysis of the matrix-based formulation of the tensor-based heterogeneous data repre- sentation.

• Analysis of spectral and information-theoretic views on reective data processing, allow- ing to compare algorithms like RRI and PageRank in order to show how the reduction of spectrally dened entropy aects the quality of the noise reduction.

Design and implementation of the proposed algorithm, experiments' results, major insights and examples of the system applicability are also discussed.

1.6 Structure of this thesis

Chapter 1 introduces the aims and scope of the dissertation. The major problems of the goals accompanied by the research challenges, and the authors main contribution are briey described. Chapter 2 presents the theoretical background and reviews the most important approaches that are relevant to the scope of the thesis. In addition, the major deciencies in the state of the art are also discussed. Chapter 3 introduces the necessary terminology and the evaluation methodology. An illustrative running example, used to demonstrate each of method introduced by the author, is also provided. Chapters 47 show the main research of the author, concerning scalability, high data sparsity, relevance modelling, multi-relational data processing, and are organised as follows:

1. Introduction illustrates a specic research problem accompanied by the associated chal- lenges.

2. Proposed method section introduces the major innovations of the author's solution.

3. Evaluation provides an experimental examination of the introduced algorithms along with some insights.

4. Conclusions section discusses the main results of the work presented in the chapter.

Finally, chapter 8 summaries the main results, including a discussion of work's limitations, and

reviews the major contributions. The potential directions of further studies of the research

are presented as well.

(24)

State of the art 2

One of the most popular applications of machine learning and data mining systems in real- world (e.g., in e-commerce) is extracting the unknown information from large and usually sparse data sets in order to provide useful predictions. Particularly, a recommender system is expected to predict user choices, rather than to provide exact ratings [56]. Therefore, it should not be surprising that many researchers working on personalized recommendation systems study the so-called `nd good items' task [6].

In this chapter the most important approaches that are relevant to the scope of the thesis as well as the corresponding theoretical research are reviewed. Main problems such as scalability, high data sparsity, relevance modelling, multi-relational data processing are briey discussed.

2.1 Sparse data processing

One the main problems in the research on recommender systems is how to interpret the observed sparse data. Sparsity means that a relatively low percentage of possible feedback is present. In machine learning systems, these non-observed or not available values are commonly referred to as missing values.

More formally, let Concept be a nite set of c dened concept variables (i.e. |Concept| = c), and it is assumed that there is no direct relationship between elements in Concept. Further- more, let Ψ be a space over the variables:

Ψ = Concept 1 × . . . × Concept _c ,

where |Concept 1 | = c ₁ , . . . , |Concept _c | = c _c . Thus, |Ψ| = c 1 · . . . · c _c . Let a vector ψ = (ψ ₁ , . . . , ψ _c ) ∈ Ψ be dened as an instance of Ψ, where ψ 1 ∈ Concept ₁ , . . . , ψ _c ∈ Concept _c .

For example, in case of the MovieLens dataset, we could have:

Concept ₁ = {user ₁ , user ₂ , . . .}

Concept 2 = {item 1 , item 2 , . . .}

(25)

2.1 Sparse data processing 22

Thus, the instance (user 1 , item 2 ) of the space Ψ would mean that user 1 rated item 2 . Every recommender system, in order to generate recommendations, uses the observations that have been made in the past. Let function θ describe the past observations over Ψ and be dened as:

θ : Ψ → {0, 1} .

In other words, θ is a binary function, indicating whether an instance has been observed, and can also be interpreted as a set of observations Θ, such as Θ ⊆ Ψ, where:

ψ ∈ Θ ⇔ θ(ψ) = 1 . Finally, it may be stated that a dataset is sparse when:

X

ψ∈Ψ

θ(ψ) = |Θ| |Ψ| . (2.1)

In practice, many commercial recommender systems are based on large and extremely sparse datasets [5, 12], which is one of the most challenging problems in this domain. What is more, usually only positive values are present in a dataset (i.e. no negative feedback is observed directly). As a consequence, the interpretation of the missing values is not trivial, as they may contain both the negative or positive feedback.

Furthermore, if an item has been observed (e.g., an item was rated by a user), it has rarely been observed multiple times. In other words, for every item in the dataset there are usually very little ratings at all. For example, the ML10M ¹ dataset has even 98.69% missing values.

The Last.fm 360K ² dataset has even more: 99.98%!

It should be also noted that if an item was not rated yet, it does not mean that it will not be rated in the future. Generally it is quite the opposite items not yet rated are the ones from which the system has to generate recommendations. For instance, in online shopping, even though a new user has not bought any item yet, it does not mean she/he would never buy anything. The recommender system, by analysing her behaviour should decide in which of the items she is more interested in, and provide a useful recommendation.

One of the most typical problems caused by the data sparsity is the cold start problem [57].

In CF the items are recommended based on users' past preferences, thus new users have to rate a substantial number of items for the system to provide reliable recommendations for them. Recommender systems, in such a new user case, tend to suggest popular items which are expected to be somehow familiar to the user [29]. However, as these popular items are not necessarily relevant to each user, it has been acknowledged that providing such obvious, non-personalised recommendations may decrease user satisfaction [6, 58]. Similarly, an analogous situation applies to new items when they are added to the system, they need to be rated sucient number of times before they can be recommended to the users.

Hybrid recommender systems were introduced in order to eectively deal with the be- havioural data sparsity [59, 60]. This kind of application is designed to exploit the comple- mentary advantages of content-based ltering and collaborative ltering. In the literature [26], one may nd that one of the most challenging problems, which has to be considered when designing a state-of-the-art hybrid recommender system, is that typically there are many

1 http://www.grouplens.org/node/73

2 http://mtg.upf.edu/node/1671

(26)

2.1 Sparse data processing 23

Figure 2.1: Long tail popularity distribution example. To the right (lighter shade) is the long tail that dominates; to the left (darker shade) is the short head.

So far, many solutions have been proposed in the area of hybrid recommendation tech- niques [39, 54, 57, 59]. The main practical distinction in these techniques corresponds to how the data processing stages are dened. The most widely referenced (considering the experi- ments on popular data sets [12, 54]) and also the simplest approach is to balance the results of separated data processing stages. Typically, in these so-called weighted hybrid recommen- dation systems, the collaborative ltering data and content features are initially processed independently, and then the dual predictions are combined in the nal stage of the algorithm.

More advanced hybrid systems are usually based on joint processing of content and collab- orative features or are realized as cascade systems [60, 61]. An example of a cascade system may be found in [26], where, in the rst stage, a content-based model is built for each user, and then the ratings of data are used to combine these features in order to generate the nal user proles, used for recommendation purposes.

The data sparsity problem applies mostly to items from the Long Tail, where many items have only few ratings, thus making them hard to use in a recommender system [62]. By denition [63], the distribution of a random variable X with probability distribution P r is said to have a long right tail if:

∀ _t>0 lim

x→∞ P r[X > x + t|X > x] = 1 (2.2)

Specically, in e-commerce, for such population distributions it is considered that the majority

of occurrences (i.e., more than half) are accounted for at least the rst 20% of items in the

distribution [17]. In other words, in a long-tailed distribution the most frequently occurring

20% of items represent less than 50% of occurrences, thus, the least frequently occurring items

cumulatively dominate in the population. A long tail popularity distribution example is shown

in Figure 2.1. According to Chris Anderson [17], the Long Tail is divided in two parts: the

(27)

2.1 Sparse data processing 24

head containing popular items and the tail containing niche or obscure items. In particular, the boundary point between the head and the tail in Figure 2.1 indicates the median of the distribution.

Coping with long-tailed behavioural data is regarded as one of the main challenges of the research on CF, and of the research on recommender systems in general [10,18,29]. Heavily- tailed data distributions are known to be typical of popular Internet applications [17,18]. One of the main problems is to provide recommendations that allow users to discover niche items from the Long Tail [29]. Providing such personalised discovery tools, instead of promoting only popular goods, could lead to an increased item diversication, and consequently leverage the benets from the Long Tail [29]. Despite that, the recommender systems that are proposed as a means for Long Tail recommendation are based on SVD the method that has been used for CF for more than 10 years by many authors not mentioning the Long Tail phenomenon [6, 12,59]. Moreover, even researchers focusing on Long Tail recommendation systems very rarely use specialized methods for analysing heavily-tailed data distributions [6,62].

At least in some application scenarios, the reective matrix data processing techniques, like RRI, are more useful than `classical' methods based on SVD and dimensionality reduc- tion [20]. Unfortunately, there is no publicly available widely referenced e-commerce data set [4]. At the same time, the most widely known CF data set that has been used in these research activities the MovieLens data set cannot be regarded as having a heavily-tailed distribution. Therefore, the applicability of reective data processing [20] to Long Tail CF systems remains an open issue.

The need for specialized analysis of sparse and heavily-tailed CF was one of the key moti- vations for the research presented in this dissertation. Specically, reective vector retraining methods are investigated as the ones especially eective in nding `latent connections' in sparse and long-tailed data.

2.2 Scalability

In the context of recommendation systems, one of the most challenging problems is to achieve the feasibility of large-scale matrix data dimensionality reduction. This problem appears as especially challenging in real-world e-commerce application scenarios, as large number of users usually rate or buy only a small percentage of available products [5, 12].

There are two basic reasons for using dimensionality reduction of algebraic space, in which the represented data are modelled, in recommender systems. One of them is the ability to improve the precision of recommendations. Another one is the ability to reduce the com- plexity of online computations and storage requirements [5, 12]. Recommendation algorithms usually divide their internal functions into two parts: and oine and an online component.

Producing a low-dimensional representation of a customer-product matrix helps to decrease the complexity of online computations and reduces the volume of a recommender system database [12, 42].

For more than a decade, SVD the most popular matrix dimensionality reduction method

has been used as the key element of many CF systems [5,12,42,59]. Recommendation systems

based on CF do not represent the only area of successful SVD applications. Probably the

(28)

2.2 Scalability 25

best example from another domain is LSA a method that is widely regarded as signicantly enhancing the quality of Text Retrieval (TR) [16]. Consequently, SVD is also used in hybrid recommendation systems.

However, the computational complexity and storage requirements of SVD are still serious bottlenecks of large-scale applications [20]. It is often stated that SVD is a technique ap- plicable to data sets of small or medium sizes only [5, 20]. In other words, large-scale SVD application requires computational resources beyond the reach of most researchers.

On the other hand, in the case of nearly each SVD-based application, only a very lim- ited number of the principal components is used in further processing. This fact is the key motivation for work on scalable alternatives to SVD [20,42,64].

In particular, RI is a method that attracts an attention of the TR research community [22]

and Google, Inc. [20]. To our knowledge, RI and RRI are the only methods that are widely reported in the literature as scalable and ecient alternatives to SVD [20, 22]. Although some researchers have proposed (for applications other than recommender systems) solutions involving a cascade use of RI or random projection and SVD [65,66], none of these methods is capable of producing a full set of accurately approximated SVD matrices. As a result, these methods cannot be used for a low-dimensional input matrix reconstruction in a high accuracy recommender system based on rate estimation [5, 12].

Apart from methods based on random projection or random sampling, some of the re- searchers proposed some additional optimizations based on lossless rating compression [67].

However these compression techniques are applicable only to memory-based recommender sys- tems, which, essentially calculate predictions based on some ad hoc heuristic rules (i.e. not by using statistical and machine learning techniques) [59]. A typical example of a memory-based technique is the user-based nearest neighbour algorithm (kNN). User-based kNN method pre- dicts the rating r ui of a user u for an item i using the ratings given to i only by the k users most similar to u called nearest neighbours. There are several substantial problems with such an approach. Initially, due to no probabilistic interpretation it is not clear which type of similarity measure to use. Secondly, one needs to determine the value of k parameter (the number of nearest neighbours). What is more, generating predictions only on the basis of selected k neighbours makes it very sensitive to noise in data. Finally, pure memory-based approaches do not scale, thus, they are not suitable for large-scale, real-world applications [68].

An example of an approximate method to nd the similar users is the Locality-Sensitive Hashing (LSH), which implements the nearest neighbour algorithm in linear time [69]. How- ever, even though this algorithm is considered scalable, it inherently suers from similar deciencies as the kNN algorithms, which are reported to provide low accuracy rate in mul- tidimensional data sets [5]. It is particularly evident in high data sparsity scenarios, when available data are insucient for nding the nearest neighbours [11, 70].

It is worth noticing that researchers working on scalable alternatives to SVD have so far

focused their eorts on the development of methods enabling application-specic performance

improvement, rather than on analysing how algebraic properties of the new methods inuence

the retrieval quality. In particular, it is well-known that SVD provides dimensionality reduc-

tion that is optimal in the Frobenius norm sense [71] and that SVD eectively enhances TR

and CF methods [16,59]. Nevertheless, the relation between the properties of SVD especially

singular values distribution and the recommendation accuracy has not yet been thoroughly

investigated [72, 73]. As a consequence, although SVD-based CF systems operate optimally

Scalable dimensionality reduction methods for recommender systems

Scalable dimensionality reduction methods for recommender systems

Michał Ciesielczyk, MSc

Michal.Ciesielczyk@put.poznan.pl

Doctoral Dissertation

prepared at

Faculty of Computing Science Pozna ´ n University of Technology

Primary supervisor: Mikołaj Morzy, PhD, DSc Auxiliary supervisor: Andrzej Szwabe, PhD

Pozna ´ n, March 2015

Skalowalne metody redukcji wymiarowo´sci dla systemów rekomendacyjnych

mgr in ˙z. Michał Ciesielczyk

Michal.Ciesielczyk@put.poznan.pl

Rozprawa Doktorska

Wydział Informatyki Politechnika Pozna ´ nska

Promotor: dr hab. in ˙z. Mikołaj Morzy Promotor pomocniczy: dr in ˙z. Andrzej Szwabe

Pozna ´ n, Marzec 2015

Dedicated to my beloved wife Ania.

Abstract

The results of the theoretical research have been evaluated according to a research method-

ology that is well established in the relevant literature using publicly available data sets and

following scenarios corresponding to the real-world demands. Therefore, as in practice it is

sucient to identify only a small and non-trivial set of items for each user from a vast set

of choices, the evaluation is novelty oriented and based on the so-called nd-good-items task,

rather than on the low-error-of-ratings prediction. In order to make the results comparable

with those presented in relevant literature, all the methods are tested using the MovieLens

2

data set, which is one of the most widely referenced datasets used in research on collaborative

ltering.

Based on presented analytical research and experimental results, the author states that vector-space recommendation techniques and dimensionality reduction methods may be com- bined in a way preserving the high quality of recommendations, regardless of the amount of processed heterogeneous data.

Keywords: Collaborative Filtering, Dimensionality Reduction, Machine Learning, Reective Random Indexing, Statistical Relational Learning

Thesis Statement

Vector-space recommendation techniques and dimensionality reduction methods may be com-

bined in a way preserving the high quality of recommendations regardless of the amount of

processed heterogeneous data.

3

Acknowledgements

Numerous people helped me through the completion of this thesis and I would like to express my appreciation to some of them here.

I would also like to appreciate the funding directly supporting my scientic research.

Most importantly of all, I would like to especially thank my family for love and support

in everyday situations. In particular, I appreciate my beloved wife and best friend, Anna,

without whose understanding, patience, continuous encouragement and enormous support

in dicult times I would not have nished this thesis. I thank my parents, for their faith,

devotion of time and help they provided me through my entire life, allowing me to be as

ambitious as I wanted. Finally, I thank my friends and all other individuals who, more or

less, contributed to the completion of this work.

Contents

Acronyms 7

Glossary 8

1 Introduction 10

1.1 Motivation . . . . 10

1.2 Area of research . . . . 12

1.2.1 Recommendation vs. classication . . . . 12

1.2.2 Recommendation vs. prediction . . . . 13

1.2.3 Dataset-based o-line evaluation methodology . . . . 14

1.2.4 Multidimensional vector-space model in relational learning . . . . 14

1.3 Aims of the dissertation . . . . 15

1.3.1 Sparse data processing . . . . 15

1.3.2 Scalability . . . . 15

1.3.3 Graph-based perspective on reective vector-space processing . . . . . 16

1.3.4 Multi-relational data processing . . . . 16

1.4 Research of the author . . . . 17

1.4.1 RI-based approximation of SVD . . . . 17

1.4.2 Hybrid recommendation methods . . . . 18

1.4.3 Long-tail recommendation . . . . 18

1.4.4 CF based on bi-relational data . . . . 18

1.5 Contribution of the thesis . . . . 19

1.6 Structure of this thesis . . . . 20

2 State of the art 21 2.1 Sparse data processing . . . . 21

2.2 Scalability . . . . 24

2.3 Graph-based perspective on reective vector-space processing . . . . 26

2.4 Multi-relational data processing . . . . 27

2.5 Deciencies in the state of the art . . . . 29

Contents 5

3 Methodology 30

3.1 Basic denitions . . . . 30

3.2 Recommendation quality evaluation . . . . 31

3.3 Running example . . . . 33

4 Sparse data processing 35 4.1 Introduction . . . . 35

4.2 Proposed method . . . . 36

4.2.1 Preliminary processing using content feature data . . . . 36

sucient to identify only a small and non-trivial set of items for each user from a vast set

of choices, the evaluation is novelty oriented and based on the so-called nd-good-items task,

ltering.

Keywords: Collaborative Filtering, Dimensionality Reduction, Machine Learning, Reective Random Indexing, Statistical Relational Learning

I would also like to appreciate the funding directly supporting my scientic research.

in dicult times I would not have nished this thesis. I thank my parents, for their faith,

1.2.1 Recommendation vs. classication . . . . 12

1.2.3 Dataset-based o-line evaluation methodology . . . . 14

1.3.3 Graph-based perspective on reective vector-space processing . . . . . 16

2.3 Graph-based perspective on reective vector-space processing . . . . 26

2.5 Deciencies in the state of the art . . . . 29

3.1 Basic denitions . . . . 30

4.2.2 Training based on collaborative ltering data . . . . 37

5.3.6 SECF-RSVD computational eectiveness . . . . 61

6.2.2 Reective data processing . . . . 69

8.1.3 Graph-based perspective on reective vector-space processing . . . 106

8.2.3 Graph-based perspective on reective vector-space processing . . . 109