Multiple Instance Learning using Bag Distribution Parameters

(1)

Distribution Parameters

D.M.J. Tax

Delft University of Technology, Delft

Abstract

In pattern recognition and data analysis, objects or events are often represented by a feature vector with a fixed length. For some applications this is a severe limitation, and extensions have been proposed. One approach is Multiple-Instance Learning (MIL). Here, objects are represented by a collection of feature vectors (called a bag) and a bag is labeled positive, when at least one feature vector is member of a concept. In some situations it is not suitable to assume the presence of a concept, and the distribution of all the feature vectors in a bag is required to classify the bag. In this paper we propose a simple bag classification scheme using the parameters of the fitted distributions. Experiments show sometimes surprisingly good performances with respect to other state-of-the-art approaches.

1 Introduction

A standard assumption in pattern recognition or machine learning is, that objects are represented by a fixed number of features values. This reduces a real-world object or event to a single point in a feature space, and allows for a relatively straightforward analysis and classification of the object. Unfortunately, this assumption is often very restrictive. Consider, for instance, the representation of an image containing a scene with objects. In order to capture all information about all objects, a very rich representation is required. Only when the task is specified beforehand precisely (for instance, that we want to label images as ‘horses’ or ’no-horses’) more specific segmentation and feature extraction can be defined. If the representation should be more flexible, to allow for more flexible class definitions, the single feature representation becomes hard. A well-known approach to classify a set of instances, is to use a ‘bag of words’ approach [9]. Here a dictionary of words is defined, what should represent all possible concepts in the feature space. This is typically found by clustering all instances from a large collection of objects, and using the cluster centers as words. The instances in an object are assigned to the best matching word, and the object is represented by a word-presence vector. This approach has the advantage that the classification typically does not depend on a single word (like it depended on a single concept in MIL). But the drawback is that a suitable collection of words has to be defined (which can become hard when the total number of instances is not very large), and that the word-presence vector typically becomes very sparse.

An alternative approach is to break up the object in smaller, and more homogeneous, parts, and to represent the object by a collection of parts. This is called Multiple Instance Learning (MIL)[5]. Instead of a single feature vector, we have a collection (called a bag) with feature vectors (called instances). For the representation of images it would mean that the image is split in smaller patches or homogeneous segments, and each of these segments is characterized by standard image features like color, texture and shape features. An example is shown in the first three subfigures in Figure 1. An original image containing a horse (subfigure 1(a)) is segmented in non-overlapping segments (subfigure 1(b)), where from each of the segments a feature vector is extracted. The classifier then classifies the individual segments, and applies a combining rule to derive a bag label.

In MIL typically an additional assumption is made; the task is to detect the presence of a so-called concept. Therefore bags have to be classified to a positive or negative class (like ‘horses’ and ‘no-horses’). A bag is positive, when at least one instance in this bag is member of the concept, and otherwise the bag is labeled negative. For the example in Figure 1 the segments that belong to the concept ‘horse’ are shown in

(2)

(a) Original horse image (b) Segmented image (c) Segments from the concept

(d) Original African image (e) Segmented African image

?

(f) African ‘concept’?

Figure 1: Images can be segmented in homogeneous segments, and are thus represented by a collection of image segments. For image classification problems that involve clear objects, some of these segments should belong to the concept. For more abstract classification problems, like the class ‘African’ for the image in subfigure 1(d), it may be hard to find such concept. Instead, the distribution of all image segments has to be considered.

white in subfigure 1(c). To find a good classifier, we face two problems: what instances are informative for the class label (or, are member of the concept), and how can we train a good classifier to model this concept? A drawback of this concept-modeling approach is that it essentially assumes that these concepts are also present as tight clusters in the feature space. ‘Vague’ and higher level concepts are therefore not suitable for this approach. For instance, when images have to be labeled ‘outdoor scene’, ‘evening’ or ‘African’. Here it can be expected that most instances in the bag are informative, and not a single one as is assumed in the strict MIL definition. This is shown in the second row of figures in Figure 1, where an image labeled ‘African’ is segmented, but no clear concept can be distinguished. For these types of problems, it may be more natural to characterize the complete distribution of instances in a bag, instead of just selecting a single, most informative, one.

In this paper we propose an alternative representation, that avoids the pre-definition of concepts or words. Instead, the bag of instances is modeled by a full probability distribution. That means that bag distributions have to be classified. In [7] kernels between bags have been defined, and on these kernels a support vector classifier is trained. Here we propose a simpler alternative to classify the bag distribution: fit a probability density model per bag, and use the optimized parameter values as the feature representation of the bag. This is typically very simple to perform, and allows us to use any classifier for the classification. A drawback is that the number of instances per bag is typically limited, and therefore only simple density models can be used. In the experimental section we show that for many problems the performances are still surprisingly good (i.e. easily beating more advanced MIL approaches).

In section 2 we discuss a few possible density models and their feature representations. In section 3 some experiments with comparisons show the feasibility of the approach and in 4 we conclude.

(3)

2 Modeling bags with distributions

Assume that we have a training set with labeled bags X = {(bi, yi); i = 1, ..., N}, where each of the bags

contains niinstances bi = {xi; i = 1, ..., ni} and the instances are standard feature vectors x ∈ Rd. In

standard MIL it is assumed that there is a positive and negative class, y = +1 or y = −1, and that the bag label is derived from the instance labels. The instance are labeled according to their membership of a so-called concept: when instance x belongs to the concept c(x) = +1, otherwise c(x) = 0. The bag label is then inferred using:

ˆ y(bi) =

(

+1 ∃k : c(xik) = +1

−1 otherwise. (1)

The typical approach to train a MIL classifier, is to optimize an instance classifier h that classifies each instance x to be concept or not. Then the combining rule (1) is applied to obtain the bag label.

For the general, optimal bag classification, the class posterior probability given all the instances in a bag (p(y|b)) has to be found:

ˆ

y = arg max

y p(y|bi) = arg maxy p(y|{xi1, ..., xini}) (2)

This model is very flexible, because all instances or all combinations of instances in a bag can potentially contribute to the discrimination between classes. This model family is too broad to handle easily, and therefore here we make the simplifying assumption that the distribution of the instances is informative for the class label.

When we model, or approximate, the collection of instances by a density ˆd, the class posterior probability is now estimated by:

p(y|bi) = p(y| ˆd(x|θi)) = p(y|θi) (3)

where θiare the parameters of the density ˆdwhat was fitted on bag bi. Note that we avoid the explicit

representation of ˆdin the feature space, but instead use only the density parameters. It is therefore assumed that all information in the collection of instances is captured by the distribution parameters θ. This is indeed true, when the instances are truly drawn from the distribution ˆd, and the parameters θ are the sufficient statistics of ˆd. Typically, the parameters θiare then estimated using maximum likelihood on the given bag

instances {xi1, ..., xini}.

In general, when the bag distribution model is flexible enough to characterize the distribution of the instance well, the distribution parameters should be informative enough for the subsequent classifier. To estimate the class posteriors p(y|θ), or to perform the final classification (2), we can apply any ‘regular’ classifier on the parameters θ. For instance a logistic classifier, an LDA, a support vector classifier or any other density based classifier can be fitted. Therefore, instead of explicitly estimating and modeling a concept, the classification problem is represented in terms of the distribution parameters of each of the individual bags.

2.1 Bag density models

Typically, the number of instances per bag is limited, so it is hard to estimate very complicated density models on each of the bags. Furthermore, the number of parameters has to be equal for all bags bi, therefore

nonparametric distributions cannot be used. The remaining possibilities are still numerous though. Below we list a few simple approaches:

Gaussian distribution The bags are characterized by both a mean and a covariance matrix:

p(y|bi) = p(y|[ˆµi, ˆΣi]) (4)

where the brackets [a] means that all values of a are concatenated into one vector. The ˆµiand ˆΣiare

the maximum likelihood estimates of the mean and covariance matrix. For a d-dimensional Gaussian, this results in d + d(d + 1)/2 = (d2_{+ 3d)/2}_{dimensional bag representation.}

Gaussian with identical covariance matrices When equal covariance matrices are assumed for all bags, the parameters of these covariance matrices can not be informative for discriminating classes. The only parameter left is the mean vectors per bag:

(4)

This results in a representation with d feature values for each bag.

Uniform distribution The maximum likelihood solution for a (axis parallel) uniform distribution is the rectangle tightly fit around the instances. These boundaries are the minimum and maximum feature values that appear somewhere in the instances of a bag. So when we define the minimum

bli= [min_j ((xij)1), ..., min

j ((xij)d)] (6)

(where (a)kindicates the k-th feature of vector a) and maximum

bu

i = [max_j ((xij)1), ..., max

j ((xij)d)] (7)

we obtain:

p(y|bi) = p(y|[bli, bui]) (8)

The resulting bag feature vector has now 2d components.

An alternative encoding for the same parameters can be made by using the center and the width of the uniform distribution:

p(y|bi) = p(y|[((bui + bil)/2, (bui − bli)/2]) (9)

Poisson distribution When the features are counts (like word-counts in a document characterization), a Poisson distribution per feature can be estimated. The maximum likelihood for the rates λiare the

average of the counts, therefore this estimator reduces to (5).

Note that the final practical implementation of this idea is very similar to the approach in [7]. In that paper it was noted that it is possible to extract features from bags of instances, and use these as input to a standard classifier. That is basically what is being done here, except that here the features directly follow from the assumption a user makes on the distribution of the instances in the bags. So when some knowledge on the bag distributions may be available, it may give a hint what features can be used.

2.2 Alternative approaches

In the original MIL formulation, the basic assumption is that there is a concept. Many standard MIL ap-proaches therefore search for the most informative instances in a bag, and try to optimize a concept model on these instances. In the seminal work [5] an axis parallel rectangle (APR) is fitted, while later in [10] a probabilistic formulation is given, called Diverse Density (here often a axis parallel Gaussian model for the concept is used). Unfortunately, these methods tend to be very computational intensive, because it does not just involve optimizing a concept model, it also includes the selection of the most suitable instances from the positive bags.

Therefore many approximate formulations are proposed. For instance, in [2] MISVM is explained. Here a support vector classifier is trained by selecting just a single instance from each positive bag as a ‘witness’. All instances from the negative bags are treated as negative examples. The selection of the positive instances can be iterated once, or a few times. In [12] MiBoost is defined, that performs a boosting where all instances get a weight, according to their apparent importance for the classification task.

When a concept can be identified in the classification problem, these approaches tend to work well. On the other hand, is some situations it appears that the identification of a concept is very hard. An naive, but still successful, approach is to copy all the bag labels to the instance labels, and train a standard classifier on the instances[13]. To classify a new and unseen bag of instances, all instances are classified first and a standard combining rule is used to fuse the labels. Although this approach ignores the bag structure in the data, it can obtain good results.

Finally, the standard approach in computer vision classification tasks is to use a ’bag of words’ approach [6] (which originated from natural language processing). Here, all the instances are clustered (or, like in [4], all instances are used) and each of the clusters is considered a potential concept. The bags are now represented by the similarity to all the words. This results in a feature representation of fixed length, that can be classified by a standard classifier.

(5)

3 Experiments

To compare the basic bag representation with alternative MIL approaches, experiments are performed on a large set of datasets. Many MIL problems are image classification problems, because images can be naturally represented by a bag of instances. First, images are segmented, then features per segment are extracted and different classes are defined [4, 2, 3]. Three of these problems are considered in this paper. Some other types of problems are considered as well. First the classical drug discovery problems, Musk1 and Musk2, in which molecules are described by 166 shape features [5] and have to be distinguished between active and inactive molecules. Another dataset considers newsgroup classification, in which newsgroup articles are described by 200 TFIDF features [15] and given a collection of newsgroup articles, the newsgroup name has to be predicted. And finally web page preference prediction, where web pages are described by web pages that link to the page of interest [14].

Table 1: Dataset characteristics of some standard MIL datasets.

pos. neg. min. median max. dataset nr.inst. dim. bags bags inst/bag inst/bag inst/bag

MUSK 1 [5] 476 166 47 45 2 4 40 MUSK 2 [5] 6598 166 39 63 1 12 1044 Corel African [4] 7947 9 100 1900 2 3 13 Corel Historical [4] 7947 9 100 1900 2 3 13 SIVAL AjaxOrange [8] 47414 30 60 1440 31 32 32 News atheism [15] 5443 200 50 50 22 58 76 Corel Fox [1] 1320 230 100 100 2 6 13 Corel Tiger [1] 1220 230 100 100 1 6 13 News motorcycles [15] 4730 200 50 50 22 49 73 News mideast [15] 3373 200 50 50 15 34 55 Harddrive [11] 68411 61 178 191 2 290 299 Web recomm.[14] 2212 5863 17 58 4 24 141

In table 1 some characteristics are shown of the datasets that are considered in this paper. The datasets are chosen to show some variability in number of features, number of bags, and (average) number of instances per bag.

Table 2: AUC performances (100×) of the classifiers on datasets Musk1, Musk2, Corel African and Corel Historical. Results are obtained using five times 10-fold stratified crossvalidation.

classifier Musk 1 Musk 2 Corel African Corel Historical APR 78.9 (1.7) 80.8 (2.3) 57.4 (0.8) 61.4 (0.4) Diverse Density 90.3 (1.8) 93.2 (0.0) 85.6 (0.1) 54.4 (0.2) Boosting MIL 80.3 (3.1) 49.3 (3.7) 64.1 (0.1) 38.0 (0.4) Citation kNN 88.6 (2.1) 82.9 (1.2) 80.4 (1.6) 76.5 (0.9) MI-SVM 70.3 (3.0) 81.5 (2.1) 63.5 (1.7) 78.8 (1.5) MILES 89.3 (1.9) 88.8 (1.8) 88.5 (0.5) 90.8 (0.8) Naive MIL with Logistic 77.8 (1.7) 80.5 (1.8) 75.7 (0.2) 82.8 (0.1) BagOfWords (k=10) Logistic 66.2 (5.6) 65.5 (5.9) 76.0 (2.4) 81.5 (2.0) BagOfWords (k=100) Logistic 80.9 (4.7) 74.8 (4.1) 83.4 (2.3) 86.1 (2.5) Mean-inst Logistic 84.3 (1.2) 89.6 (2.4) 83.2 (0.3) 88.6 (0.2) Extremes Logistic 89.9 (2.5) 89.2 (2.3) 90.0 (0.2) 86.5 (0.2) Extremes2 Logistic 91.8 (2.0) 90.0 (1.8) 90.0 (0.2) 86.6 (0.2) Covariance Logistic 78.0 (3.5) (a) _{86.5 (0.9)} _{88.5 (0.3)}

On these datasets the MIL classifiers mentioned in this paper are trained. The Axis Parallel Rectan-gle (APR) and the Diverse Density explicitly model a concept, while the Boosting MIL, the MI-SVM and MILES try to select the most informative instances, without modeling them. Next to that, the naive MIL implementation is used [13], a bag of word representation is used, and this is compared to the

(6)

representa-tions that are derived in section 2.1. Note that for the uniform bag distribution, two alternative parameter formulations are defined. They are called ‘Extremes’ (equation (6)) and ‘Extremes2’ (equation (8)). The ’Mean-inst’ entry shows the results of the Gaussian with identical cov. matrices, equation (5). For the approaches that define representations, but not classifiers, the (linear) logistic classifier is used.

In Tables 2, 3 and 4 the Area Under the ROC curve performances of the MIL classifiers are shown for the datasets mentioned in Table 1. There are some situations that experiments cannot be performed. In situations(a)_{where a bag contains just a single instance, it is not possible to get a good estimate of the}

covariance matrix, and no results are given. Furthermore, in(b)_{for very large datasets (like AjaxOrange and}

Harddrive), the MILES optimizer runs out of memory. Similarly, in(c)_{the dimensionality of the dataset is}

so large, that it becomes infeasible to estimate a covariance matrix.

In the text-categorization problems ‘News atheism’, ‘News motorcycles’ and ‘News mideast’ the feature vectors are very sparse, resulting in very small distances between the vectors. Because most instances are now very close together, it becomes very hard to find 100 distinct clusters. Results for 100 words could therefore not be reliably obtained. This is indicated by(d)_{. Finally, the optimization of the Diverse Density}

is slow, and for the large Harddrive dataset one fold in the ten-fold crossvalidation experiments required more than 20 hours, therefore no result is obtained(e)_.

The first results listed in Table 2 already show that in particular the ‘Extremes’ representation is very suitable for the Musk datasets, it approaches the much more advanced Diverse Density approach. Also for the Corel African (from which the example images in Figure 1 are taken), this performs very well. The approaches that try to select informative instances, perform worse here. For Corel Historical the mean instance representation is suitable, although it does not outperform the MILES classifier.

Table 3: AUC performances (100×) of the classifiers on datasets AjaxOrange, News atheism, Fox and Tiger. Results are obtained using five times 10-fold stratified crossvalidation.

classifier AjaxOrange News atheism Fox Tiger APR 48.4 (0.8) 50.0 (0.0) 55.2 (1.2) 57.9 (1.6) Diverse Density 55.5 (0.2) 42.0 (0.0) 66.8 (1.2) 80.8 (1.5) Boosting MIL NaN (0.0) 50.0 (0.0) 53.5 (1.4) 74.2 (1.3) MI-SVM 93.6 (2.6) 69.8 (2.8) 54.4 (1.5) 80.1 (1.1)

MILES (b) _{80.4 (1.2)} _{61.6 (0.9)} _{82.5 (1.1)}

Naive MIL with Logistic 95.2 (0.1) 82.7 (2.5) 54.9 (3.2) 80.0 (1.4) BagOfWords (k=10) Logistic 66.4 (2.7) 63.4 (8.5) 56.8 (3.5) 71.2 (4.6) BagOfWords (k=100) Logistic 77.9 (1.7) (d) _{53.7 (4.9) 63.1 (5.2)}

Mean-inst Logistic 87.5 (0.6) 85.2 (2.2) 55.4 (2.2) 79.5 (2.0) Extremes Logistic 91.6 (0.4) 81.8 (3.4) 60.7 (2.8) 77.5 (2.0) Extremes2 Logistic 91.6 (0.4) 81.0 (3.0) 60.1 (2.1) 76.6 (2.8) Covariance Logistic 97.7 (0.8) 75.3 (2.0) 62.7 (1.9) (a)

Often the characterization of a bag of instances using the extreme feature values performs very well. It can even approach the performance for concept-based MIL classifiers for datasets that appear to have a concept present. This could be explained when we assume that the concept is ’sticking out’ in one of the directions in the feature space. If this direction is parallel to a few feature dimensions, it automatically will result in an extreme value for the minimum or maximum value for one of the features that characterizes the positive bags. These extreme values can therefore be considered as a poor-man’s concept description.

In Tables 3 and 4 also the mean-instance representation (using (5)), or the representation using both the mean and covariance matrix (equation (4)) perform surprisingly well. In particular for the text-based problems, like the web recommendation and the News problems, the mean-instance representation is the most suitable representation.

For the datasets Fox and Tiger, a clear concept appears to be present in the objects, and selecting the most informative instances becomes essential. Taking into account all instances in a bag confuses and adds too much noise to the representation, and deteriorates the performance. For these problems it is required to apply Diverse Density or MILES.

(7)

Table 4: AUC performances (100×) of the classifiers on datasets News motorcycles, News mideast, Hard-drive and Web recommendation. Results are obtained using five times 10-fold stratified crossvalidation.

classifier News motorcycles News mideast Harddrive Web recomm.

APR 50.0 (0.0) 49.8 (0.4) 90.4 (2.3) 58.9 (3.6)

Diverse Density 46.4 (2.9) 40.2 (2.5) (e) _{NaN (0.0)}

Boosting MIL 50.0 (0.0) 62.1 (4.2) 44.4 (0.1) 74.1 (4.6) MI-SVM 76.4 (4.0) 79.8 (2.3) NaN (0.0) 60.7 (4.3)

MILES 77.4 (1.9) 79.9 (3.4) (b) _{70.5 (5.0)}

Naive MIL with Logistic 76.8 (3.4) 80.4 (2.2) 91.9 (0.7) 76.7 (1.8) BagOfWords (k=10) Logistic 64.6 (6.0) 68.2 (5.2) 96.6 (1.7) 50.0 (0.0) BagOfWords (k=100) Logistic (d) (d) _{96.8 (1.0)} _{50.0 (0.0)} Mean-inst Logistic 85.3 (0.7) 79.4 (3.2) 96.0 (0.6) 90.0 (0.4) Extremes Logistic 77.8 (2.1) 76.8 (2.5) 94.4 (0.5) 81.5 (0.8) Extremes2 Logistic 78.5 (1.7) 76.8 (2.5) 94.2 (1.0) 81.3 (0.7) Covariance Logistic 74.5 (2.7) 73.1 (2.9) 94.5 (0.9) (c)

4 Conclusions

To solve a classification problem, typically objects are represented by a fixed length feature vector. An alternative is to use a Multiple Instance Learning approach, where objects are represented by a collection (a bag) of feature vectors. This paper shows that for many problems, not just a single feature vector but the full distribution of feature vectors in a bag is informative. Or, in other words, to classify a complicated object like an image, it is often insufficient to look at just a single segment in the image, but you need all segments to get an idea about the image label.

To classify this distribution of vectors, the fitted distribution parameters are used as input features to a standard classifier. Experiments show that a uniform bag distribution performs already well in characterizing these bags. The bag of feature vectors is then represented by the minimum and maximum feature values that appear in the collection of feature vectors. This approach does not only perform well, it is also very cheap and simple to implement and apply. In particular when the number of training bags and the number of instances per bag is limited, this simple representation is sufficiently robust and flexible for a subsequent classification.

More complicated bag distributions can be used as well. Assuming a Gaussian distribution put more heavy demands on the sample size, because a covariance matrix has to be estimated. But using just the mean of the instances (therefore assuming that all Gaussian distribution have the same covariance matrix), gives surprisingly good results and for some problems this approach outperforms all other classifiers. This is most prominent in the (very) high dimensional Web recommendation problem, where most classifiers fail to work well in the 5863 dimensional feature space.

An open issue is still what distribution parameters to use for classification. Not only different distribu-tions can be chosen, also the distribudistribu-tions can be characterized in several equivalent ways (as the equadistribu-tions (6) and (8) for the uniform distribution show). This can result in (slightly) different performances of the final classifier, as is shown in some of the experiments in this paper. Given the relatively low cost of computing this feature representation, it is feasible to use standard crossvalidation to find the optimal representation.

References

[1] S. Andrews, T. Hofmann, and I. Tsochantaridis. Multiple instance learning with generalized support vector machines. In Proceedings of the AAAI National Conference on Artificial Intelligence, 2002. [2] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance

learn-ing. In Advances in Neural Information Processing Systems 15, pages 561–568. MIT Press, 2003. [3] Chad Carson, Megan Thomas, Serge Belongie, Joseph M. Hellerstein, and Jitendra Malik. Blobworld:

A system for region-based image indexing and retrieval. In VISUAL, pages 509–516, Amsterdam, 1999.

(8)

[4] Y. Chen, J. Bi, and J.Z. Wang. MILES: Multiple-instance learning via embedded instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12):1931–1947, 2006.

[5] T.G. Dietterich, R.H. Lathrop, and T. Lozano-Perez. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89(1-2):31–71, 1997.

[6] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene categories. Proc. of IEEE Computer Vision and Pattern Recognition, pages 524–531, 2005.

[7] T. G¨artner, P.A. Flach, A. Kowwalczyk, and A.J. Smola. Multi-instance kernels. In C. Sammut and A. Hoffmann, editors, Proceedings of the 19th International Conference on Machine Learning, pages 179–186. Morgan Kaufmann, 2002.

[8] S.J. Goldman. SIVAL (spatially independent, variable area, and lighting) benchmark, 1998.

[9] T. Leung and J. Malik. Representing and recognizing the visual appearance of materials using three-dimensional textons. International journal of computer vision, 43(1):29–44, 2001.

[10] O. Maron and T. Lozano-P´erez. A framework for multiple-instance learning. In Advances in Neural Information Processing Systems, volume 10, pages 570–576. MIT Press, 1998.

[11] J.F. Murray, G.F. Forman Hughes, and K. Kreutz-Delgado. Machine learning methods for predicting failures in hard drives: a multiple-instance application. Journal of Machine Learning Research, 6:783– 816, 2005.

[12] P. Viola, J. Platt, and C. Zhang. Multiple instance boosting for object detection. In Advances in Neural Inf. Proc. Systems (NIPS 05), pages 1419–1426, 2005.

[13] X. Xu and E. Frank. Logistic regression and boosting for labeled bags of instances. In Proc. of the Pacific-Asia conference on knowledge discovery and data mining. Springer Verlag, 2004.

[14] Z.-H. Zhou, K. Jiang, and M. Li. Multi-instance learning based web mining. Applied Intelligence, 22(2):135–147, 2005.

[15] Zhi-Hua Zhou, Yu-Yin Sun, and Yu-Feng Li. Multi-instance learning by treating instances as non-i.i.d. samples. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 1249–1256, New York, NY, USA, 2009. ACM.