Applying general nonconformity function to transfer AdaBoost algorithm

(1)

Applying General Nonconformity Function to

Transfer AdaBoost Algorithm

S. Zhou

E.N. Smirnov

H. Bou Ammar

R. Peeters

Department of Knowledge Engineering, Maastricht University,

P.O.BOX 616, 6200 MD Maastricht, The Netherlands

Abstract

This paper shows that the region classification task can benefit from instance-transfer learning. It proposes to implement a standard region-classification algorithm using a general nonconformity function based on the Transfer AdaBoost algorithm. The experiments show that the new approach produces valid class regions when instances are transferred from a close domain. The conditions for successful instance transfer are empirically derived.

1 Introduction

Most of the research in machine learning focuses on the task of point classification: predicting the correct class of an instance given a data sample drawn from some unknown target probability distribution. However, in applications with high misclassification costs, region classification is needed [6, 8]. The task of region classification is to find a region (set) of classes that contains the correct class of the instance to be classified with a given probability of error ε ∈ [0, 1]. Thus by employing region classification we can control the error in a long run which however has practical sense if the class regions are efficient; i.e., small.

This paper shows that the region classification task can benefit from instance-transfer learning [5]. In this type of learning in addition to the target sample we have a second data sample generated by some unknown sourceprobability distribution. The main assumption is that the target and source distributions are different but somehow similar. Thus, the region-classification task in this case is to find class regions according to the target probability distribution, given the target sample, by transferring relevant instances from the source sample.

To solve the region-classification task in the instance-transfer learning setting we note that (1) the con-formal framework [6, 8] is a general framework for the region classification task, and (2) the Transfer AdaBoost algorithm is a base algorithm for instance-transfer learning [2]. Thus, our solution for the task is a combination of these two techniques.

To compare our research with relevant work we note that instance-transfer learning has been applied so far only for the task of point classification [6, 8]. Thus our region classification task considered in instance-transfer learning and the approach that we propose for the task are novel and they form the main contributions of this paper.

The remaining of the paper is organized as follows. Section 2 formalizes the tasks of point classifica-tion and region classificaclassifica-tion in tradiclassifica-tional learning and instance-transfer learning setting. Secclassifica-tions 3 and 4 introduce the conformal algorithm and the Transfer AdaBoost algorithm respectively. The general noncon-formity function derived from the Transfer AdaBoost algorithm for implementing the conformal algorithm is described in Section 5. The experiment setting and results are shown in Section 6. Finally, Section 7 concludes the paper.

(2)

2 Point and Region Classification

This section formalizes the tasks of point classification and region classification. The formalizations are given separately for the traditional-learning and instance-transfer learning setting in the next two subsections.

2.1 Traditional Learning Setting

Let X be an instance space and Y a class set. We assume an unknown probability distribution over the labelled space X × Y, namely the target distribution pt_{(x, y). We consider training sample D}t

n ⊆ X × Y

defined as a bag_*(x1, y1)t, (x2, y2)t, ..., (xnt, y_nt)t_{+ of n instances (x}i, yi)t ∈ X × Y drawn from the

probability distribution pt_{(x, y).}

Given training sample Dt

nand an instance xtn+1∈ X drawn according to pt(x),

• the point classification is to find an estimate ˆy ∈ Y of the class of the instance xn+1according to

pt(x, y);

• the region classification is to find a class region Γε_(Dt

n, xn+1) ⊆ Y that contains the class of xn+1

according to pt(x, y) with probability at least 1 − ε, where ε is a significance level.

In point classification, estimating the class of any instance x ∈ X assumes that we learn a point classifier h(Dt

n, x) in a hypothesis space H of point classifiers h (h : (X × Y)(∗)× X → 2R)1using the target sample

Dt

n. The classifier h(Dtn, x) outputs for any instance x a posterior distribution of scores {sy}y∈Yover all

the classes in Y. The class y with the highest posterior score syis the estimated class ˆy for the instance x.

In this context we note that the point classifier h(Dt

n, x) has to be learned such that it performs best on new

unseen instances (x, y)t_{∈ X × Y drawn according to the target probability distribution p}t_{(x, y).}

In region classification (according to the conformal algorithm [6, 8]) computing class region for any instance xn+1∈ X requires two steps. First we derive a nonconformity function A that given a class y ∈ Y

maps the sample Dt

n and the instance (xn+1, y) to a nonconformity score α ∈ [0, R ∪ {∞}]. Then we

compute the p-value pyof class y for the instance xn+1as the proportion of the instances in Dtn∪*(xn+1, y)+ of which the nonconformity scores are greater than or equal to that of the instance (xn+1, y). The class y is

added to the final class region for the instance xn+1if py ≥ ε. In this context we note that the nonconformity

function A has to be learned such that it performs best on new unseen instances (x, y)t ∈ X × Y drawn according to the target probability distribution pt(x, y).

2.2 Instance-Transfer Learning Setting

In instance-transfer learning in addition to the instance space X, the class set Y, the target distribution pt_{(x, y), and the target sample D}t

n, we have a second unknown probability distribution over X × Y, namely

the source distribution ps(x, y), and source sample Ds

mdefined as a bag of m instances (xi, yi)s∈ X × Y

drawn from ps_{(x, y). Assuming that the target distribution p}t_{(x, y) and source distribution p}s_{(x, y) are}

different but somehow similar we observe that the source sample Ds

mcannot be used directly, but there are

certain instances of the source sample that can still be used together with the target sample Dt

n. In this

context we define:

• the instance-transfer point classification as a point classification task for which we learn the point classifier h(Dt

n, x) by transferring relevant instances from the source sample Dsmin addition to the

target sample Dt_n;

• the instance-transfer region classification as a region classification task for which we learn the non-conformity function A by transferring relevant instances from the source sample Dsmin addition to

the target sample Dtn.

(3)

3 The Conformal Algorithm

This section introduces the conformal algorithm [6, 8]. It formalizes the algorithm and introduces metrics for evaluating region classifiers.

3.1 Formal Description

The conformal algorithm has been proposed in [6, 8] for region classification. The algorithm is proven to be valid [6, 8] when the target sample Dt

n and each instance xn+1 ∈ X to be classified are drawn from

the same unknown target distribution pt_{(x, y) under the exchangeability assumption. The exchangeability}

assumption holds when different orderings of instances in a bag are equally likely. The validity of conformal algorithm means it constructs for any object xn+1class region Γε(Dtn, xn+1) ⊆ Y containing the correct

class y ∈ Y of xn+1with probability at least 1 − ε, where ε is a confidence level. The conformal algorithm

outputs valid class regions for any real-valued function used as nonconformity function [6].

Conformal algorithm can be used with any method of point classifiers. Applying the conformal algo-rithm is a two-stage process. Given a point classifier h(D_nt, x), a nonconformity function is constructed for h(Dtn, x) capable of measuring how unusual an instance looks relative to other instances in the data

bag. Then, the conformal algorithm employing this nonconformity function is applied to compute the class regions. The efficiency of a conformal classifier will depend on the nonconformity function it adopts. It is efficient if the produced class regions are usually relatively small and therefore informative.

Formally, a nonconformity function is of type A : (X × Y)(∗) × (X × Y) → R ∪ {∞}. Given a bag Dt

n ∈ (X × Y)

(∗)

and an instance (x, y) ∈ (X × Y) it returns a value α in the range [0, R ∪ {∞}] indicating how unusual the instance (x, y) is with respect to the instances in Dt

n. In general, the function A

returns different scores for instance (x, y) depending on whether (x, y) is in the bag Dt

n(added prediction)

or not (deleted prediction): if (x, y) ∈ Dt

n, then the score is lower; otherwise it is higher. Since there is

no consensus on this issue [6] deciding which option (added prediction/ deleted prediction) to use has to be done experimentally.

The conformal algorithm is presented in Algorithm 1. Given a significance level ε ∈ [0, 1], target sample Dnt, instance xn+1 ∈ X to be classified, and the nonconformity function A for a point classifier

h(Dnt, x), the algorithm constructs a class region Γε(Dnt, xn+1) ⊆ Y for the instance xn+1. The

class-region construction is realized separately for each class y ∈ Y. To decide whether to include the class y in the class region Γε(Dt

n, xn+1) the instance xn+1and class y are first combined into labelled instance

(xn+1, y). Then, the algorithm computes the nonconformity score αifor each instance (xi, yi) in the bag

Dt

n+1, using the nonconformity function A . The nonconformity scores are used for computing the p-value

pyof the class y for the instance xn+1. More precisely, pyis computed as the proportion of the instances in

the bag Dt

n+1of which the nonconformity scores αiare greater or equal to that of the instance (xn+1, y).

Once pyis set, the algorithm includes the class y in the class region Γε(Dtn, xn+1) if py> ε.

The conformal algorithm was originally designed for the on-line learning setting. This setting assumes initially an empty data bag Dt

n. Then for each integer n from 0 to +∞ we first construct class region

Γε_(Dt

n, xn+1) for the new instance xn+1being classified, and then add the instance (xn+1, yr) to the bag

where yris the correct class of xn+1. The conformal algorithm is proven to be valid when the learning setting

is online [6]; i.e., it constructs for any instance a class region containing the correct class with probability at least 1 − ε. Moreover, there are reported experiments in the offline (batch) setting [7]. They show that the conformal algorithm produces valid class regions in this setting as well. In contrast to the online setting, the offline setting assumes a data bag that is non-empty initially and the data bag remains the same throughout the classification process.

3.2 Evaluation Metrics

Any class region Γε(Dt

n, xn+1) is valid if it contains the correct class y ∈ Y of the instance xn+1 ∈ X

being classified with probability of at least 1 − ε. To evaluate experimentally the validity of the class region produced by conformal algorithm we introduce the error metric E which defined as the proportion of the

(4)

Algorithm 1 Conformal algorithm

Input: Significance level , Target sample Dt

n, Instance xn+1to be classified,

Non-conformity function A for a point classifier h(Dt n, x).

Output: Class region Γ_(Dt n, xn+1)

1: Γ(Dnt, xn+1) = ∅.

2: for each class y ∈ Y do

3: Dn+1t = Dtn∪*(xn+1, y)+.

4: for i := 1 to n + 1 do

5: if using deleted prediction then

6: Set nonconformity score αi:= A(Dn+1t + \ * (xi, yi)+, (xi, yi)).

7: else if using added prediction then

8: Set nonconformity score αi:= A(Dn+1t , (xi, yi)).

9: end if

10: end for

11: Calculate py:= #{i=1,...,n|α_n+1i≥αn+1}.

12: Include y in Γ(Dt_n, xn+1) if and only if py> .

13: end for

14: Output Γ(Dnt, xn+1).

class regions that do not contain the correct class. Therefore, the conformal algorithm is valid if the error E is less than or equal to ε for all significance levels ε ∈ [0, 1].

Any class region Γε_(Dt

n, xn+1) is efficient if it is non-empty and small. Hereby, we propose to evaluate

the efficiency of class regions by the following three metrics: the percentage Peof empty-class regions,

the percentage Psof single-class regions, and the percentage Pm of multiple-class regions. The

empty-class regions, single-empty-class regions, and multiple-empty-class regions can be characterized by their own errors. The percentage Peof empty-class regions is essentially an error, since the correct classes are not in the class

regions. The error Eson single-class regions is defined as the proportion of the invalid single-class regions

among all the class regions. The error Em on multiple-class regions is defined as the proportion of the

invalid multiple-class regions among all the class regions.

It is straightforward to prove that the error E is composed of the errors Pe, Es, and Em, more precisely,

E = Pe+ Es+ Em. The error E has its own upper bound Eurepresenting the worst case that the correct

classes can never be picked up from valid multi-class regions. In this case, all the multi-class regions would lead to errors, therefore, Eu _{is defined equal to P}

e+ Es+ Pm. We note that for any significance level

ε ∈ [0, 1] there is no guarantee that Eu_{is less than or equal to ε unless P} m= 0.

4 Transfer AdaBoost Algorithm

The Transfer AdaBoost algorithm [2] (denote as TrAdaBoost) is a learning method for the instance-transfer point classification(see Algorithm 2). The TrAdaBoost algorithm uses instances from both target sample Dt_n and source sample D_ms as the training instances. Each of the training instances is associated with a weight wi, initially, uniform weights assigned over all instances. Given a base point classifier h(B, x),

the algorithm runs iteratively. In each iteration, a base classifier hk(B, x) is built on the weighted training

instances and evaluated only on the weighted target instances. After the evaluation, the weights of all training instances are updated. The TrAdaBoost algorithm employs different reweighing schemes for target sample Dt

nand source sample Dsm. For instances from the target sample Dnt, it increases the weights of incorrectly

classified instancesand decreases the weights of correctly classified instances through normalization. Thus, the subsequent classifiers to be built are tweaked in favour of those hard instances which misclassified by previous classifiers. On the contrary, for the source sample Ds

(5)

the weights of incorrectly classified instances and through normalization increases the weights of correctly classified instances. This means the source instances that less likely to be generated by the target distribution receive lower weights and source instances that more likely to be generated by target distribution receive higher weights. Therefore, the TrAdaBoost algorithm encourages the effect of hard target instances as well as the effect of similar source instances in the next iterations.

Algorithm 2 Transfer AdaBoost Algorithm, adapted from Dai et al. [2] Input: Two labelled data samples Dt

nand Dms

Base point classifier h(B, x), Number of iterations T .

1: for any instance(xi, yi) ∈ DsmS Dtninitialize weight w1(xi) = 1. Let p be vector of the normalized

weights of instances in DsmS Dtnand ptbe vector of the normalized weights of instances in Dnt.

2: for k = 1 to T do

3: Train base classifier hk : X → Y on DsmS Dtnusing normalized weights from pk.

4: Calculate the weighted error k of hkon Dnt using normalized weights from ptk

5: if k= 0 or k≥ 1₂then 6: Set T = k − 1. 7: Abort loop. 8: end if 9: Set β = 1 1+√2 ln m/T and βk = k 1−k.

10: Update the weight for any instance(xi, yi) ∈ Dms S Dnt:

( ws k+1(xi) = wsk(xi)β[hk(xi)6=yi] if (xi, yi) ∈ Dms; wt k+1(xi) = wtk(xi)β−[hk (xi)6=yi] k if (xi, yi) ∈ D t n. 11: end for

12: Output the strong classifier: hf(x) = sign(P T k=T /2ln(

1

βk)hk(x))

5 Transfer AdaBoost Non-conformity Function

To solve the instance-transfer region classification task we propose to apply the conformal algorithm that employs a nonconformity function based on the Transfer AdaBoost algorithm. Given training sample Dt_n and an instance (x, yr) ∈ X, the nonconformity function outputs the sum Py∈Y,y6=yrsy, where sy is

the score for class y ∈ Y produced by the Transfer AdaBoost point classifier h(Dtn, x). In this context

one important property of the Transfer AdaBoost algorithm has to be pointed out: in case the instance (x, yr) is in the training sample Dnt, the score Syr increase with the number of the Transfer AdaBoost

iterations while the scores sy for the remaining classes y ∈ Y \* yr+ decrease. This implies that as the number of iterations goes up, the value of nonconformity function approaches to zero, which results in poor estimation of the nonconformity of the instance. Therefore, we have to make sure the instance (x, yr), for

which the nonconformity score is calculated, is taken out from the training sample Dt

n. This means that

when the conformal algorithm has to be used with the general nonconformity function in combination with Transfer AdaBoost Algorithm, deleted prediction should be used(see Algorithm 1). In case of using deleted prediction, computing p-value pyfor one class y ∈ Y requires |Dtn| runs of the Transfer AdaBoost algorithm.

Therefore, the time complexity for computing the class region for one instance becomes O(|Y ||Dt n|Th),

(6)

6 Experiments and Discussion

This section presents our experiments with the conformal algorithm presented in subsection 3.1. Given a sample Dtn and a sample Dsm, the algorithm was instantiated using three types of nonconformity

func-tions(see Table 1). The generalization performance of these three algorithms is given in terms of the validity and efficiency of the produced class regions. We note that CAdaBoostT is used as a base-line algorithm. The algorithms CAdaBoostTS and CTrAdaBoostTS are used in order to decide whether we need to directly add or transfer relevant source instances.

Table 1: The descriptions of the conformal algorithms Algorithm Description

CAdaBoostT The weight-based nonconformity function based on the AdaBoost algorithm trained on the sample Dtn.

CAdaBoostTS The weight-based nonconformity function based on the AdaBoost algorithm trained on the sample Dtn∪ Dms.

CTrAdaBoostTS The weight-based nonconformity function based on the Transfer AdaBoost algorithm trained on the sample Dnt as a target sample and sample Dsmas a source sample.

6.1 Data sets

The datasets for our experiments were taken from the UCI Machine Learning Repository [1]. In order to fit transfer-learning scenario each data set was split into target sample and source sample with different probability distributions. For example, the data set heart-c was split using the binary attribute sex. The target sample in this case was composed of all the instances having the attribute value sex=female, while the source sample was composed of all the instances having the attribute value sex=male. Table 2 given below shows the description of each data-set split. We note that three datasets, namely colic, breast-cancer and hepatitis were split twice using different binary attributes. Moreover, KL-divergence [4] of target class distribution from source class distribution was given for each data set, which was a non-symmetric measure of the difference between two probability distribution.

Table 2: The descriptions of the data sets Data Set Number of Classes KL-divergence Size

Dt Ds credit 2 0.005 88 120 colic1 2 0.020 28 340 colic2 2 0.052 96 272 breast-cancer1 2 0.054 68 109 lymph 4 0.061 73 75 anneal 4 0.118 49 103 heart-c 2 0.187 60 207 breast-cancer2 2 0.211 56 115 hepatitis1 2 0.257 61 94 hepatitis2 2 0.311 54 101

6.2 Validation Setup

Experiments were performed on ten data sets from Table 2 using all the three conformal algorithms CAd-aBoostT, CAdaBoostTS, and CTrAdaBoostTS. The base point classifier for these three algorithms was the

(7)

NaiveBayes classifier for all two-class datasets, while for the multi-class datasets was the Decision-stump classifier2_{. The class regions of the algorithms were evaluated in terms of validity and efficiency using five}

metrics (defined in subsection 3.2): the error E, the percentage Peof empty-class regions, the percentage

Psof single-class regions, the percentage Pmof multiple-class regions, and the upper-bound error Eu. The

method of evaluation was repeated stratified 10-fold cross validation. Results for each data set were gener-ated for 10 to 50 boosting iterations (with step 10). The best results for each algorithm over the five different iteration numbers are reported in Table 3 on two significance levels ε = 0.05 and ε = 0.1.

Table 3: Performance of the CAdaBoostT, CTrAdaBoostTS, and CAdaBoostTS algorithms.

= 0.05 = 0.1 E Pe Ps Pm Eu E Pe Ps Pm Eu credit CAdaBoostT 0.047 0.000 0.182 0.818 0.865 0.102 0.000 0.384 0.616 0.718 CAdaBoostTS 0.049 0.000 0.252 0.748 0.797 0.086 0.000 0.428 0.572 0.658 CTrAdaBoostTS 0.047 0.000 0.268 0.732 0.779 0.089 0.000 0.419 0.581 0.670 colic1 CAdaBoostT 0.071 0.000 0.238 0.762 0.833 0.124 0.000 0.388 0.612 0.736 CAdaBoostTS 0.062 0.000 0.164 0.836 0.893 0.117 0.000 0.288 0.712 0.829 CTrAdaBoostTS 0.031 0.000 0.498 0.502 0.533 0.064 0.000 0.664 0.336 0.400 colic2 CAdaBoostT 0.047 0.000 0.180 0.820 0.867 0.097 0.000 0.308 0.692 0.789 CAdaBoostTS 0.046 0.000 0.144 0.856 0.902 0.113 0.000 0.269 0.731 0.844 CTrAdaBoostTS 0.051 0.000 0.294 0.706 0.757 0.085 0.000 0.405 0.595 0.680 breast-cancer1 CAdaBoostT 0.066 0.000 0.165 0.835 0.901 0.117 0.000 0.256 0.744 0.861 CAdaBoostTS 0.040 0.000 0.095 0.905 0.945 0.089 0.000 0.181 0.819 0.908 CTrAdaBoostTS 0.065 0.000 0.208 0.792 0.857 0.119 0.000 0.382 0.618 0.737 lymph CAdaBoostT 0.056 0.000 0.278 0.722 0.769 0.080 0.000 0.359 0.641 0.715 CAdaBoostTS 0.040 0.000 0.090 0.910 0.935 0.068 0.000 0.278 0.722 0.777 CTrAdaBoostTS 0.052 0.000 0.291 0.709 0.753 0.063 0.000 0.364 0.636 0.694 anneal CAdaBoostT 0.063 0.000 0.227 0.773 0.808 0.083 0.000 0.313 0.687 0.739 CAdaBoostTS 0.016 0.000 0.046 0.954 0.967 0.022 0.000 0.065 0.935 0.956 CTrAdaBoostTS 0.046 0.000 0.291 0.709 0.754 0.054 0.000 0.454 0.546 0.588 heart-c CAdaBoostT 0.050 0.000 0.318 0.682 0.732 0.100 0.000 0.510 0.490 0.590 CAdaBoostTS 0.041 0.000 0.234 0.766 0.807 0.121 0.000 0.412 0.588 0.709 CTrAdaBoostTS 0.040 0.000 0.350 0.650 0.690 0.098 0.000 0.546 0.454 0.552 breast-cancer2 CAdaBoostT 0.065 0.000 0.187 0.813 0.878 0.106 0.000 0.301 0.699 0.805 CAdaBoostTS 0.067 0.000 0.117 0.883 0.950 0.106 0.000 0.198 0.802 0.908 CTrAdaBoostTS 0.073 0.000 0.227 0.773 0.845 0.112 0.000 0.361 0.639 0.751 hepatitis1 CAdaBoostT 0.057 0.000 0.233 0.767 0.824 0.099 0.000 0.368 0.632 0.731 CAdaBoostTS 0.068 0.000 0.139 0.861 0.929 0.129 0.000 0.261 0.739 0.868 CTrAdaBoostTS 0.048 0.000 0.191 0.809 0.857 0.093 0.000 0.343 0.657 0.750 hepatitis2 CAdaBoostT 0.056 0.000 0.412 0.588 0.644 0.088 0.016 0.560 0.424 0.512 CAdaBoostTS 0.057 0.000 0.264 0.736 0.793 0.105 0.000 0.388 0.612 0.717 CTrAdaBoostTS 0.051 0.000 0.238 0.762 0.813 0.081 0.009 0.474 0.517 0.598

6.3 Results and Discussion

The performance statistics are given in Table 3. It shows that the class regions computed by the algorithms are valid. This is due to the fact that the error E is close to the significance level ε up to some negligible statistical fluctuation. Thus we can derive one of the main results of this paper, namely that the conformal algorithm is capable of obtaining valid class regions for the instance-transfer region-classification task.

In addition, Table 2 shows how instance-transfer approach can help learning. When the target and source distributions are almost the same (very small KL-divergence number) and the size of the target

2_{The time complexity for computing the class region increases proportionately with the number of classes. In order to reduce the}

(8)

sample is relatively small, comparing to CAdaBoostT, both of the conformal algorithms CAdaBoostTS and CTrAdaBoostTS give smaller error E and upper-bound error Euwhich results in more efficient class

regions. This observation is illustrated by the performance of the algorithms for the dataset credit. The performance statistics of CAdaBoostTS and CTrAdaBoostTS are close to each other. Thus in this case any of these two algorithms can be applied; i.e., auxiliary instances improve results no matter using transfer or not. When the distance between the target and source distributions increases (KL-divergence ∈ [0.02, 0.25]), the percentage Pmof multiple-class regions and the upper-bound error Euof the CTrAdaBoostTS are smaller

than those of CAdaBoostT on all the datasets, in particular when the size of target sample is relatively small. This observation is illustrated best by the performance of the algorithms for the datasets colic1 and anneal. On the other hand, CAdaBoostTS always results in bigger percentage Pmof multiple-class regions

and the upper-bound error Eu than those of CAdaBoostT. Thus in this case the CTrAdaBoostTS should be applied; i.e., instance transfer does improve the final results. When the distance between the target and source distributions becomes larger (KL-divergence ∈ [0.25, +∞]), again the performance statistics of CAdaBoostTS is worse than those of CAdaBoostT. Unfortunately, in this case CTrAdaBoostTS also gives negative transfer results(see data sets hepatitis1 and hepatitis2). Therefore in this case introducing source sample is not a good option.

7 Conclusion

This paper showed that the instance-transfer learning approach can be applied to enhance the learning results in the context of the region classification task. The proposed solution consists of the conformal algorithm that employs a nonconformity function based on the Transfer AdaBoost algorithm. The experiments showed that the approach results in valid class regions. In addition, instance-transfer is experimentally proved to significantly improve the learning efficiency under some conditions.

References

[1] K. Bache and M. Lichman. UCI machine learning repository, 2013.

[2] W. Dai, Q. Yang, G. Xue, and Y. Yu. Boosting for transfer learning. In Proceedings of the 24th international conference on Machine learning, pages 193–200. ACM, 2007.

[3] Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In International Conference on Machine Learning, pages 148–156, 1996.

[4] S. Kullback and R.A. Leible. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, 1951.

[5] S.J. Pan and Q. Yang. A survey on transfer learning. IEEE Trans. Knowl. Data Eng., 22(10):1345–1359, 2010.

[6] G. Shafer and V. Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9:371–421, March 2008.

[7] S. Vanderlooy, L.V.D. Maaten, and I.G. Sprinkhuizen-Kuyper. Off-line learning with transductive con-fidence machines: An empirical evaluation. volume 4571 of Lecture Notes in Computer Science, pages 310–323, 2007.