• Nie Znaleziono Wyników

Robust domain-adaptive discriminant analysis

N/A
N/A
Protected

Academic year: 2021

Share "Robust domain-adaptive discriminant analysis"

Copied!
8
0
0

Pełen tekst

(1)

Delft University of Technology

Robust domain-adaptive discriminant analysis

Kouw, Wouter; Loog, Marco

DOI

10.1016/j.patrec.2021.05.005

Publication date

2021

Document Version

Final published version

Published in

Pattern Recognition Letters

Citation (APA)

Kouw, W., & Loog, M. (2021). Robust domain-adaptive discriminant analysis. Pattern Recognition Letters,

148, 107-113. https://doi.org/10.1016/j.patrec.2021.05.005

Important note

To cite this publication, please use the final published version (if applicable).

Please check the document version above.

Copyright

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons. Takedown policy

Please contact us and provide details if you believe this document breaches copyrights. We will remove access to the work immediately and investigate your claim.

This work is downloaded from Delft University of Technology.

(2)

ContentslistsavailableatScienceDirect

Pattern

Recognition

Letters

journalhomepage:www.elsevier.com/locate/patrec

Robust

domain-adaptive

discriminant

analysis

Wouter

M.

Kouw

a ,b ,∗

,

Marco

Loog

b ,c

a Department of Electrical Engineering, Eindhoven University of Technology, Groene Loper 3, 5612 AE Eindhoven, the Netherlands b Department of Intelligent Systems, Delft University of Technology, Van Moerik Broekmanweg 6, 2628 XE Delft, the Netherlands c Datalogisk Institut, University of Copenhagen, Universitetsparken 5, DK-2100 Copenhagen Ø, Denmark

a

r

t

i

c

l

e

i

n

f

o

Article history:

Received 7 September 2019 Revised 26 March 2021 Accepted 3 May 2021 Available online 20 May 2021 MSC: 41A05 41A10 65D05 65D17 Keywords: Domain adaptation Robust estimator Discriminant analysis Transduction

a

b

s

t

r

a

c

t

Consideradomain-adaptivesupervisedlearningsetting,whereaclassifierlearnsfromlabeleddataina sourcedomainandunlabeleddatainatargetdomaintopredictthecorrespondingtargetlabels.Ifthe classifier’sassumptionontherelationshipbetweendomains(e.g.covariateshift,commonsubspace,etc.) isvalid,thenit willusuallyoutperformanon-adaptivesourceclassifier. Ifitsassumptionisinvalid,it canperformsubstantiallyworse.Validatingassumptionsondomainrelationshipsisnotpossiblewithout targetlabels.Wearguethat,inordertomakedomain-adaptiveclassifiersmorepractical,itisnecessary tofocusonrobustness;robustinthesensethatanadaptiveclassifierwillstillperformatleastaswellas anon-adaptiveclassifierwithouthavingtorelyonthevalidityofstrongassumptions.Withthisobjective inmind,wederiveaconservativeparameter estimationtechnique,whichis transductiveinthe sense ofVapnikandChervonenkis,and showfordiscriminant analysisthatthenew estimatorisguaranteed toachievealowerrisk onthe giventarget samplescomparedtothesourceclassifier. Experimentson problemswithgeographicalsamplingbiasindicatethatourparameterestimatorperformswell.

© 2021TheAuthors.PublishedbyElsevierB.V. ThisisanopenaccessarticleundertheCCBYlicense(http://creativecommons.org/licenses/by/4.0/)

1. Introduction

Generalizationinsupervisedlearningreliesonthefactthat fu-ture samplesoriginate fromthe sameunderlying data-generating distributionastheonesusedfortraining.However,thisisnotthe case in settings where data is collectedfrom different locations, different measurement instruments are used orthere is only ac-cesstobiaseddata[25] .In thesesituations thelabeleddatadoes not represent thedistribution of interest.This problemsettingis referred to asa domainadaptation setting,wherethedistribution ofthelabeleddataiscalledthesourcedomainandthedistribution ofinterestiscalledthetargetdomain[3,15] .Mostoften,datainthe target domainisnotlabeledandadaptingasourcedomain classi-fier,i.e.,changingpredictionstosuitthetargetdomain,istheonly meansbywhichonecanmakeaccuratepredictions.Unfortunately, dependingonthedomaindissimilarity,adaptiveclassifierscan eas-ilyperformworsethannon-adaptiveones.Weformulatea conser-vative adaptiveclassifier that always performs atleast aswell as thenon-adaptiveone.1

✩ Handle by Associate Editor Francesco Tortorella. ∗Corresponding author.

E-mail address: w.m.kouw@tue.nl (W.M. Kouw).

1 A shortened, preliminary version was accepted for S+SSPR [16] . The current ver-

sion offers a significant extension with a clearer exposition, additional technical de-

In the general setting, domains can be arbitrarily different, which means generalization will be extremely difficult.However, thereare caseswhere theproblemsettingis morestructured:in thecovariateshiftsetting,themarginaldatadistributionsdifferbut theposteriordistributionsare equal[5,9,28] .Insuchcases,a cor-rectlyspecifiedadaptiveclassifierwillconvergetothesame solu-tion asthe target classifier [9] . One wayto carry out adaptation isby weighingeach source sampleby howimportantit isunder the target distribution and training on the importance-weighted labeledsourcedata.However,suchaclassifier canperformpoorly whenapplied to settingswhere thecovariate shiftassumption is false,i.e.,wheretheposteriordistributionsfrombothdomainsare notequal [8,19] .Inthat case,one oftenobservesthat afew sam-plesaregivenlargeweightsandallother samplesaregiven near-zeroweights,whichgreatlyreducestheeffectivesamplesize[23 , Chapter 8]. Sensitivityto domain relationship assumptions isnot restricted to covariate shift.Another adaptive algorithm, Transfer ComponentAnalysis (TCA),assumestheexistenceofa latent rep-resentationcommon to both domains. When that doesnot hold, mapping both source and target data onto transfer components

tails and references, more experiments, and a comprehensive analysis and discus- sion.

https://doi.org/10.1016/j.patrec.2021.05.005

(3)

W.M. Kouw and M. Loog Pattern Recognition Letters 148 (2021) 107–113

willresultinmixingoftheclass-conditionaldistributionsand per-formancewilldeteriorate[24] .

Sincethevalidityoftheaforementionedassumptionsisdifficult – ifnotimpossible– tocheck,itisofinteresttodesignrobust clas-sifiers. Robustness to uncertaintyis often achievedthrough min-imax optimization [17] . An example of a robust adaptive classi-fierisRobustCovariateShiftAdjustment(RCSA),whichfirst max-imizes risk with respect to the importance-weights and subse-quently minimizes risk with respect to the classifier parameters

[32] . It attempts to account for estimationerrors in importance-weights. Another example is theRobust Bias-Aware (RBA) classi-fier,whichplaysagamebetweenariskminimizingtargetclassifier and a risk maximizing target posterior distribution [19] . The ad-versaryisconstrainedtopickposteriors thatmatchthemoments ofthesourcedistributionstatistics,toavoidposteriorprobabilities that resultindegenerateclassifiers(e.g.assignall posterior prob-abilitiesto 1).Matchingmoments meansthat RBAclassifierslose predictive power inareas offeature spacewhere the source dis-tribution haslimitedsupport.Notethat bothrobust methodsstill relyonassumingcovariateshift.

Ourmain contributionis aparameter estimatorthat produces estimates witha risk that is always lower or equalto the risk of the source classifier, withrespect to thegiven target samples.It doessowithoutmakingdomainrelationshipassumptionssuch as covariate shiftbutby constructingaspecific typeofrisk thatcan be considered transductive in the sense originally defined by by VapnikandChervonenkis [see 30] . Furthermore,we show that in thecaseofdiscriminantanalysis,theestimatorwillproducestrictly

smallerrisksonthetargetdata.Tothebestofourknowledge,such performanceguaranteescomparedtothesourceclassifierhavenot beenshownbefore.

The paper is outlined as follows: Section 3 presents the for-mulation of ourmethod, withdiscriminant analysisin Section 4 .

Section 5.1 shows experiments on two data sets involving geo-graphicalsamplingbias,indicatingthatourestimatorconsistently performsamongthebest.Weconcludewithlimitationsanda dis-cussion inSection 6 .Tostart with, thenext section briefly intro-ducesthespecificdomainadaptationsettingthatweconsiderand commentsonthetransductivenatureofourparticularapproach.

2. Domainadaptationandtransduction

A domain isdefined hereasa particular jointprobability dis-tribution over a D-dimensional input space X⊆ RD and a

K-dimensional output space of one-hot vectors Y=



b

{

0,1

}

K:



kbk=1



[15] .LetS marka sourcedomain,with nsamples x=

(

x1,. . .,xn

)

withcorresponding labelsy=

(

y1,...,yn

)

Yn drawn

fromthesourcedomain’sjointdistribution.Similarly,letT marka

targetdomain,withmsamplesz=

(

z1,...,zm

)

withcorresponding

labelsu=

(

u1,...,um

)

drawnfromthetargetdomain’sjoint

distri-bution. Thetarget labelsu areunknown attrainingtime andthe goalistopredictthem,usingonlytheunlabeledtargetsamplesz

andthelabeledsourcesamples

(

x,y

)

.

2.1. Themeaningoftransduction

Given that the primary performance measure in this work is specifically the risk on the unlabeled data of the target domain thatisavailabletous,ourobjectiveisessentiallytransductive[see

15 ].This isin linewiththe original definitionoftransduction as proposedbyVapnikandChervonenkis[see 30] .

It should be pointedout that, confusingly,what isreferred to astransductive formosttransfer learninganddomainadaptation methods, just means that there is labeled data available for the source but not for the target domain [see also 15 ]. The classi-fiers considered in papers such as [1,10,13] , like most papers in

Fig. 1. Example domain adaptation setting. (Left) Labeled source domain data, (right) labeled target domain data. Black lines show a classifier trained on source data, applied to source data (left) and target data (right).

our review work [15] ,do not focus on the unlabeledsamples in the target domain in particularand are actually not transductive inthesense ofVapnikandChervonenkis[seealso15 ].Workslike

[27,29] exploit graph methods that do not have a ready out-of-sample extension andare therefore transductive in the sense of VapnikandChervonenkis.AsSection 3 shows,ourmethodfocuses particularlyon the risk obtainedon thegiventarget data andis, as such, transductive. As it turns out, it is specifically this ap-proach that can provide uswith performance guarantees, where othertechniquescannot.

Weshouldnotethat,typically,ourtargetclassifierscanstill be usedforclassifyingnewandunseentarget domainsamples.That is,they canalsobeusedforinductiveinference.Thisisespecially thecaseifthesamplesfromthetarget domaincanbe considered representative of that domain. In that case, the performance on thoseparticulartargetdomaininstancescanequallywellbe inter-pretedasaregularempiricalrisk, usedinstandardempiricalrisk minimization[26,31] .Justasinthe supervisedlearningsetting,it isthenassumedthathavingasmallempiricalrisk carriesoverto asmallgeneralizationerrorandthattheclassifiercanbe success-fullyemployedinductively.

As a final remark, we like to state that thebenefits of trans-ductionoverinduction,orviceversa,arenotalways easily identi-fied.Especiallybecausein manysettings,inductive classifierscan beusedfortransductionandtheotherwayaround.Referto Chap-ter25in[6] forfurtherviewsandconsiderations.

2.2. Example

Fig. 1 visualizessome conceptsusedthroughoutthepaper.On the left is shown samples from the source domain, labeled as points(red)versuscrosses(blue).Theseweredrawnfromisotropic Gaussianscenteredat[−2, 0]and[+2, 0],respectively.Theblack linesareacontourplotoftheposteriorprobabilitiesofaclassifier trainedon the sourcedata. Onthe rightis showndata fromthe targetdomain,aswellasthesourceclassifierappliedtothetarget data.Thesetarget samplesweredrawn fromtwoGaussian distri-butions, both withcovariance matrix [3, 2; 2, 4] but one with a meanof [−1, 2] and one witha meanof [+2, 1].The source andtargetdomainsarethereforerelatedtoeachother throughan affine transformation. Notethat the source classifier does not fit thetargetdatawell.

3. Robustestimatorfortargetdomain

Inthefollowing,wepresenttheconstructionofourestimator. First,wediscusstheriskoftheclassifierinthetargetdomain. Sec-ondly,wecomparethetargetrisk ofaproposalclassifier withthe

(4)

targetriskincurredbythesourceclassifierandthirdly,weassume aworst-caselabelingforthegiventargetsamples.

3.1. Targetrisk

The empiricalrisk ofa classifierinthesource domainis com-putedastheaveragelosswithrespecttosourcesamples

(

x,y

)

:

ˆ R



h

|

x,y



=1 n n  i=1 

(

h

|

xi,yi

)

, (1)

wherehistheclassificationfunctionmappinginputtolabelsand

isalossfunctioncomparingtheclassifier’spredictionh

(

xi

)

with

the truelabelyiattrainingtime. Sincetheclassification error,or 0− 1 loss, cannot be directly optimized over, it is customary to choosesurrogatelossfunctions,suchasthequadraticloss

(

h

(

xi

)

yi

)

2[11] .The source classifieris the classifier foundby minimizing

theempiricalriskwithrespecttosourcesamples:

ˆ

hS=argmin h∈H

ˆ

R

(

h

|

x,y

)

, (2)

whereH referstothehypothesisspace.

Sincethesourceclassifierdoesnotincorporateanypartofthe target domain,itisessentially entirelynaiveofit.Butsource do-mains are chosenfora reason– oftenbecause theyare themost similardataavailable– andsourceclassifiersaresubsequently re-gardedasthebestalternativeforclassifyingthetargetdomain.To evaluatehˆS inthetargetdomain,theriskoftheclassifierwith re-specttotargetsamples

(

z,u

)

,iscomputed:

ˆ R



hˆS

|

z,u



= 1 m m  j=1 



ˆhS

|

z j,uj



. (3)

We argue that adaptive classifiers should never perform worse than sourceclassifiers.Inother words,they should neverachieve alargertargetrisk.

3.2. Contrast

Weformalizethedesiretoneverachievealargertargetriskby directlycomparingthetarget riskofapotential alternative classi-fierwiththetargetriskofthesourceclassifier.Ifwesubtractthe target riskofthesourceclassifier,thenwe canarguethatthe re-sultingfunctionshouldneverbepositive:

ˆ

R



h

|

z,u



− ˆR



hˆS

|

z,u



(4)

If thiscontrast betweenrisk functionsis used asa minimization objective, i.e., hˆ=minh Rˆ

(

h

|

z,u

)

− ˆR

(

hˆS

|

z,u

)

, then the target

risk oftheresultingclassifierisboundedabovebytherisk ofthe source classifier:Rˆ

(

hˆ

|

z,u

)

≤ ˆR

(

hˆS

|

z,u

)

.Equalityoccurswhen the sourceclassifierisrecovered:hˆ=ˆhS.Classifiersthatleadtolarger

targetrisksarenotvalidoutcomesofthisminimizationprocedure.

3.3. Robustness

Eq. (4) still relieson target labelsu, whichare unknown dur-ing training. Instead ofu we use a worst-case labeling, achieved by maximizingrisk withrespect to a hypothetical labeling q. For any classifier h, the risk with respect to this worst-caselabeling willalways belargerthantherisk withrespecttothetruetarget labeling: ˆ R

(

h

|

z,u

)

≤ max q ˆ R

(

h

|

z,q

)

. (5)

Maximizing overa set ofdiscrete labels isa combinatorial prob-lem and, unfortunately,thisone is computationally expensive. To avoid this, we apply a relaxation by considering a soft labeling,

qjk=p

(

uj=k

|

zj

)

. This means that qj is a vector of K elements

that sumto1.Inother words,a point ona K− 1 simplex,



K−1.

For m samples, the Cartesian product of m simplices is taken:



K−1×



K−1× · · · =



K−1m .Byoptimizingwithrespecttoa

worst-caselabeling,theestimatorwillbemorerobusttouncertaintyover targetlabels[17] .

3.4. TargetContrastivePessimisticrisk

Combining the contrast between risk functions from(4) with the worst-case labeling q from (5) produces the following risk function: ˆ RTCP



h

|

hˆS,z,q



= 1 m m  j=1 



h

|

zj,qj



− 



hˆS

|

z j,qj



. (6)

WerefertotheriskinEq. (6) astheTargetContrastivePessimistic risk(TCP).Minimizingwithrespecttoaclassifier hand maximiz-ingwithrespecttoahypotheticallabelingq,producesthenewTCP targetclassifier: ˆ hT =argmin h∈H max qm K−1 ˆ RTCP



h

|

hˆS,z,q



. (7)

NotethattheTCPriskonlyconsiderstheperformanceonthe tar-get domain. More precisely, it considers the performance on the givensamplesfromthetargetdomainandis,inthissense,a trans-ductiveapproach[12,30] .Itisdifferentfromtheriskformulations in[19,32] ,andthosementionedinSection 2 ,becausethose incor-porateperformanceonthesourcedomainaswell.Ourformulation focusespurely on theperformance gainwe can achieve over the sourceclassifier,intermsoftargetrisk.

3.5. Optimization

Ifthelossfunctionisrestrictedtobegloballyconvexandthe hypothesis spaceH tobe aconvexset,then theTCPrisk will be globallyconvexwithrespecttohandtherewillbeaunique opti-mumforh.TheTCPrisk islinearwithrespect toqandthe opti-mumneednotbe uniqueforq.Butthecombinedminimax objec-tivewillbegloballyconvex-linear,whichguaranteestheexistence ofa saddlepoint, i.e., a unique optimum withrespect to both h

andq[7] .

Finding this saddle point can be done through first perform-ingagradientdescentstepaccordingtothepartialderivativewith respectto h,followed by a gradient ascentstep accordingto the partialderivativewithrespecttoq.However, thislast stepcauses the updated q to leave thesimplex. In orderto enforce the con-straint,theupdatedqisprojectedbackontothesimplex.The pro-jection,P,mapsapointoutsidethesimplex,a,tothepoint,b,that istheclosestpointonthesimplexintermsofEuclideandistance:

P

(

a

)

=argminb



a− b2[22] .Unfortunately,theprojectionstep

complicatesthecomputationofthestepsize,whichwereplaceby alearningrate

α

t,decreasing overiterationst.Thisresultsinthe

overallupdate:

qt+1←P(qt+

α

t

qt

)

. (8)

Agradientdescent-ascentprocedureforgloballyconvex-linear ob-jectivesisguaranteedtoconvergetoasaddlepoint(c.f.proposition 4.4andcorollary4.5of[7] ).

4. Discriminantanalysis

Interestingly,forclassical discriminant analysis(DA),it canbe shownthattheTCPriskproducesparameterestimateswithstrictly

smallerrisksthan thatofthe sourceclassifier.Discriminant anal-ysis models the data from each class with a Gaussian distribu-tion,weightedproportional toaclassprior:

π

kN

(

x

|

μ

k,



k

)

[11] .

(5)

W.M. Kouw and M. Loog Pattern Recognition Letters 148 (2021) 107–113

(

π

k,

μ

k,



k

)

. The model is expressed as an empirical risk

mini-mizationformulationbytakingthenegativelog-likelihoodasaloss function,

(

θ |

x,y

)

=K

k−yklog[

π

kN

(

x

|

μ

k,



k

)

].

4.1. Quadraticdiscriminantanalysis

If each class is modeled with a separate covariance matrix, the resultingclassifier isa quadraticfunctionof thedifference in means andcovariances,andishencecalledquadratic discriminant analysis(QDA).Fortargetdataz and probabilisticlabelsq,theloss isformulatedas: QDA

(

θ |

zj,qj

)

= K  k=1 −qjklog[

π

kN

(

zj

|

μ

k,



k

)

]. (9)

Note thatthe lossis nowexpressed intermsofclassifier param-eters

θ

, as opposed to the classifier h. Plugging the loss from

(9) into(6) ,theTCP-QDAriskbecomes:

ˆ RTCPQDA

(

θ |

θ

ˆS,z,q

)

= 1 m m  j=1 QDA

(

θ |

zj,qj

)

− QDA

(

θ

ˆS

|

zj,qj

)

= m1 m  j=1 K  k=1 −qjklog

π

k N

(

zj

|

μ

k,



k

)

ˆ

π

S kN

(

zj

|

μ

ˆSk,



ˆkS

)

, (10)

wheretheestimateitselfis:

ˆ

θ

T =argmin θ qmax∈m K−1 ˆ RTCP QDA

(

θ |

θ

ˆS,z,q

)

. (11)

Minimization withrespectto

θ

hasa closed-formsolution for discriminant analysismodels. For each class, the parameter esti-matesare:

π

k= 1 m m  j=1 qjk, (12)

μ

k=



m j=1 qjk



−1m j=1 qjkzj, (13)



k=



m j=1 qjk



−1m j=1 qjk

(

zj

μ

k

)

(

zj

μ

k

)

. (14)

Keeping

θ

fixed,thegradientwithrespecttoqjk is:

qjk ˆ RTCP QDA



θ |

θ

ˆS,z,q

)

= 1 mlog

π

kN

(

zj

|

μ

k,



k

)

ˆ

π

S kN

(

zj

|

μ

ˆSk,



ˆSk

)

. (15) 4.2. Example

Fig. 2 visualizes the difference between the source classifier andour TCP-QDAclassifier.Ontheleft is shownthesource clas-sifier appliedto the target datafromSection 2.2 .On theright is showntheTCP-QDAclassifierappliedtothesamedata.Notethat ithasshiftedupwardstobetterfitthetargetsamples,achievinga smallerriskthanthesourceclassifier.

4.3. Regularization

One ofthe propertiesofa discriminantanalysismodel isthat itrequirestheestimatedcovariancematrix



k tobenon-singular. It ispossible forthemaximizer overq inTCP-QDAtoassign less samplesthandimensionstooneoftheclasses,causingthe covari-ancematrixforthatclasstobesingular.Topreventthis,we regu-larizeitsestimationbyenforcingalowerboundontheeigenvalues oftheestimatedcovariancematrix.

Fig. 2. Example of difference between source Quadratic Discriminant Analysis (left, ˆ

θS ) and Target Contrastive Pessimistic - Quadratic Discriminant Analysis (right, ˆ θT )

on the target domain data from Section 2.2 .

4.4. Lineardiscriminantanalysis

If the model is constrained to share a covariance matrix be-tweenclasses,theresultingclassifierisalinearfunctionofthe dif-ferenceinmeansandishencetermedlineardiscriminantanalysis (LDA).Thisconstraintisimposed throughtheweightedsumover classcovariancematrices



=Kk

π

k



k.

4.5. Performanceguarantee

Forthediscriminantanalysismodel,theTCPparameter estima-tor obtains a strictly smaller risk. In other words,this parameter estimatorisguaranteedtoimproveitsperformance– onthegiven target samples, and in terms ofrisk – over the source classifier. Thisisthefirstdomainadaptation parameterestimatorforwhich suchaguaranteecanbeprovided.

Theorem1. Foracontinuous targetdistribution,withmoresamples than features forevery class, the empirical target risk, with respect to discriminant analysis,of TCP estimated parameters

θ

ˆT is,almost surely,strictlysmallerthanthatofthesourceparameters

θ

ˆS:

ˆ

RQDA



ˆ

θ

T

|

z,u



< RˆQDA



θ

ˆS

|

z,u



(16)

The readeris referred to Appendix A forthe proof. It follows similar steps as a guarantee for discriminant analysis in semi-supervisedlearning[20] .Notethataslongasthesameamountof regularizationisadded toboth thesourceandtheTCPestimator, thestrictlysmallerriskalsoholdsforaregularizedmodel.

5. Experiments

WeseetheTCPrisk formulationfromSection 3 ,togetherwith

Theorem 1 ,asour maincontributions. Ofcourse, itis still of in-teresttoseehowotherapproachescomparetoours.Wecompare2 the performance ofourclassifiers withthat ofsome well-known and robust domain-adaptive classifiers. We implemented Trans-ferComponentAnalysis(TCA) [24] ,KernelMeanMatching(KMM)

[14] ,RobustCovariateShiftAdjustment(RCSA)[32] andtheRobust Bias-Aware(RBA) classifier [19] . TCAandKMM make explicit as-sumptions:TCAassumesthattherearelatentfactorsonwhichthe data can be projected such that the distributions are more simi-lar,whiletheoriginalpropertiessuchasclassseparabilityare pre-served.Wetrainedalogisticregressortothesourcedatamapped onto the transfer components. KMM assumes that the posterior distributionsineachdomainareequalandthatthesupportofthe target distribution is contained within the support of the source

2 Code is available at https://github.com/wmkouw/tcpr

(6)

Table 1

WeatherAUS data set. AUC for all pairwise combinations of domains (D = ’Darwin’, P = ’Perth’, B = ’Brisbane’ and M = ’Melbourne’).

S D D D P P B P B M B M M T P B M B M M D D D P P B avg S-LDA 0.650 0.700 0.672 0.783 0.732 0.565 0.862 0.819 0.919 0.789 0.879 0.903 0.773 S-QDA 0.681 0.857 0.642 0.914 0.940 0.881 0.950 0.937 0.955 0.898 0.929 0.959 0.879 TCA 0.825 0.856 0.718 0.838 0.72 0.628 0.842 0.856 0.845 0.834 0.808 0.662 0.786 KMM 0.778 0.704 0.556 0.766 0.705 0.691 0.827 0.717 0.768 0.612 0.517 0.505 0.679 RCSA 0.837 0.895 0.769 0.841 0.759 0.726 0.858 0.872 0.878 0.813 0.851 0.851 0.829 RBA 0.844 0.884 0.764 0.843 0.756 0.741 0.86 0.874 0.878 0.818 0.844 0.839 0.829 TCP-LDA 0.833 0.886 0.749 0.853 0.738 0.733 0.858 0.869 0.875 0.828 0.838 0.859 0.827 TCP-QDA 0.710 0.886 0.760 0.932 0.946 0.903 0.965 0.95 0.969 0.905 0.908 0.964 0.900

distribution. We trained both a weighted logistic regressor and a weighted least-squares classifier using the importance-weights estimated by KMM. We report the best performing of the two, namelyleast-squares. RCSAalsoassumesequalposterior distribu-tions,butemploysworst-caseimportance-weightestimationtobe robust to weight estimation errors. We used the authors’ imple-mentation,whichtrainsaweightedsupportvectormachineusing theestimatedworst-caseweights.RBAassumesthatthemoments ofthesourceclassifier’spredictionsmatchthatofthetarget classi-fier.Inourimplementation,onlythefirstmomentsareconstrained tomatch. Asbaselines,we includedanon-adaptivelinear(S-LDA) andquadratic(S-QDA)discriminantanalysismodeltrainedonthe sourcedomain.

Alltarget samplesaregiven-unlabeled-totheadaptive clas-sifiers.Theclassifiersmakepredictionsforthosegiventarget sam-ples andtheirperformance isevaluated withrespecttothose tar-getsamples’truelabels.PerformanceismeasuredintermsofArea Under the ROC-curve (AUC). All methods are trained using L2

-regularization. Since there is no labeled target data available for validation, we set the regularization parameter to a small value, namely0.01.

5.1. Datasets

We performeda setofexperiments ontwo data setsthat are geographically splitintodomains.Inthefirstproblem, thegoalis topredictwhetheritwillrainthefollowingday,basedon22 fea-tures including wind speed, humidity, and sunshine (data set is part of the R package Rattle [33] ). The measurements are taken overaperiodof200daysfromAustralianweatherstationslocated in Darwin, Perth, Brisbane, and Melbourne. Each station can be considered adomainbecausethefeaturespacesareequalbutthe underlyingdata-generatingdistributionsaredifferent.Forinstance, theaverage temperatureisseveraldegreeshigherinDarwinthan inMelbourne.

TheseconddatasetisfromtheUCImachinelearningrepository

[18] .Thegoalistopredictheartdiseaseinpatientsfrom4 differ-enthospitals.ThesearelocatedinHungary(294patients), Switzer-land (123 patients), California (200 patients) and Ohio (303 pa-tients).Eachhospitalcanbeconsideredadomainbecausepatients are measured on the sameclinical features butthe local patient populations differ. Forexample, patientsinHungary are on aver-ageyounger thanpatientsfromSwitzerland(48versus 55years). Heart disease is predicted from 13 clinical features such as age, sex,cholesterol levelandchestpain type.Bothdatasets are pre-processedbyfirstimputingmissingvalueswithzerosandthen z-scoringeachfeature.

5.2. Results

Table 1 compares the AUCs of various classifiers on the WeatherAUSdataset.Allcombinationsofusingonestationasthe sourcedomainandanotherstationasthetargetdomain,aretaken.

Firstly, as a collective, the robust methods (TCP-QDA, TCP-LDA, RBA, RCSA) rather consistently outperform the non-robust meth-ods(TCA,KMM,S-LDA,S-QDA),thoughitcertainlyisnotthecase thateveryrobustmethodoutperformseverynon-robustone.Also, thereisoneexception whereS-QDAactuallyperforms bestofall. Secondly, RCSA outperforms KMM in all cases, indicating that it is either difficult to estimate appropriate importance weights or thatitisdifficulttotraintheimportance-weightedclassifiergiven KMM’sweights.Thirdly,ineightoutoftwelvecasesTCP-LDA out-performs S-LDA. TCP-QDA is better than S-QDA in eleven of the twelve. Lastly, S-LDA occasionally outperform the non-TCP, adap-tiveclassifiers,wherethismostnotablyhappensinthethreecases whenS=M.ForS-QDAthishappensinallcasesexceptforS=D. WhenS=MandT =P,wefindthatS-QDAperformsbestoverall. Particularly whereS-LDAis concerned, theseresultsindicatethat adaptationstrategiescanalsobedetrimentaltoperformance.

Table 2 lists AUCsof each classifier in the heart disease data set. Overall, the AUC’s are lower here, indicating that these set-tingsaremoredifficultthanthoseoftheweather stations.Firstly, TCP-LDAgenerallyoutperformsTCP-QDAhere,indicatingthatmost problemsettingsarelinearlyseparableandtheadditional flexibil-ityofQDAisnothelpful.Secondly,thedifferencesinperformance between S-LDA andS-QDA and their TCP versions is clearlyless appreciable.In mostcases the differencesseem insignificant. Ex-ceptionsoccur whenS=S andT =O, inwhichcasetheoriginal methodsactuallyperformclearlybetterandwhenS=SandT = H,inwhichcasetheTCPadaptationsdoso.Thirdly,RCSAdoesnot always outperformKMM,butsince bothKMMandRCSA perform worse thanchance ona few occasions, itdoesseemthat the as-sumptionof equivalent posterior distributions is invalid in many cases.Fourthly,TCA’sperformancealsovariesaroundchancelevel, which meansthat it is difficultto recover a commonlatent rep-resentationhere.Lastly,S-LDAandS-QDAoutperformtheadaptive classifiersonanumberofoccasionsagain.

6. Discussion

Although,by construction, the TCPclassifiers are neverworse thanthe sourceclassifier intermsofempiricalrisk,theywill not automaticallylead to improvementsinthe errorrate. Thisisdue tothe fact that asurrogate lossfunction is usedduringtraining: the classifier that minimizes the surrogate loss need not be the classifierthatminimizesthe0/1-loss[2,4,21] .Similarperformance guaranteesaswehavegivenwithrespecttoempiricalrisk,cannot begivenwithrespecttoclassificationerror,becausethe0− 1loss cannotbedirectlyoptimized.

Although our TCP estimator is guaranteed to never perform worse than the source classifier, it may not perform well if the source classifier is a poor choice to begin with. Of course, if no decent source classifiers can be formed, then one can wonder whetheranykindofadaptation willbe able toconstructa satis-factorytargetclassifier,unlessparticularlyreliableassumptionscan bemade.

(7)

W.M. Kouw and M. Loog Pattern Recognition Letters 148 (2021) 107–113 Table 2

Heart disease data set. AUC for all pairwise combinations of domains (O = ’Ohio’, H = ’Hungary’, S = ’Switzerland’ and C = ’California’).

S O O O H H S H S C S C C T H S C S C C O O O H H S avg S-LDA 0.866 0.674 0.658 0.671 0.726 0.527 0.866 0.500 0.831 0.559 0.883 0.440 0.683 S-QDA 0.829 0.674 0.503 0.660 0.668 0.484 0.840 0.500 0.811 0.502 0.834 0.452 0.647 TCA 0.674 0.597 0.500 0.453 0.466 0.530 0.544 0.439 0.693 0.408 0.661 0.572 0.545 KMM 0.709 0.591 0.460 0.503 0.568 0.552 0.742 0.302 0.294 0.345 0.290 0.508 0.489 RCSA 0.646 0.667 0.572 0.641 0.483 0.459 0.749 0.626 0.651 0.685 0.647 0.343 0.597 RBA 0.502 0.670 0.430 0.636 0.423 0.582 0.556 0.366 0.523 0.396 0.597 0.412 0.508 TCP-LDA 0.864 0.675 0.653 0.673 0.725 0.555 0.867 0.424 0.831 0.717 0.882 0.447 0.693 TCP-QDA 0.822 0.675 0.500 0.661 0.660 0.432 0.841 0.422 0.813 0.565 0.847 0.414 0.638

Giventhat reliableassumptions canbe made, ourTCP estima-torcouldstillbeuseful.Ratherthantheoriginalsupervisedsource classifier,onecan,inprinciple,useanyadaptiveclassifierin com-binationwithTCPparameterestimation.Inthatcase, theTCP pa-rameter estimatorwouldstill retain its guarantee tonot perform worsethat theclassifieritiscomparedagainst,whichinthiscase is the adaptive classifier. Potentially, this may of course lead to even betterparameterestimates.Awide rangeofstandard classi-fiersthatrelyontheoptimizationofaconvexlosscanbe incorpo-rated,such asleast-squares orsupport vectormachines, meaning that TCPcould becombined withmanyadaptiveclassifiers. Non-convexlosses,aswidelyemployedinthiseraofdeeplearning,are achallengeand,asyet,itisanopenandinterestingresearch ques-tiontowhatextentourtheoreticalresultscanbesalvagedinthat setting.

Another possible extension to the current estimatoris to use multiplesourcedomains.PerhapsourTCPestimatorcouldproduce betterestimatesthanthebestsourceestimates.Onecouldenvision contrastingtheproposalclassifierwiththeclassifierproducingthe lowest risk from among a set of source classifiers, each trained on its ownsource domain.Finding thebestone fromamongthe set ofsourceclassifierswouldrequirean additionalminimization stepoversourcedomains,whichwouldincreasethecomputational cost. Selectingasubsetofsourcedomains inadvance,couldlimit thisincreaseincostandmakesuchanapproachfeasible.

7. Conclusion

We have designed a risk minimization formulation for a domain-adaptive classifier whose performance, in terms of em-pirical target risk, isalways atleast asgood as that ofthe non-adaptivesourceclassifier,withoutmakingassumptionsonthe rela-tionshipbetweendomains.Thisissomethingthatnoothermethod can guarantee.Furthermore,forthediscriminantanalysiscase,its performance isalwaysstrictly better.As demonstrated,ourTarget Contrastive Pessimisticdiscriminantanalysismodelperforms con-sistentlystrongamongotherrobustclassifiers.

DeclarationofCompetingInterest

Theauthorsstatethattheyholdnoconflictofinterests.

Acknowlgedgment

A word of thanks goes out to the two anonymous reviewers whose feedbackhelpedusimprovethepresentationofourwork. Wegladlyacknowledgetheirconstructiveremarksandcomments.

AppendixA

Proof of Theorem 1. Let

{

(

xi,yi

)

}

ni=1 be a data set of size

n drawn iid from a continuous distribution defined over in-put space X⊆ RD and output space Y=



{

0,1

}

K:

kyk=1,y

Y



. Similarly, let

{

(

zj,uj

)

}

mj=1 be a data set of size m, drawn

iid from another continuous distribution defined over X× Y. Consider a discriminant analysis model parameterized with

θ

=

(

π

1,...,

π

K,

μ

1,...,

μ

K,



1,...,



K

)

with empirical risk defined

by: ˆ RQDA

(

θ |

x,y

)

= 1 m m  i=1 K  k=1

−yiklog[

π

kN

(

xi

|

μ

k,



k

)

]. (A.1)

Thesamplecovariancematrix,



k,isrequiredtobe non-singular,

which is guaranteed when there are more unique samples than featuresforeveryclass.Let

θ

ˆS betheparametersestimatedon la-beledsourcedata:

ˆ

θ

S=argmin θ ˆ RQDA



θ |

x,y



. (A.2)

and let

(

θ

ˆT,q

)

be the parameters and worst-case labeling es-timated by mini-maximizing the Target Contrastive Pessimistic risk: ˆ

θ

T,q=argmin θ argqmaxm K−1 ˆ RQDA



θ |

z,q



− ˆRQDA



ˆ

θ

S

|

z,q



. (A.3)

Firstly, keeping q fixed, the minimization over the contrast be-tweenthetargetriskoftheproposalparameters

θ

andthesource parameters

θ

ˆS isupperboundedby0,becausebothsetsof param-etersareelementsofthesameparameterspace,

θ

,

θ

ˆS



: min θ RˆQDA



θ |

z,q



− ˆRQDA



ˆ

θ

S

|

z,q



≤ 0, (A.4)

forallchoicesofq. Since

θ

canalways beset to

θ

ˆS,valuesfor

θ

that would result ina larger target risk than that of

θ

ˆS are not validminimizers ofthe contrast.Considering that thecontrast is upperboundedfor anylabelingq,it isalso upperboundedby 0 fortheworst-caselabeling:

min θ RˆQDA



θ |

z,q



− ˆRQDA



ˆ

θ

S

|

z,q



≤ 0, (A.5)

andsince

θ

ˆT istheminimizeroftheleft-handsideof(A.5) :

ˆ RQDA



ˆ

θ

T

|

z,q



− ˆR QDA



ˆ

θ

S

|

z,q



≤ 0. (A.6)

Secondly,keeping

θ

fixed,theempiricalriskwithrespecttothe truelabeling u isalways less thanor equalto the empiricalrisk withrespecttotheworst-caselabeling:

ˆ RQDA

(

θ |

z,u

)

− ˆRQDA

(

θ

ˆS

|

z,u

)

≤ max qm K−1 ˆ

RQDA

(

θ |

z,q

)

− ˆRQDA

(

θ

ˆS

|

z,q

)

. (A.7)

Sinceq∗isthemaximizerfor

θ

ˆT asparameters,wecanwrite:

ˆ

RQDA

(

θ

ˆT

|

z,u

)

− ˆRQDA

(

θ

ˆS

|

z,u

)

≤ ˆRQDA

(

θ

ˆT

|

z,q

)

− ˆRQDA

(

θ

ˆS

|

z,q

)

. (A.8)

CombiningInequalitiesA.6 andA.8 gives:

ˆ RQDA



ˆ

θ

T

|

z,u



− ˆR QDA



ˆ

θ

S

|

z,u



≤ 0. (A.9) 112

(8)

Bringing the second term on the left-handside to the right-handside showsthat thetargetrisk oftheTCPestimate isalways lessthanorequaltothetargetriskofthesourceclassifier’s:

ˆ RQDA



ˆ

θ

T

|

z,u



≤ ˆR QDA



ˆ

θ

S

|

z,u



. (A.10)

Equality in (A.10) occurs with probability 0, which can be shown throughthe parameter estimators.The total meanforthe sourceclassifier consistsoftheweightedcombinationoftheclass means,resultingintheoverallsourcesampleaverage:

ˆ

μ

S=K k=1 ˆ

π

S k

μ

ˆSk = K  k=1 n iyik n



1 n iyik n  i=1 yikxi

=1n n  i=1 xi. (A.11)

ThetotalmeanfortheTCP-QDAestimatorissimilarlydefined, re-sultingintheoveralltargetsampleaverage:

ˆ

μ

T = K  k=1 ˆ

π

T k

μ

ˆTk = K  k=1 m j qjk m



1 m j qjk m  j=1 qjkzj

= K  k=1 1 m m  j=1 qjkzj =1 m m  j=1 zj. (A.12)

Note that since q∗ consists ofprobabilities, the sum over classes K

kqjk is 1, for every sample j.Equal risks forthese parameter

sets, RˆQDA



ˆ

θ

T

|

z,u



=RˆQDA

(

θ

ˆS

|

z,u

)

, implies equalityof the total

means,

μ

ˆT =

μ

ˆS.ByEqs. A.11 andA.12 ,equaltotalmeans imply equal sampleaverages: m1mj zj=1n

n

ixi. Given a set of source

samples,drawing asetoftarget samplessuch thattheir averages are exactly equal, constitutes a single event under a probability densityfunction.Bydefinition,singleeventsundercontinuous dis-tributions haveprobability 0.Therefore,astrictly smallerrisk oc-cursalmostsurely:

ˆ RQDA



ˆ

θ

T

|

z,u



< RˆQDA



θ

ˆS

|

z,u



. (A.13)  Supplementarymaterial

Supplementary material associated with this article can be found,intheonlineversion,atdoi:10.1016/j.patrec.2021.05.005 .

References

[1] A. Arnold , R. Nallapati , W.W. Cohen , A comparative study of methods for trans- ductive transfer learning, in: IEEE International Conference on Data Mining Workshops, 2007, pp. 77–82 .

[2] P.L. Bartlett , M.I. Jordan , J.D. McAuliffe , Convexity, classification, and risk bounds, J. Am. Stat. Assoc. 101 (2006) 138–156 .

[3] S. Ben-David , J. Blitzer , K. Crammer , A. Kulesza , F. Pereira , J.W. Vaughan , A the- ory of learning from different domains, Mach. Learn. 79 (2010) 151–175 . [4] S. Ben-David , D. Loker , N. Srebro , K. Sridharan , Minimizing the misclassifica-

tion error rate using a surrogate convex loss, in: International Conference on Machine Learning, 2012, pp. 83–90 .

[5] S. Bickel , M. Brückner , T. Scheffer , Discriminative learning under covariate shift, J. Mach. Learn. Res. 10 (2009) 2137–2155 .

[6] O. Chapelle , B. Scholkopf , A. Zien ,Semi-Supervised Learning, MIT Press, 2006 . [7] A. Cherukuri , B. Gharesifard , J. Cortes , Saddle-point dynamics: conditions

for asymptotic stability of saddle points, SIAM J. Control Optim. 55 (2017) 486–511 .

[8] C. Cortes , M. Mohri , Domain adaptation and sample bias correction theory and algorithm for regression, Theor. Comput. Sci. 519 (2014) 103–126 .

[9] C. Cortes , M. Mohri , M. Riley , A. Rostamizadeh , Sample selection bias correc- tion theory, in: Algorithmic Learning Theory, 2008, pp. 38–53 .

[10] N. Farajidavar , T.E. de Campos , J. Kittler , Adaptive transductive transfer ma- chine, in: British Machine Vision Conference, 2014, pp. 1–12 .

[11] J. Friedman , T. Hastie , R. Tibshirani , The Elements of Statistical Learning, Springer, 2001 . volume 1

[12] A. Gammerman , V. Vovk , V. Vapnik , Learning by transduction, in: Conference on Uncertainty in Artificial Intelligence, 1998, pp. 148–155 .

[13] Q. Gu , J. Zhou , Learning the shared subspace for multi-task clustering and transductive transfer classification, in: IEEE International Conference on Data Mining, 2009, pp. 159–168 .

[14] J. Huang , A.J. Smola , A. Gretton , K.M. Borgwardt , B. Schölkopf , et al. , Correcting sample selection bias by unlabeled data, in: Advances in Neural Information Processing Systems, 2007, pp. 601–608 .

[15] W.M. Kouw , M. Loog , A review of domain adaptation without target labels, IEEE Trans. Pattern Anal. Mach.Intell. 43 (2021) 766–785 .

[16] W.M. Kouw , M. Loog , Target robust discriminant analysis, in: IAPR Joint Inter- national Workshops on Statistical techniques in Pattern Recognition and Struc- tural and Syntactic Pattern Recognition, 2021, p. accepted .

[17] E.L. Lehmann , G. Casella , Theory of Point Estimation, Springer, 2006 . [18] M. Lichman, UCI machine learning repository, 2013,. http://archive.ics.uci.edu/

ml .

[19] A. Liu , B. Ziebart , Robust classification under sample selection bias, in: Ad- vances in Neural Information Processing Systems, 2014, pp. 37–45 .

[20] M. Loog , Contrastive pessimistic likelihood estimation for semi-supervised classification, IEEE Trans. Pattern Anal. Mach.Intell. 38 (2016) 462–475 . [21] M. Loog , J.H. Krijthe , A.C. Jensen , On measuring and quantifying performance:

error rates, surrogate loss, and an example in semi-supervised learning, in: Handbook of Pattern Recognition and Computer Vision, World Scientific, 2016, pp. 53–68 .

[22] N. Maculan , G.G. De Paula Jr , A linear-time median-finding algorithm for pro- jecting a vector on the simplex of R n , Oper. Res. Lett. 8 (1989) 219–222 .

[23] A.B. Owen, Monte Carlo theory, methods and examples, 2013,. https://statweb. stanford.edu/ ∼owen/mc/ .

[24] S.J. Pan , I.W. Tsang , J.T. Kwok , Q. Yang , Domain adaptation via transfer compo- nent analysis, IEEE Trans. Neural Netw. 22 (2011) 199–210 .

[25] J. Quionero-Candela , M. Sugiyama , A. Schwaighofer , N.D. Lawrence , Dataset Shift in Machine Learning, MIT Press, 2009 .

[26] B. Schölkopf , A.J. Smola ,Learning with Kernels: Support Vector Machines, Reg- ularization, Optimization, and beyond, MIT press, 2002 .

[27] O. Sener , H.O. Song , A. Saxena , S. Savarese , Unsupervised transductive domain adaptation, arXiv preprint arXiv:1602.03534 (2016) .

[28] H. Shimodaira , Improving predictive inference under covariate shift by weight- ing the log-likelihood function, J. Stat. Plan. Inference 90 (20 0 0) 227–244 . [29] L. Shu , L.J. Latecki , Transductive domain adaptation with affinity learning, in:

Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 2015, pp. 1903–1906 .

[30] V. Vapnik , Estimation of Dependences based on Empirical Data, Springer, 1982 . [31] V. Vapnik , Principles of risk minimization for learning theory, in: Advances in

neural information processing systems, 1992, pp. 831–838 .

[32] J. Wen , C.N. Yu , R. Greiner , Robust learning under uncertain test distributions: Relating covariate shift to model misspecification, in: International Conference on Machine Learning, 2014, pp. 631–639 .

[33] G.J. Williams , Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery, Springer, 2011 . Use R!

Cytaty

Powiązane dokumenty

Estimation of covariance matrices is important in a number of areas of statistical analysis, including dimension reduction by PCA, classification by QDA, establishing independence

In fact, we know the correspondence, at least generically, in the case of second order linear ordinary differential equations of Fuchsian type (with a large parameter) and we

Recently, many kernel-based algorithms have been proposed, such as Support Vector Machines (SVMs) (Vapnik, 1998), Kernel Fisher Discriminant Analysis (KFDA), Kernel Principal

The purpose of this paper is to compare the performance of discriminant analysis based methods for building credit scoring models (i.e. classical) against those based on the

In a recent paper [5], some non-normal asymptotic results in MANOVA were derived using a theory of convergence in distribution of multiply-indexed arrays.. matrix used for testingH

Each of the indicators presented in the models includes a different financial result (net profit, gross profit, operating profit, retained profit) and is compared with a

Przedstawię również przykład analizy materiału wizualnego nie po to jednak, by udowadniać, że Bal wbrew deklaracjom zakłada jednak pewien rodzaj medialnego czy

Historia snuta w książkach staje się marką samą w sobie, brak jest pierwotnej opowieści, która rozrasta się na poszczególnych platformach medialnych i jest ciągłym proce-