Robust domain-adaptive discriminant analysis

(1)

Delft University of Technology

Robust domain-adaptive discriminant analysis

Kouw, Wouter; Loog, Marco

DOI

10.1016/j.patrec.2021.05.005

Publication date

2021

Document Version

Final published version

Published in

Pattern Recognition Letters

Citation (APA)

Kouw, W., & Loog, M. (2021). Robust domain-adaptive discriminant analysis. Pattern Recognition Letters,

148, 107-113. https://doi.org/10.1016/j.patrec.2021.05.005

Important note

To cite this publication, please use the final published version (if applicable).

Please check the document version above.

Copyright

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons. Takedown policy

Please contact us and provide details if you believe this document breaches copyrights. We will remove access to the work immediately and investigate your claim.

This work is downloaded from Delft University of Technology.

(2)

ContentslistsavailableatScienceDirect

Pattern

Recognition

Letters

journalhomepage:www.elsevier.com/locate/patrec

Robust

domain-adaptive

discriminant

analysis

✩

Wouter

M. Kouw

a ,b ,∗

_,

_Marco

_Loog

b ,c

a Department of Electrical Engineering, Eindhoven University of Technology, Groene Loper 3, 5612 AE Eindhoven, the Netherlands b Department of Intelligent Systems, Delft University of Technology, Van Moerik Broekmanweg 6, 2628 XE Delft, the Netherlands c Datalogisk Institut, University of Copenhagen, Universitetsparken 5, DK-2100 Copenhagen Ø, Denmark

a

r

t

i

c

l

e

i

n

f

o

Article history:

Received 7 September 2019 Revised 26 March 2021 Accepted 3 May 2021 Available online 20 May 2021 MSC: 41A05 41A10 65D05 65D17 Keywords: Domain adaptation Robust estimator Discriminant analysis Transduction

a

b

s

t

r

a

c

t

Consideradomain-adaptivesupervisedlearningsetting,whereaclassifierlearnsfromlabeleddataina sourcedomainandunlabeleddatainatargetdomaintopredictthecorrespondingtargetlabels.Ifthe classifier’sassumptionontherelationshipbetweendomains(e.g.covariateshift,commonsubspace,etc.) isvalid,thenit willusuallyoutperformanon-adaptivesourceclassifier. Ifitsassumptionisinvalid,it canperformsubstantiallyworse.Validatingassumptionsondomainrelationshipsisnotpossiblewithout targetlabels.Wearguethat,inordertomakedomain-adaptiveclassifiersmorepractical,itisnecessary tofocusonrobustness;robustinthesensethatanadaptiveclassifierwillstillperformatleastaswellas anon-adaptiveclassifierwithouthavingtorelyonthevalidityofstrongassumptions.Withthisobjective inmind,wederiveaconservativeparameter estimationtechnique,whichis transductiveinthe sense ofVapnikandChervonenkis,and showfordiscriminant analysisthatthenew estimatorisguaranteed toachievealowerrisk onthe giventarget samplescomparedtothesourceclassifier. Experimentson problemswithgeographicalsamplingbiasindicatethatourparameterestimatorperformswell.

1. Introduction

Generalizationinsupervisedlearningreliesonthefactthat fu-ture samplesoriginate fromthe sameunderlying data-generating distributionastheonesusedfortraining.However,thisisnotthe case in settings where data is collectedfrom different locations, different measurement instruments are used orthere is only ac-cesstobiaseddata[25] .In thesesituations thelabeleddatadoes not represent thedistribution of interest.This problemsettingis referred to asa domainadaptation setting,wherethedistribution ofthelabeleddataiscalledthesourcedomainandthedistribution ofinterestiscalledthetargetdomain[3,15] .Mostoften,datainthe target domainisnotlabeledandadaptingasourcedomain classi-fier,i.e.,changingpredictionstosuitthetargetdomain,istheonly meansbywhichonecanmakeaccuratepredictions.Unfortunately, dependingonthedomaindissimilarity,adaptiveclassifierscan eas-ilyperformworsethannon-adaptiveones.Weformulatea conser-vative adaptiveclassifier that always performs atleast aswell as thenon-adaptiveone.1

✩ Handle by Associate Editor Francesco Tortorella. ∗_{Corresponding author.}

E-mail address: w.m.kouw@tue.nl (W.M. Kouw).

1 A shortened, preliminary version was accepted for S+SSPR _[16]_{. The current ver-}

sion offers a signiﬁcant extension with a clearer exposition, additional technical de-

In the general setting, domains can be arbitrarily different, which means generalization will be extremely difficult.However, thereare caseswhere theproblemsettingis morestructured:in thecovariateshiftsetting,themarginaldatadistributionsdifferbut theposteriordistributionsare equal[5,9,28] .Insuchcases,a cor-rectlyspecifiedadaptiveclassifierwillconvergetothesame solu-tion asthe target classifier [9] . One wayto carry out adaptation isby weighingeach source sampleby howimportantit isunder the target distribution and training on the importance-weighted labeledsourcedata.However,suchaclassifier canperformpoorly whenapplied to settingswhere thecovariate shiftassumption is false,i.e.,wheretheposteriordistributionsfrombothdomainsare notequal [8,19] .Inthat case,one oftenobservesthat afew sam-plesaregivenlargeweightsandallother samplesaregiven near-zeroweights,whichgreatlyreducestheeffectivesamplesize[23 , Chapter 8]. Sensitivityto domain relationship assumptions isnot restricted to covariate shift.Another adaptive algorithm, Transfer ComponentAnalysis (TCA),assumestheexistenceofa latent rep-resentationcommon to both domains. When that doesnot hold, mapping both source and target data onto transfer components

tails and references, more experiments, and a comprehensive analysis and discussion.

https://doi.org/10.1016/j.patrec.2021.05.005

(3)

W.M. Kouw and M. Loog Pattern Recognition Letters 148 (2021) 107–113

willresultinmixingoftheclass-conditionaldistributionsand per-formancewilldeteriorate[24] .

Sincethevalidityoftheaforementionedassumptionsisdifficult – ifnotimpossible– tocheck,itisofinteresttodesignrobust clas-sifiers. Robustness to uncertaintyis often achievedthrough min-imax optimization [17] . An example of a robust adaptive classi-fierisRobustCovariateShiftAdjustment(RCSA),whichfirst max-imizes risk with respect to the importance-weights and subse-quently minimizes risk with respect to the classifier parameters

[32] . It attempts to account for estimationerrors in importance-weights. Another example is theRobust Bias-Aware (RBA) classi-fier,whichplaysagamebetweenariskminimizingtargetclassifier and a risk maximizing target posterior distribution [19] . The ad-versaryisconstrainedtopickposteriors thatmatchthemoments ofthesourcedistributionstatistics,toavoidposteriorprobabilities that resultindegenerateclassifiers(e.g.assignall posterior prob-abilitiesto 1).Matchingmoments meansthat RBAclassifierslose predictive power inareas offeature spacewhere the source dis-tribution haslimitedsupport.Notethat bothrobust methodsstill relyonassumingcovariateshift.

Ourmain contributionis aparameter estimatorthat produces estimates witha risk that is always lower or equalto the risk of the source classifier, withrespect to thegiven target samples.It doessowithoutmakingdomainrelationshipassumptionssuch as covariate shiftbutby constructingaspecific typeofrisk thatcan be considered transductive in the sense originally defined by by VapnikandChervonenkis [see 30] . Furthermore,we show that in thecaseofdiscriminantanalysis,theestimatorwillproducestrictly

smallerrisksonthetargetdata.Tothebestofourknowledge,such performanceguaranteescomparedtothesourceclassiﬁerhavenot beenshownbefore.

The paper is outlined as follows: Section 3 presents the for-mulation of ourmethod, withdiscriminant analysisin Section 4 .

Section 5.1 shows experiments on two data sets involving geo-graphicalsamplingbias,indicatingthatourestimatorconsistently performsamongthebest.Weconcludewithlimitationsanda dis-cussion inSection 6 .Tostart with, thenext section brieﬂy intro-ducesthespeciﬁcdomainadaptationsettingthatweconsiderand commentsonthetransductivenatureofourparticularapproach.

2. Domainadaptationandtransduction

A domain isdeﬁned hereasa particular jointprobability dis-tribution over a D-dimensional input space X⊆ RD _and _a

K-dimensional output space of one-hot vectors Y=

b∈

{

0,1

}

K_:

kbk=1

[15] .LetS marka sourcedomain,with nsamples x=

(

x1,. . .,xn

)

withcorresponding labelsy=

(

y1,...,yn

)

∈Yn drawn

fromthesourcedomain’sjointdistribution.Similarly,letT marka

targetdomain,withmsamplesz₌

(

z1,...,zm

)

withcorresponding

labelsu=

(

u1,...,um

)

drawnfromthetargetdomain’sjoint

distri-bution. Thetarget labelsu areunknown attrainingtime andthe goalistopredictthem,usingonlytheunlabeledtargetsamplesz

andthelabeledsourcesamples

(

x,y

)

.

2.1. Themeaningoftransduction

Given that the primary performance measure in this work is speciﬁcally the risk on the unlabeled data of the target domain thatisavailabletous,ourobjectiveisessentiallytransductive[see

15 ].This isin linewiththe original deﬁnitionoftransduction as proposedbyVapnikandChervonenkis[see 30] .

It should be pointedout that, confusingly,what isreferred to astransductive formosttransfer learninganddomainadaptation methods, just means that there is labeled data available for the source but not for the target domain [see also 15 ]. The classi-ﬁers considered in papers such as [1,10,13] , like most papers in

Fig. 1. Example domain adaptation setting. (Left) Labeled source domain data, (right) labeled target domain data. Black lines show a classiﬁer trained on source data, applied to source data (left) and target data (right).

our review work [15] ,do not focus on the unlabeledsamples in the target domain in particularand are actually not transductive inthesense ofVapnikandChervonenkis[seealso15 ].Workslike

[27,29] exploit graph methods that do not have a ready out-of-sample extension andare therefore transductive in the sense of VapnikandChervonenkis.AsSection 3 shows,ourmethodfocuses particularlyon the risk obtainedon thegiventarget data andis, as such, transductive. As it turns out, it is speciﬁcally this ap-proach that can provide uswith performance guarantees, where othertechniquescannot.

Weshouldnotethat,typically,ourtargetclassiﬁerscanstill be usedforclassifyingnewandunseentarget domainsamples.That is,they canalsobeusedforinductiveinference.Thisisespecially thecaseifthesamplesfromthetarget domaincanbe considered representative of that domain. In that case, the performance on thoseparticulartargetdomaininstancescanequallywellbe inter-pretedasaregularempiricalrisk, usedinstandardempiricalrisk minimization[26,31] .Justasinthe supervisedlearningsetting,it isthenassumedthathavingasmallempiricalrisk carriesoverto asmallgeneralizationerrorandthattheclassiﬁercanbe success-fullyemployedinductively.

As a final remark, we like to state that thebenefits of trans-ductionoverinduction,orviceversa,arenotalways easily identi-fied.Especiallybecausein manysettings,inductive classifierscan beusedfortransductionandtheotherwayaround.Referto Chap-ter25in[6] forfurtherviewsandconsiderations.

2.2. Example

Fig. 1 visualizessome conceptsusedthroughoutthepaper.On the left is shown samples from the source domain, labeled as points(red)versuscrosses(blue).Theseweredrawnfromisotropic Gaussianscenteredat[−2, 0]and[+2, 0],respectively.Theblack linesareacontourplotoftheposteriorprobabilitiesofaclassifier trainedon the sourcedata. Onthe rightis showndata fromthe targetdomain,aswellasthesourceclassifierappliedtothetarget data.Thesetarget samplesweredrawn fromtwoGaussian distri-butions, both withcovariance matrix [3, 2; 2, 4] but one with a meanof [−1, 2] and one witha meanof [+2, 1].The source andtargetdomainsarethereforerelatedtoeachother throughan affine transformation. Notethat the source classifier does not fit thetargetdatawell.

3. Robustestimatorfortargetdomain

Inthefollowing,wepresenttheconstructionofourestimator. First,wediscusstheriskoftheclassiﬁerinthetargetdomain. Sec-ondly,wecomparethetargetrisk ofaproposalclassiﬁer withthe

(4)

targetriskincurredbythesourceclassiﬁerandthirdly,weassume aworst-caselabelingforthegiventargetsamples.

3.1. Targetrisk

The empiricalrisk ofa classiﬁerinthesource domainis com-putedastheaveragelosswithrespecttosourcesamples

(

x,y

)

:

ˆ R

h

|

x,y

=1 n n i=1

(

h

|

xi,yi

)

, (1)

wherehistheclassiﬁcationfunctionmappinginputtolabelsand

isalossfunctioncomparingtheclassiﬁer’spredictionh

(

xi

)

with

the truelabely_iattrainingtime. Sincetheclassiﬁcation error,or 0− 1 loss, cannot be directly optimized over, it is customary to choosesurrogatelossfunctions,suchasthequadraticloss

(

h

(

xi

)

−

yi

)

2[11] .The source classiﬁeris the classiﬁer foundby minimizing

theempiricalriskwithrespecttosourcesamples:

ˆ

hS=argmin h∈H

ˆ

R

(

h

|

x,y

)

, (2)

where_{H refers}tothehypothesisspace.

Sincethesourceclassifierdoesnotincorporateanypartofthe target domain,itisessentially entirelynaiveofit.Butsource do-mains are chosenfora reason– oftenbecause theyare themost similardataavailable– andsourceclassifiersaresubsequently re-gardedasthebestalternativeforclassifyingthetargetdomain.To evaluatehˆS inthetargetdomain,theriskoftheclassifierwith re-specttotargetsamples

(

z,u

)

,iscomputed:

ˆ R

hˆS

|

z,u

= 1 m m j=1

ˆ_hS

_|

_z j,uj

. (3)

We argue that adaptive classiﬁers should never perform worse than sourceclassiﬁers.Inother words,they should neverachieve alargertargetrisk.

3.2. Contrast

Weformalizethedesiretoneverachievealargertargetriskby directlycomparingthetarget riskofapotential alternative classi-fierwiththetargetriskofthesourceclassifier.Ifwesubtractthe target riskofthesourceclassifier,thenwe canarguethatthe re-sultingfunctionshouldneverbepositive:

ˆ

R

h

|

z,u

− ˆR

hˆS

|

z,u

(4)

If thiscontrast betweenrisk functionsis used asa minimization objective, i.e., hˆ=minh Rˆ

(

h

|

z,u

)

− ˆR

(

hˆS

|

z,u

)

, then the target

risk oftheresultingclassiﬁerisboundedabovebytherisk ofthe source classiﬁer:Rˆ

(

hˆ

|

z,u

)

≤ ˆR

(

hˆS

|

z,u

)

.Equalityoccurswhen the sourceclassiﬁerisrecovered:hˆ=ˆ_hS_._Classiﬁers_that_lead_to_larger

targetrisksarenotvalidoutcomesofthisminimizationprocedure.

3.3. Robustness

Eq. (4) still relieson target labelsu, whichare unknown dur-ing training. Instead ofu we use a worst-case labeling, achieved by maximizingrisk withrespect to a hypothetical labeling q. For any classiﬁer h, the risk with respect to this worst-caselabeling willalways belargerthantherisk withrespecttothetruetarget labeling: ˆ R

(

h

|

z,u

)

≤ max q ˆ R

(

h

|

z,q

)

. (5)

Maximizing overa set ofdiscrete labels isa combinatorial prob-lem and, unfortunately,thisone is computationally expensive. To avoid this, we apply a relaxation by considering a soft labeling,

qjk=p

(

uj=k

|

zj

)

. This means that qj is a vector of K elements

that sumto1.Inother words,a point ona K− 1 simplex,

K−1.

For m samples, the Cartesian product of m simplices is taken:

K−1×

K−1× · · · =

K−1m .Byoptimizingwithrespecttoa

worst-caselabeling,theestimatorwillbemorerobusttouncertaintyover targetlabels[17] .

3.4. TargetContrastivePessimisticrisk

Combining the contrast between risk functions from(4) with the worst-case labeling q from (5) produces the following risk function: ˆ RTCP

_h

_|

_h_ˆS_,_z,_q

₌ 1 m m j=1

h

|

zj,qj

−

_hˆS

_|

_z j,qj

. (6)

WerefertotheriskinEq. (6) astheTargetContrastivePessimistic risk(TCP).Minimizingwithrespecttoaclassiﬁer hand maximiz-ingwithrespecttoahypotheticallabelingq,producesthenewTCP targetclassiﬁer: ˆ hT =argmin h∈H max q∈m K−1 ˆ RTCP

_h

_|

_h_ˆS_,_z,_q

_. ₍₇₎

NotethattheTCPriskonlyconsiderstheperformanceonthe tar-get domain. More precisely, it considers the performance on the givensamplesfromthetargetdomainandis,inthissense,a trans-ductiveapproach[12,30] .Itisdifferentfromtheriskformulations in[19,32] ,andthosementionedinSection 2 ,becausethose incor-porateperformanceonthesourcedomainaswell.Ourformulation focusespurely on theperformance gainwe can achieve over the sourceclassiﬁer,intermsoftargetrisk.

3.5. Optimization

Ifthelossfunctionisrestrictedtobegloballyconvexandthe hypothesis space_{H to}be aconvexset,then theTCPrisk will be globallyconvexwithrespecttohandtherewillbeaunique opti-mumforh.TheTCPrisk islinearwithrespect toqandthe opti-mumneednotbe uniqueforq.Butthecombinedminimax objec-tivewillbegloballyconvex-linear,whichguaranteestheexistence ofa saddlepoint, i.e., a unique optimum withrespect to both h

andq[7] .

Finding this saddle point can be done through ﬁrst perform-ingagradientdescentstepaccordingtothepartialderivativewith respectto h,followed by a gradient ascentstep accordingto the partialderivativewithrespecttoq.However, thislast stepcauses the updated q to leave thesimplex. In orderto enforce the con-straint,theupdatedqisprojectedbackontothesimplex.The pro-jection,_P,mapsapointoutsidethesimplex,a,tothepoint,b,that istheclosestpointonthesimplexintermsofEuclideandistance:

P

(

a

)

=argminb∈

a− b2[22] .Unfortunately,theprojectionstep

complicatesthecomputationofthestepsize,whichwereplaceby alearningrate

α

t_,_decreasing _over_iterations_t._This_results_in_the

overallupdate:

qt+1←P(qt+

α

t

_∇

_qt

₎

_. ₍₈₎

Agradientdescent-ascentprocedureforgloballyconvex-linear ob-jectivesisguaranteedtoconvergetoasaddlepoint(c.f.proposition 4.4andcorollary4.5of[7] ).

4. Discriminantanalysis

Interestingly,forclassical discriminant analysis(DA),it canbe shownthattheTCPriskproducesparameterestimateswithstrictly

smallerrisksthan thatofthe sourceclassiﬁer.Discriminant anal-ysis models the data from each class with a Gaussian distribu-tion,weightedproportional toaclassprior:

π

_kN

(

x

|

μ

_k,

k

)

[11] .

(5)

W.M. Kouw and M. Loog Pattern Recognition Letters 148 (2021) 107–113

(

π

k,

μ

k,

k

)

. The model is expressed as an empirical risk

mini-mizationformulationbytakingthenegativelog-likelihoodasaloss function,

(

θ |

x,y

)

=K

k−yklog[

π

kN

(

x

|

μ

k,

k

)

].

4.1. Quadraticdiscriminantanalysis

If each class is modeled with a separate covariance matrix, the resultingclassiﬁer isa quadraticfunctionof thedifference in means andcovariances,andishencecalledquadratic discriminant analysis(QDA).Fortargetdataz and probabilisticlabelsq,theloss isformulatedas: QDA

(

θ |

zj,qj

)

= K k=1 −qjklog[

π

kN

(

zj

|

μ

k,

k

)

]. (9)

Note thatthe lossis nowexpressed intermsofclassiﬁer param-eters

θ

, as opposed to the classiﬁer h. Plugging the loss from

(9) into(6) ,theTCP-QDAriskbecomes:

ˆ RTCP_QDA

(

θ |

θ

ˆS,z,q

)

= 1 m m j=1 QDA

(

θ |

zj,qj

)

− QDA

(

θ

ˆS

|

zj,qj

)

= _m1 m j=1 K k=1 −qjklog

π

k N

(

zj

|

μ

k,

k

)

ˆ

π

S kN

(

zj

|

μ

ˆSk,

ˆkS

)

, (10)

wheretheestimateitselfis:

ˆ

θ

T ₌_arg_min θ∈ qmax∈m K−1 ˆ RTCP QDA

(

θ |

θ

ˆS,z,q

)

. (11)

Minimization withrespectto

θ

hasa closed-formsolution for discriminant analysismodels. For each class, the parameter esti-matesare:

π

k= 1 m m j=1 qjk, (12)

μ

k=

m j=1 qjk

₋₁m j=1 qjkzj, (13)

k=

m j=1 qjk

₋₁m j=1 qjk

(

zj−

μ

k

)

(

zj−

μ

k

)

. (14)

Keeping

θ

ﬁxed,thegradientwithrespecttoq_jk is:

∂

qjk ˆ RTCP QDA

θ |

_θ

ˆS_,_z,_q

₎

₌ ₋1 mlog

π

kN

(

zj

|

μ

k,

k

)

ˆ

π

S kN

(

zj

|

μ

ˆSk,

ˆSk

)

. (15) 4.2. Example

Fig. 2 visualizes the difference between the source classifier andour TCP-QDAclassifier.Ontheleft is shownthesource clas-sifier appliedto the target datafromSection 2.2 .On theright is showntheTCP-QDAclassifierappliedtothesamedata.Notethat ithasshiftedupwardstobetterfitthetargetsamples,achievinga smallerriskthanthesourceclassifier.

4.3. Regularization

One ofthe propertiesofa discriminantanalysismodel isthat itrequirestheestimatedcovariancematrix

_k tobenon-singular. It ispossible forthemaximizer overq inTCP-QDAtoassign less samplesthandimensionstooneoftheclasses,causingthe covari-ancematrixforthatclasstobesingular.Topreventthis,we regu-larizeitsestimationbyenforcingalowerboundontheeigenvalues oftheestimatedcovariancematrix.

Fig. 2. Example of difference between source Quadratic Discriminant Analysis (left, ˆ

θS ) and Target Contrastive Pessimistic - Quadratic Discriminant Analysis (right, ˆ _θT )

on the target domain data from Section 2.2 .

4.4. Lineardiscriminantanalysis

If the model is constrained to share a covariance matrix be-tweenclasses,theresultingclassiﬁerisalinearfunctionofthe dif-ferenceinmeansandishencetermedlineardiscriminantanalysis (LDA).Thisconstraintisimposed throughtheweightedsumover classcovariancematrices

₌K_k

π

_k

_k.

4.5. Performanceguarantee

Forthediscriminantanalysismodel,theTCPparameter estima-tor obtains a strictly smaller risk. In other words,this parameter estimatorisguaranteedtoimproveitsperformance– onthegiven target samples, and in terms ofrisk – over the source classiﬁer. Thisistheﬁrstdomainadaptation parameterestimatorforwhich suchaguaranteecanbeprovided.

Theorem1. Foracontinuous targetdistribution,withmoresamples than features forevery class, the empirical target risk, with respect to discriminant analysis,of TCP estimated parameters

θ

ˆT is,almost surely,strictlysmallerthanthatofthesourceparameters

θ

ˆS:

ˆ

RQDA

_ˆ

θ

T

_|

_z,_u

_< _Rˆ_QDA

_θ

ˆS

_|

_z,_u

₍₁₆₎

The readeris referred to Appendix A forthe proof. It follows similar steps as a guarantee for discriminant analysis in semi-supervisedlearning[20] .Notethataslongasthesameamountof regularizationisadded toboth thesourceandtheTCPestimator, thestrictlysmallerriskalsoholdsforaregularizedmodel.

5. Experiments

WeseetheTCPrisk formulationfromSection 3 ,togetherwith

Theorem 1 ,asour maincontributions. Ofcourse, itis still of in-teresttoseehowotherapproachescomparetoours.Wecompare2 the performance ofourclassiﬁers withthat ofsome well-known and robust domain-adaptive classiﬁers. We implemented Trans-ferComponentAnalysis(TCA) [24] ,KernelMeanMatching(KMM)

[14] ,RobustCovariateShiftAdjustment(RCSA)[32] andtheRobust Bias-Aware(RBA) classiﬁer [19] . TCAandKMM make explicit as-sumptions:TCAassumesthattherearelatentfactorsonwhichthe data can be projected such that the distributions are more simi-lar,whiletheoriginalpropertiessuchasclassseparabilityare pre-served.Wetrainedalogisticregressortothesourcedatamapped onto the transfer components. KMM assumes that the posterior distributionsineachdomainareequalandthatthesupportofthe target distribution is contained within the support of the source

2 Code is available at _{https://github.com/wmkouw/tcpr}

(6)

Table 1

WeatherAUS data set. AUC for all pairwise combinations of domains (D = ’Darwin’, P = ’Perth’, B = ’Brisbane’ and M = ’Melbourne’).

S D D D P P B P B M B M M T P B M B M M D D D P P B avg S-LDA 0.650 0.700 0.672 0.783 0.732 0.565 0.862 0.819 0.919 0.789 0.879 0.903 0.773 S-QDA 0.681 0.857 0.642 0.914 0.940 0.881 0.950 0.937 0.955 0.898 0.929 0.959 0.879 TCA 0.825 0.856 0.718 0.838 0.72 0.628 0.842 0.856 0.845 0.834 0.808 0.662 0.786 KMM 0.778 0.704 0.556 0.766 0.705 0.691 0.827 0.717 0.768 0.612 0.517 0.505 0.679 RCSA 0.837 0.895 0.769 0.841 0.759 0.726 0.858 0.872 0.878 0.813 0.851 0.851 0.829 RBA 0.844 0.884 0.764 0.843 0.756 0.741 0.86 0.874 0.878 0.818 0.844 0.839 0.829 TCP-LDA 0.833 0.886 0.749 0.853 0.738 0.733 0.858 0.869 0.875 0.828 0.838 0.859 0.827 TCP-QDA 0.710 0.886 0.760 0.932 0.946 0.903 0.965 0.95 0.969 0.905 0.908 0.964 0.900

distribution. We trained both a weighted logistic regressor and a weighted least-squares classifier using the importance-weights estimated by KMM. We report the best performing of the two, namelyleast-squares. RCSAalsoassumesequalposterior distribu-tions,butemploysworst-caseimportance-weightestimationtobe robust to weight estimation errors. We used the authors’ imple-mentation,whichtrainsaweightedsupportvectormachineusing theestimatedworst-caseweights.RBAassumesthatthemoments ofthesourceclassifier’spredictionsmatchthatofthetarget classi-fier.Inourimplementation,onlythefirstmomentsareconstrained tomatch. Asbaselines,we includedanon-adaptivelinear(S-LDA) andquadratic(S-QDA)discriminantanalysismodeltrainedonthe sourcedomain.

Alltarget samplesaregiven-unlabeled-totheadaptive clas-siﬁers.Theclassiﬁersmakepredictionsforthosegiventarget sam-ples andtheirperformance isevaluated withrespecttothose tar-getsamples’truelabels.PerformanceismeasuredintermsofArea Under the ROC-curve (AUC). All methods are trained using L2

-regularization. Since there is no labeled target data available for validation, we set the regularization parameter to a small value, namely0.01.

5.1. Datasets

We performeda setofexperiments ontwo data setsthat are geographically splitintodomains.Intheﬁrstproblem, thegoalis topredictwhetheritwillrainthefollowingday,basedon22 fea-tures including wind speed, humidity, and sunshine (data set is part of the R package Rattle [33] ). The measurements are taken overaperiodof200daysfromAustralianweatherstationslocated in Darwin, Perth, Brisbane, and Melbourne. Each station can be considered adomainbecausethefeaturespacesareequalbutthe underlyingdata-generatingdistributionsaredifferent.Forinstance, theaverage temperatureisseveraldegreeshigherinDarwinthan inMelbourne.

TheseconddatasetisfromtheUCImachinelearningrepository

[18] .Thegoalistopredictheartdiseaseinpatientsfrom4 differ-enthospitals.ThesearelocatedinHungary(294patients), Switzer-land (123 patients), California (200 patients) and Ohio (303 pa-tients).Eachhospitalcanbeconsideredadomainbecausepatients are measured on the sameclinical features butthe local patient populations differ. Forexample, patientsinHungary are on aver-ageyounger thanpatientsfromSwitzerland(48versus 55years). Heart disease is predicted from 13 clinical features such as age, sex,cholesterol levelandchestpain type.Bothdatasets are pre-processedbyﬁrstimputingmissingvalueswithzerosandthen z-scoringeachfeature.

5.2. Results

Table 1 compares the AUCs of various classiﬁers on the WeatherAUSdataset.Allcombinationsofusingonestationasthe sourcedomainandanotherstationasthetargetdomain,aretaken.

Firstly, as a collective, the robust methods (TCP-QDA, TCP-LDA, RBA, RCSA) rather consistently outperform the non-robust meth-ods(TCA,KMM,S-LDA,S-QDA),thoughitcertainlyisnotthecase thateveryrobustmethodoutperformseverynon-robustone.Also, thereisoneexception whereS-QDAactuallyperforms bestofall. Secondly, RCSA outperforms KMM in all cases, indicating that it is either difficult to estimate appropriate importance weights or thatitisdifficulttotraintheimportance-weightedclassifiergiven KMM’sweights.Thirdly,ineightoutoftwelvecasesTCP-LDA out-performs S-LDA. TCP-QDA is better than S-QDA in eleven of the twelve. Lastly, S-LDA occasionally outperform the non-TCP, adap-tiveclassifiers,wherethismostnotablyhappensinthethreecases whenS=M.ForS-QDAthishappensinallcasesexceptforS=D. When_S₌Mand_T ₌P,wefindthatS-QDAperformsbestoverall. Particularly whereS-LDAis concerned, theseresultsindicatethat adaptationstrategiescanalsobedetrimentaltoperformance.

Table 2 lists AUCsof each classifier in the heart disease data set. Overall, the AUC’s are lower here, indicating that these set-tingsaremoredifficultthanthoseoftheweather stations.Firstly, TCP-LDAgenerallyoutperformsTCP-QDAhere,indicatingthatmost problemsettingsarelinearlyseparableandtheadditional flexibil-ityofQDAisnothelpful.Secondly,thedifferencesinperformance between S-LDA andS-QDA and their TCP versions is clearlyless appreciable.In mostcases the differencesseem insignificant. Ex-ceptionsoccur whenS=S andT =O, inwhichcasetheoriginal methodsactuallyperformclearlybetterandwhenS=SandT = H,inwhichcasetheTCPadaptationsdoso.Thirdly,RCSAdoesnot always outperformKMM,butsince bothKMMandRCSA perform worse thanchance ona few occasions, itdoesseemthat the as-sumptionof equivalent posterior distributions is invalid in many cases.Fourthly,TCA’sperformancealsovariesaroundchancelevel, which meansthat it is difficultto recover a commonlatent rep-resentationhere.Lastly,S-LDAandS-QDAoutperformtheadaptive classifiersonanumberofoccasionsagain.

6. Discussion

Although,by construction, the TCPclassifiers are neverworse thanthe sourceclassifier intermsofempiricalrisk,theywill not automaticallylead to improvementsinthe errorrate. Thisisdue tothe fact that asurrogate lossfunction is usedduringtraining: the classifier that minimizes the surrogate loss need not be the classifierthatminimizesthe0/1-loss[2,4,21] .Similarperformance guaranteesaswehavegivenwithrespecttoempiricalrisk,cannot begivenwithrespecttoclassificationerror,becausethe0_{− 1}loss cannotbedirectlyoptimized.

Although our TCP estimator is guaranteed to never perform worse than the source classifier, it may not perform well if the source classifier is a poor choice to begin with. Of course, if no decent source classifiers can be formed, then one can wonder whetheranykindofadaptation willbe able toconstructa satis-factorytargetclassifier,unlessparticularlyreliableassumptionscan bemade.

(7)

W.M. Kouw and M. Loog Pattern Recognition Letters 148 (2021) 107–113 Table 2

Heart disease data set. AUC for all pairwise combinations of domains (O = ’Ohio’, H = ’Hungary’, S = ’Switzerland’ and C = ’California’).

S O O O H H S H S C S C C T H S C S C C O O O H H S avg S-LDA 0.866 0.674 0.658 0.671 0.726 0.527 0.866 0.500 0.831 0.559 0.883 0.440 0.683 S-QDA 0.829 0.674 0.503 0.660 0.668 0.484 0.840 0.500 0.811 0.502 0.834 0.452 0.647 TCA 0.674 0.597 0.500 0.453 0.466 0.530 0.544 0.439 0.693 0.408 0.661 0.572 0.545 KMM 0.709 0.591 0.460 0.503 0.568 0.552 0.742 0.302 0.294 0.345 0.290 0.508 0.489 RCSA 0.646 0.667 0.572 0.641 0.483 0.459 0.749 0.626 0.651 0.685 0.647 0.343 0.597 RBA 0.502 0.670 0.430 0.636 0.423 0.582 0.556 0.366 0.523 0.396 0.597 0.412 0.508 TCP-LDA 0.864 0.675 0.653 0.673 0.725 0.555 0.867 0.424 0.831 0.717 0.882 0.447 0.693 TCP-QDA 0.822 0.675 0.500 0.661 0.660 0.432 0.841 0.422 0.813 0.565 0.847 0.414 0.638

Giventhat reliableassumptions canbe made, ourTCP estima-torcouldstillbeuseful.Ratherthantheoriginalsupervisedsource classifier,onecan,inprinciple,useanyadaptiveclassifierin com-binationwithTCPparameterestimation.Inthatcase, theTCP pa-rameter estimatorwouldstill retain its guarantee tonot perform worsethat theclassifieritiscomparedagainst,whichinthiscase is the adaptive classifier. Potentially, this may of course lead to even betterparameterestimates.Awide rangeofstandard classi-fiersthatrelyontheoptimizationofaconvexlosscanbe incorpo-rated,such asleast-squares orsupport vectormachines, meaning that TCPcould becombined withmanyadaptiveclassifiers. Non-convexlosses,aswidelyemployedinthiseraofdeeplearning,are achallengeand,asyet,itisanopenandinterestingresearch ques-tiontowhatextentourtheoreticalresultscanbesalvagedinthat setting.

Another possible extension to the current estimatoris to use multiplesourcedomains.PerhapsourTCPestimatorcouldproduce betterestimatesthanthebestsourceestimates.Onecouldenvision contrastingtheproposalclassifierwiththeclassifierproducingthe lowest risk from among a set of source classifiers, each trained on its ownsource domain.Finding thebestone fromamongthe set ofsourceclassifierswouldrequirean additionalminimization stepoversourcedomains,whichwouldincreasethecomputational cost. Selectingasubsetofsourcedomains inadvance,couldlimit thisincreaseincostandmakesuchanapproachfeasible.

7. Conclusion

We have designed a risk minimization formulation for a domain-adaptive classifier whose performance, in terms of em-pirical target risk, isalways atleast asgood as that ofthe non-adaptivesourceclassifier,withoutmakingassumptionsonthe rela-tionshipbetweendomains.Thisissomethingthatnoothermethod can guarantee.Furthermore,forthediscriminantanalysiscase,its performance isalwaysstrictly better.As demonstrated,ourTarget Contrastive Pessimisticdiscriminantanalysismodelperforms con-sistentlystrongamongotherrobustclassifiers.

DeclarationofCompetingInterest

Theauthorsstatethattheyholdnoconﬂictofinterests.

Acknowlgedgment

A word of thanks goes out to the two anonymous reviewers whose feedbackhelpedusimprovethepresentationofourwork. Wegladlyacknowledgetheirconstructiveremarksandcomments.

AppendixA

Proof of Theorem 1. Let

{

(

xi,yi

)

}

ni=1 be a data set of size

n drawn iid from a continuous distribution deﬁned over in-put space X⊆ RD _and _output _space _Y₌

_{

₀_,₁

_}

K_:

kyk=1,y∈

Y

. Similarly, let

{

(

zj,uj

)

}

mj=1 be a data set of size m, drawn

iid from another continuous distribution deﬁned over X× Y. Consider a discriminant analysis model parameterized with

θ

₌

(

π

1,...,

π

K,

μ

1,...,

μ

K,

1,...,

K

)

with empirical risk deﬁned

by: ˆ RQDA

(

θ |

x,y

)

= 1 m m i=1 K k=1

−yiklog[

π

kN

(

xi

|

μ

k,

k

)

]. (A.1)

Thesamplecovariancematrix,

k,isrequiredtobe non-singular,

which is guaranteed when there are more unique samples than featuresforeveryclass.Let

θ

ˆS betheparametersestimatedon la-beledsourcedata:

ˆ

θ

S₌_arg_min θ∈ ˆ RQDA

θ |

x,y

. (A.2)

and let

(

θ

ˆT,q∗

)

be the parameters and worst-case labeling es-timated by mini-maximizing the Target Contrastive Pessimistic risk: ˆ

θ

T_,_q∗₌_arg_min θ∈ argq∈maxm K−1 ˆ RQDA

θ |

z,q

− ˆRQDA

_ˆ

θ

S

_|

_z,_q

_. _(A.3)

Firstly, keeping q ﬁxed, the minimization over the contrast be-tweenthetargetriskoftheproposalparameters

θ

andthesource parameters

θ

ˆS isupperboundedby0,becausebothsetsof param-etersareelementsofthesameparameterspace,

θ

,

_θ

ˆS _∈

_: min θ∈ RˆQDA

θ |

z,q

− ˆRQDA

_ˆ

θ

S

_|

_z,_q

_{≤ 0}_, _(A.4)

forallchoicesofq. Since

θ

canalways beset to

θ

ˆS,valuesfor

θ

that would result ina larger target risk than that of

θ

ˆS are not validminimizers ofthe contrast.Considering that thecontrast is upperboundedfor anylabelingq,it isalso upperboundedby 0 fortheworst-caselabeling:

min θ∈ RˆQDA

θ |

z,q∗

− ˆRQDA

_ˆ

θ

S

_|

_z,_q∗

_{≤ 0}_, _(A.5)

andsince

θ

ˆT istheminimizeroftheleft-handsideof(A.5) :

ˆ RQDA

_ˆ

θ

T

_|

_z,_q∗

_{− ˆ}_R QDA

_ˆ

θ

S

_|

_z,_q∗

_{≤ 0}_. _(A.6)

Secondly,keeping

θ

ﬁxed,theempiricalriskwithrespecttothe truelabeling u isalways less thanor equalto the empiricalrisk withrespecttotheworst-caselabeling:

ˆ RQDA

(

θ |

z,u

)

− ˆRQDA

(

θ

ˆS

|

z,u

)

≤ max q∈m K−1 ˆ

RQDA

(

θ |

z,q

)

− ˆRQDA

(

θ

ˆS

|

z,q

)

. (A.7)

Sinceq∗isthemaximizerfor

θ

ˆT asparameters,wecanwrite:

ˆ

RQDA

(

θ

ˆT

|

z,u

)

− ˆRQDA

(

θ

ˆS

|

z,u

)

≤ ˆRQDA

(

θ

ˆT

|

z,q∗

)

− ˆRQDA

(

θ

ˆS

|

z,q∗

)

. (A.8)

CombiningInequalitiesA.6 andA.8 gives:

ˆ RQDA

_ˆ

θ

T

_|

_z,_u

_{− ˆ}_R QDA

_ˆ

θ

S

_|

_z,_u

_{≤ 0}_. _(A.9) 112

(8)

Bringing the second term on the left-handside to the right-handside showsthat thetargetrisk oftheTCPestimate isalways lessthanorequaltothetargetriskofthesourceclassiﬁer’s:

ˆ RQDA

_ˆ

θ

T

_|

_z,_u

_{≤ ˆ}_R QDA

_ˆ

θ

S

_|

_z,_u

_. _(A.10)

Equality in (A.10) occurs with probability 0, which can be shown throughthe parameter estimators.The total meanforthe sourceclassiﬁer consistsoftheweightedcombinationoftheclass means,resultingintheoverallsourcesampleaverage:

ˆ

μ

S₌K k=1 ˆ

π

S k

μ

ˆSk = K k=1 n iyik n

1 n iyik n i=1 yikxi

=1_n n i=1 xi. (A.11)

ThetotalmeanfortheTCP-QDAestimatorissimilarlydeﬁned, re-sultingintheoveralltargetsampleaverage:

ˆ

μ

T ₌ K k=1 ˆ

π

T k

μ

ˆTk = K k=1 m j q∗jk m

1 m j q∗jk m j=1 q∗_jkzj

= K k=1 1 m m j=1 q∗_jkzj =1 m m j=1 zj. (A.12)

Note that since q∗ consists ofprobabilities, the sum over classes K

kq∗jk is 1, for every sample j.Equal risks forthese parameter

sets, RˆQDA

_ˆ

θ

T

_|

_z_,_u

₌_Rˆ_QDA

₍

_θ

ˆS

_|

_z_,_u

₎

_, _implies _equality_of _the _total

means,

μ

ˆT =

μ

ˆS.ByEqs. A.11 andA.12 ,equaltotalmeans imply equal sampleaverages: _m1m_j zj=1n

n

ixi. Given a set of source

samples,drawing asetoftarget samplessuch thattheir averages are exactly equal, constitutes a single event under a probability densityfunction.Bydeﬁnition,singleeventsundercontinuous dis-tributions haveprobability 0.Therefore,astrictly smallerrisk oc-cursalmostsurely:

ˆ RQDA

_ˆ

θ

T

_|

_z,_u

_< _Rˆ_QDA

_θ

ˆS

_|

_z,_u

_. _(A.13) Supplementarymaterial

Supplementary material associated with this article can be found,intheonlineversion,atdoi:10.1016/j.patrec.2021.05.005 .

References

[1] A. Arnold , R. Nallapati , W.W. Cohen , A comparative study of methods for transductive transfer learning, in: IEEE International Conference on Data Mining Workshops, 2007, pp. 77–82 .

[2] P.L. Bartlett , M.I. Jordan , J.D. McAuliffe , Convexity, classiﬁcation, and risk bounds, J. Am. Stat. Assoc. 101 (2006) 138–156 .

[3] S. Ben-David , J. Blitzer , K. Crammer , A. Kulesza , F. Pereira , J.W. Vaughan , A theory of learning from different domains, Mach. Learn. 79 (2010) 151–175 . [4] S. Ben-David , D. Loker , N. Srebro , K. Sridharan , Minimizing the misclassiﬁca-

tion error rate using a surrogate convex loss, in: International Conference on Machine Learning, 2012, pp. 83–90 .

[5] S. Bickel , M. Brückner , T. Scheffer , Discriminative learning under covariate shift, J. Mach. Learn. Res. 10 (2009) 2137–2155 .

[6] O. Chapelle , B. Scholkopf , A. Zien ,Semi-Supervised Learning, MIT Press, 2006 . [7] A. Cherukuri , B. Gharesifard , J. Cortes , Saddle-point dynamics: conditions

for asymptotic stability of saddle points, SIAM J. Control Optim. 55 (2017) 486–511 .

[8] C. Cortes , M. Mohri , Domain adaptation and sample bias correction theory and algorithm for regression, Theor. Comput. Sci. 519 (2014) 103–126 .

[9] C. Cortes , M. Mohri , M. Riley , A. Rostamizadeh , Sample selection bias correction theory, in: Algorithmic Learning Theory, 2008, pp. 38–53 .

[10] N. Farajidavar , T.E. de Campos , J. Kittler , Adaptive transductive transfer machine, in: British Machine Vision Conference, 2014, pp. 1–12 .

[11] J. Friedman , T. Hastie , R. Tibshirani , The Elements of Statistical Learning, Springer, 2001 . volume 1

[12] A. Gammerman , V. Vovk , V. Vapnik , Learning by transduction, in: Conference on Uncertainty in Artiﬁcial Intelligence, 1998, pp. 148–155 .

[13] Q. Gu , J. Zhou , Learning the shared subspace for multi-task clustering and transductive transfer classiﬁcation, in: IEEE International Conference on Data Mining, 2009, pp. 159–168 .

[14] J. Huang , A.J. Smola , A. Gretton , K.M. Borgwardt , B. Schölkopf , et al. , Correcting sample selection bias by unlabeled data, in: Advances in Neural Information Processing Systems, 2007, pp. 601–608 .

[15] W.M. Kouw , M. Loog , A review of domain adaptation without target labels, IEEE Trans. Pattern Anal. Mach.Intell. 43 (2021) 766–785 .

[16] W.M. Kouw , M. Loog , Target robust discriminant analysis, in: IAPR Joint Inter- national Workshops on Statistical techniques in Pattern Recognition and Struc- tural and Syntactic Pattern Recognition, 2021, p. accepted .

[17] E.L. Lehmann , G. Casella , Theory of Point Estimation, Springer, 2006 . [18] M. Lichman, UCI machine learning repository, 2013,. http://archive.ics.uci.edu/

ml .

[19] A. Liu , B. Ziebart , Robust classiﬁcation under sample selection bias, in: Ad- vances in Neural Information Processing Systems, 2014, pp. 37–45 .

[20] M. Loog , Contrastive pessimistic likelihood estimation for semi-supervised classiﬁcation, IEEE Trans. Pattern Anal. Mach.Intell. 38 (2016) 462–475 . [21] M. Loog , J.H. Krijthe , A.C. Jensen , On measuring and quantifying performance:

error rates, surrogate loss, and an example in semi-supervised learning, in: Handbook of Pattern Recognition and Computer Vision, World Scientiﬁc, 2016, pp. 53–68 .

[22] N. Maculan , G.G. De Paula Jr , A linear-time median-ﬁnding algorithm for pro- jecting a vector on the simplex of R n , Oper. Res. Lett. 8 (1989) 219–222 .

[23] A.B. Owen, Monte Carlo theory, methods and examples, 2013,. https://statweb. stanford.edu/ ∼_owen/mc/_.

[24] S.J. Pan , I.W. Tsang , J.T. Kwok , Q. Yang , Domain adaptation via transfer compo- nent analysis, IEEE Trans. Neural Netw. 22 (2011) 199–210 .

[25] J. Quionero-Candela , M. Sugiyama , A. Schwaighofer , N.D. Lawrence , Dataset Shift in Machine Learning, MIT Press, 2009 .

[26] B. Schölkopf , A.J. Smola ,Learning with Kernels: Support Vector Machines, Reg- ularization, Optimization, and beyond, MIT press, 2002 .

[27] O. Sener , H.O. Song , A. Saxena , S. Savarese , Unsupervised transductive domain adaptation, arXiv preprint arXiv:1602.03534 (2016) .

[28] H. Shimodaira , Improving predictive inference under covariate shift by weight- ing the log-likelihood function, J. Stat. Plan. Inference 90 (20 0 0) 227–244 . [29] L. Shu , L.J. Latecki , Transductive domain adaptation with aﬃnity learning, in:

Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, 2015, pp. 1903–1906 .

[30] V. Vapnik , Estimation of Dependences based on Empirical Data, Springer, 1982 . [31] V. Vapnik , Principles of risk minimization for learning theory, in: Advances in

neural information processing systems, 1992, pp. 831–838 .

[32] J. Wen , C.N. Yu , R. Greiner , Robust learning under uncertain test distributions: Relating covariate shift to model misspeciﬁcation, in: International Conference on Machine Learning, 2014, pp. 631–639 .

[33] G.J. Williams , Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery, Springer, 2011 . Use R!

Robust domain-adaptive discriminant analysis

Delft University of Technology