Evaluating XAI

(1)

Evaluating XAI

A comparison of rule-based and example-based explanations

van der Waa, Jasper; Nieuwburg, Elisabeth; Cremers, Anita; Neerincx, Mark

DOI

10.1016/j.artint.2020.103404

Publication date

2021

Document Version

Final published version

Published in

Artificial Intelligence

Citation (APA)

van der Waa, J., Nieuwburg, E., Cremers, A., & Neerincx, M. (2021). Evaluating XAI: A comparison of

rule-based and example-rule-based explanations. Artificial Intelligence, 291, [103404].

https://doi.org/10.1016/j.artint.2020.103404

Important note

To cite this publication, please use the final published version (if applicable).

Please check the document version above.

Copyright

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons. Takedown policy

Please contact us and provide details if you believe this document breaches copyrights. We will remove access to the work immediately and investigate your claim.

This work is downloaded from Delft University of Technology.

(2)

Contents lists available atScienceDirect

Artiﬁcial

Intelligence

www.elsevier.com/locate/artint

Evaluating

XAI:

A

comparison

of

rule-based

and

example-based

explanations

Jasper van der Waa

a

,

b

,∗

,

Elisabeth Nieuwburg

a

,

c

,

Anita Cremers

a

,

Mark Neerincx

a

,

b

a_TNO,_Perceptual_&_Cognitive_Systems,_Soesterberg,_Netherlands b_Technical_University_of_Delft,_Interactive_{Intelligence,}_Delft,_Netherlands

c_University_of_Amsterdam,_Institute_of_{Interdisciplinary}_Studies,_Amsterdam,_Netherlands

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory:

Received21February2020

Receivedinrevisedform20August2020 Accepted26October2020

Availableonline28October2020

Keywords:

ExplainableArtiﬁcialIntelligence(XAI) Userevaluations

Contrastiveexplanations ArtiﬁcialIntelligence(AI) Machinelearning Decisionsupportsystems

Current developments in Artiﬁcial Intelligence (AI) led to a resurgence of Explainable AI (XAI). Newmethods are being researchedto obtain information fromAI systemsin order to generate explanations for their output. However, there is an overall lack of valid and reliable evaluations of the effects on users’ experience of, and behavior in responsetoexplanations.NewXAImethodsareoftenbasedonanintuitivenotionwhatan effectiveexplanationshouldbe.Rule- andexample-basedcontrastiveexplanationsaretwo exemplaryexplanationstyles.Inthisstudyweevaluatetheeffectsofthesetwoexplanation styles onsystem understanding,persuasive power and task performance in thecontext ofdecision supportindiabetes self-management.Furthermore, weprovidethree setsof recommendations based on our experience designing this evaluation to help improve future evaluations. Our resultsshow thatrule-based explanations have asmall positive effect on system understanding, whereas both rule- and example-based explanations seemtopersuadeusersinfollowingtheadviceevenwhenincorrect.Neitherexplanation improvestaskperformancecomparedtonoexplanation.Thiscanbeexplainedbythefact that both explanationstyles onlyprovidedetails relevant for asingle decision,not the underlyingrationalorcausality.Theseresultsshowtheimportanceofuserevaluationsin assessingthecurrentassumptionsandintuitionsoneffectiveexplanations.

1. Introduction

Humansexpectotherstocomprehensiblyexplaindecisionsthathaveanimpactonthem[1].Thesameholdsforhumans interacting withdecision support systems(DSS). To help them understandand trust a system’sreasoning, such systems need toexplain their advice to humanusers [1,2]. Currently,several approachesareproposed in thefield of Explainable ArtificialIntelligence(XAI)thatallowDSStogenerateexplanations[3].Asidefromthenumerouscomputationalevaluations ofimplementedmethods,literature reviewsshowthat thereisanoverall lackofhighqualityuserevaluationsthatadd a user-centeredfocustothefieldofXAI[4,5].Asexplanationsfulfill auserneed,explanationsgeneratedbyaDSSneedtobe evaluatedamongtheseusers.Thiscanprovidevaluableinsightsintouserrequirementsandeffects.Inaddition,evaluations canbeusedtobenchmarkXAImethodstomeasuretheresearchfield’sprogress.

*

Correspondingauthorat:TNO,Perceptual&CognitiveSystems,Soesterberg,Netherlands.

E-mailaddress:jasper.vanderwaa@tno.nl(J.S. van der Waa).

https://doi.org/10.1016/j.artint.2020.103404

(3)

The contribution of this article is twofold. First, we propose a set of recommendations on designing user evaluations in the ﬁeld of XAI. Second, we performed an extensive user evaluation on the effects of rule-based and example-based contrastive explanations. The recommendations regard 1) how to construct a theory of the effectsthat explanations are expectedtohave,2)howtoselectausecaseandparticipantstoevaluatethattheory,and3)whichtypesofmeasurements touseforthetheorizedeffects.TheserecommendationsareintendedasareferenceforXAIresearchersunfamiliartouser evaluations.Theserecommendationsarebasedonourexperiencedesigning auserevaluationandretreadknowledgethat ismorecommoninﬁeldssuchascognitivepsychologyandHuman-ComputerInteraction.

Thepresentuserstudy focusedontwo stylesofcontrastiveexplanations andtheirevaluation. Contrastiveexplanations inthecontextofaDSSarethosethatanswerquestionsas“Whythisadviceinsteadofthatadvice?”[6].Theseexplanations helpuserstounderstandandpinpointinformationthatcausedthesystemtogiveoneadviceovertheother.Intwoseparate experiments,weevaluatedtwocontrastiveexplanationstyles.Anexplanationstyledefinesthewayinformationisstructured andisoftendefinedbythealgorithmicapproachtogenerateexplanations.Notethatthisisdifferentfromexplanationform, whichdefineshow itispresented(e.g.textually orvisually).The twoevaluated styleswere rule-based andexample-based explanations,withnoexplanationasacontrol.Thesetwostylesofexplanationsareoftenreferredtoasmeanstoconveya system’sinternalworkingstoauser.However,thesestatementsarenotyetformalizedintoatheorynorcomparedindetail. Hence, our second contribution is the evaluationof the effects that rule-based and example-based explanations have on systemunderstanding (ExperimentI),persuasivepower andtaskperformance (ExperimentII).Wedefinesystemunderstanding astheuser’sabilitytoknowhowthesystembehavesinanovelsituationandwhy.Thepersuasivepowerofanexplanation isdefinedasitscapacitytoconvincetheusertofollowthegivenadviceindependentofwhetheritiscorrectornot.Task performance is definedasthe decisionaccuracy ofthe combinationofthe system, explanationand user.Together, these conceptsrelatetothebroaderconceptoftrust,animportanttopicinXAIresearch.Systemunderstandingisbelievedtohelp usersachieve anappropriateleveloftrustinaDSS,andbothsystemunderstanding andappropriatetrust areassumedto improvetaskperformance[7].Explanationsmightalsopersuadetheusertovariousextents,resultingineitherappropriate, over- orunder-trust,whichcouldaffecttaskperformance [8].Insteadofmeasuringtrust directly,weoptedformeasuring theintermediatevariablesofunderstandingandpersuasiontobetterunderstandhowtheseconceptsaffectthetask.

Thewayofstructuringexplanatoryinformationdiffersbetweenthetwoexplanationstylesexaminedinthisstudy. Rule-based explanationsare“if...then...”statements,whereasexample-based explanations providehistoricalsituationssimilarto the currentsituation.Inour experiments,both explanationstyles were contrastive,comparinga givenadvice to an alter-native advice that was not given. The rule-basedcontrastiveexplanations explicitly conveyed the DSS’s decision boundary betweenthegivenadviceandthealternativeadvice.Theexample-basedcontrastiveexplanations providedtwoexamples,one on eithersideofthisdecisionboundary,both assimilaraspossibletothe currentsituation.The ﬁrstexampleillustrated a situationwhere thegivenadvice proved tobe correct, andthe second exampleshoweda differentsituationwhere an alternativeadvicewascorrect.

Rule-based explanationsexplicitlystatetheDSS’sdecisionboundarybetweenthegivenandthecontrastingadvice.Given this fact, we hypothesized that these explanations improvea participant’sunderstanding of system behavior, causing an improved taskperformance compared to example-based explanations. Speciﬁcally, we expected participants to be able to identifythemostimportantfeatureusedbytheDSSinagivensituation,replicatethisfeature’srelevantdecisionthresholds andusethisknowledgetopredicttheDSS’sbehaviorinnovelsituations.Whentheuserisconfrontedwithhowcorrectits decisionswere, thisknowledgewouldresultin abetterestimate ofwhenaDSS’s adviceiscorrectornot.However, rule-based explanations areveryfactualandprovidelittleinformationtoconvincetheparticipantofthecorrectnessofagiven advice. As such, we expected rule-based explanations to have little persuasivepower.For the example-based explanations we hypothesized opposite effects. As examples of correct past behavior would incite conﬁdence in a given advice, we hypothesizedthemtoholdmorepersuasivepower.However, theamountofunderstanding aparticipantwouldgain would be limited,asit wouldrely onparticipantsinferring theseparatingdecisionboundary betweenthe examplesratherthan havingitpresentedtothem.Whetherpersuasive powerisdesirableinanexplanationdependsontheusecaseaswell as the performance oftheDSS. Alow performance DSS combinedwitha highly persuasive explanationforexample, would likelyresultinalowtaskperformance.

The usecaseoftheuserevaluationwas basedonadiabetes mellitustype 1(DMT1) self-managementcontext,where patients are assisted by a personalized DSSto decide on the correct dosageof insulin.Insulin is a hormone that DMT1 patientshaveto administerto prevent thenegative effectsofthedisturbed bloodglucose regulationassociatedwiththis condition.The doseishighlypersonal andcontextdependent,andan incorrectdose cancausethepatientshort- or long-term harm. The purpose of the DSS’s advice is to minimize these adverse effects. This use case was selected for two reasons.Firstly,AIisincreasinglymoreoftenusedinDMT1self-management[9–11].Therefore,theresultsarerelevantfor researchonDSSaidedDMT1self-management.Secondly,thisusecasewasbothunderstandableandmotivatingforhealthy participants without any experience withDMT1. Because DMT1 patientswould have potentially confounding experience withinsulinadministration orcertain biases, werecruited healthyparticipants thatimaginedthemselves inthesituation ofaDMT1patient.Empathizingwithapatientmotivatedthemtomakecorrectdecisions,evenifthismeanttoignorethe DSS’sadviceinfavoroftheirownchoice,orviceversa.ThisrequiredanunderstandingofwhentheDSS’sadvicewouldbe correctandincorrectandhowitwouldbehaveinnovelsituations.

The paperisstructured asfollows.First wediscussthe backgroundandshortcomingsofcurrentXAIuserevaluations. Furthermore,weprovideexamples onhowrule-based andexample-based explanations arecurrentlyusedinXAI. The

(4)

sub-sequentsection describes threesets of recommendationsforuser evaluationsin XAI, basedon ourexperience designing the evaluationaswell asonrelevantliterature. Next,weillustrate ourownrecommendations byexplainingthe usecase inmore detailandofferingthe theory behindour hypotheses.Thisisfollowedby a detaileddescription ofourmethods, analysisandresults.Weconcludewitha discussiononthe validityandreliabilityofthe resultsandabriefdiscussionof futurework.

2. Background

The followingtwo sectionsdiscussthe currentstateofuserevaluations inXAIandrule-based andexample-based con-trastiveexplanations.Theformersectionillustratestheshortcomingsofcurrentuserevaluations,formedbyeitheralackof validity andreliabilityortheentireomissionofanevaluation. Thelatterdiscussesthetwoexplanationstyles usedinour evaluationinmoredetail,andillustratestheirprevalenceintheﬁeldofXAI.

2.1. UserevaluationsinXAI

AmajorgoalofExplainableArtificialIntelligence(XAI)istohaveAI-systemsconstructexplanationsfortheirownoutput. Common purposes ofthese explanations are to increase systemunderstanding [12], improve behaviorpredictability [13] and calibrate system trust [14,15,8]. Other purposes include support in system debugging [16,12], verification [13] and justification[17].Currently,theexactpurposeofexplanationmethodsisoftennotdefinedorformalized,eventhoughthese differentpurposesmayresultinprofoundlydifferentrequirementsforexplanations[18].Thismakesitdifficultforthefield ofXAItoprogressandtoevaluatedevelopedmethods.

ThedifficultiesinXAIuserevaluationsarereflectedinrecentsurveysfromAnjomshoae etal. [5],Adadietal. [19],and Doshi-Velez andKim [4] that summarizecurrenteffortsofuser evaluationsinthe field.The systematicliterature review by [5] showsthat 97% ofthe62reviewedarticlesunderlinethatexplanationsserve auserneedbut41% didnotevaluate their explanations withsuchusers.In addition,ofthosepapers thatperformeda userevaluation,relatively fewprovided a good discussion ofthe context (27%), results(19%) andlimitations (14%) oftheir experiment.The second survey from [19] evaluated381 papersandfoundthat only5% hadanexplicitfocusontheevaluationoftheXAImethods.Thesetwo surveysshowthat,althoughuserevaluationsarebeingconducted,manyofthemprovidelimitedconclusionsforotherXAI researcherstobuildon.

Athirdsurvey by[4] discusses anexplicitissuewithuserevaluationsinXAI. Theauthorsarguetosystematicallystart evaluating differentexplanations stylesandformsinvarious domains,arigorthat iscurrentlylacking inXAIuser evalua-tions.Todosoinavalidway,severalrecommendationsaregiven.First,theapplicationlevelofthestudycontextshouldbe madeclear;either areal,simpliﬁed orgeneric application.Second, any(expected)task-speciﬁcexplanationrequirements shouldbementioned.Examplesincludetheaveragehumanlevelofexpertisetargeted,andwhethertheexplanationshould address theentire systemora single output.Finally, theexplanations andtheir effects should beclearly statedtogether witha discussion ofthestudy’slimitations.Together, thesethree surveysillustratethe shortcomingsofcurrent XAIuser evaluations.

Fromseveralstudiesthatdofocusonevaluatingusereffects,wenotethatthemajorityfocusesonsubjective measure-ment.Surveysandinterviewsareusedtomeasureusersatisfaction[20,21],thegoodnessofanexplanation[22],acceptance ofthe system’sadvice [23,24] and trust inthesystem[25–28]. Suchsubjective measurements canprovide avaluable in-sightintheuser’sperspectiveontheexplanation.However,theseresultsdonotnecessarilyrelatetothebehavioraleffects an explanationcouldcause. Therefore,thesesubjectivemeasurements requirefurtherinvestigationtoseeifthey correlate withabehavioraleffect[7].Withoutsuchan investigation,thesesubjectiveresultsonlyprovideinformationontheuser’s beliefsandopinions,butnotonactualgainedunderstanding,trustortaskperformance.Somestudies,however,doperform objective measurements.The work from[29] for example,measured both subjective ease-of-useof an explanationanda participant’scapacitytocorrectlymakeinferencesbasedontheexplanations.Thisallowed theauthorstodifferentiate be-tweenbehavioralandself-perceivedeffectsofanexplanation,underliningthevalueofperformingobjectivemeasurements. Theabovedescribed criticalviewonXAIuserevaluationsisrelatedtotheconceptsofconstructvalidityandreliability. These two concepts provide clearstandardsto scientiﬁcally sounduser evaluations[30–32]. The construct validity ofan evaluationisitsaccuracyinmeasuringtheintendedconstructs(e.g.understandingortrust).Examplesofhowvaliditymay beharmedisapoordesign,illdeﬁnedconstructsorarbitrarilyselectedmeasurements.Reliability,ontheotherhand,refers totheevaluation’sinternal consistencyandreproducibility,andmaybe harmedbyalackofdocumentation,an unsuitable use caseornoisy measurements. Inthe social sciences,a commoncondition for resultsto begeneralized toother cases and toinfer causal relations isthat a user evaluationis both validandreliable [30]. Thiscan be (partially) obtainedby developing differenttypes ofmeasurements for commonconstructs.For example,self-reported subjective measurements

such asratings and surveyscan be supplemented by behavioral measurements to gather data on the performance in a

speciﬁctask.

2.2. Rule-basedandexample-basedexplanations

Humanexplanations tend to becontrastive: they comparea certain phenomenon (fact) witha hypotheticalone (foil) [33,34]. In the case ofa decision support systems (DSS), a naturalquestion to ask is “Why thisadvice?”. This question

(5)

Fig. 1. An overview of three sets of practical recommendations to improve user evaluations for XAI.

impliesa contrast, asthe personasking this question oftenhas an explicitcontrasting foil inmind. Inother words,the implicitquestionis“Whythisadviceandnotthatadvice?”.Thespeciﬁccontrastallowstheexplanationtobelimitedtothe differencesbetweenfactandfoil.Humansusecontrastiveexplanationstoexplaineventsinaconciseandspeciﬁcmanner [2]. Thisadvantagealsoapplies tosystems:contrastiveexplanations narrowdowntheavailable informationtoaconcrete differencebetweentwooutputs.

Contrastive explanations canvary depending onthewaythe adviceis contrastedwitha differentadvice, forexample using rules orexamples. Withinthe contextof a DSSadvising an insulin dose forDMT1 self-management, a contrastive rule-basedexplanationcouldbe:“Currentlythetemperatureisbelow10degreesandalowerinsulindoseisadvised.Ifthe temperaturewas above30degrees, anormalinsulindose wouldhavebeenadvised.” Thisexplanationcontainstwo rules thatexplicitlystatethedifferentiatingdecisionboundariesbetweenthefactandfoil.SeveralXAImethodsaimtogenerate thistypeof“if...then...”rules,suchas[35–38].

An example-based explanationrefers tohistorical situations in which theadvice was found tobe trueor false: “The temperatureis currently8degrees, andalower insulindose is advised.Yesterday was similar:it was 7degreesandthe same advice proved tobe correct.Twomonths ago, whenit was 31degrees, a normaldose was advised instead, which provedtobecorrectforthat situation”.Suchexample- orinstance-basedexplanationsareoftenusedbetweenhumans,as they illustrate past behavior andallow forgeneralization to newsituations [39–42]. SeveralXAImethods try to identify examplestogeneratesuchexplanations,forexamplethosefrom[43–47].

Researchonsystemexplanationsusingrulesandexamplesisnotnew.Mostoftheexistingresearchfocusedonexploring how users preferreda systemwould reason, by rules or throughexamples. Forexample, users prefer an example-based spam-filter over a rule-based [48], while they prefer spam-filter explanations to be rule-based [49]. Another evaluation showedthat thenumberofrulefactorsinan explanationhadaneffecton taskperformance by eitherpromotingsystem over-reliance (too many factors) or self-reliance (too few factors) [50]. Work by Lim et al. [51] shows that rule-based explanations cause users to understand system behavior, especially ifthose rules explain why the systembehaves in a certain wayasopposed towhy itdoesnot behave ina different (expected)way.Studies such asthesetendto evaluate eitherrulesorexamples,dependingontheresearchfield(e.g.recommendersystemexplanationstendtobeexample-based) butfewcompareruleswithexamples.

3. RecommendationsforXAIuserevaluations

AsdiscussedinSection2.1,userevaluationsplayaninvaluableroleinXAIbutareoftenomittedorofinsufficientquality. Ourmaincontributionisathoroughevaluationofrule-basedandexample-basedcontrastiveexplanations.Inaddition,we believethattheexperienceandlessonslearnedindesigningthisevaluationcanbevaluableforotherresearchers.Especially researchersinthefieldofXAIthat arelessfamiliarwithuserevaluationscanbenefitfromguidanceinthedesignofuser studies incorporatingknowledge fromdifferent disciplines. Tothat end, we proposethree sets ofrecommendations with practicalmethodstohelpimproveuserevaluations.AnoverviewisprovidedinFig.1.

(6)

3.1. R1:Constructsandrelations

As stated inSection 2.1, the field of XAIoftendeals withambiguously definedconcepts such as‘understanding’.We believethatthishindersthecreationandreplicationofXAIuserevaluationsandtheirresults.Throughcleardefinitionsand motivation,the contributionof theevaluationbecomesmore apparent.This alsoaids other researchersto extendonthe results.Weprovidethreepracticalrecommendationstoclarifytheevaluatedconstructsandtheirrelations.

Our first recommendationis to clearly define the intended purposes of an explanation in the form ofa construct. A construct is either the intended purpose, an intermediate requirement forthe purposeor a potential confound to your purpose. Constructs form the basis of the scientific theory underlying XAImethods and user evaluations. By defining a construct, it becomes easier to develop measurements. Second, we recommend to clearly define the relations expected betweentheconstructs.AconcreteandvisualwaytodosoisthroughaCausalDiagramwhichpresentstheexpectedcausal relations betweenconstructs [52]. These relationsform your hypothesesand makesure they are formulated interms of yourconstructs.Clearlystatinghypothesesallowsother researcherstocriticallyreflectontheunderlyingtheoryassumed, proved orfalsifiedwiththeevaluation. Itoffers insightinhowconstructsare assumedtoberelatedandhowtheresults supportorcontradicttheserelations.

Ourfinalrecommendationregardingconstructsistoadoptexistingtheories,suchasfromphilosophy,(cognitive) psychol-ogyandfromhuman-computerinteraction (see[2,6] foran overview).Theformer providesconstructdefinitionswhereas thelattertwoprovidetheoriesofhuman-humanandhuman-computerexplanations. Thesethreerecommendationsto de-fineconstructsandtheirrelationsandgroundingtheminotherresearchdisciplinescancontributetomorevalidandreliable userevaluations.Inaddition,thispracticeallowsresultstobemeaningfulevenifhypothesesarerejected,astheyfalsifya scientifictheorythatmayhavebeenacceptedastrue.

3.2. R2:Usecaseandexperimentalcontext

Thesecond setofrecommendationsregardstheexperimental context,includingtheusecase. Theusecasedetermines thetask,theparticipantsthatcanandshouldbeused,themodeoftheinteraction,thecommunicationthattakesplaceand theinformationavailabletotheuser[53].As[4] alreadystated,theselectedusecasehasalargeeffectontheconclusions that canbedrawnandtheextenttowhichtheycan begeneralized.Also, theusecasedoesnotnecessarilyneedtobeof highfidelity,asalowfidelityallowsformoreexperimentalcontrolandapotentiallymorevalidandreliableevaluation[54]. Werecommendtotaketheseaspectsintoaccountwhendeterminingtheusecaseandtoreflectonthechoicesmadewhen interpretingtheresultstheuserevaluation.Thisimprovesboththevalidityandreliabilityoftheevaluation.Aconcreteway tostructurethechoiceforausecaseistofollowthetaxonomyprovidedby[4] (seeSection2.1)orasimilarone.

The secondrecommendationconcernsthesample ofparticipantsselected,asthischoicedetermines theinitial knowl-edge,experience,beliefs,opinionsandbiasestheusershave.Whetherparticipantsareuniversitystudents,domainexperts orrecruitedonlinethroughplatforms suchasMechanicalTurk,thecharacteristicsofthegroupwillhavean effectonthe results.The choiceofpopulation shouldbe governedby purposeoftheevaluation. Forexample,ourevaluationwas per-formedwithhealthyparticipantsratherthandiabetespatients, asthelattertendtovaryintheir diabetesknowledgeand suffer frommisconceptions [55]. These factors can interferein an exploratorystudy such asours, in whichthe findings arenotdomainspecific.Hence,werecommendtoinvestinbothunderstandingtheusecasedomainandreflectingonthe intended purpose of the evaluation. These considerations should be consolidated in inclusion criteriato ensure that the resultsaremeaningfulwithrespecttothestudy’saim.

Our ﬁnal recommendation related to the context considers the experimental settingand surroundings, asthese may affect the quality and generalizability of the results.An online setting mayprovide a large quantity of readily available participants,buttheresultsareoftenofambiguousquality(see [56] forareview).Ifcircumstances allow,werecommend to useacontrolledsetting(e.g.a roomwithnodistractions, orausecasespeciﬁcenvironment). Thisallowsforvaluable interactionwithparticipantswhilereducingpotentialconfoundsthatthreatentheevaluation’sreliabilityandvalidity. 3.3. R3:Measurements

Numerous measurements exist for computational experiments on suggestedXAI methods (for example; ﬁdelity [57], sensitivity[58] andconsistency [59]).However, thereisa lackofvalidatedmeasurements foruserevaluations[7].Hence, ourthird groupofrecommendationsregardsthetype ofmeasurement tousefortheoperationalizationoftheconstructs. WeidentifytwomainmeasurementtypesusefulforXAIuserevaluations:self-reportedmeasuresandbehavioralmeasures. Self-reportedmeasuresaresubjectiveandareoftenusedinXAIuserevaluations.Theyprovideinsightsinusers’conscious thoughts, opinionsandperceptions.We recommend theuseof self-reportedmeasures forsubjective constructs(e.g. per-ceived understanding), butalso recommend a criticalperspective onwhether themeasures indeed addressthe intended constructs.Behavioral measures havea moreobservational natureandare used tomeasure actual behavioraleffects.We recommendtheirusageforobjectivelymeasuringconstructssuchasunderstandingandtaskperformance.Importantly how-ever,suchmeasuresoftenonlymeasureoneaspectofbehavior.Ideally,acombinationofbothmeasurementtypesshouldbe usedtoassesseffectsonboththeuser’sperceptionandbehavior.Inthisway,acompleteperspectiveonaconstructcanbe

(7)

obtained. Inpractice,some constructslend themselvesmoreforself-reportedmeasurements,forexampleauser’s percep-tionontrustorunderstanding.Otherconstructsaremoresuitableforbehavioralmeasurements,suchastaskperformance, simulatability,predictability,andpersuasivepower.

Furthermore, we recommend to measure explanation effects implicitly, rather than explicitly. When participants are

not aware of the evaluation’s purpose, their responses may be more genuine. Also, when measuring understanding or

similarconstructs,theparticipant’sexplicitfocusontheexplanationsmaycauseskewedresultsnotpresentinarealworld application.Thisleadstoourthirdrecommendationtomeasurepotentialbiases.Biasescanregardtheparticipant’soverall perspective on AI, the use case, decision-making or similar. However, biases can also be introduced by the researchers themselves. Forexample,one XAImethodcan be presentedmore attractivelyorreliably than another.It can be difficult toprevent such biases.Onewaytomitigatethesebiasesistodesignhow theexplanationarepresented,theexplanation form, in an iterative manner with expert reviews and pilots. In addition, one can measure these biases nonetheless if possibleandreasonable. Forexample,a usabilityquestionnairecanbe usedtomeasurepotential differencesbetweenthe way explanations are presented in the different conditions. For our study we designedthe explanations iteratively and verifiedthatthechosenformforeachexplanationtypedidnotdiffersignificantlyintheperceptionoftheparticipants.

4. Theusecase:diabetesself-management

Inthisstudy,wefocusedonpersonalizedhealthcare,an areainwhichmachinelearning ispromisingandexplanations are essentialforrealistic applications[60].Ourusecaseisthatofassisting patientswithdiabetesmellitustype1(DMT1) withpersonalizedinsulinadvice.DMT1 isachronic autoimmunedisorderinwhichglucose homeostasisisdisturbed and intake of the hormone insulin is required to balance glucose levels. Since blood glucose levels are influenced by both environmental andpersonal factors,it isoften difficultto findthe adequate dose ofinsulinthat stabilizes blood glucose levels[61].Therefore,personalizedadvicesystemscanbeapromisingtoolinDMT1managementtoimprovequalityoflife andmitigatelong-termhealthrisks.

Inourcontext,aDMT1patientfindsitdifficulttofindtheoptimalinsulindoseforamealinagivensituation.Onthe patient’srequest,a fictitiousintelligent DSSprovides assistance withtheinsulinintake before ameal.Based on different internal andexternal factors(e.g.hoursofsleep, temperature,pastactivity,etc.), thesystemmayadvisetotakea normal insulin dose,or a higherorlower dose than usual. Forexample,the systemcould advise a lower insulindose based on thecurrenttemperature. Thefactorsthat were usedintheevaluationarerealistic,andwere basedonBosch [62] andan interviewwithaDMT1patient.

Inthisusecase,boththeadviceandtheexplanationsaresimplified.Thisstudythereforefallsunderthehumangrounded evaluationcategoryofDoshi-VelezandKim [4]:asimplifiedtaskofareal-worldapplication.Theadviceisbinary(higheror lower),whereasinrealityonewouldexpecteitheraspecificdoseorarangeofsuggesteddoses.Thissimplificationallowed ustoevaluatewithnoviceusers(seeSection6.3),aswecouldlimitourexplanationtotheeffectsofatoolowortoohigh dosagewithoutgoingintodetailabouteffectsofspecificdoses.Furthermore,thispreventedtheunnecessary complication of havingmultiplepotential foils forour contrastiveexplanations. Althoughtheselection ofthe foil,eitherby systemor user,isaninterestingtopicregardingcontrastiveexplanations,itwasdeemed outofscopeforthisevaluation.Thesecond simplificationwasthat theexplanationswere notgeneratedusingaspecificXAImethod,butdesignedbytheresearchers instead. Several design iterationswere conducted based on feedback from XAI researchers and interaction designers to removepotentialdesignchoicesintheexplanationformthatcouldcauseoneexplanationtobefavoredoveranother.Since theexplanationswerenotgeneratedbyaspecificXAImethod,wewereabletoexploretheeffectsofmoreprototypical rule-andexample-basedexplanationsinspiredbymultipleXAImethodsthatgeneratesimilarexplanations(seeSection2.2).

Thereareseverallimitationscausedbythesetwosimpliﬁcations.First,weimplythatthesystemcanautomaticallyselect theappropriatefoilforcontrastiveexplanations.Second,weassumethattheXAImethodisabletoidentifyonlythemost relevantfactorstoexplain adecision.AlthoughthisassumesapotentiallycomplexrequirementfortheXAImethod,itisa reasonableassumptionashumanspreferaselectiveexplanationoveracompleteone[2].

5. Constructs,expectedrelationsandmeasurements

The user evaluation focused on three constructs: system understanding, persuasive power, andtask performance. Al-thoughanimportantgoalofofferingexplanationsistoallowuserstoarriveattheappropriate leveloftrustinthesystem [63,7],theconstructoftrustisdifficulttodefineandmeasure[18].Assuch,ourfocuswas onconstructsinfluencingtrust that were moresuitable totranslate intomeasurableconstructs;the intermediateconstructofsystemunderstanding and thefinalconstructoftaskperformanceoftheentireuser-systemcombination.Thepersuasivepowerofanexplanationwas also measured, as an explanation might cause over-trust in a user; believing that the system is correctwhile it is not, withouthavingapropersystemunderstanding.Assuch,thepersuasivepowerofanexplanationconfoundstotheeffectof understandingontaskperformance.

Bothcontrastiverule- andexample-based explanationswerecomparedtoeachotherwithnoexplanation asacontrol.Our hypothesesarevisualizedinaCausalDiagramdepictedinFig.2[52].Fromrule-based explanationsweexpectedparticipants togainabetterunderstandingofwhenandhowthesystemarrivesataspeciﬁcadvice.Contrastiverule-based explanations

(8)

Fig. 2. Ourtheory,depictedasaCausalDiagram.Itdescribestheexpectedeffectsofcontrastiverule- andexample-basedexplanationsontheconstructsof systemunderstanding,persuasivepowerandtaskperformance.Thesolidandgreenarrowsdepictexpectedpositiveeffectsandthedashedandredarrow depictsanegativeeffect.Thearrowthicknessdepictsthesizeoftheexpectedeffect.Theopaquegreyboxesarethemeasurementsthatwereperformed forthatconstruct,dividedintobehavioralandself-reportedmeasurements.

explicatethesystem’sdecisionboundarybetweenfactandfoilandweexpectedtheparticipantstorecallandapplythis in-formation.Second,weexpectedthatcontrastiveexample-based explanationspersuadeparticipantstofollowtheadvicemore often.Webelieve thatexamples raiseconﬁdenceinthecorrectnessofanadvice astheyillustratepast goodperformance ofthesystem.Third,wehypothesizedthatbothsystemunderstandingandpersuasivepowerhaveaneffectontask perfor-mance.Whereasthiseffectwasexpectedtobepositiveforsystemunderstanding,persuasivepowerwasexpectedtoaffect taskperformancenegativelyincaseasystem’sadviceisnotalwayscorrect.Thisfollowstheargumentationthatpersuasive explanationscancauseharmastheymayconvinceuserstoover-trustasystem[64].Notethatweconductedtwoseparate experimentstomeasuretheeffectsofanexplanationtypeonunderstandingandpersuasion.Thisallowedustomeasurethe effectofeachconstructseparatelyontaskperformance,butnottheircombinedeffect(e.g.whethersuﬃcientunderstanding cancounteractthepersuasivenessofanexplanation).

Theconstructofunderstandingwasmeasured withtwobehavioralmeasurements andoneself-reportedmeasurement. Thefirstbehavioralmeasurementassessedtheparticipant’scapacitytocorrectlyidentifythedecisivefactor ofthesituations in thesystem’sadvice. Thismeasured to whatextent theparticipant recalledwhatfactor thesystembelievedto be im-portantforaspecific adviceandsituation.Second, wemeasuredtheparticipant’sabilitytoaccurately predicttheadvice in novelsituations.Thistestedwhethertheparticipantobtainedamentalmodelofthesystemthat wassufficientlyaccurate enough topredictitsbehaviorinnovelsituations.The self-reportedmeasurement testedtheparticipant’sperceivedsystem understanding. This provided insight in whether participants over- or underestimated their understanding of the system comparedtowhattheirbehaviortoldus.

Persuasivepowerofthesystem’sadvicewasmeasuredwithonebehavioralmeasurement,namelythenumberoftimes participantscopiedtheadvice,independentofitscorrectness.Ifparticipantsthatreceivedanexplanationfollowedtheadvice moreoftenthanparticipantswithoutanexplanation,weaddressedthistothepersuasivenessoftheexplanation.

Taskperformancewasmeasuredasthenumberofcorrectdecisions,abehavioralmeasurement,andperceptionofpredicting advicecorrectness, a self-reported measurement. We assumed a system that did not have a 100% accurate performance, meaning that italso madeincorrectdecisions.Therefore,the numberofcorrectdecisions madeby theparticipant while aided by thesystemcould be usedto measuretaskperformance. The self-reportedmeasure allowedusto measurehow wellparticipantsbelievedtheycouldpredictthecorrectnessofthesystemadvice.

Finally, two self-reported measurements were added to check for potential confounds. The ﬁrst was a brief usability questionnaire addressing issuessuchasreadabilityandtheorganization of information.Thiscould revealwhetherone ex-planation stylewas designedand visualized better than the other,which would be a confounding variable. The second, perceivedsystemaccuracy,measuredhowaccuratetheparticipantthoughtthesystemwas.Thiscouldhelpidentifya poten-tialover- orunderestimationoftheusefulnessofthesystem,thatcouldhaveaffectedtowhatextentparticipantsattended tothesystem’sadviceandexplanation.

Thecombinationofself-reportedandbehavioralmeasurementsenabledustodrawrelationsbetweenourobservations andaparticipant’sownperception.Finally,by measuringasingleconstructwithdifferentmeasurements(known as trian-gulation[65])wecouldidentifyandpotentiallyovercomebiasesandotherweaknessesinourmeasurements.

(9)

Fig. 3. Thecontrastiverule-based(above)andexample-based(below)explanationstyles.Participantscould viewthe situation,adviceandexplanation indeﬁnitely.

6. Methods

In this section we describe the operationalization ofour user evaluationin two separate experiments inthe context of DSSadvice inDMT1 self-management (see Section 4). ExperimentI focusedonthe constructofsystemunderstanding. ExperimentII focusedontheconstructsofpersuasivepowerandtaskperformance. Theexplanationstyle(contrastive rule-based, contrastive example-based orno explanation) was the independent variable in both experiments and was tested between-subjects.SeeFig.3foranexampleofeachexplanationstyle.

Theexperimentalprocedurewassimilarinbothexperiments:

1. Introduction.Participantswereinformedaboutthestudy,use-caseandtask,aswellaspresentedwithabriefnarrative aboutaDMT1patientforimmersivepurposes.

2. Demographics questionnaire. Age andeducation level were inquiredto identify whether the population sample was suﬃcientlybroad.

3. Pre-questionnaire.ParticipantswerequestionedonDMT1knowledgetoassessifDMT1wassuﬃcientlyintroducedand tocheckourassumptionthatparticipantshadnoadditionaldomainknowledge.

(10)

Fig. 4. A schematic overview of the learning (left) and testing (right) block in Experiment I.

Table 1

Anoverviewoftheninefactorsthatplayedaroleintheexperiment.Foreachfactor,itsinfluenceonthecorrect insulindoseisshown,aswellasthesystemthresholdforthatinfluence.Thethresholdsdifferedbetweenthetwo experimentsandthesetofrulesofthefirstexperimentweredefinedasthegroundtruth.Threefactorsserved asfillersandhadnoinfluence.

Factor Insulin dose Exp. I Rules Exp. II Rules Planned alcohol intake Lower dose >1 unit >1 unit Planned physical exercise Lower dose >17 minutes >20 minutes Physical health Lower dose Diarrhoea & Nausea Diarrhoea & Nausea Hours slept Higher dose <6 hours <6 hours Environmental temperature Higher dose >26◦C >31◦C Anticipated tension level Higher dose >3 (a little tense) >4 (quite tense)

Water intake so far - -

-Planned caffeine intake - -

-Mood - -

-4. Learningblock.Multiplestimuliwere presented,accompaniedwitheithertheexample- orrule-basedexplanations,or noexplanations(controlgroup).

5. Testingblock.Severaltrialsfollowedtoconductthebehavioralmeasurements(adviceprediction anddecisivefactor iden-tiﬁcation inExperimentI,thenumberoftimesadvicecopied andnumberofcorrectdecisions inExperimentII).

6. Post-questionnaire.Aquestionnairewascompletedtoobtainself-reportedmeasurements(perceivedsystemunderstanding inExperimentI andperceivedpredictionofadvicecorrectness inExperimentII).

7. Usabilityquestionnaire.Participantsﬁlledoutausabilityquestionnairetoidentifypotentialinterfacerelatedconfounds. 8. Control questionnaire.Theexperimentalprocedureconcludedwithseveralquestionstoassesswhetherthepurposeof

thestudywassuspectedandtomeasureperceivedsystemaccuracy toidentifyover- orunder-trustinthesystem. 6.1. ExperimentI:Systemunderstanding

ThepurposeofExperimentI wastomeasuretheeffectsofrule-based andexample-based explanationsonsystem under-standing compared to each other and to the control group with noexplanations. See Fig. 4for an overview of both the learning andtestingblocks.The learning blockconsisted of18randomlyordered trials,each trialdescribing asingle sit-uation withthreefactorsandvaluesfromTable1.The situationdescriptionwas followedby thesystem’sadvice, inturn followedby anexplanation(intheexperimental groups).Finally,theparticipantwas askedtomakea decisionon admin-isteringa higherorlowerinsulindose than usual.Thisblock served onlytofamiliarizethe participantwiththe system’s adviceanditsexplanationandtolearnwhenandwhyacertainadvicewasgiven.Participantswerenotinstructedtofocus ontheexplanationsinthelearningblock,norweretheyinformedofthepurposeofthetwoblocks.

In thetesting block, twobehavioral measures were usedto test the constructofunderstanding: adviceprediction and decisivefactoridentiﬁcation. The testingblock consisted of30 randomized trials, each witha novel situation description. Eachdescriptionwasfollowedbythequestionwhatadvicetheparticipantthoughtthesystemwouldgive.Thisformedthe

(11)

Fig. 5. A schematic overview of the learning (left) and testing (right) block in Experiment II.

measurementofadviceprediction.The measurementdecisivefactoridentiﬁcation was formedbythesubsequentquestionto selectasinglefactorfromasituationdescriptionthattheybelievedwasdecisiveforthepredictedsystemadvice.

Athird, self-reportedmeasurementwas conductedinthe post-questionnaire,whichcontainedaneight-item question-nairebasedona7-pointLikertscale.Theseitemsformedthemeasurementofperceivedsystemunderstanding.Thequestions wereaskedwithoutmentioningthetermexplanationandsimplyaddressed‘systemoutput’.Theamountofeightitemswas deemednecessary,toobtainameasurementlessdependentontheformulationofoneitem.

6.2. ExperimentII:Persuasivepowerandtaskperformance

The purpose of ExperimentII was to measure the effects of rule-based and example-based explanations on persuasive power and task performance, and to compare these to each other and to the control group with noexplanation. Fig. 5

provides an overviewofthe learningandtestingblocksofthisexperiment.The learningblock was similarto thatofthe ﬁrstexperiment:asituationwasshown,containingthreefactorsfromTable1.Intheexperimentalgroups,thesituationwas followedby anadvice andexplanation.Next,theparticipantwasaskedtomakeadecisionontheinsulindose.After this point, thelearning block differedfromthe learningblock inthe ﬁrstexperiment:the participant’sdecisionwas followed withfeedbackonitscorrectness.In12ofthe18randomlyordered trialsofthislearningblock (66%),thesystem’sadvice was correct.Inthesixothertrials,theadvice wasincorrect. Throughthisfeedback,participantslearnedthat thesystem’s advice could beincorrectandinwhichsituations. Insteadoffollowingthe groundtruth ruleset(from ExperimentI),this systemfollowedasecond,partiallycorrectsetofrules,asshowninTable1.

The testingblock contained30trials,alsopresentedin randomorder,inwhicha presentedsituationwas followedby thesystem’sadviceandexplanation.Next,participantshadtochoosewhichinsulindosewascorrectbasedonthesystem’s advice, explanation andgainedknowledge ofwhen thesystemis incorrect. Persuasivepower was operationalizedasthe numberof timesa participantfollowedthe advice,independent ofwhetheritwas correct ornot.Task performance was

represented by the numberof times a correct decision was made. The former reﬂected how persuasive the advice and

explanationwas,evenwhenparticipantsexperiencedsystemerrors.Thelatterreﬂectedhowwellparticipantswereableto understandwhenthesystemmakeserrorsandcompensateaccordinglyintheirdecision.

Alsointhisexperiment, aself-reportedmeasurementwitheight7-pointLikertscalequestionswasperformed. It mea-suredtheparticipant’ssubjectivesenseoftheirabilitytoestimatewhenthesystemwascorrect.

6.3. Participants

In ExperimentI, 45 participants took part, of which 21 female and 24 male, aged between 18 and 64 years old (M

=

44

.

2

±

16

.

8). Their education levels varied from lower vocational to university education. In ExperimentII 45 dif-ferentparticipantstook part,ofwhich31femaleand14male,agedbetween18and61yearsold(M

=

36

.

5

±

14

.

5).Their

(12)

education levelsvaried from secondary vocationalto university education. Participants were recruited froma participant databaseatTNOSoesterberg(NL)aswellasviaadvertisementsinUtrecht University (NL)buildings andon socialmedia. Participantsreceivedacompensationof20,- euroandtheirtravelcostswerereimbursed.Bothsamplesaimedtorepresent theentireDutchpopulationandassuchtheentirerangeofpotentialDMT1patients, hencethewide ageandeducational ranges.

Theinclusioncriteriawere asfollows:not diabetic,nocloserelativesorfriendswithdiabetes,andnoextensive knowl-edgeofdiabetesthrough workoreducation.GeneralcriteriawereDutchnativespeaking,goodorcorrectedeyesight,and basicexperienceusingcomputers.Theseinclusioncriteriawereveriﬁedinthepre-questionnaire.Atotalof16participants reporteda closerelativeorfriend withdiabetesandone participanthadexperiencewithdiabetesthrough work,despite clearinclusioninstructionsbeforehand.Aftercarefulinspectionoftheiranswers,nonewereexcludedbecausetheiranswers on diabetesquestionsinthepre-questionnairewere notmore accurateorelaborate thanothers.From thisweconcluded thattheirknowledgeofdiabeteswasunlikelytoinﬂuencetheresults.

7. Dataanalysis

StatisticaltestswereconductedusingSPSSStatistics22.Analphalevelof0

.

05 wasusedforallstatisticaltests.

ThedatafromthebehavioralmeasuresinExperimentI wereanalyzedusingaone-wayMultivariateAnalysisofVariance (MANOVA)withexplanationstyle(rule-based,example-based ornoexplanation)astheindependentbetween-subjectsvariable andadviceprediction and decisivefactoridentiﬁcation as dependentvariables. The reasonfora one-wayMANOVA was the multivariate operationalizationofa singleconstruct, understanding[66].Cronbach’sAlpha wasusedtoassesstheinternal consistencyoftheself-reportedmeasurementforperceivedsystemunderstanding fromthepost-questionnaire.Subsequently, aone-wayAnalysisofVariance(ANOVA)was conductedwiththemeanratingonthisquestionnaireasdependentvariable andthe explanationstyleasindependent variable.Finally,therelation betweenthetwo behavioralandtheself-reported measurementswasexaminedwithPearson’sproduct-momentcorrelations.

ForExperimentII twoone-wayANOVA’swereperformed.TheﬁrstANOVAhadtheexplanationstyle(rule-based, example-based or noexplanation) as independent variable and the numberoftimestheadvicewascopied as dependent variable. The second ANOVAalso had explanation style asindependent variable,but the numberofcorrectdecisions as dependent variable. The internal consistency of the self-reported measurement of perceivedpredictionofadvicecorrectness from the post-questionnaire was assessed withCronbach’s Alphaand analyzedwitha one-way ANOVA.Explanationstyle was the independent andthe meanratingonthequestionnairethedependent variable.Thepresence ofcorrelationsbetweenthe behavioralandtheself-reportedmeasurementswasassessedwithPearson’sproduct-momentcorrelations.Detectedoutliers wereexcludedfromtheanalysis.

8. Results

8.1. ExperimentI:Systemunderstanding

ThepurposeofExperimentI wastomeasuregainedsystemunderstanding whena systemprovidesarule- or example-based explanation,compared tono explanation.This was measuredwith twobehavioral measures andoneself-reported measure.

Fig.6showstheresultsonthetwobehavioralmeasures:correctadvicepredictioninnovelsituationsandcorrect iden-tiﬁcation of the system’sdecisive factor. A one-wayMANOVA with Wilks’ lambda indicated a signiﬁcant main effectof explanation style on both measurements (F

(

4

,

82

)

=

6

.

675, p

<

0

.

001,

= .

450,

η

2

p

= .

246). Furtheranalysis revealeda signiﬁcanteffectforexplanationstyleonfactoridentiﬁcation(F

(

2

,

42

)

=

14

.

816, p

<

0

.

001,

η

2

p

= .

414),butnotforadvice prediction (F

(

2

,

42

)

=

14

.

816, p

= .

264,

η

2

p

= .

414). One assumption of a one-way MANOVAwas violated, as the linear relationships betweenthe twodependentvariablesandeachexplanation stylewasweak.Thiswas indicatedby Pearson’s product-momentcorrelationsfortherule-based(r

= .

487, p

= .

066),example-based(r

= −.

179,p

= .

522)andno explana-tion(r

= .

134,p

= .

636)groups.Somecautionisneededininterpretingtheseresults,asthislackofsignificantcorrelations showsapotential lackofstatisticalpower.Furtherpost-hocanalysisshowedasignificantdifferenceinfactoridentification infavorofrule-basedexplanationscomparedtoexample-basedexplanationsandnoexplanations(p

<

0

.

001).Nosigniﬁcant differencebetweenexample-basedexplanationsandnoexplanationwasfound(p

= .

796).

Fig.7 showstheresultsonthe self-reportedmeasureofsystemunderstanding. Theconsistencybetweenthe different itemsinthemeasurewasveryhigh,asreﬂectedbyCronbach’salpha(

α

= .

904).Themeanratingoveralleightitemswas used asthe participant’s subjective rating ofsystemunderstanding. A one-wayANOVA showeda signiﬁcant maineffect ofexplanation styleonthisrating(F

(

2

,

41

)

=

7

.

222, p

= .

002, p

η

2

p

= .

261). Twoassumptionsof aone-wayANOVAwere

violated. First, the rule-based explanations group had one outlier, of which the exclusion did not affect the analysis in anyway.Theresultsafterremovalofthisoutlierarereported.Second,Levene’stestwasnotsigniﬁcant(p

= .

017)signaling inequalitybetweengroupvariances.However,ANOVAisrobustagainstthevariancehomogeneityviolationwithequalgroup sizes [67,68].Further post-hoctestsrevealedthat onlyrule-basedexplanations causeda signiﬁcantlyhigherself-reported understandingcomparedtonoexplanations(p

= .

001).Nosigniﬁcantdifferencewasfoundforexample-basedexplanations withnoexplanations(p

= .

283)andwithrule-basedexplanations(p

= .

072).

(13)

Fig. 6. Barplotofthemeanpercentagesofcorrectpredictionofthesystem’sadviceandcorrectidentiﬁcationofthedecisivefactorforthatadvice.Values arerelativetothetotalof30randomizedtrialsinExperimentI.Theerrorbarsrepresenta95% conﬁdenceinterval.Note;***p<0.001.

Fig. 7. Barplotofthemeanself-reportedsystemunderstanding.Allvaluesareona7-pointLikertscaleanderrorbarsrepresenta95% conﬁdenceinterval.

Note;**p<0.01.

Finally,Fig.8showsascatterplotbetweenbothbehavioralmeasuresandtheself-reportedmeasure.Pearson’s product-momentanalysisrevealednosigniﬁcantcorrelationsbetweenself-reportedunderstandingandadviceprediction(r

= −.

007, p

= .

965), not within the rule-based explanation group (r

= −.

462, p

= .

129), the example-based explanation group (r

= −.

098, p

= .

729), nor the noexplanation group (r

= .

001, p

= .

996). Similar results were found forthe correlation between self-reported understanding andfactor identiﬁcation (r

= .

192, p

= .

211) and for the separate groups of rule-basedexplanations(r

= −.

124,p

= .

673),example-basedexplanations(r

= .

057,p

= .

840)andnoexplanations(r

= −.

394, p

= .

146).

8.2. ExperimentII:Persuasivepowerandtaskperformance

ThepurposeofExperimentII wastomeasureaparticipant’sabilitytouseadecisionsupportsystemappropriatelywhen itprovides arule- orexample-basedexplanation,comparedwithnoexplanation.Thiswas measuredwithone behavioral andoneself-reportedmeasurement.Inaddition,wemeasuredthepersuasivenessofthesystemforeachexplanationstyle, comparedtonoexplanations.Thiswasassessedwithonebehavioralmeasure.

(14)

Fig. 8. Scatter plotsdisplayingthe relationbetweenadviceprediction(left)anddecisive factoridentiﬁcation(right)withself-reported understanding. Outliersarecircled.

Fig. 9. Barplotdisplayingtaskperformance(themeanpercentageofcorrectdecisions)andpersuasivepower(themeanpercentageofdecisionsfollowing thesystem’sadviceindependentofcorrectness).Errorbarsrepresenta95% conﬁdenceinterval.Note;*p<0.05,***p<0.001.

Fig.9showstheresultsofthebehavioralmeasurefortaskperformance,asreﬂected bytheuser’sdecisionaccuracy.A one-wayANOVAshowednosigniﬁcantdifferences(F

(

2

,

41

)

=

1

.

716, p

= .

192,

η

2

p

= .

077).TwoviolationsofANOVAwere discovered.Therewasoneoutlierintheexample-basedexplanations,with93

.

3% accuracy(1error).Removaloftheoutlier did not affect theanalysis. Levene’s test showed there was no homogeneityof variances (p

= .

007), howeverANOVA is believedtoberobustagainstthisunderequalgroupsizes[67,68].

Fig.9showstheresultsofthebehavioralmeasureforpersuasiveness,i.e.thenumbertimessystemadvicewasfollowed. Note that in ExperimentII the system’s accuracy was 66

.

7%. Thus, following the advice in a higher percentage of cases

denotes an adverse amount of persuasion. A one-wayANOVA showed that explanation style had a signiﬁcant effect on

following the system’s advice (F

(

2

,

41

)

=

11

.

593, p

< .

001,

η

2

p

= .

361). Further analysis revealed that participants with no explanation followed the system’sadvice signiﬁcantly less than those withrule-based (p

= .

049) and example-based explanations (p

< .

001).However, therewas nosigniﬁcantdifference betweenthe twoexplanationstyles (p

= .

068).One outlierviolated theassumptionsofan ANOVA.Oneparticipantintherule-basedexplanationgroupfollowedthesystem’s adviceonly33

.

3% ofthetime.ItsexclusionaffectedtheoutcomesoftheANOVAandtheresultsafterexclusionarereported. Fig. 10 displays the self-reported capacity to predict correctness, operationalized by a rating how well participants thought they were able to predict when system advice was correct or not. The consistency of the eight 7-point Likert scalequestionswashighaccordingtoCronbach’sAlpha(

α

= .

820).Therefore,wetookthemeanratingofallquestionsas anestimateofparticipants’performanceestimation.Aone-wayANOVAwasperformed,revealingnosigniﬁcantdifferences (F

(

2

,

41

)

=

2

.

848, p

= .

069,

η

2

p

= .

122).Oneoutlierfromtherule-basedexplanationgroupwas found,itsremovaldidnot affecttheanalysis.

A correlation analysis was performed between the self-reported prediction of advice correctness and the behavioral measurement of making the correct decision, two measurements oftask performance. The accompanying scatter plotis showninFig.11.APearson’sproduct-momentcorrelationrevealednosigniﬁcantcorrelationbetweentheself-reportedand

(15)

Fig. 10. Barplotofthemeanself-reportedsystemperformanceestimation.Allvaluesareona7-pointLikertscaleanderrorbarsrepresenta95% conﬁdence interval.

Fig. 11. Scatterplotdisplayingtherelationbetweenthenumberofcorrectdecisionsmadeandtheself-reportedcapacitytopredictadvicecorrectness. Outliersarecircled.

behavioralmeasure(r

= .

146, p

= .

350).Also,therewere nosigniﬁcantcorrelationsintherule-based(r

= .

411, p

= .

144) andexample-basedexplanation (r

= −.

347, p

= .

225) groups, norinthe noexplanation group(r

= .

102, p

= .

718).Both outliersfromeachmeasurementwereremovedinthisanalysisanddidnotaffectthesigniﬁcance.

8.3. Usabilityandbiases

Ausabilityquestionnairewasusedtoevaluatewhetherthereweredifferencesinusabilitybetweenthetwoexplanation styles,asthiscouldinfluencetheresults.Thequestionnairecontainedfivequestionsona100-pointscaleaboutreadability, organization of information,language, images andcolor. The consistencybetween the fivequestions was relatively high, as revealed by a Cronbach’s Alpha test (

α

= .

722). Fig. 12 shows the mean ratings for each question, broken down by explanation style (rule-based, example-based, noexplanation). No statisticalanalysis was performed, as this questionnaire onlyfunctionedasacheckforpotentialusabilityconfoundsintheexperiment.

In addition to theratings, participants were askedaboutthe positive andnegative usability aspects of thesystem in twoopen questions.Common positivedescriptionsincluded“clear”,“well-arranged”,“clearandsimpleicons”and “under-standablelanguage”.Althoughnotmanyparticipantshadnegative remarks,mostaddressedinsuﬃcientvisualcontrastdue tothecolorsused.Uniquetotheexample-basedexplanationsparticipantgroup wereremarksabouta lackofconciseand well-arrangedinformation.

In the control questionnaire we asked participants to give an estimate of the overall system’s accuracy. This was to validateanypotentialoverlypositiveornegativetrustbiastowardsthesystem.InExperimentI thesystemwas 100% accu-rate,butthiswasunknowntotheparticipantssincetherewasnofeedbackoncorrectnessincluded.Nonetheless,estimates

(16)

Fig. 12. The mean ratings on the usability questions, displayed by explanation style. The error bars represent a 95% conﬁdence interval. rangedfrom30% to90% (

μ

=

75

.

2%,

σ

=

12

.

8%).Thismeantthatallparticipantsbelievedthesystemtomakeerrorsbased onnoinformation.InExperimentII the system’saccuracywas66

.

7%.Participants experiencedthisduetothefeedbackon made decisionsin thelearning block. Estimates rangedbetween 50% and95% (

μ

=

74

.

8%,

σ

=

8

.

8%), indicating that on average,systemaccuracywasoverestimated.

After theexperiment,briefdiscussions withparticipantsrevealedadditionalperspectives.Severalparticipantsfromthe no explanation group wished the system could give an explanation forits advice. One participant expressed a need for knowing the system’s rules governing the system’s advice. In the two explanation groups, participants experienced the explanationsasuseful.Ruleswerevaluedforthereexplicitness,whereasexampleswereviewedasincitingtrust.However, inthetwoexplanationgroupsseveralparticipantsfounditunclearwhatthehighlightofafactor(seeFig.3)meant.Several participantsalsomentionedthat,althoughuseful,theexplanationslackedacausalrationale.

9. Discussion

BelowwediscusstheresultsfrombothexperimentsindetailandrelatethemtoourtheorypresentedinSection5. 9.1. ExperimentI:Systemunderstanding

ExperimentI measured the participant’scapacityto understand how andwhen thesystem provideda speciﬁc advice. Thisconstructwasoperationalizedinthreemeasurements:decisivefactoridentiﬁcation,adviceprediction andperceivedsystem understanding.We hypothesizedthatparticipantsreceivingcontrastiverule-basedexplanations wouldscorebestonallthree measurements.Contrastiveexample-basedexplanations wereonlyexpectedtoimproveunderstandingslightlymorethan no-explanations (seeFig.2).

Theresultsfromourevaluationsupportthesehypothesesinpart.First,rule-basedexplanations indeedseemtoallow par-ticipantstomoreaccuratelyidentifythefactorfromasituationthatwasdecisiveinthesystem’sadvice.However,rule-based norexample-basedexplanation allowedparticipantstolearntopredictsystembehavior.Therule-basedexplanations however, didcausetoparticipantstothink thatthey betterunderstoodthesystemcomparedtoexample-based andnoexplanations. Theexample-basedexplanations onlyshowedasmallandinsigniﬁcantincreaseinperceivedsystemunderstanding.Itis im-portanttonotethattherewasnocorrelationbetweentheself-reportedmeasurementofunderstandingandthebehavioral measurements of understanding. This showsthat participants had a perception of understanding that differed fromthe understandingasmeasuredwithfactoridentiﬁcation andadviceprediction.

Closeinspectionoftheresultsshowedtwopotentialcausesforthelackofsupportforourhypotheses.Thefirstreason mightbebecausethedescribedDMT1situationsandaccompanyingsystemadvicewastoointuitive.Thisissupportedbythe factthatparticipantswithnoexplanation werealreadyquiteadaptinidentifyingdecisivefactors (nearly70% comparedto33% chance).Thesecondreasonweinferredfromopendiscussionswithparticipantsaftertheexperiment.Mostparticipantswho receivedeitherexplanationstylementioneddifficultyinapplyingandgeneralizingtheknowledgefromtheexplanationsto novel situations. Severalparticipants even expressed the desire to knowthe rationale ofwhy a certain rule orbehavior occurred. Thisisinlinewiththetheorythat explanationsshouldconveyspecificcausalrelationsobtainedfromanoverall causalmodeldescribingthebehaviorofthesystem,insteadofjustfactualcorrelationsbetweensysteminputandoutput.

(17)

Ifwegeneralize theseresultstothefield ofXAI,we haveshownthat contrastiverule-basedexplanations as“if...then...” statements are not sufficient to predict system behavior. However, such explanations are capable ofeducating a userto identifywhichfactorswouldplaya decisiveroleinsystemadvicegivenaspecific situation.Also, suchexplanationsseem toprovidethe userwiththeperceptionthat (s)heisbettercapableofunderstandingthesystem. Thecontrastive example-basedexplanations howevershowednoimprovementonobservedorself-reportedunderstanding.Thisexperimentillustrated theneed forexplanationsthat providemorecausalinformation,insteadofsolelyinformationdepictingsysteminput and outputcorrelations.Furthermore,weillustratedthatself-reportedandbehavioralmeasurementsofunderstandingmaynot correlate,underliningtheneedfor(acombinationof)measuresthataccuratelyandreliablymeasuretheintendedconstruct. 9.2. ExperimentII:Persuasivepowerandtaskperformance

InExperimentII weinvestigatedtheextenttowhichanexplanationincreasesthepersuasivenessofanadvice,aswellas theexplanation’seffectontaskperformance.Thepersuasivepowerofanexplanationwasoperationalizedwiththenumber oftimestheadvicewascopied.Taskperformancewasrepresentedbythenumberofcorrectdecisions andtheself-reported pre-dictionofadvicecorrectness.Wehypothesizedthatespeciallycontrastiveexample-basedexplanations wouldincreasepersuasive power, whiletheseinturnwouldlower actual taskperformance. Incontrast,the understandingparticipantsgained from rule-basedexplanations wasexpectedtocauseanincreaseintaskperformance(seeFig.2).

Both contrastiverule-based and example-based explanations showed more persuasive power than when noexplanation wasgiven.Theexample-basedexplanations alsoshowedslightlymorepersuasivepowerthantherule-basedexplanations,but thisdifference wasnot signiﬁcant.Theseresultspartlysupport ourtheoryaboutpersuasive power,asthey illustratethat explanations persuade userstofollowa system’sadvice moreoften.Theseresultshowever,donot supportthat example-basedexplanations aremuchmorepersuasivethanrule-basedexplanations.

Withrespecttotaskperformance,wesawthatexplanationscausedsmallbutinsigniﬁcantimprovementsonboth behav-ioralandself-reporteddata.Infact,theexample-basedexplanations showedthehighest(butstillinsigniﬁcant)improvement. Duetoalackofstatisticalevidencenotmuchcanbeinferredfromthis,andfurtherevaluationisrequired.

Similar to ExperimentI we found a lack of correlation betweenreports ofparticipants’ perception ofpredictingadvice correctness,andthenumberofcorrectdecisions.Inotherwords,thesemeasuresdonotseemtomeasurethesameconstruct. Anexplanationcouldbethatparticipantswereunabletoestimatetheirowncapacityofpredictingthecorrectnessofadvice. We haveshownthat providing an explanationwithan advice resultsin usersfollowing that advice moreoften,even whenincorrect.Inaddition,therewasasuggestionthatexplanationsalsoimprovetaskperformance, especiallycontrastive example-basedexplanations.However, theseeffects were marginalandnot significant.These resultsunderline theneed in thefieldofXAItotakeadifferentstanceonwhichexplanations shouldbegenerated.Twocommonstylesofexplanations answeringacontrastingquestiondidnotappeartoincreasetaskperformance,aneffectoftenattributedtosuchexplanations withinthefield.

10. Limitations

ThisstudyhasseverallimitationsthatwarrantcautioningeneralizingtheresultstootherusecasesortothefieldofXAI ingeneral.ThefirstsetoflimitationsisrelatedtotheselectedusecaseofaidedDMT1self-management.Thisusecasefalls intothecategory‘simplified’fromDoshi-VelezandKim [4] asitapproximatesarealistic usecase.However,twomajor as-pectsdifferfromthereal-lifesituation.First,werecruitedhealthyparticipantswhohadtoempathizewithaDMT1patient, insteadofactual DMT1patients.Nevertheless,participantswere sampledfromtheentireDutchpopulation,resultingina widevarietyofagesandeducationlevels.Thesechoicesallowedustomeasuretheeffectsoftheexplanationtypeswithout focusingonaspecific demographicorhavingtocompensate forvaryingdomainknowledgeinDMT1participants.Second, the systemitself was fictitiousandfolloweda pre-determined setof rules ratherthan comprisingthe fullcomplexity of arealistic system.Thesetwosimplificationspreventustogeneralizetheresultsandtoapply ourconclusions toconstruct an actualsystemforaiding DMT1patientsinself-management.However, thiswas notthepurposeofthisstudy.Instead, weaimedtoevaluatewhetherthesupposed effectsoftwooftencitedexplanationsstyleswere warranted.Webelievethe selectedusecaseallowedustodoso,asitgavebothcontextaswellasmotivationfortheuserstounderstandexplanations. Also,laymenwerechosenopposedtoDMT1patientstomitigateanydifferenceindiabetesknowledgeandmisconceptions, whichcanvarygreatlybetweenpatients(e.g.see[55]).Ofcourse,futureresearchspecificallytargetedatthedevelopment ofaDSSforDMT1self-managementshouldincludeDMT1patientsasparticipants.

Thesecondsetoflimitationsisrelatedtosuspectedconfoundsintheexperiment.Abriefusabilityquestionnaireshowed thatparticipantsheldanoverallpositivebiastowardsthesystem,whetheranexplanationwasprovidedornot.Inaddition this questionnaire showed that participants’ perception of the organization of the information was not always positive. Hence,apotentiallimitationliesinthewaytheexplanationswerepresented.Also,surprisingly,inExperimentI participants attributedalowperformancetothesystem,whiletheyhadnoinformationtodoso.InExperimentII however,participants tended toslightlyoverestimate thesystem’sactual performance.Thisoccurredindependent oftheexplanationstyle.This showsthattheparticipantscouldhavehadanaturaltendencytodistrustthesystem’sadvice.Thismayhaveaffectedthe self-reportedresults.

(18)

Finally, a few limitations arose fromthe design ofboth experiments.The results forthe example-based explanations could havebeendifferentwitha longerlearningblock, asittakestimetoinfer decisionboundariesfromexamples.Also, bothtestingblockswererelativelylong,whichcouldhavecausedparticipantstocontinuelearningaboutthesystemwhile we weremeasuring theirunderstanding.We didnotperformanyanalysesonthis, asitwouldaddanotherlevelof com-plexitytothedesign.Hence,wecannotsayforcertainthatthelearningblockwasofsuﬃcientlengthtoallowparticipants tolearnenoughfromtheexplanations.However,ifthiswasthecase,webelievethatprolongingthelearningblockwould have resultedin even strongereffects. Lastly,due tothe choice ofdifferent participantgroups forboth experiments,we could only draw limitedconclusions onthe relation betweenthe understanding on theone handandtask performance andpersuasivenessontheother hand.However,weselectedthisapproachinsteadofcombiningtheconstructsinasingle experimentwitha within-subject design,toavoid learning effectsnot suﬃcientlycompensated through randomizingthe understandingandtaskperformance/persuasionblocks.

11. Conclusion

A lackof userevaluationscharacterizes the ﬁeld ofExplainable Artiﬁcial Intelligence (XAI). Acontribution ofthis pa-per was to provide aset ofrecommendations forfuture userevaluations.Practical recommendationswere given forXAI researchersunfamiliar withuserevaluations.Theseaddressedtheevaluation’sconstructsandtheirrelations,theselection ofausecaseandtheexperimentalcontext,andsuitablemeasurementstooperationalizetheconstructsintheevaluation. Theserecommendationsoriginatedfromourexperiencedesigninganextensiveuserevaluation.Oursecondcontributionwas toevaluatetheeffectsofcontrastiverule-based andcontrastiveexample-basedexplanations ontheparticipant’sunderstanding ofsystembehavior, persuasive powerofthe system’sadvice whencombinedwithan explanation,andtaskperformance. Theevaluationtookplaceinadecision-supportcontextwhereuserswereaidedinchoosingtheappropriatedoseofinsulin tomitigatetheeffectsofdiabetesmellitustype1.

Resultsshowedthatcontrastiverule-basedexplanations allowedparticipantstocorrectlyidentifythesituationalfactorthat playedadecisiveroleinasystem’sadvice.Neitherexample-based orrule-basedexplanations enabledparticipantstocorrectly predict thesystem’sadvice innovel situations,nor didthey improvetaskperformance.However, both explanationstyles didcauseparticipantstofollowthesystem’sadvicemoreoften,evenwhenthisadvicewasincorrect.Thisshowsthatboth rules and examples that answera contrastivequestion arenot suﬃcientontheir own toimproveusers’understanding or taskperformance.Webelievethatthemainreasonforthisisthat theseexplanationslackaclariﬁcationoftheunderlying rationaleofsystembehavior.

Futureworkwillfocusontheevaluationofacombinedexplanationstyleprovidedininteractiveform,toassesswhether this interactive form helps users to learn a system’s underlying rationale. As an extension, potential methods will be researched that can generate causal reasoning traces, rather than decisionboundaries, to expose the behavior rationale directly.Inaddition,futureresearchmayfocusonsimilarstudieswithactualdiabetespatientstostudyexplanationeffects inpotentially homogeneousgroups(e.g.effectsofage,domainknowledge,etc.).Finally,during thedesignandanalysisof thisuserevaluationwediscoveredaneedforvalidatedandreliablemeasurements.Wewillcontinuetousedifferenttypes ofmeasurementstomeasureconstructsinavalidandreliablewayinfutureuserevaluations.

Declarationofcompetinginterest

The authors declare that they haveno known competingﬁnancial interests or personal relationships that could have appearedtoinﬂuencetheworkreportedinthispaper.

Acknowledgements

WeacknowledgetheprojectERPExplainableArtiﬁcialIntelligence(060.38608)andERPFATE(060.43385)fromTNOfor fundingthisresearch.Inaddition,wethanktheTechnical UniversityofDelftandtheUniversityofAmsterdamforsupport andfeedbackonthisresearch.

References

[1]M.M.DeGraaf,B.F.Malle,Howpeopleexplainaction(andautonomousintelligentsystemsshouldtoo),in:2017AAAIFallSymposiumSeries,2017.

[2]T.Miller,Explanationinartiﬁcialintelligence:insightsfromthesocialsciences,Artif.Intell.267C(2019)1–38.

[3]R.Guidotti,A.Monreale,S.Ruggieri,F.Turini,F.Giannotti,D.Pedreschi,Asurveyofmethodsforexplainingblackboxmodels,ACMComput.Surv. 51 (5)(2019)93.

[4]F.Doshi-Velez,B.Kim,Towardsarigorousscienceofinterpretablemachinelearning,arXivpreprintarXiv:1702.08608.

[5]S.Anjomshoae,A.Najjar,D.Calvaresi,K. Främling,Explainableagentsandrobots:results fromasystematicliteraturereview,in:Proceedingsof the18thInternationalConferenceonAutonomousAgentsandMultiAgentSystems,InternationalFoundationforAutonomousAgentsandMultiagent Systems,2019,pp. 1078–1088.

[6]T.Miller,Contrastiveexplanation:astructural-modelapproach,arXivpreprintarXiv:1811.03163.

[7]R.R.Hoffman,S.T.Mueller,G.Klein,J.Litman,MetricsforexplainableAI:challengesandprospects,arXivpreprintarXiv:1812.04608.

[8]E.J.deVisser,M.M.Peeters,M.F.Jung,S.Kohn,T.H.Shaw,R.Pak,M.A.Neerincx,Towardsatheoryoflongitudinaltrustcalibrationinhuman–robot teams,Int.J.Soc.Robot.12 (2)(2020)459–478.