Delft University of Technology
Urban pluvial flooding prediction by machine learning approaches – a case study of
Shenzhen city, China
Ke, Qian; Tian, Xin; Bricker, Jeremy; Tian, Zhan; Guan, Guanghua; Cai, Huayang; Huang, Xinxing; Yang,
Honglong ; Liu, Junguo
DOI
10.1016/j.advwatres.2020.103719
Publication date
2020
Document Version
Final published version
Published in
Advances in Water Resources
Citation (APA)
Ke, Q., Tian, X., Bricker, J., Tian, Z., Guan, G., Cai, H., Huang, X., Yang, H., & Liu, J. (2020). Urban pluvial
flooding prediction by machine learning approaches – a case study of Shenzhen city, China. Advances in
Water Resources, 145, [103719]. https://doi.org/10.1016/j.advwatres.2020.103719
Important note
To cite this publication, please use the final published version (if applicable).
Please check the document version above.
Copyright
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons. Takedown policy
Please contact us and provide details if you believe this document breaches copyrights. We will remove access to the work immediately and investigate your claim.
This work is downloaded from Delft University of Technology.
ContentslistsavailableatScienceDirect
Advances
in
Water
Resources
journalhomepage:www.elsevier.com/locate/advwatres
Urban
pluvial
flooding
prediction
by
machine
learning
approaches
– a
case
study
of
Shenzhen
city,
China
Qian
Ke
a,
Xin
Tian
b ,c,
Jeremy
Bricker
a,
Zhan
Tian
d ,∗,
Guanghua
Guan
e,
Huayang
Cai
f ,g,
Xinxing
Huang
h,
Honglong
Yang
i,
Junguo
Liu
da Department of Hydraulic Engineering, Faculty of Civil Engineering and Geosciences, Delft University of Technology, 2628CN Delft, the Netherlands b Department of Water Management, Faculty of Civil Engineering and Geosciences, Delft University of Technology, 2628CN Delft, the Netherlands c KWR Water Research Institute, Groningenhaven 7, 3433PE Nieuwegein, the Netherlands
d School of Environmental Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China e State Key Laboratory of Water Resources and Hydropower Engineering Science, Wuhan University, 430072 Wuhan, China
f Institute of Estuarine and Coastal Research/ Guangdong Provincial Engineering Research Center of Coasts, Islands and Reefs, School of Marine Engineering and
Technology, Sun Yat-sen University, Guangzhou, China
g Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai)/ State and Local Joint Engineering Laboratory of Estuarine Hydraulic Technology,
Guangzhou, China
h Shanghai Institute of Technology, Shanghai, China
i Shenzhen National Climate Observatory of Meteorological Bureau of Shenzhen Municipality, Shenzhen, China
a
r
t
i
c
l
e
i
n
f
o
Keywords:Urban pluvial flooding Rainfall threshold Machine learning Shenzhen city
a
b
s
t
r
a
c
t
Urban pluvial flooding is a threatening natural hazard in urban areas all over the world, especially in recent years given its increasing frequency of occurrence. In order to prevent flood occurrence and mitigate the subsequent aftermath, urban water managers aim to predict precipitation characteristics, including peak intensity, arrival time and duration, so that they can further warn inhabitants in risky areas and take emergency actions when forecasting a pluvial flood. Previous studies that dealt with the prediction of urban pluvial flooding are mainly based on hydrological or hydraulic models, requiring a large volume of data for simulation accuracy. These methods are computationally expensive. Using a rainfall threshold to predict flooding based on a data-driven approach can decrease the computational complexity to a great extent. In order to prepare cities for frequent pluvial flood events – especially in the future climate – this paper uses a rainfall threshold for classifying flood vs. non-flood events, based on machine learning (ML) approaches, applied to a case study of Shenzhen city in China. In doing so, ML models can determine several rainfall threshold lines projected in a plane spanned by two principal components, which provides a binary result (flood or no flood). Compared to the conventional critical rainfall curve, the proposed models, especially the subspace discriminant analysis, can classify flooding and non-flooding by different combinations of multiple-resolution rainfall intensities, greatly raising the accuracy to 96.5% and lowering the false alert rate to 25%. Compared to the conventional model, the critical indices of accuracy and true positive rate (TPR) were 5%-15% higher in ML models. Such models are applicable to other urban catchments as well. The results are expected to be used to assist early warning systems and provide rational information for contingency and emergency planning.
1. Introduction 1.1. Background
Urbanpluvialfloodingisathreattoagreatnumberofcities world-wide,especiallygivenitsincreasingfrequencyofoccurrenceinrecent years(Martina et al., 2006 ;Atta-ur-Rahman et al., 2016 ;Ziegler, 2012 ). Itsimpact,includinglossoflifeanddamagestobothpublicandprivate
∗Corresponding author.
E-mailaddresses:q.ke@tudelft.nl(Q. Ke), x.tian@tudelft.nl(X. Tian), j.d.bricker@tudelft.nl(J. Bricker), tianz@sustech.edu.cn(Z. Tian), ggh@whu.edu.cn(G. Guan), caihy7@mail.sysu.edu.cn(H. Cai), xxhuang1031@outlook.com(X. Huang), yangl01@163.com(H. Yang), liujg@sustech.edu.cn, j.d.bricker@tudelft.nl(J. Liu).
properties,canbefurtherdeepenedbyclimatechangeandaccelerated urbanization(Falconer et al., 2009 ).Thistypeoffloodingusually oc-curswhenintenserainfallexceedsthecapacityof anurbandrainage system.Recentextremeprecipitationeventshaveraisedawarenessfrom bothauthoritiesandcitizenstothechallengesofpredictingand man-aging urbanpluvialfloods. InJuly2019, heavy raincaused atleast 18deathsandtriggeredmassivetransportdisruptionsinMumbai, In-dia (BBC, 2019 ). In the UK, about 40% of damages andassociated
https://doi.org/10.1016/j.advwatres.2020.103719
Received 30 November 2019; Received in revised form 18 July 2020; Accepted 6 August 2020 Available online 7 August 2020
economiclossesincitiesareestimatedtoresultfrompluvialflooding (Douglas et al., 2010 ).InChina,98%ofcitiesareexposedor vulnera-bletofrequentfloods(Jiang et al., 2017 ).Asurvey,conductedbetween 2008and2010,showedthat218Chinesecitiessufferedsevereurban pluvialfloodingatleast once,andmorethan100 citiesexperienced itmorethanthreetimes(Jiang et al., 2018 ).Therefore,urbanpluvial floodpredictionandmanagementisacriticaltopic inthecontextof urbanwatermanagement.
Inordertopreventpluvialfloodinganditsconsequences,city au-thorities(e.g.meteorologicaloffices,emergencyagencyofficesorwater authorities)usuallyneedtomakepredictionsofpluvialfloods.Thisis basedongoodpredictionofprecipitationcharacteristics,suchaspeak intensity,arrivaltimeandduration.Manystudiesinvestigatethe predic-tionofpluvialfloodingbyhydraulicmodels(Li, 2020; Li and Willems, 2019 ),bysimulatingtheinundatedareaanddepthgivencertain histor-icalorpredictedrainfallscenarios(Babaei et al., 2018 ;Thorndahl et al., 2016 ;Xing et al., 2019 ).However,hydraulicmodelsneedalarge vol-umeofdataandcomputationalresources.Asaresult,theoutputofa hydraulicmodelisusuallycase-specific.Inotherwords,wehavetorun themodeltomakepredictionsforfloodingduringeachseparaterainfall scenario.Ontheotherhand,usingarainfallthresholdbasedon data-drivenmodelscanprovideaneasyandintuitivesolution.Bycomparing thecurrent/predictedrainfallwiththethreshold,onecan straightfor-wardlyestimatethelikelihoodofthecitybeingflooded(Martina et al., 2006 ;Montesarchio et al., 2011 ;Tian et al., 2019 ;Yang et al., 2016 ). Specifically,arainfallthresholdspecifiesoneorseveralrainfalldepth(s) overcertaintimewindows,abovewhichapluvialfloodislikelytooccur. Moreover,rainfall-threshold-basedhazardpredictioniswidelyapplied tolandslides(Garcia ‐urquia and Axelsson, 2015 ; Giannecchini et al., 2012 ;Golian et al., 2015 ;Hong et al., 2018 ;Martelloni et al., 2012 ), debrisflow(Nikolopoulos et al., 2014 ;Pan et al., 2018 ;van Asch et al., 2014 )andflashfloods(Montesarchio et al., 2011 ;Norbiato et al., 2008 ; Zhai et al., 2018 ). Todetermine a cumulative rainfall threshold, a physically-basedmodelis usuallyneededtocomputecriticalrainfall thresholdsovertime(Norbiato et al., 2008 ; Yang et al., 2016 ),ora statistical,data-drivenanalysiscanbeapplied(Carpenter et al., 1999 ; Golian et al., 2010 ;Martina et al., 2006 ;Montesarchio et al., 2011 ). However,ontheonehand,thereisagapinshort-termprediction capa-bilityofphysicalmodels(Costabile and Macchione, 2015 ).Shortlead timefloodpredictionisofcrucialimportanceforhighlyurbanizedareas inordertoprovidetimelywarningstoresidents(Zhang et al., 2018 );on theotherhand,statisticalmodelshavealimitationontheaccuracyof prediction(Fawcett and Stone, 2010 ).Furthermore,urbancatchments oftenlacksufficientdataonboththedrainagenetworkand topogra-phy,complicatingtheestimateofrainfallthreshold(Yang et al., 2016 ). Machinelearning(ML)modelscandealwithdatascarcitybasedonan ensemblemethod(Breiman, 2001 ).Therefore,inthispaper,weuseML approachestoderivethefloodingthresholdsfordifferentrainfall dura-tionperiods.
MLisafamilyof algorithmsderivedfromstatisticsandcomputer science,whichaimstotrainmathematicalmodelstomakepredictions ordecisionsbasedonobservedsamples.MLis suggestedasan effec-tivetooltoexploretheconnectednessbetweenhumanandwater sys-tems(Shen et al., 2018 ).Thelatterisanticipatedtobeakey interdis-ciplinaryissuetodealwithinfuturehydrologicalstudies(Vogel et al., 2015 ).Moreover,MLmodelscannumericallyreproduceflood nonlin-earity, solelybased on historical data, without requiring knowledge abouttheunderlyingphysicalprocesses (Mosavi et al., 2018 ). There-fore,thisstudyutilizesMLalgorithmstoattempttoclassifythe pres-enceorabsenceoffloodingbasedonrainfallcharacteristics.Although MLalgorithmshaveshownpowerfulapplicabilitytofloodprediction andforecasting(Liu et al., 2017 ;Mosavi et al., 2018 ;Noymanee et al., 2017 ;Tayfur et al., 2018 ),therearestillveryfewstudiesthatutilize MLtoclassifyorpredicturbanpluvialflooding,whichisachallenge duetolackoffloodinundationdata,drainagesystemdata,andfine res-olutiontopographydata(Yang et al., 2016 ).Therefore,weaimtotest
MLalgorithmsforclassifyingurbanfloodinginthecityofShenzhen, whichisfrequentlyflooded.Asuddenrainstormeventclaimed11lives inApril2019inShenzhen(Hua, 2019 ),attractinggreatattentionfor thelocalauthoritiestoreconsidertheearlywarningsystemforpluvial floodinginthecity.Moreover,Shenzhenisapioneercityintermsof high-technologydevelopment,socio-culturaldevelopmentanddisaster emergencymanagement.Thisexperiencecanbesharedwithothercities inChinaandabroad.
Thepaperisorganizedasfollows.Section 2 describesthestudyarea anddatausedforthisstudy,andintroducestheconventionalandML methodsforfloodprediction.Section 3 showstheresultsofthemodels andproposestherainfallthresholdforShenzhen.Section 4 compares theMLresults forrainfallthresholdstothecurrent rainfallthreshold andcumulativerainfallthresholdinShenzhen.Section 5 presentsthe conclusionsandrecommendations.
2. Materialsandmethodology 2.1. Studyarea
Inthepastdecades,Shenzhenhasgrownrapidlyfromaruralareato aprosperouseconomiczoneandanimportantindustrialcityinSouthern China.ItislocatedonthecentralcoastofGuangdongProvince,which isthepassagewayfrommainlandChinatoHongKong(SeeFig. 1 ).Itis alsoanimportantcityinthePearlRiverDelta(PRD).Ithasatotalland areaof1,948km2.Theaverageelevationis3-4mabovemeansealevel.
Rainstorm-inducedcatastrophesinShenzhencityaremostlycausedby persistent short-duration heavy rainfall in the summer (Zhou et al., 2017 ).Pluvialfloodingisoneoftheprimarynaturalhazardsin Shen-zhen.Inrecentyears,urbanizationhasincreasedthesurfacerunoff and intensifiedthefloodfrequency(Shi et al., 2007 ;Yan et al., 2019 ).
Shenzhenisidentifiedasanareaunderahighfloodrisk,sincemany propertiesarebuiltinflood-proneareas,suchastheharbour-frontarea (Chan et al., 2014 ).Thetotalaverageannualprecipitationis~1,900 mm/y, of which rainstorms caused bytyphoons (July – September) make up36%(i.e.,689mm/y)andapproximately 85%of precipita-tionoccursfromApriltoSeptember(SeeFig. 2 )(Datasource: Meteo-rologicalBureauofShenzhen(SMB)).Convective(March– June)and typhoonrainstorms(July– October)arethetwomainrainfallsources inthisregion.
Asof2019,Shenzhenhasapopulationof13million,witha pop-ulationdensityof6,234people/km2.Mostofthecityisdrainedbya
separated stormsewersystem(4,883.92km) whereastheremaining area(1,693km)isdrainedbyacombinedsewersystem(i.e.wastewater combinedwithrainwatersewersystem),withadrainagepipedensity of12.5km/km2(SSB,2019 ).Intotal,126municipalpumpswitha
ca-pacityof671m3/sareusedtodrainstormwateroutofthecity(SSB,
2019 ).
Shortduration,highintensityrainfallisthemaindriverofpluvial floodinginShenzhen.Duetotherapidpaceofurbanization,the im-perviousareahassignificantlyincreasedwhilethewaterstoragearea suchasrivers,lakesandwetlandshasdecreased.Withclimatechange (increasingfrequencyoftyphoonoccurrenceandintensityoftorrential rainfall)(Tracy et al., 2007 ),pluvialfloodinghasahighlikelihoodof occurrenceinthepavedarea.OnMay11,2014,forinstance,thedaily rainfallvolumereached233mm,andsomedistrictsexperiencedapeak rainfallintensityof310mmin6hours(Cai, 2014 ).Currently,SMBuses arainfallthresholdforpredictingurbanpluvialflooding,onlybasedon 30-minrainfalldepth(i.e.,20mm)or3-hrainfalldepth(i.e.80mm) (SMB, 2019 ).Inthesubsequentsections,wewillfurthertestifyand com-parethisthresholdwiththatfromtheproposedMLmodels.
2.2. Recordsoffloodevents
Records of historical flood events from 1June 2014 to14 June 2017,consistingof1,110daysand663recordsintotal,wereretrieved
Fig.1. Location and land use maps of the city of Shenzhen.
Fig.2. Monthly average rainfall volumes for the city of Shenzhen, based on historical data from 1970 to 2015 (Data source: SMB).
Fig.3. Spatial distribution of 640 inundation records between 2014 – 2017 and 25 rainfall gauges in Shenzhen, China.
fromthewatersectorofShenzhenmunicipality(http://swj.sz.gov.cn/ ), whichhasdevelopedandimplementedadisasterreportingsystem(i.e.a floodreportAPPnamed‘shenzhensanfang’)since2014.Citizensof Shen-zhencanreportfloodeventsviathissystematanytime.Theserecords registerthedate,thelocation(geotagging),andadescription.Asmost oftherecordsindicatingpluvialfloodeventsfallintheperiodbetween JuneandSeptember(640records,i.e.,96.5%),weonlyconsiderdata pointsinthesummerofeachyear,namely,413daysintotaloverthe 3-yearstudyperiod.Indoingso,wecanexcludehundredsofnon-flooding eventstolowertheimbalanceof thedataset(toomanynon-flooding eventsandtoofewfloodingevents).Notethatthehighfrequencyofthe floodingrecordcorrespondstotheprecipitationcharacteristicsin Shen-zhen.The640recordswereregisteredover24days(c.a.27records/d), whichareregardedasdayswithfloods.Theremaining389daysofthe studyperiodareregardedasdayswithoutfloods.Theserecordsare spatiallydistributedthroughoutthewholecity(seeFig. 3 ).Itshouldbe notedthatastheinundationrecordsweresubmittedbycitizens, socio-economicbackground(suchasage,educationlevelandexperiencewith previouspluvialflooding)mayaffecttherecording.Thismaycausefalse alertsormissedalerts.
2.3. Rainfallobservations
The rainfall intensity each minute at 25 rainfall gauges (see Fig. 3 )from 1June 2014to14June 2017was retrievedfromSMB (http://weather.sz.gov.cn/ ).Weusedarealaveragerainfallintensityto representthestudyarea,whichstandsforthemeanvalueofrainfall intensitiesofallstudysub-areas(districts).Theoriginaldatabase con-sistedof1-minrainfallintensity.These1-minrainfallintensitieswere aggregatedtorainfallvolumesoflongertemporalscale,namely,5,10, 15,30,60,120,360,720,and1440mins.Eachday,themaximum rain-fallvolumeateachtemporalscale,denotedasRdx (inmm),iscalculated byEq. (1) (Tian et al., 2019 ).
𝑅𝑑𝑥 =max 𝑗 {𝑥 ∑ 𝑘 =1 𝑅1(𝑘),…, (𝑗+1 ∑)∗𝑥 𝑘 =1+𝑗∗ 𝑥 𝑅1(𝑘),…, 1440∑ 𝑘 =1441−𝑥 𝑅1(𝑘 ) } (1) wherej=0,1,…,1440/x[min];x=1,5,10,15,30,60,120,360,720, 1440[min].NotethateachiteminthebracketofEq. (1) standsfor
x-minrainfallvolumeaccumulatedfrom1-minrainfallintensityinthe interval[1+j∗x,(j+1)∗x]
2.4. Floodclassificationmodels
Inthisstudy,wefirstapply aconventionalrainfallcurvemethod as a benchmark. Then we further develop multiple parametric and non-parametricMLmodelstoclassifyfloodingandnon-floodingevents based on rainfall intensities. With respect toa binary classification problem,fourpossiblepredictedoutcomesareexpected(SeeTable 1 ), namely,truepositives(TPorcorrectlyclassifiedfloodingevents),false positives(FPorfalselyclassifiedfloodingevents), truenegatives(TN orcorrectlyclassifiednon-floodingevents),andfalsenegatives(FNor missed floodingevents). Ideally, anurban flood classificationmodel should achievea hightruepositive rate(TPR), ahightrue negative rate(TNR)andhighoverallaccuracy(ACC).Ontheotherhand,a pre-dictionmodelwithalowpositivepredictiverate(PPR)oralowTPR impliesthatanumberofactualfloodeventsarewronglylabeledor un-expectedlymissed.ACCisalsocalledtheproportionofcorrectforecasts (Wilks, 2005 ).
2.5. Conventionalmodelwithcumulativerainfallvolumethresholds
Thecumulativerainfallvolumethresholdisareferencecurve, repre-sentingacumulativeamountofrainfalloveracertaintimewindow(see Fig. 4 ).Whentheobservedcumulativerainfallexceedsthethresholdat agivenmoment,floodingisexpectedtooccur.
Weproposeawaytodetermineathresholdcurveviathefollowing steps:
1) Calculatingthecumulativerainfall(max.in24hours),basedonthe 1-minrainfallintensity,forallfloodingandnon-floodingevents. 2) Computingthelower𝛼 percentileofthe1-minrainfallforall
flood-ingevents,denotedasT𝛼.Note𝛼 istobedeterminedinstep(5). Indoingso,T𝛼depictsacurvethatacertainnumberofcumulative rainfallcurvesforfloodingeventsstayabove.Forinstance,allcurves offloodingeventsareabovethecurveT𝛼|𝛼=0.
3) Computingtheupper𝛽 percentileofthe1-minrainfallforall non-floodingevents,denotedasT𝛽.Note𝛽 isalsotobedeterminedin
Table1
Confusion matrix for quantifying the performance of a classification model. True classes
Flooding Non-flooding
Predicted classes Flooding True positives (TP) False positives (FP)
Positive predictive rate (PPR) = TP/(TP + FP) Non-flooding False negatives
(FN)
True negatives (TN)
Negative predictive rate (NPR) = TN/(TN + FN) True positive rate
(TPR) = TP/(TP + FN)
True negative rate (TNR) = TN/(FP + TN)
Accuracy
(ACC) = (TN + TP)/(TP + TN + FN + FP)
Fig.4. An example of a cumulative rainfall threshold curve (when actual rain- fall exceeds the rainfall threshold, pluvial flooding is expected to occur), modi- fied from Martinaetal.(2006).
step(5).ThedefinitionofT𝛽isanalogoustothatofT𝛼.T𝛽depictsa curvethatacertainnumberofcumulativerainfallcurvesforflooding eventsstaybelow.
4) ConstitutingalinearcombinationofT𝛼andT𝛽,basedonaweight
μ,namely,T(𝛼,𝛽,μ)=μ∗T
𝛼+(1-μ)∗T𝛽.Asaresult,we obtaina rainfallthresholdbasedonthreevariables:𝛼,𝛽,andμ.Anyassigned valuescanresultinagivencumulativerainfallthresholdcurveand itscorrespondingmodelperformance.
5) Solvinganoptimizationproblemthatmaximizesthemodel perfor-mancebytuning𝛼,𝛽,andμ.Threeoptimalcombinationsfor𝛼,𝛽,
andμ werepursued,aimingforthemaximumTPR,thehighestTNR, andthehighestACC:
max
𝛼,𝛽,𝜇 Perfof mancedeterminedby𝑇(𝛼,𝛽,𝜇)
Perfof mance=TPR,TNR,orAccuracy (2)
2.5.1. Machinelearning(ML)algorithms
Machinelearning(ML)algorithmsareacollectionofcomputational data-drivenmethods.Withoututilizingapre-definedequationasthe basicmodel,MLalgorithmstrainamodel,usingacertaintypeof al-gorithms,fullybasedonknowndatawhereasthetrainedmodelcanbe appliedtonewdata.Asthenumberoftrainingdatasetsincreases,the performanceofMLalgorithmscanimprove.MLconsistsoftwofamilies, namely,supervisedlearningandunsupervisedlearning.
Specifically,supervised learningalgorithmsaimtofind functions thatareabletomapinputstolabeledoutputs,alsoincludingtwo cat-egories,classificationandregression.Floodingpredictioniscommonly anapplicationofclassification(Jhong et al., 2018 ;Tayfur et al., 2018 ; Zhou et al., 2018 ),whichaimstodistinguishfloodeventsvs.no-flood eventsbasedonhydrologicalvariables,i.e.,abinaryclassification prob-lem.
Giventhesizeofthedatabaseavailable,weadoptacollectionof modelsinthisstudythatusuallyshowgoodperformanceforsmall-to
medium-sizeddatasets.14classificationalgorithmsfrom5majorML familiesareconsideredtoclassifyurbanpluvialfloodingbasedon rain-fallintensitiesofmultipletemporalscales(Table 2 ).Briefintroductions ofthesealgorithmsfollow:
• Decisiontrees:Decisiontrees buildatree-shapedtop-down struc-ture from the roof (at the top) to leaf nodes (at the bottom) (Breiman et al., 2017 ).Each leaf noderepresentsa predicted re-sponse.Giventhefactthatwefocusonabinaryclassification prob-lem,thebifurcationstartsfromoneparentnodeofagivenlayerto twochildnodesof asubsequentlowerlayer,relying ondifferent valuesofvariables.Specifically,tofindtheoptimalbifurcation,we maximizeGini’sdiversityindexbutstopmaximizationwhen(i)a nodeonlycontainsasingle-classofdata,(ii)achildnodetobe gen-eratedcontainsfewerthanfivedatapoints,or(iii)thenumberof layersexceedsapre-definedcriterion(fiveforacoarsedecisiontree andtwenty foramediumdecisiontree). Ingeneral, decisiontree learningisoneofthefastestalgorithms.Itsresultsarealsoeasyto interpret.WebuiltthedecisiontreemodelinMatlabbyusingthe functionfitctree.
• Discriminantanalysis:discriminantanalysis(DA)classifiersassumes a Gaussian distributionfor dataof each class. The Gaussian dis-tributionisdeterminedbythesamplemeanofeachclassandthe identicalcovariancematrixforlinearDAordifferentclass-based co-variancematricesforquadraticDA.Underthisassumption,linear orquadraticDAsmakepredictionsbyminimizingpredictioncosts basedonBayes’theorem.Notethatthepredictioncostsarethesum ofthemultiplicationoftheposteriorprobabilityofagivenclasskfor adatasampleandthecostofclassifyingasampleasybutitsactual classisk(0foraccurateclassificationand1formisclassification). Readerscanreferto(Ledoit and Wolf, 2004 ;T.Hastie,R.Tibshirani, 2008 )formoredetails.Notethatthisstudyconsidersbothlinearand quadraticdiscriminantanalyses.Astheirnamessuggest,linear dis-criminantanalysiscanonlylearnlinearboundaries,whilequadratic discriminantanalysiscanlearnquadraticboundaries,bothofwhich arefasttorunandeasytointerpret.Webuildthediscriminant anal-ysismodelinMatlabbyusingthefunctionfitcdiscr.
• Supportvectormachine:linearsupportvectormachine(SVM) ap-pliedtobinaryclassificationaimstofindanoptimalhyperplanethat separatestwoclasseswithamarginofthemaximalwidth.Inother words,welookforthemaximummarginwidthwhilekeepingthe dataoftwoclassesoneachsideofthemargin.Samplesthatare mis-classifiedarepenalized.Usingkernelfunctions,suchasquadratic andcubickernels, canturn alinearSVMintoa non-linearSVM. Thelatterismoreflexiblebutalsorequiresmorecomputational re-sourcesandbecomeslessstraightforwardtoexplain.Readerscan referto(Ng, 2000 )formore details.Webuild thesupportvector machinemodelinMatlabusingthefunctionfitcsvm.
• K-nearestneighbor:K-nearestneighbor is a distance-based learn-ingtechnique that determines the predicted responseof agiven pointbycheckingthemajorclassofthekclosestpoints(Cover and Hart, 1967 ).NotethatweusetheEuclideanandcosinedistanceas themetrictomeasuretheclosenessbetweenpoints.TheKNN algo-rithmis oneoftheeasiestandmostintuitivelearningtechniques
Table2
14 machine learning models applied in this study.
No. Algorithm categories Models Settings
1 Decision trees Coarse tree Maximum number of layers = 5
2 Medium tree Maximum number of layers = 20
3 Discriminant analysis Linear discriminant Based on a linear classifier
4 Quadratic discriminant Based on a quadratic classifier
5 Support vector machine (SVM) Linear SVM Based on a linear kernel
6 Quadratic SVM Based on a quadratic kernel
7 Cubic SVM Based on a cubic kernel
8 K nearest neighbor (KNN) Fine KNN Based on Euclidean distance, number of NN = 1
9 Medium KNN Based on Euclidean distance, number of
NN = 10
10 Cos KNN Based on cosine distance, number of NN = 100
11 Cubic KNN Based on Euclidean distance, number of
NN = 10
12 Ensemble algorithms Bagged trees Number of learners = 50
Learning rate = 0.1
13 Subspace discriminant Number of learners = 30
Subspace dimension = (number of features)/2
14 Subspace KNN Number of learners = 30
Subspace dimension = (number of features)/2
widelyusedinmanyapplications(Cheng et al., 2014 ;Zhang, 2016 ). However,itis alsoverysensitivetooutliers (Ramaswamy et al., 2000 ),whichwemayencounterfrequentlywhenpredictingurban floodingbasedonrainfallintensities.WebuildtheK-NNmodelin Matlabusingthefunctionfitcknn.
• Baggedtrees(Breiman et al., 1984 ;Breiman, 2001 ):Baggingstands foratypeofensemblelearning,whichisusedtoreducethevariance ofasingledecisiontree.Tobuildabaggedtreemodel,wecreate multiplesubsetsofnewdatafromoriginalsamples,whicharechosen randomlywithreplacement.Asaresult,weobtainanensembleof decisiontrees,alsoreferredtoasweaklearners,andtheyareproven tobemorerobustthanasingledecisiontree.Webuildthebagged treesmodelinMatlabusingthefunctionfitcensemble.
• Subspaceensembles:therandomsubspacemethodisalsoan ensem-bletechniquetoincreasetheaccuracyofthediscriminantclassifier andKNNclassifier.Thesubspace ensembleaimstotrainrandom samplefeatures,ratherthantheentirefeatureset.Itisprovento beaneffectivemethodtodealwiththeissueofhigh-dimensional featuresets andsmall trainingsets.Asthenamesuggests, classi-fiersareconstructedinarandomsubspaceofdatafeaturespaceand thencombinedbysimplemajorityvoting.Readerscanreferto(Tin KamHo, 1998 )fordetails.(García-Pedrajas and Ortiz-Boyer, 2009 ; Skurichina and Duin, 2002 )alsoprovethattherandomsubspace methodcanbefurtherusedforDAandKNN,whichareappliedin ourstudy.Webuildthesubspace-DAandsubspace-KNNmodelsin Matlabusingthefunctionfitcensemble.
2.5.2. Featureselectionandmodelvalidation
AllthemodelslistedinTable 2 arefirsttestedontenfeatures,which arethetenmulti-temporalrainfallaccumulationsRdx (seeEq. (1) ).Later, wealsorunaprincipalcomponentanalysis(PCA),basedonthe singu-larvaluedecompositionmethod(Madsen et al., 2004 ),toreducethe numberofdimensionsandfindthemostmeaningfulcomponentsfor predictingfloodingevents.
Asweonlyhaveasmalldatasetwith413datapoints,itisdifficult todividethewholedatasetintoseveralsubsetsforbuilding,calibrating andvalidatingthemodel.Instead,weusethe10-foldcross-validation technique(Bengio and Grandvalet, 2004 )todealwiththisissue.We randomlypartitionthedatasetinto10subsetsofanequalsize.Thenwe computethemeanvalueofthemodelperformanceforeachsubset.If the10-foldcross-validationerrorisclosetotheerrorusingtheentire dataset,itmeansthemodelbuiltfromtheentiredatasetisunlikelytobe over-fitted.Indoingso,weareabletoexaminetheperformancesofall
models.Inthesubsequentsection,theaccuracyofthemodelindicates themeanvalueoftheaccuraciesof10modelsbasedonalldatasubsets.
3. Results
3.1. Conventionalmodel
Theconventionalmethodis basedonalinearcombinationof the lowerpercentileofthecumulativerainfallvolumesoffloodingevents andthe upperpercentileof thecumulativerainfall volumes of non-floodingevents.Fig. 5 -(a)showsallthecumulativerainfallcurvesof wetdays(dailyrainfalldepth>=0.1mm)from1June2014to14June 2017.Morethan60%offloodingeventsoccurwithintensiverainfallof shorttemporalscale,e.g.,60minto360min,butalsowithlarger ac-cumulation(bluedashedlines).Morethan90%ofnon-floodingevents havesmallrainfallvolumes,forinstance,dailyaccumulationbeingless than20mm.However,therearealsoexceptions,whereeventswithlarge rainfallvolumeswerereportedasnon-floodingandviceversa.
Weconductedanexhaustivesearchforallpossiblevaluesof𝛼,𝛽,
andμ between0and1andderived112Paretooptimalthresholdcurves, showninFig. 5 -(b).Fourrepresentativerainfallthresholdcurvesare se-lected,whichhavethehighestrateforatleastoneofthefivemodel qual-itymetrics.Threshold1hasthehighestvaluesintermsofTNR(0.98), PPR(0.73)andACC(0.91), butalso thelowestvalueof TPR(0.46); Threshold2hasthehighestACC(0.91)butmediumTPR(0.5); Thresh-old3hasthehighestNPR(0.98)andThreshold4hasthehighestTPR (0.96),whicharepresentedinFig. 5 -(c)andTable 3 .Wecanseethat thresholds1and2ensuremorenon-floodingeventsarecorrectly clas-sified,butalsomissmanyfloodingevents.Thresholds3and4aremore inclinedtocorrectlyclassifyfloodingevents,whichimpliesthatmany non-floodingeventscanbe labeledasfloodingeventsbasedonthese twothresholds.Thresholdcurves#1and#2arebasedonthelower 0-percentile(𝛼 =0)oftherainfalldepthforallthefloodingeventsandthe upper100-percentile(𝛽 =1)oftherainfalldepthforallthenon-flooding events.Thesecurvesuseacoefficientof0and0.2tomakethelinear combination.BothcurveshavealowTPR,meaningmanyactual flood-ingeventsaremissed,andahighTNR,meaningnon-floodingiswell captured.ACCisthusrelativelyhigh,at0.91.Curves#3and#4have thehighestNPRandTPR,respectively,butverylowACC.Thisisbecause thethresholdislowinFig. 6 -(b),ensuringfloodingeventsarecorrectly classified,butmissingnon-floodingevents.Ingeneral,itisdifficultto findathresholdcurvethatcanrobustlyindicatebothfloodingand non-floodingevents,basedononlythecumulativerainfalldepth.Therefore,
Fig.5. (a) Cumulative rainfall volumes over time for all events on wet days; (b) performance indicators for all the candidate thresholds with the four selected ones highlighted; (c) four selected rainfall thresholds.
Table3
Four cumulative rainfall threshold curves for the conventional method each target different performance metrics.
Threshold A 𝛽 μ TPR NPR TNR PPR ACC 1 0 1 0 0.46 0.93 0.98 0.73 0.91 2 0 1 0.2 0.5 0.93 0.97 0.71 0.91 3 0.05 0.85 0.6 0.92 0.98 0.49 0.2 0.54 4 0.08 0.90 0.8 0.96 0.97 0.18 0.15 0.27 Range [0, 0.2] [0.8, 1] [0, 1] [0.46, 0.96] [0.86, 0.98] [0.04, 0.98] [0.12, 0.73] [0.15, 0.91]
weneedothervariables,ratherthanonlythecumulativerainfalldepth, tomakeabetterclassification.
3.2. Machinelearning(ML)
3.2.1. Predictionresultswith10features
ThefirstcollectionofMLmodelswastrainedbasedontenrainfall volumesat1,5,10,15,30,60,120,360,720,and1440-min tempo-ralresolutions,withdefinitionsgivenin Eq. (1) . AlloftheML mod-elshaveanACCbetween0.94and0.96(Fig. 6 ),exceptforonemodel withanACCof0.92.Thisimpliesthatonly16to25events,outof413 events,weremisclassifiedinthirteenoftheMLmodelsused.Thisshows aslightlybetterperformancethanthatoftheconventionalmodel.On theotherhand,theTPRhasalargervariation,rangingfrom0.29to 0.75.Inotherwords,themissraterangesfrom0.25to0.71.Amongall thefourteenmodels,theDAfamilyshowsthemostsatisfactory
perfor-mance.Specifically,theQuadraticDA(Model4)hasthehighestTPR (0.75),implyingthat18outof24actualfloodingeventscanbewell predictedwhiletheensembleDA(Model13)hasthehighestACCof 0.96(SeeFig. 6 ).AlltheperformancemetricsarelistedinAppendixX1. AlthougheachMLmodeliseasytorunwiththecompletesetofall tenfeatures,theresultcannotbevisualizedinaten-dimensionalspace, resultingindifficultyinterpretingresults.Therefore,weneedtofurther reducethenumberofdimensionstothreeorevenfewer,asshownin thesubsequentsection.
3.2.2. Predictionresultswith2features
ThesecondcollectionofMLmodelsweretrainedbyusingtwo prin-cipalcomponents,whichwerederivedfromtenrainfallaccumulations byrunningaprincipalcomponentanalysis.Thenewfeaturesarelinear combinationsofthetendailypeakrainfallintensitiesatdifferent tem-poralresolutions,withasetofcoefficientsgiveninTable 4 .Thefirst
Table4
Coefficients of ten temporal resolutions of rainfall ( Eq.(1)) for two principal components.
Coefficients Rd1 Rd5 Rd10 Rd15 Rd30 Rd60 Rd120 Rd360 Rd720 Rd1440 explained Feature 1 0.02 0.05 0.08 0.11 0.15 0.21 0.28 0.43 0.53 0.61 97.5% Feature 2 0.05 0.16 0.25 0.31 0.40 0.44 0.42 0.15 -0.23 -0.46 2%
Fig.6. True positive rates (TPR) and model accuracies (ACC) of 14 trained ML models based on rainfall accumulations at 10 temporal resolutions. Models 4 and 13 are marked in red as they have the best performance in terms of TPR and ACC, respectively. Model numbers correspond to Table2.
featureisaweightedsumwithlargertemporalscalesreceivingmore weight,explaining97.5%ofthetotalvariance,whilethesecondfeature hasmoreweightattimescalesbetween30and120min,explaining2% ofthetotalvariance.Therefore,usingthesefeaturescanexplain99.5% oftheoriginaldatasetwhenclassifyingthelabeledevents.Inprinciple, onecaneasilycomputethevaluesofthetwofeaturesforpresentor fu-tureeventsbasedonthecombinationofrainfallaccumulationvolumes. Ifastudyareahasacoarsertemporalresolutionofrainfallmeasurement thanthatusedhere,principalcomponentanalysiscanberunbasedon historicaldataofcoarserresolutiontogeneratetwonewsetsofweights. Allmodelsusingtwofeatureshaveaperformancethat isslightly worsethanthatoftheten-featuremodelspresentedinSection 3.2.1 .The ACConlydropsby0.01to0.02forsomemodelssuchasthemedium decisiontrees(Model2)andthefine/subspaceKNN(Models8and14), whileothermodels(Model3,4,5,7and13)donotseereducedACC. Intermsof theTPR,fewermodelsreach0.5or higher,comparedto themodelsinSection 3.2.1 .However,asseenfromFig. 7 ,thesubspace DA(Model13)isstilloneofthebestperformingmodels.Twolinear models,namely,thelinearDA(Model3)andthelinearSVM(Model 5)alsoshowaParetooptimalperformanceintermsofACCandTPR. ItshouldbenotedthatParetooptimalityisasituationthatcannotbe modifiedsoastomakeanyoneindividualorpreferencecriterionbetter off withoutmakingatleastoneindividualorpreferencecriterionworse off.Models3,5,and13areadoptedforfurtherdiscussionbecausethey havethebestperformanceregardingeitherTPRorACC.Asshownin Fig. 7 ,thesethreemodels(inred)performbetterthanothermodels(in blue)forbothperformanceindicators.Theperformancemetricsofother modelsarelistedinAppendixX2.
Withtwodecisionvariables(i.e.,features),weareabletovisualize theoutcomeof themodelsin atwo-dimensionalplane.Asshown in Fig. 8 ,Models3,5,and13determinerainfallthresholdlinesbasedon combinationsofprinciplecomponentfeature1andfeature2.Flooding andnon-floodingeventsoccurtotheright-handandleft-handsidesof eachline,respectively.Amongthesethreemodels,thethresholdline fromthelinearDAmodelisfurthestleft,soclassifiesmoreeventsas flooding,whilethelinearSVMisthefurthestright,soclassifiesfewer
Fig.7. True positive rates (TPR) and model accuracies (ACC) of 14 trained ML models based on two principal components. Models 3, 5, and 13 are marked in red as they have the best performance for either TPR or ACC. Model numbers correspond to Table2.
eventsasflooding.ThesubspaceDAprovidesathresholdlineinbetween theothertwo.NotethatFig. 8 offersanintuitivelook-upgraphthat one caneasilytell whetheraneventis floodingornotbasedon the valuesoftwofeatures.Forinstance,acombinationoffeature1of60 mmandfeature2of10mmispredictedtonotbeafloodingevent,buta combinationoffeature1of100mmandfeature2of10mmispredicted tobeafloodingeventaccordingtoallmodels.However,furthereffort isstillrequiredtoclassifyaneventfallingintheareabetweenthelines ofModel3andModel5,asthethreemodelsmaygivedifferentanswers. Wefurtherelaborateonthefactthatthenatureofthedatacanleadto differentthresholdsfromeachofthethreeMLmodelsintheDiscussion sectionbelow.
4. Discussion
4.1. MLmodelcomparedtocurrentrainfallthresholdandcumulative rainfallthreshold
WefirstelaborateonhowtheproposedMLmodelestimatesthe rain-fallthresholdbetterthanthecurrentempiricalthresholdprovidedby thelocalauthority(SMB, 2019 )Thethresholdsuggestsanyeventis re-gardedasapluvialfloodifeither30-minrainfalldepthisover20mm or3-hrainfalldepthisover80mm.Thisthreshold,andthehistorical datapoints,areshowninFig. 9 .Asthe3-hrainfallthresholdisplaced toohigh,manyfloodingeventsaremissed,resultinginabadresultfor theTPR(only0.25)althoughtheoverallACCisgood(0.95)asalarge numberofnon-floodingeventsarecorrectlypredicted.Inotherwords, themissrateforfloodingeventsisveryhigh,i.e.,0.75.
EveniftheMLmodelisbuiltbasedonasinglefeature,namely 30-minor3-hrainfalldepth,theMLmodelisstillabletoexplorethedataset andfindthresholds.WeusedoneoftheproposedMLmodels, specifi-callythesubspaceDAmodel,asoneofthemodelswiththebest per-formance,totesttheperformancewhenusing thesamefeature(s)of 30-minrainfalldepth,3-hrainfalldepth,ortheircombination.TheDA modelsuggeststhatthethresholdshouldbeeither30-minrainfalldepth
Fig.8. Rainfall thresholds from three selected ML models which have the best performance with respect to the TPR or ACC. Note that a PCA was run in advance to derive two fea- tures which represent 99.5% of the original dataset.
Fig.9. The current empirical rainfall threshold (dashed lines) for urban pluvial flooding in Shenzhen, based on 30-min and 3-h rainfall depths, compared with historical data points.
of12.5mm(Fig. 10 -(a)),3-hrainfalldepthof29.1mm(Fig. 10 -(b)),or acombinationofthese(Fig. 10 -(c)).PerformanceTPR’sareallhigher than0.54,whichismorethantwicetheTPRusingtheempiricalrainfall threshold.DetailedmetricsareshowninAppendixX3.Thismeansthat themachinelearningmodelscanimprovethecurrentempiricalrainfall thresholdtoagreatextent.
Next,wecomparetheperformanceofconventionalcumulative crit-icalrainfallcurvestothosederivedfrom theMLmodels.Theresults showthatMLmodels,especiallylineardiscriminantanalysis,can clas-sifyfloodingandnon-floodingbytwoprinciplecomponents,raisingthe ACCandTPRto96%and58%,respectively;andloweringthefalsealert
rateto25%.Comparedtotheconventionalmodel,thecriticalindicesof ACCandTPRwere5%-15%higherinMLmodels.Therefore,ingeneral, MLmodelscanbetterclassifyfloodingandnon-floodingeventsthanthe conventionalempiricalmethod,basedondifferenttemporalresolutions ofrainfallmeasurements.
TheminimumtemporalresolutionfortheinputofourMLmodels is1minute.However,themethodisgeneric.Theminimumtemporal resolutioncanalsobe5-minor10-mintore-trainthemodel.Totrainthe MLmodel,theuserneedsreportsorobservationsoffloodingand non-floodingevents.Theseinputs(rainfallandfloodreports)areidentical totheinputsneededbytheconventionalmethod.
Fig.10. Performance of single and dual feature rainfall thresholds calculated by the subspace DA model based on observed records. (a) performance of single threshold – 30-mins of 12.5 mm; (b) performance of single threshold – 3-hours of 29.1mm; (c) performance of dual threshold (30-mins together with 3-hours).
Fig.11. Performance of outstanding ML models (models 3, 5 and 13) in terms of ACC and TPR using two principle components. Each model categorizes events to the left of its threshold line as non-flooding, and to the right of the line as flooding.
4.2. Prosandconsofthemachinelearning(ML)model
MLmodelscansuccessfully producerainfall thresholdsforurban pluvialflooding.Themodelonlyneedstoberunonceandthewater systemmanager/operatorcansimplyusealook-upgraphtodetermine whetherapluvialfloodislikelytooccur.Thefeaturescanbeflexibly selected,usingeithertheentire10features,orfewerrepresentative fea-turesbyrunningaPCA.
However,MLisadata-derivedmethod,whichlargelyreliesonthe quantity andthe quality of dataavailable. For example,five points (whicharecircledinFig. 11 ),regardedas‘trickyevents’,caninfluence theoutputwhenusingdifferentmodels.Theseeventshavesimilar rain-fallconditionsbuttheyarecategorizedbytheMLmodelsdifferently;in reality,threearefloodingeventsandtwoarenon-floodingevents.The modelsonlymakedecisionsbasedondata,resultingindifferent thresh-oldlinesforModels3,5and13.Model3includesthesefivepointsin thesetoffloodingevents,therebymakingtwopredictionsincorrectly. Model5excludesthesefivepointsfromthesetfloodingevents,thereby makingthreepredictionsincorrectly.Model13drawsathresholdline inbetween,therebymakingonlyonepredictionincorrectly.Potentially, moredatapointslyinginbetweenthethresholdlineofModel3andthe
thresholdlineofModel5canimprovethemodeltomakepredictions moreprecisely.
Inthiswork,historicalinundationrecordswerecollectedthrougha floodreportsystem(asmartphoneapplication).However,notallthe municipality’scitizensareawareofthisreportingsystem.This limits thenumberoftherecords,thusaffectingtheTPR(i.e.increasedmissed alerts).Inaddition,each citizen’ssocio-economicbackground, educa-tionlevelandexperiencewithpluvialfloodinginfluencetherecordsas well.Forinstance,inundationcausedbyblockageof sewers/pipesat home canbe wronglyreportedasinundationcausedbyrainfall;this undoubtedlyincreasedthenumberoffalsepositives.Sincethecurrent floodreportsystemdoesnotprovideinformationonthereasonsfor in-undation,falseinundationrecordscannotbefilteredout.
ItshouldalsobenotedthatourMLmodelswereappliedoverthe entirecityofShenzheninthisstudy,duetothelimitednumberofdata points.Ifmoredatabecomeavailable,themodelcanbefurtherrefined toadistrict,acommunity,orastreet.Similarly,itcanalsobeapplied to otherurban/rural catchmentsgiven an availablerainfall-flooding database.Asmoreavailabledatacanbecollectedinthefuture,even withimagesandtextdescriptions,wealsoaimtotestdeeplearning algorithmstoincreasetheaccuracyofthefloodpredictionmodel.
5. Conclusion
DespiteuncertaintyabouttheinundationrecordsandMLmodels, thisdata-drivenmethodprovidesabasisforgeneratingrainfall thresh-oldsforfloodearlywarningandemergencyresponseinShenzhen.The objectiveofthispaperistopredicttheoccurrenceofurbanpluvial flood-ingbyMLapproaches.ItconcludesthatMLmodelscandeterminethe rainfallfloodingthresholdasalineprojectedinaplanespannedbytwo principalcomponents,therebyproviding abinaryresult(floodor no flood).Comparedtotheconventionalempiricalcriticalrainfallcurve, theproposedmodels,especiallythesubspacediscriminantanalysis algo-rithm,canbetterclassifyfloodingandnon-floodingeventsbydifferent combinationsofmulti-resolutionrainfallintensities,greatlyraisingthe ACCto96.5%andloweringthefalsealertrateto25%.Suchmodelsare applicabletootherurbancatchmentsaswell.
Extremeweathereventsinthefutureduetoglobalclimatechange willbringhigh-intensityrainfallofshortduration(Westra et al., 2014 ) Advancedtechniques, suchasradar observations,can efficiently im-proveveryshort-rangerainfallforecasts,whichareessentialforaccurate floodprediction(Yang et al., 2016 ).Precipitationisthedominantinput influencingthefloodpredictionresult.Otherfactorslikesoil character-istics,drainagecapacityandtopography(e.g.landsubsidence)would affecttheresultaswell,emphasizingtheneedforupdated,data-driven floodingthresholds.Sincerainfall-threshold-basedfloodpredictioncan be executedrapidlyandsimply, this methodallowsdecision makers (e.g.emergencymanagers)timeforahigh-levelassessmentoffloodrisk, providingvaluableleadtimeforcitizensintheflood-proneareastobe warned.Probabilitythresholds,whichcanhelpunderstandthe uncer-taintiesinvolved,needtobeinvestigatedfurther.Althoughthe inunda-tionrecordscontaininformationaboutoccurrencelocationsand (esti-mated)inundationdepths,thesedatawerenotutilized/analysedinthis study.Furtherstudyonthecorrelationofspatialdistributionof inun-dationandinundationdepthwiththespatiallyvaryingrainfallrecords willbevaluableaswell.
DeclarationofCompetingInterest
Theauthorsdeclarethattheyhavenoknowncompetingfinancial interestsorpersonalrelationshipsthatcouldhaveappearedtoinfluence theworkreportedinthispaper.
CRediTauthorshipcontributionstatement
QianKe:Conceptualization,Methodology,Datacuration,Writing -originaldraft.Xin Tian:Conceptualization,Methodology, Data cu-ration,Writing-originaldraft.JeremyBricker:Supervision,Writing -review&editing.ZhanTian:Conceptualization,Methodology, Supervi-sion,Writing-review&editing.GuanghuaGuan:Visualization, Inves-tigation.HuayangCai:Visualization,Investigation.XinxingHuang:
Visualization,Investigation.Honglong Yang:Datacuration. Junguo Liu:Writing-review&editing.
Acknowledgements
The authors would like to thank the sponsors of this work: the National Key R&D Program of China (Grantno.2018YFE0206200 ),the National Natural Science Foundation of China (Grantno.41671113 and 51761135024 ), theNetherlands Organisation for Scientific Research (NWO)(Grantno.ALWSD.2016.007 ),JPIUrbanEurope,ERA-NET Co-fundSmartUrbanFutures(Projectno.646453),NWO/VerDuSSmart UrbanRegionsoftheFuture(SURF)andtheEngineering and Physical Sciences Research Council ofUK(Grantno.R034214/1 ),theHigh-level SpecialFundingoftheSouthern University of Science and Technology (Grantno.G02296302 ,G02296402 ).Wesincerelyappreciatethree re-viewer’scommentsandsuggestionswhichimprovedthequalityofthe
article.WealsowouldliketoshowourgratitudetoMeteorological Bu-reauofShenzhenMunicipalityandWaterSectorofShenzhen Municipal-ityforprovidingtherainfalldataandinundationrecordsinthispaper.
Supplementarymaterials
Supplementarymaterialassociatedwiththisarticlecanbefound,in theonlineversion,atdoi:10.1016/j.advwatres.2020.103719 .
Appendices
X1. Performance metrics of 14 trained machine learning models basedon10features. No Models TP FN FP TN TPR TNR PPR NPR Accuracy 1 Coarse tree 13 11 7 382 0.54 0.98 0.65 0.97 0.956 2 Medium tree 14 10 11 378 0.58 0.97 0.56 0.97 0.949 3 Linear discriminant 14 10 7 382 0.58 0.98 0.67 0.97 0.959 4 Quadratic discriminant 18 6 29 360 0.75 0.93 0.38 0.98 0.919 5 Linear SVM 11 13 2 387 0.46 0.99 0.85 0.97 0.964 6 Quadratic SVM 11 13 2 387 0.46 0.99 0.85 0.97 0.964 7 Cubic SVM 10 14 6 383 0.42 0.98 0.63 0.97 0.952 8 Fine KNN 11 13 9 380 0.46 0.98 0.55 0.97 0.946 9 Medium KNN 10 14 3 386 0.42 0.99 0.77 0.97 0.959 10 Cos KNN 7 17 4 385 0.29 0.99 0.64 0.96 0.950 11 Cubic KNN 10 14 3 386 0.42 0.99 0.77 0.97 0.959 12 Bagged trees 12 12 6 383 0.50 0.98 0.67 0.97 0.956 13 Subspace discriminant 14 10 5 384 0.58 0.99 0.74 0.97 0.964 14 Subspace KNN 14 10 10 379 0.58 0.97 0.58 0.97 0.951
X2. Performance metrics of 14 trained machine learning models basedon2featuresderivedfromaPCA.
No Models TP FN FP TN TPR TNR PPR NPR Accuracy 1 Coarse tree 10 14 8 381 0.42 0.98 0.56 0.96 0.947 2 Medium tree 10 14 16 373 0.42 0.96 0.38 0.96 0.927 3 Linear discriminant 14 10 6 383 0.58 0.98 0.70 0.97 0.961 4 Quadratic discriminant 14 10 10 379 0.58 0.97 0.58 0.97 0.952 5 Linear SVM 12 12 3 386 0.50 0.99 0.80 0.97 0.964 6 Quadratic SVM 9 15 4 385 0.38 0.99 0.69 0.96 0.954 7 Cubic SVM 7 17 3 386 0.29 0.99 0.70 0.96 0.952 8 Fine KNN 8 16 12 377 0.33 0.97 0.40 0.96 0.932 9 Medium KNN 7 17 1 388 0.29 1.00 0.88 0.96 0.956 10 Cos KNN 0 24 1 388 0.00 1.00 0.00 0.94 0.939 11 Cubic KNN 5 19 1 388 0.21 1.00 0.83 0.95 0.952 12 Bagged trees 10 14 10 379 0.42 0.97 0.50 0.96 0.942 13 Subspace discriminant 13 11 4 385 0.54 0.99 0.76 0.97 0.964 14 Subspace KNN 7 17 11 378 0.29 0.97 0.39 0.96 0.932
X3. Performance metrics of 14 trained machine learning models basedon(a)30-minrainfall(b)3-h rainfall(c)both30-minand3-h rainfall.
No Models TP FN FP TN TPR TNR PPR NPR Accuracy a Based on 30-min
rainfall 13 11 9 380 0.54 0.98 0.59 0.97 0.952 b Based on 3-h rainfall 15 9 7 382 0.63 0.98 0.68 0.98 0.961 c Based on 30-min and
3-h rainfall
14 10 5 384 0.58 0.99 0.74 0.97 0.964
References
Atta-ur-Rahman, Parvin, G.A., Shaw, R., Surjan, A., 2016. Cities, Vulnerability, and Climate Change, in: Urban Disasters and Resilience in Asia. Elsevier, pp. 35–47. https://doi.org/10.1016/B978-0-12-802169-9.00003-3 .
Babaei, S., Ghazavi, R., Erfanian, M., 2018. Urban flood simulation and prioritization of critical urban sub-catchments using SWMM model and PROMETHEE II approach. Phys. Chem. Earth 105, 3–11. https://doi.org/10.1016/j.pce.2018.02.002 . BBC, 2019. Mumbai: Heaviest rain in decade triggers chaos [WWW Document]. URL
Bengio, Y., Grandvalet, Y., 2004. No unbiased estimator of the variance of K-Fold cross-validation. J. Mach. Learn. Res. 5, 1089–1105. https://doi.org/10.1016/ S0006-291X(03)00224-9 .
Breiman, L. , Friedman, J. , Stone, C.J. , Olshen, R.A. , 1984. Classification and Regression Trees, 1st edn. .
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 2017. Classification And Regression Trees. Routledge. https://doi.org/10.1201/9781315139470 .
Cai, Y. , 2014. Causes analysis of urban flooding in Shenzhen. Shenzhen Water Conservancy Planning Design Institute, in Chinese Technical report .
Carpenter, T.M., Sperfslage, J.A., Georgakakos, K.P., Sweeney, T., Fread, D.L., 1999. National threshold runoff estimation utilizing GIS in support of operational flash flood warning systems. J. Hydrol. 224, 21–44. https://doi.org/10.1016/ S0022-1694(99)00115-8 .
Chan, F.K.S., Wright, N., Cheng, X., Griffiths, J., 2014. After Sandy: Rethinking Flood Risk Management in Asian Coastal Megacities. Nat. Hazards Rev. 15, 101–103. https://doi.org/10.1061/(asce)nh.1527-6996.0000117 .
Costabile, P., Macchione, F., 2015. Enhancing river model set-up for 2-D dynamic flood modelling. Environ. Model. Softw. 67, 89–107. https://doi.org/10.1016/ j.envsoft.2015.01.009 .
Cover, T., Hart, P., 1967. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13, 21–27. https://doi.org/10.1109/TIT.1967.1053964 .
Cheng, D., Zhang, S., Deng, Z., Zhu, Y., Zong, M., 2014. kNN Algorithm with Data-Driven k Value, pp. 499–512. https://doi.org/10.1007/978-3-319-14717-8_39 .
Douglas, I., Garvin, S., Lawson, N., Richards, J., Tippett, J., White, I., 2010. Urban pluvial flooding: A qualitative case study of cause, effect and nonstructural mitigation. J. Flood Risk Manag. 3, 112–125. https://doi.org/10.1111/j.1753-318X.2010.01061.x . Falconer, R.H., Cobby, D., Smyth, P., Astle, G., Dent, J., Golding, B., 2009. Pluvial flood- ing: New approaches in flood warning, mapping and risk management. J. Flood Risk Manag. 2, 198–208. https://doi.org/10.1111/j.1753-318X.2009.01034.x . Fawcett, R., Stone, R., 2010. A comparison of two seasonal rainfall forecasting systems
for Australia. Aust. Meteorol. Oceanogr. J. 61, 15–23. https://doi.org/10.22499/ 2.6001.002 .
García-Pedrajas, N., Ortiz-Boyer, D., 2009. Boosting k-nearest neighbor classifier by means of input space projection. Expert Syst. Appl. 36, 10570–10582. https://doi.org/10.1016/j.eswa.2009.02.065 .
Garcia ‐urquia, E., Axelsson, K., 2015. Rainfall thresholds for the occurrence of urban land- slides in tegucigalpa, honduras: an application of the critical rainfall intensity. Geogr. Ann. Ser. A, Phys.Geogr. 97, 61–83. https://doi.org/10.1111/geoa.12092 . Giannecchini, R., Galanti, Y., D’Amato Avanzi, G., 2012. Critical rainfall thresholds for
triggering shallow landslides in the Serchio River Valley (Tuscany, Italy). Nat. Hazards Earth Syst. Sci. 12, 829–842. https://doi.org/10.5194/nhess-12-829-2012 . Golian, S., Fallahi, M.R., Behbahani, S.M., Sharifi, S., Sharma, A., 2015. Real-time up-
dating of rainfall threshold curves for flood forecasting. J. Hydrol. Eng. 20, 1–9. https://doi.org/10.1061/(ASCE)HE.1943-5584.0001049 .
Golian, S., Saghafian, B., Maknoon, R., 2010. Derivation of Probabilistic Thresholds of Spatially Distributed Rainfall for Flood Forecasting. Water Resour. Manag. 24, 3547– 3559. https://doi.org/10.1007/s11269-010-9619-7 .
Hong, M., Kim, J., Jeong, S., 2018. Rainfall intensity-duration thresholds for landslide prediction in South Korea by considering the effects of antecedent rainfall. Landslides 15, 523–534. https://doi.org/10.1007/s10346-017-0892-x .
Hua, C., 2019. Sudden rainstorm kills 11 people in Shenzhen. https:// www.chinadailyhk.com/articles/154/181/191/1555144049585.html . (Accessed 09 May 2019).
Jhong, Y.-D., Chen, C.-S., Lin, H.-P., Chen, S.-T., 2018. Physical Hybrid Neural Net- work Model to Forecast Typhoon Floods. Water 10, 632. https://doi.org/10.3390/ w10050632 .
Jiang, Y., Zevenbergen, C., Fu, D., 2017. Can “Sponge Cities ” Mitigate China’s In- creased Occurrences of Urban Flooding? Aquademia Water. Environ. Technol. 1, 1–5. https://doi.org/10.20897/awet.201703 .
Jiang, Y., Zevenbergen, C., Ma, Y., 2018. Urban pluvial flooding and stormwater man- agement: A contemporary review of China’s challenges and “sponge cities ” strategy. Environ. Sci. Policy 80, 132–143. https://doi.org/10.1016/j.envsci.2017.11.016 . Breiman, L., 2001. Random Forests. Mach. Learn. 45, 5–32. https://doi.org/10.3390/
rs10060911 .
Ledoit, O. , Wolf, M. , 2004. Honey, I Shrunk the Sample Covariance Matrix: Problems in Mean-Variance Optimization. J. Portf. Manag. 30, 110–119 .
Li, X., 2020. A Hybrid Model for Fast and Probabilistic Urban Pluvial Flood Prediction. Water Resour. Res. 56. https://doi.org/10.1029/2019WR025128 .
Li, X., Willems, P., 2019. Probabilistic flood prediction for urban sub-catchments using sewer models combined with logistic regression models. Urban Water J. 16, 687–697. https://doi.org/10.1080/1573062X.2020.1726409 .
Liu, F., Xu, F., Yang, S., 2017. A Flood Forecasting Model Based on Deep Learning Al- gorithm via Integrating Stacked Autoencoders with BP Neural Network. In: 2017 IEEE Third International Conference on Multimedia Big Data (BigMM), pp. 58–61. https://doi.org/10.1109/BigMM.2017.29 IEEE.
Madsen, R.E., Kai, H., Winther, O., 2004. Singular Value Decomposition and Principal Component Analysis.
Martelloni, G., Segoni, S., Fanti, R., Catani, F., 2012. Rainfall thresholds for the forecasting of landslide occurrence at regional scale. Landslides 485–495. https://doi.org/10.1007/s10346-011-0308-2 .
Martina, M.L.V., Todini, E., Libralon, A., 2006. A Bayesian decision approach to rainfall thresholds based flood warning. Hydrol. Earth Syst. Sci. 10, 413–426. https://doi.org/10.5194/hess-10-413-2006 .
Montesarchio, V., Ridolfi, E., Russo, F., Napolitano, F., 2011. Rainfall threshold definition using an entropy decision approach and radar data. Nat. Hazards Earth Syst. Sci. 11, 2061–2074. https://doi.org/10.5194/nhess-11-2061-2011 .
Ng, A., 2000. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. In: Christianini, N., J.C.S.-T. (Eds.), Intelligent Systems and Their Applications IEEE. Cambridge University Press, pp. 1–25. https://doi.org/10.1016/ j.aca.2011.07.027 .
Mosavi, A., Ozturk, P., Chau, K.W., 2018. Flood prediction using machine learning models: Literature review. Water (Switzerland) 10, 1–40. https://doi.org/10.3390/ w10111536 .
Nikolopoulos, E.I., Crema, S., Marchi, L., Marra, F., Guzzetti, F., Borga, M., 2014. Im- pact of uncertainty in rainfall estimation on the identification of rainfall thresholds for debris flow occurrence. Geomorphology 221, 286–297. https://doi.org/10.1016/ j.geomorph.2014.06.015 .
Norbiato, D., Borga, M., DegliEsposti, S., Gaume, E., Anquetin, S., 2008. Flash flood warning based on rainfall thresholds and soil moisture conditions: An assessment for gauged and ungauged basins. J. Hydrol. 362, 274–290. https://doi.org/10.1016/ j.jhydrol.2008.08.023 .
Noymanee, J., Nikitin, N.O., Kalyuzhnaya, A.V., 2017. Urban Pluvial Flood Forecasting us- ing Open Data with Machine Learning Techniques in Pattani Basin. Procedia Comput. Sci. 119, 288–297. https://doi.org/10.1016/j.procs.2017.11.187 .
Pan, H.L., Jiang, Y.J., Wang, J., Ou, G.Q., 2018. Rainfall threshold calculation for debris flow early warning in areas with scarcity of data. Nat. Hazards Earth Syst. Sci. 18, 1395–1409. https://doi.org/10.5194/nhess-18-1395-2018 .
Ramaswamy, S., Rastogi, R., Shim, K., 2000. Efficient algorithms for mining outliers from large data sets. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data - SIGMOD ’00. ACM Press, New York, New York, USA, pp. 427– 438. https://doi.org/10.1145/342009.335437 .
Shen, C., Laloy, E., Elshorbagy, A., Albert, A., Bales, J., Chang, F.J., Ganguly, S., Hsu, K.L., Kifer, D., Fang, Z., Fang, K., Li, D., Li, X., Tsai, W.P., 2018. HESS Opinions: Incubating deep-learning-powered hydrologic science advances as a community. Hydrol. Earth Syst. Sci. 22, 5639–5656. https://doi.org/10.5194/hess-22-5639-2018 .
Shi, P.J., Yuan, Y., Zheng, J., Wang, J.A., Ge, Y., Qiu, G.Y., 2007. The effect of land use/cover change on surface runoff in Shenzhen region, China. Catena 69, 31–35. https://doi.org/10.1016/j.catena.2006.04.015 .
Skurichina, M., Duin, R.P.W., 2002. Bagging, boosting and the random subspace method for linear classifiers. Pattern Anal. Appl. 5, 121–135. https://doi.org/10.1007/ s100440200011 .
SMB, 2019. Rough determination of meterogolical disaster in Shenzhen [www Doc- ument].URL http://weather.sz.gov.cn/qixiangfuwu/fangzaijianzai/fangzaijianzai/ content/post_3576505.html .in Chinese.
SSB (Statistical Bureau of Shenzhen Municipality), 2019. Shenzhen Statistical Year- book. China Statistics Press online PDF http://tjj.sz.gov.cn/attachment/0/418/ 418268/6765070.pdf .
Hastie, R. T., Tibshirani, J.F., 2008. The Elements of Statistical Learning -Data Mining, Inference, and Prediction. pp.106–119.
Tayfur, G., Singh, V., Moramarco, T., Barbetta, S., 2018. Flood Hydrograph Prediction Us- ing Machine Learning Methods. Water 10, 968. https://doi.org/10.3390/w10080968 . Thorndahl, S., Nielsen, J.E., Jensen, D.G., 2016. Urban pluvial flood prediction: A case study evaluating radar rainfall nowcasts and numerical weather prediction mod- els as model inputs. Water Sci. Technol. 74, 2599–2610. https://doi.org/10.2166/ wst.2016.474 .
Tian, X. , Veldhuis, M. , Schleiss, M. , 2019. Critical rainfall thresholds for urban pluvial flooding inferred from citizen observatories. Sci. Total Environ. 21, 4890 . Ho, Tin Kam, 1998. The random subspace method for constructing decision forests. IEEE
Trans. Pattern Anal. Mach. Intell. 20, 832–844. https://doi.org/10.1109/34.709601 . Tracy, A. , Trumbull, K. , Loh, C. , 2007. Impact of climate change in Hong Kong and the
pearl river delta. China Perspect. 2007, 18–29 .
van Asch, T.W.J., Tang, C., Alkema, D., Zhu, J., Zhou, W., 2014. An integrated model to assess critical rainfall thresholds for run-out distances of debris flows. Nat. Hazards 70, 299–311. https://doi.org/10.1007/s11069-013-0810-z .
Vogel, R.M., Lall, U., Cai, X., Rajagopalan, B., Weiskel, P.K., Hooper, R.P., Matalas, N.C., 2015. Hydrology: The interdisciplinary science of water. Water Resour. Res. 51, 4409– 4430. https://doi.org/10.1002/2015WR017049 .
Westra, S., Fowler, H.J., Evans, J.P., Alexander, L.V., Berg, P., Johnson, F., Kendon, E.J., Lenderink, G., Roberts, N.M., 2014. Future changes to the intensity and fre- quency of short-duration extreme rainfall. Rev. Geophys. 52 (3), 522–555. https://doi.org/10.1002/2014RG000464 .
Wilks, D. , 2005. Statistical Methods in the Atmospheric Sciences, 2nd edn. Elsevier, Burlington .
Xing, Y., Liang, Q., Wang, G., Ming, X., Xia, X., 2019. City-scale hydrodynamic modelling of urban flash floods: the issues of scale and resolution. Nat. Hazards 96, 473–496. https://doi.org/10.1007/s11069-018-3553-z .
Yan, H., He, X., Lei, Y., Wang, Y., Su, H., Jiang, S., 2019. Land use-induced change in trophic state of Shenzhen Bay (South China) over the past half-century. Mar. Pollut. Bull. 145, 208–213. https://doi.org/10.1016/j.marpolbul.2019.05.046 .
Yang, T.H., Hwang, G.Do, Tsai, C.C., Ho, J.Y., 2016. Using rainfall thresholds and en- semble precipitation forecasts to issue and improve urban inundation alerts. Hydrol. Earth Syst. Sci. 20, 4731–4745. https://doi.org/10.5194/hess-20-4731-2016 . Zhai, X., Guo, L., Liu, R., Zhang, Y., 2018. Rainfall threshold determination for flash
flood warning in mountainous catchments with consideration of antecedent soil moisture and rainfall pattern. Nat. Hazards 94, 605–625. https://doi.org/10.1007/ s11069-018-3404-y .
Zhang, Z., 2016. Introduction to machine learning: k-nearest neighbors. Ann. Transl. Med. 4, 218. https://doi.org/10.21037/atm.2016.03.37 .
Zhou, J., Peng, T., Zhang, C., Sun, N., 2018. Data Pre-Analysis and Ensemble of Vari- ous Artificial Neural Networks for Monthly Streamflow Forecasting. Water 10, 628. https://doi.org/10.3390/w10050628 .
Zhang, J., Hou, G., Ma, B., Hua, W., 2018. Operating characteristic information extraction of flood discharge structure based on complete ensemble empirical mode decompo- sition with adaptive noise and permutation entropy. J. Vib. Control 24, 5291–5301. https://doi.org/10.1177/1077546317750979 .
Zhou, L., Wu, X., Ji, Z., Gao, G., 2017. Characteristic analysis of rainstorm-induced catastrophe and the countermeasures of flood hazard mitigation about Shen- zhen city. Geomat. Nat. Hazards Risk 8, 1886–1897. https://doi.org/10.1080/ 19475705.2017.1392368 .
Ziegler, A.D., 2012. Reduce urban flood vulnerability. Nature 481, 145. https://doi.org/10.1038/481145b .