A novel one-layer recurrent neural network for the l1-regularized least square problem

(1)

Delft University of Technology

A novel one-layer recurrent neural network for the l1-regularized least square problem

Mohammadi, Majid; Tan, Yao Hua; Hofman, Wout; Mousavi, S. Hamid

DOI

10.1016/j.neucom.2018.07.007

Publication date

2018

Document Version

Final published version

Published in

Neurocomputing

Citation (APA)

Mohammadi, M., Tan, Y. H., Hofman, W., & Mousavi, S. H. (2018). A novel one-layer recurrent neural

network for the l1-regularized least square problem. Neurocomputing.

https://doi.org/10.1016/j.neucom.2018.07.007

Important note

To cite this publication, please use the final published version (if applicable).

Please check the document version above.

Copyright

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons. Takedown policy

Please contact us and provide details if you believe this document breaches copyrights. We will remove access to the work immediately and investigate your claim.

This work is downloaded from Delft University of Technology.

(2)

Green Open Access added to TU Delft Institutional Repository

'You share, we take care!' - Taverne project

https://www.openaccess.nl/en/you-share-we-take-care

Otherwise as indicated in the copyright section: the publisher

is the copyright holder of this work and the author uses the

Dutch legislation to make this work public.

(3)

Neurocomputing 315 (2018) 135–144

ContentslistsavailableatScienceDirect

Neurocomputing

journalhomepage:www.elsevier.com/locate/neucom

A

novel

one-layer

recurrent

neural

network

for

the

l

₁

-regularized

least

square

problem

Majid

Mohammadi

a,∗

_,

_Yao-Hua

_Tan

a

_,

_Wout

_Hofman

b

_,

_S.

_Hamid

_Mousavi

c

a Faculty of Technology, Policy and Management, Delft University of Technology, The Netherlands b The Netherlands Institute of Applied Technology (TNO)

c Department of Medical Physics and Acoustics and Cluster of Excellence Hearing4all, Carl von Ossietzky University of Oldenburg, Germany

a

r

t

i

c

l

e

i

n

f

o

Article history: Received 18 July 2017 Revised 12 May 2018 Accepted 4 July 2018 Available online 10 July 2018 Communicated by Dr Ding Wang

Keywords:

Least squares

l1 -regularization

Recurrent neural network Convex

Lyapunov Total variation

a

b

s

t

r

a

c

t

Thel1-regularizedleastsquareproblemhasbeenconsideredindiversefields.However,findingits solu-tionisexactingasitsobjectivefunctionisnotdifferentiable.Inthispaper,weproposeanewone-layer neuralnetworkto findthe optimal solution ofthe l1-regularizedleast squaresproblem. To solve the problem,wefirstconvertitintoasmoothquadraticminimizationbysplittingthedesiredvariableinto itspositiveand negativeparts.Accordingly,anovelneuralnetworkisproposed tosolve theresulting problem,whichisguaranteedtoconvergetothesolutionoftheproblem.Furthermore,therateofthe convergenceisdependentonascalingparameter,nottothesizeofdatasets.Theproposedneural net-workisfurtheradjusted toencompassthetotalvariationregularization.Extensiveexperimentsonthe

l1andtotalvariationregularizedproblemsillustratethereasonableperformanceoftheproposedneural network.

1. Introduction

Thel1-regularizedleastsquares,orthelasso[1],hasreceiveda

considerable amountof attentionover thelast decade andmuch research in recent years has focused on solving its non-smooth convexoptimizationproblem

min x 1 2

y− Ax

2 2 +

λ

x

1 (1)

wherex_∈_Rl_,_y_∈_Rn_,_A_is_an_n_{× l}_matrix_consisting_of_l_data_points,

λ

isanon-negativeparameter,

v

₂indicatestheEuclideannorm, and

v

1 =

|

v

i

|

isthe l1 -normofv,whichencouragesthesmall

componentsofxtobezero.

The lassohasa broadrangeofapplications,such assignal re-construction[2],curve fittingandclassification[3],subspace clus-tering[4,5],sparsecoding[6,7],androbotcontrol[8],tonamejust afew.Intheseapplications,itiscriticaltosolvetheminimization (1)efficiently.Therefore,myriadmethodshavebeendevelopedfor solving(1)morequicklyandeffectively[9–12].

One promising wayto ﬁnd the optimum of the minimization (1)istoutilizetherecurrentneuralnetwork.Oneofthemain ad-vantages of such an approach is that the structure of RNNs can

∗ _{Corresponding author.}

E-mail address: m.mohammadi@tudelft.nl (M. Mohammadi).

be implementedusingvery-large-scale integration(VLSI) and op-ticaltechnologies. Furthermore,it iswell-known that neural net-works have the ability to process real-time applications. Hence, when thereare demands on real-time processing, it is necessary anddesirable to employ parallel anddistributedapproaches, like neural networks. Despitehaving such unique merits, solving the minimization (1)via RNNs is thoroughly neglected (withthe ex-ceptionoftheRNNs forgeneralnon-smoothproblems).And, itis theprincipalincentivetodevelopanovelrecurrentneuralnetwork especiallytailoredforthelasso.

The tremendous challenge of solving the minimization (1) is itsnon-differentiability duetoits l₁-regularization.There aretwo options to putforward the neural networkby circumventing the non-differentiability of the lasso. The first approach is to take advantage of the dual problem of the minimization (1). This is the modus operandi of various methods in the recent literature [10,11,13].The interior-pointmethodis arguablythemostfamous technique used to solve the dual problem. Contrary to conven-tional interior-points methods, it is claimed that this technique is suitable for large-scale problems; a problem with millions of variablesissolubleinseveralminutesonanordinaryPC.However, the main difficulty in solving the dual problem is finding the optimalsolutionoftheprimalproblem,e.g.xintheminimization (1),fromthedual solution.The calculationoftheprimal solution x fromthe dual variable hasusually enmeshed the computation

https://doi.org/10.1016/j.neucom.2018.07.007

(4)

(

AT_A

₎

−1 _. _{Mathematically} _speaking, _such _an _inverse _does _not

theoretically exist forall matrices A. Ontop of that, the inverse calculation is both time- and memory-consuming for large-scale problems.Therefore,thisapproachisnottakenintoaccount.

Anotherapproachtosolvetheminimization(1)istoconvertit intoa smoothproblembysplittingthevariablexintoitspositive andnegativeparts.The resultant smooth problemcan be readily solvedusing gradient-based methods.The gradient projection for sparse reconstruction (GPSR) [9]solves the smooth problem and is of immense popularity among other methods. Further studies onthe gradient projection concentrated on acceleratingthe con-vergence[14,15].

Inthisarticle,weusethesecondapproachto comeup witha neuralnetworkinordertoavoidthecalculationoftheinverse ma-trix.However, splitting thevariable intoits positive andnegative parts results in dimension escalation of the consequent smooth problem. We further investigate whether the dimension increase canbe dealt withmoreeconomically than it appearsat theﬁrst sight.

Theproposedneuralnetworkisguaranteedtoﬁndtheoptimal solutionofthesmoothproblemequivalenttotheminimization(1). Then,thesolutionoftheoriginalproblemcanbereadilyobtained byconductingthesubtractionsamongtheoutcomesoftheneural network.Further,theproposed neuralnetwork hasasimple one-layerstructurethatcanbesmoothlyimplemented.Fromthespeed pointofview,theconvergenceoftheneuralnetworkisrelianton a positive parameter determined by the user, not on the size of thedataset.Such a salientfeature isdesiredwhen largedatasets are available. We further adjust the proposed neural network to solvethetotalvariation-regularizedproblems.Similartothelasso, thetotalvariation-regularizedproblemsarenotdifferentiable.The eﬃciencyoftheproposedneuralnetworkisdemonstratedby con-ductingexperimentsoverseveralrealandsimulateddatasetsfrom thesignalandimageprocessingandbioinformaticsdomain.

In a nutshell, thecontributions of thisarticle can be summa-rizedasfollows:

• A novel recurrent neural network is proposed for solving the lasso.

• The neural network isguaranteed to convergeto thesolution oftheproblem.

• Theescalation indimensionsstemming fromthevariablesplit isdiscussed,andthecomputationcostisreduced.

• The neural network is then extended to solve the total variation-regularizedproblem.

• Extensive experiments are presented to illustrate the perfor-manceoftheproposedneuralnetwork.

The paperisorganizedasfollows.InSection II,weﬁrstderive the smooth problem of the minimization (1), andthen a neural networkisproposed accordingly.Further, wealso analyzethe ef-fect of dimension and the complexity of the neural network in thissection. The convergenceofthe neural network andits con-vergencerateare investigatedinSection III.Extensive experimen-talresultswithapplicationtocompressedsensingandimage and signalrecovery are discussedin SectionIX,andwe concludethis paperinSectionX.

2. Neuralnetworkforsmoothequivalentproblem

In this section, a smooth problemfor the minimization (1) is derivedbysplittingthedesiredvariablexintoitspositiveand neg-ative parts. The subsequent escalation of dimension and a one-layer neural network are investigated afterward. The proposed neural network is then adjusted to solve the total variation reg-ularizedproblem.

2.1. Smoothequivalentproblem

Tosolvetheminimization(1)usingtheneuralnetwork,weﬁrst restateitasasmooth quadraticproblem.Thisisdonebysplitting variablexintoitspositiveandnegativeparts.Letu,

v

∈Rn _be aux-iliaryvariablessuchthat

x = u−

v

u ≥ 0 ,

v

≥ 0

whereu_i₌

(

x_i

)

₊_,

v

i=

(

−xi

)

+ and

(

.

)

+ denotesthepositivepart de-ﬁned as

(

x

)

+ =max

{

0,x

}

. Now, let 1₂n=

(

1,1,...,1

)

∈R2 n, then theproblem(1)canberewrittenasthefollowingquadratic prob-lem: min z F

(

z

)

= 1 2 z T_Bz₊_cT_z ₍₂₎ s.t. z≥ 0 where z =

u

v

, c =

λ

1₂n+

−A T_y AT_y

B =

AT_A _{− A}T_A −A T_A _AT_A

2.2. One-layerneuralnetwork

The smooth problem (2)is a convex minimization with non-negativity constraints. Therefore, the Karush–Kuhn–Tucker (KKT) conditions [16]are necessary and suﬃcient forthe optimality of thesolution.Asstatedby K.K.Tconditions,z∗ istheoptimal solu-tionoftheminimization(2)ifandonlyifthereexistsw∗∈R2 l_such that(z∗,w∗)satisﬁesthefollowingconditions:

∇

F

(

z

)

− w = 0 , w≥ 0

wT_z_{= 0}_, _z_{≥ 0}_. (3)

Fromtheﬁrst equalityinEq.(3),itisdrawnthat

∇

F

(

z

)

=w.The foregoingequationscouldbethusrestatedas

∇

F

(

z

)

≥ 0 , z≥ 0 ,

∇

F

(

z

)

T_z_{= 0}_. ₍₄₎

The inequalities (4) are known as the nonlinear complementar-ityproblem(NCP)[17].Withtheaidofthenext theorem,a neu-ralnetwork fortheminimization(2)isproposedaccordingtothe aboveNCP.

Theorem2.1. Forthe problem(2),z∗ is theoptimalsolution ifand onlyif

(

z∗

)

=0,where

(

z

)

= min

{

z,

∇

F

(

z

)

}

, (5)

and

(z) is a vector valuefunction, and “min” represents the mini-mumvalueofeachelementofzand

∇

F(z).

Proof. Itcanbeeasilydrawnfromtheinequalities(4)(see[18]for moreinformation).

Basedon theabove theorem,thefollowingdynamicsystemis proposedtosolvetheproblem(2)

dz

dt = −

α

(

z

)

(6)

where

α

>0 is a scaling parameter. The dynamic system(6) can be recognized as a recurrent neural network with a single-layer structure. Beforeexamining its structure,however,we ﬁrst probe intotheeffectofthedimensionescalation causedbythevariable split.

(5)

M. Mohammadi et al. / Neurocomputing 315 (2018) 135–144 137

Fig. 1. Block diagram of the proposed recurrent neural network (6) taking the computational reduction into account. The aa ij is the element at the ith row and jth column

of A T A , and the triangle and _{represent the multiplication and addition, respectively.}

2.3. Dimensioneffectandcomplexityofneuralnetwork

Itisobservedthat thesizeoftheproblem(2)istwiceaslarge astheoriginalproblem(1)whilex∈Rl _but_z_∈_R2 l_._However,_this increaseindimensiondoesnothaveasigniﬁcantimpactsincethe matrix operation to obtain B can be performed more eﬃciently thanitmightseem.Toillustratetheminorityofthiseffect,letus considerthecomplexityofthesystem(6)bycomputingthe num-berofmultiplicationsandadditions/subtractionsineachiteration. ThemostcostlycomputationbelongstoBzwhileBisa2l× 2l ma-trixandz∈R2 l_._Such_a_calculation_requires₄_l2 _{multiplications}_and

4l2_{− 2}_{l additions.}

However, the computationcan be signiﬁcantly reduced.For a givenz=

(

uT_,

_v

T

₎

T_,_one_can_rewrite_Bz_as

Bz = B

u

v

=

AT_A

₍

_u₋

_v

₎

−A T_A

₍

_u₋

_v

₎

.

The computationofBzonly requiresl2 multiplicationsandl2 ad-ditions/subtractions,consideringthatAT_A_should_be_computed be-forehand. Hence, thenumberofoperationshasdropped from4l2 multiplicationstol2 ,andfrom4l2 − 2l additions/subtractionstol2 . Intheaggregate, asc isalsoapre-process computation,l2 multi-plicationsandl2 +2l additions/subtractionsaredone ineach iter-ationofthedynamicsystem(6).

Intheelementform,thedynamicsystem(6)canwrittenas

dzi

dt =

(

zi

)

= min

((

Bz

)

i + ci,zi

)

= min

(

sign

(

l− i

)

j

(

aai j

(

u−

v

)

j+ ci,zi

)

(7)

where aaij is the element in the ith rowand jth column of the matrixAT_A_._As_regards_the_element-wise_equation_of_the_proposed neural network,its structure isdisplayed inFig. 1.In thisfigure, themodificationfordimensionescalationisalsoconsideredto re-ducethecomplexityofthenetwork.Theoutputsoftheneural net-work are ui’s and vi’s, which are recursively entered in the first layer. They are then multiplied by aaij, which are shown as the triangleinthefigure andareexplained inEq.(7).Intheview of Fig.1,thecircuitconsistsof2lintegrators,2lactivationminimum functions,4lsummers,andsomeconnectionweights.

2.4.Totalvariation-regularizedproblem

Thetotalvariation-regularizedproblemisanothernon-smooth minimization. The corresponding minimization function for total variation-regularizedproblemis

min

q

p− q

2

2 +

λ

q

TV

wherep∈Rl _is_the _observation,_q_∈_Rl _is_the_desired_variable,

λ

_is theregularizationparameterand

x

TV=il−1 =1

|

xi− xi+1

|

isthe

to-talvariationnorm.Thisproblemcanbeequivalentlyrewrittenas

min q

p− q

2 2 +

λ

Dq

1 (8) whereD∈Rl−1 ,l _is_deﬁned_as D=

⎡

⎢

⎣

1 −1 0 ...0 0 0 1 −1 ...0 0 . . . . . . . . . . . . . . . 0 0 ... 1 −1

⎤

⎥

⎦

.

(6)

Toallappearances,theproblem(8)issimilartotheminimization (1);however,thetotalvariation-regularizedproblemhasmore ma-jorchallengesasthevariableinthel₁-regularizationhasbeen mul-tipliedbyamatrix.

Harchaoui andLevy-Leduc [19]solved the total variation reg-ularizedminimization(8)through theproblem(1).Thefollowing theoremsummarizestheirmainresult.

Theorem2.2 [19]. By the following change in variables, the mini-mizations(1)and(8)areequivalent:

x= Dq

A = DT

₍

_DDT

₎

−1

y = DT

₍

_DDT

₎

−1 _Dp ₍₉₎

where D, p, and q are thevariables in thetotal variationproblem. Further,thevariableqintheminimization(8)isobtainedas

q = p+ DT

₍

_DDT

₎

−1

₍

_x_{− Dp}

₎

_. ₍₁₀₎

Inother words,thetotalvariation-regularizedproblem(8)can besolvedbytheminimization(1)withtheinitialization(9).Then, theoptimalsolutionqiscalculatedbyEq.(10).

Based onthistheorem,theproposedrecurrentneuralnetwork canbeadjustedtosolvethetotalvariation-basedregularizationas well.Themajorelementsfortheneuralnetworkcomputationare

ATA =

(

DDT

)

−1 AT_y₌

₍

_DDT

₎

−1 _Dp

Inthe experimentsection, two applications ofthe totalvariation regularizationareinvestigated.

3. Convergenceanalysis

To assess the reliability of the proposed dynamicsystem, we ﬁrstdiscuss its stabilityandconvergence, andfurther investigate thepropertiesofthe presentedRNN. Thesystemis proved tobe globallyconvergentandstableinaLyapunovsense.

Deﬁnition 3.1. A continuous-time neural network is said to be globallyconvergentifthetrajectoryofthecorrespondingdynamic systemconvergestoanequilibriumpointforanyinitialpointz(t₀). Inotherwords,theequilibriumze isconvergentif

∃

δ

> 0 s.t.

z

(

t0

)

− z e

<

δ

⇒ lim

t−→∞ z

(

t

)

= ze.

Lemma3.2. Thefunction

(.), deﬁnedin the system(5),is a Lips-chitzcontinuousfunction.Therefore,thereexistsapositiveconstantL suchthat

(

x

)

−

(

y

)

≤ L

x− y

,

∀

x,y∈ R 2 n_. ₍₁₁₎ Proof.Foranyarbitraryx,y∈R2 n_,_we_have

(

x

)

−

(

y

)

=

min

{

x,

∇

F

(

x

)

}

− min

{

y,

∇

F

(

y

)

}

=

x+

∇

F

(

x

)

−

|

x−

∇

F

(

x

)

|

2 − y+

∇

F

(

y

)

−

|

y−

∇

F

(

y

)

|

2

=

1 /2

{

(

x− y

)

+

(

∇

F

(

x

)

−

∇

F

(

y

))

−

|

x−

∇

F

(

x

)

|

+

|

y−

∇

F

(

y

)

|}

≤ 1 /2

{

x− y

+

∇

F

(

x

)

−

∇

F

(

y

)

+

|

x−

∇

F

(

x

)

|

−

|

y−

∇

F

(

y

)

|}

≤ 1 /2

{

x− y

+

∇

F

(

x

)

−

∇

F

(

y

)

+

x−

∇

F

(

x

)

− y +

∇

F

(

y

)

}

≤

x− y

+

(

Bx+ c

)

−

(

By+ c

)

=

(

1 +

B

)

x− y

,

Now,letL=

(

1+

B

)

andtheproofiscomplete.

The upcoming discussion elaboratesthe convergence and sta-bilityofthesystem(6).

Theorem3.3. Foranyinitialpointz₀,thereexistsaunique continu-oussolutionz(t)for(6)withintheﬁnitetime.Moreover,the equilib-riumpointof(6)isthesolutionoftheminimization(2).

Proof. AccordingtoLemma3.1, thefunction

(z)isLipschitz con-tinuousandsoistheright-handsideofthesystem(6).Thus,based onthePeano’s theoremforODEs[20],thereexists aunique con-tinuoussolutionz(t) for(6)deﬁnedont₀_{≤ t}_{≤ T}_f.Theinterval [t₀, Tf)istheso-calledmaximalintervalofexistence.

Furthermore, we show that Tf=∞ if the set of all pos-sible solutions,

=

{

z∈R2 n

_|

_z_{≥ 0}

_}

_, _is _bounded. _To _do _so, _let

be bounded and z₀∈

; and let

|

z−

∇

F

(

z

)

|

represent

(

|

z₁−

∇

F

(

z

)

₁

|

_,_._._._,

|

z₂n−

∇

F

(

z

)

2 n

|

)

.Wehave

(

z

)

=

min

{

z,

∇

F

(

z

)

}

=

z+

∇

F

(

z

)

−₂

|

z−

∇

F

(

z

)

|

≤ 1 /2

(

z+

∇

F

(

z

)

+

z−

∇

F

(

z

)

≤ 1 /2

(

z

+

∇

F

(

z

)

+

z

+

∇

F

(

z

)

≤

z

+

∇

F

(

z

)

Ontheotherhand,since

isbounded,thereexistsavectorKsuch that foranyz∈Rn _we_have

_∇

_F₍_z₎

_≤

_K

₍_[21]_)._It _is_obtainable that

z

(

t

)

≤

z0

+

α

t t0

(

z

(

s

))

ds ≤

z0

+

α

t t0

z

(

s

)

+

∇

F

(

z

)

ds ≤

z0

+

α

(

K

(

t − t 0

))

+ t t0

z

(

s

)

ds

)

Furthermore,byGronwallinequality[22]

z

(

t

)

≤

z0

+

α

K

(

t − t 0

)

exp

(

α

(

t − t 0

))

.

Thus,thesolutionz(t)isboundedon[t₀,Tf),whichimpliesTf=∞ andthiscompletestheproofoftheﬁrstpart.

Now,ifz∗istheequilibriumpointofsystem(6),then

(

z∗

)

= 0,andaccordingtoTheorem2.1hisequilibriumpointisthe opti-malsolutionofproblem(2).

Theorem3.4. Theproposedneuralnetwork(6)withtheinitialpoint z₀∈R2 n _is_{stable in}_{the sense}_{of Lyapunov}_{and globally}_{converges to} thesolutionof(2).Moreover,theconvergencerateoftheneural net-work(6)escalatesas

α

increases.

Proof. AccordingtoTheorem3.1, thereexistsauniquesolutionz∗ forthesystem(6)withintheinterval[t₀,Tf).Letz∈

andconsider thefollowingLyapunovfunction:

E

(

z

)

= F

(

z

)

− F

(

z∗

)

.

ItisreadilyseenthatE(z)≥ 0becausez∗istheoptimalsolutionof theminimization(2).Further,z∗istheoptimalsolutionofproblem (2) if andonly if

(

z∗

)

=0 (according to Theorem 2.1), and the solutionof

(

z

)

₌0isunique(byTheorem3.3),soisthesolution oftheproblem(2).Thus,E

(

z

)

=0ifandonlyifz=z∗.Moreover, wehave dE

(

z

)

dt =

_dE₍_z₎ dz

Tdz dt = −

α∇

F

(

z

)

T

₍₍

_z

₎₎

= −

α∇

F

(

z

)

T

₍

z+

∇

F

(

z

)

−

|

z−

∇

F

(

z

)

|

2

)

= −

α

2

(

∇

F

(

z

)

T_z₊

_∇

_F

₍

_z

₎

2 ₋

_∇

_F

₍

_z

₎

T

_|

_z₋

_∇

_F

₍

_z

₎

_|

₎

≤

α

2

(

−

∇

F

(

z

)

T_z₋

_∇

_F

₍

_z

₎

2 ₊

_∇

_F

₍

_z

₎

T

_|

_z

_|

₊

_∇

_F

₍

_z

₎

2

₎

_{= 0}_, (12)

(7)

Fig. 2. Convergence of the proposed neural network (6) with α= 10 and different initializations: (a) with the initialization z = 1;(b) with the initialization z = 0; (c) with the random initialization. The x -axis is the iteration and y -axis is the value of elements of the desired variable x in the lasso problem.

where

|

z

|

=z since z≥ 0. Hence, the system (6) is stable in the sense ofLyapunov. Wefurther investigatetheglobalconvergence of the proposed system and show that dz_/dt₌0 if and only if dE/dt=0. To do so, let dz/dt=0 which implies

(

z

)

=0, then clearly dE dt = −

α∇

F

(

z

)

T

₍

_z

₎

_{= 0}_. Conversely,ifdE/dt=0,then

∇

F

(

z

)

T

₍₍

_z

₎₎

_{= 0}_.

In this equation,

(

z

)

=0 results in dz/dt=0 and the proof is complete.Butif

(z)=0and

∇

F

(

z

)

=0,weget(sincez≥ 0)

dz

dt = −

α

(

z

)

= min

{

z,

∇

F

(

z

)

}

=

∇

F

(

z

)

= 0 .

Therefore,the presentedsystem(6)is stableinthesense of Lya-punovandgloballyconvergestotheoptimalsolutionof(2).

Moreover,theinequalityin(12)impliesthatas

α

increases,the convergenceratealsoincreases.

4. Experimentresults

This section presents the experimental results regarding the proposed neural network. First, the convergence analysis of the neural network was empirically investigated,and its dependency on theparameter

α

wasveriﬁed.Then, the proposed neural net-work wasappliedto threedifferentapplications.The ﬁrstwasto recover a sparse signal from noisy observations. The other two were an image restoration andan aCGH data recovery,in which the total variation-regularized minimization is utilized. The pro-posedneuralnetworkisimplementedinMATLAB bytheordinary differentialequations(ODE)solvers.

4.1. Empiricalconvergenceanalysis

Theconvergenceoftheproposedneuralnetworkhasbeen the-oretically investigated.Wenowpresentempiricallyexplorationof theconvergenceoftheproposedneuralnetwork(6)asa comple-ment to the theoretical studies. To do so, the WINE benchmark problem, whichconsistsof178datawithfourattributes, was se-lected. To check the convergence, y was set to one of the data pointsrandomlyselectedfromthedataset,andAwasthe remain-ing data. Thus, the minimization of the problem (1) obtained a coeﬃcient vector that enabledusto write therandomly selected sampleasalinearcombinationofotherdatapoints.Thisisknown astheself-expressivenessproperty,whichisutilizedinrecentworks [5,23]. Theconvergenceis scrutinizedby various initializationsin order to check the sensitivity of the neural network to the ini-tialization. Let

α

=10, Fig. 2 plots the convergence of the neu-ral network trajectory withthe initial point z=[1,...,1]∈R356 _,

z=[0,...,0]∈R356 _and_random_{initialization,}_{respectively.} _The

x-axisin thisﬁgure isthe iterationandy-axisisthe value ofeach

Fig. 3. The transient behavior of the energy error based on the neural network

(6) for three different values of αon the WINE benchmark. The solid, dashed and dotted lines correspond to α= 10 , 15 and 20, respectively.

element of the vector x. In this figure, it is clear that most of the coefficients z converge to zero, which is the reason for the l₁-regularization.Further,thenon-zerocoefficientsconvergetothe samevalues(onearound0.22andanotheraround0.61).This indi-catesthattheneuralnetworkisgloballyconvergenttotheoptimal solution,anditsconvergenceisnotreliantontheinitializations.

Furthermore,weexploredtheconvergenceratebehaviorofthe neuralnetwork(6).Todoso,werepeatedthepreviousexperiment overtheWINEbenchmarkandassumedthat

α

is10,15and20in thedynamicsystem(6).Theenergyerrorofthe proposed neural networkcanbedeﬁnedas

ER

(

z

)

=

(

z

)

2 . (13)

Accordingtothedynamicsystem(6),ER

(

k∗

)

₌0ifandonly ifk∗ isan optimalsolution.Fig.3 showsthetransientbehavior ofthe error. It is readily observable that the bigger value of

α

acceler-atestheconvergenceoftheproposedneuralnetworkonthesame problem. Thus, onecan accelerate the convergencesimply by in-creasingtheparameter

α

.

4.2.Signalreconstruction

In thissection, we consider a sparse signal recovery problem withasignalx∈R4096 _._In_this_example_(shown_at_the_top_of_Fig.₄_),

thereare160spikeswith± 1amplitude.ThematrixA∈R1024 ×4096

isﬁlled withindependentsamplesof thestandard normal distri-butionwithorthonormalizedrows.Theobservationyisgenerated accordingto

y = Ax+ n (14)

wherenisanoisedrawnaccordingtothenormaldistributionN(0, 0.01)onR1024 _._The_parameter

_λ

_is_also_chosen_by

(8)

Fig. 4. Sparse signal reconstruction. Top: the original signal. Middle: the minimum energy reconstruction. Bottom: the reconstructed signal using the neural network

(6) .

asfor

λ

>

AT_y

∞ ,the unique minimumof(1)is thezerovector [24].

Fig. (4) showsthereconstruction results.Theoriginal signal is presentedatthe top ofthe plot.The middle plotshows the sig-nalx=A† y,which isknown asthe minimumenergy reconstruc-tion. The bottom plot delineates the reconstructed signal by the proposedneuralnetwork (6).As canbe readilygraspedformthis ﬁgure,theproposedneuralnetworkcanfaithfullyrecoverthe cor-rupted signal even though only a few of the non-zero measure-mentsareavailableincomparisontoallelements.

4.3.aCGHdatarecovery

Array comparative genomehybridization (CGHarray oraCGH) is a new technique to discover the aberration in the DNA copy number[25,26].Thegreatestchallengeinﬁndingtheaberrationsis thataCGHdataarehighlycorruptedbyvariousnoisessothat the boundariesofthenormalandaberrantgenomescannotbereadily detected.Asaresult,itisoftheutmostimportancetoremovethe noisesfromtherawaCGHdatapriortotheaberrationdetection.

The most popular way of denosing aCGH data is to solve a problemregularized by the total variation norm. These method-ologiesprocesseither allthe aCGHsamplesina dataset simulta-neously[27–30]oreachsampleseparately[31,32].

We applied the proposed neural network for noise removal fromthe aCGH data and compared it with state-of-the-art algo-rithmssuch as total variation and spectral regularization (TVSp) [33],piece-wiseandlowrankapproximation(PLA)[34],low rank recovery based on the half-quadratic minimization (LRHQ) [30], andgroupfusedlassosegmentation(GFLseg)[28].TVSptakes ad-vantage of the nuclear norm regularization along with the total variation norm. By the same token, PLA and LRHQ have similar formulation,withmoresparsityconstraintsintheformermethod andmorerobust information-theoretic lossfunction in the latter method.GFLsegisyetanothertechniquethatutilizestheweighted l1− l2normwiththeintegral total-variationregularization.Allof thesemethods havemoreparameters to be tuned (at leasttwo), and are of higher complexity due to the various regularizations theyemployed.Inthefollowing,weshowthattheproposedneural

networkiscompetitivewiththestateoftheartdespiteits simplic-ityandlowernumberofparameters.

The performance comparison was twofold.First, The compar-ison was conducted based on receiver operating characteristic (ROC)curvesacross simulateddatasets contaminatedby different typesofnoise.Second,tworeal-worldaCGHdatasetswereusedto carryouttherecovery.

4.3.1. Experimentonsimulateddata

Inthissubsection,themethodsmentionedabovearecompared acrosssynthesizeddatasets.Intheexperiment,50sampleswitha length of500were generatedaccording tothemethodology pre-sentedin[33].The simulateddatawere corrupted by a Gaussian noisewithdifferentsignal-to-noise(SNR)ratios.Fortheﬁrst com-parison,we plot theROCdiagram forthe methods.The ROCis a curveplottingthetruepositiverate(TPR)againstthefalsepositive rate (FPR) for different thresholds. Given a threshold T, the true andfalsepositiveratesaredeﬁnedas

TPR

(

T

)

=

|

PT

|

A

|

FPR

(

T

)

=

|

FPT

|

N

|

where A and N are respectively real aberrations and normal genomes,PTandFPT arerespectivelythe trulyandfalsely discov-eredaberrations,and|.|isthecardinalityoperator.Theseelements can be easily obtained as the study was on the simulated data. IntheROC curve,moredeviationfromthediagonalindicates the superiorityofthemethods.Fig.5plotstheROCdiagramfor differ-ent SNRs.The proposed neural network consistently outperforms PLA and GFLseg in all scenarios as it has more digression from thediagonal.However,TVSpandLRHQareslightlybetterthanthe proposed neural network. For SNR=0.5,the superiority of TVSp andLRHQ ismoreevident while the proposed neuralnetwork is competitiveforotherSNRs.Thereasonforsuchadifferenceisthe complexityofTVSpandLRHQ.Bothutilizethenuclearnorm (be-sidesthetotalvariation)intheir problemtoinducethe lowrank intherecoveredproﬁles.Sucharegularizationincreasesthe com-plexityandrequirestheinterminablesingularvaluedecomposition in each iteration.Despite its simplicity,the recurrentneural net-work has a reasonable performance in removing the noise from aCGHdata.

4.3.2. Experimentonrealdatasets

Theperformanceoftheproposedneural networkwasthen in-vestigated across real datasets. To do so, two datasets were em-ployed: the Pollack et al. dataset [35], which includes 44 breast tumorsof6691humanmappedgenes,andChinetal.dataset[36], whichconsistsof2149clonesfrom141primarybreasttumors.

Thesedatasetsweresubjectedtotheproposed neuralnetwork to obtainthe recoveredprofiles. Fig.6plots theheat andbar di-agramsfortheretrievedprofilesofthedatasetsmentionedabove. Theheatmapsare plottedatthetopandthebardiagram,which is the sum of the number of grains across all samples given a threshold,isatthebottom.Asthecolorbarsuggests,theyellowish segmentsintheheatmap indicatetheduplicationandthebluish segments indicate the lossin the aCGHdata.The greenish parts, whichareindeedprevalentintheheatmap,arewherethereisno aberration.Theresultsfromthebardiagramsindicatethatprobes 178–184fromthePollacketal.datasetandprobes38–39fromthe Chinetal.datasetareamplificationregions. Regardingtheir loca-tionsonthechromosome,thediscoveredareasfrombothdatasets areinaccordancewitheachother andarealsoinlinewithother studiesonbreastcancer[35,36].

To show the eﬃcient data recovery by the neural network, several recovered proﬁles from the proposed neural network, TVSp [33] and PLA [34] are presented in Fig. 7. Each column

(9)

Fig. 5. The performance comparison of the proposed recurrent neural network (RNN), TVSp [33] , PLA [34] , LRHQ [30] and GFLseg [28] via the ROC curve. The ROC curves of different methods on the simulated data corrupted by the Gaussian noise with different SNRs: (a) SNR = 0.5; (b) SNR = 1.0; (c) SNR = 1.5; (d) SNR = 2.0. The x -axis and

y -axis of each ﬁgure is the false positive rate and the true positive rate, respectively.

Fig. 6. The profiles retrieved by the proposed neural network; (a) the recovered profiles of the Pollack et al. dataset [35] ; (b) the recovered profiles of the Chin et al. dataset

[36] . The yellowish color in the heat map (the top ﬁgure) indicates the duplication and the bluish shows the loss in the chromosome. The greenish areas are the normal regions. The bottom is the bar diagram which plots the sum of the number of aberrations with the threshold 1. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

in this ﬁgure is dedicated to one sample, andeach column cor-responds to a recovery method. Further, the red dots are the real data,andthe blue linesindicate the datarecovered by each method. From the smoothness perspective, the proposed neural network consistentlyoutperformsPLAandTVSp, sincethe recov-ered data are much smoother than those recovered by PLA and TVSp.

4.3.3. Timecomplexity

The proposed neural network was empirically evaluated in terms ofthe executiontime. To thisend, 50 aCGHsamples with a differentnumberofprobes were generatedandcorrupted with

arandomGaussian noise.Theresultingcorrupted datawerethen subjectedtodifferentmethods forrecovery,andthetime needed todoso istheparameter basedonwhichthe variousalgorithms are contrasted. The numbers of probes for this experimentwere 50,500,1000,and10,000.Theexperimentswereperformedona PCwitha3.2Core-i5CPUand4GBofRAM.

Fig. 8plots the time in seconds that each method needed to completetherecoverytaskwithdifferentnumbersofprobes.The proposed neural network signiﬁcantly outperforms RCLR, and is quitecompetitivewithTVSp.PLAandGFLSegaremuchfasterthan theothers,mainlyduetothefacttheyhaveimplementedapartof theiralgorithminC/C++,whichisinherentlyswift.

(10)

Fig. 7. Five selected samples from the Pollack et al. dataset recovered by various methods. Each row in this ﬁgure corresponds to a sample and each column tallies with a recovery method. The three methods are the proposed neural network, PLA [34] and TVSp [33] . The red dots are the real data from the datasets, and the blue lines are the data retrieved by each method. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

Fig. 8. The time required for each method to complete the recovery task over a dataset with 50 sample and different numbers of probes. The x -axis is the number of probes, and y -axis is the time in seconds for each method to complete the task.

4.4.Imagerestoration

The ﬁnal experiment was to recover the original image from noisyobservations.Todoso,threeimageswereselectedand con-taminatedbytheGaussiannoisewith

σ

=0.05.Theﬁrstand sec-ondcolumnsofFig.9correspondtotheoriginalandnoisyimages understudy, respectively. The totalvariation-regularized problem (8)wasutilized to recover the original images fromthe contami-natedobservations.Therecovery wascarriedoutbytheproposed neuralnetworkandtheprimal-dualsplittingmethod(PDSM)[37]. Theimagesrecoveredby PDSMandtheproposed neuralnetwork are presentedin the third and fourthcolumns, respectively. This ﬁgureclearly showsthat the proposed neural network has

faith-Table 1

The mean square errors of the proposed neural network and the primal-dual splitting method (PDSM) [37] across three images.

Image RNN PDSM

MRI 3 . 08 × 10 −3 _{5 . 75 × 10}−5

Lena 6 . 49 × 10 −5 _{9 . 13 × 10}−5

Cameraman 7 . 47 × 10 −5 _{9 . 54 × 10}−5

fullyrecovered the images.We further tabulatethe meansquare error of two methods for each image in Table 1. The table also conﬁrmsthat theproposed neural network retrieves the original imageswithhighconﬁdenceandiscompetitivewithPDSM.

(11)

Fig. 9. Image recovery by the proposed neural network and PDSM [37] . The columns from left to right correspond to the original image, noisy image, the image recovered by PDSM, and the image retrieved by the neural network, respectively.

5. Conclusion

This paperpresented a one-layer recurrent neural network to ﬁnd theoptimal solution ofthe l₁-regularized least square prob-lem. Theproposed neural network isguaranteed toglobally con-vergeto thesolutionofthisproblemwhileitsconvergenceis re-liant not upon the size of the datasets but upon a constant pa-rameter. Theexperiments furtherinvestigated theconvergenceof the neural network and its dependence on the constant param-eter. The proposed recurrentneural network was applied to sev-eral problemsincludingsparse signal recovery,image restoration, and aCGH data recovery. These applications showed the reason-able performance of the proposed neural network incomparison withotherstate-of-the-artmethods.

References

[1] R. Tibshirani , Regression shrinkage and selection via the lasso, J. Royal Stat. Soc. Ser. B (Methodol.) 58 (1) (1996) 267–288 .

[2] S.J. Wright , R.D. Nowak , M.A. Figueiredo , Sparse reconstruction by separable approximation, IEEE Trans. Signal Process. 57 (7) (2009) 2479–2493 .

[3] C.M. Bishop , et al. , Pattern Recognition and Machine Learning, 1, Springer, New York, 2006 .

[4] E. Elhamifar , R. Vidal , Sparse subspace clustering, in: Proceedings of IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 20 09, IEEE, 20 09, pp. 2790–2797 .

[5] E. Elhamifar , R. Vidal , Sparse subspace clustering: algorithm, theory, and applications, IEEE Trans. Pattern Anal. Mach. Intell. 35 (11) (2013) 2765–2781 .

[6] H. Lee , A. Battle , R. Raina , A.Y. Ng , Eﬃcient sparse coding algorithms, in: Proceedings of Advances in Neural Information Processing Systems, 2006, pp. 801–808 .

[7] J. Mairal , F. Bach , J. Ponce , G. Sapiro , Online dictionary learning for sparse coding, in: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, 2009, pp. 689–696 .

[8] L. Jin , S. Li , X. Luo , Y. Li , B. Qin , Neural dynamics for cooperative control of redundant robot manipulators, IEEE Trans. Neural Netw. Learn. Syst. (2018) .

[9] M.A. Figueiredo , R.D. Nowak , S.J. Wright , Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems, IEEE J. Sel. Top. Signal Process. 1 (4) (2007) 586–597 .

[10] J. Kim , H. Park , Fast active-set-type algorithms for l1-regularized linear regression, in: Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, 2010, pp. 397–404 .

[11] S.-J. Kim , K. Koh , M. Lustig , S. Boyd , D. Gorinevsky , An interior-point method for large-scale l 1-regularized least squares, IEEE J. Sel. Top. Signal Process 1 (4) (2007) 606–617 .

[12] Y. Xiao , Q. Wang , Q. Hu , Non-smooth equations based method for 1-norm problems with applications to compressed sensing, Nonlinear Anal.: Theory Methods Appl. 74 (11) (2011) 3570–3577 .

[13] P.G.C. Zhang , in: A Fast Dual Projected Newton Method for L1-Regularized Least Squares, Tsinghua University, Beijing, 2011 .

[14] P. Tseng , S. Yun , A coordinate gradient descent method for nonsmooth separable minimization, Math. Program. 117 (1–2) (2009) 387–423 .

[15] I. Loris , M. Bertero , C. De Mol , R. Zanella , L. Zanni , Accelerating gradient projection methods for 1-constrained signal recovery by steplength selection rules, Appl. Comput. Harmon. Anal. 27 (2) (2009) 247–254 .

[16] M.S. Bazaraa , H.D. Sherali , C.M. Shetty , Nonlinear programming: Theory and Algorithms, John Wiley & Sons, 2013 .

[17] O.L. Mangasarian , Equivalence of the complementarity problem to a system of nonlinear equations, SIAM J. Appl. Math. 31 (1) (1976) 89–92 .

[18] D.P. Bertsekas , J.N. Tsitsiklis , Parallel and Distributed Computation: Numerical Methods, 23, Prentice Hall, Englewood Cliffs, NJ, 1989 .

[19] C. Levy-leduc , Z. Harchaoui , Catching change-points with lasso, in: Proceedings of Advances in Neural Information Processing Systems, 2008, pp. 617–624 .

[20] J.K. Hale , Functional Differential Equations, Springer, 1971 .

[21] S. Boyd , A. Mutapcic , Subgradient Methods, in: Notes for EE364b, Stanford Uni- versity, Winter 2006-07 .

[22] R. Bellman , et al. , The stability of solutions of linear differential equations, Duke Math. J. 10 (4) (1943) 643–647 .

[23] G. Liu , Z. Lin , Y. Yu , Robust subspace segmentation by low-rank representa- tion, in: Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 663–670 .

[24] J.-J. Fuchs , On sparse representations in arbitrary redundant bases, IEEE Trans. Inf. Theory, 50 (6) (2004) 1341–1344 .

(12)

[25] D. Pinkel , D.G. Albertson ,Array comparative genomic hybridization and its applications in cancer, Nat. Genet. 37 (2005) S11–S17 .

[26] L. Feuk , A.R. Carson , S.W. Scherer , Structural variation in the human genome, Nat. Rev. Genet. 7 (2) (2006) 85–97 .

[27] C.M. Alaíz , Á. Barbero , J.R. Dorronsoro , Group fused lasso, in: International Conference on Artiﬁcial Neural Networks, Springer, 2013, pp. 66–73 .

[28] K. Bleakley, J.-P. Vert, The group fused lasso for multiple change-point detection, arXiv preprint arXiv: 1106.4199 (2011).

[29] H.S. Noghabi , M. Mohammadi , Y.-H. Tan , Robust group fused lasso for multisample copy number variation detection under uncertainty, IET Syst. Biol. 10 (6) (2016) 229–236 .

[30] M. Mohammadi , G.A. Hodtani , M. Yassi , A robust correntropy-based method for analyzing multisample aCGH data, Genomics 106 (5) (2015) 257–264 .

[31] A. Mitra , G. Liu , J. Song , A genome-wide analysis of array-based comparative genomic hybridization (CGH) data to detect intra-species variations and evolu- tionary relationships, PloS one 4 (11) (2009) e7978 .

[32] J. Hu , J.-B. Gao , Y. Cao , E. Bottinger , W. Zhang , Exploiting noise in array CGH data to improve detection of DNA copy number change, Nucl. Acids Res. 35 (5) (2007) e35 .

[33] X. Zhou , C. Yang , X. Wan , H. Zhao , W. Yu , Multisample ACGH data analysis via total variation and spectral regularization, IEEE/ACM Trans. Comput. Biol. Bioinform. 10 (1) (2013) 230–235 .

[34] X. Zhou , J. Liu , X. Wan , W. Yu , Piecewise-constant and low-rank approximation for identiﬁcation of recurrent copy number variations, Bioinformatics 30 (14) (2014) btu131 .

[35] J.R. Pollack , T. Sørlie , C.M. Perou , C.A. Rees , S.S. Jeffrey , P.E. Lonning , R. Tibshi- rani , D. Botstein , A.-L. Børresen-Dale , P.O. Brown , Microarray analysis reveals a major direct role of dna copy number alteration in the transcriptional program of human breast tumors, Proc. Natl. Acad. Sci. 99 (20) (2002) 12963–12968 .

[36] K. Chin , S. DeVries , J. Fridlyand , P.T. Spellman , R. Roydasgupta , W.-L. Kuo , A. La- puk , R.M. Neve , Z. Qian , T. Ryder , et al. , Genomic and transcriptional aberrations linked to breast cancer pathophysiologies, Cancer Cell 10 (6) (2006) 529–541 .

[37] L. Condat , A primal–dual splitting method for convex optimization involving lipschitzian, proximable and linear composite terms, J. Optim. Theory Appl. 158 (2) (2013) 460–479 .

Majid Mohammadi is a Ph.D. candidate at Information and Communication Technology group of the Department of Technology, Policy and Management of the Delft Uni- versity of Technology. He has obtained his B.Sc. and M.Sc. in Software Engineering and Artiﬁcial Intelligence, respectively. His main research interest is semantic interoperability, machine learning and pattern recognition.

Yao-Hua Tan is professor of Information and Communi- cation Technology at the ICT Group of the Department of Technology, Policy and Management of the Delft Univer- sity of Technology and part-time professor of Electronic Business at the Department of Economics and Business Administration of the Vrije university Amsterdam. His research interests are service engineering and governance; ICT-enabled electronic negotiation and contracting; multi- agent modelling to develop automation of business proce- dures in international trade.

Wout Hofman is senior research scientist at TNO, the Dutch organization for applied science, on the subject of interoperability with a specialization in government (e.g. customs) and business interoperability in logistics. He is responsible for coordinating semantic developments within the iCargo project. Wout is also as member of the Scientiﬁc Board of the EU FP7 SEC Cassandra project responsible for IT developments in that latter project.

S. Hamid Mousavi was born in Mashhad, Iran on Febru- ary 3, 1988. He received the B.Sc. degree in pure mathematics from Ferdowsi University of Mashhad (FUM) in 2011. He started his M.Sc. in applied mathematics in FUM and worked on control and optimization problems. Af- ter graduation in 2015, he joined the machine learning group at the University of Oldenburg, Germany, where he is currently working toward a doctorate degree. His major ﬁelds of interest currently are optimization and prob- abilistic algorithms.