Derivation and Analysis of the Primal-Dual Method of Multipliers Based on Monotone Operator Theory

(1)

Derivation and Analysis of the Primal-Dual Method of Multipliers Based on Monotone

Operator Theory

Sherson, Thomas William; Heusdens, Richard; Kleijn, W. Bastiaan DOI

10.1109/TSIPN.2018.2876754 Publication date

2019

Document Version

Accepted author manuscript Published in

IEEE Transactions on Signal and Information Processing over Networks

Citation (APA)

Sherson, T. W., Heusdens, R., & Kleijn, W. B. (2019). Derivation and Analysis of the Primal-Dual Method of Multipliers Based on Monotone Operator Theory. IEEE Transactions on Signal and Information Processing over Networks, 5(2), 334-347. [8496887]. https://doi.org/10.1109/TSIPN.2018.2876754

Important note

To cite this publication, please use the final published version (if applicable). Please check the document version above.

Copyright

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons. Takedown policy

Please contact us and provide details if you believe this document breaches copyrights. We will remove access to the work immediately and investigate your claim.

This work is downloaded from Delft University of Technology.

(2)

Derivation and Analysis of the Primal-Dual Method

of Multipliers Based on Monotone Operator Theory

Thomas Sherson, Richard Heusdens, and W. Bastiaan Kleijn

Abstract—In this paper we present a novel derivation of an ex-isting algorithm for distributed optimization termed the primal-dual method of multipliers. In contrast to its initial derivation, monotone operator theory is used to connect PDMM with other first-order methods such as Douglas-Rachford splitting and the alternating direction method of multipliers thus providing insight into its operation. In particular, we show how PDMM combines a lifted dual form in conjunction with Peaceman-Rachford splitting to facilitate distributed optimization in undirected networks. We additionally demonstrate sufficient conditions for primal conver-gence for strongly convex differentiable functions and strengthen this result for strongly convex functions with Lipschitz continuous gradients by introducing a primal geometric convergence bound. Index Terms—Primal-Dual method of multipliers (PDMM), distributed optimization, monotone operator.

I. INTRODUCTION

The world around us is evolving through the use of large scale networking. From the way we communicate via social media [1], to the revolution of utilities and services via the paradigm of the “Internet of Things” [2], networking is reshaping the way we operate as a society. Echoing this trend, the last three decades has seen a significant rise in the deployment of large scale sensor networks for a wide range of applications [3]–[5]. Such applications include environmental monitoring [6], [7], power grid management [8]–[10], as well being used as part of home health care systems [11], [12].

Where centralized network topologies were once the port of call for handling data processing of sensor networks, in-creasingly on-node computational capabilities of such systems are being exploited to parallelize or even fully distribute data processing and computation. In contrast to their centralized counterparts such distributed networks have a number of dis-tinct advantages including robustness to node failure, scalabil-ity with network size and localized transmission requirements. Unfortunately, these distributed networks are also often characterized by limited connectivity. This limited accessibil-ity between nodes implicitly restricts data availabilaccessibil-ity making classical signal processing operations impractical or infeasible to perform. Therefore, the desire to decentralize computation

Thomas Sherson is with the Department of Microelectronics, Circuits and Systems group, Delft University of Technology, The Netherlands. Email: t.sherson@tudelft.nl

Richard Heusdens is with the Department of Microelectronics, Circuits and Systems group, Delft University of Technology, The Netherlands. Email: r.heusdens@tudelft.nl

W. Bastiaan Kleijn is with the Department of Microelectronics, Circuits and Systems group, Delft University of Technology, The Netherlands. and with the School of Engineering and Computer Science, Victoria University of Wellington, New Zealand. Email: w.b.kleijn@tudelft.nl

requires the design of novel signal processing approaches specifically tailored to the task of in-network computation.

Within the literature, a number of methods for performing distributed signal processing have been proposed including distributed consensus [13]–[15], belief propagation/message passing approaches [16]–[18], graph signal processing over networks [19]–[21] and more. An additional method of partic-ular interest to this work, is to approach the task of signal pro-cessing via its inherent connection with convex optimization. In particular, over the last two decades, it has been shown that many classical signal processing problems can be recast in an equivalent convex form [22]. By defining methods to perform distributed optimization we can therefore facilitate distributed signal processing in turn.

Recently, a new algorithm for distributed optimization called the primal dual method of multipliers (PDMM) was proposed [23]. In [23], it was shown that PDMM exhibited guaranteed average convergence, which in some examples were faster than competing methods such as the alternating direction method of multipliers (ADMM) [24]. However, there are a number of open questions surrounding the approach. In particular, prior to this work, it was unclear how PDMM was connected with similar methods within the literature.

To clarify the link between PDMM and existing works, we present a novel viewpoint of the algorithm through the lens of monotone operator theory. By demonstrating how PDMM can be derived from this perspective, we link its operation with classic operator splitting algorithms. The major strength of this observation is the fact that we can leverage results from monotone operator theory to better understand the operation of PDMM. In particular we use this insight to demonstrate new and stronger convergence results for different classes of problems than those that currently exist within the literature. A. Related Work

The work in this paper builds upon the extensive history within the field of convex optimization in the areas of parallel and decentralized processing. In the 1970’s, Rockafellar’s work in network optimization [25] and the relation between convex optimization and monotone operator theory [26]– [28] helped establish a foundation for the field. Importantly, Rockafellar showed how linearly constrained separable convex programs can be solved in parallel via Lagrangian duality.

In the field of parallel and distributed computation, further development was undertaken by Bertsekas and Tsitsiklis [29]– [31] throughout the 1980’s, where again separability was used as a mechanism to design a range of new algorithms. Similarly,

(3)

Eckstein [32], [33] adopted an approach more reflective of Rockafellar, utilizing monotone operator theory and operator splitting to develop new distributed algorithms.

In recent years, there has been a renewed surge of interest in networked signal processing [34]–[36] due to the continued expansion of networked systems. This period has also seen the development of novel distributed optimization approaches for both convex and potentially non-convex problems. In the convex case, the works of [37], [38], echoing advances in three term operator splitting such as Vu-Condat splitting [39], [40], provide general frameworks for distributed convex optimiza-tion. Including classical approaches, such as ADMM, as spe-cial cases, these algorithms leverage primal-dual schemes and functional separability to create distributed implementations.

The work in [41], [42] focuses on the more general problem of potentially non-convex optimization. In particular, by at each iteration approximating both objective and constraints with specific strongly convex and smooth surrogates, the proposed methods have provable guarantees on convergence to local minima. Furthermore, in contrast to other methods, the proposed approach need not explicitly require functional separability, only the separability of the surrogates used. This allows for the optimization of problems typically outside of the scope of distributed algorithms.

B. Main Contribution

The main contributions of this paper are two-fold. Firstly we provide a novel derivation for PDMM from the perspective of monotone operator theory. In particular, we show how PDMM can be derived by combining a particular dual lifted problem with Peaceman-Rachford (PR) splitting. In contrast to its original derivation, this approach links PDMM with other classical first order methods from the literature including forward-backward splitting, Douglas-Rachford (DR) splitting and ADMM (see [43] for a recent overview).

The monotone operator perspective is also used to demon-strate a range of new convergence results for PDMM. We show how PDMM is guaranteed to converge to a primal optimal solution for strongly convex, differentiable objective functions. This result is strengthened for strongly convex functions with Lipschitz continuous gradients where a geometric convergence bound is demonstrated by linking the worst-case convergence of PDMM with that of a generalized alternating method of projections algorithm. Notably, while such results exist for PR splitting applied to dual domain optimization problems [44], they require an additional full row rank1 assumption to ensure strong monotonicity which cannot be guaranteed in the case of PDMM. Furthermore, while a geometric convergence proof exists for distributed ADMM [45], currently there is no such result for PDMM. In this way the proposed work also strengthens the performance guarantees for PDMM, an important point for practical distributed optimization.

1_{Row rank}_{refers to the dimension of the span of the row space of a matrix.}

Row rank deficientmatrices have more rows than their row rank. The notions of column rank and column rank deficiency are defined equivalently.

C. Organization of the Paper

The remainder of this paper is organized as follows. Sec. II introduces appropriate nomenclature to support the manuscript. Sec. III introduces a monotone operator derivation of PDMM based on a specific dual lifting approach. Sec. IV demonstrates the guaranteed primal convergence of PDMM for strongly convex and differentiable functions. This is strength-ened in Sec. V where we demonstrate primal geometric convergence for strongly convex functions with Lipschitz continuous gradients. Finally, Sec. VI includes simulation results to reinforce and verify the underlying claims of the document and the final conclusions are drawn in Sec. VII

II. NOMENCLATURE

In this work we denote by R the set of real numbers, by RN _{the set of real column vectors of length N and by}

RM ×N the set of M by N real matrices. Let X , Y ⊆ RN. A set valued operator T : X → Y is defined by its graph, gra (T) = {(x, y) ∈ X × Y | y ∈ T (x)}. Similarly, the notion of an inverse of an operator T−1 is defined via its graph so that gra T−1 = {(y, x) ∈ Y × X | y ∈ T (x)}. JT,ρ = (I + ρT)

−1

denotes the resolvent of an operator while RT,ρ = 2JT,ρ − I denotes the reflected resolvent

(Cayley operator). The fixed-point set of T is denoted by fix (T) = {x ∈ X | T (x) = x}. If T is a linear operator then ran(T) and ker(T) denote its range and kernel respectively.

III. A DERIVATION OF THEPRIMAL-DUALMETHOD OF

MULTIPLIERSBASED ONMONOTONEOPERATORTHEORY

In this section we reintroduce a recently proposed algorithm for distributed optimization termed the Primal-Dual method of multipliers (PDMM) [23]. Unlike earlier efforts within the literature [23], [24], here we demonstrate how PDMM can be derived from the perspective of monotone operator theory. In particular we show how PDMM can be derived by applying PR splitting to a certain lifted dual problem. Additionally, we highlight a previously unknown connection between PDMM and a distributed ADMM variant.

A. Problem Statement: Node Based Distributed Optimization Consider an undirected network consisting of N nodes with which we want to perform convex optimization in a distributed manner. The associated graphical model of such a network is given by G(V, E) where V = {1, ..., N } denotes the set of nodes and E denotes the set of undirected edges so that (i, j) ∈ E if nodes i and j share a physical connection. Note that these are simple graphs as they do not contain self loops or repeated edges. We will assume that G forms a single connected component and will denote by N (i) = {j ∈ V | (i, j) ∈ E} the set of neighbors of node i, i.e. those nodes j so that i and j can communicate directly. An example of such a network is given in Figure 1.

As previously mentioned, we are interested in using this network to perform distributed convex optimization. In this way, assume that each node i is equipped with a function fi ∈

Γ0 RMi parameterized by a local variable x

(4)

1 3 2 4 5 6 7

Fig. 1. The communication graph G of a seven node network. Numbered circles denote nodes and while the arrows denote the undirected edges. The neighborhood of node five is given by the set N (5) = {3, 6, 7}.

Γ0 denotes the family of closed, convex and proper (CCP)

functions. Under this model, consider solving the following optimization problem in a distributed manner:

min

xi∀ i∈V

X

i∈V

fi(xi)

s.t Ai|jxi+ Aj|ixj = bi,j ∀ (i, j) ∈ E.

(1) The matrices Ai|j ∈ RMi,j×Mi while the vectors bi,j ∈

RMi,j. The identifier i|j denotes a directed edge while i, j denotes an undirected edge. Furthermore, let MV = P

i∈V

Mi

and ME = P (i,j)∈E

Mi,j. We will also assume that (1) is

feasible. In such distributed convex optimization problems the terms Ai|j and bi,j impose affine constraints between

neighboring nodes.

The prototype problem in (1) includes, as a subset, the family of distributed consensus problems that minimize the sum of the local cost functions under network wide consensus constraints. The algorithm presented in this paper can therefore be used for this purpose.

B. Exploiting Separability Via Lagrangian Duality

Given the prototype problem in (1), the design of our distributed solver aims to address the coupling between the set of primal variables xi due to the linear constraints. Echoing

classic approaches in the literature, we can overcome this point via Lagrangian duality. In particular, the Lagrange dual problem of (1) is given by min ν X i∈V  f_i∗   X j∈N (i) AT_i|jνi,j  − X j∈N (i) bi,j 2 T νi,j  , (2) where each νi,j ∈ RMi,j denotes the dual vector variable

associated with the constraint at edge (i, j) and f_i∗ is the Fenchel conjugate of fi. By inspection, the resulting problem

is still separable over the set of nodes but unfortunately each νi,j in (2) is utilized in two conjugate functions, fi∗ and fj∗,

resulting in a coupling between neighboring nodes.

To decouple the objective terms, we can lift the dimension of the dual problem by introducing copies of each νi,j at

nodes i and j. The pairs of additional directed edge variables are denoted by λi|j, λj|i ∀(i, j) ∈ E and are associated

with nodes i and j respectively. To ensure equivalence of the problems, these variables are constrained so that at optimality

λi|j = λj|i. The resulting problem is referred to as the

The proposed lifting is appealing from the perspective of al-ternating minimization techniques as it partitions the resulting problem into two sections: a fully node separable objective function and a set of edge based constraints.

C. Simplification of Notation

To assist in the derivation of our algorithm, we firstly introduce a compact vector notation for Eq. (3). Specifically we will show that (3) can be rewritten as

min

λ f

∗_(CT_{λ) − d}T_λ

s.t (I − P) λ = 0.

(4) 1) Dual Vector Notation: Firstly we introduce the dual variable λ as the stacked vector of the set of λi|j where the

ordering of this stacking is given by 1|2 < 1|3 < · · · < 1|N < 2|1 < 2|3 < · · · < N |N − 1. In particular, λ is given by

λ =hλT_1|2, · · · , λT_1|N, λT_2|1, · · · , λT_{N |N −1}i

T

∈ RME_.

2) Compact Objective Notation: Given the definition of the dual vector λ, we now move to simplifying the objective function. Firstly, we define the sum of local functions

f : RMV _{7→ R, x 7→}X i∈V

fi(xi)

where RMV _{= R}M1_{× R}M2_{× ... × R}MN_.

We can then define a matrix C ∈ RME×MV _{and vector}

d ∈ RME_{to rewrite our objective using λ and f . In particular,}

C =    C1 · · · 0 .. . . .. ... 0 · · · CN   , d =d T 1, · · · , dTN T , where the components Ci and di are given by

Ci= h AT i|1, · · · , A T i|i−1, A T i|i+1, · · · T_{, A}T i|N iT ∀i ∈ V, di= 1 2b T i,1, · · · , b T i,i−1, b T i,i+1, · · · , b T i,N T ∀i ∈ V. The terms Ai|jand bi,jare included in Ciand direspectively

if only if (i, j) ∈ E.

The objective of Eq. (3) can therefore be rewritten as f∗(CTλ) − dTλ.

3) Compact Constraints Notation: Similar to the objective, we can define an additional matrix to rewrite the constraint functions using our vector notation. For this task we intro-duce the symmetric permutation matrix P ∈ RME×ME _that

permutes each pair of variables λi|jand λj|i. This allows the

constraints in (3) to be rewritten as (I − P) λ = 0. The vector λ is therefore only feasible if it is contained in ker(I − P).

(5)

D. From the Extended Dual Problem to a Nonexpansive PDMM Operator

Given the node and edge separable nature of the extended dual, we now move to forming a distributed optimization solver which takes advantage of this structure. In particular we aim to construct an operator of the form

S = SE◦ SN,

where SN and SE are parallelizable over the nodes and edges

respectively and ◦ is used to denote their composition so that ∀ (x, z) ∈ gra (S1◦ S2) , ∃y | (x, y) ∈ gra (S1) , (y, z) ∈

gra (S2). Furthermore, we would like such operators to be

nonexpansive so that classic iterative solvers can be employed. The nonexpansiveness of an operator is defined as follows. Definition III.1. Nonexpansive Operators: An operator T : X → Y is nonexpansive if

ku − vk ≤ kx − yk (x, u) , (y, v) ∈ gra (T) , We can construct such an S by making use of the relation-ship between monotone operators and the subdifferentials of convex functions. In particular, an operator is monotone if it satisfies the following definition.

Definition III.2. Monotone Operators: An operator T : X → Y is monotone iff

hu − v, x − yi ≥ 0 ∀ (x, u) , (y, v) ∈ gra (T) , Furthermore, T is maximal monotone iff

@ a monotoneT : X → Y | gra(T) ⊂ gra( ˜˜ T). With these definitions in mind, consider the equivalent unconstrained form of (4) given by

min

λ f

∗_(CT_{λ) − d}T_{λ + ι}

ker(I−P)(λ) , (5) where ιker(I−P) is an indicator function defined as

ιker(I−P)(y) =

(

0 (I − P)y = 0 +∞ otherwise.

As ker(I − P) is a closed subspace, it follows from [46, Example 1.25] that ιker(I−P) ∈ Γ0. Furthermore, as

f ∈ Γ0, using [46, Theorem 13.32, Prop. 13.11], it follows

that f∗ CT_{∈ Γ}

0 as well. Due to our feasibility assumption

of (1), the relative interiors of the domains of f∗ CT_and

ιker(I−P)share a common point. From [46, Theorem 16.3], it

follows that λ∗ is a minimizer of (5) if and only if

0 ∈ C∂f∗ CTλ∗ − d + ∂ιker(I−P)(λ∗) . (6)

Note that the operators T1 = C∂f∗ CT − d and T2 =

∂ιker(I−P) are by design separable over the set of nodes and

edges respectively. Furthermore, C∂f∗ CT_{and ∂ι} ker(I−P)

are the subdifferentials of CCP functions and thus are maximal monotone. A zero-point of (6) can therefore be found via a range of operator splitting methods (see [32] for an overview). In this particular instance, we will use PR splitting to construct a nonexpansive PDMM operator by rephrasing the zero-point condition in (6) as a more familiar fixed-point

condition. This equivalent condition, as demonstrated in [47] (Section 7.3), is given by

RT2,ρ◦ RT1,ρ(z) = z, λ = JT1,ρ(z) ,

where RTi,ρ and JTi,ρ are the reflected resolvent and

re-solvent operators of Ti respectively. Here, the introduced z

variables will be referred to as an auxiliary variables. We define the PDMM operator as

TP,ρ= RT2,ρ◦ RT1,ρ,

which will be used repeatedly throughout this work. Impor-tantly given the nature of the operators considered, TP,ρ is

nonexpansive. Specifically, as both T1 and T2 are maximal

monotone operators, JT1,ρ and JT2,ρ are both firmly

non-expansive. By [46, Proposition 4.2], it follows that RT1,ρ

and RT2,ρ are nonexpansive. The nonexpansiveness of TP,ρ

allows us to utilize fixed-point iterative methods to solve (3) and ultimately (1) in a distributed manner.

E. On the Link with the Primal Dual Method of Multipliers We now demonstrate how PDMM, as defined in [23], can be linked with classical monotone operator splitting theory. For this purpose we will consider the fixed-point iteration of TP,ρ given by z(k+1)= TP,ρ z(k)= RT2,ρ◦ RT1,ρ z(k). (7) To aid in the aforementioned relationship, the evaluation of the reflected resolvent operators RT1,ρand RT2,ρare outlined

in the following Lemmas. Lemma III.1. y(k+1)_{= R} T1,ρ z (k)_{can be computed as} x(k+1)=arg min x f (x) − D CTz(k), xE+ρ 2||Cx − d|| 2 λ(k+1)=z(k)− ρCx(k+1)− d y(k+1)=2λ(k+1)− z(k)

A proof of this result can be found in Appendix A. Note that the block diagonal structure of C and the separability of f allow this reflected resolvent to be computed in parallel across the nodes.

Lemma III.2. z(k+1)_{= R} T2,ρ y

(k+1)_{can be computed as}

z(k+1)_{= Py}(k+1)_.

The proof for this result is included in Appendix B. The resulting permutation operation is equivalent to an exchange of auxiliary variables between neighboring nodes and is therefore distributable over the underlying network.

Utilizing Lemmas III.1 and III.2 it follows that

TP,ρ= P ◦ RT1,ρ, (8)

and thus that (7) is equivalent to

z(k+1)= Pz(k)− 2ρCx(k+1)− d. (9) By noting that z(k+1)_{= P λ}(k+1)_{− ρ Cx}(k+1)_{− d, the}

dependence on y(k+1) and z(k+1) can be removed, reducing the scheme to that given in Algorithm 1.

(6)

Algorithm 1 Simplified PDMM 1: Initialise: λ(0)_{∈ R}ME_{, x}(0) _{∈ R}MV 2: for k=0,..., do 3: x(k+1)= argmin x f (x) −C T_Pλ(k)_{, x +} ρ 2||Cx + PCx (k)_{− 2d||}2 4: λ(k+1)_{= Pλ}(k)_{− ρ Cx}(k+1)_{+ PCx}(k)_{− 2d} 5: end for

This algorithm is identical to a particular instance of PDMM proposed in [23]. Thus, PDMM is equivalent to the fixed-point iteration of the PR splitting of the extended dual problem, linking the approach with a plethora of existing algorithms within the literature [34], [38], [48], [49].

The connection with PR splitting motivates why PDMM may converge faster than ADMM for some problems, as demonstrated in [23]. In particular, [44, Remark 4] notes that PR splitting provides the fastest bound on convergence even though it may not converge for general problems. Specifically, the strong convexity and Lipschitz continuity of the averaging problem considered in [23] supports this link.

The distributed nature of PDMM can be more easily visu-alized in Algorithm 2 where we have utilized the definitions of C and d. Here the notation Nodej ← Nodei(•) indicates

the transmission of data from node i to node j. Algorithm 2 Distributed PDMM

1: Initialise: z(0) ∈ RME 2: for k=0,..., do

3: for all i ∈ V do . Primal Update

4: x(k+1)_i = arg minxi fi(xi) + P j∈N (i) −DAT i|jz (k) i|j, xi E +ρ₂||Ai|jxi− bi,j 2 || 2

5: for all j ∈ N (i) do . Dual Update 6: y(k+1)_i|j = z(k)_i|j − 2ρ Ai|jx (k+1) i − bi,j 2 7: end for 8: end for

9: for all i ∈ V, j ∈ N (i) do . Transmit Variables

10: Nodej← Nodei(y (k+1) i|j ) 11: end for

12: for all i ∈ V, j ∈ N (i) do . Auxiliary Update 13: z(k+1)_i|j = y(k+1)_j|i

14: end for

15: end for

Each iteration of the algorithm only requires one-way transmission of the auxiliary z variables between neighboring nodes. Thus, no direct collaboration is required between nodes during the computation of each iteration leading to an appealing mode of operation for use in practical networks. F. On the Link with the Distributed Alternating Direction Method of Multipliers

Using the proposed monotone interpretation of PDMM we can also link its behavior with ADMM. While in [23] it

was suggested that these two methods were fundamentally different due to their contrasting derivations, in the following we demonstrate how they are more closely related than first thought. Interestingly, this link is masked via the change of variables typically used in the updating scheme for ADMM and PDMM (see [34, Sec. 3] and [23, Sec. 4] respectively for such representations). For this purpose we re-derive an ADMM variant from the perspective of monotone operator theory.

To begin, consider the prototype ADMM problem given by min

x,y f (x) + g(y)

s.t. Ax + By = c.

(10) We can recast (1), in the form of (10) by introducing the additional variables yi|j, yj|i∈ RMi,j ∀(i, j) ∈ E so that

min x X i∈V fi(xi) s.t Ai|jxi− bi,j 2 = yi|j

Aj|ixj−bi,j₂ = yj|i

yi|j+ yj|i = 0      ∀(i, j) ∈ E. (11)

Defining the stacked vector y ∈ RME _{and adopting the}

matrices C, P and d as per Sec. III-C, (11) can be more simply written as

min

x f (x) + ιker(I+P)(y)

s.t Cx − d = y. (12) Here, the indicator function is used to capture the final set of equality constraints in (11). It follows that (12) is exactly in the form of (10) so that ADMM can be applied.

The ADMM algorithm is equivalent to applying Douglas Rachford (DR) splitting [50] to the dual of (12), given by

min

λi∀ i∈V

f∗ CTλ − dT_{λ + ι}∗

ker(I+P)(λ) , (13)

where λ, as in the case of PDMM, denotes the stacked vector of dual variables associated with the directed edges.

Comparing (13) and (6), we can note that the apparent difference in the dual problems is due to the use of ιker(I−P),

in the case of PDMM, or ι∗_ker(I+P)in the case of ADMM. In actual fact these two functions are equal which stems from the definition of the Fenchel conjugate of an indicator function,

ι∗_ker(I+P)(λ) = sup

y

hy, λi − ιker(I+P)(y)

=

(

0 λ ∈ ran (I + P) ∞ otherwise.

As ran (I + P) = ker (I − P), it follows that ι∗_ker(I+P) = ιker(I−P). The problems in (5) and (13) are therefore identical.

As DR splitting is equivalent to a half averaged form of PR splitting [46], the operator form of ADMM is therefore given by TA,ρ = 1₂(I + TP,ρ). In this manner, despite their

differences in earlier derivations, ADMM and PDMM are fundamentally linked. Within the literature, PDMM could therefore also be referred to as a particular instance of gener-alised[51] or relaxed ADMM [44].

(7)

IV. GENERALCONVERGENCERESULTS FORPDMM Having linked PDMM with PR splitting, we now move to demonstrate convergence results for the algorithm. In particu-lar we demonstrate a proof of convergence for PDMM for strongly convex and differentiable functions. This proof is required due to the fact that the strong monotonicity of either T1 or T2, usually required to guarantee convergence of PR

splitting, cannot be guaranteed for PDMM due to the row rank deficiency of the matrix C. We also highlight the use of operator averaging to guarantee convergence for all f ∈ Γ0

and demonstrate its necessity with an analytic example where PDMM fails to converge.

A. Convergence of the Primal Error (kx(k)_{− x}∗_k2_{) of PDMM}

The first result we demonstrate is that of the primal conver-genceof PDMM. In particular, we show that the sequence of primal iterates x(k)

k∈N converges to an optimal state, i.e.,

∃x∗_{∈ X}∗_{| kx}(k)_{− x}∗_k2_{→ 0.} ₍₁₄₎

where X∗ denotes the set of primal optimizers of (1) and • → • denotes convergence. The term kx(k)_{− x}∗_k2 _{will be}

referred to as the primal error from here on.

Many of the arguments used in this section make use of the notions of the kernel and range space of non-square matrices. These properties are defined below.

Definition IV.1. Range Space and Kernel Space: Given a matrixA, the range space of A is denoted by ran (A) where

∀y ∈ ran (A) , ∃u | Au = y.

Similarly, the kernel space ofA is denoted by ker (A) where ∀y ∈ ker (A) , Ay = 0.

For any matrix, the subspaces ran (A) and ker AT are orthogonal and, furthermore, their direct sum ran (A) + ker AT_{spans the entire space.}

To demonstrate that (14) holds, we can make use of the relationship between the primal x and auxiliary z variables of PDMM. In particular, we will demonstrate that both the primal and auxiliary variables converge by ultimately showing that

∃z∗∈ fix (TP,ρ) | kz(k)− z∗k2→ 0,

which we will refer to as auxiliary convergence. B. Primal Independence of a Non-Decreasing Subspace

To prove auxiliary convergence, other approaches in the lit-erature often leverage additional operational properties such as strict nonexpansiveness. Unfortunately, in the case of PDMM, TP,ρ is at best nonexpansive due to the presence of a

non-decreasing component. Fortunately, this particular component does not influence the computation of the primary variables and ultimately can be ignored.

To demonstrate that PDMM is at best nonexpansive, con-sider the equation for two successive updates given by

z(k+2)=TP,ρ◦ TP,ρ z(k) =TP,ρ Pz(k)− 2ρCx(k+1)− d =z(k)− 2ρPCx(k+2)+ Cx(k+1)− 2d, (15)

where the second and third lines use the PDMM update in (9). From our feasibility assumption of (1), ∃x∗ | PCx∗+ Cx∗= 2d so that d ∈ ran (PC) + ran (C). Therefore, every two PDMM updates only affect the auxiliary variables in the subspace ran (PC) + ran (C). By considering the projection of each iterate onto the orthogonal subspace of ran (PC) + ran (C), which is given by ker CT_{∩ ker C}T_{P, it follows}

that, for all even k, Π ker(CT_)∩ker(CT_P) z(k+2)= Π ker(CT_)∩ker(CT_P) z(k) = Π ker(CT_)∩ker(CT_P) z(0), where Π

Adenotes the orthogonal projection onto A.

Every even-numbered auxiliary iterate z(k) contains a non-decreasing component determined by our initial choice of z(0). Fortunately, from Lemma III.1 it is clear that each x(k) is independent of Π

ker(CT₎ z

(k)_{+ ρd. As ker C}T_∩

ker CT_P

⊆ ker CT_{, any signal in the non-decreasing}

subspace of TP,ρ◦ TP,ρ will not play a role in the primal

updates. For proving primal convergence, we will therefore consider the projected auxiliary error

k Π

ran(C)+ran(PC)

z(k)− z∗k2_.

(16) Such a projection can be easily computed for even iterates due to the structure noted in (15) by defining the vector

z= z∗+ Π

ker(CT_)∩ker(CT_P)

z(0). (17) From the nonexpansiveness of PDMM, the projected auxiliary error satisfies

kz(k+2)_{− z}_{k ≤ kz}(k)_{− z}_k.

The sequence z(2k)

k∈N is therefore Fej´er monotone with

respect to z and thus the sequence kz(2k)− z_k

k∈N

con-verges [46, Proposition 5.4]. To prove projected auxiliary convergence, all that remains is to show that

lim

k→∞

z(2k)− z= 0. (18)

C. Optimality of Auxiliary Limit Points

We will now demonstrate that (18) holds in the specific case of strongly convex and differentiable functions, in turn allow-ing us to prove primal convergence. While the differentiability of a function is straightforward, the notion of strong convexity is defined below.

(8)

Definition IV.2. Strong Convexity: A function f is µ-strongly convex withµ > 0 iff ∀θ ∈ [0, 1], x, y ∈ dom (f ),

f (θx + (1 − θ)y) ≤θf (x) + (1 − θ)f (y) − µθ(1 − θ)kx − yk2

Additionally, if f is µ-strongly convex, ∂f is µ-strongly monotone.

Definition IV.3. Strongly Monotone: An operator T : X → Y is µ-strongly monotone with µ > 0, if

hu − v, x − yi ≥ µkx − yk2 _{∀ (x, u) , (y, v) ∈ gra (T) .}

To verify that (18) holds under the aforementioned assump-tions, we make use of the following Lemma relating to the limit points of the primal and dual variables.

Lemma IV.1. If f is differentiable and µ-strongly convex then lim k→∞x (k)_=x∗_, lim k→∞ran(C)Π λ(k)= Π ran(C)(λ ∗_{) .}

The proof for this Lemma can be found in Appendix C. Using Lemma IV.1, and rearranging the dual update equa-tion in Lemma III.1, it follows that

lim k→∞ran(C)Π z(k)= lim k→∞ran(C)Π λ(k+1) +ρCx(k+1)− d = Π ran(C) (λ∗+ ρ (Cx∗− d)) = Π ran(C)(z ∗_{) .} (19)

From (19), if also follows that 0 = lim k→∞ran(C)Π z(k+1)− z∗ = lim k→∞ran(C)Π Pz(k)− z∗_{− 2ρC}_x(k+1)_{− x}∗ = lim k→∞P Πran(C) Pz(k)− z∗ = lim k→∞ran(PC)Π z(k)− z∗, (20)

where the second line uses Eq. (9), the third line uses that lim

k→∞x

(k+1) _{= x}∗ _{and that P is full rank, while the last}

line exploits that P = P−1 such that P Π

ran(C)

P = Π

ran(PC).

Combining (19) and (20), finally demonstrates that, under the restrictions of strong convexity and differentiability of f , that

lim k→∞ran(C)+ran(PC)Π z(2k)− z∗_{= lim} k→∞ z(2k)− z_{= 0.}

Primal convergence follows from Lemma III.1, by noting, x(k+1)= ∇f + ρCTC−1

CTz(k)+ ρd x∗= ∇f + ρCTC−1CT(z∗+ ρd) .

(21) The equality in this case follows from the fact that ∇f is µ-strongly monotone such that ∇f + ρCT_C−1

is Lipschitz

continuous and thus single-valued. Substituting (21) into the primal error, it follows that

kx(k+1)_{− x}∗_k2_{=k ∇f + ρC}T_C−1 CTz(k)+ ρd − ∇f + ρCTC−1 CT(z∗+ ρd) k2 ≤1 µ2kC T_z(k)_{− z}∗_k2 ₍₂₂₎ ≤σ 2 max(C) µ2 kz (k)_{− z}_k2_,

where, σmax denotes the largest singular value of a matrix.

The primal error kx(k+1)−x∗_k2_{is therefore upper bounded}

by the projected auxiliary error and thus converges. D. Averaged PDMM Convergence

As with other operator splitting methods, PDMM can be combined with an averaging stage to guarantee convergence ∀f ∈ Γ0, even those which do not satisfy the strong convexity

or differentiability assumptions introduced in Sec. IV-C. The general form of the averaged PDMM operator is given by

TP,ρ,α= (1 − α)I + αTP,ρ,

where the scalar α ∈ (0, 1). In the particular case that α = 1₂, averaged PDMM is equivalent to ADMM, as was previously noted in Sec. III-F. In this case, by [46, Proposition 4.4], the operator TP,ρ,α is firmly nonexpansive.

The fixed-point iteration of TP,ρ,α is therefore given by

z(k+1)= (1 − α)z(k)+ αTP,ρz(k).

This is referred to as the α-Krasnosel’ski˘ı-Mann iteration [46] of the operator TP,ρ which is a well documented method

of guaranteeing convergence for nonexpansive operators. No-tably, recursively applying [46, Eq. 5.16], it follows that the fixed-point residual (TP,ρ− I) z(k)

converges at an asymptotic rate of O _k1 and thus z(k) _{converges to a point}

in fix (TP,ρ) for finite dimensional problems.

E. Lack of Convergence of PDMM forf ∈ Γ0

Without the use of averaging, the convergence results demonstrated so far require f to be both strongly convex and differentiable. While such a result is well known in the case of PR splitting, it is not noted in the existing analysis of PDMM within the literature [23].

In the following, we reinforce the importance of this result by demonstrating a problem instance were PDMM does not converge despite f ∈ Γ0. For this purpose we consider solving

the following problem over two nodes. min

x1,x2

|x1− 1| + |x2+ 1|

s.t. x1− x2= 0.

(23) The objective in (23) is neither differentiable nor strongly convex. From Lemmas III.1 and III.2, the primal and auxiliary updates for PDMM are given respectively by

x(t+1)₁ =argmin x |x − 1| − z_1|2(t)x + ρ 2kxk 2_, x(t+1)₂ =argmin x |x + 1| + z_2|1(t)x + ρ 2kxk 2_, z(t+1)_1|2 =z(t)_2|1+ 2ρx(t+1)₂ , z_2|1(t+1)= z(t)_1|2− 2ρx(t+1)₁ , (24)

(9)

By setting z_1|2(0) = z(0)_2|1 = 0 and ρ = 1 it follows from (24) that after the first iteration x(1)₁ = −x(1)₂ = 1 and z(1)_1|2= z_2|1(1)= 2. Note that x16= x2such that x is not primal feasible.

For the second iteration x(2)₁ = −x(2)₂ = −1 and z_1|2(2) = z_2|1(2) = 0. Again, x1 6= x2 and furthermore the auxiliary

variables are back to their original configuration. The auxiliary variables of PDMM are therefore stuck in a limit cycle and can never converge for this problem. The primal variables also exhibit a limit cycle in this case. As such, f ∈ Γ0 is not a

sufficient condition for the convergence of PDMM without the use of operator averaging.

V. GEOMETRICCONVERGENCE ANDDISTRIBUTED

PARAMETERSELECTION

While PR splitting is well known to converge geometrically under the assumption of strong monotonicity and Lipschitz continuity, such conditions cannot be guaranteed in the case of PDMM due to the row rank deficiency of C. However, by assuming that f is strongly convex and has a Lipschitz contin-uous gradient, we can demonstrate a geometrically contracting upper bound for the primal error of PDMM despite this fact. A. A Primal Geometric Convergence Bound for Strongly Con-vex and Smooth Functions

In the following we demonstrate that for strongly convex functions with Lipschitz continuous gradients, the primal vari-ables of PDMM converge at a geometric rate. More formally we show that ∃ ≥ 0, γ ∈ [0, 1) so that

∀k ∈ N, kx(k)_{− x}∗_k2_{≤ γ}k_.

As in the case of Section IV-A, this is achieved by firstly forming a geometric bound for the projected auxiliary error

k Π

ran(C)+ran(PC)

z(k)− z∗k2= kz(k)− zk2, before linking back to the primal variables.

The process of bounding the projected auxiliary error is broken down into two stages. Firstly, in Sections V-B and V-C we demonstrate how, for strongly convex functions with Lipschitz continuous gradients, PDMM is contractive over a subspace. In Sections V-D and V-E we then show how a geometric convergence bound can be found by linking PDMM with a generalized form of the alternating method of projections allowing us to derive the aforementioned γ and . B. Contractive Nature of PDMM Over a Subspace

Proving that the projected auxiliary error of PDMM con-verges geometrically relies on strong monotonicity and the additional notion of Lipschitz continuity. This is defined as follows.

Definition V.1. Lipschitz Continuous: An operator T : X → Y is L-Lipschitz if

ku − vk ≤ Lkx − yk ∀ (x, u) , (y, v) ∈ gra (T) . If L = 1, T is nonexpansive while if L < 1 it is contractive.

Given this notion, we demonstrate the contractive na-ture of the PDMM operator over ran (C) by showing that C∇f∗(CT_{•) is strongly monotone and Lipschitz continuous}

over this subspace. This is summarized in Lemma V.1. Lemma V.1. If f is µ-strongly convex and ∇f is β-Lipschitz continuous thenC∇f∗_(CT_{•) is} (i) σ2max(C) µ -Lipschitz continuous (ii) σ 2 min6=0(C)

β -strongly monotone ∀z ∈ ran (C),

whereσmin6=0 denotes the smallest non-zero singular value.

The proof of this lemma can be found in Appendix D. Lemma V.1 reflects a similar approach in [44] for general PR splitting problems. Note that the result demonstrated therein does not hold in this context due to the row-rank deficiency of C. Specifically, [44, Assumption 2] is violated.

As C∇f∗(CT_{•) is both strongly monotone and Lipschitz}

continuous over ran (C), from [44], RT1,ρis contractive ∀z ∈

ran (C) with an upper bound on this contraction given by δ = max   ρσmax2 (C) µ − 1 ρσmax2 (C) µ + 1 ,1 − ρ σ2 min6=0(C) β 1 + ρσ 2 min6=0(C) β  ∈ [0, 1). By the same arguments, the operator P ◦ RT1,ρ◦ P is δ

contractive over ran (PC). Using the definition of the PDMM operator (8), the two-step PDMM updates given in (15), can equivalently be written as

z(k+2)= (P ◦ RT1,ρ◦ P) ◦ RT1,ρ

z(k).

Every two PDMM iterations is therefore the composition of the operators RT1,ρ and P ◦ RT1,ρ◦ P with each being

δ-contractive over ran (C) and ran (PC) respectively. C. Inequalities due to the Contraction of PDMM

The contractive nature of RT1,ρ and P ◦ RT1,ρ◦ P leads

to two important inequalities. In this case we will assume that k is even and that z is defined as per (17).

Beginning with the operator RT1,ρ, consider the updates

y = RT1,ρ(z

_{) and y}(k+1) _{= R} T1,ρ z

(k)_{. Using Lemma}

III.1, it follows that

y(k+1)− y=2λ(k+1)− z(k)_{− (2λ}_{− z}₎

=z(k)− z_{− 2ρC}_x(k+1)_{− x}∗_,

so that the projection onto ker(CT) satisfies Π ker(CT₎ y(k+1)− y₌ _Π ker(CT₎ z(k)− z_.

Combining with the δ-contractive nature of RT1,ρ over

ran (C), it follows that,

ky(k+1)_{− y}_k2_{≤ δ}2_{k Π} ran(C) z(k)− zk2 +k Π ker(CT₎ z(k)− z_k2_.

For the operator P ◦ RT1,ρ◦ P, as z

_{= P ◦ R}

T1,ρ◦ P (y ₎

by the results of Section IV-B and z(k+2) = P ◦ RT1,ρ◦

P y(k+1)_{, it can be similarly shown that}

Π ker(CTP) z(k+2)− z= Π ker(CTP) y(k+1)− y,

(10)

and furthermore that kz(k+2)_{− z}_k2_{≤ δ}2_k _Π ran(PC) y(k+1)− y_k2 +k Π ker(CT_P) y(k+1)− yk2. While the contractive nature of RT1,ρ and P ◦ RT1,ρ◦ P

suggests the geometric convergence of PDMM, it is unclear what this convergence rate may be. In the following, this will be addressed by deriving a geometric error bound for two-step PDMM by connecting it with the method of alternating projections.

D. A Geometric Rate Bound for PDMM Interpreted as an Optimization Problem

Using the results of Section V-C we now demonstrate that ∃γ so that the projected auxiliary error satisfies

kz(k+2)_{− z}_k2_{≤ γ}2_kz(k)_{− z}_k2_, ₍₂₅₎

where γ2 can be computed via a non-convex optimization problem. Specifically, it is the maximum objective value of

max y,z,ˆz kˆz − z _k2 _(26a) s.t. y = RT1,ρ(z) (26b) ˆ z = P ◦ RT1,ρ◦ P (y) (26c) kz − zk2_{≤ 1.} _(26d)

Here, (26a) captures the worst case improvement in the distance between the two-step iterates (ˆz) and the projected fixed point (z). Due to (26d), the maximum of this objective exactly determines the worst case convergence rate. The vector z corresponds to the initial auxiliary variable, y and ˆz are generated via the one and two step PDMM updates imposed by (26b) and (26c), and (26d) defines the feasible set of z. In a similar manner to (17), z = z∗+ Π

ker(CT_)∩ker(CT_P)(z) so

that z − z∈ ran (PC) + ran (C).

Using the properties of RT1,ρand P ◦ RT1,ρ◦ P from Sec.

V-C, the optimum of (26) can be equivalently computed via max y,z k δ Π ran(PC)+ker(CΠT_P) (y − y) k2 s.t. k Π ran(C)(y − y _{) k}2_{≤ δ}2_{k Π} ran(C)(z − z _{) k}2_(27a) Π ker(CT₎(y − y _{) =} _Π ker(CT₎(z − z ₎ _(27b) kz − zk2≤ 1, (27c) where y = RT1,ρ(z

_{) and in the objective we have}

ex-ploited the orthogonality of ran (PC) and ker CT_{P. The}

constraints of (27) increase the feasible sets of y and ˆz while including the true updates due to RT1,ρas special cases.

The constraints (27a), (27b) and (27c) collectively define the feasible set of the vectors y −y. We can further simplify (27) by considering the form of this feasible set. In particular, as (27c) denotes a sphere, the constraints (27a) and (27b) restrict the vectors y − y to lie in an ellipsoid given by

y − y∈ δ Π ran(C)+ker(CΠT) u | kuk ≤ 1 .

By defining the additional variable u = z − z, the optimiza-tion problem in (26) is therefore equivalent to

max u k δ Π ran(PC)+ker(CΠT_P) δ Π ran(C)+ker(CΠT₎ uk2 s.t. kuk2_{≤ 1, u ∈ ran (PC) + ran (C) ,} ₍₂₈₎

where the additional domain constraint stems from the defini-tion of z. In the following we demonstrate how (28) exhibits an analytic expression for γ, ultimately allowing us to form our primal convergence rate bound.

E. Relationship with the Method Alternating of Projections To compute the contraction factor γ in (25), we can exploit the relationship between (28) and the method of alternating projections. Optimal rate bounds for generalizations of the classic alternating projections algorithm has been an area of recent attention in the literature with two notable papers on the subject being [52] and [53]. Our analysis below follows in the spirit of these methods.

Consider the particular operator from Eq. (28), G = δ Π ran(PC) + Π ker(CT_P) δ Π ran(C) + Π ker(CT₎ . Given the domain constraint also from (28), it follows that γ corresponds to the largest singular value of the matrix

Π

ran(C)+ran(PC)G T

G Π

ran(C)+ran(PC). We can therefore

com-pute γ by taking advantage of the structure of G. In particular, from [53], there exists an orthonormal matrix D such that

Π ran(PC) =D     C2 _{CS 0 0} CS S2 0 0 0 0 I 0 0 0 0 0     DH, Π ran(C) = D     I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0     DH,

where C and S denote diagonal matrices of the cosines and sines of the principal angles between ran(C) and ran(PC), respectively. It follow that for the considered operator

G =D     δ2+δ(1−δ)S2 −(1−δ)CS 0 0 −δ(1−δ)CS (1 − δ)C2+ δ 0 0 0 0 δI 0 0 0 0 I     DH.

Note that the bottom right identity matrix corresponds to those vectors that lie outside our feasible set.

Given the structure of G and the diagonal nature of C and S, it follows that γ is either given by δ or by σmaxof any of

the two by two submatrices Gi= δ2_{+ δ(1 − δ)S}2 i −(1 − δ)CiSi −δ(1 − δ)CiSi δ + (1 − δ)Ci2 ,

where Si = sin(θi), Ci = cos(θi) and θi ∈ (0,π₂] is the ith

principal angle. The singular values of such a submatrix can be computed via the following lemma.

Lemma V.2. The singular values of Gi are given by

σ (Gi) = v u u tδ2+(1−δ2)C_i (1−δ2_)C i 2 ± r (1−δ2₎2_C2 i 4 +δ 2 ! .

(11)

The proof for this lemma can be found in Appendix E. As the singular values are a nondecreasing function of Ciand thus

a nonincreasing function of θi, it follows that

γ = max{δ, {σmax(Gi) ∀i}} = σmax(GF) . (29)

Here GF refers to the submatrix associated with the smallest

non-zero principal angle θF, which is referred to as the

Friedrichs angle. Therefore, given δ and CF = cos(θF),

γ = v u u tδ2_+(1−δ2_)C F (1−δ2_)C F 2 + r (1−δ2₎2_C2 F 4 +δ 2 ! . F. From an Auxiliary Error Bound to a Geometric Primal Convergence Bound

Using (29), our primal convergence bound can finally be constructed. For two-step PDMM we already know that

kz(k+2)_{− z}_k2_{≤ γ}2_kz(k)_{− z}_k2_.

By recursively applying this result, it follows that, for even k, kz(k+1)_{− T}

P,ρ(z) k2≤γkkz1− TP,ρ(z) k2

≤γk_kz0_{− z}_k2_,

so that the projected auxiliary error of PDMM satisfies kz(k+2)− zk2≤γk+2kz

0_{− z}_k2

γ . (30) By applying (22) to (30), the final primal bound is given by

kx(k+1)− x∗k2≤σ 2 max(C) µ2 kz (k) − zk2 (31) ≤γk+2σmax2 (C) µ2_γ kz (0)_{− z}_k2

The primal error kx(k+2)_{− x}∗_k2 _{is therefore upper bounded}

by a geometrically contracting sequence and thus converges at a geometric rate. To the best of the authors knowledge, this is the fastest rate for PDMM proven within the literature.

VI. NUMERICALEXPERIMENTS

In this section, we verify the analytical results of Sec. IV and V with numerical experiments. These results are broken down into two subsections: the convergence of PDMM for strongly convex and differentiable functions and the geometric convergence of PDMM for strongly convex functions with Lipschitz continuous gradients.

A. PDMM for Strongly Convex and Differentiable Functions The first set of simulations validate the sufficiency of strong convexity and differentiability to guarantee primal conver-gence, as introduced in Sec. IV. For these simulations, as testing all such functions would be computationally infeasible, we instead considered the family of m-th power of m-norms for m ∈ {3, 4, 5, · · · } combined with an additive squared Eu-clidean norm term to enforce strong convexity. The prototype problem for these simulations is given by

min x X i∈V kxi− aikmm+ µkxi− aik2 s.t. xi− xj= 0 ∀(i, j) ∈ E,

where ai are local observation vectors, µ controls the strong

convexity parameter and, for simplicity, edge based consensus constraints were chosen.

An N = 10 node undirected Erd˝os-R´enyi network [54] was considered for these simulations. Such networks are randomly generated graphs where ∀ i, j ∈ V \ i, there is equal probability that (i, j) ∈ E. This probability determines the density of the connectivity in the network and in this case was set to log(N )_N . The resulting network had 12 undirected edges and was verified as forming a single connected component as per the assumptions in Sec. III. Additionally, a randomly generated initial z(0) _{was also used for all problem instances.}

Finally the strong convexity parameter was set to µ = 10−3. For m = 3, · · · 10, 150 iterations of PDMM were performed and the resulting primal error computed. The squared Eu-clidean distance between the primal iterates and the primal optimal set was used as an error measure. Figure 2 demon-strates the convergence of this error with respect to iteration count. For each m the step sizes ρ were empirically selected to optimize convergence rate. Note that the finite precision stems

0 50 100 150

10-15 10-10 10-5 100

Convergence of Distributed m-Norms

m=3, m=4 m=5 m=6 m=7 m=8 m=9 m=10

Fig. 2. The primal convergence of different m-normm_{consensus problem}

for a 10 node Erd˝os-R´enyi network.

from the use of MATLABs f minunc function.

Figure 3 further demonstrates that the choice of ρ does not effect the guarantee of convergence which in this instance was modeled via the number of iterations required to reach an auxiliary precision of 1e−5. This measure was chosen as the auxiliary error is monotonically decreasing with iteration count. In contrast the primal error need not satisfy this point, as can be observed in Figure 2. Note that while there is a clear variation in the rate of convergence for different choices of ρ, the guarantee of convergence of the algorithms are unaffected. B. Geometric Convergence of PDMM for Strongly Convex and Smooth Functions

The final simulations verify the geometric bound from Sec. V by comparing the convergence of multiple problem instances to (31). Specifically, 104 random quadratic optimi-sation problems were generated, each of the form

min x X i∈V 1 2x T i Qixi− qTixi s.t. xi− xj = 0 ∀(i, j) ∈ E.

(12)

100 101 102 103 104 105 20

40 60

80 Effect of Step Size on Convergence

m = 3, 4, 5, 6, 7, 8, 9, 10

Fig. 3. A comparison to the required iterations for kz(k)− z_k2 _{≤ 1e}−5

for various step sizes (ρ). The step size is plotted on a log scale to better demonstrate the convergence characteristics of the different problems.

For each problem, the local variables were configured so that xi ∈ R3 ∀i ∈ V and the resulting objective was paired

with a random 10 node Erd˝os-R´enyi network. The connection probability of each network was set to log(N )_N and the networks were verified as forming single connected components.

For each problem instance, the matrices Qi 0 were

generated in such a way that a constant convergence rate bound was achieved. In this case the contraction factor of this rate bound was specified as γ = 0.9. Furthermore, the initial vector z(0) was generated randomly and for each the associated z was computed as per Eq. (17). This randomization procedure was implemented so that σ2max(C)

µ2_γ kz

(0) _{− z}_k2

= 1 for all instances.

For each problem instance, a total of 120 iterations of PDMM, were performed and the auxiliary errors, kz(k)− z_k2

for k even and kz(k)−TP,ρ(z) k2for k odd, were computed.

The distribution of the resulting data is demonstrated in Figure 4 which highlights the spread of the convergence curves across all problem instances.

20 40 60 80 100 120

10-20 10-10

100 Geometric Convergence Bound of PDMM

Bound 100% 75% 50% 25% 0%

Fig. 4. Convergence of simulated PDMM problem instances. From top to bottom, the solid green line denotes the convergence rate bound while the remaining 5 lines denote the 100%, 75%, 50%, 25% and 0% quantiles respectively.

As expected, (30) provides a strict upper bound for all problem instances, with the smoothness of the curves stem-ming from the linear nature of the PDMM update equations. Furthermore, the rate of the worst case sequence (100% quantile) does not exceed that of the bound. Interestingly, while (30) holds for the worst case functions, most problem instances exhibit far faster convergence. This suggests that, for more restrictive problem classes, stronger bounds may exist.

VII. CONCLUSIONS

In this paper we have presented a novel derivation of the node-based distributed algorithm termed the primal-dual method of multipliers (PDMM). Unlike existing efforts within the literature, monotone operator theory was used for this purpose, providing both a succinct derivation for PDMM while highlighting the relationship between it and other existing first order methods such as PR splitting and ADMM. Using this derivation, primal convergence was demonstrated for strongly convex, differentiable functions and, in the case of strongly convex functions with Lipschitz continuous gradients, a geometric primal convergence bound was presented. This is despite the loss of a full row-rank assumption required by existing approaches and is a first for PDMM. In conclusion, the demonstrated results unify PDMM with existing solvers in the literature while providing new insight into its operation and convergence characteristics.

APPENDIX

A. Proof of Lemma III.1

As RT1,ρ= 2JT1,ρ− I, we begin by defining a method for

computing the update λ(k+1) _{= J} T1,ρ z

(k)_{. Firstly, by the}

definition of the resolvent,

λ(k+1)= (I + ρT1)−1 z(k) λ(k+1)∈ z(k)_{− ρT} 1 λ(k+1). From the definition of the operator T1, it follows that

λ(k+1)∈ z(k)_{− ρ}_C∂f∗_CT_λ(k+1)_{− d}_.

Let x ∈ ∂f∗ CT_{λ. For f ∈ Γ}

0, it follows from Proposition

16.10 [46], that x ∈ ∂f∗ CT_λ

⇐⇒ ∂f (x) 3 CT_{λ so that}

λ(k+1)= z(k)− ρCx(k+1)− d CTλ(k+1)∈ ∂fx(k+1).

(32)

Thus, x(k+1) can be computed as x(k+1)=arg min x f (x) − D CTz(k), xE+ρ 2||Cx − d|| 2 Combining (32) with the fact that y(k+1) = (2JT1,ρ− I) z

(13)

B. Proof of Lemma III.2

As RT2,ρ = 2JT2,ρ − I, we again begin by defining a

method for computing the update JT2,ρ y (k+1)_,

From [48, Eq. 1.3], the resolvent of ιker(I−P), is given by

JT2,ρ

y(k+1)= Π

ker(I−P)y (k+1)_.

It follows that the reflected resolvent can be computed as z(k+1)= 2 Π ker(I−P)− I y(k+1)= Py(k+1), completing the proof.

C. Proof of Lemma IV.1

Reconsider the auxiliary PDMM updates given in Eq. (9). Substituting (9) into (16), it follows that

kz(k+1)_{− z}∗_k2_=kP_z(k)_{− z}∗_{− 2ρC}_x(k+1)_{− x}∗_k2

=kz(k)− z∗− 2ρCx(k+1)− x∗k2

= kz(k)− z∗k2_−4ρD_λ(k+1)_{− λ}∗_{, C}_x(k+1)_{− x}∗E

≤kz(k)_{− z}∗_k2_, ₍₃₃₎

where the penultimate line uses the dual update in Lemma III.1 and the final line uses the nonexpansiveness of TP,ρ.

As CTλ = ∇f (x) (32), by Definition IV.3 it follows that, hλ1− λ2, C (x1− x2)i ≥ µkx1− x2k2 ∀x16= x2, Π ran(C) (λ1) 6= Π ran(C) (λ2) . (34)

Recursively applying (33) and by using (34), it follows that lim k→∞4ρ k X i=1 µkx(i)− x∗k2≤ kz(0)− z∗k2− lim k→∞kz (k) − z∗k2 so that kx(k)_−x∗_k2 k∈N is finitely summable. If kx(k)_{− x}∗_k2

k∈N is non-zero infinitely often then

lim

k→∞kx

(k)_{− x}∗_k2

= 0 and thus lim

k→∞x

(k)_{= x}∗_.

To demonstrate this point note that if ∃k | x(k+2) = x(k+1) = x∗ then by the two-step PDMM update given in (15), z(k+2) = z(k). Thus, ∀M ≥ 1 the same primal updates will be computed so that x(k+M )= x(k+M −1)= x∗.

Any z(k) _{which produces two successive primal optimal}

updates therefore guarantees primal convergence. Thus, given our assumptions on f , any sequence which does not guarantee primal convergence in finite iterations has to be non-zero infinitely often so that lim

k→∞x

(k)_{= x}∗_{. As ∇f is single-valued,}

it also follows that lim

k→∞ran(C)Π

λ(k)₌ _Π ran(C)

(λ∗_).

D. Proof of Lemma V.1

Under the assumption that f ∈ Γ0is µ-strongly convex and

∇f is β-Lipschitz, from Theorem 18.15 [46], f∗ _{is both} 1 β

-strongly convex and 1_µ-smooth. It follows that ∇f∗ is both _β1 strongly monotone and _µ1 Lipschitz continuous.

In the case of (i), due to the Lipschitz continuity of ∇f∗ ||C ∇f∗(CTz1) − ∇f∗(CTz2) || ≤ σmax(C) ||∇f∗(CTz1) − ∇f∗(CTz2)|| ≤ σmax(C) µ ||C T_(z 1− z2) || ≤ σ 2 max(C) µ ||z1− z2||, Therefore, C∇f∗(CT_{•) is} σmax(C)2 µ -Lipschitz continuous. In

the case of (ii), due to the strong monotonicity of ∇f∗ C ∇f∗_(CT_z

1)−∇f∗(CTz2),z1−z2 ≥

||CT_(z

1−z2) ||2

β . For all z1, z2∈ ran(C) it follows that

||CT_(z 1− z2) ||2 β ≥ σ2 min6=0(C) ||z1− z2||2 β ,

completing the proof. E. Proof of Lemma V.2

Consider the two by two matrix Gi=

δ2_{+ δ(1 − δ)S}2

i −(1 − δ)CiSi

−δ(1 − δ)CiSi δ + (1 − δ)C2i

The squared singular values of this matrix are given by the eigenvalues of the matrix

GTiGi= δ4_{+ δ}2_{(1 − δ}2_)S2 i −δ(1 − δ2)CiSi −δ(1 − δ2_)C iSi δ2+ (1 − δ2)Ci2 (35) The eigenvalues of (35) can be computed via its trace and determinant. With some manipulation, these are given by

tr GTi Gi =2δ2+ (1 − δ2)2Ci2, det G T

i Gi = δ4

It follows that the squared singular values of Gi are given by

σ2_(Gi) = tr GT iGi 2 ± s tr GT i Gi 2 4 − det G T iGi = δ2+ (1 − δ2)Ci (1 − δ2_)C i 2 ± r (1 − δ2₎2_C2 i 4 + δ 2 !

completing the proof.

REFERENCES

[1] R. Hanna, A. Rohm, and V. L. Crittenden, “We’re all connected: The power of the social media ecosystem,” Business horizons, vol. 54, no. 3, pp. 265–273, 2011.

[2] L. Atzori, A. Iera, and G. Morabito, “The Internet of Things: A survey,” Computer networks, vol. 54, no. 15, pp. 2787–2805, 2010.

[3] D. Estrin, L. Girod, G. Pottie, and M. Srivastava, “Instrumenting the world with wireless sensor networks,” in Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE Int. Conf. on, vol. 4. IEEE, 2001, pp. 2033–2036.

[4] I. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, “Wireless sensor networks: a survey,” Computer networks, vol. 38, no. 4, pp. 393– 422, 2002.

[5] A. Swami, Q. Zhao, Y.-W. Hong, and L. Tong, Wireless sensor networks: Signal processing and communications. John Wiley & Sons, 2007.

(14)

[6] A. Mainwaring, D. Culler, J. Polastre, R. Szewczyk, and J. Anderson, “Wireless sensor networks for habitat monitoring,” in Proceedings of the 1st ACM Int. workshop on Wireless sensor networks and applications. Acm, 2002, pp. 88–97.

[7] A. Cerpa, J. Elson, D. Estrin, L. Girod, M. Hamilton, and J. Zhao, “Habi-tat monitoring: Application driver for wireless communications technol-ogy,” ACM SIGCOMM Computer Communication Review, vol. 31, no. 2 supplement, pp. 20–41, 2001.

[8] V. C. Gungor, B. Lu, and G. P. Hancke, “Opportunities and challenges of wireless sensor networks in smart grid,” IEEE Trans. on industrial electronics, vol. 57, no. 10, pp. 3557–3564, 2010.

[9] F. Blaabjerg, R. Teodorescu, M. Liserre, and A. Timbus, “Overview of control and grid synchronization for distributed power generation systems,” IEEE Trans. Industrial Electronics, vol. 53, no. 5, pp. 1398– 1409, 2006.

[10] M. Erol-Kantarci and H. T. Mouftah, “Wireless sensor networks for cost-efficient residential energy management in the smart grid,” IEEE Trans. on Smart Grid, vol. 2, no. 2, pp. 314–325, 2011.

[11] C. R. Baker, K. Armijo, S. Belka, M. Benhabib, V. Bhargava, N. Burkhart, A. Der Minassians, G. Dervisoglu, L. Gutnik, M. B. Haick et al., “Wireless sensor networks for home health care,” in Advanced Information Networking and Applications Workshops, 2007, AINAW’07. 21st Int. Conf. on, vol. 2. IEEE, 2007, pp. 832–837.

[12] H. Alemdar and C. Ersoy, “Wireless sensor networks for healthcare: A survey,” Computer Networks, vol. 54, no. 15, pp. 2688–2710, 2010. [13] A. Nedic, A. Olshevsky, A. Ozdaglar, and J. Tsitsiklis, “On distributed

averaging algorithms and quantization effects,” IEEE Trans. on Auto-matic Control, vol. 54, no. 11, pp. 2506–2517, 2009.

[14] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Randomized gossip algorithms,” IEEE/ACM Trans. on Networking (TON), vol. 14, no. SI, pp. 2508–2530, 2006.

[15] F. B´en´ezit, V. Blondel, P. Thiran, J. Tsitsiklis, and M. Vetterli, “Weighted gossip: Distributed averaging using non-doubly stochastic matrices,” in Information theory proceedings (isit), 2010 IEEE Int. symposium on. IEEE, 2010, pp. 1753–1757.

[16] K. Murphy, Y. Weiss, and M. Jordan, “Loopy belief propagation for approximate inference: An empirical study,” in Proceedings of the Fifteenth Conf. on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 1999, pp. 467–475.

[17] A. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun, “Distributed message passing for large scale graphical models,” in Computer vision and pattern recognition (CVPR), 2011 IEEE Conf. on. IEEE, 2011, pp. 1833–1840.

[18] Y. Weiss and W. Freeman, “On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs,” IEEE Trans. on Information Theory, vol. 47, no. 2, pp. 736–744, 2001.

[19] D. Shuman, S. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Processing Mag., vol. 30, no. 3, pp. 83–98, 2013. [20] A. Loukas, A. Simonetto, and G. Leus, “Distributed autoregressive

moving average graph filters,” IEEE Signal Processing Letters, vol. 22, no. 11, pp. 1931–1935, 2015.

[21] E. Isufi, A. Simonetto, A. Loukas, and G. Leus, “Stochastic graph filtering on time-varying graphs,” in Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 2015 IEEE 6th Int. Workshop on. IEEE, 2015, pp. 89–92.

[22] Z. Luo and W. Yu, “An introduction to convex optimization for com-munications and signal processing,” IEEE Journal on selected areas in communications, vol. 24, no. 8, pp. 1426–1438, 2006.

[23] G. Zhang and R. Heusdens, “Distributed optimization using the primal-dual method of multipliers,” IEEE Trans. on Signal and Information Processing over Networks, 2016, accepted for publication.

[24] ——, “On simplifying the primal-dual method of multipliers,” in 2016 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4826–4830.

[25] R. Rockafellar, “Network flows and monotropic optimization,” 1984. [26] ——, Convex analysis. Princeton, NJ: Princeton University Press, 1970. [27] ——, “Monotone operators and the proximal point algorithm,” SIAM

journal on control and Opt., vol. 14, no. 5, pp. 877–898, 1976. [28] ——, “On the maximal monotonicity of subdifferential mappings,”

Pacific Journal of Mathematics, vol. 33, no. 1, pp. 209–216, 1970. [29] D. Bertsekas and J. Tsitsiklis, Parallel and distributed computation:

numerical methods. Prentice hall Englewood Cliffs, NJ, 1989, vol. 23. [30] J. Tsitsiklis, “Problems in decentralized decision making and computa-tion.” Massachusetts Inst of Tech Cambridge lab for information and decision systems, Tech. Rep., 1984.

[31] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous deterministic and stochastic gradient optimization algorithms,” IEEE Trans. on automatic control, vol. 31, no. 9, pp. 803–812, 1986. [32] J. Eckstein, “Splitting methods for monotone operators with applications

to parallel optimization,” Ph.D. dissertation, Massachusetts Institute of Technology, 1989.

[33] ——, “Parallel alternating direction multiplier decomposition of convex programs,” Journal of Opt. Theory and Applications, vol. 80, no. 1, pp. 39–62, 1994.

[34] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3,R

no. 1, pp. 1–122, 2011.

[35] M. Zhu and S. Mart´ınez, “On distributed convex optimization under inequality and equality constraints,” IEEE Trans. on Automatic Control, vol. 57, no. 1, pp. 151–164, 2012.

[36] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans. on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.

[37] P. Bianchi, W. Hachem, and I. Franck, “A stochastic coordinate descent primal-dual algorithm and applications,” in Machine Learning for Signal Processing (MLSP), 2014 IEEE Int. Workshop on. IEEE, 2014, pp. 1–6. [38] P. Latafat and P. Patrinos, “Asymmetric forward-backward-adjoint split-ting for solving monotone inclusions involving three operators,” Com-putational Opt. and Applications, pp. 1–37, 2016.

[39] L. Condat, “A primal-dual splitting method for convex optimization involving lipschitzian, proximable and linear composite terms,” Journal of Opt. Theory and Applications, vol. 158, no. 2, pp. 460–479, 2013. [40] B. V˜u, “A splitting algorithm for dual monotone inclusions involving

cocoercive operators,” Advances in Computational Mathematics, vol. 38, no. 3, pp. 667–681, 2013.

[41] G. Scutari, F. Facchinei, and L. Lampariello, “Parallel and distributed methods for constrained nonconvex optimization-part i: Theory.” IEEE Trans. Signal Processing, vol. 65, no. 8, pp. 1929–1944, 2017. [42] G. Scutari, F. Facchinei, L. Lampariello, S. Sardellitti, and P. Song,

“Par-allel and distributed methods for constrained nonconvex optimization-part ii: Applications in communications and machine learning,” IEEE Trans. on Signal Processing, vol. 65, no. 8, pp. 1945–1960, 2017. [43] J. Eckstein and W. Yao, “Augmented Lagrangian and alternating

direc-tion methods for convex optimizadirec-tion: A tutorial and some illustrative computational results,” RUTCOR Research Reports, vol. 32, p. 3, 2012. [44] P. Giselsson and S. Boyd, “Linear convergence and metric selection for Douglas-Rachford splitting and ADMM,” IEEE Trans. on Automatic Control, vol. 62, no. 2, pp. 532–544, 2017.

[45] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the ADMM in decentralized consensus optimization.” IEEE Trans. Signal Processing, vol. 62, no. 7, pp. 1750–1761, 2014. [46] H. Bauschke and P. Combettes, Convex analysis and monotone operator

theory in Hilbert spaces. New York, NY.: Springer NY, 2017, vol. 408. [47] E. Ryu and S. Boyd, “Primer on monotone operator methods,” Appl.

Comput. Math, vol. 15, no. 1, pp. 3–43, 2016.

[48] N. Parikh, S. P. Boyd et al., “Proximal algorithms.” Foundations and Trends in Opt., vol. 1, no. 3, pp. 127–239, 2014.

[49] P. Bianchi, W. Hachem, and F. Iutzeler, “A coordinate descent primal-dual algorithm and application to distributed asynchronous optimiza-tion,” ArXiv:1407:0898, 2014.

[50] J. Eckstein and D. Bertsekas, “On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone op-erators,” Math. Programming, vol. 55, no. 1-3, pp. 293–318, 1992. [51] W. Deng and W. Yin, “On the global and linear convergence of the

generalized alternating direction method of multipliers,” Journal of Scientific Computing, vol. 66, no. 3, pp. 889–916, 2016.

[52] M. F¨alt and P. Giselsson, “Optimal convergence rates for generalized alternating projections,” arXiv preprint arXiv:1703.10547, 2017. [53] H. Bauschke, J. Cruz, T. Nghia, H. Pha, and X. Wang, “Optimal rates

of linear convergence of relaxed alternating projections and generalized douglas-rachford methods for two subspaces,” Numerical Algorithms, vol. 73, no. 1, pp. 33–76, 2016.

[54] P. Erdos and A. R´enyi, “On the evolution of random graphs,” Publ. Math. Inst. Hung. Acad. Sci, vol. 5, no. 1, pp. 17–60, 1960.