Decompounding discrete distributions

(1)

Decompounding discrete distributions

A nonparametric Bayesian approach

Gugushvili, Shota; Mariucci, Ester; van der Meulen, Frank DOI

10.1111/sjos.12413

Publication date 2019

Document Version Final published version Published in

Scandinavian Journal of Statistics

Citation (APA)

Gugushvili, S., Mariucci, E., & van der Meulen, F. (2019). Decompounding discrete distributions: A nonparametric Bayesian approach. Scandinavian Journal of Statistics, 47 (2020)(2), 464-492. https://doi.org/10.1111/sjos.12413

Important note

To cite this publication, please use the final published version (if applicable). Please check the document version above.

Copyright

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons. Takedown policy

Please contact us and provide details if you believe this document breaches copyrights. We will remove access to the work immediately and investigate your claim.

This work is downloaded from Delft University of Technology.

(2)

DOI: 10.1111/sjos.12413

O R I G I N A L A R T I C L E

Decompounding discrete distributions:

A nonparametric Bayesian approach

Shota Gugushvili

1

Ester Mariucci

2

Frank van der Meulen

3

1_{Biometris, Wageningen University &}

Research,

2_{Institut für Mathematik, Potsdam}

Universität,

3_{Delft Institute of Applied Mathematics,}

Delft University of Technology,

Correspondence

Ester Mariucci, Institut für Mathematik, Potsdam Universität,

Karl-Liebknecht-Str. 24-25, D-14476 Potsdam OT Golm, Germany. Email: mariucci@ovgu.de

Funding information

Deutsche Forschungsgemeinschaft, CRC 1294 Data Assimilation, DFG 314838170, GRK 2297 MathCoRe; European Research Council, ERC Grant Agreement 320637

Abstract

Suppose that a compound Poisson process is observed discretely in time and assume that its jump distribu-tion is supported on the set of natural numbers. In this paper we propose a nonparametric Bayesian approach to estimate the intensity of the underlying Poisson process and the distribution of the jumps. We provide a Markov chain Monte Carlo scheme for obtaining samples from the posterior. We apply our method on both simulated and real data examples, and compare its performance with the frequentist plug-in estimator proposed by Buch-mann and Grübel. On a theoretical side, we study the posterior from the frequentist point of view and prove that as the sample size n→ ∞, it contracts around the “true,” data-generating parameters at rate 1∕√𝑛, up to a log𝑛 factor.

K E Y W O R D S

compound Poisson process, data augmentation, diophantine equation, Gibbs sampler, Metropolis-Hastings algorithm, Nonparametric Bayesian estimation

1 I N T RO D U CT I O N

1.1 Problem formulation

Let N = (Nt∶t≥ 0) be a Poisson process with a constant intensity 𝜆 > 0, and let Yibe a sequence

of independent random variables, each with distribution P, that are also independent of N. By definition, a compound Poisson process (abbreviated CPP) X = (Xt∶t≥ 0) is

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

(3)

𝑋𝑡= 𝑁𝑡

∑

𝑗=1𝑌𝑗,

(1)

where here and below the sum over an empty index set is understood to be equal to zero. In particular, X0 =0. CPP constitutes a classical model in, for example, risk theory, see Embrechts,

Mikosch, and Klüppelberg (1997).

Assume that the process X is observed at discrete times 0< t1< t2< … < tn=T, where the

instants ti are not necessarily equidistant on [0, T]. Based on the observations 𝑋𝑡1, 𝑋𝑡2, … , 𝑋𝑡𝑛,

our goal is to estimate the jump size distribution P and the intensity𝜆. We specifically restrict our attention to the case where P is a discrete distribution, 𝑃 (N) = 1, and we will write 𝑝 = (𝑝𝑘)_𝑘∈N for the probability mass function corresponding to P, where p_k=P({k}). A similar notation will be used for any other discrete law. The distribution P is called the base distribu-tion. Abusing terminology, we will at times identify it with the corresponding probability mass function p. An assumption that P has no atom at zero is made for identifiability: otherwise this atom gets confounded with e−𝜆, which does not allow consistent estimation of the inten-sity𝜆. For a discussion of applications of this CPP model in risk theory, see Zhang, Liu, and Li (2014).

Define the increments𝑍_𝑖=𝑋_𝑡_𝑖−𝑋_𝑡_𝑖−1, i = 1, … , n. Then _𝑛= (𝑍_𝑖∶𝑖 = 1, … , 𝑛) is a sequence of independent random variables. When {ti}are equidistant on [0, T], the random variables Zi

have in fact a common distribution Q satisfying 𝑄(N0) =1. As 𝑛 carries as much

informa-tion as (𝑋_𝑡_𝑖 ∶𝑖 = 1, … , 𝑛) does, we can base our estimation procedure directly on the increments 𝑛. Since summing up the jumps Yjs amounts to compounding their distributions, the inverse

problem of recovering P and 𝜆 from Zi can be referred to as decompounding (Buchmann &

Grübel, 2003).

There are two natural ways to parameterize the CPP model: either in terms of the pair (𝜆, p), or in terms of the Lévy measure𝜈 = (𝜈_𝑘)_𝑘∈Nof the process X (Sato, 2013). A relationship between the two is𝜆 =∑∞_𝑘=1𝜈_𝑘and p =𝜈∕𝜆. Inferential conclusions in one parameterization can be easily translated into inferential conclusions into another parameterization. However, for our specific statistical approach the Lévy measure parameterization turns out to be more advantageous from the computational point of view.

1.2 Approach and results

In this paper, we take a nonparametric Bayesian approach to estimation of the Lévy measure 𝜈 of X. See Ghosal & van der Vaart (2017) and Müller, Quintana, Jara, and Hanson (2015) for modern expositions of Bayesian nonparametrics. A case for nonparametric Bayesian methods has already been made elsewhere in the literature, and will not be repeated here. On the practi-cal side, we implement our procedure via the Gibbs sampler and data augmentation, and show that it performs well under various simulation setups. On the theoretical side, we establish its consistency and derive the corresponding posterior contraction rate, which can be thought of as an analogue of a convergence rate of a frequentist estimator (Ghosal and van der Vaart, (2017)). The posterior contraction rate, up to a practically insignificant log𝑛 factor, turns out to be 1∕√𝑛, which is an optimal rate for nonparametric estimation of cumulative distribution functions. Our contribution thus nicely bridges practical and theoretical aspects of Bayesian nonparametrics.

(4)

1.3 Related literature

To provide a better motivation for our model and approach, in this subsection we briefly survey the existing literature. A Bayesian approach to nonparametric inference for Lévy processes is a very recent and emerging topic, with references limited at the moment to Belomestny, Gugushvili, Schauer, and Spreij (2019), Gugushvili, van der Meulen, and Spreij (2015), Gugushvili, van der Meulen, and Spreij (2018) and Nickl and Söhl (2017). These deal exclusively with the case when the Lévy measure is absolutely continuous with respect to the Lebesgue density. At least from the computational point of view, these works are of no help in our present context.

Related frequentist papers for CPP models with discrete base distributions are Buchmann and Grübel (2003) and Buchmann and Grübel (2004), which, after earlier contributions dating from the previous century, in fact revived interest in nonparametric techniques for Lévy processes. To estimate the base distribution p, Buchmann and Grübel (2003) employ a frequentist plug-in approach relying on the Panjer recursion (i.e., an empirical cumulative distribution estimate of q is plugged into the Panjer recursion equations to yield an estimate of p; see below on the Panjer recursion). The drawback is that the parameter estimates are not guaranteed to be nonnegative. Buchmann and Grübel (2004) fix this problem by truncation and renormalization. This works, but looks artificial. As noted in Buchmann and Grübel (2004), in practice the latter approach breaks down if no zero values are observed among Zis. Buchmann and Grübel (2004) establish

weak convergence of their modified estimator, but on the downside its asymptotic distribution is unwieldy to give confidence statements on p. Most importantly, the plug-in approaches in Buch-mann and Grübel (2003) and BuchBuch-mann and Grübel (2004) do not allow obvious generalisations to nonequidistant observation times {ti}. In Lindo, Zuyev, and Sagitov (2018), another frequentist

estimator of the jump measure is introduced, that is obtained via the steepest descent technique as a solution to an optimization problem over the cone of positive measures. The emphasis in Lindo et al. (2018) is on numerical aspects; again, no obvious generalization to the case of nonequidistant {ti}is available.

Finally, some important, predominantly theoretical references on inference for Lévy processes are Comte and Genon-Catalot (2011), Duval and Hoffmann (2011), Kappus (2014), Neumann and Reiß (2009), Nickl and Reiß (2012), van Es, Gugushvili, and Spreij (2007) and Trabs (2015). We refer to Belomestny, Comte, Genon-Catalot, Masuda, and Reiß (2015), Coca (2018a, 2018b), and Duval and Mariucci (2017) for more extensive literature surveys.

1.4 Outline

The rest of the paper is organized as follows: in Section 2 we introduce our approach and describe an algorithm for drawing from the posterior distribution. In Sections 3 and 4 we study its per-formance on synthetic and real examples. Section 5 is devoted to the examination of asymptotic frequentist properties of our procedure. An outlook on our results is given in Section 6. Finally, in Appendix A technical lemmas used in the proofs of Section 5 are collected, whereas Appendix B contains some additional simulation results.

1.5 Notation

For two sequences {an}and {bn}of positive real numbers, the notation𝑎𝑛≲ 𝑏𝑛(or𝑏𝑛≳ 𝑎𝑛) means

that there exists a constant C> 0 that is independent of n and such that an≤ Cbn. We write

(5)

size n) by Πn. The corresponding posterior measure is denoted by Π𝑛(⋅|𝑛). The Gamma

distri-bution with shape parameter a and rate parameter b (a, b > 0) is denoted by Gamma(a, b). Its density is given by𝑥 → 𝑏𝑎

Γ(𝑎)𝑥𝑎−1𝑒−𝑏𝑥, 𝑥 > 0, where Γ is the Gamma function. The inverse Gamma

distribution with shape parameter a and scale parameter b is denoted by𝐼𝐺(a, b). Its density is 𝑥 → _Γ(𝑏𝑎_𝑎)𝑥−𝑎−1_𝑒−𝑏∕𝑥_{, 𝑥 > 0. We use the notation Exp(a) for an exponential distribution with mean}

1∕a. Finally, given a metric d on a set and 𝜖 > 0, the covering number 𝑁(𝜖, , 𝑑) is defined as the minimal number of balls of radius𝜖 needed to cover .

2 A LG O R I T H M FO R D R AW I N G F RO M T H E P O ST E R I O R

A Bayesian statistical approach relies on the combination of the likelihood and the prior on the parameter of interest through Bayes' formula. We start with specifying the prior. As far as the likelihood is concerned, although explicit, it is intractable from the computational point of view for nonparametric inference in CPP models. We will circumvent the latter problem by means of data augmentation, as detailed below.

2.1 Prior

We define a prior Π on𝜈 through a hierarchical specification

{𝜈_𝑘}∞_𝑘=1|𝑎, 𝑚, 𝛽_𝑘i.i.d.∼ Gamma(𝑎, 1∕𝛽_𝑘)⋅ 1{1≤𝑘≤𝑚},

𝛽1, … , 𝛽𝑚|𝛾 i.i.d.

∼ IG(𝑐, 𝛾), 𝛾 ∼ Exp(1).

Note that the (fixed) hyperparameters𝑚 ∈N, a, c > 0 are denoted by Latin letters.

The hyperparameter m incorporates our a priori opinion on the support of the Lévy mea-sure𝜈, or equivalently, the base measure p. In applications, the support of p may be unknown, which necessitates the use of a large m, for example,𝑚 = max_{𝑖=1,…,𝑛}𝑍_𝑖; this latter is the maxi-mal value suggested by the data_𝑛at hand. Nevertheless, we may simultaneously expect that the “true,” data-generating𝜈 charges full mass only to a proper, perhaps even a small subset of the set {1, … , m}. In other words, 𝜈 may form a sparse sequence, with many components equal to zero. In fact, there are at least two plausible explanations for an occurrence of a large increment Ziin the data: either a few large jumps Yj’s occurred, which points toward a large right endpoint

of the support of𝜈0, or Ziis predominantly formed of many small jumps, which in turn indicates

that the intensity𝜆 of the Poisson arrival process N may be large. To achieve accurate estima-tion results, a prior should take a possible sparsity of𝜈 into account. This is precisely the reason of our hierarchical definition of the prior Π: a small𝛽k encourages a priori the shrinkage of the

components𝜈kof𝜈 toward zero.

2.2 Data augmentation

Assume temporarily ti=i, i = 1, … , n, and write 𝑞 = (𝑞𝑘)𝑘∈N0for qk=q({k}). Then Zihave the distribution

(6)

𝑞 = 𝑒−𝜆∑∞ 𝑗=0

𝜆𝑗

𝑗!𝑝∗𝑗, (2)

with ∗ denoting convolution. The compounding mapping (𝜆, p) → q can be expressed explicitly via the Panjer recursion (see Panjer (1981)):

𝑞0=𝑒−𝜆, 𝑞𝑘= 𝜆_𝑘 𝑘

∑

𝑗=1

𝑗𝑝𝑗𝑞𝑘−𝑗, 𝑘 ∈N.

This recursion can be inverted to give the inverse mapping q→ (𝜆, p) via

𝜆 = − log 𝑞0, 𝑝𝑘= −_𝑞 𝑞𝑘 0log𝑞0 − 1 𝑘𝑞0 𝑘−1 ∑ 𝑗=1𝑗𝑝𝑗𝑞𝑘−𝑗, 𝑘 ∈N.

In view of (2), the likelihood in the CPP model is explicit. Nevertheless, an attempt to directly use (2) or the Panjer recursions in posterior computations results in a numerically intractable procedure. Equally important is the fact that a Panjer recursion-based approach would not apply to nonequidistant observation times {ti}. Therefore, instead of (2) and the Panjer recursion, we

will employ data augmentation (Tanner & Wong, 1987). We switch back to the case when values of {ti}are not necessarily uniformly spaced. The details of our procedure are as follows: when the

process X is observed continuously over the time interval [0, T], so that our observations are a full sample path X(T)= (Xt ∶t ∈ [0, T]) of CPP, the likelihood is tractable and is proportional to

𝑒−𝑇 𝑚 ∑ 𝑘=1𝜈𝑘 𝑚 ∏ 𝑘=1 𝜈𝜇𝑘 𝑘 ,

see Shreve (2004). Here

𝜇𝑘= #{𝑌𝑗 =𝑘}, (3)

that is, the total number of jumps of size k. Then the prior Π from Subsection 2.1 leads to conjugate posterior computations. In fact, the full conditionals are

𝜈𝑘|{𝜇𝑘}, {𝛽𝑘}i.i.d.∼ Gamma(𝑎 + 𝜇𝑘, 1∕𝛽𝑘+𝑇 ), 𝑘 = 1, … , 𝑚, 𝛽𝑘|{𝜈𝑘}, 𝛾i.i.d.∼ IG(𝑎 + 𝑐, 𝛾 + 𝜈_𝑘), 𝑘 = 1, … , 𝑚, 𝛾|{𝛽𝑘} ∼Gamma ( 𝑐𝑚 + 1, 1 + 𝑚 ∑ 𝑘=1𝛽 −1 𝑘 ) .

Therefore, the Gibbs sampler for posterior inference on𝜈 can be implemented. The Gibbs sam-pler cycles through the above conditionals a large number of times, generating approximate (dependent) samples from the posterior. See, e.g., Gelfand and Smith (1990) and section 24.5 in Wasserman (2004) on the Gibbs sampler and its use in Bayesian statistics.

(7)

As we do not observe the process X continuously, we will combine the above with the data augmentation device. First note that we have

𝑍𝑖= 𝑚

∑

𝑗=1𝑗𝜇𝑖𝑗,

where (𝜇_ij∶i =1, … , n, j = 1, … , m) are independent, and 𝜇_ij∼Poisson(Δi𝜈j)for𝜈j=𝜆pjand

Δi=ti−ti−1; see Corollary 11.3.4 in Shreve (2004, p. 498). Furthermore, for𝜇k as in (3) we

trivially have𝜇_𝑘=∑𝑛_𝑖=1𝜇_𝑖𝑘. Data augmentation iterates the following two steps: (i) Draw (𝜇ij)conditional on the data𝑛and the parameter𝜈.

(ii) Draw𝜈 conditional on (𝜇_ij).

Once the algorithm has been run long enough, this gives approximate (dependent) samples from the posterior of𝜈. We already know how to deal with step (ii); now we need to handle step (i).

Thus, keeping 𝜈 fixed, for each i we want to compute the conditional distribution (𝜇_ij∶j =1, … , m)|Zi, and furthermore, we want to be able to simulate from this distribution. In

turn, this will immediately allow us to simulate𝜇k conditional on the data𝑛. Now, with Pr(⋅)

referring to probability under the parameter𝜈, it holds that

Pr(𝜇_𝑖1=𝑘1, … , 𝜇𝑖𝑚=𝑘𝑚|𝑍𝑖=𝑧𝑖) = _Pr(_𝑍1 𝑖=𝑧𝑖)𝑒 −Δ𝑖∑𝑚𝑗=1𝜈𝑗 𝑚 ∏ 𝑗=1 (Δ_𝑖𝜈_𝑗)𝑘𝑗 𝑘𝑗! 1 { _𝑚 ∑ 𝑗=1 𝑗𝑘𝑗 =𝑧𝑖 } . Knowledge of the normalizing constant Pr(Zi= zi) will not be needed in our approach.

In general, simulation from a discrete multivariate distribution is nontrivial; some general options are discussed in Devroye (1986, Chapter XI, Section 1.5), but are unlikely to work eas-ily for a large m. We will take an alternative route and use the Metropolis-Hastings algorithm, see, for example, section 24.4 in Wasserman (2004). We start by observing that for a fixed i, the support of Pr(⋅ | Zi=zi) is precisely the set 𝑖 of nonnegative solutions (k1, … , km) of

the Diophantine equation ∑𝑚_𝑗=1𝑗𝑘_𝑗 =𝑧𝑖. The R package nilde (see Pya Arnqvist, Voinov, &

Voinov, 2018) implements an algorithm from Voinov and Nikulin (1997) that finds all such solutions for given integers m and zi. By Markovianity of the process X, we can simulate the

vectors (𝜇i1, … , 𝜇im)independently for each i = 1, … , n. If zi=0 or 1, there is only one

solu-tion to the Diophantine equasolu-tion: the trivial solusolu-tion (0, … , 0) in the first case, and the solution (1, 0, … , 0) in the second case; for such zi, no simulation is required, as (𝜇i1, … , 𝜇im)is known

explicitly. We thus only need to consider each𝑖 ∈  = {𝑖 ∶ 𝑧_𝑖≠ 0 or 1} in turn, and design a Metropolis-Hastings move on the set of the corresponding solutions _𝑖. Fix once and for all an ordering of elements in_𝑖 (this could be, e.g., lexicographic, or reverse lexicographic); we use the notation|_𝑖| for the cardinality of _𝑖. Let𝜇 = (𝜇i1, … , 𝜇im)be the current state of the

chain, corresponding to the𝓁th element s_𝓁 of _𝑖. A proposal𝜇◦ = (𝜇◦_𝑖1, … , 𝜇◦_𝑖𝑚)is obtained as follows:

(i) If𝓁 = 1, draw 𝜇◦uniformly at random among the elements {𝑠2, 𝑠|_𝑖|}of𝑖.

(ii) If𝓁 = |_𝑖|, draw 𝜇◦uniformly at random among the elements {𝑠1, 𝑠|𝑖|−1}of𝑖.

(8)

Occasionally, one may want to propose another type of a move too. (iv) Draw𝜇◦= (𝜇_𝑖1◦, … , 𝜇_𝑖𝑚◦ )uniformly at random from_𝑖.

The two proposals lead to reversible moves, and one may also alternate them with probabilities 𝜋 and 1 − 𝜋, for example, 𝜋 = 0.8. The logarithm of the acceptance probability of a move from (𝜇i1, … , 𝜇im)to (𝜇◦_𝑖1, … , 𝜇_𝑖𝑚◦ )is computed as log𝐴 = 𝑚 ∑ 𝑘=1 (𝜇◦_𝑖𝑘−𝜇_𝑖𝑘)log(Δ_𝑖𝜈_𝑘) + 𝑚 ∑ 𝑘=1 {log(𝜇_𝑖𝑘!) −log(𝜇◦_𝑖𝑘!)}.

The move is accepted if log𝑈 ≤ log 𝐴 for U an independently generated uniform random variate on [0, 1], and in that case the current state of the chain is reset to (𝜇◦_𝑖1, … , 𝜇◦_𝑖𝑚). Otherwise the chain stays in (𝜇_i1, … , 𝜇_im).

3 S I M U L AT I O N E X A M P L E S

In this section, we test performance of our approach in a range of representative simulation exam-ples. For benchmarking, we use the frequentist plug-in estimator from Buchmann and Grübel (2004). Two real data examples are given in Section 4. Unless otherwise stated, we took c = 2 and a =0.01 as hyperparameters in our prior specification. As can be seen from the update formulae for the Gibbs sampler, as long as a is not taken too large, its precise value is not very influential on the posterior, given a reasonable sample size. The value c = 2 ensures that the update step for 𝛽k has finite variance. At each step of updating the imputed data for increment size z we have

chosen with probability 0.2 to propose uniformly from all solutions to the Diophantine equation (for that particular value of z).

We implemented our procedure in Julia, see (Bezanson, Edelman, Karpinski, & Shah, 2017). The code and datasets for replication of our examples are available on GitHub1 _{and Zenodo}

(Gugushvili, Mariucci, & van der Meulen, 2019).

3.1 Uniform base distribution

This simulation example follows with some extensions that in Buchmann and Grübel (2004). Let 𝜆0=2, and let P0be the discrete uniform distribution on {1, 4, 6}. We simulated data according

to the following settings:

(a) n = 100, Δi=1 for 1≤ i ≤ n;

(b) n = 500, Δi=1 for 1≤ i ≤ n (the data under (a) are augmented with 400 extra observations);

(c) n = 500, Δi=Unif(0, 2) for 1 ≤ i ≤ n.

We set𝑚 = min(15, 𝑍(𝑛)), where𝑍(𝑛)=max1≤𝑖≤𝑛𝑍𝑖. In all cases this led to m = 15, as the value

of Z(n)was equal to 30, 35, and 40 for the simulated data under settings (a), (b), and (c),

respec-tively. The Gibbs sampler was run for 500, 000 iterations, of which the first 250, 000 were discarded 1_{See https://github.com/fmeulen/Bdd}

(9)

0.00 0.25 0.50 0.75 1 2 3 4 5 6 7 8 9 10 k νk (a) 0.0 0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 k νk (b) 0.0 0.2 0.4 0.6 1 2 3 4 5 6 7 8 9 10 k νk (c)

F I G U R E 1 Simulation example from Section 3.1. In each figure, the horizontal axis gives the magnitudes of𝜈k, k ∈ {1, … , 10}. The orange balls denote the true values, the black triangles the Buchmann-Grübel

estimator. The blue crosses give the posterior means, whereas the vertical blue line segments represent (pointwise) 95% credible intervals. The settings corresponding to (a), (b), and (c) are explained in the main text. Note the differences in vertical scale across the figures [Colour figure can be viewed at wileyonlinelibrary.com]

as burn-in. From the remaining samples, the posterior mean and 2.5% and 97.5% percentiles were computed for each coefficient𝜈k. The results for the first 10 coefficients are shown in Figure 1.

For comparison, the estimator from Buchmann and Grübel (2004) is also included in the figure. For setting (b), traceplots of every 50th iteration for a couple of coefficients𝜈kare shown in

Figure 2. We measure the error of an estimate {̂𝜈_𝑘}by Err(𝜈, ̂𝜈) =∑∞_𝑘=1|̂𝜈_𝑘−𝜈𝑘|. The errors are

reported in Table 1. In all settings, for these particular realizations of the simulated data, the Bayesian procedure outperformed the truncated estimator from Buchmann and Grübel (2004). For setting (c), the latter produces a poor result, as was to be expected, given that it is derived under the assumption Δi=1 for all i. An advantage of the Bayesian procedure is the included measure

of uncertainty, namely the credible intervals for𝜈k. On the other hand, for the Buchmann-Grübel

estimator it is hardly possible to derive confidence intervals via an asymptotic method, since the limiting distribution of the estimator is fairly complicated. Although not considered in the original

(10)

ν₁

ν₂

ν₆

0e+00 1e+05 2e+05 3e+05 4e+05 5e+05

0.50 0.75 1.00 1.25 1.50 0.0 0.2 0.4 0.6 0.25 0.50 0.75 iterate

F I G U R E 2 Traceplots for the simulation example from Section 3.1 under setting (b). The posterior samples were subsampled, with every 50th iteration kept. The displayed results are for parameters𝜈1,𝜈6, and𝜈9

T A B L E 1 Results for scenarios (a)–(c) from Section 3.1

Simulation setting (a) (b) (c)

Buchmann-Grübel estimator 1.40 0.32 1.44

Posterior mean 0.15 0.07 0.12

publications Buchmann and Grübel (2003) and Buchmann and Grübel (2004), a natural alterna-tive is the bootstrap. A detailed examination of the performance of the latter and its comparison to that of the Bayesian method lies beyond the scope of the present paper. Indeed, any thorough study would require, on one hand, the asymptotic justification of bootstrap confidence intervals, and on another hand establishing frequentist coverage properties of our Bayesian procedure. In that respect, good performance of neither method is automatically warranted (e.g., van der Pas, Szabó, and van der Vaart (2017) and van der Vaart (1998), Chapter 23). Here instead we opt for a numerical illustration, which is reported in Appendix B.

3.2 Geometric base distribution

The setup of this synthetic data example likewise follows that in Buchmann and Grübel (2004). Assume q is a geometric distribution with parameter𝛼, that is, q_k= (1 −𝛼)k_{𝛼 for 0 < 𝛼 < 1,}

k ∈𝑁0. Then𝜆 = − log 𝛼, and

𝑝𝑘= −(1 −𝛼) 𝑘

𝑘 log 𝛼 , 𝑘 ∈N. Hence,𝜈k= (1 −𝛼)k∕k.

We consider two simulation setups:

(a) n = 500, Δi=1 for 1≤ i ≤ n and 𝛼 = 1∕3;

(11)

0.00 0.25 0.50 0.75 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 k

ν

k (a) 0.00 0.25 0.50 0.75 1.00 1.25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 k

ν

k (b)

F I G U R E 3 Simulation example from Section 3.2. Settings (a) and (b) correspond to the true jump distributions Geom(1∕3) and Geom(1∕6), respectively. The horizontal axis gives the magnitudes of 𝜈k,

k ∈ {1, … , 15}. The orange balls denote the true values, the black triangles the Buchmann-Grübel estimator. The blue crosses give the posterior means, whereas the vertical blue line segments represent (pointwise) 95% credible intervals [Colour figure can be viewed at wileyonlinelibrary.com]

T A B L E 2 Results for scenarios (a)–(b) from Section 3.2

Simulation setting (a) (b)

Buchmann-Grübel estimator 0.28 0.60

Posterior mean 0.52 1.05

We set 𝑚 = min(15, 𝑍_(𝑛))and ran the sampler according to the settings of Section 3.1. The results for both scenarios (a) and (b) are reported in Figure 3. In Table 2 we also report estimation errors in one simulation run. For this example and these generated data, the Bayesian proce-dure gives less precise point estimates than the Buchmann-Grübel method. Note that estimation error for𝛼 = 1∕3 is smaller than that for 𝛼 = 1∕6. This appears intuitive, as a smaller value of 𝛼 corresponds to a larger value of𝜆. The latter implies that on average each Ziis a superposition of

a larger number of jumps, which renders the decompounding problem more difficult. However, this argument is hard to formalise.

3.3 Monte Carlo study

For a more thorough comparison of the Buchmann-Grübel estimator and our Bayesian method, we performed a small Monte Carlo experiment. We considered two settings:

(i) The setting from Section 3.1 with n = 250. We took𝑚 = min(15, 𝑍_(𝑛)). (ii) The setting from Section 3.2 with𝛼 = 1∕3. We took 𝑚 = min(20, 𝑍_(𝑛)).

(12)

Uniform base Geometric base (c=2) Geometric base (c=0.01) 0.0 0.5 1.0 1.5 0.3 0.6 0.9 0.3 0.6 0.9 bayesmean bayesmedian bg absolute (L1) error

F I G U R E 4 Monte Carlo study from Subsection 3.3 comparing the Buchmann-Grübel estimator and the Bayesian method proposed in this paper. In this figure “bg” refers to the Buchmann-Grübel estimator, while “bayesmedian” and “bayesmean” refer to the Bayesian method, where either the median or mean was used as a point estimator for each𝜈i. The leftmost panel corresponds to the setting (i), whereas the other two panels to the

setting (ii). In the latter we used both c = 2 and c = 0.01 in the prior specification [Colour figure can be viewed at wileyonlinelibrary.com]

In both cases we assumed Δi=1 for all 1≤ i ≤ n. The number of Monte Carlo repetitions was

taken equal to 50. We took 400, 000 Markov chain Monte Carlo (MCMC) iterations and discarded the first half of these as burn-in. In Figure 4 we give a graphical display of the results by means of boxplots of the errors. Here, as before, if the true values are denoted by𝜈k and the estimate

within a particular simulation run by ̂𝜈_𝑘, the error is defined by Err(𝜈, ̂𝜈) =∑∞_𝑘=1|̂𝜈_𝑘−𝜈_𝑘| (we truncated the infinite summation to 50). The results agree with our earlier findings, in that there is no clear “winner” in the comparison. Note that for the setting (ii) we considered both c = 2 and c =0.01 in the prior specification. Both values give similar performance of the Bayesian method. This provides insight into sensitivity of our results with respect to the prior specification. A minor difference between the middle and righmost panel of Figure 4 may be attributed to Monte Carlo error: the 50 simulated datasets on which these panels are based are not the same. Note that the prior promotes sparsity, and in that respect it is not surprising it does better when the true data-generating Lévy measure is sparse.

3.4 Computing time

In terms of computational effort, the time it takes to evaluate the Buchmann-Grübel estimator is negligible compared to our algorithm for sampling from the posterior. This is not surprising, as that frequentist estimator relies on a plug-in approach, whereas in our case an approximation to the posterior is obtained by MCMC simulation. However, if proper uncertainty quantification is desired, then the Bayesian method is advantageous in the sense that it does not solely produce a point estimate.

Note that the proposed MCMC scheme requires determination of the solutions to the Dio-phantine equation ∑𝑚_𝑗=1𝑗𝑘_𝑗 =𝑧 for all unique values z in the observation set. For moderate values of z, say z≤ 30, this is rather quick, but for large values of z the computing time increases exponentially, as does the amount of the allocated memory. The computing time of each Metropolis-Hastings step is then very small, but we potentially need a very large number of itera-tions to reach stationarity. The latter is caused firstly by the fact that at a particular iteration, our

(13)

proposals for𝜇_ijdo not take into account the current values of𝜈1, … , 𝜈m; secondly, the size of the

state space that needs to be explored increases exponentially with m.

4 R E A L DATA E X A M P L E S

4.1 Horse kick data

To further illustrate our procedure, we will use the von Bortkewitsch data on the number of sol-diers in the Prussian cavalry killed by horse kicks (available by year and by cavalry corps); this example was also employed in Buchmann and Grübel (2003). Each observation is an integer from 0 to 4, giving the number of deaths for a given year and a given cavalry corps, with overall counts reported in Table 3. The data are extracted from the table on p. 25 in von Bortkewitsch (1898). Note that von Bortkewitsch corrects for the fact that the Guards and I, VI, and XI cavalry corpses of the Prussian army had a different organization from other units, and justifiably omits the corresponding counts from consideration.

It has been demonstrated by von Bortkewitsch that the Poisson distribution fits the horse kick data remarkably well. Assuming instead that observations follow a compound Poisson distribu-tion is a stretch of imaginadistribu-tion, as that would correspond to a horse running amok and killing possibly more than one soldier in one go. Nevertheless, this example provides a good sanity check for our statistical procedure.

The estimation results are graphically depicted in Figure 5. Clearly, point estimates of both methods are in agreement and lend support to the Poisson model for this dataset.

4.2 Plant data

Our second real example is the one used in Buchmann and Grübel (2004). Consider the data in Table 4, taken from Evans (1953). The data were collected as follows: the area was divided into plots of equal size and in each plot the number of plants was counted; the number of plants in each

T A B L E 3 Data on the number of soldiers in the Prussian cavalry killed by horse kicks. See the table on p. 25 in von Bortkewitsch (1898)

Deaths 0 1 2 3 4 Counts 109 65 22 3 1 0.0 0.2 0.4 0.6 4 3 2 1 k νk

F I G U R E 5 Estimation for the horse kick data from Subsection 4.1. The horizontal axis gives the magnitudes of𝜈k, k ∈ {1, … , 4}. The black triangles denote the Buchmann-Grübel estimator, the blue crosses

give the posterior means, whereas the vertical blue line segments represent (pointwise) 95% credible intervals [Colour figure can be viewed at wileyonlinelibrary.com]

(14)

T A B L E 4 Plant population data from Evans (1953) Plants 0 1 2 3 4 5 6 7 8 9 10 11 12 Counts 274 71 58 36 20 12 10 7 6 3 0 2 1 0.0 0.1 0.2 0.3 1 2 3 4 5 6 7 8 9 10 k νk

F I G U R E 6 Estimation results for the plant data from Subsection 4.2. The horizontal axis gives the magnitudes of𝜈k, k ∈ {1, … , 10}. The black triangles denote the Buchmann-Grübel estimates, the blue crosses

give the posterior means, whereas the vertical blue line segments represent (pointwise) 95% credible intervals [Colour figure can be viewed at wileyonlinelibrary.com]

plot ranges from 0 to 12. The second row of Table 4 gives the counts of plots containing a given number of plants; thus, there were 274 plots that contained no plant, 71 that contained 1 plant, etc. It is customary in the ecological literature to model such count data as i.i.d. realizations from a compound Poisson distribution. Thus, for example, Neyman (1939) advocated the use of a Poisson base distribution in this context; another option here is a geometric base distribution. Given exis-tence of several distinct modeling possibilities, performing an exploratory nonparametric analysis appears to be a sensible strategy.

The estimation results are graphically depicted in Figure 6. There are some small differences between the posterior mean and the Buchmann-Grübel estimate, but overall they are very similar.

5 F R EQ U E N T I ST A S Y M P TOT I C S

In this section we assume that the observation times {ti}are equidistant: ti=i, i = 1, … , n. To

evaluate our Bayesian method from a theoretical point of view, we will verify that it is consistent, and we will establish the rate at which the posterior contracts around the “true,” data-generating Lévy measure𝜈0; see Ghosal and van der Vaart (2017) for a thorough treatment of Bayesian

asymptotics from the frequentist point of view. From now on the subscript 0 in various quantities will refer to the data-generating distribution.

Our strategy consists in proving that the posterior contraction rate for𝜈0, given the sample

𝑛= (𝑍1, … , 𝑍𝑛), can be derived from the posterior contraction rate for q0 given𝑛, which is

mathematically easier since Z1, … , Znis a sequence of independent and identically distributed

random variables with distribution q₀. We therefore effectively avoid dealing directly with the inverse nature of the problem of estimating p₀.

The prior we consider in this section is defined as follows: • Endow the rate𝜆 of the Poisson process with a prior distribution.

• Independently, endow the vector (p₁, … , p_m) with a Dirichlet distribution with parameter (𝛼1, … , 𝛼m).

(15)

This is a somewhat simplified version of the prior we used in Section 2, which allows us to con-centrate on essential features of the problem, without need to clutter the analysis with extra and unenlightening technicalities. Also remember the well-known relationship between the Gamma and Dirichlet distributions: if𝜉₁, … , 𝜉_mare independent Gamma distributed random variables, 𝜉i∼Gamma(𝛼i, 1), then for 𝜂𝑖=𝜉𝑖∕∑𝑚𝑗=1𝜉𝑗, the vector (𝜂1, … , 𝜂m)follows the Dirichlet

distribu-tion with parameter (𝛼1, … , 𝛼m); furthermore, we have that∑𝑚_𝑗=1𝜉𝑗 ∼Gamma(∑𝑚_𝑗=1𝛼𝑗, 1

) , and 𝜂iare independent of∑𝑚_𝑗=1𝜉𝑗.

In our asymptotic setting, we will make m = mndependent on n and let mn→ ∞ at a suitable

rate as n→ ∞.

Recall that we write𝑄 = (𝑞_𝑘)_𝑘∈N₀for q_k =Q({k}). Let denote the collection of all probability measures supported onN.

Theorem 1. Suppose there exists 𝛼, such that 0 < 𝛼 ≤ 𝛼_𝑖≤ 1 for all 1 ≤ i ≤ mn. Suppose 𝜆 ∼

𝐺𝑎𝑚𝑚𝑎(a, b) with a ∈ (0, 1] and that 𝜈0has a compact support. Then, for any𝛾 > 1,

Π𝑛 ( ||𝜈 − 𝜈0||1≥ log𝛾𝑛 √ 𝑛 |𝑛 ) → 0 in𝑄𝑛₀-probability, as n→ ∞.

Remark1. Note that since the support of𝜈0is not assumed to be known, our CPP model is still

naturally nonparametric. The assumption of the compact support of𝜈0does not cover the

sim-ulation example of Section 3.2. However, its relaxation appears to pose very difficult technical challenges and is not attempted in this work.

The remainder of this section is devoted to the proof of Theorem 1.

5.1 Basic posterior inequality via the stability estimate

A key step of the proof of Theorem 1 is the stability estimate in Equation (5) below, which bounds the total variation distance between the Lévy measures𝜈, 𝜈′_{in terms of the total variation distance}

between the corresponding compound distributions q, q′_.

In principle, it is conceivable that the Panjer recursion should allow one to bound probability distances between P-probabilities via distances between Q-probabilities; we call such a bound a stability estimate. Nevertheless, explicit as the equations of the Panjer recursion are, they are still somewhat unwieldy for that purpose. Hence we will use another inversion formula from Buchmann & Grübel (2003), that will lead to the stability estimate we are after.

First we introduce some notation, and also recall a few useful facts summarised in Buch-mann & Grübel (2003). The space of absolutely summable sequences is defined as 𝓁1∶=

{

𝑎 ∈RN0∶∑∞

𝑗=0|𝑎𝑗| < ∞}, with a norm given by||𝑎||1=∑∞_𝑗=0|𝑎𝑗|. For probability vectors a, b,

the norm||a − b||1is (twice) the total variation distance between a and b. For any a, b ∈ 𝓁1, we

have the inequality

||𝑎 ∗ 𝑏||1≤ ||𝑎||1||𝑏||1, (4)

(16)

exp(𝑎) = ∞ ∑ 𝑗=0 𝑎∗𝑗 𝑗!. The exponential has the following two useful properties:

exp(𝑎 + 𝑏) = exp(𝑎) ∗ exp(𝑏), 𝑎, 𝑏 ∈ 𝓁1,

and

exp(𝑎) = exp(𝑏) ⇒ 𝑎 = 𝑏, 𝑎, 𝑏 ∈ 𝓁1.

We define a sequence𝛿0= (𝛿0, 𝑘)𝑘∈N0, such that𝛿0, 0=1 and its all other entries are equal to zero. Then, using the above properties of the exponential, we can write concisely the compounding mapping in (2) in terms of convolutions of infinite sequences:𝑞 = exp(𝜆(𝑝 − 𝛿0)). Its convolution

inverse, that is, q∗(−1)_{such that q}∗(−1)_∗_{q =}_𝛿

0, is given by𝑟 = 𝑞∗(−1)=exp(−𝜆(𝑝 − 𝛿0)). Note that

r ∈𝓁1. We have the following recursive expressions

𝑟0= 1 𝑞0, 𝑟𝑘 = −1 𝑞0 𝑘 ∑ 𝑗=1𝑞𝑗𝑟𝑘−𝑗, 𝑘 ∈N.

Lemma 1. Let q, q′_{correspond to two pairs (}_{𝜆, p) and (𝜆}′_{, p}′₎_{, respectively (and r correspond to}

q, i.e. the pair (𝜆, p)). Then, in accordance with the notation introduced above and provided that ||𝑞′₋_𝑞|| 1< ||𝑟||−1₁ , it holds that ||𝜈′₋_𝜈|| 1=||𝜆′𝑝′−𝜆𝑝||1 ≤ ||𝑟||1||𝑞 ′₋_𝑞|| 1 1 −||𝑟||1||𝑞′−𝑞||1. (5) Proof. The result is a direct consequence of Lemma 3 in Buchmann & Grübel (2003), which states that (𝜆′−𝜆)𝛿0+𝜆𝑝 − 𝜆′𝑝′= ∞ ∑ 𝑗=1 1 𝑗(𝑟 ∗ (𝑞 − 𝑞′))∗𝑗, whenever||𝑞′₋_𝑞||

1< ||𝑟||−1₁ . Taking the|| ⋅ ||1-norm on both sides and some elementary bounding

via (4) imply that

|𝜆′₋_{𝜆| + ||𝜆}′_𝑝′₋_𝜆𝑝||

1≤ ||𝑟||1||𝑞 ′₋_𝑞||

1

1 −||𝑟||1||𝑞′−𝑞||1,

and thus Equation (5) follows. ▪

We will use Equation (5) to establish the key inequality for the posterior measure Π(⋅|_𝑛). We recall once again that the subscript 0 refers to “true,” data-generating quantities.

Proposition 1. For any prior Π on𝜈, for any 𝜀 ∈ (0, 1] and for any n ≥ 1, the following posterior inequality holds: Π(||𝜈 − 𝜈0||1≥ 𝜀|𝑛)≤ 2Π ( ||𝑞 − 𝑞0||1≥ 𝜀 2||𝑟0||1 𝑛 ) . Proof. Write {𝜈 ∶ ||𝜈 − 𝜈0||1≥ 𝜀} as a union of the sets

(17)

and

{𝜈 ∶ ||𝜈 − 𝜈0||1 ≥ 𝜀} ∩ {𝜈 ∶ ||𝑟0||1||𝑞 − 𝑞0||1≥ 1∕2}.

Thanks to Lemma 1, the set

{𝜈 ∶ ||𝜈 − 𝜈0||1 ≥ 𝜀} ∩ {𝜈 ∶ ||𝑟0||1||𝑞 − 𝑞0||1< 1∕2},

is a subset of {𝜈 ∶ ||q − q₀||1≥ 𝜀∕(2||r0||1)}. The proof is concluded by observing that {𝜈 ∶ ||𝜈 −

𝜈0||1≥ 𝜀} ∩ {𝜈 ∶ ||𝑟0||1||𝑞 − 𝑞0||1≥ 1∕2} is a subset of {𝜈 ∶ ||q − q0||1≥ 𝜀∕(2||r0||1)}, too, since

𝜀 ≤ 1. ▪

In general, stability estimates like the one in Equation (5) are unknown in the literature on Lévy processes. Consequently, studying Bayesian asymptotics for Lévy models, even in the CPP case, necessitates the use of very intricate arguments under restrictive assumptions (e.g., Nickl and Söhl (2017)).

5.2 Proof of Theorem 1

The usefulness of Proposition 1 above lies in the fact that the posterior contraction rate in the inverse problem of estimating the Lévy measure𝜈0 from indirect observations𝑛 can be now

deduced from the posterior contraction rate in the direct problem of estimating the compound distribution q₀, which is easier (observe that r0is determined by𝜈0and is therefore fixed in the

proofs). The general machinery developed in Ghosal, Ghosh, and van der Vaart (2000) can be applied to handle the latter, and also several inequalities obtained in Gugushvili et al. (2015) are useful in that respect. In particular, we make use of the following inequality for the Hellinger distance, ℎ(𝑞𝜆,𝑝, 𝑞𝜆′_,𝑝′)≤ √ 𝜆ℎ(𝑝, 𝑝′_{) + |} ||√𝜆 −√𝜆′|| | , (6)

compare Lemma 1 in Gugushvili et al. (2015). To ease our notation, in the sequel we will often write q and q′_{instead of q}

𝜆,pand𝑞𝜆′_,𝑝′, respectively. Denote KL(𝑞0, 𝑞) = 𝑄0 ( log𝑞0 𝑞 ) , 𝑉 (𝑞0, 𝑞) = 𝑄0 ( log𝑞0 𝑞 )2 .

Another two inequalities we will use are the following: let 𝜆, 𝜆0∈ [𝜆, 𝜆]. Then there exists a

positive constant𝐶, such that

KL(𝑞0, 𝑞) ≤ 𝐶(KL(𝑝0, 𝑝) + |𝜆0−𝜆|2),

𝑉 (𝑞0, 𝑞) ≤ 𝐶(𝑉 (𝑝0, 𝑝) + KL(𝑝0, 𝑝) + |𝜆0−𝜆|2); (7)

Compare with Equations (14) and (15) in Lemma 1 in Gugushvili et al. (2015).

These three inequalities can be obtained by adjustment of the arguments used in Gugushvili et al. (2015). However, we opted to give their direct proofs in Lemma 5 from Appendix A under slightly weaker conditions than required in Gugushvili et al. (2015).

Our proof of Theorem 1 proceeds via verification of the conditions for posterior contraction in theorem 2.1 in Ghosal et al. (2000). In our setting, the latter result reads as follows.

(18)

Theorem 2. Assume _𝑛= (𝑍1, … , 𝑍𝑛), where Z1, … , Zn are independent and identically

dis-tributed with distribution q₀. Let h denote the Hellinger metric on, a collection of all measures with support inN. Suppose that for a sequence {𝜖n}with𝜖n→ 0 and 𝑛𝜖_𝑛2→ ∞, a constant C > 0 and sets

𝑛⊂ , we have

log𝑁(𝜖_𝑛, _𝑛, ℎ) ≤ 𝑛𝜖_𝑛2,

Π_𝑛(∖_𝑛)≤ exp(−𝑛𝜖_𝑛2(𝐶 + 4)),

Π_𝑛(𝑞 ∶ KL(𝑞0, 𝑞) ≤ 𝜖𝑛2, 𝑉 (𝑞0, 𝑞) ≤ 𝜖𝑛2)≥ exp(−𝐶𝑛𝜖𝑛2).

Then, for sufficiently large M> 0, we have that Π_𝑛(𝑄 ∶ ℎ(𝑞, 𝑞0)≥ 𝑀𝜖𝑛|𝑛)→ 0 in 𝑄𝑛0-probability.

We will now verify the three conditions of this theorem, which we refer to as the entropy condition, the remaining mass condition, and the prior mass condition, respectively. To that end, fix strictly positive sequences {Λ_𝑛}, {Λ_𝑛}, and define the sieves

𝑛= {𝑞𝜆,𝑝 ∶𝜆 ∈ [Λ_𝑛, Λ𝑛], 𝑠𝑢𝑝𝑝 𝑝 ⊆ {1, … , 𝑚𝑛}}.

5.2.1 Entropy

We start with bounding the entropy of the sieve_𝑛for h-balls of radius𝜖n.

Lemma 2. Assume that as n→ ∞,

𝑚𝑛→ ∞, 𝜖𝑛→ 0, Λ_𝑛→ 0, Λ𝑛→ ∞. (8)

Then

log𝑁(𝜖_𝑛, _𝑛, ℎ) ≲ 𝑚_𝑛 {

log(𝑚_𝑛) +log(Λ_𝑛) +log ( 1 𝜖𝑛 )} +log ( 1 Λ_𝑛 ) . (9) Proof. For𝜆, 𝜆′_{≥ Λ} 𝑛, |√𝜆 −√𝜆′| = |𝜆 − 𝜆 ′_| √ 𝜆 +√𝜆′ ≤ 1 2√Λ_𝑛|𝜆 − 𝜆 ′_|.

Furthermore, from section 3.3 in Pollard (2002), ℎ(𝑝, 𝑝′₎_≤√_{||𝑝 − 𝑝}′||

1≤

√

𝑚𝑛||𝑝 − 𝑝′||∞.

Combining the preceding two displays and Equation (6), we get ℎ(𝑞𝜆,𝑝, 𝑞𝜆′_,𝑝′)≤ √ Λ_𝑛𝑚_𝑛||𝑝 − 𝑝′|| ∞+ 1 2√Λ_𝑛|𝜆 − 𝜆 ′_|. Hence, if ||𝑝 − 𝑝′_|| ∞≤ 𝜖 2 𝑛 4Λ_𝑛𝑚_𝑛, |𝜆 − 𝜆 ′_{| ≤}√_Λ 𝑛𝜖𝑛,

(19)

then the Hellinger distance between q_𝜆,pand𝑞_𝜆′_,𝑝′is bounded by𝜖_n. To cover [Λ

𝑛, Λ𝑛], we need

at most⌊ Λ𝑛 2𝜖𝑛√Λ_𝑛

⌋

+1 intervals of length 2√Λ_𝑛𝜖_𝑛. To cover discrete distributions with support in {1, … , mn}, we need at most (⌊ 2Λ_𝑛𝑚_𝑛 𝜖2 𝑛 ⌋ +1 )𝑚𝑛

L∞-balls of radius𝜖𝑛2∕(4Λ𝑛𝑚𝑛). Under assumption (8), the summand 1 in the above display is

asymptotically negligible and can be omitted. In that case, the number of h-balls that we need to cover_𝑛is of order ( Λ_𝑛𝑚_𝑛 𝜖2 𝑛 )𝑚𝑛 × Λ𝑛 𝜖𝑛√Λ_𝑛.

Taking the logarithm and next a straightforward rearrangement of the terms gives the statement

of the lemma. ▪

5.2.2 Remaining prior mass

Now we will derive an inequality for the remaining prior mass. Lemma 3. For𝜆 ∼ 𝐺𝑎𝑚𝑚𝑎(a, b) with 0 < a ≤ 1,

Π_𝑛(∖_𝑛)≲ Λ𝑎−1_𝑛 𝑒−𝑏Λ𝑛+ Λ_𝑛. Proof. We have (with a slight abuse of notation)

Π_𝑛(∖_𝑛) = Π_𝑛([Λ_𝑛, ∞)) + Π_𝑛([0, Λ_𝑛)). Now, Π_𝑛(𝜆 ≥ Λ_𝑛) = 𝑏𝑎 Γ(𝑎) ∫ ∞ Λ_𝑛 𝜆𝑎−1_𝑒−𝑏𝜆_d_{𝜆 ≲ Λ}𝑎−1 𝑛 𝑒−𝑏Λ𝑛. Furthermore, Π𝑛([0, Λ_𝑛)) = 𝑏𝑎 Γ(𝑎) ∫ Λ_𝑛 0 𝜆 𝑎−1_𝑒−𝑏𝜆_d_{𝜆 ≲ Λ}𝑎 𝑛.

The proof is concluded. ▪

5.2.3 Prior mass

Finally, we lower bound the prior mass in a small Kullback-Leibler neighbourhood of the data-generating compound distribution q₀. Define the function g ∶ (0, ∞) × (0, 1) → (0, ∞) by

𝑔(𝜖, 𝑐) = 𝐶 𝜖2 2[log(𝑒∕𝑐)]2,

(20)

Lemma 4. Assume that

(i) there exists𝛼, such that 0 < 𝛼 ≤ 𝛼_𝑖≤ 1 for all 1 ≤ i ≤ mn;

(ii) strictly positive sequences𝑝

𝑛 → 0, 𝜖n→ 0 and mn→ ∞ satisfy the inequalities 𝑚𝑛𝑔(𝜖𝑛, 𝑝𝑛)< 1

and𝑝 𝑛< 𝑔(𝜖𝑛, 𝑝𝑛) 2_. Define 𝐵𝑛(𝜖) = {𝑞 ∈ 𝑛∶KL(𝑞0, 𝑞) ≤ 𝜖2, 𝑉 (𝑞0, 𝑞) ≤ 𝜖2}. Then Π_𝑛(𝐵_𝑛(𝜖_𝑛))≳ Π_𝑛(|𝜆0−𝜆| ≤ ̃𝜖𝑛) × Γ (_𝑚𝑛 ∑ 𝑖=1 𝛼𝑖 ) exp(−𝑚_𝑛log(1∕(𝑔( ̃𝜖_𝑛, 𝑝 𝑛) 2₋_𝑝 𝑛)) −𝑚𝑛log(1∕𝛼)).

Here ̃𝜖_𝑛=𝜖_𝑛∕√3𝐶, with a constant𝐶 > 0 not depending on n. Proof. Define ̃𝐵𝑛(𝜖) = { (𝜆, 𝑝) ∶ 𝜆 ∈ [Λ_𝑛, Λ𝑛], min 1≤𝑖≤𝑚𝑛𝑝𝑖≥ 𝑝𝑛, 𝑠𝑢𝑝𝑝 𝑝 ⊆ {1, … , 𝑚𝑛 }, KL(𝑝0, 𝑝) ≤ 𝜖2, 𝑉 (𝑝0, 𝑝) ≤ 𝜖2, |𝜆0−𝜆| ≤ 𝜖 } .

For all n large enough and𝜖 small, we have {𝜆 ∶ |𝜆0−𝜆| ≤ 𝜖} ⊆ [Λ_𝑛, Λ𝑛]. Then by

inequali-ties in Lemma 5, ̃𝐵_𝑛(𝜖) ⊂ 𝐵_𝑛(√3𝐶𝜖), with a constant𝐶 that can be taken the same for all large enough n; see the arguments in Section 4.2 in Gugushvili et al. (2015). Hence, using the a priori independence of p and𝜆, Π𝑛(𝐵𝑛(𝜖𝑛))≥ Π𝑛( ̃𝐵𝑛(̃𝜖𝑛)) = Π𝑛(|𝜆0−𝜆| ≤ ̃𝜖𝑛) ×𝑈𝑛, where 𝑈𝑛= Π𝑛 ({ 𝑝 ∶ KL(𝑝0, 𝑝) ≤ ̃𝜖2𝑛, 𝑉 (𝑝0, 𝑝) ≤ ̃𝜖2𝑛, min 1≤𝑖≤𝑚𝑛𝑝𝑖≥ 𝑝𝑛 }) . Furthermore, by Lemma 6 from Appendix A, we have

𝑈𝑛≥ Π𝑛 ({ 𝑝 ∶ 𝑚𝑛 ∑ 𝑖=1|𝑝0𝑖 −𝑝_𝑖| ≤ 2𝑔( ̃𝜖_𝑛, 𝑝 𝑛), min1≤𝑖≤𝑚𝑛𝑝𝑖 ≥ 𝑝𝑛 }) .

The statement of the lemma now follows upon applying Lemma 7 from Appendix A with 𝜂 = 𝑝_𝑛and𝜖 = 𝑔( ̃𝜖_𝑛, 𝑝

𝑛). ▪

5.2.4 Using bounds in Theorem 2

We take 𝑚𝑛≍log𝑛, 𝜖𝑛≍ log 𝛾_𝑛 √ 𝑛 , 𝑝𝑛≍ 1 𝑛2,

(21)

with appropriately selected proportionality constants, and verify the conditions in Theorem 2. Firstly, condition (8) is trivially satisfied. Therefore, we can invoke Lemma 2 and conclude that the entropy is upper bounded by a multiple of log2𝛾𝑛, since 𝛾 > 1. Now log2𝛾𝑛 ≲ 𝑛𝜖2

𝑛, and this

verifies the entropy condition in Theorem 2.

Be Lemma 3, for a suitable choice of the constant C the remaining prior mass condition is likewise satisfied.

Finally, for the prior mass condition in a small Kullback-Leibler neighbourhood to hold, by Lemma 4 we need that the term

Π𝑛(|𝜆0−𝜆| ≤ ̃𝜖𝑛)exp(−𝑚𝑛log(1∕(𝑔( ̃𝜖𝑛, 𝑝_𝑛)2−𝑝_𝑛)) −𝑚𝑛log(1∕𝛼))

is lower bounded by exp(−𝐶𝑛𝜖2

𝑛)for some large enough C> 0. Now, Π𝑛(|𝜆0−𝜆| ≤ ̃𝜖𝑛) ≍ ̃𝜖𝑛. Take

the logarithm on both sides of the above display and note that by our conditions log(Π_𝑛(|𝜆0−𝜆| ≤ ̃𝜖𝑛))≳ log( ̃𝜖𝑛)≳ −𝑛𝜖𝑛2.

Likewise,

𝑚𝑛log(1∕(𝑔( ̃𝜖𝑛, 𝑝_𝑛)2−𝑝_𝑛)) +𝑚𝑛log(1∕𝛼) ≲ 𝑛𝜖𝑛2,

so that the prior mass condition holds.

Thus we have verified all the conditions of Theorem 2. The resulting posterior contraction rate is𝜖_𝑛≍log𝛾𝑛∕√𝑛.

6 O U T LO O K

In this paper we introduced a nonparametric Bayesian approach to estimation of the Lévy mea-sure𝜈 of a discretely observed CPP, when the support of 𝜈 is a subset ofN. We constructed an algorithm for sampling from the posterior distribution of𝜈, and showed that in practice our procedure performs well and measures up to a benchmark frequentist plug-in approach from (Buchmann & Grübel, 2004). Although computationally more demanding and slower than the latter, our method has an added benefit of providing uncertainty quantification in parameter esti-mates through the spread of the posterior distribution. On the theoretical side we show that our procedure is consistent, in that asymptotically, as the sample size n→ ∞, the posterior concen-trates around the “true,” data-generating distribution. The corresponding posterior contraction rate is the (nearly) optimal rate log𝛾𝑛∕√𝑛 for an arbitrary 𝛾 > 1, if we are to ignore a practically insignificant log𝑛 factor.

Among several generalizations of our results, the one that looks the most promising is exten-sion of our methodology to CPP processes with jump size distributions supported on the set of integersZ. The corresponding model has garnered substantial interest in financial applications (see Barndorff-Nielsen, Pollard, & Shephard, 2012). We leave this extension as a topic of possible future research.

AC K N OW L E D G E M E N T S

The research leading to the results in this paper has received funding from the European Research Council under ERC Grant Agreement 320637, from the Deutsche Forschungsgemeinschaft (DFG,

(22)

German Research Foundation) – 314838170, GRK 2297 MathCoRe, and from the Deutsche Forschungsgemeinschaft (DFG) through the grant CRC 1294 "Data Assimilation." The authors would like to thank the Associate Editor and the referee for their detailed and constructive comments on the paper.

O RC I D

Shota Gugushvili https://orcid.org/0000-0002-6963-295X

Ester Mariucci https://orcid.org/0000-0003-0409-4131

Frank van der Meulen https://orcid.org/0000-0001-7246-8612

R E F E R E N C E S

Barndorff-Nielsen, O. E., Pollard, D. G., & Shephard, N. (2012). Integer-valued Lévy processes and low latency financial econometrics. Quantitative Finance, 12(4), 587–605.

Belomestny, D., Comte, F., Genon-Catalot, V., Masuda, H., & Reiß, M. (2015). Lévy matters IV. Estimation for

discretely observed Lévy processes(Vol. 2128). Cham, Switzerland: Springer.

Belomestny, D., Gugushvili, S., Schauer, M., & Spreij, P. (2019). Nonparametric Bayesian inference for Gamma-type Lévy subordinators. Communications in Mathematical Sciences, 17(3), 781–816.

Bezanson, J., Edelman, A., Karpinski, S., & Shah, V. (2017). Julia: A fresh approach to numerical computing. SIAM

Review, 59(1), 65–98.

Buchmann, B., & Grübel, R. (2003). Decompounding: An estimation problem for Poisson random sums. The

Annals of Statistics, 31(4), 1054–1074.

Buchmann, B., & Grübel, R. (2004). Decompounding Poisson random sums: Recursively truncated estimates in the discrete case. Annals of the Institute of Statistical Mathematics, 56(4), 743–756.

Coca, A. J. (2018a). Adaptive nonparametric estimation for compound Poisson processes robust to the

discrete-observation scheme. arXiv e-prints. Retrieved fromhttps://arxiv.org/abs/1803.09849

Coca, A. J. (2018b). Efficient nonparametric inference for discretely observed compound Poisson processes.

Probability Theory and Related Fields, 170(1), 475–523.

Comte, F., & Genon-Catalot, V. (2011). Estimation for Lévy processes from high frequency data within a long time interval. The Annals of Statistics, 39(2), 803–837.

Devroye, L. (1986). Non-uniform random variate generation. New York, NY: Springer-Verlag.

Duval, C., & Hoffmann, M. (2011). Statistical inference across time scales. Electronic Journal of Statistics, 5, 2004–2030.

Duval, C., & Mariucci, E. (2017). Compound Poisson approximation to estimate the Lévy density. ArXiv e-prints. Retrieved fromhttps://arxiv.org/abs/1702.08787

Embrechts, P., Mikosch, T., & Klüppelberg, C. (1997). Modelling extremal events: For insurance and finance. Berlin, Germany: Springer-Verlag.

Evans, D. A. (1953). Experimental evidence concerning contagious distributions in ecology. Biometrika, 40, 186–211.

Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of

the American Statistical Association, 85(410), 398–409.

Ghosal, S., Ghosh, J. K., & van der Vaart, A. W. (2000). Convergence rates of posterior distributions. The Annals of

Statistics, 28(2), 500–531.

Ghosal, S., & van der Vaart, A. (2007). Posterior convergence rates of Dirichlet mixtures at smooth densities. The

Annals of Statistics, 35(2), 697–723.

Ghosal, S., & van der Vaart, A. (2017). Fundamentals of nonparametric Bayesian inference Cambridge Series in

Statistical and Probabilistic Mathematics(Vol. 44). Cambridge, UK: Cambridge University Press.

Gugushvili, S., Mariucci, E., & van der Meulen, F. (2019). Bdd: Julia code for Bayesian decompounding of discrete

(23)

Gugushvili, S., van der Meulen, F., & Spreij, P. (2015). Nonparametric Bayesian inference for multidimensional compound Poisson processes. Modern Stochastics: Theory & Applications, 2(1), 1–15.

Gugushvili, S., van der Meulen, F., & Spreij, P. (2018). A non-parametric Bayesian approach to decompounding from high frequency data. Statistical Inference for Stochastic Processes, 21(1), 53–79.

Kappus, J. (2014). Adaptive nonparametric estimation for Lévy processes observed at low frequency. Stochastic

Processes and their Applications, 124(1), 730–758.

Lindo, A., Zuyev, S., & Sagitov, S. (2018). Nonparametric estimation for compound Poisson process via variational analysis on measures. Statistics and Computing, 28(3), 563–577.

Müller, P., Quintana, F. A., Jara, A., & Hanson, T. (2015). Bayesian nonparametric data analysis Springer Series in

Statistics. Cham, Switzerland: Springer.

Neumann, M. H., & Reiß, M. (2009). Nonparametric estimation for Lévy processes from low-frequency observa-tions. Bernoulli, 15(1), 223–248.

Neyman, J. (1939). On a new class of “contagious” distributions, applicable in entomology and bacteriology. The

Annals of Mathematical Statistics, 10, 35–57.

Nickl, R., & Reiß, M. (2012). A Donsker theorem for Lévy measures. Journal of Functional Analysis, 263(10), 3306–3332.

Nickl, R., & Söhl, J. (2017). Bernstein-von Mises theorems for statistical inverse problems II: Compound Poisson

processes. ArXiv e-prints. Retrieved fromhttps://arxiv.org/abs/1709.07752

Panjer, H. H. (1981). Recursive evaluation of a family of compound distributions. ASTIN Bullentin, 12(1), 22–26. Pollard, D. (2002). A user’s guide to measure theoretic probability (Vol. 8). Cambridge, UK: Cambridge University

Press.

Pya Arnqvist, N., Voinov, V., & Voinov, Y. (2018). nilde: Nonnegative integer solutions of linear Diophantine equations

with applications. R package version 1.1-2. Retrieved fromhttps://CRAN.R-project.org/package=nilde

Sato, K.-I. (2013). Lévy processes and infinitely divisible distributions (Vol. 68, 2nd revised ed.). Cambridge, UK: Cambridge University Press.

Shreve, S. E. (2004). Stochastic calculus for finance. II: Continuous-time models. New York, NY: Springer. Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of

the American Statistical Association, 82(398), 528–550 With discussion and with a reply by the authors. Trabs, M. (2015). Information bounds for inverse problems with application to deconvolution and Lévy models.

Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 51(4), 1620–1650.

van der Pas, S., Szabó, B., & van der Vaart, A. (2017). Uncertainty quantification for the horseshoe (with discussion).

Bayesian Analysis, 12(4), 1221–1274.

van der Vaart, A. W. (1998). Asymptotic statistics Cambridge series in statistical and probabilistic mathematics (Vol. 3). Cambridge, UK: Cambridge University Press.

van Es, B., Gugushvili, S., & Spreij, P. (2007). A kernel type nonparametric density estimator for decompounding.

Bernoulli, 13(3), 672–694.

Voinov, V. G., & Nikulin, M. S. (1997). On a subset sum algorithm and its probabilistic and other applications. In N. Balakrishnan (Ed.), Advances in combinatorial methods and applications to probability and statistics (pp. 153–163). Boston, MA: Birkhäuser Boston.

von Bortkewitsch, L. (1898). Das Gesetz der kleinen Zahlen. Leipzig, Germany: B.G. Teubner.

Wasserman, L. (2004). All of statistics. A concise course in statistical inference. New York, NY: Springer.

Zhang, H., Liu, Y., & Li, B. (2014). Notes on discrete compound Poisson model with applications to risk theory.

Insurance: Mathematics & Economics, 59, 325–336.

How to cite this article: Gugushvili S, Mariucci E, der Meulen F. Decompounding discrete distributions: A nonparametric Bayesian approach. Scand J Statist.