Decompounding discrete distributions
A nonparametric Bayesian approach
Gugushvili, Shota; Mariucci, Ester; van der Meulen, Frank DOI
10.1111/sjos.12413
Publication date 2019
Document Version Final published version Published in
Scandinavian Journal of Statistics
Citation (APA)
Gugushvili, S., Mariucci, E., & van der Meulen, F. (2019). Decompounding discrete distributions: A nonparametric Bayesian approach. Scandinavian Journal of Statistics, 47 (2020)(2), 464-492. https://doi.org/10.1111/sjos.12413
Important note
To cite this publication, please use the final published version (if applicable). Please check the document version above.
Copyright
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons. Takedown policy
Please contact us and provide details if you believe this document breaches copyrights. We will remove access to the work immediately and investigate your claim.
This work is downloaded from Delft University of Technology.
DOI: 10.1111/sjos.12413
O R I G I N A L A R T I C L E
Decompounding discrete distributions:
A nonparametric Bayesian approach
Shota Gugushvili
1Ester Mariucci
2Frank van der Meulen
31Biometris, Wageningen University &
Research,
2Institut für Mathematik, Potsdam
Universität,
3Delft Institute of Applied Mathematics,
Delft University of Technology,
Correspondence
Ester Mariucci, Institut für Mathematik, Potsdam Universität,
Karl-Liebknecht-Str. 24-25, D-14476 Potsdam OT Golm, Germany. Email: mariucci@ovgu.de
Funding information
Deutsche Forschungsgemeinschaft, CRC 1294 Data Assimilation, DFG 314838170, GRK 2297 MathCoRe; European Research Council, ERC Grant Agreement 320637
Abstract
Suppose that a compound Poisson process is observed discretely in time and assume that its jump distribu-tion is supported on the set of natural numbers. In this paper we propose a nonparametric Bayesian approach to estimate the intensity of the underlying Poisson process and the distribution of the jumps. We provide a Markov chain Monte Carlo scheme for obtaining samples from the posterior. We apply our method on both simulated and real data examples, and compare its performance with the frequentist plug-in estimator proposed by Buch-mann and Grübel. On a theoretical side, we study the posterior from the frequentist point of view and prove that as the sample size n→ ∞, it contracts around the “true,” data-generating parameters at rate 1∕√𝑛, up to a log𝑛 factor.
K E Y W O R D S
compound Poisson process, data augmentation, diophantine equation, Gibbs sampler, Metropolis-Hastings algorithm, Nonparametric Bayesian estimation
1
I N T RO D U CT I O N
1.1
Problem formulation
Let N = (Nt∶t≥ 0) be a Poisson process with a constant intensity 𝜆 > 0, and let Yibe a sequence
of independent random variables, each with distribution P, that are also independent of N. By definition, a compound Poisson process (abbreviated CPP) X = (Xt∶t≥ 0) is
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
© 2019 The Authors. Scandinavian Journal of Statistics published by John Wiley & Sons Ltd on behalf of The Board of the Foundation of the Scandinavian Journal of Statistics.
𝑋𝑡= 𝑁𝑡
∑
𝑗=1𝑌𝑗,
(1)
where here and below the sum over an empty index set is understood to be equal to zero. In particular, X0 =0. CPP constitutes a classical model in, for example, risk theory, see Embrechts,
Mikosch, and Klüppelberg (1997).
Assume that the process X is observed at discrete times 0< t1< t2< … < tn=T, where the
instants ti are not necessarily equidistant on [0, T]. Based on the observations 𝑋𝑡1, 𝑋𝑡2, … , 𝑋𝑡𝑛,
our goal is to estimate the jump size distribution P and the intensity𝜆. We specifically restrict our attention to the case where P is a discrete distribution, 𝑃 (N) = 1, and we will write 𝑝 = (𝑝𝑘)𝑘∈N for the probability mass function corresponding to P, where pk=P({k}). A similar notation will be used for any other discrete law. The distribution P is called the base distribu-tion. Abusing terminology, we will at times identify it with the corresponding probability mass function p. An assumption that P has no atom at zero is made for identifiability: otherwise this atom gets confounded with e−𝜆, which does not allow consistent estimation of the inten-sity𝜆. For a discussion of applications of this CPP model in risk theory, see Zhang, Liu, and Li (2014).
Define the increments𝑍𝑖=𝑋𝑡𝑖−𝑋𝑡𝑖−1, i = 1, … , n. Then 𝑛= (𝑍𝑖∶𝑖 = 1, … , 𝑛) is a sequence of independent random variables. When {ti}are equidistant on [0, T], the random variables Zi
have in fact a common distribution Q satisfying 𝑄(N0) =1. As 𝑛 carries as much
informa-tion as (𝑋𝑡𝑖 ∶𝑖 = 1, … , 𝑛) does, we can base our estimation procedure directly on the increments 𝑛. Since summing up the jumps Yjs amounts to compounding their distributions, the inverse
problem of recovering P and 𝜆 from Zi can be referred to as decompounding (Buchmann &
Grübel, 2003).
There are two natural ways to parameterize the CPP model: either in terms of the pair (𝜆, p), or in terms of the Lévy measure𝜈 = (𝜈𝑘)𝑘∈Nof the process X (Sato, 2013). A relationship between the two is𝜆 =∑∞𝑘=1𝜈𝑘and p =𝜈∕𝜆. Inferential conclusions in one parameterization can be easily translated into inferential conclusions into another parameterization. However, for our specific statistical approach the Lévy measure parameterization turns out to be more advantageous from the computational point of view.
1.2
Approach and results
In this paper, we take a nonparametric Bayesian approach to estimation of the Lévy measure 𝜈 of X. See Ghosal & van der Vaart (2017) and Müller, Quintana, Jara, and Hanson (2015) for modern expositions of Bayesian nonparametrics. A case for nonparametric Bayesian methods has already been made elsewhere in the literature, and will not be repeated here. On the practi-cal side, we implement our procedure via the Gibbs sampler and data augmentation, and show that it performs well under various simulation setups. On the theoretical side, we establish its consistency and derive the corresponding posterior contraction rate, which can be thought of as an analogue of a convergence rate of a frequentist estimator (Ghosal and van der Vaart, (2017)). The posterior contraction rate, up to a practically insignificant log𝑛 factor, turns out to be 1∕√𝑛, which is an optimal rate for nonparametric estimation of cumulative distribution functions. Our contribution thus nicely bridges practical and theoretical aspects of Bayesian nonparametrics.
1.3
Related literature
To provide a better motivation for our model and approach, in this subsection we briefly survey the existing literature. A Bayesian approach to nonparametric inference for Lévy processes is a very recent and emerging topic, with references limited at the moment to Belomestny, Gugushvili, Schauer, and Spreij (2019), Gugushvili, van der Meulen, and Spreij (2015), Gugushvili, van der Meulen, and Spreij (2018) and Nickl and Söhl (2017). These deal exclusively with the case when the Lévy measure is absolutely continuous with respect to the Lebesgue density. At least from the computational point of view, these works are of no help in our present context.
Related frequentist papers for CPP models with discrete base distributions are Buchmann and Grübel (2003) and Buchmann and Grübel (2004), which, after earlier contributions dating from the previous century, in fact revived interest in nonparametric techniques for Lévy processes. To estimate the base distribution p, Buchmann and Grübel (2003) employ a frequentist plug-in approach relying on the Panjer recursion (i.e., an empirical cumulative distribution estimate of q is plugged into the Panjer recursion equations to yield an estimate of p; see below on the Panjer recursion). The drawback is that the parameter estimates are not guaranteed to be nonnegative. Buchmann and Grübel (2004) fix this problem by truncation and renormalization. This works, but looks artificial. As noted in Buchmann and Grübel (2004), in practice the latter approach breaks down if no zero values are observed among Zis. Buchmann and Grübel (2004) establish
weak convergence of their modified estimator, but on the downside its asymptotic distribution is unwieldy to give confidence statements on p. Most importantly, the plug-in approaches in Buch-mann and Grübel (2003) and BuchBuch-mann and Grübel (2004) do not allow obvious generalisations to nonequidistant observation times {ti}. In Lindo, Zuyev, and Sagitov (2018), another frequentist
estimator of the jump measure is introduced, that is obtained via the steepest descent technique as a solution to an optimization problem over the cone of positive measures. The emphasis in Lindo et al. (2018) is on numerical aspects; again, no obvious generalization to the case of nonequidistant {ti}is available.
Finally, some important, predominantly theoretical references on inference for Lévy processes are Comte and Genon-Catalot (2011), Duval and Hoffmann (2011), Kappus (2014), Neumann and Reiß (2009), Nickl and Reiß (2012), van Es, Gugushvili, and Spreij (2007) and Trabs (2015). We refer to Belomestny, Comte, Genon-Catalot, Masuda, and Reiß (2015), Coca (2018a, 2018b), and Duval and Mariucci (2017) for more extensive literature surveys.
1.4
Outline
The rest of the paper is organized as follows: in Section 2 we introduce our approach and describe an algorithm for drawing from the posterior distribution. In Sections 3 and 4 we study its per-formance on synthetic and real examples. Section 5 is devoted to the examination of asymptotic frequentist properties of our procedure. An outlook on our results is given in Section 6. Finally, in Appendix A technical lemmas used in the proofs of Section 5 are collected, whereas Appendix B contains some additional simulation results.
1.5
Notation
For two sequences {an}and {bn}of positive real numbers, the notation𝑎𝑛≲ 𝑏𝑛(or𝑏𝑛≳ 𝑎𝑛) means
that there exists a constant C> 0 that is independent of n and such that an≤ Cbn. We write
size n) by Πn. The corresponding posterior measure is denoted by Π𝑛(⋅|𝑛). The Gamma
distri-bution with shape parameter a and rate parameter b (a, b > 0) is denoted by Gamma(a, b). Its density is given by𝑥 → 𝑏𝑎
Γ(𝑎)𝑥𝑎−1𝑒−𝑏𝑥, 𝑥 > 0, where Γ is the Gamma function. The inverse Gamma
distribution with shape parameter a and scale parameter b is denoted by𝐼𝐺(a, b). Its density is 𝑥 → Γ(𝑏𝑎𝑎)𝑥−𝑎−1𝑒−𝑏∕𝑥, 𝑥 > 0. We use the notation Exp(a) for an exponential distribution with mean
1∕a. Finally, given a metric d on a set and 𝜖 > 0, the covering number 𝑁(𝜖, , 𝑑) is defined as the minimal number of balls of radius𝜖 needed to cover .
2
A LG O R I T H M FO R D R AW I N G F RO M T H E P O ST E R I O R
A Bayesian statistical approach relies on the combination of the likelihood and the prior on the parameter of interest through Bayes' formula. We start with specifying the prior. As far as the likelihood is concerned, although explicit, it is intractable from the computational point of view for nonparametric inference in CPP models. We will circumvent the latter problem by means of data augmentation, as detailed below.
2.1
Prior
We define a prior Π on𝜈 through a hierarchical specification
{𝜈𝑘}∞𝑘=1|𝑎, 𝑚, 𝛽𝑘i.i.d.∼ Gamma(𝑎, 1∕𝛽𝑘)⋅ 1{1≤𝑘≤𝑚},
𝛽1, … , 𝛽𝑚|𝛾 i.i.d.
∼ IG(𝑐, 𝛾), 𝛾 ∼ Exp(1).
Note that the (fixed) hyperparameters𝑚 ∈N, a, c > 0 are denoted by Latin letters.
The hyperparameter m incorporates our a priori opinion on the support of the Lévy mea-sure𝜈, or equivalently, the base measure p. In applications, the support of p may be unknown, which necessitates the use of a large m, for example,𝑚 = max𝑖=1,…,𝑛𝑍𝑖; this latter is the maxi-mal value suggested by the data𝑛at hand. Nevertheless, we may simultaneously expect that the “true,” data-generating𝜈 charges full mass only to a proper, perhaps even a small subset of the set {1, … , m}. In other words, 𝜈 may form a sparse sequence, with many components equal to zero. In fact, there are at least two plausible explanations for an occurrence of a large increment Ziin the data: either a few large jumps Yj’s occurred, which points toward a large right endpoint
of the support of𝜈0, or Ziis predominantly formed of many small jumps, which in turn indicates
that the intensity𝜆 of the Poisson arrival process N may be large. To achieve accurate estima-tion results, a prior should take a possible sparsity of𝜈 into account. This is precisely the reason of our hierarchical definition of the prior Π: a small𝛽k encourages a priori the shrinkage of the
components𝜈kof𝜈 toward zero.
2.2
Data augmentation
Assume temporarily ti=i, i = 1, … , n, and write 𝑞 = (𝑞𝑘)𝑘∈N0for qk=q({k}). Then Zihave the distribution
𝑞 = 𝑒−𝜆∑∞ 𝑗=0
𝜆𝑗
𝑗!𝑝∗𝑗, (2)
with ∗ denoting convolution. The compounding mapping (𝜆, p) → q can be expressed explicitly via the Panjer recursion (see Panjer (1981)):
𝑞0=𝑒−𝜆, 𝑞𝑘= 𝜆𝑘 𝑘
∑
𝑗=1
𝑗𝑝𝑗𝑞𝑘−𝑗, 𝑘 ∈N.
This recursion can be inverted to give the inverse mapping q→ (𝜆, p) via
𝜆 = − log 𝑞0, 𝑝𝑘= −𝑞 𝑞𝑘 0log𝑞0 − 1 𝑘𝑞0 𝑘−1 ∑ 𝑗=1𝑗𝑝𝑗𝑞𝑘−𝑗, 𝑘 ∈N.
In view of (2), the likelihood in the CPP model is explicit. Nevertheless, an attempt to directly use (2) or the Panjer recursions in posterior computations results in a numerically intractable procedure. Equally important is the fact that a Panjer recursion-based approach would not apply to nonequidistant observation times {ti}. Therefore, instead of (2) and the Panjer recursion, we
will employ data augmentation (Tanner & Wong, 1987). We switch back to the case when values of {ti}are not necessarily uniformly spaced. The details of our procedure are as follows: when the
process X is observed continuously over the time interval [0, T], so that our observations are a full sample path X(T)= (Xt ∶t ∈ [0, T]) of CPP, the likelihood is tractable and is proportional to
𝑒−𝑇 𝑚 ∑ 𝑘=1𝜈𝑘 𝑚 ∏ 𝑘=1 𝜈𝜇𝑘 𝑘 ,
see Shreve (2004). Here
𝜇𝑘= #{𝑌𝑗 =𝑘}, (3)
that is, the total number of jumps of size k. Then the prior Π from Subsection 2.1 leads to conjugate posterior computations. In fact, the full conditionals are
𝜈𝑘|{𝜇𝑘}, {𝛽𝑘}i.i.d.∼ Gamma(𝑎 + 𝜇𝑘, 1∕𝛽𝑘+𝑇 ), 𝑘 = 1, … , 𝑚, 𝛽𝑘|{𝜈𝑘}, 𝛾i.i.d.∼ IG(𝑎 + 𝑐, 𝛾 + 𝜈𝑘), 𝑘 = 1, … , 𝑚, 𝛾|{𝛽𝑘} ∼Gamma ( 𝑐𝑚 + 1, 1 + 𝑚 ∑ 𝑘=1𝛽 −1 𝑘 ) .
Therefore, the Gibbs sampler for posterior inference on𝜈 can be implemented. The Gibbs sam-pler cycles through the above conditionals a large number of times, generating approximate (dependent) samples from the posterior. See, e.g., Gelfand and Smith (1990) and section 24.5 in Wasserman (2004) on the Gibbs sampler and its use in Bayesian statistics.
As we do not observe the process X continuously, we will combine the above with the data augmentation device. First note that we have
𝑍𝑖= 𝑚
∑
𝑗=1𝑗𝜇𝑖𝑗,
where (𝜇ij∶i =1, … , n, j = 1, … , m) are independent, and 𝜇ij∼Poisson(Δi𝜈j)for𝜈j=𝜆pjand
Δi=ti−ti−1; see Corollary 11.3.4 in Shreve (2004, p. 498). Furthermore, for𝜇k as in (3) we
trivially have𝜇𝑘=∑𝑛𝑖=1𝜇𝑖𝑘. Data augmentation iterates the following two steps: (i) Draw (𝜇ij)conditional on the data𝑛and the parameter𝜈.
(ii) Draw𝜈 conditional on (𝜇ij).
Once the algorithm has been run long enough, this gives approximate (dependent) samples from the posterior of𝜈. We already know how to deal with step (ii); now we need to handle step (i).
Thus, keeping 𝜈 fixed, for each i we want to compute the conditional distribution (𝜇ij∶j =1, … , m)|Zi, and furthermore, we want to be able to simulate from this distribution. In
turn, this will immediately allow us to simulate𝜇k conditional on the data𝑛. Now, with Pr(⋅)
referring to probability under the parameter𝜈, it holds that
Pr(𝜇𝑖1=𝑘1, … , 𝜇𝑖𝑚=𝑘𝑚|𝑍𝑖=𝑧𝑖) = Pr(𝑍1 𝑖=𝑧𝑖)𝑒 −Δ𝑖∑𝑚𝑗=1𝜈𝑗 𝑚 ∏ 𝑗=1 (Δ𝑖𝜈𝑗)𝑘𝑗 𝑘𝑗! 1 { 𝑚 ∑ 𝑗=1 𝑗𝑘𝑗 =𝑧𝑖 } . Knowledge of the normalizing constant Pr(Zi= zi) will not be needed in our approach.
In general, simulation from a discrete multivariate distribution is nontrivial; some general options are discussed in Devroye (1986, Chapter XI, Section 1.5), but are unlikely to work eas-ily for a large m. We will take an alternative route and use the Metropolis-Hastings algorithm, see, for example, section 24.4 in Wasserman (2004). We start by observing that for a fixed i, the support of Pr(⋅ | Zi=zi) is precisely the set 𝑖 of nonnegative solutions (k1, … , km) of
the Diophantine equation ∑𝑚𝑗=1𝑗𝑘𝑗 =𝑧𝑖. The R package nilde (see Pya Arnqvist, Voinov, &
Voinov, 2018) implements an algorithm from Voinov and Nikulin (1997) that finds all such solutions for given integers m and zi. By Markovianity of the process X, we can simulate the
vectors (𝜇i1, … , 𝜇im)independently for each i = 1, … , n. If zi=0 or 1, there is only one
solu-tion to the Diophantine equasolu-tion: the trivial solusolu-tion (0, … , 0) in the first case, and the solution (1, 0, … , 0) in the second case; for such zi, no simulation is required, as (𝜇i1, … , 𝜇im)is known
explicitly. We thus only need to consider each𝑖 ∈ = {𝑖 ∶ 𝑧𝑖≠ 0 or 1} in turn, and design a Metropolis-Hastings move on the set of the corresponding solutions 𝑖. Fix once and for all an ordering of elements in𝑖 (this could be, e.g., lexicographic, or reverse lexicographic); we use the notation|𝑖| for the cardinality of 𝑖. Let𝜇 = (𝜇i1, … , 𝜇im)be the current state of the
chain, corresponding to the𝓁th element s𝓁 of 𝑖. A proposal𝜇◦ = (𝜇◦𝑖1, … , 𝜇◦𝑖𝑚)is obtained as follows:
(i) If𝓁 = 1, draw 𝜇◦uniformly at random among the elements {𝑠2, 𝑠|𝑖|}of𝑖.
(ii) If𝓁 = |𝑖|, draw 𝜇◦uniformly at random among the elements {𝑠1, 𝑠|𝑖|−1}of𝑖.
Occasionally, one may want to propose another type of a move too. (iv) Draw𝜇◦= (𝜇𝑖1◦, … , 𝜇𝑖𝑚◦ )uniformly at random from𝑖.
The two proposals lead to reversible moves, and one may also alternate them with probabilities 𝜋 and 1 − 𝜋, for example, 𝜋 = 0.8. The logarithm of the acceptance probability of a move from (𝜇i1, … , 𝜇im)to (𝜇◦𝑖1, … , 𝜇𝑖𝑚◦ )is computed as log𝐴 = 𝑚 ∑ 𝑘=1 (𝜇◦𝑖𝑘−𝜇𝑖𝑘)log(Δ𝑖𝜈𝑘) + 𝑚 ∑ 𝑘=1 {log(𝜇𝑖𝑘!) −log(𝜇◦𝑖𝑘!)}.
The move is accepted if log𝑈 ≤ log 𝐴 for U an independently generated uniform random variate on [0, 1], and in that case the current state of the chain is reset to (𝜇◦𝑖1, … , 𝜇◦𝑖𝑚). Otherwise the chain stays in (𝜇i1, … , 𝜇im).
3
S I M U L AT I O N E X A M P L E S
In this section, we test performance of our approach in a range of representative simulation exam-ples. For benchmarking, we use the frequentist plug-in estimator from Buchmann and Grübel (2004). Two real data examples are given in Section 4. Unless otherwise stated, we took c = 2 and a =0.01 as hyperparameters in our prior specification. As can be seen from the update formulae for the Gibbs sampler, as long as a is not taken too large, its precise value is not very influential on the posterior, given a reasonable sample size. The value c = 2 ensures that the update step for 𝛽k has finite variance. At each step of updating the imputed data for increment size z we have
chosen with probability 0.2 to propose uniformly from all solutions to the Diophantine equation (for that particular value of z).
We implemented our procedure in Julia, see (Bezanson, Edelman, Karpinski, & Shah, 2017). The code and datasets for replication of our examples are available on GitHub1 and Zenodo
(Gugushvili, Mariucci, & van der Meulen, 2019).
3.1
Uniform base distribution
This simulation example follows with some extensions that in Buchmann and Grübel (2004). Let 𝜆0=2, and let P0be the discrete uniform distribution on {1, 4, 6}. We simulated data according
to the following settings:
(a) n = 100, Δi=1 for 1≤ i ≤ n;
(b) n = 500, Δi=1 for 1≤ i ≤ n (the data under (a) are augmented with 400 extra observations);
(c) n = 500, Δi=Unif(0, 2) for 1 ≤ i ≤ n.
We set𝑚 = min(15, 𝑍(𝑛)), where𝑍(𝑛)=max1≤𝑖≤𝑛𝑍𝑖. In all cases this led to m = 15, as the value
of Z(n)was equal to 30, 35, and 40 for the simulated data under settings (a), (b), and (c),
respec-tively. The Gibbs sampler was run for 500, 000 iterations, of which the first 250, 000 were discarded 1See https://github.com/fmeulen/Bdd
0.00 0.25 0.50 0.75 1 2 3 4 5 6 7 8 9 10 k νk (a) 0.0 0.2 0.4 0.6 0.8 1 2 3 4 5 6 7 8 9 10 k νk (b) 0.0 0.2 0.4 0.6 1 2 3 4 5 6 7 8 9 10 k νk (c)
F I G U R E 1 Simulation example from Section 3.1. In each figure, the horizontal axis gives the magnitudes of𝜈k, k ∈ {1, … , 10}. The orange balls denote the true values, the black triangles the Buchmann-Grübel
estimator. The blue crosses give the posterior means, whereas the vertical blue line segments represent (pointwise) 95% credible intervals. The settings corresponding to (a), (b), and (c) are explained in the main text. Note the differences in vertical scale across the figures [Colour figure can be viewed at wileyonlinelibrary.com]
as burn-in. From the remaining samples, the posterior mean and 2.5% and 97.5% percentiles were computed for each coefficient𝜈k. The results for the first 10 coefficients are shown in Figure 1.
For comparison, the estimator from Buchmann and Grübel (2004) is also included in the figure. For setting (b), traceplots of every 50th iteration for a couple of coefficients𝜈kare shown in
Figure 2. We measure the error of an estimate {̂𝜈𝑘}by Err(𝜈, ̂𝜈) =∑∞𝑘=1|̂𝜈𝑘−𝜈𝑘|. The errors are
reported in Table 1. In all settings, for these particular realizations of the simulated data, the Bayesian procedure outperformed the truncated estimator from Buchmann and Grübel (2004). For setting (c), the latter produces a poor result, as was to be expected, given that it is derived under the assumption Δi=1 for all i. An advantage of the Bayesian procedure is the included measure
of uncertainty, namely the credible intervals for𝜈k. On the other hand, for the Buchmann-Grübel
estimator it is hardly possible to derive confidence intervals via an asymptotic method, since the limiting distribution of the estimator is fairly complicated. Although not considered in the original
ν1
ν2
ν6
0e+00 1e+05 2e+05 3e+05 4e+05 5e+05
0.50 0.75 1.00 1.25 1.50 0.0 0.2 0.4 0.6 0.25 0.50 0.75 iterate
F I G U R E 2 Traceplots for the simulation example from Section 3.1 under setting (b). The posterior samples were subsampled, with every 50th iteration kept. The displayed results are for parameters𝜈1,𝜈6, and𝜈9
T A B L E 1 Results for scenarios (a)–(c) from Section 3.1
Simulation setting (a) (b) (c)
Buchmann-Grübel estimator 1.40 0.32 1.44
Posterior mean 0.15 0.07 0.12
publications Buchmann and Grübel (2003) and Buchmann and Grübel (2004), a natural alterna-tive is the bootstrap. A detailed examination of the performance of the latter and its comparison to that of the Bayesian method lies beyond the scope of the present paper. Indeed, any thorough study would require, on one hand, the asymptotic justification of bootstrap confidence intervals, and on another hand establishing frequentist coverage properties of our Bayesian procedure. In that respect, good performance of neither method is automatically warranted (e.g., van der Pas, Szabó, and van der Vaart (2017) and van der Vaart (1998), Chapter 23). Here instead we opt for a numerical illustration, which is reported in Appendix B.
3.2
Geometric base distribution
The setup of this synthetic data example likewise follows that in Buchmann and Grübel (2004). Assume q is a geometric distribution with parameter𝛼, that is, qk= (1 −𝛼)k𝛼 for 0 < 𝛼 < 1,
k ∈𝑁0. Then𝜆 = − log 𝛼, and
𝑝𝑘= −(1 −𝛼) 𝑘
𝑘 log 𝛼 , 𝑘 ∈N. Hence,𝜈k= (1 −𝛼)k∕k.
We consider two simulation setups:
(a) n = 500, Δi=1 for 1≤ i ≤ n and 𝛼 = 1∕3;
0.00 0.25 0.50 0.75 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 k
ν
k (a) 0.00 0.25 0.50 0.75 1.00 1.25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 kν
k (b)F I G U R E 3 Simulation example from Section 3.2. Settings (a) and (b) correspond to the true jump distributions Geom(1∕3) and Geom(1∕6), respectively. The horizontal axis gives the magnitudes of 𝜈k,
k ∈ {1, … , 15}. The orange balls denote the true values, the black triangles the Buchmann-Grübel estimator. The blue crosses give the posterior means, whereas the vertical blue line segments represent (pointwise) 95% credible intervals [Colour figure can be viewed at wileyonlinelibrary.com]
T A B L E 2 Results for scenarios (a)–(b) from Section 3.2
Simulation setting (a) (b)
Buchmann-Grübel estimator 0.28 0.60
Posterior mean 0.52 1.05
We set 𝑚 = min(15, 𝑍(𝑛))and ran the sampler according to the settings of Section 3.1. The results for both scenarios (a) and (b) are reported in Figure 3. In Table 2 we also report estimation errors in one simulation run. For this example and these generated data, the Bayesian proce-dure gives less precise point estimates than the Buchmann-Grübel method. Note that estimation error for𝛼 = 1∕3 is smaller than that for 𝛼 = 1∕6. This appears intuitive, as a smaller value of 𝛼 corresponds to a larger value of𝜆. The latter implies that on average each Ziis a superposition of
a larger number of jumps, which renders the decompounding problem more difficult. However, this argument is hard to formalise.
3.3
Monte Carlo study
For a more thorough comparison of the Buchmann-Grübel estimator and our Bayesian method, we performed a small Monte Carlo experiment. We considered two settings:
(i) The setting from Section 3.1 with n = 250. We took𝑚 = min(15, 𝑍(𝑛)). (ii) The setting from Section 3.2 with𝛼 = 1∕3. We took 𝑚 = min(20, 𝑍(𝑛)).
Uniform base Geometric base (c=2) Geometric base (c=0.01) 0.0 0.5 1.0 1.5 0.3 0.6 0.9 0.3 0.6 0.9 bayesmean bayesmedian bg absolute (L1) error
F I G U R E 4 Monte Carlo study from Subsection 3.3 comparing the Buchmann-Grübel estimator and the Bayesian method proposed in this paper. In this figure “bg” refers to the Buchmann-Grübel estimator, while “bayesmedian” and “bayesmean” refer to the Bayesian method, where either the median or mean was used as a point estimator for each𝜈i. The leftmost panel corresponds to the setting (i), whereas the other two panels to the
setting (ii). In the latter we used both c = 2 and c = 0.01 in the prior specification [Colour figure can be viewed at wileyonlinelibrary.com]
In both cases we assumed Δi=1 for all 1≤ i ≤ n. The number of Monte Carlo repetitions was
taken equal to 50. We took 400, 000 Markov chain Monte Carlo (MCMC) iterations and discarded the first half of these as burn-in. In Figure 4 we give a graphical display of the results by means of boxplots of the errors. Here, as before, if the true values are denoted by𝜈k and the estimate
within a particular simulation run by ̂𝜈𝑘, the error is defined by Err(𝜈, ̂𝜈) =∑∞𝑘=1|̂𝜈𝑘−𝜈𝑘| (we truncated the infinite summation to 50). The results agree with our earlier findings, in that there is no clear “winner” in the comparison. Note that for the setting (ii) we considered both c = 2 and c =0.01 in the prior specification. Both values give similar performance of the Bayesian method. This provides insight into sensitivity of our results with respect to the prior specification. A minor difference between the middle and righmost panel of Figure 4 may be attributed to Monte Carlo error: the 50 simulated datasets on which these panels are based are not the same. Note that the prior promotes sparsity, and in that respect it is not surprising it does better when the true data-generating Lévy measure is sparse.
3.4
Computing time
In terms of computational effort, the time it takes to evaluate the Buchmann-Grübel estimator is negligible compared to our algorithm for sampling from the posterior. This is not surprising, as that frequentist estimator relies on a plug-in approach, whereas in our case an approximation to the posterior is obtained by MCMC simulation. However, if proper uncertainty quantification is desired, then the Bayesian method is advantageous in the sense that it does not solely produce a point estimate.
Note that the proposed MCMC scheme requires determination of the solutions to the Dio-phantine equation ∑𝑚𝑗=1𝑗𝑘𝑗 =𝑧 for all unique values z in the observation set. For moderate values of z, say z≤ 30, this is rather quick, but for large values of z the computing time increases exponentially, as does the amount of the allocated memory. The computing time of each Metropolis-Hastings step is then very small, but we potentially need a very large number of itera-tions to reach stationarity. The latter is caused firstly by the fact that at a particular iteration, our
proposals for𝜇ijdo not take into account the current values of𝜈1, … , 𝜈m; secondly, the size of the
state space that needs to be explored increases exponentially with m.
4
R E A L DATA E X A M P L E S
4.1
Horse kick data
To further illustrate our procedure, we will use the von Bortkewitsch data on the number of sol-diers in the Prussian cavalry killed by horse kicks (available by year and by cavalry corps); this example was also employed in Buchmann and Grübel (2003). Each observation is an integer from 0 to 4, giving the number of deaths for a given year and a given cavalry corps, with overall counts reported in Table 3. The data are extracted from the table on p. 25 in von Bortkewitsch (1898). Note that von Bortkewitsch corrects for the fact that the Guards and I, VI, and XI cavalry corpses of the Prussian army had a different organization from other units, and justifiably omits the corresponding counts from consideration.
It has been demonstrated by von Bortkewitsch that the Poisson distribution fits the horse kick data remarkably well. Assuming instead that observations follow a compound Poisson distribu-tion is a stretch of imaginadistribu-tion, as that would correspond to a horse running amok and killing possibly more than one soldier in one go. Nevertheless, this example provides a good sanity check for our statistical procedure.
The estimation results are graphically depicted in Figure 5. Clearly, point estimates of both methods are in agreement and lend support to the Poisson model for this dataset.
4.2
Plant data
Our second real example is the one used in Buchmann and Grübel (2004). Consider the data in Table 4, taken from Evans (1953). The data were collected as follows: the area was divided into plots of equal size and in each plot the number of plants was counted; the number of plants in each
T A B L E 3 Data on the number of soldiers in the Prussian cavalry killed by horse kicks. See the table on p. 25 in von Bortkewitsch (1898)
Deaths 0 1 2 3 4 Counts 109 65 22 3 1 0.0 0.2 0.4 0.6 4 3 2 1 k νk
F I G U R E 5 Estimation for the horse kick data from Subsection 4.1. The horizontal axis gives the magnitudes of𝜈k, k ∈ {1, … , 4}. The black triangles denote the Buchmann-Grübel estimator, the blue crosses
give the posterior means, whereas the vertical blue line segments represent (pointwise) 95% credible intervals [Colour figure can be viewed at wileyonlinelibrary.com]
T A B L E 4 Plant population data from Evans (1953) Plants 0 1 2 3 4 5 6 7 8 9 10 11 12 Counts 274 71 58 36 20 12 10 7 6 3 0 2 1 0.0 0.1 0.2 0.3 1 2 3 4 5 6 7 8 9 10 k νk
F I G U R E 6 Estimation results for the plant data from Subsection 4.2. The horizontal axis gives the magnitudes of𝜈k, k ∈ {1, … , 10}. The black triangles denote the Buchmann-Grübel estimates, the blue crosses
give the posterior means, whereas the vertical blue line segments represent (pointwise) 95% credible intervals [Colour figure can be viewed at wileyonlinelibrary.com]
plot ranges from 0 to 12. The second row of Table 4 gives the counts of plots containing a given number of plants; thus, there were 274 plots that contained no plant, 71 that contained 1 plant, etc. It is customary in the ecological literature to model such count data as i.i.d. realizations from a compound Poisson distribution. Thus, for example, Neyman (1939) advocated the use of a Poisson base distribution in this context; another option here is a geometric base distribution. Given exis-tence of several distinct modeling possibilities, performing an exploratory nonparametric analysis appears to be a sensible strategy.
The estimation results are graphically depicted in Figure 6. There are some small differences between the posterior mean and the Buchmann-Grübel estimate, but overall they are very similar.
5
F R EQ U E N T I ST A S Y M P TOT I C S
In this section we assume that the observation times {ti}are equidistant: ti=i, i = 1, … , n. To
evaluate our Bayesian method from a theoretical point of view, we will verify that it is consistent, and we will establish the rate at which the posterior contracts around the “true,” data-generating Lévy measure𝜈0; see Ghosal and van der Vaart (2017) for a thorough treatment of Bayesian
asymptotics from the frequentist point of view. From now on the subscript 0 in various quantities will refer to the data-generating distribution.
Our strategy consists in proving that the posterior contraction rate for𝜈0, given the sample
𝑛= (𝑍1, … , 𝑍𝑛), can be derived from the posterior contraction rate for q0 given𝑛, which is
mathematically easier since Z1, … , Znis a sequence of independent and identically distributed
random variables with distribution q0. We therefore effectively avoid dealing directly with the inverse nature of the problem of estimating p0.
The prior we consider in this section is defined as follows: • Endow the rate𝜆 of the Poisson process with a prior distribution.
• Independently, endow the vector (p1, … , pm) with a Dirichlet distribution with parameter (𝛼1, … , 𝛼m).
This is a somewhat simplified version of the prior we used in Section 2, which allows us to con-centrate on essential features of the problem, without need to clutter the analysis with extra and unenlightening technicalities. Also remember the well-known relationship between the Gamma and Dirichlet distributions: if𝜉1, … , 𝜉mare independent Gamma distributed random variables, 𝜉i∼Gamma(𝛼i, 1), then for 𝜂𝑖=𝜉𝑖∕∑𝑚𝑗=1𝜉𝑗, the vector (𝜂1, … , 𝜂m)follows the Dirichlet
distribu-tion with parameter (𝛼1, … , 𝛼m); furthermore, we have that∑𝑚𝑗=1𝜉𝑗 ∼Gamma(∑𝑚𝑗=1𝛼𝑗, 1
) , and 𝜂iare independent of∑𝑚𝑗=1𝜉𝑗.
In our asymptotic setting, we will make m = mndependent on n and let mn→ ∞ at a suitable
rate as n→ ∞.
Recall that we write𝑄 = (𝑞𝑘)𝑘∈N0for qk =Q({k}). Let denote the collection of all probability measures supported onN.
Theorem 1. Suppose there exists 𝛼, such that 0 < 𝛼 ≤ 𝛼𝑖≤ 1 for all 1 ≤ i ≤ mn. Suppose 𝜆 ∼
𝐺𝑎𝑚𝑚𝑎(a, b) with a ∈ (0, 1] and that 𝜈0has a compact support. Then, for any𝛾 > 1,
Π𝑛 ( ||𝜈 − 𝜈0||1≥ log𝛾𝑛 √ 𝑛 |𝑛 ) → 0 in𝑄𝑛0-probability, as n→ ∞.
Remark1. Note that since the support of𝜈0is not assumed to be known, our CPP model is still
naturally nonparametric. The assumption of the compact support of𝜈0does not cover the
sim-ulation example of Section 3.2. However, its relaxation appears to pose very difficult technical challenges and is not attempted in this work.
The remainder of this section is devoted to the proof of Theorem 1.
5.1
Basic posterior inequality via the stability estimate
A key step of the proof of Theorem 1 is the stability estimate in Equation (5) below, which bounds the total variation distance between the Lévy measures𝜈, 𝜈′in terms of the total variation distance
between the corresponding compound distributions q, q′.
In principle, it is conceivable that the Panjer recursion should allow one to bound probability distances between P-probabilities via distances between Q-probabilities; we call such a bound a stability estimate. Nevertheless, explicit as the equations of the Panjer recursion are, they are still somewhat unwieldy for that purpose. Hence we will use another inversion formula from Buchmann & Grübel (2003), that will lead to the stability estimate we are after.
First we introduce some notation, and also recall a few useful facts summarised in Buch-mann & Grübel (2003). The space of absolutely summable sequences is defined as 𝓁1∶=
{
𝑎 ∈RN0∶∑∞
𝑗=0|𝑎𝑗| < ∞}, with a norm given by||𝑎||1=∑∞𝑗=0|𝑎𝑗|. For probability vectors a, b,
the norm||a − b||1is (twice) the total variation distance between a and b. For any a, b ∈ 𝓁1, we
have the inequality
||𝑎 ∗ 𝑏||1≤ ||𝑎||1||𝑏||1, (4)
exp(𝑎) = ∞ ∑ 𝑗=0 𝑎∗𝑗 𝑗!. The exponential has the following two useful properties:
exp(𝑎 + 𝑏) = exp(𝑎) ∗ exp(𝑏), 𝑎, 𝑏 ∈ 𝓁1,
and
exp(𝑎) = exp(𝑏) ⇒ 𝑎 = 𝑏, 𝑎, 𝑏 ∈ 𝓁1.
We define a sequence𝛿0= (𝛿0, 𝑘)𝑘∈N0, such that𝛿0, 0=1 and its all other entries are equal to zero. Then, using the above properties of the exponential, we can write concisely the compounding mapping in (2) in terms of convolutions of infinite sequences:𝑞 = exp(𝜆(𝑝 − 𝛿0)). Its convolution
inverse, that is, q∗(−1)such that q∗(−1)∗q =𝛿
0, is given by𝑟 = 𝑞∗(−1)=exp(−𝜆(𝑝 − 𝛿0)). Note that
r ∈𝓁1. We have the following recursive expressions
𝑟0= 1 𝑞0, 𝑟𝑘 = −1 𝑞0 𝑘 ∑ 𝑗=1𝑞𝑗𝑟𝑘−𝑗, 𝑘 ∈N.
Lemma 1. Let q, q′correspond to two pairs (𝜆, p) and (𝜆′, p′), respectively (and r correspond to
q, i.e. the pair (𝜆, p)). Then, in accordance with the notation introduced above and provided that ||𝑞′−𝑞|| 1< ||𝑟||−11 , it holds that ||𝜈′−𝜈|| 1=||𝜆′𝑝′−𝜆𝑝||1 ≤ ||𝑟||1||𝑞 ′−𝑞|| 1 1 −||𝑟||1||𝑞′−𝑞||1. (5) Proof. The result is a direct consequence of Lemma 3 in Buchmann & Grübel (2003), which states that (𝜆′−𝜆)𝛿0+𝜆𝑝 − 𝜆′𝑝′= ∞ ∑ 𝑗=1 1 𝑗(𝑟 ∗ (𝑞 − 𝑞′))∗𝑗, whenever||𝑞′−𝑞||
1< ||𝑟||−11 . Taking the|| ⋅ ||1-norm on both sides and some elementary bounding
via (4) imply that
|𝜆′−𝜆| + ||𝜆′𝑝′−𝜆𝑝||
1≤ ||𝑟||1||𝑞 ′−𝑞||
1
1 −||𝑟||1||𝑞′−𝑞||1,
and thus Equation (5) follows. ▪
We will use Equation (5) to establish the key inequality for the posterior measure Π(⋅|𝑛). We recall once again that the subscript 0 refers to “true,” data-generating quantities.
Proposition 1. For any prior Π on𝜈, for any 𝜀 ∈ (0, 1] and for any n ≥ 1, the following posterior inequality holds: Π(||𝜈 − 𝜈0||1≥ 𝜀|𝑛)≤ 2Π ( ||𝑞 − 𝑞0||1≥ 𝜀 2||𝑟0||1 𝑛 ) . Proof. Write {𝜈 ∶ ||𝜈 − 𝜈0||1≥ 𝜀} as a union of the sets
and
{𝜈 ∶ ||𝜈 − 𝜈0||1 ≥ 𝜀} ∩ {𝜈 ∶ ||𝑟0||1||𝑞 − 𝑞0||1≥ 1∕2}.
Thanks to Lemma 1, the set
{𝜈 ∶ ||𝜈 − 𝜈0||1 ≥ 𝜀} ∩ {𝜈 ∶ ||𝑟0||1||𝑞 − 𝑞0||1< 1∕2},
is a subset of {𝜈 ∶ ||q − q0||1≥ 𝜀∕(2||r0||1)}. The proof is concluded by observing that {𝜈 ∶ ||𝜈 −
𝜈0||1≥ 𝜀} ∩ {𝜈 ∶ ||𝑟0||1||𝑞 − 𝑞0||1≥ 1∕2} is a subset of {𝜈 ∶ ||q − q0||1≥ 𝜀∕(2||r0||1)}, too, since
𝜀 ≤ 1. ▪
In general, stability estimates like the one in Equation (5) are unknown in the literature on Lévy processes. Consequently, studying Bayesian asymptotics for Lévy models, even in the CPP case, necessitates the use of very intricate arguments under restrictive assumptions (e.g., Nickl and Söhl (2017)).
5.2
Proof of Theorem 1
The usefulness of Proposition 1 above lies in the fact that the posterior contraction rate in the inverse problem of estimating the Lévy measure𝜈0 from indirect observations𝑛 can be now
deduced from the posterior contraction rate in the direct problem of estimating the compound distribution q0, which is easier (observe that r0is determined by𝜈0and is therefore fixed in the
proofs). The general machinery developed in Ghosal, Ghosh, and van der Vaart (2000) can be applied to handle the latter, and also several inequalities obtained in Gugushvili et al. (2015) are useful in that respect. In particular, we make use of the following inequality for the Hellinger distance, ℎ(𝑞𝜆,𝑝, 𝑞𝜆′,𝑝′)≤ √ 𝜆ℎ(𝑝, 𝑝′) + | ||√𝜆 −√𝜆′|| | , (6)
compare Lemma 1 in Gugushvili et al. (2015). To ease our notation, in the sequel we will often write q and q′instead of q
𝜆,pand𝑞𝜆′,𝑝′, respectively. Denote KL(𝑞0, 𝑞) = 𝑄0 ( log𝑞0 𝑞 ) , 𝑉 (𝑞0, 𝑞) = 𝑄0 ( log𝑞0 𝑞 )2 .
Another two inequalities we will use are the following: let 𝜆, 𝜆0∈ [𝜆, 𝜆]. Then there exists a
positive constant𝐶, such that
KL(𝑞0, 𝑞) ≤ 𝐶(KL(𝑝0, 𝑝) + |𝜆0−𝜆|2),
𝑉 (𝑞0, 𝑞) ≤ 𝐶(𝑉 (𝑝0, 𝑝) + KL(𝑝0, 𝑝) + |𝜆0−𝜆|2); (7)
Compare with Equations (14) and (15) in Lemma 1 in Gugushvili et al. (2015).
These three inequalities can be obtained by adjustment of the arguments used in Gugushvili et al. (2015). However, we opted to give their direct proofs in Lemma 5 from Appendix A under slightly weaker conditions than required in Gugushvili et al. (2015).
Our proof of Theorem 1 proceeds via verification of the conditions for posterior contraction in theorem 2.1 in Ghosal et al. (2000). In our setting, the latter result reads as follows.
Theorem 2. Assume 𝑛= (𝑍1, … , 𝑍𝑛), where Z1, … , Zn are independent and identically
dis-tributed with distribution q0. Let h denote the Hellinger metric on, a collection of all measures with support inN. Suppose that for a sequence {𝜖n}with𝜖n→ 0 and 𝑛𝜖𝑛2→ ∞, a constant C > 0 and sets
𝑛⊂ , we have
log𝑁(𝜖𝑛, 𝑛, ℎ) ≤ 𝑛𝜖𝑛2,
Π𝑛(∖𝑛)≤ exp(−𝑛𝜖𝑛2(𝐶 + 4)),
Π𝑛(𝑞 ∶ KL(𝑞0, 𝑞) ≤ 𝜖𝑛2, 𝑉 (𝑞0, 𝑞) ≤ 𝜖𝑛2)≥ exp(−𝐶𝑛𝜖𝑛2).
Then, for sufficiently large M> 0, we have that Π𝑛(𝑄 ∶ ℎ(𝑞, 𝑞0)≥ 𝑀𝜖𝑛|𝑛)→ 0 in 𝑄𝑛0-probability.
We will now verify the three conditions of this theorem, which we refer to as the entropy condition, the remaining mass condition, and the prior mass condition, respectively. To that end, fix strictly positive sequences {Λ𝑛}, {Λ𝑛}, and define the sieves
𝑛= {𝑞𝜆,𝑝 ∶𝜆 ∈ [Λ𝑛, Λ𝑛], 𝑠𝑢𝑝𝑝 𝑝 ⊆ {1, … , 𝑚𝑛}}.
5.2.1
Entropy
We start with bounding the entropy of the sieve𝑛for h-balls of radius𝜖n.
Lemma 2. Assume that as n→ ∞,
𝑚𝑛→ ∞, 𝜖𝑛→ 0, Λ𝑛→ 0, Λ𝑛→ ∞. (8)
Then
log𝑁(𝜖𝑛, 𝑛, ℎ) ≲ 𝑚𝑛 {
log(𝑚𝑛) +log(Λ𝑛) +log ( 1 𝜖𝑛 )} +log ( 1 Λ𝑛 ) . (9) Proof. For𝜆, 𝜆′≥ Λ 𝑛, |√𝜆 −√𝜆′| = |𝜆 − 𝜆 ′| √ 𝜆 +√𝜆′ ≤ 1 2√Λ𝑛|𝜆 − 𝜆 ′|.
Furthermore, from section 3.3 in Pollard (2002), ℎ(𝑝, 𝑝′)≤√||𝑝 − 𝑝′||
1≤
√
𝑚𝑛||𝑝 − 𝑝′||∞.
Combining the preceding two displays and Equation (6), we get ℎ(𝑞𝜆,𝑝, 𝑞𝜆′,𝑝′)≤ √ Λ𝑛𝑚𝑛||𝑝 − 𝑝′|| ∞+ 1 2√Λ𝑛|𝜆 − 𝜆 ′|. Hence, if ||𝑝 − 𝑝′|| ∞≤ 𝜖 2 𝑛 4Λ𝑛𝑚𝑛, |𝜆 − 𝜆 ′| ≤√Λ 𝑛𝜖𝑛,
then the Hellinger distance between q𝜆,pand𝑞𝜆′,𝑝′is bounded by𝜖n. To cover [Λ
𝑛, Λ𝑛], we need
at most⌊ Λ𝑛 2𝜖𝑛√Λ𝑛
⌋
+1 intervals of length 2√Λ𝑛𝜖𝑛. To cover discrete distributions with support in {1, … , mn}, we need at most (⌊ 2Λ𝑛𝑚𝑛 𝜖2 𝑛 ⌋ +1 )𝑚𝑛
L∞-balls of radius𝜖𝑛2∕(4Λ𝑛𝑚𝑛). Under assumption (8), the summand 1 in the above display is
asymptotically negligible and can be omitted. In that case, the number of h-balls that we need to cover𝑛is of order ( Λ𝑛𝑚𝑛 𝜖2 𝑛 )𝑚𝑛 × Λ𝑛 𝜖𝑛√Λ𝑛.
Taking the logarithm and next a straightforward rearrangement of the terms gives the statement
of the lemma. ▪
5.2.2
Remaining prior mass
Now we will derive an inequality for the remaining prior mass. Lemma 3. For𝜆 ∼ 𝐺𝑎𝑚𝑚𝑎(a, b) with 0 < a ≤ 1,
Π𝑛(∖𝑛)≲ Λ𝑎−1𝑛 𝑒−𝑏Λ𝑛+ Λ𝑛. Proof. We have (with a slight abuse of notation)
Π𝑛(∖𝑛) = Π𝑛([Λ𝑛, ∞)) + Π𝑛([0, Λ𝑛)). Now, Π𝑛(𝜆 ≥ Λ𝑛) = 𝑏𝑎 Γ(𝑎) ∫ ∞ Λ𝑛 𝜆𝑎−1𝑒−𝑏𝜆d𝜆 ≲ Λ𝑎−1 𝑛 𝑒−𝑏Λ𝑛. Furthermore, Π𝑛([0, Λ𝑛)) = 𝑏𝑎 Γ(𝑎) ∫ Λ𝑛 0 𝜆 𝑎−1𝑒−𝑏𝜆d𝜆 ≲ Λ𝑎 𝑛.
The proof is concluded. ▪
5.2.3
Prior mass
Finally, we lower bound the prior mass in a small Kullback-Leibler neighbourhood of the data-generating compound distribution q0. Define the function g ∶ (0, ∞) × (0, 1) → (0, ∞) by
𝑔(𝜖, 𝑐) = 𝐶 𝜖2 2[log(𝑒∕𝑐)]2,
Lemma 4. Assume that
(i) there exists𝛼, such that 0 < 𝛼 ≤ 𝛼𝑖≤ 1 for all 1 ≤ i ≤ mn;
(ii) strictly positive sequences𝑝
𝑛 → 0, 𝜖n→ 0 and mn→ ∞ satisfy the inequalities 𝑚𝑛𝑔(𝜖𝑛, 𝑝𝑛)< 1
and𝑝 𝑛< 𝑔(𝜖𝑛, 𝑝𝑛) 2. Define 𝐵𝑛(𝜖) = {𝑞 ∈ 𝑛∶KL(𝑞0, 𝑞) ≤ 𝜖2, 𝑉 (𝑞0, 𝑞) ≤ 𝜖2}. Then Π𝑛(𝐵𝑛(𝜖𝑛))≳ Π𝑛(|𝜆0−𝜆| ≤ ̃𝜖𝑛) × Γ (𝑚𝑛 ∑ 𝑖=1 𝛼𝑖 ) exp(−𝑚𝑛log(1∕(𝑔( ̃𝜖𝑛, 𝑝 𝑛) 2−𝑝 𝑛)) −𝑚𝑛log(1∕𝛼)).
Here ̃𝜖𝑛=𝜖𝑛∕√3𝐶, with a constant𝐶 > 0 not depending on n. Proof. Define ̃𝐵𝑛(𝜖) = { (𝜆, 𝑝) ∶ 𝜆 ∈ [Λ𝑛, Λ𝑛], min 1≤𝑖≤𝑚𝑛𝑝𝑖≥ 𝑝𝑛, 𝑠𝑢𝑝𝑝 𝑝 ⊆ {1, … , 𝑚𝑛 }, KL(𝑝0, 𝑝) ≤ 𝜖2, 𝑉 (𝑝0, 𝑝) ≤ 𝜖2, |𝜆0−𝜆| ≤ 𝜖 } .
For all n large enough and𝜖 small, we have {𝜆 ∶ |𝜆0−𝜆| ≤ 𝜖} ⊆ [Λ𝑛, Λ𝑛]. Then by
inequali-ties in Lemma 5, ̃𝐵𝑛(𝜖) ⊂ 𝐵𝑛(√3𝐶𝜖), with a constant𝐶 that can be taken the same for all large enough n; see the arguments in Section 4.2 in Gugushvili et al. (2015). Hence, using the a priori independence of p and𝜆, Π𝑛(𝐵𝑛(𝜖𝑛))≥ Π𝑛( ̃𝐵𝑛(̃𝜖𝑛)) = Π𝑛(|𝜆0−𝜆| ≤ ̃𝜖𝑛) ×𝑈𝑛, where 𝑈𝑛= Π𝑛 ({ 𝑝 ∶ KL(𝑝0, 𝑝) ≤ ̃𝜖2𝑛, 𝑉 (𝑝0, 𝑝) ≤ ̃𝜖2𝑛, min 1≤𝑖≤𝑚𝑛𝑝𝑖≥ 𝑝𝑛 }) . Furthermore, by Lemma 6 from Appendix A, we have
𝑈𝑛≥ Π𝑛 ({ 𝑝 ∶ 𝑚𝑛 ∑ 𝑖=1|𝑝0𝑖 −𝑝𝑖| ≤ 2𝑔( ̃𝜖𝑛, 𝑝 𝑛), min1≤𝑖≤𝑚𝑛𝑝𝑖 ≥ 𝑝𝑛 }) .
The statement of the lemma now follows upon applying Lemma 7 from Appendix A with 𝜂 = 𝑝𝑛and𝜖 = 𝑔( ̃𝜖𝑛, 𝑝
𝑛). ▪
5.2.4
Using bounds in Theorem 2
We take 𝑚𝑛≍log𝑛, 𝜖𝑛≍ log 𝛾𝑛 √ 𝑛 , 𝑝𝑛≍ 1 𝑛2,
with appropriately selected proportionality constants, and verify the conditions in Theorem 2. Firstly, condition (8) is trivially satisfied. Therefore, we can invoke Lemma 2 and conclude that the entropy is upper bounded by a multiple of log2𝛾𝑛, since 𝛾 > 1. Now log2𝛾𝑛 ≲ 𝑛𝜖2
𝑛, and this
verifies the entropy condition in Theorem 2.
Be Lemma 3, for a suitable choice of the constant C the remaining prior mass condition is likewise satisfied.
Finally, for the prior mass condition in a small Kullback-Leibler neighbourhood to hold, by Lemma 4 we need that the term
Π𝑛(|𝜆0−𝜆| ≤ ̃𝜖𝑛)exp(−𝑚𝑛log(1∕(𝑔( ̃𝜖𝑛, 𝑝𝑛)2−𝑝𝑛)) −𝑚𝑛log(1∕𝛼))
is lower bounded by exp(−𝐶𝑛𝜖2
𝑛)for some large enough C> 0. Now, Π𝑛(|𝜆0−𝜆| ≤ ̃𝜖𝑛) ≍ ̃𝜖𝑛. Take
the logarithm on both sides of the above display and note that by our conditions log(Π𝑛(|𝜆0−𝜆| ≤ ̃𝜖𝑛))≳ log( ̃𝜖𝑛)≳ −𝑛𝜖𝑛2.
Likewise,
𝑚𝑛log(1∕(𝑔( ̃𝜖𝑛, 𝑝𝑛)2−𝑝𝑛)) +𝑚𝑛log(1∕𝛼) ≲ 𝑛𝜖𝑛2,
so that the prior mass condition holds.
Thus we have verified all the conditions of Theorem 2. The resulting posterior contraction rate is𝜖𝑛≍log𝛾𝑛∕√𝑛.
6
O U T LO O K
In this paper we introduced a nonparametric Bayesian approach to estimation of the Lévy mea-sure𝜈 of a discretely observed CPP, when the support of 𝜈 is a subset ofN. We constructed an algorithm for sampling from the posterior distribution of𝜈, and showed that in practice our procedure performs well and measures up to a benchmark frequentist plug-in approach from (Buchmann & Grübel, 2004). Although computationally more demanding and slower than the latter, our method has an added benefit of providing uncertainty quantification in parameter esti-mates through the spread of the posterior distribution. On the theoretical side we show that our procedure is consistent, in that asymptotically, as the sample size n→ ∞, the posterior concen-trates around the “true,” data-generating distribution. The corresponding posterior contraction rate is the (nearly) optimal rate log𝛾𝑛∕√𝑛 for an arbitrary 𝛾 > 1, if we are to ignore a practically insignificant log𝑛 factor.
Among several generalizations of our results, the one that looks the most promising is exten-sion of our methodology to CPP processes with jump size distributions supported on the set of integersZ. The corresponding model has garnered substantial interest in financial applications (see Barndorff-Nielsen, Pollard, & Shephard, 2012). We leave this extension as a topic of possible future research.
AC K N OW L E D G E M E N T S
The research leading to the results in this paper has received funding from the European Research Council under ERC Grant Agreement 320637, from the Deutsche Forschungsgemeinschaft (DFG,
German Research Foundation) – 314838170, GRK 2297 MathCoRe, and from the Deutsche Forschungsgemeinschaft (DFG) through the grant CRC 1294 "Data Assimilation." The authors would like to thank the Associate Editor and the referee for their detailed and constructive comments on the paper.
O RC I D
Shota Gugushvili https://orcid.org/0000-0002-6963-295X
Ester Mariucci https://orcid.org/0000-0003-0409-4131
Frank van der Meulen https://orcid.org/0000-0001-7246-8612
R E F E R E N C E S
Barndorff-Nielsen, O. E., Pollard, D. G., & Shephard, N. (2012). Integer-valued Lévy processes and low latency financial econometrics. Quantitative Finance, 12(4), 587–605.
Belomestny, D., Comte, F., Genon-Catalot, V., Masuda, H., & Reiß, M. (2015). Lévy matters IV. Estimation for
discretely observed Lévy processes(Vol. 2128). Cham, Switzerland: Springer.
Belomestny, D., Gugushvili, S., Schauer, M., & Spreij, P. (2019). Nonparametric Bayesian inference for Gamma-type Lévy subordinators. Communications in Mathematical Sciences, 17(3), 781–816.
Bezanson, J., Edelman, A., Karpinski, S., & Shah, V. (2017). Julia: A fresh approach to numerical computing. SIAM
Review, 59(1), 65–98.
Buchmann, B., & Grübel, R. (2003). Decompounding: An estimation problem for Poisson random sums. The
Annals of Statistics, 31(4), 1054–1074.
Buchmann, B., & Grübel, R. (2004). Decompounding Poisson random sums: Recursively truncated estimates in the discrete case. Annals of the Institute of Statistical Mathematics, 56(4), 743–756.
Coca, A. J. (2018a). Adaptive nonparametric estimation for compound Poisson processes robust to the
discrete-observation scheme. arXiv e-prints. Retrieved fromhttps://arxiv.org/abs/1803.09849
Coca, A. J. (2018b). Efficient nonparametric inference for discretely observed compound Poisson processes.
Probability Theory and Related Fields, 170(1), 475–523.
Comte, F., & Genon-Catalot, V. (2011). Estimation for Lévy processes from high frequency data within a long time interval. The Annals of Statistics, 39(2), 803–837.
Devroye, L. (1986). Non-uniform random variate generation. New York, NY: Springer-Verlag.
Duval, C., & Hoffmann, M. (2011). Statistical inference across time scales. Electronic Journal of Statistics, 5, 2004–2030.
Duval, C., & Mariucci, E. (2017). Compound Poisson approximation to estimate the Lévy density. ArXiv e-prints. Retrieved fromhttps://arxiv.org/abs/1702.08787
Embrechts, P., Mikosch, T., & Klüppelberg, C. (1997). Modelling extremal events: For insurance and finance. Berlin, Germany: Springer-Verlag.
Evans, D. A. (1953). Experimental evidence concerning contagious distributions in ecology. Biometrika, 40, 186–211.
Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of
the American Statistical Association, 85(410), 398–409.
Ghosal, S., Ghosh, J. K., & van der Vaart, A. W. (2000). Convergence rates of posterior distributions. The Annals of
Statistics, 28(2), 500–531.
Ghosal, S., & van der Vaart, A. (2007). Posterior convergence rates of Dirichlet mixtures at smooth densities. The
Annals of Statistics, 35(2), 697–723.
Ghosal, S., & van der Vaart, A. (2017). Fundamentals of nonparametric Bayesian inference Cambridge Series in
Statistical and Probabilistic Mathematics(Vol. 44). Cambridge, UK: Cambridge University Press.
Gugushvili, S., Mariucci, E., & van der Meulen, F. (2019). Bdd: Julia code for Bayesian decompounding of discrete
Gugushvili, S., van der Meulen, F., & Spreij, P. (2015). Nonparametric Bayesian inference for multidimensional compound Poisson processes. Modern Stochastics: Theory & Applications, 2(1), 1–15.
Gugushvili, S., van der Meulen, F., & Spreij, P. (2018). A non-parametric Bayesian approach to decompounding from high frequency data. Statistical Inference for Stochastic Processes, 21(1), 53–79.
Kappus, J. (2014). Adaptive nonparametric estimation for Lévy processes observed at low frequency. Stochastic
Processes and their Applications, 124(1), 730–758.
Lindo, A., Zuyev, S., & Sagitov, S. (2018). Nonparametric estimation for compound Poisson process via variational analysis on measures. Statistics and Computing, 28(3), 563–577.
Müller, P., Quintana, F. A., Jara, A., & Hanson, T. (2015). Bayesian nonparametric data analysis Springer Series in
Statistics. Cham, Switzerland: Springer.
Neumann, M. H., & Reiß, M. (2009). Nonparametric estimation for Lévy processes from low-frequency observa-tions. Bernoulli, 15(1), 223–248.
Neyman, J. (1939). On a new class of “contagious” distributions, applicable in entomology and bacteriology. The
Annals of Mathematical Statistics, 10, 35–57.
Nickl, R., & Reiß, M. (2012). A Donsker theorem for Lévy measures. Journal of Functional Analysis, 263(10), 3306–3332.
Nickl, R., & Söhl, J. (2017). Bernstein-von Mises theorems for statistical inverse problems II: Compound Poisson
processes. ArXiv e-prints. Retrieved fromhttps://arxiv.org/abs/1709.07752
Panjer, H. H. (1981). Recursive evaluation of a family of compound distributions. ASTIN Bullentin, 12(1), 22–26. Pollard, D. (2002). A user’s guide to measure theoretic probability (Vol. 8). Cambridge, UK: Cambridge University
Press.
Pya Arnqvist, N., Voinov, V., & Voinov, Y. (2018). nilde: Nonnegative integer solutions of linear Diophantine equations
with applications. R package version 1.1-2. Retrieved fromhttps://CRAN.R-project.org/package=nilde
Sato, K.-I. (2013). Lévy processes and infinitely divisible distributions (Vol. 68, 2nd revised ed.). Cambridge, UK: Cambridge University Press.
Shreve, S. E. (2004). Stochastic calculus for finance. II: Continuous-time models. New York, NY: Springer. Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of
the American Statistical Association, 82(398), 528–550 With discussion and with a reply by the authors. Trabs, M. (2015). Information bounds for inverse problems with application to deconvolution and Lévy models.
Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 51(4), 1620–1650.
van der Pas, S., Szabó, B., & van der Vaart, A. (2017). Uncertainty quantification for the horseshoe (with discussion).
Bayesian Analysis, 12(4), 1221–1274.
van der Vaart, A. W. (1998). Asymptotic statistics Cambridge series in statistical and probabilistic mathematics (Vol. 3). Cambridge, UK: Cambridge University Press.
van Es, B., Gugushvili, S., & Spreij, P. (2007). A kernel type nonparametric density estimator for decompounding.
Bernoulli, 13(3), 672–694.
Voinov, V. G., & Nikulin, M. S. (1997). On a subset sum algorithm and its probabilistic and other applications. In N. Balakrishnan (Ed.), Advances in combinatorial methods and applications to probability and statistics (pp. 153–163). Boston, MA: Birkhäuser Boston.
von Bortkewitsch, L. (1898). Das Gesetz der kleinen Zahlen. Leipzig, Germany: B.G. Teubner.
Wasserman, L. (2004). All of statistics. A concise course in statistical inference. New York, NY: Springer.
Zhang, H., Liu, Y., & Li, B. (2014). Notes on discrete compound Poisson model with applications to risk theory.
Insurance: Mathematics & Economics, 59, 325–336.
How to cite this article: Gugushvili S, Mariucci E, der Meulen F. Decompounding discrete distributions: A nonparametric Bayesian approach. Scand J Statist.