• Nie Znaleziono Wyników

WHEN THE LIKELIHOOD RATIOS ARE BOUNDED

N/A
N/A
Protected

Academic year: 2021

Share "WHEN THE LIKELIHOOD RATIOS ARE BOUNDED"

Copied!
9
0
0

Pełen tekst

(1)

A. L. R U K H I N (Baltimore)

INFORMATION-TYPE DIVERGENCE

WHEN THE LIKELIHOOD RATIOS ARE BOUNDED

Abstract. The so-called φ-divergence is an important characteristic de- scribing “dissimilarity” of two probability distributions. Many traditional measures of separation used in mathematical statistics and information the- ory, some of which are mentioned in the note, correspond to particular choices of this divergence. An upper bound on a φ-divergence between two probability distributions is derived when the likelihood ratio is bounded.

The usefulness of this sharp bound is illustrated by several examples of familiar φ-divergences. An extension of this inequality to φ-divergences be- tween a finite number of probability distributions with pairwise bounded likelihood ratios is also given.

1. Information-type divergences. Let φ be a convex function de- fined on the positive half-line, and let F and G be two different probability distributions such that F is absolutely continuous with respect to G. The φ-divergence between F and G is defined as

φ(F |G) =

\

φ  dF dG



dG = E G φ  dF dG

 (see for example, Vajda, 1989). Clearly

φ(1) = φ(F |F ) ≤ φ(F |G).

This inequality and the fact that many familiar separation characteristics used in mathematical statistics and information theory correspond to par- ticular choices of φ justify the interest in φ-divergences.

Out of these choices perhaps the most important is φ I (u) = − log u + u − 1,

1991 Mathematics Subject Classification: 60E15, 94A17.

Key words and phrases : convexity, information measures, likelihood ratio, multiple decisions.

[415]

(2)

in which case

φ I (F |G) = E G log  dF dG



= K(G, F )

is the classical information number. Another information number K(F, G) corresponds to the function φ(u) = u log u − u + 1, and the sum of these information numbers (the so-called J-divergence, see Cover and Thomas, 1991) is determined by φ J (u) = (u − 1) log u.

The probability of correct discrimination between F and G in the Bayes- ian setting is another example of φ-divergence. Indeed, let λ be the prior probability of distribution F , so that 1 − λ is the prior probability of G.

Then the probability of the correct decision is λ

\

λdF ≥(1−λ)dG

dF + (1 − λ)

\

λdF <(1−λ)dG

dG

=

\

max[λdF, (1 − λ)dG] = φ C (F |G), which is another version of φ-divergence with φ C (u) = max[λu, 1 − λ].

A further classical example of φ-divergence is provided by χ 2 -separation with φ(u) = (u − 1) 2 , or by more general functions of the form

φ r (u) =  |1 − u r | 1/r , 0 < r < 1,

|1 − u| r , r ≥ 1.

For a fixed number w, 0 < w < 1, the φ-divergence with φ(u) = −u/(wu + 1 − w) or, somewhat more conveniently, with

φ M (u) = u



1 − w − 1

wu + 1 − w



, u > 0,

appears in the statistical estimation problems of the mixture parameter and of the change-point parameter (Rukhin, 1996).

In this note the interest is in obtaining an upper bound on a φ-divergence when the likelihood ratio, dF/dG, is bounded. Intuitively it is clear that the closer the probability distributions F and G are to each other, the smaller any φ-divergence must be. This intuition is confirmed by the inequality (2) in the next section.

One of the motivations for the study of the bounded likelihood ratios family is statistical inference with finite memory (see Cover, Freedman and Hellman, 1976) or recurrent multiple decision-making (Rukhin, 1994). In the latter problem a recursive procedure can be consistent only if the dis- tribution of the likelihood ratio is supported by the whole positive half-line.

It is demonstrated by Rukhin (1993) that in the bounded likelihood ratio

situation the probability of the correct decision is bounded from above by

an explicitly given constant, which is strictly smaller than one. Theorem

2.1 generalizes this result.

(3)

Another reason for interest in distributions with bounded likelihood ratio is importance sampling in Monte-Carlo methods (see Fishman (1996), Sec.

4.1). This technique, designed to reduce the variance of an estimate of an integral, replaces sampling from the distribution F by sampling from a suitably chosen G under condition (1). Similar situation appears in the rejection method of generating of non-uniform random variables (cf. Devroy (1986), II.3). The inequality (2) gives a bound on possible gain (or loss) obtained from such a replacement.

2. A bound for φ-divergence. Suppose that with G-probability one

(1) b min ≤ dF

dG ≤ b max . Then b min < 1 < b max .

Notice that all functions φ considered above have minimum at u = 1 and that they are bowl-shaped, i.e. are non-increasing in the interval (0, 1) and are non-decreasing for u > 1. Only this condition is needed in the following theorem.

Theorem 2.1. Assume that the function φ is bowl-shaped with the min- imum at u = 1. Under the condition (1),

(2) φ(F |G) ≤ b max − 1 b max − b min

φ(b min ) + 1 − b min

b max − b min

φ(b max ).

P r o o f. Let A 1 =

 u : dF

dG (u) = b max



and A 2 =

 u : dF

dG (u) = b min

 . If the set (A 1 ∪ A 2 ) c is not empty, the value of

T

φ(dF/dG) dG, for fixed distribution G, can get only larger by the inclusion of the points of this set either in A 1 or in A 2 . Thus for any F , under condition (1),

φ(F |G) ≤

\

A 1

φ  dF dG

 dG +

\

A 2

φ  dF dG

 dG

= φ(b max )G(A 1 ) + φ(b min )G(A 2 ).

Since

F (A 1 ) = b max G(A 1 ) and F (A 2 ) = b min G(A 2 ), one obtains

G(A 1 ) = 1 − b min

b max − b min

and G(A 2 ) = b max − 1 b max − b min

,

which proves (2).

(4)

Let us illustrate this theorem by the particular versions of φ from Sec- tion 1.

1. For φ I (u) = − log u + u − 1, Theorem 2.1 shows that K(G, F ) ≤ − (1 − b min ) log b max + (b max − 1) log b min

b max − b min

. Similarly,

K(G, F ) + K(F, G) ≤ (b max − 1)(1 − b min ) b max − b min

log b max b min

.

2. The function φ C (u) = max[λu, 1 − λ] has a (non-unique) minimum at u = 1 if λ ≤ 1/2. The inequality (2) shows that in this case

φ C (F |G) ≤ (1 − b min ) max[λb max , 1 − λ] + (b max − 1)(1 − λ)

b max − b min ,

which is equivalent to the inequality (3.3) in Rukhin (1993).

3. For φ 2 (u) = (u − 1) 2 , one concludes from Theorem 2.1 that

(3) E G

 dF dG

 2

≤ 1 + (b max − 1)(1 − b min ).

For two discrete distributions with probabilities p 1 , . . . , p n and q 1 , . . . , q n

such that b min ≤ p i /q i ≤ b max , this inequality means that X p 2 i

q i

≤ 2 + (b max − 1)(1 − b min ).

For arbitrary non-negative numbers α 1 , . . . , α n and β 1 , . . . , β n put q i = β i 2 / P β k 2 , and p i = α i β i / P

k α k β k . Then P α 2 i P β 2 i

(P α i β i ) 2 ≤ 2 + (b max − 1)(1 − b min ), where

β max = max

i

α i

β i

· P β 2 i P α i β i

, β min = min

i

α i

β i

· P β i 2 P α i β i

.

By maximizing the right-hand side of (3) when b max /b min = B, one obtains

(4) E G  dF

dG

 2

≤ (B + 1) 2 4B .

For discrete distributions, as above, this inequality reduces to a well known inequality

P α 2 i P β i 2

(P α i β i ) 2 ≤ (B + 1) 2 4B

with B = max i (α i /β i )/ min i (α i /β i ) (see P´olya and Szeg˝o, 1972).

(5)

The latter inequality has been used by Tukey (1948) and Bloch and Moses (1988) in the problem of statistical estimation of the common mean by weighted means statistics with measurements of different precision. Both of these papers comment on the numerical accuracy of the bound (4) (which is weaker than (2)).

4. For φ M , Theorem 2.1 implies

\

dF dG

wdF + (1 − w)dG ≥ 1 − w + wb min b max

(wb min + 1 − w)(wb max + 1 − w) .

The example of two Bernoulli distributions with probabilities of success (1−b min )/(b max −b min ) and b max (1−b min )/(b max −b min ), respectively, shows that the inequality (2) is sharp. Its sharpness can also be seen by the limiting cases when w = 0 or w = 1.

As another example, let F be the exponential distribution with mean ω and G be the exponential distribution with mean 1. Then

dF

dG (x) = ω exp{(1 − ω)x}, x > 0,

so that for ω > 1, b max = ω and b min = 0. Therefore for any bowl-shaped function φ with minimum at u = 1, for ω > 1 we have

φ(F |G) =

1

\

0

φ(ωu ω−1 ) du ≤ (ω − 1)φ(0) + φ(ω)

ω .

When ω ↓ 1, this inequality reduces to equality.

3. Information divergence for several probability distributions.

In this section we derive an inequality similar to the one in Theorem 2.1 for the information divergence between several probability distributions. This divergence is defined in the following way (see Gy¨orfi and Nemetz, 1975).

Let φ(u 1 , . . . , u m ) be a non-negative convex function defined over the positive quadrant of m-dimensional Euclidean space. Assume that φ is a homogeneous function, i.e. for all positive u,

φ(uu 1 , . . . , uu m ) = uφ(u 1 , . . . , u m ).

Let (X , A, µ) be a measure space, and let different probability distributions P 1 , . . . , P m defined on A be absolutely continuous with respect to µ. The φ-divergence between P 1 , . . . , P m is defined as

φ(P 1 , . . . , P m ) =

\

X

φ  dP 1

dµ , . . . , dP m

 dµ.

The homogeneity property of φ guarantees independence of φ(P 1 , . . . , P m )

from the dominating measure µ. When m = 2, this information divergence

reduces to the one in Section 1 with the function φ(u) there equal to φ(u, 1).

(6)

The examples of φ-divergence include the error probability in a mul- tiple decision problem for φ C (u 1 , . . . , u m ) = max i [λ i u i ] with probabilities λ 1 , . . . , λ m ; the analogues of Kullback–Leibler divergences,

φ I (u 1 , . . . , u m ) = X

i,k



u i − u k − u k log u i

u k

 , φ J (u 1 , . . . , u m ) = X

i,k

(u i − u k ) log u i u k

;

and Hellinger-type transforms with φ(u 1 , . . . , u m ) = u α 1 1 u α 2 2 . . . u α m m for α 1 + . . . + α m = 1.

Assume now that the ratios of the densities p i = dP i /dµ, i = 1, . . . , m, are bounded, i.e.

(5) b ki ≤ p k (x)

p i (x) ≤ 1

b ik µ-a.s.

Moreover, assume that b ik are the largest (positive) quantities satisfying (5). Then b ki b il < b kl for i 6= k, l. In particular, b ki b ik < 1 for i 6= k.

The set P of all probability distributions satisfying this condition is convex and closed under weak convergence. Since the functional φ(P 1 , . . . , P m ) is convex, its maximum is attained on the set ext(P) of the extreme points of P.

The next result gives a necessary condition for (P 1 0 , . . . , P m 0 ) to belong to ext(P).

Proposition 3.1. If (P 1 0 , . . . , P m 0 ) is an extreme point of P, then for any k,

(6) µ



max i:i6=k b ki p 0 i (x) < p 0 k (x) < min

i:i6=k

p 0 i (x) b ik



= 0.

P r o o f. We show first of all that the conditions p i (x)

b ik

= min

l:l6=k

p l (x) b lk

and

b ki p i (x) = max

l:l6=k b kl p l (x)

are equivalent. Indeed, according to the first condition, for any l 6= k, b kl p l (x) ≥ b kl b lk p i (x)

b ik

, so that

max l:l6=k b kl p l (x) ≥ max

l:l6=k b kl b lk

p i (x) b ik

≥ max

l:l6=k b kl b lk max

l:l6=k

b kl p l (x) b ki b ik

.

(7)

It follows that

(7) max

l:l6=k b kl b lk = b ki b ik and that

max l:l6=k b kl p l (x) = b ki p i (x).

Suppose now that for some i 6= k, (6) does not hold for (P 1 , . . . , P m ) ∈ P, i.e. on a set of µ-positive measure,

(8) max

l:l6=k b kl p l (x) = b ki p i (x) < p k (x) < p i (x) b ik

= min

l:l6=k

p l (x) b lk

. Then for sufficiently small positive w, the µ-measure of the set

b ki + w

 1 b ik

− b ki



≤ p k (x) p i (x) ≤ 1

b ik

− w

 1 b ik

− b ki



is positive. For any number a such that b ki < a < 1/b ik , this set is contained in the region

C =



b ki + w(a − b ki ) ≤ p k (x) p i (x) ≤ 1

b ik

− w

 1 b ik

− a



, With a = P k (C)/P i (C), the set C must have µ-positive measure.

For x ∈ C put

r(x) = p k (x) − wap i (x)

1 − w , q(x) = ap i (x), and for x 6∈ C,

r(x) = q(x) = p k (x).

Then for all x,

p k (x) = wq(x) + (1 − w)r(x),

and q and r are probability densities. We now show that (P 1 , . . . , Q, . . . , P m )

∈ P and (P 1 , . . . , R, . . . , P m ) ∈ P. Indeed, for x ∈ C, b ki ≤ q(x)

p i (x) a ≤ 1 b ik

, and these inequalities trivially hold for x 6∈ C. Also,

b ki ≤ r(x) p i (x) =

p k (x) p i (x) − wa

1 − w ≤ 1 b ik

for x ∈ C, by the definition of C. Because of (8), for any l 6= k, b kl ≤ r(x) ∧ q(x)

p l (x) ≤ r(x) ∨ q(x) p l (x) ≤ 1

b lk

.

Therefore (P 1 , . . . , P m ) 6∈ ext(P), which concludes the proof.

(8)

According to this proposition, if (P 1 0 , . . . , P m 0 ) ∈ ext(P), then for any k there exists i, i 6= k, which can be found from (7), such that the sets

A + k =



p 0 k (x) = p 0 i (x) b ik

= min

l:l6=k

p 0 l (x) b lk

 and

A k = {p 0 k (x) = p 0 i (x)b ki = max

l:l6=k p 0 l (x)b kl } form a partition of X . Clearly

P k 0 (A + k ) = P i 0 (A + k )

b ik ≤ min

l:l6=k

P l 0 (A + k ) b lk , P k 0 (A k ) = P i 0 (A k )b ki ≥ max

l:l6=k P l 0 (A k )b kl . As in Section 2,

P k 0 (A + k ) = 1 − b ik

1 − b ik b ki

, P k 0 (A k ) = b ki (1 − b ik ) 1 − b ik b ki

. If φ(u 1 , . . . , u m ) attains its minimum at (1, . . . , 1) then

φ(P 1 , . . . , P m ) ≤ max

(P 1 0 ,...,P m 0 )∈ext(P) φ(P 1 0 , . . . , P m 0 ) (9)

= max

(P 1 0 ,...,P m 0 )∈ext(P)

\

X

φ p 0 1 , . . . , p 0 m  dµ

≤ max

k max

(P 1 0 ,...,P m 0 )∈ext(P)

h

\

A + k

φ(p 0 1 , . . . , p 0 m ) dµ

+

\

A k

φ(p 0 1 , . . . , p 0 m ) dµ i

≤ max

k max

(P 1 0 ,...,P m 0 )∈ext(P)



φ(b 1k , . . . , 1, . . . , b mk )P k 0 (A + k ) + φ

 1 b k1

, . . . , 1, . . . , 1 b km



P k 0 (A k )



= max

k



φ(b 1k , . . . , 1, . . . , b mk ) 1 − b ik 1 − b ik b ki

+ φ

 1

b k1 , . . . , 1, . . . , 1 b km

 b ki (1 − b ik ) 1 − b ik b ki



≤ max

k6=l

 φ



b 1k , . . . , 1, . . . , b mk

 1 − b lk

1 − b lk b kl + φ

 1 b k1

, . . . , 1, . . . , 1 b km

 b kl (1 − b kl ) 1 − b lk b kl



.

(9)

We formulate the obtained result.

Theorem 3.2. Under the boundedness condition (5), the inequality (9) holds for any information divergence φ(P 1 , . . . , P m ) such that the convex function φ(u 1 , . . . , u m ) attains its minimum at (1, . . . , 1).

It is easy to see that for convex functions φ the inequality (9) implies that of Theorem 2.1.

References

[1] D. A. B l o c h and L. E. M o s e s, Nonoptimally weighted least squares, Amer. Statist.

42 (1988), 50–53.

[2] T. M. C o v e r, M. A. F r e e d m a n and M. E. H e l l m a n, Optimal finite memory learning algorithms for the finite sample problem, Information Control 30 (1976), 49–85.

[3] T. M. C o v e r and J. A. T h o m a s, Elements of Information Theory, Wiley, New York, 1991.

[4] L. D e v r o y, Non-Uniform Random Variate Generation, Springer, New York, 1986.

[5] G. S. F i s h m a n, Monte Carlo: Concepts, Algorithms and Applications, Springer, New York, 1996.

[6] L. G y ¨ o r f i and T. N e m e t z, f -dissimilarity: A generalization of the affinity of several distributions, Ann. Inst. Statist. Math. 30 (1978), 105–113.

[7] G. P ´ o l y a and G. S z e g ˝ o, Problems and Theorems in Analysis. Volume 1 : Series, Integral Calculus, Theory of Functions, Springer, New York, 1972.

[8] A. L. R u k h i n, Lower bound on the error probability for families with bounded likelihood ratios, Proc. Amer. Math. Soc. 119 (1993), 1307–1314.

[9] —, Recursive testing of multiple hypotheses: Consistency and efficiency of the Bayes rule, Ann. Statist. 22 (1994), 616–633.

[10] —, Change-point estimation: linear statistics and asymptotic Bayes risk , Math.

Methods Statist. 5 (1996), 412–431.

[11] J. W. T u k e y, Approximate weights, Ann. Math. Statist. 19 (1948), 91–92.

[12] I. V a j d a, Theory of Statistical Inference and Information, Kluwer, Dordrecht, 1989.

Andrew L. Rukhin

Department of Mathematics and Statistics University of Maryland at Baltimore County 1000 Hilltop Circle

Baltimore, Maryland 21250 U.S.A.

E-mail: rukhin@math.umbc.edu

Received on 16.9.1996;

revised version on 10.12.1996

Cytaty

Powiązane dokumenty

In the present paper we use a boundary value problem [6, 9] to construct the conformal mapping for polygons with zero angles.. The number of vertices can

Extending this idea we will introduce Hadamard matrices: such a matrix (of order q) gives sequences which can be generated by finite automata and which satisfy (2) where M 2 is

In analogy to Lemma 2.1 the integral on the left-hand side exists since the boundary lies in a finite union of hyperplanes... Heights and

On the other hand, it is clear that the category Mod(T ) may be identified with the full subcate- gory of Mod(R) consisting of all right R-modules M such that M = M T , and this

Initially the rocket has 3600 kg fuel which is used by the engine at a constant rate and finishes just as the rocket reaches the

The radius of the circle circumscribing this triangle is equal to:A. The centre of the circle

(i) Copy the tree diagram and add the four missing probability values on the branches that refer to playing with a stick.. During a trip to the park, one of the dogs is chosen

(b) Find the probability that a randomly selected student from this class is studying both Biology and