Anna Janicka

(1)

Anna Janicka

Probability Calculus 2019/2020 Lecture 4

1. Real-valued Random Variables

We already know how to describe the results of a random experiment in terms of a formal mathematical construction, i.e. the probability triple (Ω, F , P). More often than not, however, we are not interested in the precise outcome of the experiment, but rather in a function of the result. For example, the stock market processes may be thought of in terms of random experiments (i.e. we assume that stocks rise and fall randomly, according to some rules). What an investor is interested in is, perhaps, not the outcome of the experiment as such, but rather the value of his portfolio – a function of the price movements, but weighted by the stocks owned. Formally, we will be interested in a function X defined over Ω, which will transform elementary events into real numbers (as for now – one dimensional). If what is interesting to us are real values, then natural questions that may also be of interest are of the type: what is the probability, that the value of the portfolio is not more than $1 mln? This means that we want “X does not exceed a” to be events of interest, i.e.

X⁻¹((−∞, a]) = {ω ∈ Ω : X(ω) ¬ a} ∈ F , which leads us to the following definition:

Definition 1. A real-valued random variable is any function X : Ω → R, such that for all a ∈ R the set X⁻¹((−∞, a]) is an event, i.e. X⁻¹((−∞, a]) ∈ F .

If Ω is countable (finite or infinite), and we have F = 2^Ω, then any function X : Ω → R will be a random variable.

Let us now look at some very simple examples.

(1) We toss a coin twice. Let X denote the number of heads. We have Ω = {(H, H), (H, T ), (T, H), (T, T )}, and the function X((H, H)) = 2, X((H, T )) = X((T, H)) = 1, X((T, T )) = 0.

(2) We roll a die twice, and let X denote the sum of numbers obtained. We have Ω = {(x, y) : x, y ∈ {1, 2, 3, 4, 5, 6}}, X((x, y)) = x + y.

(3) We randomly choose a point from the interval [0, 2]. Let X denote the distance from this point to the nearest integer. We have Ω = [0, 2] and

X(ω) =











ω if ω ∈ [0,¹₂],

|ω − 1| if ω ∈ (¹₂,³₂], 2 − ω if ω ∈ (³₂, 2].

If we have more than one random variable defined over the same space Ω, we may perform any sensible operation of our choice – such as addition, subtraction, multiplication, division (if not by zero) – and as a result, we will obtain random variables. Also, if we transform a random variable with a “decent” function f : R → R (a function transforming Borel sets into Borel sets), f (X) will also be a random variable. That is, if X and Y are random variables, then Z₁ = X · Y², Z₂ = sin(X) + e^Y, Z₃ = ^{ln (X}_Y2+1⁴⁺¹⁾ etc. are also random variables.

We have hinted above that we may be interested in asking questions about te probability that the value of X will be equal to or fall in a specific interval, etc. In order to be able to ask these questions, we have to make sure that we will be able to comfortably assign probability to the values of X. Since we have probability P over (Ω, F), this assignment will be quite straightforward, as illustrated by the following examples:

(1) We toss a symmetric coin twice, and let X denote the number of heads. What are the probabilities of the specific outcomes?

We will adopt a useful notation: P(X ∈ A) = P(X⁻¹(A)).

We have P(X = 0) = ¹₄, P(X = 1) = ¹₂ and P(X = 2) = ¹₄ – i.e. we get a probability distribution on R. What would happen if we were interested in Y , the number of tails?

1

(2)

We would also have P(Y = 0) = ¹₄, P(Y = 1) = ¹₂ and P(Y = 2) = ¹₄. We see that two different random variables – X and Y – have the same probability distribution.

(2) From a circle of radius equal to 1 we randomly draw a point. Let X denote the distance of this point from the center of the circle. This random variable may assume only values from the range [0, 1]. For any a ∈ [0, 1], we have P(X ∈ [0, a]) = ^πa_π² = a², which means that we can assign probability to intervals of the type [0, a]. This probability may be extended to probability over any events on the real line.

In the above examples, we have seen how to transform the probability over the initial sample space (Ω, F ) to a probability over (R, B(R)) with the use of the random variable X. Different Ωs may lead to substantially different probability models on (R, B(R)) (such as discrete or continuous). For a given Ω, we may have different probabilities over (R, B(R)) for various random variables X (or the same probabilities for different random variables...). This leads us to the definition of a probability distribution connected with the variable X:

Definition 2. The probability distribution of a random variable X (real-valued) is the probability µ_X on (R, B(R)), such that µX(A) = P(X ∈ A).

We will often write X ∼ µ to denote the fact that the distribution of X is given by the function µ. Note that – as we have already seen – different random variables may have the same probability distributions, so that from a distribution of a random variable one can not derive the definition of the random variable. This is the reason why in most cases we will “forget”

about the way a random variable was defined (i.e. forget about Ω etc.), and concentrate on the probability distribution of this random variable.

Let us now look at some common distribution types.

(1) We roll a die. Let X denote the number obtained. In this case, µ_X is a probability distribution concentrated on the set {1, 2, 3, 4, 5, 6}, i.e. for k = 1, 2, . . . , 6 we have

µ_X({k}) = 1 6 > 0,

and for any other values of k this probability would be null. Note that when describing probability over countable sample spaces, it was sufficient to describe probability over simple events only, and the probability of all other events could be easily calculated on this basis. It is the same in the case of distributions concentrated on countable subsets of R: the description of the probability distribution may be derived from a description of probabilities of the outcomes of non-negative probability. Therefore, the formula from above may be expanded to

µ_X(A) = 1 6

6

X

k=1

1_A(k), for all A ∈ B(R).

(2) The above is an example of a discrete probability distribution: we will call a distribu- tion discrete, if there exists a countable (at most) subset S of R such that µ(S) = 1.

This distribution is unequivocally defined by the probabilities of the elements of S: for any A ∈ B(R), we have

µ(A) = ^X

s∈A

µ({s}).

This is why, when we deal with a discrete probability distribution, this distribution is described by means of a set of pairs of values and corresponding (non-negative) probabilities {(s_i, p_i) : s_i ∈ S, p_i > 0,^Pp_i = 1} (typically presented in a table).

(3) Binomial distribution with parameters n and p (denoted Bin(n, p)) – this is the distribution of the number of successes in a Bernoulli Scheme with n trials and a probability of success in a single trial equal to p. As we already know, the probabilities are:

µ({k}) = P(X = k) = n k

!

p^k(1 − p)^n−k, k = 0, 1, 2, . . . , n.

(3)

(4) Geometric distribution with parameter p ∈ (0, 1), denoted Geom(p) – this is the distribution of the number of the trial in which a success appeared for the first time in a series of Bernoulli trials with probability of success in a single trial equal to p. This distribution is concentrated on the set {1, 2, . . .}, and we have

µ_X({k}) = P(X = k) = (1 − p)^k−1p, k = 1, 2, . . . .

Sometimes the geometric distribution is defined slightly differently (shifted – the number of failures before the first success appeared). This latter distribution is concentrated on the set {0, 1, 2, . . .}, and may be written as

µ_Y({k}) = P(Y = k) = (1 − p)^kp, k = 0, 1, 2, . . . . (We have Y = X − 1.)

(5) Poisson distribution with parameter λ > 0, denoted Poiss(λ) – this is a distribution concentrated on the set {0, 1, 2, . . .}, such that

µ({k}) = P(X = k) = λ^k k!e^−λ.

As we already know, this is the limit distribution for binomial distributions

(6) Uniform distribution over interval [a, b], denoted U (a, b) – this corresponds to ran- domly drawing a number from the interval [a, b]. From geometric models of probability, we know that for an interval [c, d] ⊆ [a, b] we have

µ_X([c, d]) = P(X ∈ [c, d]) = |[c, d]|

|[a, b]| = d − c b − a, which may be extended to any set A ∈ B(R):

µ_X(A) = µ_X(A ∩ [a, b]) = P(X ∈ A ∩ [a, b]) = |A ∩ [a, b]|

|[a, b]| = |A ∩ [a, b]|

b − a .

It will be useful to rewrite the probability density in what, as for now, seems a more complicated form:

µX(A) =

Z

A∩[a,b]

1 b − adx.

This last example is different from the previous ones in that we did not describe the pro- bability that X will assume a given value; rather, we talked about the probabilities of sets (just as in the geometric probability case). It is the most simple example of a random variable assuming uncountably many values (i.e. values from a given range). This example may be complicated slightly:

(7) Assume a wild bee leaves its nest (located at point 0 on a real line) and files left with probability ¹₃, and right with probability ²₃. If it flies left, it comes across a stick lying on [−3, −1], covered uniformly with honey. It then randomly chooses a point from this stick to sit upon (all places are “equally attractive” in terms of honey). If the wild bee flies right, it comes across a stick lying on [2, 3], also uniformly covered with honey, and also chooses a location to sit upon randomly. Using the same mechanism as in the previous example, we may write that for any Borel subset A ⊆ [−3, −1], we have

µX(A) = P(X ∈ A) =

Z

A

1

3 · 1

−1 − (−3)dx, and for any Borel subset A ⊆ [2, 3], we have

µX(A) = P(X ∈ A) =

Z

A

2 3 · 1

3 − 2dx.

In the general case, we may therefore write µ_X(A) =

Z

A

g(x)dx,

3

(4)

where

g(x) =











1

6 if x ∈ [−3, −1),

2

3 if x ∈ [2, 3], 0 otherwise.

In this latter case, it became slightly more evident why we may wish to use a notation with integrals (but still, due to the fact that drawing from ranges was uniform, these integrals were not necessary). There is no reason, however, why we should constrain ourselves only to cases where probability is distributed uniformly (the layer of honey has the same thickness everywhere); in this case, the integral notation, where we describe the probability connected with a given range(set) on the real line with a function which is not necessarily constant, becomes crucial.

Definition 3. A random variable X has a continuous distribution, if there exists a func- tion g : R → R⁺, such that for any set A ∈ B(R), µ^X(A) = P(X ∈ A) =^RAg(x)dx.

g is called the probability density function of X.

A density function must be nonnegative (it would be acceptable to change some isolated values to negative – all we will do with a density function will be integrating, and the result of integration is not vulnerable to a change in the value of the integrated function in an isolated point). Another feature of the density function is that since the probability of the whole space – in this case R – must be equal to 1, we must have that

Z ∞

−∞g(x)dx = 1.

A density function determines the distribution of a random variable unequivocally.

Going back to examples:

(6) – cont. For a uniform distribution U (a, b) we have g(x) = 1

b − a1_[a,b](x).

(8) An Exponential distribution with parameter λ > 0, denoted Exp(λ). This is a distribution with density

g(x) = λe^−λx1_[0,∞)(x).

Note that we would get the same distribution if the density function was equal to g(x) = λe^−λx1_(0,∞)(x).

(9) Standard normal distribution, denoted N (0, 1). This is a distribution with density g(x) = 1

√2πe^−x2² .

This distribution is the distribution with what is probably the widest-known density function: the bell-shape curve. This distribution (or rather the more general versions) is a very common distribution which often appears when we describe the weight, size, IQ etc. distributions in a given population. This is not a coincidence – the Central Limit Theorem, which we will talk about in due time, provides an answer as to why this distribution is so popular in nature.

(10) The more general version of the normal distribution, denoted by N (a, σ²), is defined for any a ∈ R (the parameter of location) and σ > 0 (the parameter of scale) by a density function

g_a,σ²(x) = 1

√2πσexp

− (x − a)² 2σ²

.

(5)

(11) Additionally, for a ∈ R and σ = 0 we define N (a, 0) as a single-point distribution δa

(Dirac delta), such that

δ_a(A) = 1_A(a) =







1 if a ∈ A 0 otherwise

5