Random Variables- Discrete and Continuous



Discrete Random Variables

A random variable is essentially a random number. As motivation for a definition, let us consider an example. A coin is thrown three times, and the sequence of heads and tails is observed; thus,

$\Omega= \{hhh, hht, htt, hth, ttt, tth, thh, tht\}$

Examples of random variables defined on $\Omega$ are (1) the total number of heads, (2) the total number of tails, and (3) the number of heads minus the number of tails. Each of these is a real-valued function defined on $\Omega$; that is, each is a rule that assigns a real number to every point $\omega \in \Omega$ . Since the outcome in $\Omega$ is random, the corresponding number is random as well.

In general, a random variable is a function from $\Omega$ to the real numbers. Because the outcome of the experiment with sample space $\Omega$ is random, the number produced by the function is random as well. It is conventional to denote random variables by italic uppercase letters from the end of the alphabet. For example, we might define $X$ to be the total number of heads in the experiment described above.

A discrete random variable is a random variable that can take on only a finite or at most a countably infinite number of values. The random variable $X $just defined is a discrete random variable since it can take on only the values 0, 1, 2, and 3. For an example of a random variable that can take on a countably infinite number of values, consider an experiment that consists of tossing a coin until a head turns up and defining $Y$ to be the total number of tosses. The possible values of $Y$ are 0, 1, 2, 3, . . . . In general, a countably infinite set is one that can be put into one-to-one correspondence with the integers.
If the coin is fair, then each of the outcomes in $\Omega$ above has probability 1 /8 ,from which the probabilities that $X$ takes on the values 0, 1, 2, and 3 can be easily computed:

$P(X = 0) = 1/8$
$P(X = 1) = 3/8$
$P(X = 2) = 3/8$
$P(X = 3) = 1/8$
Generally, the probability measure on the sample space determines the probabilities of the various values of $X$; if those values are denoted by $x1, x2, . . . $, then there is a function $p$ such that $p(xi ) = P(X = xi )$ and $\sum_i p(xi ) = 1$. This function is called the probability mass function, or the frequency function, of the random variable $X$. Figure below shows a graph of $p(x)$ for the coin tossing experiment. The frequency function describes completely the probability properties of the random variable.


In addition to the frequency function, it is sometimes convenient to use the cumulative distribution function (cdf) of a random variable, which is defined to be 
$F(x) = P(X ≤ x), −∞ < x < ∞$

Cumulative distribution functions are usually denoted by uppercase letters and frequency functions by lowercase letters. Figure  is a graph of the cumulative distribution function of the random variable $X$ of the preceding paragraph. Note that the cdf jumps wherever $p(x) > 0$ and that the jump at $xi$ is 
$p(xi )$. For example, if $0 < x <1, F(x) = 1/8$; at $x = 1, F(x)$ jumps to $F(1) = 4/8=1/2$
The jump at $x = 1$ is $p(1) = 3/8$ .

The cumulative distribution function is non-decreasing and satisfies
 
$lim_{x→−∞}   F(x) = 0$ and $lim_{x→∞ }F(x) = 1$.


In the case of two discrete random variables $X$ and $Y$ , taking on possible values $x1, x2, . . . $, and $y1, y2, . . . , X$ and $Y$ are said to be independent if, for all $i$ and $j$ , 

$P(X = x_i \quad and \quad Y = y_j ) = P(X = x_i )P(Y = y_j )$

The definition is extended to collections of more than two discrete random variables in the obvious way; for example, $X, Y,$ and $Z$ are said to be mutually independent if,for all $i, j$, and $k$,

$P(X = x_i , Y = y_j , Z = z_k) = P(X = x_i )P(Y = y_j )P(Z = z_k)$

We next discuss some common discrete distributions that arise in applications.

Bernoulli Random Variables
A Bernoulli random variable takes on only two values: 0 and 1, with probabilities $1 − p$ and $p,$ respectively. Its frequency function is thus
$p(1) = p$
$p(0) = 1 − p$
$p(x) = 0$, if  $x \ne 0 \quad and \quad  x \ne 1$
An alternative and sometimes useful representation of this function is

$p(x)=\left\{\begin{matrix}
p^x(1-p)^{1-x}& if \quad x=0 \quad or \quad x=1 \\
0& otherwise
\end{matrix}\right.$

If $A$ is an event, then the indicator random variable, $I_A$, takes on the value 1 if $A$ occurs and the value 0 if $A$ does not occur:
$I_A(ω) =1, if \quad ω ∈ A$
  =0, otherwise
$I_A$ is a Bernoulli random variable. In applications, Bernoulli random variables often occur as indicators. A Bernoulli random variable might take on the value 1 or 0 according to whether a guess was a success or a failure.

The Binomial Distribution
Suppose that $n$ independent experiments, or trials, are performed, where $n$ is a fixed number, and that each experiment results in a “success” with probability $p$ and a “failure” with probability $1 − p$. The total number of successes, $X$, is a binomial random variable with parameters $n$ and $p$. For example, a coin is tossed 10 times and the total number of heads is counted (“head” is identified with “success”).
The probability that $X = k$, or $p(k)$, can be found in the following way: Any particular sequence of $k$ successes occurs with probability $p^k(1 − p)^{n−k}$ , from the multiplication principle. The total number of such sequences is $\binom{n}{k}$, since there are $\binom{n}{k}$ ways to assign $k$ successes to $n$ trials. $P(X = k)$ is thus the probability of any particular sequence times the number of such sequences:

$p(k)=\binom{n}{k}p^k(1-p)^{n-k}$

Two binomial frequency functions are shown in Figure . Note how the shape varies as a function of $p$.
A random variable with a binomial distribution can be expressed in terms of independent Bernoulli random variables, a fact that will be quite useful for analyzing some properties of binomial random variables in later chapters of this book. Specifically, let $X_1, X_2, . . . , X_n$ be independent Bernoulli random variables with $p(Xi = 1) = p$.Then $Y = X_1 + X_2 +· · ·+ X_n$ is a binomial random variable.

The Geometric and Negative Binomial Distributions
The geometric distribution is also constructed from independent Bernoulli trials, but from an infinite sequence. On each trial, a success occurs with probability $p$, and $X$ is the total number of trials up to and including the first success. So that $X = k$, there must be $k − 1$ failures followed by a success. 
From the independence of the trials, this occurs with probability
$p(k) = P(X = k) = (1 − p)^{k−1} p, k = 1, 2, 3, . . .$

Note that these probabilities sum to 1:

$\sum_{k=1}^{\infty}(1 − p)^{k−1} p = p \sum_{j=0}^\infty ( 1-p)^j = 1$

The Hypergeometric Distribution
Suppose that an urn contains $n$ balls, of which $r$ are black and $n−r$ are white. Let $X$ denote the number of black balls drawn when taking $m$ balls without replacement.

$P(X=k)=\frac{{}\binom{r}{k}\binom{n-r}{m-k}}{\binom{n}{m}}$

$X$ is a hypergeometric random variable with parameters $r, n,$ and $m$.

The Poisson Distribution
The Poisson frequency function with parameter $λ (λ > 0)$ is
$P(X = k) = \frac{λ^k}{k!}e^{−λ}, k = 0, 1, 2, . . .$

Since $e^λ = \sum_{k=0}^\infty  (λk/k!)$,it follows that the frequency function sums to 1. Figure shows four Poisson frequency functions. Note how the shape varies as a function of λ.

The binomial frequency function is
$P(k)=\frac{n!}{(n-k)!k!}p^k(1-p)^{n-k}$

Setting $np = λ$, this expression becomes

$P(k)=\frac{n!}{(n-k)!k!}\left(\frac{\lambda}{n}^k\right)\left(1-\frac{\lambda}{n}\right)^{n-k}$

$P(k)=\frac{\lambda^k}{k!} \frac{n!}{(n-k)!}\frac{1}{n^k}\left(1-\frac{\lambda}{n}\right)^n\left(1-\frac{\lambda}{n}\right)^{-k}$

As $n \to\infty$

$\frac{\lambda}{n} \to 0$

$\frac{n!}{(n-k)!n^k} \to 1$

$\left(1-\frac{\lambda}{n}\right)^n \to e^{-\lambda}$
and
$\left(1-\frac{\lambda}{n}\right)^{-k} \to 1$

Thus

$p(k) \to \frac{\lambda^k}{k!} e^{-\lambda}$
which is the Poisson frequency function.

Continuous Random Variables
In applications, we are often interested in random variables that can take on a continuum of values rather than a finite or countably infinite number. For example, a model for the lifetime of an electronic component might be that it is random and can be any positive real number. For a continuous random variable, the role of the frequency function is taken by a density function, $f (x)$, which has the properties that $f (x) ≥ 0$, $f $ is piece wise continuous, and  $\int_{-\infty}^{\infty}f(x) dx=1$. If $X$ is a random variable with a density function $f$ , then for any $a < b$, the probability that $X$ falls in the interval $(a, b)$ is the area under the density function between $a$ and $b$:

$P(a < X < b) =\int_{a}^{b}f(x) dx$

One consequence of this definition is that the probability that a continuous random variable $X$ takes on any particular value is 0:

$P(X=c)=\int_{c}^{c}f(x) dx=0$

The cumulative distribution function of a continuous random variable $X$ is defined in the same way as for a discrete random variable:
$F(x) = P(X ≤ x)$

$F(x)$ can be expressed in terms of the density function:

$F(x)=\int_{-\infty}^x f(u) du$

From the fundamental theorem of calculus, if $f$ is continuous at $x, f (x) = F'(x)$.
The cdf can be used to evaluate the probability that $X$ falls in an interval:

$P(a \le X \le b)=\int_a^b f(x)dx= F(b)-F(a)$

A uniform random variable on the interval [0, 1] is a model for what we mean when we say “choose a number at random between 0 and 1.” Any real number in the interval is a possible outcome, and the probability model should have the property that the probability that $X$ is in any subinterval of length $h$ is equal to $h$. The following density function does the job:

$f (x) =1, 0 ≤ x ≤ 1$
$f(x)= 0, x < 0 \quad or \quad  x > 1$
This is called the uniform density on [0, 1]. The uniform density on a general interval
$[a, b] $ is
$f (x) =1/(b − a), a ≤ x ≤ b$
$f(x)=0, x < a \quad or \quad x > b$

From this definition, we see that the cdf of a uniform random variable on [0, 1] is
$F(x)=\left\{\begin{matrix}
0,& x \le 0\\
x,& 0 \le x \le 1\\
1,& x \ge 1
\end{matrix}\right.$

The Exponential Density
The exponential density function is
$f(x)=\left\{\begin{matrix}
\lambda e^{-\lambda x},& x \ge 0\\
0, & x < 0\\
\end{matrix}\right.$

Like the Poisson distribution, the exponential density depends on a single parameter, $λ > 0$, and it would therefore be more accurate to refer to it as the family of exponential densities. Several exponential densities are shown in Figure below. Note that as $λ$ becomes larger, the density drops off more rapidly.



The cumulative distribution function is easily found:

$F(x)=\int_{-\infty}^x f(u)du=\left\{\begin{matrix}
1-e^{-\lambda x},& x \ge 0\\
0, & x < 0\\
\end{matrix}\right.$

The Gamma Density
The gamma density function depends on two parameters, $α$ and $λ$:

$g(t)=\frac{\lambda ^\alpha}{\Gamma (\alpha)}t^{\alpha-1}e^{-\lambda t}, \quad t \ge 0$

For $t<0, g(t)=0$ So that the density be well defined and integrate to 1,$α > 0$ and $λ > 0$. The gamma function, $\Gamma(x)$, is defined as

$\Gamma (x)=\int_{0}^\infty u^{x-1}e^{-u} du \quad x>0$

Note that if $α = 1$, the gamma density coincides with the exponential density. The parameter $α$ is called a shape parameter for the gamma density, and $λ$ is called a scale parameter. Varying $α$ changes the shape of the density, whereas varying $λ$ corresponds to changing the units of measurement (say, from seconds to minutes) and does not affect the shape of the density.
Figure below shows several gamma densities. Gamma densities provide a fairly flexible class for modeling non negative random variables.


The Normal Distribution
The normal distribution plays a central role in probability and statistics, for reasons that will become apparent in later chapters of this book. This distribution is also called the Gaussian distribution after Carl Friedrich Gauss, who proposed it as a model for measurement errors. The central limit theorem, justifies the use of the normal distribution in many applications. Roughly, the central limit theorem says that if a random variable is the sum of a large number of independent random variables, it is approximately normally distributed. The normal distribution has been used as a model for such diverse phenomena as a person’s height, the distribution of IQ scores, and the velocity of a gas molecule. The density function of the normal distribution depends on two parameters, $μ$ and $σ$ (where $−∞ < μ < ∞, σ > 0):$

$f(x)=\frac{1}{\sigma\sqrt{2 \pi}}e ^{-(x-\mu)^2/2\sigma^2}, \quad -\infty < x < \infty$

The parameters $μ$ and $σ$ are called the mean and standard deviation of the normal density As shorthand for the statement “X follows a normal distribution with parameters $μ$ and $σ$,” it is convenient to use $X ∼ N(μ, σ^2)$. From the form of the density function, we see that the density is symmetric about $μ, f (μ − x) = f (μ + x)$, where it has a maximum, and that the rate at which it falls off is determined by $σ$. Figure below shows several normal densities. Normal densities are sometimes referred to as bell-shaped curves. The special case for which $μ = 0$ and $σ = 1$ is called the standard normal density. Its cdf is denoted by $\Phi$ and its density by $\phi$ (not to be confused with the empty set). 

The Beta Density
The beta density is useful for modeling random variables that are restricted to the interval [0, 1]:

$f(u)=\frac{\Gamma(a+b)}{\Gamma(a)+\Gamma(b)}u^{a-1}(1-u)^{b-1}, \quad  0 \le u \le  1$

Figure below shows beta densities for various values of $a$ and $b$. Note that the case $a = b = 1$ is the uniform distribution. The beta distribution is important in Bayesian statistics, as you will see later.




Comments

Popular posts from this blog

Foundations of Machine Learning CST 312 KTU CS Elective Notes

Joint Distributions