Skip to main content

Probability and Statistics

Axioms of Probability Theory

  • P(E)0P(E)\geq 0.
  • P(Ω)=1P(\Omega)=1
  • For any sequence of mutually exclusive events E1,E2,E_1, E_2, \cdots, we have

P(i=1Ei)=P(Ei)P(\bigcup_{i=1}^{\infty} E_i)=\sum P(E_i)

We then have the following propositions:

  • P(Ec)=1P(E)P(E^c)=1-P(E)
  • EF    P(E)P(F)E\subset F \implies P(E)\leq P(F)
  • P(EF)=P(E)+P(F)P(EF)P(E\cup F) = P(E)+P(F) - P(E\cap F)

Principles for Calculating

Multiplication Principle

Suppose that rr experiments are to be performed. If experiment 11 has n1n_1 possible outcomes, and for each outcome of experiment 11, experiment 22 has n2n_2 possible outcomes, and so on so forth. Then there is a total of

n1×n2×nrn_1\times n_2\cdots\times n_r

possible outcomes for the rr experiments.

Addition Principle

Suppose that the set SS is partitioned into pairwise disjoint parts S1,S2,,SnS_1,S_2,\cdots, S_n. The number of objects in SS can be determined by finding the number of objects in each of the parts, and adding the numbers so obtained.

Substraction Principle

Let SS be a set and UU be a larger set containing SS. We denote by AA the set containing all elements in UU that are not in SS. Then

S=UA|S| = |U|-|A|

Division Principle

Let SS be a finite set that is partitioned in kk parts in such a way that each part contains the same number of objects. Then the number of parts in the partition is given by the rule

k=Snumber of objects in a partk=\frac{|S|}{\text{number of objects in a part}}

Permutations and Combinations

Permutations

Permutations are ordered arrangements of elements of a set. With nn objects, there are n!n! permutations (we consider every object is different from all the others).

If n1n_1 are alike, nrn_r are alike, then there are

n!n1!×nr!\frac{n!}{n_1!\times\cdots n_r!}

permutations of nn objects. Necessarily we have knkn\sum_{k}n_k \leq n.

Combinations

The number of different groups of kk objects that can be formed from a total of nn objects is given by the choose function:

(nk)={n!k!(nk)!0kn0k<0k>n{n \choose k} = \begin{cases}\frac{n!}{k!(n-k)!} & 0\leq k\leq n \\ 0 & k<0 \lor k>n\end{cases}

Multinomial Coefficients

Let n1,nkNn_1, \cdots n_k\in \mathbb{N}, where ni=n\sum n_i=n. The number of possible partitions of nn objects into kk distinct groups of sizes n1,n2,nkn_1, n_2,\cdots n_k is given by:

(nn1,n2,,nk)=n!n1!×n2!××nk!{n \choose {n_1, n_2, \cdots, n_k}} = \frac{n!}{n_1!\times n_2!\times\cdots\times n_k!}

Conditional Probability

Let A,BSA, B\subset S be two events, such that P(A)>0P(A)>0. The conditional probability of BB given AA is defined by

P(BA)=P(AB)P(A)P(B|A) = \frac{P(A\cap B)}{P(A)}

Proposition

  • We fix an event AA with P(A)>0P(A)>0. The function P(A):S[0,1]P(\cdot|A): S\to [0,1] which associates to any event BSB\subset S the quantity P(BA)P(B|A) is a probability on the same sample space SS.
  • The multiplication rule: We assume that A1,,AnA_1, \cdots, A_n are events such that P(kAk)>0P(\cap_k A_k)>0, then

P(A1An)=P(A1)P(A2A1)P(A3A2A1)P(A_1\cap \cdots \cap A_n)=P(A_1)P(A_2|A_1)P(A_3|A_2\cap A_1)\cdots

  • Bayes Formula:

P(AB)=P(A)P(B)P(BA)P(A|B) = \frac{P(A)}{P(B)}P(B|A)

  • Another form of Bayes: If AA and BB are two events, then

P(B)=P(BA)P(A)+P(BAc)(1P(A))P(B) = P(B|A)P(A) + P(B|A^c)(1-P(A))

  • Total Probability Law (extended Bayes Formula) The Bayes formula can be generalised as follows. We assume that A1,AnA_1, \cdots A_n are mutually disjoint events such that k=1nAk=S\cup_{k=1}^{n}A_k = S (such a sequence is called a partition of SS). Then we have

P(B)=k=1nP(BAk)P(Ak)P(B)=\sum_{k=1}^{n}P(B|A_k)P(A_k)

  • Another form of TPL, suppose the sample space SS is divided into 3 disjoint events B1,B2,B3B_1, B_2, B_3. Then for any event AA we have:

P(A)=P(AB1)+P(AB2)+P(AB3)P(A) = P(A\cap B_1) + P(A\cap B_2) + P(A\cap B_3)

Independence

Definition: Two events AA and BB are independent if

P(AB)=P(A)P(B)P(A\cap B) = P(A)P(B)

Events AA and BB are independent if the probability that one occurred is not affected by knowledge that the other occurred. Formally, if P(A)P(A) and P(B)P(B) are not 00,

P(AB)=P(A)P(BA)=P(B)P(A|B) = P(A) \\ P(B|A)=P(B)

Proposition If EE and FF are independent, so are EE and FcF^c.

Independence of Several Events

Definition A sequence of events AkA_k are mutually independent if for any subset {Ai1,Ai2,Aim}{A1,An}\{A_{i_1}, A_{i_2}, \cdots A_{i_m}\} \subset \{A_1, \cdots A_n\} where 2mn2\leq m\leq n:

P(k=1mAik)=P(Ai1)P(Aim)P(\cap_{k=1}^m A_{i_k})=P(A_{i_1})\cdots P(A_{i_m})

Random Variables

Definition A random variable is a real-valued function on the sample space such that

X:SRsX(s),sSX: S\to\mathbb{R} s\to X(s), \forall s\in S

  • A r.v. XX is a discrete random variable if there exists a countable set KRK\subset \mathbb{R} such that P(XK)=1P(X\in K)=1.
  • Let XX be a discrete random variable, its probability mass function (PMF) is a real map PXP_X, defined by

PX:RR,PX(x)=P(X=x),xRP_X:\mathbb{R}\to\mathbb{R}, P_X(x)=P(X=x), \forall x\in\mathbb{R}

Expectation and Variance

Definition Consider a discrete random variable X:SRX: S\to\mathbb{R} with X(S)={r1,,rn}X(S)=\{r_1,\cdots,r_n\}. Then

  • The expected value of X (expectation or mean value) is defined as

E[X]=k=1nrkP(X=rk)E[X]=\sum_{k=1}^n r_kP(X=r_k)

  • The variance of X, denoted by Var(X)Var(X), is defined as

Var(X)=E[X2]E[X]2=k=1n(rk)2P(X=rk)(k=1nrkP(X=rk))2Var(X) = E[X^2]-E[X]^2 \\ = \sum_{k=1}^n (r_k)^2P(X=r_k)-(\sum_{k=1}^n r_kP(X=r_k))^2

  • The square root of the variance is called the standard deviation of XX and is denoted by σX=Var(X)\sigma_X=\sqrt{Var(X)}.

Deluge of Discrete Distributions

Bernoulli Distribution XX is 11 on success and 00 on failure, with success and failure defined by the context.

Binomial Distribution, denoted by Bin(n,p)Bin(n,p), models the number of success in nn independent Bernoulli(p)Bernoulli(p) trials.

Hypergeometric Distribution A hypergeometric random variable XX with parameters n,M,mn, M, m is a random variable that counts the number of elements in a random sample of size nn, taken without replacement from a population of size MM, having a specified attribute, where exactly mm members of the total population possessing the attribute. We denote XH(n,N,m)X\sim H(n,N,m).

Geometric Distribution models the number of tails before the first head in a sequence of coin flips or more generally, Bernoulli trials.

Memoryless Property of Geometric Distribution A random variable XG(p)X\sim G(p) has the lack-of-memory property, i.e.

P(X=n+kX>n)=P(X=k),n,kNP(X=n+k|X>n) = P(X=k), \forall n,k\in\mathbb{N}

Poisson Distribution A discrete random variable XX is a Poisson random variable with parameter λ>0\lambda >0 if its probability mass function is given by

PX(x)={eλλxx!x=0,1,0otherwiseP_X(x)=\begin{cases}e^{-\lambda}\frac{\lambda^x}{x!} & x=0,1,\cdots \\ 0 & \text{otherwise}\end{cases}

  • The function PXP_X is indeed a probability mass function, by allowing all the axioms.
  • The Poisson distribution is considered to be the law of rare events. Poisson random variables often arise in the modelling of the frequency of occurrence of a certain event during a specified period of time.

Continuous Random Variable

Definition A random variable XX is called a continuous random variable if P(X=x)=0P(X=x)=0 for all xRx\in\mathbb{R}.

Definition Let XX be a random variable, then the cumulative distribution function of XX is the map FX:RRF_X: \mathbb{R}\to\mathbb{R} defined by

FX(a)=P(Xa)F_X(a) = P(X\leq a)

for all aRa\in\mathbb{R}.

Proposition Let XX be a random variable, then FXF_X satisfies the following properties:

  • FXF_X is non-decreasing.
  • FXF_X is right continuous, i.e. F(x+0)=F(x)F(x+0)=F(x).
  • FX():=limaFX(a)=0F_X(-\infty) := \lim_{a\to-\infty} F_X(a)=0.
  • FX():=limaFX(a)=1F_X(\infty) := \lim_{a\to\infty}F_X(a)=1.

Note that, while being continuous from the right, a CDF FXF_X is not left-continuous in general. In particular, the left limit of FXF_X at aa can be defined as

FX(a):=limtaFX(t)=P(X<a)F_X(a-):=\lim_{t\to a}F_X(t)=P(X<a)

Propositions For all a,bRa,b\in\mathbb{R}, a<ba<b, the following hold

  • P(a<Xb)=FX(b)FX(a)P(a<X\leq b)=F_X(b)-F_X(a)
  • P(aXb)=FX(b)FX(a)P(a\leq X\leq b)=F_X(b)-F_X(a-)
  • P(a<X<b)=FX(b)FX(a)P(a<X<b) = F_X(b-)-F_X(a)
  • P(aX<b)=FX(b)FX(a)P(a\leq X<b)=F_X(b-)-F_X(a-)

Proposition A random variable is continuous if and only if its CDF is a continuous function.

Density

Definition Let XX be a continuous random variable.

  • If there exists a non-negative function fX:RRf_X:\mathbb{R}\to\mathbb{R} such that for all a,bRa,b\in\mathbb{R}, a<ba<b, then fXf_X is called a Probability Density Function of XX if

P(aXb)=abfX(x)dxP(a\leq X\leq b)=\int_{a}^b f_X(x)dx

  • If the distribution of XX has a PDF, we say that XX is an absolutely continuous random variable.

Remarks The density fX(x)f_X(x) of an absolutely continuous random variable is the analogue of the probability mass function pX(x)p_X(x) of a discrete random variable. The important differences are

  • Unlike pX(x)p_X(x), the PDF fX(x)f_X(x) is not a probability. You have to integrate it to get a probability.
  • Since fX(x)f_X(x) is not a probability, there is no restriction that fX(x)f_X(x) be less than or equal to 11.

Proposition A continuous random variable XX has a PDF if and only if there exists a non-negative function f:RRf: \mathbb{R}\to\mathbb{R} such that

FX(a)=af(x)dx,aRF_X(a)=\int_{-\infty}^a f(x)dx, \: \forall a\in\mathbb{R}

In this case, ff is a PDF of XX. In the meanwhile, for all xRx\in\mathbb{R} where ff is continuous, f(x)=FX(x)f(x)=F_X'(x).

Procedure Given a random variable XX, there is a procedure for finding a PDF.

  • Check for continuity of FXF_X. If it holds, proceed with the next step.
  • Check at which points FXF_X' exists.
  • If FXF_X' exists and is continuous outside a finite or countable subset of the real line, that is, if RA\mathbb{R}-A is at most countable, where A={xR:FX(x) is continuous}A=\{x\in\mathbb{R}: \exists F_X'(x) \text{ is continuous}\}, then define fX(x)=FX(x)f_X(x)=F'_X(x) for all xAx\in A.

Proposition Once we have a PDF of a random variable, we are able to compute any probability depending on that random variable. Let XX be an absolutely continuous random variable, then

P(XA)=Af(x)dx,ARP(X\in A)=\int_{A} f(x)dx, \forall A\subset \mathbb{R}

Remark Let ff be a PDF, then we have

f(x)dx=1\int_{\infty}^{\infty} f(x)dx=1

If f:RRf:\mathbb{R}\to\mathbb{R} is a non-negative function satisfying the equality above, then ff is a PDF. We can therefore construct a PDF starting from any real map that is either positive or negative, and that has finite non-zero integral on R\mathbb{R}.

Let g:RRg: \mathbb{R}\to\mathbb{R} such that gg does not change sign and

Rg(x)dx=cR{0}\int_{\mathbb{R}}g(x)dx=c\in\mathbb{R}-\{0\}

then we can define a function f:RRf:\mathbb{R}\to\mathbb{R} by f(x)=1cg(x)f(x)=\frac{1}{c}g(x) for all xRx\in\mathbb{R}. ff is then a PDF.

Expectation and Variance

Definition Let XX be an absolutely continuous random variable, then

  • The expectation, or expected value, or mean, of XX is denoted by E[X]E[X] and defined by

E[X]=xfX(x)dxE[X]=\int_{-\infty}^{\infty} xf_X(x)dx

  • The variance of XX is defined by

Var(X)=E[X2]E[X]2Var(X)=E[X^2]-E[X]^2

  • The standard deviation is defined as

σ(X)=Var(X)\sigma(X)=\sqrt{Var(X)}

Proposition If gg is a function such that

E[g(x)]=g(x)fX(x)dxE[g(x)] = \int_{\infty}^{\infty}g(x)f_X(x)dx

It follows that

E[a+bX]=a+bE[X]E[a+bX]=a+bE[X]

Exponential Distribution

Exponential Distribution are the continuous analogue of geometric random variables. It is also related to Poisson random variables, in that they describe the time passed between consecutive occurrences of a certain event.

Memoryless Suppose that the probability that a taxi arrives within the first five minutes is pp. If I wait 5 minutes and in fact no taxi arrives, then the probability that a taxi arrives within the next 5 minutes is still pp.

The memorylessness of the exponential distribution is analogous to the memorylessness of the discrete geometric distribution, where having flipped 5 tails in a row gives no information about the next 5 flips. Indeed, the exponential distribution is the precisely the continuous counterpart of the geometric distribution, which models the waiting time for a discrete process to change state.

Proposition A positive continuous random variable XX has the lack-of-memory property, i.e

P(X>s+tX>s)=P(X>t),s,tR,0stP(X>s+t| X>s) = P(X>t), \forall s,t\in\mathbb{R}, 0\leq s\leq t

if and only if XX is exponentially distributed.

Normal Distribution

Standard Normal Distribution The standard normal distribution N(0,1)\mathcal{N}(0,1) has the mean 00 and the variance 11. We use

φ(z)=12πez22\varphi(z)=\frac{1}{\sqrt{2\pi}}e^{-\frac{z^2}{2}}

for the standard normal density (PDF) , and

Φ(z)=zφ(x)dx\Phi(z)=\int_{-\infty}^z \varphi(x)dx

for the standard normal distribution (CDF).

Other Normal Distribution The normal distribution N(μ,σ2)\mathcal{N}(\mu,\sigma^2) has the following normal density (PDF)

fX(x)=12πσ2e(xμ)22σ2f_X(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}

Proposition If XN(μ,σ2)X\sim\mathcal{N}(\mu, \sigma^2) and Y=aX+bY=aX+b, then YN(aμ+b,a2σ2)Y\sim \mathcal{N}(a\mu+b, a^2\sigma^2).

Proposition If XN(μ,σ2)X\sim \mathcal{N}(\mu, \sigma^2), then Z:=xμσN(0,1)Z:=\frac{x-\mu}{\sigma}\sim\mathcal{N}(0,1).

Laplace Limit Theorem

Let SnS_n denote the number of successes that occur when nn independent trials, each resulting in a success with probability pp, are performed. Then for any a<ba<b,

limnP(aSnnpnp(1p)b)=Φ(b)Φ(a)\lim_{n\to\infty}P(a\leq \frac{S_n-np}{\sqrt{np(1-p)}}\leq b)=\Phi(b)-\Phi(a)

We define the standardised sum as

Zn:=Snnpnp(1p)Z_n:= \frac{S_n-np}{\sqrt{np(1-p)}}

Distribution of Function of a Random Variable

Theorem Let XX be a absolutely continuous random variable and

Y=g(X)Y=g(X)

where g(X)g(X) is a strictly monotone and differentiable function. Then the density function of YY is given by

fY(y)={fX(g1(y))×ddyg1(y)y=g(x) for some x0otherwisef_Y(y)=\begin{cases}f_X(g^{-1}(y)) \times |\frac{d}{dy}g^{-1}(y)| & y=g(x)\text{ for some x}\\ 0 & \text{otherwise}\end{cases}

Procedure Two-step procedure

  • Find the CDF of YY as

FY(y)=P(Yy)=P(g(X)y)F_Y(y)=P(Y\leq y) = P(g(X)\leq y)

  • Differentiate

fY(y)=dFYdy(y)f_Y(y)=\frac{dF_Y}{dy}(y)

Joint Distributions

Discrete Joint PMF Suppose XX takes values {x1,x2,,xn}\{x_1, x_2, \cdots, x_n\} and YY takes values {y1,y2,,ym}\{y_1, y_2, \cdots, y_m\}. The ordered pair (X,Y)(X, Y) takes values in the product {(x1,y1),(x1,y2),(xn,ym)}\{(x_1, y_1), (x_1, y_2), \cdots (x_n, y_m)\}. The joint probability mass function (joint PMF) of XX and YY is the function PX,Y(xi,yj)P_{X, Y}(x_i, y_j) giving the probability of the joint outcome {X=xi,Y=yj}\{X=x_i, Y=y_j\}. We can organise this in a joint probability table.

Properties The probability mass function (PMF)

  • 0PX,Y(xi,yj)10\leq P_{X, Y}(x_i, y_j)\leq 1.
  • The total probability is 11, i=1nj=1mPX,Y(xi,yj)=1\sum_{i=1}^n\sum_{j=1}^m P_{X,Y}(x_i, y_j)=1.
  • Computing Probabilities: for any A{1,,n}×{1,,m}A\subset \{1, \cdots, n\}\times \{1, \cdots, m\}, we have P((X,Y)A)=(x,y)APX,Y(x,y)P((X, Y)\in A) = \sum_{(x,y)\in A} P_{X,Y}(x, y).

Continuous Joint Distributions XX takes values in [a,b][a,b], YY takes values in [c,d][c,d], then (X,Y)(X, Y) takes values in [a,b]×[c,d][a,b]\times [c,d]. The joint PDF is denoted as f(x,y)f(x,y).

Properties The probability density function (PDF

  • 0fX,Y(x,y)0\leq f_{X,Y}(x,y).
  • Total probability is 11, yRxRfX,Y(x,y)dxdy=1\int_{y\in\mathbb{R}}\int_{x\in\mathbb{R}} f_{X, Y}(x, y)dxdy = 1.
  • Computing probabilities, for any x1x2x_1\leq x_2 and y1y2y_1\leq y_2, we calculate the probability by

P(x1Xx2,y1Yy2)=y[y1,y2]x[x1,x2]fX,Y(x,y)dxdyP(x_1\leq X\leq x_2, y_1\leq Y\leq y_2) = \int_{y\in[y_1, y_2]}\int_{x\in[x_1, x_2]} f_{X, Y}(x,y)dxdy

Marginal PMF and PDF

Discrete Case Let X,YX, Y be discrete random variables defined on the same sample space, then

PX(x)=yRPX,Y(x,y),xRP_X(x)=\sum_{y\in\mathbb{R}}P_{X,Y}(x, y), \forall x\in\mathbb{R}

and analogously,

PY(y)=xRPX,Y(x,y),yRP_Y(y)=\sum_{x\in\mathbb{R}}P_{X, Y}(x,y), \forall y\in\mathbb{R}

Continuous Case Let X,Y:SRX,Y: S\to\mathbb{R} be continuous random variables with a joint PDF. Then the marginal densities are given by

fX(x)=RfX,Y(x,y)dy,xRf_X(x) = \int_{\mathbb{R}}f_{X, Y}(x,y)dy, \forall x\in\mathbb{R}

fY(y)=RfX,Y(x,y)dx,yRf_Y(y) = \int_{\mathbb{R}}f_{X, Y}(x,y)dx, \forall y\in\mathbb{R}

Cumulative Distribution Function

Definition Let X,YX, Y be random variables defined on the same sample space SS, the joint cumulative distribution function of XX and YY is the map FX,Y:R2[0,1]F_{X, Y}:\mathbb{R^2}\to[0,1] defined by: for all a,bRa,b\in\mathbb{R},

FX,Y(a,b)=P(Xa,Yb)F_{X,Y}(a,b)=P(X\leq a, Y\leq b)

Discrete Case

FX,Y(a,b)=xaybPX,Y(x,y)F_{X,Y}(a,b) = \sum_{x\leq a}\sum_{y\leq b}P_{X,Y}(x,y)

Continuous Case

FX,Y(a,b)=bafX,Y(x,y)dxdyF_{X,Y}(a,b) = \int_{-\infty}^b\int_{-\infty}^a f_{X,Y}(x,y)dxdy

fX,Y(x,y)=2FX,Yxy(x,y)f_{X,Y}(x,y)=\frac{\partial^2F_{X,Y}}{\partial x\partial y}(x,y)

Properties

  • F(x,y)F(x,y) is non-decreasing. That is, as xx or yy increases F(x,y)F(x,y) increases or remains constant.
  • F(x,y)=0F(x,y)=0 at the lower left of its range. If the lower left is (,)(-\infty, \infty), then this means

lim(x,y)(,)F(x,y)=0\lim_{(x,y)\to (-\infty,-\infty)}F(x,y)=0

  • F(x,y)=1F(x,y)=1 at the upper right of its range.

Independence

Definition The random variables XX and YY are independent, if for any two sets ARA\subset\mathbb{R} and BRB\subset \mathbb{R}, the events {XA}\{X\in A\} and {YB}\{Y\in B\} are independent.

If XX and YY are independent, then

  • Discrete random variables The joint probability mass function is a product of the marginal probability mass functions

PX,Y(x,y)=PX(x)PY(y)P_{X,Y}(x,y)=P_X(x)P_Y(y)

  • Continuous random variables The joint probability density function is a product of the marginal probability density functions

fX,Y(x,y)=fX(x)fY(y)f_{X,Y}(x,y)=f_X(x)f_Y(y)

Sums of Independent Random Variable

Discrete Case of 2 Variables The distribution Z=X+YZ=X+Y, in discrete case, we have

PZ(z)=xPX(x)PY(zx)P_Z(z)=\sum_x P_X(x)P_Y(z-x)

Continuous Case of 2 Variables The distribution Z=X+YZ=X+Y, in continuous case, we have

fZ(z)=fX(x)fY(zx)dxf_Z(z)=\int_{-\infty}^{\infty}f_X(x)f_Y(z-x)dx

Conditional Distributions

Discrete Case The conditional probability mass function of XX, given {Y=y}\{Y=y\} is defined as

PXY(xy)=PX,Y(x,y)PY(y)xX(s)P_{X|Y}(x|y)=\frac{P_{X,Y}(x,y)}{P_Y(y)} \: x\in X(s)

Properties

  • 0PXY(xiy)10\leq P_{X|Y}(x_i|y)\leq 1, for all xiX(S)x_i\in X(S).
  • The total probability is 11, i.e.

xiX(S)PXY(xi)=1\sum_{x_i\in X(S)}P_{X|Y}(x_i)=1

  • Computing probabilities, for any AX(S)A\subset X(S), we have

P(XAY=y)=xAPXY(xy)P(X\in A| Y=y)=\sum_{x\in A}P_{X|Y}(x|y)

Continuous Case The conditional density mass function of XX given {Y=y}\{Y=y\}, called the jointly continuous random variables, is defined as

fXY(xy)=fX,Y(x,y)fY(y)f_{X|Y}(x|y)=\frac{f_{X,Y}(x,y)}{f_Y(y)}

Properties

  • 0fXY(xy)0\leq f_{X|Y}(x|y)
  • The total probability is 11, i.e.

xRfXY(xy)dx=1\int_{x\in\mathbb{R}}f_{X|Y}(x|y)dx=1

  • Computing probabilities, for any x1x2x_1\leq x_2 and y1y2y_1\leq y_2, we have

P(x1Xx2Y=y)=x[x1,x2]fXYf(xy)dxP(x_1\leq X\leq x_2| Y=y)=\int_{x\in[x_1, x_2]}f_{X|Y}f(x|y)dx

Joint PDF for functions of random variables

Theorem Let X1X_1 and X2X_2 be random variables, the joint density function fX1,X2f_{X_1,X_2}, and Y1=g1(X1,X2)Y_1=g_1(X_1, X_2) and Y2=g2(X1,X2)Y_2=g_2(X_1, X_2). If:

  • The following equations have unique solutions, say x1=h1(y1,y2)x_1=h_1(y_1, y_2) and x2=h2(y1,y2)x_2=h_2(y_1, y_2).

{y1=g1(x1,x2)y2=g2(x1,x2)\begin{cases}y_1=g_1(x_1, x_2)\\ y_2=g_2(x_1, x_2)\end{cases}

  • g1g_1 and g2g_2 have continuous partial derivatives and J(x1,x2):=g1x1g2x2g1x2g2x10J(x_1, x_2):=\frac{\partial g_1}{\partial x_1}\frac{\partial g_2}{\partial x_2}-\frac{\partial g_1}{\partial x_2}\frac{\partial g_2}{\partial x_1}\neq 0 at all points (x1,x2)(x_1, x_2).

Then, Y1Y_1 and Y2Y_2 are jointly continuous, with density function given by

fY1,Y2(y1,y2)=fX1,X2(x1,x2)J(x1,x2)1f_{Y_1, Y_2}(y_1, y_2)=f_{X_1, X_2}(x_1, x_2)|J(x_1, x_2)|^{-1}

where x1=h1(y1,y2)x_1=h_1(y_1, y_2) and x2=h2(y1,y2)x_2=h_2(y_1, y_2).

Expectations of Functions of 2 Random Variables

Let gg be a function of two variables. If XX and YY have a joint probability mass function PX,Y(x,y)P_{X, Y}(x,y) such that

E[g(X,Y)]=xyg(x,y)PX,Y(x,y)E[g(X,Y)]=\sum_x\sum_y g(x,y)P_{X,Y}(x,y)

If XX and YY have a joint probability density function fX,Y(x,y)f_{X,Y}(x,y):

E[g(X,Y)]=g(x,y)fX,Y(x,y)dxdyE[g(X,Y)]=\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} g(x,y)f_{X,Y}(x,y)dxdy

Linearity of the Expectation

Choosing g(x,y)=x+yg(x,y)=x+y and using induction, we can show

E[X1++Xn]=k=1nE[Xk]E[X_1+\cdots+X_n]=\sum_{k=1}^n E[X_k]

Choosing g(x,y)=ax+byg(x,y)=ax+by for a,bRa,b\in\mathbb{R}:

E[aX+bY]=aE[X]+bE[Y]E[aX+bY]=aE[X]+bE[Y]

and by induction:

E[i=1naiXi]=i=1naiE[Xi]E[\sum_{i=1}^na_iX_i]=\sum_{i=1}^n a_iE[X_i]

Covariance and Correlation

Definition

The covariance between XX and YY is

Cov(X,Y)=E[(XE[X])(YE[Y])]Cov(X,Y)=E[(X-E[X])(Y-E[Y])]

The correlation coefficient is defined as

Corr(X,Y)=Cov(X,Y)Var(X)Var(Y)Corr(X, Y)=\frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}

Properties of Covariance

  • Cov(X,X)=Var(X)Cov(X,X)=Var(X)
  • Cov(X,Y)=Cov(Y,X)Cov(X,Y)=Cov(Y,X)
  • Cov(aX,Y)=aCov(X,Y)=Cov(X,aY)Cov(aX, Y)=aCov(X,Y)=Cov(X, aY)
  • If XX and YY are independent, then Cov(X,Y)=0Cov(X,Y)=0 and Corr(X,Y)=0Corr(X,Y)=0.

Note: The converse is not true, if the covariance is 00, the variables may not be independent.

Properties of Correlation

  • Corr(X,Y)Corr(X,Y) is the covariance of the standardised versions of XX and YY.
  • Corr(X,Y)Corr(X, Y) is dimensionless, i.e. it is a ratio.
  • 1Corr(X,Y)1-1\leq Corr(X,Y)\leq 1
  • Corr(X,Y)=1Corr(X,Y)=1 if and only if Y=aX+bY=aX+b with a>0a>0.
  • Corr(X,Y)=1Corr(X,Y)=-1 if and only if Y=aX+bY=aX+b with a<0a<0.

Proposition

Cov(i=1nXi,i=1mYi)=i=1ni=1mCov(Xi,Yj)Cov(\sum_{i=1}^nX_i, \sum_{i=1}^mY_i)=\sum_{i=1}^n \sum_{i=1}^m Cov(X_i, Y_j)

Var(i=1nXi)=i=1nVar(Xi)+2()i<jCov(Xi,Xj)Var(\sum_{i=1}^n X_i)=\sum_{i=1}^n Var(X_i) + 2(\sum\sum)_{i<j}Cov(X_i, X_j)

Conditional Expectation

Jointly Discrete Case

For XX and YY discrete random variables,

E[XY=y]=kkP(X=kY=y)E[X|Y=y]=\sum_{k}kP(X=k|Y=y)

Jointly Continuous Case

For XX and YY jointly continuous random variables,

E[XY=y]=RxfXY(xy)dxE[X|Y=y]=\int_{\mathbb{R}}xf_{X|Y}(x|y)dx

Note: The E[XY]E[X|Y] is a random variable, since its value depends on the realisation of YY. The same holds for Var(XY)Var(X|Y).

Tower Property

E[X]=E[E[XY]]E[X]=E[E[X|Y]]

Var(E(YX))+E(Var(YX))=Var(Y)Var(E(Y|X))+E(Var(Y|X))=Var(Y)

Large Numbers

Markov's and Chebyshev's Inequality

Markov's Inequality Let XX be a random variable X0X\geq 0, then for all a>0a>0, we have

P(Xa)E[X]aP(X\geq a)\leq \frac{E[X]}{a}

  • Use a bit of information about a distribution to learn something a out probabilities of extreme values.
  • If X0X\geq0 and E[X]E[X] is small, then XX is unlikely to be very large.

Chebyshev's Inequality Let XX be a random variable with μ=E[X]<\mu=E[X]< \infty and σ2=Var(X)<\sigma^2=Var(X)<\infty, then for all k>0k>0, we have

P(Xμk)σ2k2P(|X-\mu|\geq k)\leq \frac{\sigma^2}{k^2}

  • Random variable XX, with finite mean μ\mu and variance σ2\sigma^2.
  • If the variance is small, then XX is unlikely to be too far from the mean.

Weak Law of Large Numbers

  • Intuitively, an expectation can be thought of as the average of the outcomes over an infinite repetition of the same experiment.
  • If so, the observed average in a finite number of repetitions (which is called the sample mean) should approach the expectation, as the number of repetitions increases.
  • This is a vague statement, which is made more precise by so-called laws of large numbers.

Definition Let X1,X2,X_1, X_2, \cdots be a sequence of independent and identically distributed random variables, each having a finite mean E[Xi]=μE[X_i]=\mu, then for all ϵ>0\epsilon>0, we have

P(X1++Xnnμϵ)0 as nP(|\frac{X_1+\cdots+X_n}{n}-\mu|\geq \epsilon)\to 0 \text{ as } n\to\infty

Interpretation

  • One experiment:
    • many measurements, Xi=μ+WiX_i=\mu+W_i
    • WiW_i: measurement noise. E[Wi]=0E[W_i]=0, independent WiW_i.
    • Sample mean MnM_n is unlikely to be far off from the true mean μ\mu.
  • Many independent repetitions of the same experiments.
    • Event AA, with P=P(A)P=P(A).
    • XiX_i is the indicator of event AA.
    • The sample mean MnM_n is the empirical frequency of event AA.
    • Empirical frequency is unlikely to be far of from true probability PP.

Central Limit Theorem

Let X1,X2,X_1,X_2,\cdots be a sequence of independent and identically distributed random variables, each having a finite mean E[Xi]=μE[X_i]=\mu and variance σ2\sigma^2. Then for aRa\in\mathbb{R}:

P(X1++Xnnμσn)12πaex22dx as nP(\frac{X_1+\cdots+X_n-n\mu}{\sigma\sqrt{n}})\to \frac{1}{\sqrt{2\pi}}\int_{-\infty}^a e^{\frac{-x^2}{2}}dx \text{ as }n\to\infty

Different Scalings of the Sums of Random Variables

  • X1XnX_1\cdots X_n, i.i.d. random variables with finite μ\mu and variance σ2\sigma^2.
  • Sn=X1++XnS_n=X_1+\cdots+X_n with the variance nσ2n\sigma^2.
  • Snn=X1++Xnn\frac{S_n}{\sqrt{n}}=\frac{X_1+\cdots+X_n}{\sqrt{n}} with the variance σ2\sigma^2.
  • Zn=Snμnσn=X1++XnμnσnZ_n=\frac{S_n-\mu n}{\sigma\sqrt{n}} = \frac{X_1+\cdots+X_n-\mu n}{\sigma\sqrt{n}} with the variance 11 and the mean 00.

Usefulness of CLT

  • Universal and easy to apply, only means and variances matter.
  • Fairly accurate computational shortcut.
  • Justification of normal models.

Strong Law of Large Numbers

The strong law of large numbers establishes almost sure convergence.

Let X1,X2,X_1,X_2,\cdots be a sequence of independent and identically distributed random variables, each having a finite mean E[Xi]=μE[X_i]=\mu. Then, with probability 1, we have

X1++Ennμ as n\frac{X_1+\cdots+E_n}{n}\to \mu \text{ as }n\to\infty