Suppose that r experiments are to be performed. If experiment 1 has n1 possible outcomes, and for each outcome of experiment 1, experiment 2 has n2 possible outcomes, and so on so forth. Then there is a total of
n1×n2⋯×nr
possible outcomes for the r experiments.
Addition Principle
Suppose that the set S is partitioned into pairwise disjoint parts S1,S2,⋯,Sn. The number of objects in S can be determined by finding the number of objects in each of the parts, and adding the numbers so obtained.
Substraction Principle
Let S be a set and U be a larger set containing S. We denote by A the set containing all elements in U that are not in S. Then
∣S∣=∣U∣−∣A∣
Division Principle
Let S be a finite set that is partitioned in k parts in such a way that each part contains the same number of objects. Then the number of parts in the partition is given by the rule
Permutations are ordered arrangements of elements of a set. With n objects, there are n! permutations (we consider every object is different from all the others).
If n1 are alike, nr are alike, then there are
n1!×⋯nr!n!
permutations of n objects. Necessarily we have ∑knk≤n.
Combinations
The number of different groups of k objects that can be formed from a total of n objects is given by the choose function:
(kn)={k!(n−k)!n!00≤k≤nk<0∨k>n
Multinomial Coefficients
Let n1,⋯nk∈N, where ∑ni=n. The number of possible partitions of n objects into k distinct groups of sizes n1,n2,⋯nk is given by:
Let A,B⊂S be two events, such that P(A)>0. The conditional probability of B given A is defined by
P(B∣A)=P(A)P(A∩B)
Proposition
We fix an event A with P(A)>0. The function P(⋅∣A):S→[0,1] which associates to any event B⊂S the quantity P(B∣A) is a probability on the same sample space S.
The multiplication rule: We assume that A1,⋯,An are events such that P(∩kAk)>0, then
P(A1∩⋯∩An)=P(A1)P(A2∣A1)P(A3∣A2∩A1)⋯
Bayes Formula:
P(A∣B)=P(B)P(A)P(B∣A)
Another form of Bayes: If A and B are two events, then
P(B)=P(B∣A)P(A)+P(B∣Ac)(1−P(A))
Total Probability Law (extended Bayes Formula) The Bayes formula can be generalised as follows. We assume that A1,⋯An are mutually disjoint events such that ∪k=1nAk=S (such a sequence is called a partition of S). Then we have
P(B)=∑k=1nP(B∣Ak)P(Ak)
Another form of TPL, suppose the sample space S is divided into 3 disjoint events B1,B2,B3. Then for any event A we have:
Events A and B are independent if the probability that one occurred is not affected by knowledge that the other occurred. Formally, if P(A) and P(B) are not 0,
P(A∣B)=P(A)P(B∣A)=P(B)
Proposition If E and F are independent, so are E and Fc.
Bernoulli DistributionX is 1 on success and 0 on failure, with success and failure defined by the context.
Binomial Distribution, denoted by Bin(n,p), models the number of success in n independent Bernoulli(p) trials.
Hypergeometric Distribution A hypergeometric random variable X with parameters n,M,m is a random variable that counts the number of elements in a random sample of size n, taken without replacement from a population of size M, having a specified attribute, where exactly m members of the total population possessing the attribute. We denote X∼H(n,N,m).
Geometric Distribution models the number of tails before the first head in a sequence of coin flips or more generally, Bernoulli trials.
Memoryless Property of Geometric Distribution A random variable X∼G(p) has the lack-of-memory property, i.e.
P(X=n+k∣X>n)=P(X=k),∀n,k∈N
Poisson Distribution A discrete random variable X is a Poisson random variable with parameter λ>0 if its probability mass function is given by
PX(x)={e−λx!λx0x=0,1,⋯otherwise
The function PX is indeed a probability mass function, by allowing all the axioms.
The Poisson distribution is considered to be the law of rare events. Poisson random variables often arise in the modelling of the frequency of occurrence of a certain event during a specified period of time.
Definition A random variable X is called a continuous random variable if P(X=x)=0 for all x∈R.
Definition Let X be a random variable, then the cumulative distribution function of X is the map FX:R→R defined by
FX(a)=P(X≤a)
for all a∈R.
Proposition Let X be a random variable, then FX satisfies the following properties:
FX is non-decreasing.
FX is right continuous, i.e. F(x+0)=F(x).
FX(−∞):=lima→−∞FX(a)=0.
FX(∞):=lima→∞FX(a)=1.
Note that, while being continuous from the right, a CDF FX is not left-continuous in general. In particular, the left limit of FX at a can be defined as
FX(a−):=limt→aFX(t)=P(X<a)
Propositions For all a,b∈R, a<b, the following hold
P(a<X≤b)=FX(b)−FX(a)
P(a≤X≤b)=FX(b)−FX(a−)
P(a<X<b)=FX(b−)−FX(a)
P(a≤X<b)=FX(b−)−FX(a−)
Proposition A random variable is continuous if and only if its CDF is a continuous function.
If there exists a non-negative function fX:R→R such that for all a,b∈R, a<b, then fX is called a Probability Density Function of X if
P(a≤X≤b)=∫abfX(x)dx
If the distribution of X has a PDF, we say that X is an absolutely continuous random variable.
Remarks The density fX(x) of an absolutely continuous random variable is the analogue of the probability mass function pX(x) of a discrete random variable. The important differences are
Unlike pX(x), the PDF fX(x) is not a probability. You have to integrate it to get a probability.
Since fX(x) is not a probability, there is no restriction that fX(x) be less than or equal to 1.
Proposition A continuous random variable X has a PDF if and only if there exists a non-negative function f:R→R such that
FX(a)=∫−∞af(x)dx,∀a∈R
In this case, f is a PDF of X. In the meanwhile, for all x∈R where f is continuous, f(x)=FX′(x).
Procedure Given a random variable X, there is a procedure for finding a PDF.
Check for continuity of FX. If it holds, proceed with the next step.
Check at which points FX′ exists.
If FX′ exists and is continuous outside a finite or countable subset of the real line, that is, if R−A is at most countable, where A={x∈R:∃FX′(x) is continuous}, then define fX(x)=FX′(x) for all x∈A.
Proposition Once we have a PDF of a random variable, we are able to compute any probability depending on that random variable. Let X be an absolutely continuous random variable, then
P(X∈A)=∫Af(x)dx,∀A⊂R
Remark Let f be a PDF, then we have
∫∞∞f(x)dx=1
If f:R→R is a non-negative function satisfying the equality above, then f is a PDF. We can therefore construct a PDF starting from any real map that is either positive or negative, and that has finite non-zero integral on R.
Let g:R→R such that g does not change sign and
∫Rg(x)dx=c∈R−{0}
then we can define a function f:R→R by f(x)=c1g(x) for all x∈R. f is then a PDF.
Exponential Distribution are the continuous analogue of geometric random variables. It is also related to Poisson random variables, in that they describe the time passed between consecutive occurrences of a certain event.
Memoryless Suppose that the probability that a taxi arrives within the first five minutes is p. If I wait 5 minutes and in fact no taxi arrives, then the probability that a taxi arrives within the next 5 minutes is still p.
The memorylessness of the exponential distribution is analogous to the memorylessness of the discrete geometric distribution, where having flipped 5 tails in a row gives no information about the next 5 flips. Indeed, the exponential distribution is the precisely the continuous counterpart of the geometric distribution, which models the waiting time for a discrete process to change state.
Proposition A positive continuous random variable X has the lack-of-memory property, i.e
Let Sn denote the number of successes that occur when n independent trials, each resulting in a success with probability p, are performed. Then for any a<b,
Discrete Joint PMF Suppose X takes values {x1,x2,⋯,xn} and Y takes values {y1,y2,⋯,ym}. The ordered pair (X,Y) takes values in the product {(x1,y1),(x1,y2),⋯(xn,ym)}. The joint probability mass function (joint PMF) of X and Y is the function PX,Y(xi,yj) giving the probability of the joint outcome {X=xi,Y=yj}. We can organise this in a joint probability table.
Properties The probability mass function (PMF)
0≤PX,Y(xi,yj)≤1.
The total probability is 1, ∑i=1n∑j=1mPX,Y(xi,yj)=1.
Computing Probabilities: for any A⊂{1,⋯,n}×{1,⋯,m}, we have P((X,Y)∈A)=∑(x,y)∈APX,Y(x,y).
Continuous Joint DistributionsX takes values in [a,b], Y takes values in [c,d], then (X,Y) takes values in [a,b]×[c,d]. The joint PDF is denoted as f(x,y).
Properties The probability density function (PDF
0≤fX,Y(x,y).
Total probability is 1, ∫y∈R∫x∈RfX,Y(x,y)dxdy=1.
Computing probabilities, for any x1≤x2 and y1≤y2, we calculate the probability by
Definition Let X,Y be random variables defined on the same sample space S, the joint cumulative distribution function of X and Y is the map FX,Y:R2→[0,1] defined by: for all a,b∈R,
FX,Y(a,b)=P(X≤a,Y≤b)
Discrete Case
FX,Y(a,b)=∑x≤a∑y≤bPX,Y(x,y)
Continuous Case
FX,Y(a,b)=∫−∞b∫−∞afX,Y(x,y)dxdy
fX,Y(x,y)=∂x∂y∂2FX,Y(x,y)
Properties
F(x,y) is non-decreasing. That is, as x or y increases F(x,y) increases or remains constant.
F(x,y)=0 at the lower left of its range. If the lower left is (−∞,∞), then this means
Intuitively, an expectation can be thought of as the average of the outcomes over an infinite repetition of the same experiment.
If so, the observed average in a finite number of repetitions (which is called the sample mean) should approach the expectation, as the number of repetitions increases.
This is a vague statement, which is made more precise by so-called laws of large numbers.
Definition Let X1,X2,⋯ be a sequence of independent and identically distributed random variables, each having a finite mean E[Xi]=μ, then for all ϵ>0, we have
Let X1,X2,⋯ be a sequence of independent and identically distributed random variables, each having a finite mean E[Xi]=μ and variance σ2. Then for a∈R:
P(σnX1+⋯+Xn−nμ)→2π1∫−∞ae2−x2dx as n→∞
Different Scalings of the Sums of Random Variables
X1⋯Xn, i.i.d. random variables with finite μ and variance σ2.
Sn=X1+⋯+Xn with the variance nσ2.
nSn=nX1+⋯+Xn with the variance σ2.
Zn=σnSn−μn=σnX1+⋯+Xn−μn with the variance 1 and the mean 0.
Usefulness of CLT
Universal and easy to apply, only means and variances matter.
The strong law of large numbers establishes almost sure convergence.
Let X1,X2,⋯ be a sequence of independent and identically distributed random variables, each having a finite mean E[Xi]=μ. Then, with probability 1, we have