Sunday, March 30, 2014
Normal Distribution
X∼N(μ,σ2)fX(x)=1σ√2πe−(x−μ)2/2σ2
In fact, any PDF of the following form with α>0 is a normal distribution:
Note
fX(x) is at the peak when
x=μ. It's therefore not difficult to show (via minimizing the exponent) that:
Posterior probability:
|
fΘ|X(θ|x)=fΘ(θ)⋅fX|Θ(x|θ)fX(x)fX(x)=∫fΘ(θ)⋅fX|Θ(x|θ)dθfΘ(θ): prior distributionfΘ|X(θ|x): posteriori distribution, output of Bayesian inference |
Given the prior distribution of
Θ and some observations of
X, we can express the posteriori distribution of
Θ (as a function of
Θ) and do a point estimation. Normal distribution is particularly nice as the point estimation can be easily done by finding the peak of the posteriori distribution, which translates to simply finding the minima of the exponent as a quadratic function of
Θ via differentiation.
Estimate with single observation
|
X=Θ+WΘ,W:N(0,1)independent Θ,WˆΘMAP=ˆΘLMS=E[Θ|X]=X2E[(Θ−ˆΘ)2|X=x]=12ˆΘ: (point) estimator - a random variable ˆθ: estimate - a number MAP: Maximum a posteriori probability LMS: Least Mean Squares |
Estimate with multiple observations
|
X1=Θ+W1Θ∼N(x0,σ20)Wi∼N(0,σ2i)⋮Xn=Θ+WnΘ,W1,⋯Wn independentfΘ|X(θ|x)=c⋅e−quad(θ)quad(θ)=(θ−x0)22σ20+(θ−x1)22σ22+⋯+(θ−xn)22σ2nˆΘMAP=ˆΘLMS=E[Θ|X]=1n∑i=01σ2in∑i=0xiσ2iE[(Θ−ˆΘ)2]=E[(Θ−ˆΘ)2|X=x]=E[(Θ−ˆθ)2|X=x]var(Θ)=var(Θ|X=x)=1n∑i=01σ2imean squared error |
Θ as a m-dimensional vector with n observations
|
fΘ|X(θ|x)=1fX(x)m∏j=1fΘj(θj)n∏i=1fXi(xi)posteriori distribution |
As normal distribution, we can then differentiate quad(
Θ) per
Θj and set the derivatives to zero, solving
m linear equations with
m unknowns for the point estimate of
Θ.
Source: MITx 6.041x, Lecture 15.
Independence
Probabilistic models that do not interact with each other and have no common sources of uncertainty.
|
P(A∩B)=P(A)⋅P(B)iff A and B are independentpX|A(x)=pX(x)for all x iff X and A are independentpX,Y(x,y)=pX(x)⋅pY(y)for all x,y iff X and Y are independentpX,Y,Z(x,y,z)=pX(x)⋅pY(y)⋅pZ(z)for all x,y,z iff X,Y and Z are independent |
Note it's always true that
|
fX,Y(x,y)=fX|Y(x|y)⋅fY(y)by conditional proability |
But
|
fX|Y(x|y)⋅fY(y)=fX(x)⋅fY(y)iff X,Y are independent for all x,y |
Expectation
In general,
|
E[g(x,y)]≠g(E[x],E[y])eg E[XY]≠E[X]E[Y] |
It's however always true that
|
E[aX+b]=aE[X]+bLinearity of Expectation |
But if
X and
Y are
independent, then
|
E[XY]=E[X]E[Y] and E[g(X)h(Y)]=E[g(X)]E[h(Y)] |
Variance
In general,
It's however always true that
|
var(aX)=a2var(X)andvar(X+a)=var(X) |
But if
X and
Y are
independent, then
Source: MITx 6.041x, Lecture 7.
Thursday, March 27, 2014
Random Variables
Uniform from a to b
Discrete:
|
pX(x)=1b−a+1E[X]=a+b2var(X)=112(b−a)⋅(b−a+2)P(a≤x≤b)=∑a≤x≤bpX(x) |
Continuous:
|
fX(x)=1b−aE[X]=a+b2var(X)=(b−a)212P(a≤x≤b)=∫bafX(x)dx |
Bernoulli with parameter p∈[0,1]
|
pX(0)=1−ppX(1)=pE[X]=pvar(X)=p−p2≤14(max variance) |
Binomial with parameter p∈[0,1]
Model number of successes (k) in a given number of independent trials (n):
|
\begin{aligned}
p_X(k) &= \mathbb{P}(X=k) = {n \choose k}p^k(1-p)^{n-k} \\
\mathbb{E}[X] &= n\cdot \color{blue}{p} \quad var(X) = n\cdot\color{blue}{(p - p^2)} \\
\end{aligned}
|
Poisson with parameter p \in [0, 1]
Large n, small p, moderate \lambda = np which is the arrival rate.
Model number of arrivals S:
|
\begin{aligned}
p_S(k) &\to \frac{\lambda^k}{k!}e^{-\lambda} \qquad \mathbf{E}(S) = \lambda \qquad \text{var}(S) = \lambda \\
\end{aligned}
|
Beta with parameters (\alpha, \beta)
Infer the posterior unknown bias \Theta of a coin with k number of heads in n (fixed) tosses:
|
\begin{aligned}
f_{\Theta|K}(\theta\,|\,k) &= \frac{1}{d(n,k)} \theta^k (1-\theta)^{n-k} \\
\int_0^1 \theta^\alpha (1-\theta)^\beta\,d\theta &= \frac{\alpha! \, \beta!}{(\alpha+\beta+1)!} & \text{beta distribution}
\end{aligned}
|
Geometric with parameter p \in [0, 1]
Model number of trials (k) until a success:
|
\begin{aligned}
p_X(k) = \mathbb{P}(X = k) = (1-p)^{k-1}p \quad \mathbb{E}[X] = \frac{1}{p} \quad var(X) = \frac{1-p}{p^2} \\
\end{aligned}
|
Exponential with parameter \lambda > 0
Model amount of time elapsed (x) until a success:
|
\begin{aligned}
f_X(x) &= \lambda e^{-\lambda x} \quad \mathbb{E}[X] = \frac{1}{\lambda} \quad var(X) = \frac{1}{\lambda^2} \\
\mathbb{P}(X > a) &= \int_a^\infty \lambda e^{-\lambda x} \, dx = e^{-\lambda a} \\
\mathbb{P}(T - t > x\, |\, T > t) &= e^{-\lambda x} = \mathbb{P}(T > x) & \text{Memorylessness!} \\
\mathbb{P}(0 \le T \le \delta) &\approx \lambda\delta \approx \mathbb{P}(t \le T \le t+\delta\,|\, T > t) & \mathbb{P}(\text{(success}) \text{ at every }\delta \text{ time step } \approx \lambda\delta\\
\end{aligned}
|
Normal (Gaussian)
|
\begin{aligned}
N(0,1): f_X(x) &= \frac{1}{\sqrt{2\pi}} e^{-x^2/2} \\
N(\mu,\sigma^2): f_X(x) &= \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/2\sigma^2} \\
\end{aligned}
|
\qquadIf
X \thicksim N(\mu,\sigma^2) and
Y = aX + b, then
Y \thicksim N(a\mu + b,a^2\sigma^2)
\qquadIf X = \Theta + W where W \thicksim N(0,\sigma^2), indep. of \Theta, then f_{X|\Theta}(x\,|\,\theta) = f_W(x-\theta); or X \thicksim N(\Theta,\sigma^2)
Cumulative distributive function (CDF)
Discrete:
|
\begin{aligned}
F_X(x) = \mathbb{P}(X \le x) = \sum_{k \le x} p_X(k) \\
\end{aligned}
|
Continuous:
|
\begin{aligned}
F_X(x) = \mathbb{P}(X \le x) = \int_{-\infty}^x f_X(t)\,dt \quad \therefore \frac{d}{dx}F_X(x) = f_X(x) \\
\end{aligned}
|
Source: MITx 6.041x, Lecture 5, 6, 8, 14.
