Sunday, March 30, 2014
Normal Distribution
\(\displaystyle X \thicksim N(\mu, \sigma^2) \qquad f_X(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2 / 2 \sigma^2} \)
In fact, any PDF of the following form with \(\alpha > 0\) is a normal distribution:
  |
\[
\begin{aligned}
f_X(x) &= c \cdot\color{blue}{e^{-(\alpha x^2 + \beta x + \gamma )}} \\
\end{aligned}
\] |
Note \(f_X(x)\) is at the peak when \(x = \mu\). It's therefore not difficult to show (via minimizing the exponent) that:
  |
\[
\begin{aligned}
\mu = \color{blue}{-\frac{\beta}{2\alpha}} \qquad \sigma^2 = \color{red}{\frac{1}{2\alpha}} \\
\end{aligned}
\] |
Posterior probability:
  |
\[
\begin{aligned}
\color{teal}{f_{\Theta|X}(\theta\,|\,x)} &\color{teal}{=} \color{teal}{\frac{f_\Theta(\theta) \cdot f_{X|\Theta}(x\,|\,\theta)}{f_X(x)}} \\
f_X(x) &= \int f_\Theta(\theta) \cdot f_{X|\Theta}(x\,|\,\theta) \, d\theta \\
f_\Theta(\theta) &\text{: prior distribution} \\
f_{\Theta|X}(\theta\,|\,x) &\text{: posteriori distribution, output of Bayesian inference}\\
\end{aligned}
\] |
Given the prior distribution of \(\Theta\) and some observations of \(X\), we can express the posteriori distribution of \(\Theta\) (as a function of \(\Theta\)) and do a point estimation. Normal distribution is particularly nice as the point estimation can be easily done by finding the peak of the posteriori distribution, which translates to simply finding the minima of the exponent as a quadratic function of \(\Theta\) via differentiation.
Estimate with single observation
  |
\[
\begin{aligned}
X &= \Theta + W \qquad \Theta, W: N(0,1) \qquad \text{independent } \Theta, W \\
\widehat{\Theta}_{\text{MAP}} &= \widehat{\Theta}_{\text{LMS}} = \mathbb{E}[\Theta\,|\,X] = \frac{X}{2} \qquad \mathbb{E}[(\Theta - \widehat{\Theta})^2\,|\,X=x] = \color{red}{1 \over 2} \\
\widehat{\Theta} &\text{: (point) estimator - a random variable } \quad \hat{\theta} \text{: estimate - a number } \quad \text{MAP: Maximum a posteriori probability } \quad \text{LMS: Least Mean Squares}
\end{aligned}
\] |
Estimate with multiple observations
  |
\[
\begin{aligned}
X_1 &= \Theta + W_1 \qquad \Theta \thicksim N(x_0,\sigma_0^2) \qquad W_i \thicksim N(0, \sigma_i^2) \\
\vdots \\
X_n &= \Theta + W_n \qquad \Theta, W_1, \cdots W_n \text{ independent} \\
f_{\Theta|X}(\theta\,|\,x) &= c \cdot e^{-\text{quad}(\theta)} \\
\text{quad}(\theta) &= \frac{(\theta - x_0)^2}{2\sigma_0^2} + \frac{(\theta - x_1)^2}{2\sigma_2^2} + \cdots + \frac{(\theta - x_n)^2}{2\sigma_n^2} \\
\widehat{\Theta}_{\text{MAP}} &= \widehat{\Theta}_{\text{LMS}} = \mathbb{E}[\Theta\,|\,X] = \color{red}{\frac{1}{\displaystyle{\sum_{i=0}^n \frac{1}{\sigma_i^2}}}}\displaystyle{\sum_{i=0}^n\frac{x_i}{\sigma_i^2}} \\
\mathbb{E}[(\Theta - \widehat{\Theta})^2] &= \mathbb{E}[(\Theta - \widehat{\Theta})^2\color{blue}{\,|\,X=x}] = \mathbb{E}[(\Theta - \color{blue}{\widehat{\theta}})^2\,|\,X=x] \\
var(\Theta) &= var(\Theta\,|\,X=x) = \color{red}{1 \over \displaystyle{\sum_{i=0}^n\frac{1}{\sigma_i^2}}} & \text{mean squared error}
\end{aligned}
\] |
\(\Theta\) as a \(m\)-dimensional vector with \(n\) observations
  |
\[
\begin{aligned}
f_{\Theta|X}(\theta\,|\,x) &= \frac{1}{f_X(x)} \prod_{j=1}^{m} f_{\Theta_j}(\theta_j) \prod_{i=1}^n f_{X_i}(x_i) & \text{posteriori distribution} \\
\end{aligned}
\] |
As normal distribution, we can then differentiate quad(\(\Theta\)) per \(\Theta_j\) and set the derivatives to zero, solving \(m\) linear equations with \(m\) unknowns for the point estimate of \(\Theta\).
Source: MITx 6.041x, Lecture 15.
Independence
Probabilistic models that do not interact with each other and have \(\color{blue}{\text{no common sources}}\) of uncertainty.
  |
\[
\begin{aligned}
\mathbb{P}(A \cap B) &= \mathbb{P}(A) \cdot \mathbb{P}(B) & \text{iff } A \text{ and } B \text{ are independent}\\
p_{X|A}(x) &= p_X(x) & \text{for all } x \text{ iff } X \text{ and } A \text{ are independent}\\
p_{X,Y}(x,y) &= p_X(x)\cdot p_Y(y) & \text{for all } x,y \text{ iff } X \text{ and } Y \text{ are independent} \\
p_{X,Y,Z}(x,y,z) &= p_X(x)\cdot p_Y(y)\cdot p_Z(z) & \text{for all } x,y,z \text{ iff } X,Y \text{ and } Z \text{ are independent} \\
\end{aligned}
\] |
Note it's always true that
  |
\[
\begin{aligned}
f_{X,Y}(x,y) &= f_{X|Y}(x\,|\,y)\cdot f_Y(y) &\text{by conditional proability}\\
\\
\end{aligned}
\] |
But
  |
\[
\begin{aligned}
f_{X|Y}(x\,|\,y)\cdot f_Y(y) &= f_X(x)\cdot f_Y(y) & \text{iff } X,Y \text{ are independent for all }x,y \\
\end{aligned}
\] |
Expectation
In general,
  |
\[
\begin{aligned}
\mathbb{E}\left[g(x,y)\right] \ne g\big(\mathbb{E}[x], \mathbb{E}[y]\big) \quad \text{eg } \mathbb{E}[XY] \ne \mathbb{E}[X]\mathbb{E}[Y]
\end{aligned}
\] |
It's however always true that
  |
\[
\begin{aligned}
\mathbb{E}[aX + b] &= a\mathbb{E}[X] + b & \text{Linearity of Expectation}
\end{aligned}
\] |
But if \(X\) and \(Y\) are
independent, then
  |
\[
\begin{aligned}
\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] \quad \text{ and } \quad \mathbb{E}\left[g(X)h(Y)\right] = \mathbb{E}\left[g(X)\right]\mathbb{E}\left[h(Y)\right]
\end{aligned}
\] |
Variance
In general,
  |
\[
\begin{aligned}
var(X + Y) \ne var(X) + var(Y) \\
\end{aligned}
\] |
It's however always true that
  |
\[
\begin{aligned}
var(aX) = a^2var(X) \quad \text{and} \quad var(X+a) = var(X)
\end{aligned}
\] |
But if \(X\) and \(Y\) are
independent, then
  |
\[
\begin{aligned}
var(X + Y) = var(X) + var(Y) \\
\end{aligned}
\] |
Source: MITx 6.041x, Lecture 7.
Thursday, March 27, 2014
Random Variables
Uniform from \(a\) to \(b\)
Discrete:
  |
\[
\begin{aligned}
p_X(x) &= \frac{1}{b-a+1} \quad
\mathbb{E}[X] = \frac{a+b}{2} \quad var(X) = \frac{1}{12}(b-a)\cdot(b-a\color{blue}{+2}) \\
\mathbb{P}(a \le x \le b) &= \sum_{a \le x \le b}p_X(x) \\
\end{aligned}
\] |
Continuous:
  |
\[
\begin{aligned}
f_X(x) &= \frac{1}{b-a} \quad \mathbb{E}[X] = \frac{a+b}{2} \quad var(X) = \frac{(b-a)^2}{12} \\
\mathbb{P}(a \le x \le b) &= \int_a^b f_X(x)\,dx \\
\end{aligned}
\] |
Bernoulli with parameter \(p \in [0, 1]\)
  |
\[
\begin{aligned}
p_X(0) = 1-p \quad p_X(1) = p \quad \mathbb{E}[X] = \color{blue}{p} \quad var(X) = \color{blue}{p - p^2} \le {1 \over 4} \quad(\text{max variance}) \\
\end{aligned}
\] |
Binomial with parameter \(p \in [0, 1]\)
Model number of successes (\(k\)) in a given number of independent trials (\(n\)):
  |
\[
\begin{aligned}
p_X(k) &= \mathbb{P}(X=k) = {n \choose k}p^k(1-p)^{n-k} \\
\mathbb{E}[X] &= n\cdot \color{blue}{p} \quad var(X) = n\cdot\color{blue}{(p - p^2)} \\
\end{aligned}
\] |
Poisson with parameter \(p \in [0, 1]\)
Large \(n\), small \(p\), moderate \(\lambda = np\) which is the arrival rate.
Model number of arrivals \(S\):
  |
\[
\begin{aligned}
p_S(k) &\to \frac{\lambda^k}{k!}e^{-\lambda} \qquad \mathbf{E}(S) = \lambda \qquad \text{var}(S) = \lambda \\
\end{aligned}
\] |
Beta with parameters \((\alpha, \beta)\)
Infer the posterior unknown bias \(\Theta\) of a coin with \(k\) number of heads in \(n\) (fixed) tosses:
  |
\[
\begin{aligned}
f_{\Theta|K}(\theta\,|\,k) &= \frac{1}{d(n,k)} \theta^k (1-\theta)^{n-k} \\
\int_0^1 \theta^\alpha (1-\theta)^\beta\,d\theta &= \frac{\alpha! \, \beta!}{(\alpha+\beta+1)!} & \text{beta distribution}
\end{aligned}
\] |
Geometric with parameter \(p \in [0, 1]\)
Model number of trials (\(k\)) until a success:
  |
\[
\begin{aligned}
p_X(k) = \mathbb{P}(X = k) = (1-p)^{k-1}p \quad \mathbb{E}[X] = \frac{1}{p} \quad var(X) = \frac{1-p}{p^2} \\
\end{aligned}
\] |
Exponential with parameter \(\lambda > 0\)
Model amount of time elapsed (\(x\)) until a success:
  |
\[
\begin{aligned}
f_X(x) &= \lambda e^{-\lambda x} \quad \mathbb{E}[X] = \frac{1}{\lambda} \quad var(X) = \frac{1}{\lambda^2} \\
\mathbb{P}(X > a) &= \int_a^\infty \lambda e^{-\lambda x} \, dx = e^{-\lambda a} \\
\mathbb{P}(T - t > x\, |\, T > t) &= e^{-\lambda x} = \mathbb{P}(T > x) & \text{Memorylessness!} \\
\mathbb{P}(0 \le T \le \delta) &\approx \lambda\delta \approx \mathbb{P}(t \le T \le t+\delta\,|\, T > t) & \mathbb{P}(\text{(success}) \text{ at every }\delta \text{ time step } \approx \lambda\delta\\
\end{aligned}
\] |
Normal (Gaussian)
  |
\[
\begin{aligned}
N(0,1): f_X(x) &= \frac{1}{\sqrt{2\pi}} e^{-x^2/2} \\
N(\mu,\sigma^2): f_X(x) &= \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/2\sigma^2} \\
\end{aligned}
\] |
\(\qquad\)If \(X \thicksim N(\mu,\sigma^2)\) and \(Y = aX + b\), then \(Y \thicksim N(a\mu + b,a^2\sigma^2) \)
\(\qquad\)If \(X = \Theta + W\) where \(W \thicksim N(0,\sigma^2)\), indep. of \(\Theta\), then \(f_{X|\Theta}(x\,|\,\theta) = f_W(x-\theta)\); or \(X \thicksim N(\Theta,\sigma^2)\)
Cumulative distributive function (CDF)
Discrete:
  |
\[
\begin{aligned}
F_X(x) = \mathbb{P}(X \le x) = \sum_{k \le x} p_X(k) \\
\end{aligned}
\] |
Continuous:
  |
\[
\begin{aligned}
F_X(x) = \mathbb{P}(X \le x) = \int_{-\infty}^x f_X(t)\,dt \quad \therefore \frac{d}{dx}F_X(x) = f_X(x) \\
\end{aligned}
\] |
Source: MITx 6.041x, Lecture 5, 6, 8, 14.