Some personal notes on probability theory
A self-contained summary of some key results in probability theory. I’ll mostly summarize the results from Durrett’s Probability: Theory and Examples, Bandit Algorithms (Lattimore and Szepesvári) and some other sources.
0. Measure-theoretic probability
($\sigma$-algebra and probability measure) Let $\Omega$ be a set of outcomes. A set of events $\mathcal{F} \subseteq 2^\Omega$ is a $\sigma$-algebra if $\Omega\in\mathcal{F}$, $A^c \in \mathcal{F}$ for all $A \in \mathcal{F}$ and $\bigcup_i A_i \in \mathcal{F}$ for all $A_i \in \mathcal{F}$. A function $\mathbb{P}: \mathcal{F} \to \mathbb{R}$ is a probability measure if $\mathbb{P}(\Omega) = 1$ and $\mathbb{P}(A) \ge 0$ for all $A \in \mathcal{F}$. Moreover, $\mathbb{P}(A^c) = 1 - \mathbb{P}(A)$ for all $A \in \mathcal{F}$ and $\mathbb{P}(\bigcup_i A_i) = \sum_i \mathbb{P}(A_i)$ for all $A_i \in \mathcal{F}$ such that $A_i \cap A_j = \emptyset$ for all $i \neq j$. A set $\mathcal{G}$ is a sub-$\sigma$-algebra of $\mathcal{F}$ if $\mathcal{G} \subseteq \mathcal{F}$ and $\mathcal{G}$ is a $\sigma$-algebra. The restriction of $\mathbb{P}$ to $\mathcal{G}$ is denoted by $\mathbb{P}|_{\mathcal{G}}$.
For an event $A\in\mathcal{F}$, we denote the probability of $A$ by $\mathbb{P}(A)$.
(Measurable space) A measurable space is a pair $(\Omega, \mathcal{F})$ where $\Omega$ is a set and $\mathcal{F}$ is a $\sigma$-algebra on $\Omega$.
($\mathcal{F}/\mathcal{G}$-measurable map) Let $(\Omega, \mathcal{F})$ be a measurable space and let $\mathcal{X}$ be any set and $\mathcal{G}\subseteq 2^{\mathcal{X}}$. A map $X:\Omega \to \mathcal{X}$ is $\mathcal{F}/\mathcal{G}$-measurable if $X^{-1}(A) \in \mathcal{F}$ for all $A \in \mathcal{G}$. Here, $\mathcal{G}$ need not to be a $\sigma$-algebra. If $X$ is $\mathcal{F}/\mathcal{G}$-measurable, then it is also $\mathcal{F}/\sigma(\mathcal{G})$-measurable, where $\sigma(\mathcal{G})$ is the smallest $\sigma$-algebra containing $\mathcal{G}$.
Given a map $X:\Omega\to\mathcal{X}$ between measurable spaces $(\Omega, \mathcal{F})$ and $(\mathcal{X}, \mathcal{G})$, we define \(\sigma(X) = \{X^{-1}(A): A\in\mathcal{G}\}\) to be the $\sigma$-algebra generated by $X$. Here, the term “generated” means that $\sigma(X)$ is the smallest $\sigma$-algebra containing $X^{-1}(A)$ for all $A \in \mathcal{G}$. The map $X$ is $\mathcal{F}/\mathcal{G}$-measurable if and only if $\sigma(X) \subseteq \mathcal{F}$. In fact, $\sigma(X)$ is a sub-$\sigma$-algebra of $\mathcal{F}$ and is also the smallest sub-$\sigma$-algebra for which $X$ is measurable. Furthurmore, if $\mathcal{G} = \sigma(\mathcal{A})$ itself is generated by a set $\mathcal{A}\subseteq 2^{\mathcal{X}}$, it is enough to check that \(X^{-1}(\mathcal{A}) = \{X^{-1}(A) : A\in\mathcal{A}\}\) is a subset of $\mathcal{F}$.
(Borel $\sigma$-algebra) If $\mathcal{G}$ is a set of open intervals in $\mathbb{R}$, then the Borel $\sigma$-algebra $\mathcal{B}(\mathbb{R})$ is the smallest $\sigma$-algebra containing $\mathcal{G}$.
(Random variable) A random variable on a measurable space $(\Omega, \mathcal{F})$ is a $\mathcal{F}/\mathcal{B}(\mathbb{R})$-measurable map $X:\Omega \to \mathbb{R}$.
$\color{darkgreen}{\text{(Lemma 0.1: Doob–Dynkin lemma (also known as factorization lemma))}}$ Assume that we are given measurable spaces $(\Omega, \mathcal{F})$, $(\mathcal{X}, \mathcal{G})$ and $(\mathcal{Y}, \mathcal{H})$, and $X:\Omega\to\mathcal{X}$ and $Y:\Omega\to\mathcal{Y}$ are the random elements (generalization of random variables to higher dimensions), if $(\mathcal{Y}, \mathcal{H})$ is a Borel space, then $Y$ is $\sigma(X)$-measureable if and only if there exists a $\mathcal{G}/\mathcal{H}$-measurable map $f:\mathcal{X}\to\mathcal{Y}$ such that $Y = f\circ X$. Here, $Y$ is $\sigma(X)$-measurable means that $\sigma(Y) \subseteq \sigma(X)$, i.e. knowing $X$ gives us information about $Y$.
(Filtration) Given a measurable space $(\Omega, \mathcal{F})$, a filtration is a sequence of \((\mathcal{F}_t)_{t=0}^n\) of sub-$\sigma$-algebras of $\mathcal{F}$ such that \(\mathcal{F}_0 \subseteq \mathcal{F}_1 \subseteq \ldots \subseteq \mathcal{F}_n\). Also we define \(\mathcal{F}_\infty = \sigma\left(\bigcup_{t=0}^\infty \mathcal{F}_t\right)\) to be the smallest $\sigma$-algebra containing the union of all \(\mathcal{F}_t\).
A sequence of random variables \((X_t)_{t=1}^n\) if adapted to the filtration \(\mathbb{F} = (\mathcal{F}_t)_{t=0}^n\) if \(X_t\) is \(\mathcal{F}_{t}\)-measurable for all $t\in[n]$. For brevity, we can say \((X_t)_t\) is $\mathbb{F}$-adapted. Finally, \((X_t)_t\) is $\mathbb{F}$-predictable if $X_t$ is \(\mathcal{F}_{t-1}\)-measurable for all $t\in[n]$.
(Conditional probability) Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space. The conditional probability $\mathbb{P}(A|B)$ of $A$ given $B$ is defined as \(\mathbb{P}(A|B) = \frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}\). Intuitively, this tells us how large the portion of $A$ is in $B$.
(Independence) Two events $A, B\in\mathcal{F}$ are independent if $\mathbb{P}(A \cap B) = \mathbb{P}(A)\mathbb{P}(B)$. Alternatively, we can say that $A$ and $B$ are independent if $\mathbb{P}(A|B) = \mathbb{P}(A)$. Two random variables $X$ and $Y$ are independent if $\sigma(X)$ and $\sigma(Y)$ are independent $\sigma$-algebras, i.e. $\mathbb{P}(X\in A, Y\in B) = \mathbb{P}(X\in A)\mathbb{P}(Y\in B)$ for all $A, B\in\mathcal{B}(\mathbb{R})$.
(Expectation) Given a probability space $(\Omega, \mathcal{F}, \mathbb{P})$, the expectation of a random variable $X:\Omega\to\mathbb{R}$ is defined as its Lebesgue integral with respect to $\mathbb{P}$:
\[\mathbb{E}[X] = \int_\Omega X(\omega) d\mathbb{P}(\omega)\]To rigourously define what exactly this is, we shall need two facts about integration. First, the integral of an indicator function is the probability of the event: If \(X(\omega) = \mathbb{I}\{\omega\in A\}\) for some $A\in\mathbb{F}$, then $\int_\Omega X(\omega) d\mathbb{P}(\omega) = \mathbb{P}(A)$. Second, the integral is a linear operator.
Now, if \(X(\omega) = \sum_{i=1}^n a_i \mathbb{I}\{\omega\in A_i\}\) for some $A_i\in\mathcal{F}$, then we can write the integral as:
\[\int_\Omega X(\omega) d\mathbb{P}(\omega) = \sum_{i=1}^n a_i \int_\Omega \mathbb{I}\{\omega\in A_i\} d\mathbb{P}(\omega) = \sum_{i=1}^n a_i \mathbb{P}(A_i)\]Such an $X$ is called a simple function. Then, the expectation is an approximation from below:
\[\int_\Omega X(\omega) d\mathbb{P}(\omega) = \sup\left\{\int_\Omega h(\omega) d\mathbb{P}(\omega) : \text{$h$ is simple and $0\le h \le X$ pointwise} \right\}\](Stochastic process) A stochastic process on a probability space $(\Omega, \mathcal{F}, \mathbb{P})$ is a collection of random variables $(X_t)_{t\in T}$ indexed by a set $T$.
($\mathbb{P}$-almost surely) When two random variables $X$ and $Y$ are equal almost surely, we write $X = Y$ $\mathbb{P}$-almost surely, denoted by $X = Y$ $\mathbb{P}$-a.s., if $\mathbb{P}(X = Y) = 1$. To put it another way, they disagree on a set of measure zero.
(Martingale) Let $X_1, X_2, \ldots$ be a sequence of random variables on $(\Omega, \mathcal{F}, \mathbb{P})$ and \(\mathbb{F} = (\mathcal{F}_t)_{t=1}^n\) a filtration of $\mathcal{F}$ (we allow $n = \infty$). A $\mathbb{F}$-adapted sequence of random variables \((X_t)_{t\in\mathbb{N}_+}\) is a $\mathbb{F}$-adapted martingale if: (a) \(\mathbb{E}[X_t | \mathcal{F}_{t-1}] = X_{t-1}\) almost surely for all \(t\in\{2, 3, \ldots\}\) and (b) \(X_t\) is integrable. If we replace the equality with less-than or greater-than, we call \((X_t)_t\) a supermartingale or submartingale respectively.
A fair betting game is one real-life example, where \(S_t\) is the total money you have after $t$ rounds of a fair game which can be proven to be a martingale.
(Stopping time) Let \(\mathbb{F} = (\mathcal{F}_t)_{t\in\mathbb{N}}\) be a filtration. A random variable \(\tau:\Omega\to\mathbb{N}\cup\{\infty\}\) is a stopping time with respect to \(\mathbb{F}\) if for all \(t\in\mathbb{N}\), the event \(\{\tau \le t\} \in \mathcal{F}_t\) (or we can say \(\mathbb{I}\{\tau \le t\}\) is \(\mathcal{F}_t\)-measurable). The $\sigma$-algebra at the stopping time $\tau$ is:
\[\mathcal{F}_\tau = \{A \in \mathcal{F}_\infty : A \cap \{\tau \le t\} \in \mathcal{F}_t \text{ for all $t$} \}\]
$\color{darkblue}{(\text{Theorem 0.1: Doob's optional stopping})}$
Let $\mathbb{F} = (\mathcal{F}_t)_{t\in\mathbb{N}}$ be a filtration and $(X_t)_{t\in\mathbb{N}}$ be an $\mathbb{F}$-adapted martingale and $\tau$ an $\mathbb{F}$-stopping time such that at least one of the following conditions hold:
1. $\exists n\in\mathbb{N}$ such that $\mathbb{P}(\tau > n) = 0$, e.g. $\tau$ is almost surely bounded.
2. $\mathbb{E}[\tau] < \infty$ and $\exists c\in\mathbb{R}$ such that for all $t\in\mathbb{N}$, $\mathbb{E}\left[|X_{t+1} - X_t| ~\middle| \mathcal{F}_t\right] \le c$ almost surely.
3. $\exists c$ such that $|X_{t\land \tau}| \le c$ almost surely for all $t\in\mathbb{N}$.
Then, $X_\tau$ is almost surely well defined and $\mathbb{E}[X_\tau] = \mathbb{E}[X_0]$.
One practical consequence of this theorem is that if you play a fair betting game and try to outsmart the casino by stopping at a certain time, say when \(S_t \ge \$100\), you will expect to spend eternity in the casino, e.g. if $S_t$ is the total money you have after $t$ rounds of a fair game, then either $\mathbb{E}[S_\tau] = 0$ or $\mathbb{E}[\tau] = \infty$.
$\color{darkblue}{(\text{Theorem 0.2: Maximal inequality (also known as Ville's inequality)})}$ Let $(X_t)_{t=0}^\infty$ be a supermartingale with $X_t\ge 0$ almost surely for all $t$. Then, for any $\epsilon>0$, $$ \mathbb{P}\left(\sup_{t\in\mathbb{N}} X_t \ge \epsilon\right) \le \frac{\mathbb{E}[X_0]}{\epsilon}. $$
1. Law of large numbers
(Convergence in probability) We say that $X_n$ converges to $X$ in probability if for every $\epsilon > 0$, $\lim_{n \to \infty} \Pr(|X_n - X| > \epsilon) = 0$, denoted by $X_n \overset{p}{\longrightarrow} X$.
(Convergence in $L^r$) We say that $X_n$ converges to $X$ in $L^r$ if $\lim_{n \to \infty} \mathbb{E}(|X_n - X|^r) = 0$, denoted by $X_n \overset{L^r}{\longrightarrow} X$.
(Uncorrelated variables) Two random variables $X$ and $Y$ are uncorrelated if $\mathbb{E}(XY) = \mathbb{E}(X)\mathbb{E}(Y)$.
$\color{darkgreen}{\text{(Lemma 1.1)}}$ Let $X_1, X_2, \ldots$ be a sequence of uncorrelated random variables with $\mathbb{E}(X_i) < \infty$, then we have that $\mathbb{V}(X_1 + \ldots + X_n) = \mathbb{V}(X_1) + \ldots + \mathbb{V}(X_n)$.
$\color{darkblue}{(\text{Theorem 1.1: $L^2$ weak law})}$ Let $X_1, X_2, \ldots$ be a sequence of uncorrelated random variables with $\mathbb{E}(X_i) = \mu$ and $\mathbb{V}(X_i) \le C < \infty$. If $S_n := X_1 + \ldots + X_n$, then $S_n/n \overset{L^2}{\longrightarrow} \mu$.
Proof. We have that $\mathbb{E}[|S_n/n - \mu|^2] = \mathbb{V}(S_n/n) \le \frac{C/n}{n^2} \to 0$ as $n \to \infty$. Here, we used Lemma 1.1 to compute the variance of $S_n/n$.
(Chebyshev’s inequality) For any random variable $X$ and $r > 0$, $\Pr(|X| \ge \epsilon) \le \epsilon^{-r}\mathbb{E}(|X|^r)$.
One small remark is that Chebyshev’s inequality is generally not a sharp inequality (for example try bounding standard normal variables), however, it cannot be improved in certain cases.
$\color{darkgreen}{\text{(Lemma 1.2: Convergence in $L^r$ implies convergence in probability)}}$ For $r > 0$, if $Z_n \overset{L^r}{\longrightarrow} 0$, then $Z_n \overset{p}{\longrightarrow} 0$.
Proof. Since $\mathbb{E}(|Z_n|^r) \to 0$, the proof follows directly from Chebyshev’s inequality applying to the random variable $Z_n$.
$\color{darkblue}{(\text{Theorem 1.2: Weak law of large numbers})}$ Let $X_1, X_2, \ldots$ be a sequence of uncorrelated random variables with $\mathbb{E}(X_i) = \mu$ and $\mathbb{V}(X_i) \le C < \infty$. If $S_n := X_1 + \ldots + X_n$, then $S_n/n \overset{p}{\longrightarrow} \mu$.
Proof. Follows from Lemma 1.2 with $r = 2$ and the $L^2$ weak law.
If we relax the assumption of bounded variance, we can still obtain a weak law of large numbers with i.i.d assumption and a weaker variance assumption.
$\color{darkblue}{(\text{Theorem 1.3: Weak law of large numbers with relaxed variance assumption})}$ Let $X_1, X_2, \ldots$ be a sequence of i.i.d random variables with $x\cdot\Pr(|X_i| > x) \to 0$ as $x \to \infty$. Let $\mu_n := \mathbb{E}[X_1\cdot\mathbb{I}\{|X_1| \le n\}]$. Then, $S_n/n - \mu_n \overset{p}{\longrightarrow} 0$.
$\color{darkblue}{(\text{Theorem 1.4: Weak law of large numbers for i.i.d random variables})}$ Let $X_1, X_2, \ldots$ be a sequence of i.i.d random variables with $\mathbb{E}[|X_1|] < \infty$. Let $\mu := \mathbb{E}[X_1]$. Then, $S_n/n \overset{p}{\longrightarrow} \mu$.
Proof sketch. Use theorem 1.3 and the dominated convergence theorem.