diff options
author | Prefetch | 2022-10-14 23:25:28 +0200 |
---|---|---|
committer | Prefetch | 2022-10-14 23:25:28 +0200 |
commit | 6ce0bb9a8f9fd7d169cbb414a9537d68c5290aae (patch) | |
tree | a0abb6b22f77c0e84ed38277d14662412ce14f39 /source/know/concept/binomial-distribution |
Initial commit after migration from Hugo
Diffstat (limited to 'source/know/concept/binomial-distribution')
-rw-r--r-- | source/know/concept/binomial-distribution/index.md | 220 |
1 files changed, 220 insertions, 0 deletions
diff --git a/source/know/concept/binomial-distribution/index.md b/source/know/concept/binomial-distribution/index.md new file mode 100644 index 0000000..c25da3d --- /dev/null +++ b/source/know/concept/binomial-distribution/index.md @@ -0,0 +1,220 @@ +--- +title: "Binomial distribution" +date: 2021-02-26 +categories: +- Statistics +- Mathematics +layout: "concept" +--- + +The **binomial distribution** is a discrete probability distribution +describing a **Bernoulli process**: a set of independent $N$ trials where +each has only two possible outcomes, "success" and "failure", +the former with probability $p$ and the latter with $q = 1 - p$. +The binomial distribution then gives the probability +that $n$ out of the $N$ trials succeed: + +$$\begin{aligned} + \boxed{ + P_N(n) = \binom{N}{n} \: p^n q^{N - n} + } +\end{aligned}$$ + +The first factor is known as the **binomial coefficient**, which describes the +number of microstates (i.e. permutations) that have $n$ successes out of $N$ trials. +These happen to be the coefficients in the polynomial $(a + b)^N$, +and can be read off of Pascal's triangle. +It is defined as follows: + +$$\begin{aligned} + \boxed{ + \binom{N}{n} = \frac{N!}{n! (N - n)!} + } +\end{aligned}$$ + +The remaining factor $p^n (1 - p)^{N - n}$ is then just the +probability of attaining each microstate. + +The expected or mean number of successes $\mu$ after $N$ trials is as follows: + +$$\begin{aligned} + \boxed{ + \mu = N p + } +\end{aligned}$$ + +<div class="accordion"> +<input type="checkbox" id="proof-mean"/> +<label for="proof-mean">Proof</label> +<div class="hidden"> +<label for="proof-mean">Proof.</label> +The trick is to treat $p$ and $q$ as independent until the last moment: + +$$\begin{aligned} + \mu + &= \sum_{n = 0}^N n \binom{N}{n} p^n q^{N - n} + = \sum_{n = 0}^N \binom{N}{n} \Big( p \pdv{(p^n)}{p} \Big) q^{N - n} + \\ + &= p \pdv{}{p}\sum_{n = 0}^N \binom{N}{n} p^n q^{N - n} + = p \pdv{}{p}(p + q)^N + = N p (p + q)^{N - 1} +\end{aligned}$$ + +Inserting $q = 1 - p$ then gives the desired result. +</div> +</div> + +Meanwhile, we find the following variance $\sigma^2$, +with $\sigma$ being the standard deviation: + +$$\begin{aligned} + \boxed{ + \sigma^2 = N p q + } +\end{aligned}$$ + +<div class="accordion"> +<input type="checkbox" id="proof-var"/> +<label for="proof-var">Proof</label> +<div class="hidden"> +<label for="proof-var">Proof.</label> +We use the same trick to calculate $\overline{n^2}$ +(the mean squared number of successes): + +$$\begin{aligned} + \overline{n^2} + &= \sum_{n = 0}^N n^2 \binom{N}{n} p^n q^{N - n} + = \sum_{n = 0}^N n \binom{N}{n} \Big( p \pdv{}{p}\Big)^2 p^n q^{N - n} + \\ + &= \Big( p \pdv{}{p}\Big)^2 \sum_{n = 0}^N \binom{N}{n} p^n q^{N - n} + = \Big( p \pdv{}{p}\Big)^2 (p + q)^N + \\ + &= N p \pdv{}{p}p (p + q)^{N - 1} + = N p \big( (p + q)^{N - 1} + (N - 1) p (p + q)^{N - 2} \big) + \\ + &= N p + N^2 p^2 - N p^2 +\end{aligned}$$ + +Using this and the earlier expression $\mu = N p$, we find the variance $\sigma^2$: + +$$\begin{aligned} + \sigma^2 + &= \overline{n^2} - \mu^2 + = N p + N^2 p^2 - N p^2 - N^2 p^2 + = N p (1 - p) +\end{aligned}$$ + +By inserting $q = 1 - p$, we arrive at the desired expression. +</div> +</div> + +As $N \to \infty$, the binomial distribution +turns into the continuous normal distribution, +a fact that is sometimes called the **de Moivre-Laplace theorem**: + +$$\begin{aligned} + \boxed{ + \lim_{N \to \infty} P_N(n) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\!\Big(\!-\!\frac{(n - \mu)^2}{2 \sigma^2} \Big) + } +\end{aligned}$$ + +<div class="accordion"> +<input type="checkbox" id="proof-normal"/> +<label for="proof-normal">Proof</label> +<div class="hidden"> +<label for="proof-normal">Proof.</label> +We take the Taylor expansion of $\ln\!\big(P_N(n)\big)$ +around the mean $\mu = Np$: + +$$\begin{aligned} + \ln\!\big(P_N(n)\big) + &= \sum_{m = 0}^\infty \frac{(n - \mu)^m}{m!} D_m(\mu) + \quad \mathrm{where} \quad + D_m(n) = \dvn{m}{\ln\!\big(P_N(n)\big)}{n} +\end{aligned}$$ + +We use Stirling's approximation to calculate the factorials in $D_m$: + +$$\begin{aligned} + \ln\!\big(P_N(n)\big) + &= \ln(N!) - \ln(n!) - \ln\!\big((N - n)!\big) + n \ln(p) + (N - n) \ln(q) + \\ + &\approx \ln(N!) - n \big( \ln(n)\!-\!\ln(p)\!-\!1 \big) - (N\!-\!n) \big( \ln(N\!-\!n)\!-\!\ln(q)\!-\!1 \big) +\end{aligned}$$ + +For $D_0(\mu)$, we need to use a stronger version of Stirling's approximation +to get a non-zero result. We take advantage of $N - N p = N q$: + +$$\begin{aligned} + D_0(\mu) + &= \ln(N!) - \ln\!\big((N p)!\big) - \ln\!\big((N q)!\big) + N p \ln(p) + N q \ln(q) + \\ + &= \Big( N \ln(N) - N + \frac{1}{2} \ln(2\pi N) \Big) + - \Big( N p \ln(N p) - N p + \frac{1}{2} \ln(2\pi N p) \Big) \\ + &\qquad - \Big( N q \ln(N q) - N q + \frac{1}{2} \ln(2\pi N q) \Big) + + N p \ln(p) + N q \ln(q) + \\ + &= N \ln(N) - N (p + q) \ln(N) + N (p + q) - N - \frac{1}{2} \ln(2\pi N p q) + \\ + &= - \frac{1}{2} \ln(2\pi N p q) + = \ln\!\Big( \frac{1}{\sqrt{2\pi \sigma^2}} \Big) +\end{aligned}$$ + +Next, we expect that $D_1(\mu) = 0$, because $\mu$ is the maximum. +This is indeed the case: + +$$\begin{aligned} + D_1(n) + &= - \big( \ln(n)\!-\!\ln(p)\!-\!1 \big) + \big( \ln(N\!-\!n)\!-\!\ln(q)\!-\!1 \big) - 1 + 1 + \\ + &= - \ln(n) + \ln(N - n) + \ln(p) - \ln(q) + \\ + D_1(\mu) + &= \ln(N q) - \ln(N p) + \ln(p) - \ln(q) + = \ln(N p q) - \ln(N p q) + = 0 +\end{aligned}$$ + +For the same reason, we expect that $D_2(\mu)$ is negative. +We find the following expression: + +$$\begin{aligned} + D_2(n) + &= - \frac{1}{n} - \frac{1}{N - n} + \qquad + D_2(\mu) + = - \frac{1}{Np} - \frac{1}{Nq} + = - \frac{p + q}{N p q} + = - \frac{1}{\sigma^2} +\end{aligned}$$ + +The higher-order derivatives tend to zero for $N \to \infty$, so we discard them: + +$$\begin{aligned} + D_3(n) + = \frac{1}{n^2} - \frac{1}{(N - n)^2} + \qquad + D_4(n) + = - \frac{2}{n^3} - \frac{2}{(N - n)^3} + \qquad + \cdots +\end{aligned}$$ + +Putting everything together, for large $N$, +the Taylor series approximately becomes: + +$$\begin{aligned} + \ln\!\big(P_N(n)\big) + \approx D_0(\mu) + \frac{(n - \mu)^2}{2} D_2(\mu) + = \ln\!\Big( \frac{1}{\sqrt{2\pi \sigma^2}} \Big) - \frac{(n - \mu)^2}{2 \sigma^2} +\end{aligned}$$ + +Taking $\exp$ of this expression then yields a normalized Gaussian distribution. +</div> +</div> + + +## References +1. H. Gould, J. Tobochnik, + *Statistical and thermal physics*, 2nd edition, + Princeton. |