summaryrefslogtreecommitdiff
path: root/source/know/concept/binomial-distribution/index.md
diff options
context:
space:
mode:
Diffstat (limited to 'source/know/concept/binomial-distribution/index.md')
-rw-r--r--source/know/concept/binomial-distribution/index.md220
1 files changed, 220 insertions, 0 deletions
diff --git a/source/know/concept/binomial-distribution/index.md b/source/know/concept/binomial-distribution/index.md
new file mode 100644
index 0000000..c25da3d
--- /dev/null
+++ b/source/know/concept/binomial-distribution/index.md
@@ -0,0 +1,220 @@
+---
+title: "Binomial distribution"
+date: 2021-02-26
+categories:
+- Statistics
+- Mathematics
+layout: "concept"
+---
+
+The **binomial distribution** is a discrete probability distribution
+describing a **Bernoulli process**: a set of independent $N$ trials where
+each has only two possible outcomes, "success" and "failure",
+the former with probability $p$ and the latter with $q = 1 - p$.
+The binomial distribution then gives the probability
+that $n$ out of the $N$ trials succeed:
+
+$$\begin{aligned}
+ \boxed{
+ P_N(n) = \binom{N}{n} \: p^n q^{N - n}
+ }
+\end{aligned}$$
+
+The first factor is known as the **binomial coefficient**, which describes the
+number of microstates (i.e. permutations) that have $n$ successes out of $N$ trials.
+These happen to be the coefficients in the polynomial $(a + b)^N$,
+and can be read off of Pascal's triangle.
+It is defined as follows:
+
+$$\begin{aligned}
+ \boxed{
+ \binom{N}{n} = \frac{N!}{n! (N - n)!}
+ }
+\end{aligned}$$
+
+The remaining factor $p^n (1 - p)^{N - n}$ is then just the
+probability of attaining each microstate.
+
+The expected or mean number of successes $\mu$ after $N$ trials is as follows:
+
+$$\begin{aligned}
+ \boxed{
+ \mu = N p
+ }
+\end{aligned}$$
+
+<div class="accordion">
+<input type="checkbox" id="proof-mean"/>
+<label for="proof-mean">Proof</label>
+<div class="hidden">
+<label for="proof-mean">Proof.</label>
+The trick is to treat $p$ and $q$ as independent until the last moment:
+
+$$\begin{aligned}
+ \mu
+ &= \sum_{n = 0}^N n \binom{N}{n} p^n q^{N - n}
+ = \sum_{n = 0}^N \binom{N}{n} \Big( p \pdv{(p^n)}{p} \Big) q^{N - n}
+ \\
+ &= p \pdv{}{p}\sum_{n = 0}^N \binom{N}{n} p^n q^{N - n}
+ = p \pdv{}{p}(p + q)^N
+ = N p (p + q)^{N - 1}
+\end{aligned}$$
+
+Inserting $q = 1 - p$ then gives the desired result.
+</div>
+</div>
+
+Meanwhile, we find the following variance $\sigma^2$,
+with $\sigma$ being the standard deviation:
+
+$$\begin{aligned}
+ \boxed{
+ \sigma^2 = N p q
+ }
+\end{aligned}$$
+
+<div class="accordion">
+<input type="checkbox" id="proof-var"/>
+<label for="proof-var">Proof</label>
+<div class="hidden">
+<label for="proof-var">Proof.</label>
+We use the same trick to calculate $\overline{n^2}$
+(the mean squared number of successes):
+
+$$\begin{aligned}
+ \overline{n^2}
+ &= \sum_{n = 0}^N n^2 \binom{N}{n} p^n q^{N - n}
+ = \sum_{n = 0}^N n \binom{N}{n} \Big( p \pdv{}{p}\Big)^2 p^n q^{N - n}
+ \\
+ &= \Big( p \pdv{}{p}\Big)^2 \sum_{n = 0}^N \binom{N}{n} p^n q^{N - n}
+ = \Big( p \pdv{}{p}\Big)^2 (p + q)^N
+ \\
+ &= N p \pdv{}{p}p (p + q)^{N - 1}
+ = N p \big( (p + q)^{N - 1} + (N - 1) p (p + q)^{N - 2} \big)
+ \\
+ &= N p + N^2 p^2 - N p^2
+\end{aligned}$$
+
+Using this and the earlier expression $\mu = N p$, we find the variance $\sigma^2$:
+
+$$\begin{aligned}
+ \sigma^2
+ &= \overline{n^2} - \mu^2
+ = N p + N^2 p^2 - N p^2 - N^2 p^2
+ = N p (1 - p)
+\end{aligned}$$
+
+By inserting $q = 1 - p$, we arrive at the desired expression.
+</div>
+</div>
+
+As $N \to \infty$, the binomial distribution
+turns into the continuous normal distribution,
+a fact that is sometimes called the **de Moivre-Laplace theorem**:
+
+$$\begin{aligned}
+ \boxed{
+ \lim_{N \to \infty} P_N(n) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\!\Big(\!-\!\frac{(n - \mu)^2}{2 \sigma^2} \Big)
+ }
+\end{aligned}$$
+
+<div class="accordion">
+<input type="checkbox" id="proof-normal"/>
+<label for="proof-normal">Proof</label>
+<div class="hidden">
+<label for="proof-normal">Proof.</label>
+We take the Taylor expansion of $\ln\!\big(P_N(n)\big)$
+around the mean $\mu = Np$:
+
+$$\begin{aligned}
+ \ln\!\big(P_N(n)\big)
+ &= \sum_{m = 0}^\infty \frac{(n - \mu)^m}{m!} D_m(\mu)
+ \quad \mathrm{where} \quad
+ D_m(n) = \dvn{m}{\ln\!\big(P_N(n)\big)}{n}
+\end{aligned}$$
+
+We use Stirling's approximation to calculate the factorials in $D_m$:
+
+$$\begin{aligned}
+ \ln\!\big(P_N(n)\big)
+ &= \ln(N!) - \ln(n!) - \ln\!\big((N - n)!\big) + n \ln(p) + (N - n) \ln(q)
+ \\
+ &\approx \ln(N!) - n \big( \ln(n)\!-\!\ln(p)\!-\!1 \big) - (N\!-\!n) \big( \ln(N\!-\!n)\!-\!\ln(q)\!-\!1 \big)
+\end{aligned}$$
+
+For $D_0(\mu)$, we need to use a stronger version of Stirling's approximation
+to get a non-zero result. We take advantage of $N - N p = N q$:
+
+$$\begin{aligned}
+ D_0(\mu)
+ &= \ln(N!) - \ln\!\big((N p)!\big) - \ln\!\big((N q)!\big) + N p \ln(p) + N q \ln(q)
+ \\
+ &= \Big( N \ln(N) - N + \frac{1}{2} \ln(2\pi N) \Big)
+ - \Big( N p \ln(N p) - N p + \frac{1}{2} \ln(2\pi N p) \Big) \\
+ &\qquad - \Big( N q \ln(N q) - N q + \frac{1}{2} \ln(2\pi N q) \Big)
+ + N p \ln(p) + N q \ln(q)
+ \\
+ &= N \ln(N) - N (p + q) \ln(N) + N (p + q) - N - \frac{1}{2} \ln(2\pi N p q)
+ \\
+ &= - \frac{1}{2} \ln(2\pi N p q)
+ = \ln\!\Big( \frac{1}{\sqrt{2\pi \sigma^2}} \Big)
+\end{aligned}$$
+
+Next, we expect that $D_1(\mu) = 0$, because $\mu$ is the maximum.
+This is indeed the case:
+
+$$\begin{aligned}
+ D_1(n)
+ &= - \big( \ln(n)\!-\!\ln(p)\!-\!1 \big) + \big( \ln(N\!-\!n)\!-\!\ln(q)\!-\!1 \big) - 1 + 1
+ \\
+ &= - \ln(n) + \ln(N - n) + \ln(p) - \ln(q)
+ \\
+ D_1(\mu)
+ &= \ln(N q) - \ln(N p) + \ln(p) - \ln(q)
+ = \ln(N p q) - \ln(N p q)
+ = 0
+\end{aligned}$$
+
+For the same reason, we expect that $D_2(\mu)$ is negative.
+We find the following expression:
+
+$$\begin{aligned}
+ D_2(n)
+ &= - \frac{1}{n} - \frac{1}{N - n}
+ \qquad
+ D_2(\mu)
+ = - \frac{1}{Np} - \frac{1}{Nq}
+ = - \frac{p + q}{N p q}
+ = - \frac{1}{\sigma^2}
+\end{aligned}$$
+
+The higher-order derivatives tend to zero for $N \to \infty$, so we discard them:
+
+$$\begin{aligned}
+ D_3(n)
+ = \frac{1}{n^2} - \frac{1}{(N - n)^2}
+ \qquad
+ D_4(n)
+ = - \frac{2}{n^3} - \frac{2}{(N - n)^3}
+ \qquad
+ \cdots
+\end{aligned}$$
+
+Putting everything together, for large $N$,
+the Taylor series approximately becomes:
+
+$$\begin{aligned}
+ \ln\!\big(P_N(n)\big)
+ \approx D_0(\mu) + \frac{(n - \mu)^2}{2} D_2(\mu)
+ = \ln\!\Big( \frac{1}{\sqrt{2\pi \sigma^2}} \Big) - \frac{(n - \mu)^2}{2 \sigma^2}
+\end{aligned}$$
+
+Taking $\exp$ of this expression then yields a normalized Gaussian distribution.
+</div>
+</div>
+
+
+## References
+1. H. Gould, J. Tobochnik,
+ *Statistical and thermal physics*, 2nd edition,
+ Princeton.