--- title: "Conditional expectation" sort_title: "Conditional expectation" date: 2021-10-23 categories: - Mathematics - Statistics - Measure theory - Stochastic analysis layout: "concept" --- Recall that the expectation value $$\mathbf{E}[X]$$ of a [random variable](/know/concept/random-variable/) $$X$$ is a function of the probability space $$(\Omega, \mathcal{F}, P)$$ on which $$X$$ is defined, and the definition of $$X$$ itself. The **conditional expectation** $$\mathbf{E}[X|A]$$ is the expectation value of $$X$$ given that an event $$A$$ has occurred, i.e. only the outcomes $$\omega \in \Omega$$ satisfying $$\omega \in A$$ should be considered. If $$A$$ is obtained by observing a variable, then $$\mathbf{E}[X|A]$$ is a random variable in its own right. Consider two random variables $$X$$ and $$Y$$ on the same probability space $$(\Omega, \mathcal{F}, P)$$, and suppose that $$\Omega$$ is discrete. If $$Y = y$$ has been observed, then the conditional expectation of $$X$$ given the event $$Y = y$$ is as follows: $$\begin{aligned} \mathbf{E}[X | Y \!=\! y] = \sum_{x} x \: Q(X \!=\! x) \qquad \quad Q(X \!=\! x) = \frac{P(X \!=\! x \cap Y \!=\! y)}{P(Y \!=\! y)} \end{aligned}$$ Where $$Q$$ is a renormalized probability function, which assigns zero to all events incompatible with $$Y = y$$. If we allow $$\Omega$$ to be continuous, then from the definition $$\mathbf{E}[X]$$, we know that the following Lebesgue integral can be used, which we call $$f(y)$$: $$\begin{aligned} \mathbf{E}[X | Y \!=\! y] = f(y) = \int_\Omega X(\omega) \dd{Q(\omega)} \end{aligned}$$ However, this is only valid if $$P(Y \!=\! y) > 0$$, which is a problem for continuous sample spaces $$\Omega$$. Sticking with the assumption $$P(Y \!=\! y) > 0$$, notice that: $$\begin{aligned} f(y) = \frac{1}{P(Y \!=\! y)} \int_\Omega X(\omega) \dd{P(\omega \cap Y \!=\! y)} = \frac{\mathbf{E}[X \cdot I(Y \!=\! y)]}{P(Y \!=\! y)} \end{aligned}$$ Where $$I$$ is the indicator function, equal to $$1$$ if its argument is true, and $$0$$ if not. Multiplying the definition of $$f(y)$$ by $$P(Y \!=\! y)$$ then leads us to: $$\begin{aligned} \mathbf{E}[X \cdot I(Y \!=\! y)] &= f(y) \cdot P(Y \!=\! y) \\ &= \mathbf{E}[f(Y) \cdot I(Y \!=\! y)] \end{aligned}$$ Recall that because $$Y$$ is a random variable, $$\mathbf{E}[X|Y] = f(Y)$$ is too. In other words, $$f$$ maps $$Y$$ to another random variable, which, thanks to the *Doob-Dynkin lemma* (see [random variable](/know/concept/random-variable/)), means that $$\mathbf{E}[X|Y]$$ is measurable with respect to $$\sigma(Y)$$. Intuitively, this makes sense: $$\mathbf{E}[X|Y]$$ cannot contain more information about events than the $$Y$$ it was calculated from. This suggests a straightforward generalization of the above: instead of a specific value $$Y = y$$, we can condition on *any* information from $$Y$$. If $$\mathcal{H} = \sigma(Y)$$ is the information generated by $$Y$$, then the conditional expectation $$\mathbf{E}[X|\mathcal{H}] = Z$$ is $$\mathcal{H}$$-measurable, and given by a $$Z$$ satisfying: $$\begin{aligned} \boxed{ \mathbf{E}\big[X \cdot I(H)\big] = \mathbf{E}\big[Z \cdot I(H)\big] } \end{aligned}$$ For any $$H \in \mathcal{H}$$. Note that $$Z$$ is almost surely unique: *almost* because it could take any value for an event $$A$$ with zero probability $$P(A) = 0$$. Fortunately, if there exists a continuous $$f$$ such that $$\mathbf{E}[X | \sigma(Y)] = f(Y)$$, then $$Z = \mathbf{E}[X | \sigma(Y)]$$ is unique. ## Properties A conditional expectation defined in this way has many useful properties, most notably linearity: $$\mathbf{E}[aX \!+\! bY | \mathcal{H}] = a \mathbf{E}[X|\mathcal{H}] + b \mathbf{E}[Y|\mathcal{H}]$$ for any $$a, b \in \mathbb{R}$$. The **tower property** states that if $$\mathcal{F} \supset \mathcal{G} \supset \mathcal{H}$$, then $$\mathbf{E}[\mathbf{E}[X|\mathcal{G}]|\mathcal{H}] = \mathbf{E}[X|\mathcal{H}]$$. Intuitively, this works as follows: suppose person $$G$$ knows more about $$X$$ than person $$H$$, then $$\mathbf{E}[X | \mathcal{H}]$$ is $$H$$'s expectation, $$\mathbf{E}[X | \mathcal{G}]$$ is $$G$$'s "better" expectation, and then $$\mathbf{E}[\mathbf{E}[X|\mathcal{G}]|\mathcal{H}]$$ is $$H$$'s prediction about what $$G$$'s expectation will be. However, $$H$$ does not have access to $$G$$'s extra information, so $$H$$'s best prediction is simply $$\mathbf{E}[X | \mathcal{H}]$$. The **law of total expectation** says that $$\mathbf{E}[\mathbf{E}[X | \mathcal{G}]] = \mathbf{E}[X]$$, and follows from the above tower property by choosing $$\mathcal{H}$$ to contain no information: $$\mathcal{H} = \{ \varnothing, \Omega \}$$. Another useful property is that $$\mathbf{E}[X | \mathcal{H}] = X$$ if $$X$$ is $$\mathcal{H}$$-measurable. In other words, if $$\mathcal{H}$$ already contains all the information extractable from $$X$$, then we know $$X$$'s exact value. Conveniently, this can easily be generalized to products: $$\mathbf{E}[XY | \mathcal{H}] = X \mathbf{E}[Y | \mathcal{H}]$$ if $$X$$ is $$\mathcal{H}$$-measurable: since $$X$$'s value is known, it can simply be factored out. Armed with this definition of conditional expectation, we can define other conditional quantities, such as the **conditional variance** $$\mathbf{V}[X | \mathcal{H}]$$: $$\begin{aligned} \mathbf{V}[X | \mathcal{H}] = \mathbf{E}[X^2 | \mathcal{H}] - \big[\mathbf{E}[X | \mathcal{H}]\big]^2 \end{aligned}$$ The **law of total variance** then states that $$\mathbf{V}[X] = \mathbf{E}[\mathbf{V}[X | \mathcal{H}]] + \mathbf{V}[\mathbf{E}[X | \mathcal{H}]]$$. Likewise, we can define the **conditional probability** $$P$$, **conditional distribution function** $$F_{X|\mathcal{H}}$$, and **conditional density function** $$f_{X|\mathcal{H}}$$ like their non-conditional counterparts: $$\begin{aligned} P(A | \mathcal{H}) = \mathbf{E}[I(A) | \mathcal{H}] \qquad F_{X|\mathcal{H}}(x) = P(X \le x | \mathcal{H}) \qquad f_{X|\mathcal{H}}(x) = \dv{F_{X|\mathcal{H}}}{x} \end{aligned}$$ ## References 1. U.H. Thygesen, *Lecture notes on diffusions and stochastic differential equations*, 2021, Polyteknisk Kompendie.