Categories: Mathematics, Measure theory, Statistics, Stochastic analysis.

# Conditional expectation

Recall that the expectation value $\mathbf{E}[X]$ of a random variable $X$ is a function of the probability space $(\Omega, \mathcal{F}, P)$ on which $X$ is defined, and the definition of $X$ itself.

The conditional expectation $\mathbf{E}[X|A]$ is the expectation value of $X$ given that an event $A$ has occurred, i.e. only the outcomes $\omega \in \Omega$ satisfying $\omega \in A$ should be considered. If $A$ is obtained by observing a variable, then $\mathbf{E}[X|A]$ is a random variable in its own right.

Consider two random variables $X$ and $Y$ on the same probability space $(\Omega, \mathcal{F}, P)$, and suppose that $\Omega$ is discrete. If $Y = y$ has been observed, then the conditional expectation of $X$ given the event $Y = y$ is as follows:

\begin{aligned} \mathbf{E}[X | Y \!=\! y] = \sum_{x} x \: Q(X \!=\! x) \qquad \quad Q(X \!=\! x) = \frac{P(X \!=\! x \cap Y \!=\! y)}{P(Y \!=\! y)} \end{aligned}

Where $Q$ is a renormalized probability function, which assigns zero to all events incompatible with $Y = y$. If we allow $\Omega$ to be continuous, then from the definition $\mathbf{E}[X]$, we know that the following Lebesgue integral can be used, which we call $f(y)$:

\begin{aligned} \mathbf{E}[X | Y \!=\! y] = f(y) = \int_\Omega X(\omega) \dd{Q(\omega)} \end{aligned}

However, this is only valid if $P(Y \!=\! y) > 0$, which is a problem for continuous sample spaces $\Omega$. Sticking with the assumption $P(Y \!=\! y) > 0$, notice that:

\begin{aligned} f(y) = \frac{1}{P(Y \!=\! y)} \int_\Omega X(\omega) \dd{P(\omega \cap Y \!=\! y)} = \frac{\mathbf{E}[X \cdot I(Y \!=\! y)]}{P(Y \!=\! y)} \end{aligned}

Where $I$ is the indicator function, equal to $1$ if its argument is true, and $0$ if not. Multiplying the definition of $f(y)$ by $P(Y \!=\! y)$ then leads us to:

\begin{aligned} \mathbf{E}[X \cdot I(Y \!=\! y)] &= f(y) \cdot P(Y \!=\! y) \\ &= \mathbf{E}[f(Y) \cdot I(Y \!=\! y)] \end{aligned}

Recall that because $Y$ is a random variable, $\mathbf{E}[X|Y] = f(Y)$ is too. In other words, $f$ maps $Y$ to another random variable, which, thanks to the Doob-Dynkin lemma (see random variable), means that $\mathbf{E}[X|Y]$ is measurable with respect to $\sigma(Y)$. Intuitively, this makes sense: $\mathbf{E}[X|Y]$ cannot contain more information about events than the $Y$ it was calculated from.

This suggests a straightforward generalization of the above: instead of a specific value $Y = y$, we can condition on any information from $Y$. If $\mathcal{H} = \sigma(Y)$ is the information generated by $Y$, then the conditional expectation $\mathbf{E}[X|\mathcal{H}] = Z$ is $\mathcal{H}$-measurable, and given by a $Z$ satisfying:

\begin{aligned} \boxed{ \mathbf{E}\big[X \cdot I(H)\big] = \mathbf{E}\big[Z \cdot I(H)\big] } \end{aligned}

For any $H \in \mathcal{H}$. Note that $Z$ is almost surely unique: almost because it could take any value for an event $A$ with zero probability $P(A) = 0$. Fortunately, if there exists a continuous $f$ such that $\mathbf{E}[X | \sigma(Y)] = f(Y)$, then $Z = \mathbf{E}[X | \sigma(Y)]$ is unique.

## Properties

A conditional expectation defined in this way has many useful properties, most notably linearity: $\mathbf{E}[aX \!+\! bY | \mathcal{H}] = a \mathbf{E}[X|\mathcal{H}] + b \mathbf{E}[Y|\mathcal{H}]$ for any $a, b \in \mathbb{R}$.

The tower property states that if $\mathcal{F} \supset \mathcal{G} \supset \mathcal{H}$, then $\mathbf{E}[\mathbf{E}[X|\mathcal{G}]|\mathcal{H}] = \mathbf{E}[X|\mathcal{H}]$. Intuitively, this works as follows: suppose person $G$ knows more about $X$ than person $H$, then $\mathbf{E}[X | \mathcal{H}]$ is $H$’s expectation, $\mathbf{E}[X | \mathcal{G}]$ is $G$’s “better” expectation, and then $\mathbf{E}[\mathbf{E}[X|\mathcal{G}]|\mathcal{H}]$ is $H$’s prediction about what $G$’s expectation will be. However, $H$ does not have access to $G$’s extra information, so $H$’s best prediction is simply $\mathbf{E}[X | \mathcal{H}]$.

The law of total expectation says that $\mathbf{E}[\mathbf{E}[X | \mathcal{G}]] = \mathbf{E}[X]$, and follows from the above tower property by choosing $\mathcal{H}$ to contain no information: $\mathcal{H} = \{ \varnothing, \Omega \}$.

Another useful property is that $\mathbf{E}[X | \mathcal{H}] = X$ if $X$ is $\mathcal{H}$-measurable. In other words, if $\mathcal{H}$ already contains all the information extractable from $X$, then we know $X$’s exact value. Conveniently, this can easily be generalized to products: $\mathbf{E}[XY | \mathcal{H}] = X \mathbf{E}[Y | \mathcal{H}]$ if $X$ is $\mathcal{H}$-measurable: since $X$’s value is known, it can simply be factored out.

Armed with this definition of conditional expectation, we can define other conditional quantities, such as the conditional variance $\mathbf{V}[X | \mathcal{H}]$:

\begin{aligned} \mathbf{V}[X | \mathcal{H}] = \mathbf{E}[X^2 | \mathcal{H}] - \big[\mathbf{E}[X | \mathcal{H}]\big]^2 \end{aligned}

The law of total variance then states that $\mathbf{V}[X] = \mathbf{E}[\mathbf{V}[X | \mathcal{H}]] + \mathbf{V}[\mathbf{E}[X | \mathcal{H}]]$.

Likewise, we can define the conditional probability $P$, conditional distribution function $F_{X|\mathcal{H}}$, and conditional density function $f_{X|\mathcal{H}}$ like their non-conditional counterparts:

\begin{aligned} P(A | \mathcal{H}) = \mathbf{E}[I(A) | \mathcal{H}] \qquad F_{X|\mathcal{H}}(x) = P(X \le x | \mathcal{H}) \qquad f_{X|\mathcal{H}}(x) = \dv{F_{X|\mathcal{H}}}{x} \end{aligned}

## References

1. U.H. Thygesen, Lecture notes on diffusions and stochastic differential equations, 2021, Polyteknisk Kompendie.