Categories: Mathematics, Measure theory, Statistics, Stochastic analysis.

Conditional expectation

Recall that the expectation value \(\mathbf{E}[X]\) of a random variable \(X\) is a function of the probability space \((\Omega, \mathcal{F}, P)\) on which \(X\) is defined, and the definition of \(X\) itself.

The conditional expectation \(\mathbf{E}[X|A]\) is the expectation value of \(X\) given that an event \(A\) has occurred, i.e. only the outcomes \(\omega \in \Omega\) satisfying \(\omega \in A\) should be considered. If \(A\) is obtained by observing a variable, then \(\mathbf{E}[X|A]\) is a random variable in its own right.

Consider two random variables \(X\) and \(Y\) on the same probability space \((\Omega, \mathcal{F}, P)\), and suppose that \(\Omega\) is discrete. If \(Y = y\) has been observed, then the conditional expectation of \(X\) given the event \(Y = y\) is as follows:

\[\begin{aligned} \mathbf{E}[X | Y \!=\! y] = \sum_{x} x \: Q(X \!=\! x) \qquad \quad Q(X \!=\! x) = \frac{P(X \!=\! x \cap Y \!=\! y)}{P(Y \!=\! y)} \end{aligned}\]

Where \(Q\) is a renormalized probability function, which assigns zero to all events incompatible with \(Y = y\). If we allow \(\Omega\) to be continuous, then from the definition \(\mathbf{E}[X]\), we know that the following Lebesgue integral can be used, which we call \(f(y)\):

\[\begin{aligned} \mathbf{E}[X | Y \!=\! y] = f(y) = \int_\Omega X(\omega) \dd{Q(\omega)} \end{aligned}\]

However, this is only valid if \(P(Y \!=\! y) > 0\), which is a problem for continuous sample spaces \(\Omega\). Sticking with the assumption \(P(Y \!=\! y) > 0\), notice that:

\[\begin{aligned} f(y) = \frac{1}{P(Y \!=\! y)} \int_\Omega X(\omega) \dd{P(\omega \cap Y \!=\! y)} = \frac{\mathbf{E}[X \cdot I(Y \!=\! y)]}{P(Y \!=\! y)} \end{aligned}\]

Where \(I\) is the indicator function, equal to \(1\) if its argument is true, and \(0\) if not. Multiplying the definition of \(f(y)\) by \(P(Y \!=\! y)\) then leads us to:

\[\begin{aligned} \mathbf{E}[X \cdot I(Y \!=\! y)] &= f(y) \cdot P(Y \!=\! y) \\ &= \mathbf{E}[f(Y) \cdot I(Y \!=\! y)] \end{aligned}\]

Recall that because \(Y\) is a random variable, \(\mathbf{E}[X|Y] = f(Y)\) is too. In other words, \(f\) maps \(Y\) to another random variable, which, thanks to the Doob-Dynkin lemma (see random variable), means that \(\mathbf{E}[X|Y]\) is measurable with respect to \(\sigma(Y)\). Intuitively, this makes sense: \(\mathbf{E}[X|Y]\) cannot contain more information about events than the \(Y\) it was calculated from.

This suggests a straightforward generalization of the above: instead of a specific value \(Y = y\), we can condition on any information from \(Y\). If \(\mathcal{H} = \sigma(Y)\) is the information generated by \(Y\), then the conditional expectation \(\mathbf{E}[X|\mathcal{H}] = Z\) is \(\mathcal{H}\)-measurable, and given by a \(Z\) satisfying:

\[\begin{aligned} \boxed{ \mathbf{E}\big[X \cdot I(H)\big] = \mathbf{E}\big[Z \cdot I(H)\big] } \end{aligned}\]

For any \(H \in \mathcal{H}\). Note that \(Z\) is almost surely unique: almost because it could take any value for an event \(A\) with zero probability \(P(A) = 0\). Fortunately, if there exists a continuous \(f\) such that \(\mathbf{E}[X | \sigma(Y)] = f(Y)\), then \(Z = \mathbf{E}[X | \sigma(Y)]\) is unique.


A conditional expectation defined in this way has many useful properties, most notably linearity: \(\mathbf{E}[aX \!+\! bY | \mathcal{H}] = a \mathbf{E}[X|\mathcal{H}] + b \mathbf{E}[Y|\mathcal{H}]\) for any \(a, b \in \mathbb{R}\).

The tower property states that if \(\mathcal{F} \supset \mathcal{G} \supset \mathcal{H}\), then \(\mathbf{E}[\mathbf{E}[X|\mathcal{G}]|\mathcal{H}] = \mathbf{E}[X|\mathcal{H}]\). Intuitively, this works as follows: suppose person \(G\) knows more about \(X\) than person \(H\), then \(\mathbf{E}[X | \mathcal{H}]\) is \(H\)’s expectation, \(\mathbf{E}[X | \mathcal{G}]\) is \(G\)’s “better” expectation, and then \(\mathbf{E}[\mathbf{E}[X|\mathcal{G}]|\mathcal{H}]\) is \(H\)’s prediction about what \(G\)’s expectation will be. However, \(H\) does not have access to \(G\)’s extra information, so \(H\)’s best prediction is simply \(\mathbf{E}[X | \mathcal{H}]\).

The law of total expectation says that \(\mathbf{E}[\mathbf{E}[X | \mathcal{G}]] = \mathbf{E}[X]\), and follows from the above tower property by choosing \(\mathcal{H}\) to contain no information: \(\mathcal{H} = \{ \varnothing, \Omega \}\).

Another useful property is that \(\mathbf{E}[X | \mathcal{H}] = X\) if \(X\) is \(\mathcal{H}\)-measurable. In other words, if \(\mathcal{H}\) already contains all the information extractable from \(X\), then we know \(X\)’s exact value. Conveniently, this can easily be generalized to products: \(\mathbf{E}[XY | \mathcal{H}] = X \mathbf{E}[Y | \mathcal{H}]\) if \(X\) is \(\mathcal{H}\)-measurable: since \(X\)’s value is known, it can simply be factored out.

Armed with this definition of conditional expectation, we can define other conditional quantities, such as the conditional variance \(\mathbf{V}[X | \mathcal{H}]\):

\[\begin{aligned} \mathbf{V}[X | \mathcal{H}] = \mathbf{E}[X^2 | \mathcal{H}] - \big[\mathbf{E}[X | \mathcal{H}]\big]^2 \end{aligned}\]

The law of total variance then states that \(\mathbf{V}[X] = \mathbf{E}[\mathbf{V}[X | \mathcal{H}]] + \mathbf{V}[\mathbf{E}[X | \mathcal{H}]]\).

Likewise, we can define the conditional probability \(P\), conditional distribution function \(F_{X|\mathcal{H}}\), and conditional density function \(f_{X|\mathcal{H}}\) like their non-conditional counterparts:

\[\begin{aligned} P(A | \mathcal{H}) = \mathbf{E}[I(A) | \mathcal{H}] \qquad F_{X|\mathcal{H}}(x) = P(X \le x | \mathcal{H}) \qquad f_{X|\mathcal{H}}(x) = \dv{F_{X|\mathcal{H}}}{x} \end{aligned}\]


  1. U.H. Thygesen, Lecture notes on diffusions and stochastic differential equations, 2021, Polyteknisk Kompendie.

© Marcus R.A. Newman, a.k.a. "Prefetch". Available under CC BY-SA 4.0.