Categories: Mathematics, Measure theory, Statistics, Stochastic analysis.

Conditional expectation

Recall that the expectation value E[X]\mathbf{E}[X] of a random variable XX is a function of the probability space (Ω,F,P)(\Omega, \mathcal{F}, P) on which XX is defined, and the definition of XX itself.

The conditional expectation E[XA]\mathbf{E}[X|A] is the expectation value of XX given that an event AA has occurred, i.e. only the outcomes ωΩ\omega \in \Omega satisfying ωA\omega \in A should be considered. If AA is obtained by observing a variable, then E[XA]\mathbf{E}[X|A] is a random variable in its own right.

Consider two random variables XX and YY on the same probability space (Ω,F,P)(\Omega, \mathcal{F}, P), and suppose that Ω\Omega is discrete. If Y=yY = y has been observed, then the conditional expectation of XX given the event Y=yY = y is as follows:

E[XY ⁣= ⁣y]=xxQ(X ⁣= ⁣x)Q(X ⁣= ⁣x)=P(X ⁣= ⁣xY ⁣= ⁣y)P(Y ⁣= ⁣y)\begin{aligned} \mathbf{E}[X | Y \!=\! y] = \sum_{x} x \: Q(X \!=\! x) \qquad \quad Q(X \!=\! x) = \frac{P(X \!=\! x \cap Y \!=\! y)}{P(Y \!=\! y)} \end{aligned}

Where QQ is a renormalized probability function, which assigns zero to all events incompatible with Y=yY = y. If we allow Ω\Omega to be continuous, then from the definition E[X]\mathbf{E}[X], we know that the following Lebesgue integral can be used, which we call f(y)f(y):

E[XY ⁣= ⁣y]=f(y)=ΩX(ω)dQ(ω)\begin{aligned} \mathbf{E}[X | Y \!=\! y] = f(y) = \int_\Omega X(\omega) \dd{Q(\omega)} \end{aligned}

However, this is only valid if P(Y ⁣= ⁣y)>0P(Y \!=\! y) > 0, which is a problem for continuous sample spaces Ω\Omega. Sticking with the assumption P(Y ⁣= ⁣y)>0P(Y \!=\! y) > 0, notice that:

f(y)=1P(Y ⁣= ⁣y)ΩX(ω)dP(ωY ⁣= ⁣y)=E[XI(Y ⁣= ⁣y)]P(Y ⁣= ⁣y)\begin{aligned} f(y) = \frac{1}{P(Y \!=\! y)} \int_\Omega X(\omega) \dd{P(\omega \cap Y \!=\! y)} = \frac{\mathbf{E}[X \cdot I(Y \!=\! y)]}{P(Y \!=\! y)} \end{aligned}

Where II is the indicator function, equal to 11 if its argument is true, and 00 if not. Multiplying the definition of f(y)f(y) by P(Y ⁣= ⁣y)P(Y \!=\! y) then leads us to:

E[XI(Y ⁣= ⁣y)]=f(y)P(Y ⁣= ⁣y)=E[f(Y)I(Y ⁣= ⁣y)]\begin{aligned} \mathbf{E}[X \cdot I(Y \!=\! y)] &= f(y) \cdot P(Y \!=\! y) \\ &= \mathbf{E}[f(Y) \cdot I(Y \!=\! y)] \end{aligned}

Recall that because YY is a random variable, E[XY]=f(Y)\mathbf{E}[X|Y] = f(Y) is too. In other words, ff maps YY to another random variable, which, thanks to the Doob-Dynkin lemma (see random variable), means that E[XY]\mathbf{E}[X|Y] is measurable with respect to σ(Y)\sigma(Y). Intuitively, this makes sense: E[XY]\mathbf{E}[X|Y] cannot contain more information about events than the YY it was calculated from.

This suggests a straightforward generalization of the above: instead of a specific value Y=yY = y, we can condition on any information from YY. If H=σ(Y)\mathcal{H} = \sigma(Y) is the information generated by YY, then the conditional expectation E[XH]=Z\mathbf{E}[X|\mathcal{H}] = Z is H\mathcal{H}-measurable, and given by a ZZ satisfying:

E[XI(H)]=E[ZI(H)]\begin{aligned} \boxed{ \mathbf{E}\big[X \cdot I(H)\big] = \mathbf{E}\big[Z \cdot I(H)\big] } \end{aligned}

For any HHH \in \mathcal{H}. Note that ZZ is almost surely unique: almost because it could take any value for an event AA with zero probability P(A)=0P(A) = 0. Fortunately, if there exists a continuous ff such that E[Xσ(Y)]=f(Y)\mathbf{E}[X | \sigma(Y)] = f(Y), then Z=E[Xσ(Y)]Z = \mathbf{E}[X | \sigma(Y)] is unique.


A conditional expectation defined in this way has many useful properties, most notably linearity: E[aX ⁣+ ⁣bYH]=aE[XH]+bE[YH]\mathbf{E}[aX \!+\! bY | \mathcal{H}] = a \mathbf{E}[X|\mathcal{H}] + b \mathbf{E}[Y|\mathcal{H}] for any a,bRa, b \in \mathbb{R}.

The tower property states that if FGH\mathcal{F} \supset \mathcal{G} \supset \mathcal{H}, then E[E[XG]H]=E[XH]\mathbf{E}[\mathbf{E}[X|\mathcal{G}]|\mathcal{H}] = \mathbf{E}[X|\mathcal{H}]. Intuitively, this works as follows: suppose person GG knows more about XX than person HH, then E[XH]\mathbf{E}[X | \mathcal{H}] is HH’s expectation, E[XG]\mathbf{E}[X | \mathcal{G}] is GG’s “better” expectation, and then E[E[XG]H]\mathbf{E}[\mathbf{E}[X|\mathcal{G}]|\mathcal{H}] is HH’s prediction about what GG’s expectation will be. However, HH does not have access to GG’s extra information, so HH’s best prediction is simply E[XH]\mathbf{E}[X | \mathcal{H}].

The law of total expectation says that E[E[XG]]=E[X]\mathbf{E}[\mathbf{E}[X | \mathcal{G}]] = \mathbf{E}[X], and follows from the above tower property by choosing H\mathcal{H} to contain no information: H={,Ω}\mathcal{H} = \{ \varnothing, \Omega \}.

Another useful property is that E[XH]=X\mathbf{E}[X | \mathcal{H}] = X if XX is H\mathcal{H}-measurable. In other words, if H\mathcal{H} already contains all the information extractable from XX, then we know XX’s exact value. Conveniently, this can easily be generalized to products: E[XYH]=XE[YH]\mathbf{E}[XY | \mathcal{H}] = X \mathbf{E}[Y | \mathcal{H}] if XX is H\mathcal{H}-measurable: since XX’s value is known, it can simply be factored out.

Armed with this definition of conditional expectation, we can define other conditional quantities, such as the conditional variance V[XH]\mathbf{V}[X | \mathcal{H}]:

V[XH]=E[X2H][E[XH]]2\begin{aligned} \mathbf{V}[X | \mathcal{H}] = \mathbf{E}[X^2 | \mathcal{H}] - \big[\mathbf{E}[X | \mathcal{H}]\big]^2 \end{aligned}

The law of total variance then states that V[X]=E[V[XH]]+V[E[XH]]\mathbf{V}[X] = \mathbf{E}[\mathbf{V}[X | \mathcal{H}]] + \mathbf{V}[\mathbf{E}[X | \mathcal{H}]].

Likewise, we can define the conditional probability PP, conditional distribution function FXHF_{X|\mathcal{H}}, and conditional density function fXHf_{X|\mathcal{H}} like their non-conditional counterparts:

P(AH)=E[I(A)H]FXH(x)=P(XxH)fXH(x)=dFXHdx\begin{aligned} P(A | \mathcal{H}) = \mathbf{E}[I(A) | \mathcal{H}] \qquad F_{X|\mathcal{H}}(x) = P(X \le x | \mathcal{H}) \qquad f_{X|\mathcal{H}}(x) = \dv{F_{X|\mathcal{H}}}{x} \end{aligned}


  1. U.H. Thygesen, Lecture notes on diffusions and stochastic differential equations, 2021, Polyteknisk Kompendie.