Categories:
Mathematics,
Measure theory,
Statistics,
Stochastic analysis.
Conditional expectation
Recall that the expectation value E[X]
of a random variable X
is a function of the probability space (Ω,F,P)
on which X is defined, and the definition of X itself.
The conditional expectation E[X∣A]
is the expectation value of X given that an event A has occurred,
i.e. only the outcomes ω∈Ω
satisfying ω∈A should be considered.
If A is obtained by observing a variable,
then E[X∣A] is a random variable in its own right.
Consider two random variables X and Y
on the same probability space (Ω,F,P),
and suppose that Ω is discrete.
If Y=y has been observed,
then the conditional expectation of X
given the event Y=y is as follows:
E[X∣Y=y]=x∑xQ(X=x)Q(X=x)=P(Y=y)P(X=x∩Y=y)
Where Q is a renormalized probability function,
which assigns zero to all events incompatible with Y=y.
If we allow Ω to be continuous,
then from the definition E[X],
we know that the following Lebesgue integral can be used,
which we call f(y):
E[X∣Y=y]=f(y)=∫ΩX(ω)dQ(ω)
However, this is only valid if P(Y=y)>0,
which is a problem for continuous sample spaces Ω.
Sticking with the assumption P(Y=y)>0, notice that:
f(y)=P(Y=y)1∫ΩX(ω)dP(ω∩Y=y)=P(Y=y)E[X⋅I(Y=y)]
Where I is the indicator function,
equal to 1 if its argument is true, and 0 if not.
Multiplying the definition of f(y) by P(Y=y) then leads us to:
E[X⋅I(Y=y)]=f(y)⋅P(Y=y)=E[f(Y)⋅I(Y=y)]
Recall that because Y is a random variable,
E[X∣Y]=f(Y) is too.
In other words, f maps Y to another random variable,
which, thanks to the Doob-Dynkin lemma
(see random variable),
means that E[X∣Y] is measurable with respect to σ(Y).
Intuitively, this makes sense:
E[X∣Y] cannot contain more information about events
than the Y it was calculated from.
This suggests a straightforward generalization of the above:
instead of a specific value Y=y,
we can condition on any information from Y.
If H=σ(Y) is the information generated by Y,
then the conditional expectation E[X∣H]=Z
is H-measurable, and given by a Z satisfying:
E[X⋅I(H)]=E[Z⋅I(H)]
For any H∈H. Note that Z is almost surely unique:
almost because it could take any value
for an event A with zero probability P(A)=0.
Fortunately, if there exists a continuous f
such that E[X∣σ(Y)]=f(Y),
then Z=E[X∣σ(Y)] is unique.
Properties
A conditional expectation defined in this way has many useful properties,
most notably linearity:
E[aX+bY∣H]=aE[X∣H]+bE[Y∣H]
for any a,b∈R.
The tower property states that if F⊃G⊃H,
then E[E[X∣G]∣H]=E[X∣H].
Intuitively, this works as follows:
suppose person G knows more about X than person H,
then E[X∣H] is H’s expectation,
E[X∣G] is G’s “better” expectation,
and then E[E[X∣G]∣H]
is H’s prediction about what G’s expectation will be.
However, H does not have access to G’s extra information,
so H’s best prediction is simply E[X∣H].
The law of total expectation says that
E[E[X∣G]]=E[X],
and follows from the above tower property
by choosing H to contain no information:
H={∅,Ω}.
Another useful property is that E[X∣H]=X
if X is H-measurable.
In other words, if H already contains
all the information extractable from X,
then we know X’s exact value.
Conveniently, this can easily be generalized to products:
E[XY∣H]=XE[Y∣H]
if X is H-measurable:
since X’s value is known, it can simply be factored out.
Armed with this definition of conditional expectation,
we can define other conditional quantities,
such as the conditional variance V[X∣H]:
V[X∣H]=E[X2∣H]−[E[X∣H]]2
The law of total variance then states that
V[X]=E[V[X∣H]]+V[E[X∣H]].
Likewise, we can define the conditional probability P,
conditional distribution function FX∣H,
and conditional density function fX∣H
like their non-conditional counterparts:
P(A∣H)=E[I(A)∣H]FX∣H(x)=P(X≤x∣H)fX∣H(x)=dxdFX∣H
References
- U.H. Thygesen,
Lecture notes on diffusions and stochastic differential equations,
2021, Polyteknisk Kompendie.