source/know/concept/conditional-expectation/index.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172

---
title: "Conditional expectation"
date: 2021-10-23
categories:
- Mathematics
- Statistics
- Measure theory
- Stochastic analysis
layout: "concept"
---

Recall that the expectation value $\mathbf{E}[X]$
of a [random variable](/know/concept/random-variable/) $X$
is a function of the probability space $(\Omega, \mathcal{F}, P)$
on which $X$ is defined, and the definition of $X$ itself.

The **conditional expectation** $\mathbf{E}[X|A]$
is the expectation value of $X$ given that an event $A$ has occurred,
i.e. only the outcomes $\omega \in \Omega$
satisfying $\omega \in A$ should be considered.
If $A$ is obtained by observing a variable,
then $\mathbf{E}[X|A]$ is a random variable in its own right.

Consider two random variables $X$ and $Y$
on the same probability space $(\Omega, \mathcal{F}, P)$,
and suppose that $\Omega$ is discrete.
If $Y = y$ has been observed,
then the conditional expectation of $X$
given the event $Y = y$ is as follows:

$$\begin{aligned}
    \mathbf{E}[X | Y \!=\! y]
    = \sum_{x} x \: Q(X \!=\! x)
    \qquad \quad
    Q(X \!=\! x)
    = \frac{P(X \!=\! x \cap Y \!=\! y)}{P(Y \!=\! y)}
\end{aligned}$$

Where $Q$ is a renormalized probability function,
which assigns zero to all events incompatible with $Y = y$.
If we allow $\Omega$ to be continuous,
then from the definition $\mathbf{E}[X]$,
we know that the following Lebesgue integral can be used,
which we call $f(y)$:

$$\begin{aligned}
    \mathbf{E}[X | Y \!=\! y]
    = f(y)
    = \int_\Omega X(\omega) \dd{Q(\omega)}
\end{aligned}$$

However, this is only valid if $P(Y \!=\! y) > 0$,
which is a problem for continuous sample spaces $\Omega$.
Sticking with the assumption $P(Y \!=\! y) > 0$, notice that:

$$\begin{aligned}
    f(y)
    = \frac{1}{P(Y \!=\! y)} \int_\Omega X(\omega) \dd{P(\omega \cap Y \!=\! y)}
    = \frac{\mathbf{E}[X \cdot I(Y \!=\! y)]}{P(Y \!=\! y)}
\end{aligned}$$

Where $I$ is the indicator function,
equal to $1$ if its argument is true, and $0$ if not.
Multiplying the definition of $f(y)$ by $P(Y \!=\! y)$ then leads us to:

$$\begin{aligned}
    \mathbf{E}[X \cdot I(Y \!=\! y)]
    &= f(y) \cdot P(Y \!=\! y)
    \\
    &= \mathbf{E}[f(Y) \cdot I(Y \!=\! y)]
\end{aligned}$$

Recall that because $Y$ is a random variable,
$\mathbf{E}[X|Y] = f(Y)$ is too.
In other words, $f$ maps $Y$ to another random variable,
which, thanks to the *Doob-Dynkin lemma*
(see [random variable](/know/concept/random-variable/)),
means that $\mathbf{E}[X|Y]$ is measurable with respect to $\sigma(Y)$.
Intuitively, this makes sense:
$\mathbf{E}[X|Y]$ cannot contain more information about events
than the $Y$ it was calculated from.

This suggests a straightforward generalization of the above:
instead of a specific value $Y = y$,
we can condition on *any* information from $Y$.
If $\mathcal{H} = \sigma(Y)$ is the information generated by $Y$,
then the conditional expectation $\mathbf{E}[X|\mathcal{H}] = Z$
is $\mathcal{H}$-measurable, and given by a $Z$ satisfying:

$$\begin{aligned}
    \boxed{
        \mathbf{E}\big[X \cdot I(H)\big]
        = \mathbf{E}\big[Z \cdot I(H)\big]
    }
\end{aligned}$$

For any $H \in \mathcal{H}$. Note that $Z$ is almost surely unique:
*almost* because it could take any value
for an event $A$ with zero probability $P(A) = 0$.
Fortunately, if there exists a continuous $f$
such that $\mathbf{E}[X | \sigma(Y)] = f(Y)$,
then $Z = \mathbf{E}[X | \sigma(Y)]$ is unique.


## Properties

A conditional expectation defined in this way has many useful properties,
most notably linearity:
$\mathbf{E}[aX \!+\! bY | \mathcal{H}] = a \mathbf{E}[X|\mathcal{H}] + b \mathbf{E}[Y|\mathcal{H}]$
for any $a, b \in \mathbb{R}$.

The **tower property** states that if $\mathcal{F} \supset \mathcal{G} \supset \mathcal{H}$,
then $\mathbf{E}[\mathbf{E}[X|\mathcal{G}]|\mathcal{H}] = \mathbf{E}[X|\mathcal{H}]$.
Intuitively, this works as follows:
suppose person $G$ knows more about $X$ than person $H$,
then $\mathbf{E}[X | \mathcal{H}]$ is $H$'s expectation,
$\mathbf{E}[X | \mathcal{G}]$ is $G$'s "better" expectation,
and then $\mathbf{E}[\mathbf{E}[X|\mathcal{G}]|\mathcal{H}]$
is $H$'s prediction about what $G$'s expectation will be.
However, $H$ does not have access to $G$'s extra information,
so $H$'s best prediction is simply $\mathbf{E}[X | \mathcal{H}]$.

The **law of total expectation** says that
$\mathbf{E}[\mathbf{E}[X | \mathcal{G}]] = \mathbf{E}[X]$,
and follows from the above tower property
by choosing $\mathcal{H}$ to contain no information:
$\mathcal{H} = \{ \varnothing, \Omega \}$.

Another useful property is that $\mathbf{E}[X | \mathcal{H}] = X$
if $X$ is $\mathcal{H}$-measurable.
In other words, if $\mathcal{H}$ already contains
all the information extractable from $X$,
then we know $X$'s exact value.
Conveniently, this can easily be generalized to products:
$\mathbf{E}[XY | \mathcal{H}] = X \mathbf{E}[Y | \mathcal{H}]$
if $X$ is $\mathcal{H}$-measurable:
since $X$'s value is known, it can simply be factored out.

Armed with this definition of conditional expectation,
we can define other conditional quantities,
such as the **conditional variance** $\mathbf{V}[X | \mathcal{H}]$:

$$\begin{aligned}
    \mathbf{V}[X | \mathcal{H}]
    = \mathbf{E}[X^2 | \mathcal{H}] - \big[\mathbf{E}[X | \mathcal{H}]\big]^2
\end{aligned}$$

The **law of total variance** then states that
$\mathbf{V}[X] = \mathbf{E}[\mathbf{V}[X | \mathcal{H}]] + \mathbf{V}[\mathbf{E}[X | \mathcal{H}]]$.

Likewise, we can define the **conditional probability** $P$,
**conditional distribution function** $F_{X|\mathcal{H}}$,
and **conditional density function** $f_{X|\mathcal{H}}$
like their non-conditional counterparts:

$$\begin{aligned}
    P(A | \mathcal{H})
    = \mathbf{E}[I(A) | \mathcal{H}]
    \qquad
    F_{X|\mathcal{H}}(x)
    = P(X \le x | \mathcal{H})
    \qquad
    f_{X|\mathcal{H}}(x)
    = \dv{F_{X|\mathcal{H}}}{x}
\end{aligned}$$


## References
1.  U.H. Thygesen,
    *Lecture notes on diffusions and stochastic differential equations*,
    2021, Polyteknisk Kompendie.