source/know/concept/pulay-mixing/index.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164

---
title: "Pulay mixing"
sort_title: "Pulay mixing"
date: 2021-03-02
categories:
- Numerical methods
layout: "concept"
---

Some numerical problems are most easily solved *iteratively*,
by generating a series of "solutions" $$f_1$$, $$f_2$$, etc.
converging towards the true $$f_\infty$$.
**Pulay mixing**, also often called
**direct inversion in the iterative subspace** (DIIS),
can speed up the convergence for some types of problems,
and also helps to avoid periodic divergences.

The key concept it relies on is the **residual vector** $$R_n$$
of the $$n$$th iteration, which measures the error of the current $$f_n$$.
Its exact definition can vary,
but it is generally the difference between
the input $$f_n$$ of the $$n$$th iteration
and the raw resulting output $$f_n^\mathrm{new}$$:

$$\begin{aligned}
	R_n
	\equiv R[f_n]
	\equiv f_n^\mathrm{new}[f_n] - f_n
\end{aligned}$$

It is not always clear what to do with $$f_n^\mathrm{new}$$.
Directly using it as the next input ($$f_{n+1} = f_n^\mathrm{new}$$)
often leads to oscillation,
and linear mixing ($$f_{n+1} = (1\!-\!c) f_n + c f_n^\mathrm{new}$$)
can take a very long time to converge properly.
Pulay mixing offers an improvement.

The idea is to construct the next iteration's input $$f_{n+1}$$
as a linear combination of the previous inputs $$f_1$$, $$f_2$$, ..., $$f_n$$,
such that it is as close as possible to the optimal $$f_\infty$$:

$$\begin{aligned}
	\boxed{
		f_{n+1}
		= \sum_{m = 1}^n \alpha_m f_m
	}
\end{aligned}$$

To do so, we make two assumptions.
First, that the current $$f_n$$ is already close to $$f_\infty$$,
so such a linear combination makes sense.
Second, that the iteration is linear,
so the raw output $$f_{n+1}^\mathrm{new}$$
is also a linear combination *with the same coefficients*:

$$\begin{aligned}
	f_{n+1}^\mathrm{new}
	= \sum_{m = 1}^n \alpha_m f_m^\mathrm{new}
\end{aligned}$$

We will revisit these assumptions later.
The point is that $$R_{n+1}$$ can now also be written
as a linear combination of old residuals $$R_m$$:

$$\begin{aligned}
	R_{n+1}
	= f_{n+1}^\mathrm{new} - f_{n+1}
	= \sum_{m = 1}^n \alpha_m f_m^\mathrm{new} - \sum_{m = 1}^n \alpha_m f_m
	= \sum_{m = 1}^n \alpha_m R_m
\end{aligned}$$

The goal is to choose the coefficients $$\alpha_m$$ such that
the norm of the error $$|R_{n+1}| \approx 0$$,
subject to the following constraint to preserve the normalization of $$f_{n+1}$$:

$$\begin{aligned}
	\sum_{m=1}^n \alpha_m = 1
\end{aligned}$$

We thus want to minimize the following quantity,
where $$\lambda$$ is a [Lagrange multiplier](/know/concept/lagrange-multiplier/):

$$\begin{aligned}
	\inprod{R_{n+1}}{R_{n+1}} + \lambda \sum_{m = 1}^n \alpha_m^*
	= \sum_{m=1}^n \alpha_m^* \Big( \sum_{k=1}^n \alpha_k \inprod{R_m}{R_k} + \lambda \Big)
\end{aligned}$$

By differentiating the right-hand side with respect to $$\alpha_m^*$$
and demanding that the result is zero,
we get a cheap-to-solve system of equations, in matrix form:

$$\begin{aligned}
	\begin{bmatrix}
		\inprod{R_1}{R_1} & \cdots & \inprod{R_1}{R_n} & 1 \\
		\vdots & \ddots & \vdots & \vdots \\
		\inprod{R_n}{R_1} & \cdots & \inprod{R_n}{R_n} & 1 \\
		1 & \cdots & 1 & 0
	\end{bmatrix}
	\cdot
	\begin{bmatrix}
		\alpha_1 \\ \vdots \\ \alpha_n \\ \lambda
	\end{bmatrix}
	=
	\begin{bmatrix}
		0 \\ \vdots \\ 0 \\ 1
	\end{bmatrix}
\end{aligned}$$

From this, we can also see that the Lagrange multiplier
$$\lambda = - \inprod{R_{n+1}}{R_{n+1}}$$,
where $$R_{n+1}$$ is the *predicted* residual of the next iteration,
subject to the two assumptions.
This fact makes $$\lambda$$ a useful measure of convergence.

In practice, the earlier inputs $$f_1$$, $$f_2$$, etc.
are much further from $$f_\infty$$ than $$f_n$$,
so usually only the most recent $$N\!+\!1$$ inputs $$f_{n - N}, ..., f_n$$ are used.
This also keeps the matrix small:

$$\begin{aligned}
	f_{n+1}
	= \sum_{m = n-N}^n \alpha_m f_m
\end{aligned}$$

You might be confused by the absence of any $$f_m^\mathrm{new}$$
in the creation of $$f_{n+1}$$, as if the iteration's outputs are being ignored.
This is due to the first assumption,
which states that $$f_n^\mathrm{new}$$ and $$f_n$$ are already similar,
such that they are basically interchangeable.

Although those assumptions will clearly become more realistic as $$f_n \to f_\infty$$,
they might be very dubious at first.
Consequently, the early iterations may get "trapped"
in a suboptimal subspace spanned by $$f_1$$, $$f_2$$, etc.
Think of it like this:
we would be varying up to $$n$$ coefficients $$\alpha_m$$
to try to optimize a $$D$$-dimensional $$f_{n+1}$$, where usually $$D \gg n$$.
It is almost impossible to find a decent optimum in this way!

This problem is easy to fix,
by mixing in a small amount of the raw residuals $$R_m$$
to "nudge" $$f_{n+1}$$ towards the right subspace,
where $$\beta \in [0,1]$$ is a tunable parameter:

$$\begin{aligned}
	\boxed{
		f_{n+1}
		= \sum_{m = N}^n \alpha_m (f_m + \beta R_m)
	}
\end{aligned}$$

In this way, the raw outputs $$f_m^\mathrm{new}$$ are (rightfully) included via $$R_m$$,
but we still give more weight to iterations with smaller residuals.

Pulay mixing is very effective for certain types of problems,
e.g. density functional theory,
where it can accelerate convergence by up to two orders of magnitude!


## References
1.  P. Pulay,
    [Convergence acceleration of iterative sequences. The case of SCF iteration](https://doi.org/10.1016/0009-2614(80)80396-4),
    1980, Elsevier.