---
title: "Pulay mixing"
firstLetter: "P"
publishDate: 2021-03-02
categories:
- Numerical methods

date: 2021-03-02T19:11:51+01:00
draft: false
markup: pandoc
---

# Pulay mixing
Some numerical problems are most easily solved *iteratively*,
by generating a series $\rho_1$, $\rho_2$, etc.
converging towards the desired solution $\rho_*$.
**Pulay mixing**, also often called
**direct inversion in the iterative subspace** (DIIS),
is an effective method to speed up convergence,
which also helps to avoid periodic divergences.

The key concept it relies on is the **residual vector** $R_n$
of the $n$th iteration, which in some way measures the error of the current $\rho_n$.
Its exact definition varies,
but is generally along the lines of the difference between
the input of the iteration and the raw resulting output:

$$\begin{aligned}
	R_n
	= R[\rho_n]
	= \rho_n^\mathrm{new}[\rho_n] - \rho_n
\end{aligned}$$

It is not always clear what to do with $\rho_n^\mathrm{new}$.
Directly using it as the next input ($\rho_{n+1} = \rho_n^\mathrm{new}$)
often leads to oscillation,
and linear mixing ($\rho_{n+1} = (1\!-\!f) \rho_n + f \rho_n^\mathrm{new}$)
can take a very long time to converge properly.
Pulay mixing offers an improvement.

The idea is to construct the next iteration's input $\rho_{n+1}$
as a linear combination of the previous inputs $\rho_1$, $\rho_2$, ..., $\rho_n$,
such that it is as close as possible to the optimal $\rho_*$:

$$\begin{aligned}
	\boxed{
		\rho_{n+1}
		= \sum_{m = 1}^n \alpha_m \rho_m
	}
\end{aligned}$$

To do so, we make two assumptions.
Firstly, the current $\rho_n$ is already close to $\rho_*$,
so that such a linear combination makes sense.
Secondly, the iteration is linear,
such that the raw output $\rho_{n+1}^\mathrm{new}$
is also a linear combination with the *same coefficients*:

$$\begin{aligned}
	\rho_{n+1}^\mathrm{new}
	= \sum_{m = 1}^n \alpha_m \rho_m^\mathrm{new}
\end{aligned}$$

We will return to these assumptions later.
The point is that $R_{n+1}$ is also a linear combination:

$$\begin{aligned}
	R_{n+1}
	= \rho_{n+1}^\mathrm{new} - \rho_{n+1}
	= \sum_{m = 1}^n \alpha_m \rho_m^\mathrm{new} - \sum_{m = 1}^n \alpha_m \rho_m
	= \sum_{m = 1}^n \alpha_m R_m
\end{aligned}$$

The goal is to choose the coefficients $\alpha_m$ such that
the norm of the error $|R_{n+1}| \approx 0$,
subject to the following constraint to preserve the normalization of $\rho_{n+1}$:

$$\begin{aligned}
	\sum_{m=1}^n \alpha_m = 1
\end{aligned}$$

We thus want to minimize the following quantity,
where $\lambda$ is a [Lagrange multiplier](/know/concept/lagrange-multiplier/):

$$\begin{aligned}
	\braket{R_{n+1}}{R_{n+1}} + \lambda \sum_{m = 1}^n \alpha_m
	= \sum_{m=1}^n \alpha_m \Big( \sum_{k=1}^n \alpha_k \braket{R_m}{R_k} + \lambda \Big)
\end{aligned}$$

By differentiating the right-hand side with respect to $\alpha_m$,
we get a system of equations that we can write in matrix form,
which is relatively cheap to solve numerically:

$$\begin{aligned}
	\begin{bmatrix}
		\braket{R_1}{R_1} & \cdots & \braket{R_1}{R_n} & 1 \\
		\vdots & \ddots & \vdots & \vdots \\
		\braket{R_n}{R_1} & \cdots & \braket{R_n}{R_n} & 1 \\
		1 & \cdots & 1 & 0
	\end{bmatrix}
	\begin{bmatrix}
		\alpha_1 \\ \vdots \\ \alpha_n \\ \lambda
	\end{bmatrix}
	=
	\begin{bmatrix}
		0 \\ \vdots \\ 0 \\ 1
	\end{bmatrix}
\end{aligned}$$

This method is very effective.
However, in practice, the earlier inputs $\rho_1$, $\rho_2$, etc.
are much further from $\rho_*$ than $\rho_n$,
so usually only the most recent $N$ inputs $\rho_{n - N}$, ..., $\rho_n$ are used:

$$\begin{aligned}
	\rho_{n+1}
	= \sum_{m = N}^n \alpha_m \rho_m
\end{aligned}$$

You might be confused by the absence of all $\rho_m^\mathrm{new}$
in the creation of $\rho_{n+1}$, as if the iteration's outputs are being ignored.
This is due to the first assumption,
which states that $\rho_n^\mathrm{new}$ are $\rho_n$ are already similar,
such that they are interchangeable.

Speaking of which, about those assumptions:
while they will clearly become more accurate as $\rho_n$ approaches $\rho_*$,
they might be very dubious in the beginning.
A consequence of this is that the early iterations might get "trapped"
in a suboptimal subspace spanned by $\rho_1$, $\rho_2$, etc.
To say it another way, we would be varying $n$ coefficients $\alpha_m$
to try to optimize a $D$-dimensional $\rho_{n+1}$,
where in general $D \gg n$, at least in the beginning.

There is an easy fix to this problem:
add a small amount of the raw residual $R_m$
to "nudge" $\rho_{n+1}$ towards the right subspace,
where $\beta \in [0,1]$ is a tunable parameter:

$$\begin{aligned}
	\boxed{
		\rho_{n+1}
		= \sum_{m = N}^n \alpha_m (\rho_m + \beta R_m)
	}
\end{aligned}$$

In other words, we end up introducing a small amount of the raw outputs $\rho_m^\mathrm{new}$,
while still giving more weight to iterations with smaller residuals.

Pulay mixing is very effective:
it can accelerate convergence by up to one order of magnitude!


## References
1.  P. Pulay,
    [Convergence acceleration of iterative sequences. The case of SCF iteration](https://doi.org/10.1016/0009-2614(80)80396-4),
    1980, Elsevier.