Categories: Numerical methods.

Some numerical problems are most easily solved *iteratively*, by generating a series \(\rho_1\), \(\rho_2\), etc. converging towards the desired solution \(\rho_*\). **Pulay mixing**, also often called **direct inversion in the iterative subspace** (DIIS), can speed up the convergence for some types of problems, and also helps to avoid periodic divergences.

The key concept it relies on is the **residual vector** \(R_n\) of the \(n\)th iteration, which in some way measures the error of the current \(\rho_n\). Its exact definition varies, but is generally along the lines of the difference between the input of the iteration and the raw resulting output:

\[\begin{aligned} R_n = R[\rho_n] = \rho_n^\mathrm{new}[\rho_n] - \rho_n \end{aligned}\]

It is not always clear what to do with \(\rho_n^\mathrm{new}\). Directly using it as the next input (\(\rho_{n+1} = \rho_n^\mathrm{new}\)) often leads to oscillation, and linear mixing (\(\rho_{n+1} = (1\!-\!f) \rho_n + f \rho_n^\mathrm{new}\)) can take a very long time to converge properly. Pulay mixing offers an improvement.

The idea is to construct the next iteration’s input \(\rho_{n+1}\) as a linear combination of the previous inputs \(\rho_1\), \(\rho_2\), …, \(\rho_n\), such that it is as close as possible to the optimal \(\rho_*\):

\[\begin{aligned} \boxed{ \rho_{n+1} = \sum_{m = 1}^n \alpha_m \rho_m } \end{aligned}\]

To do so, we make two assumptions. Firstly, the current \(\rho_n\) is already close to \(\rho_*\), so that such a linear combination makes sense. Secondly, the iteration is linear, such that the raw output \(\rho_{n+1}^\mathrm{new}\) is also a linear combination with the *same coefficients*:

\[\begin{aligned} \rho_{n+1}^\mathrm{new} = \sum_{m = 1}^n \alpha_m \rho_m^\mathrm{new} \end{aligned}\]

We will return to these assumptions later. The point is that \(R_{n+1}\) is also a linear combination:

\[\begin{aligned} R_{n+1} = \rho_{n+1}^\mathrm{new} - \rho_{n+1} = \sum_{m = 1}^n \alpha_m \rho_m^\mathrm{new} - \sum_{m = 1}^n \alpha_m \rho_m = \sum_{m = 1}^n \alpha_m R_m \end{aligned}\]

The goal is to choose the coefficients \(\alpha_m\) such that the norm of the error \(|R_{n+1}| \approx 0\), subject to the following constraint to preserve the normalization of \(\rho_{n+1}\):

\[\begin{aligned} \sum_{m=1}^n \alpha_m = 1 \end{aligned}\]

We thus want to minimize the following quantity, where \(\lambda\) is a Lagrange multiplier:

\[\begin{aligned} \braket{R_{n+1}}{R_{n+1}} + \lambda \sum_{m = 1}^n \alpha_m^* = \sum_{m=1}^n \alpha_m^* \Big( \sum_{k=1}^n \alpha_k \braket{R_m}{R_k} + \lambda \Big) \end{aligned}\]

By differentiating the right-hand side with respect to \(\alpha_m^*\) and demanding that the result is zero, we get a system of equations that we can write in matrix form, which is cheap to solve:

\[\begin{aligned} \begin{bmatrix} \braket{R_1}{R_1} & \cdots & \braket{R_1}{R_n} & 1 \\ \vdots & \ddots & \vdots & \vdots \\ \braket{R_n}{R_1} & \cdots & \braket{R_n}{R_n} & 1 \\ 1 & \cdots & 1 & 0 \end{bmatrix} \begin{bmatrix} \alpha_1 \\ \vdots \\ \alpha_n \\ \lambda \end{bmatrix} = \begin{bmatrix} 0 \\ \vdots \\ 0 \\ 1 \end{bmatrix} \end{aligned}\]

From this, we can also see that the Lagrange multiplier \(\lambda = - \braket{R_{n+1}}{R_{n+1}}\), where \(R_{n+1}\) is the *predicted* residual of the next iteration, subject to the two assumptions.

However, in practice, the earlier inputs \(\rho_1\), \(\rho_2\), etc. are much further from \(\rho_*\) than \(\rho_n\), so usually only the most recent \(N\!+\!1\) inputs \(\rho_{n - N}\), …, \(\rho_n\) are used:

\[\begin{aligned} \rho_{n+1} = \sum_{m = n-N}^n \alpha_m \rho_m \end{aligned}\]

You might be confused by the absence of any \(\rho_m^\mathrm{new}\) in the creation of \(\rho_{n+1}\), as if the iteration’s outputs are being ignored. This is due to the first assumption, which states that \(\rho_n^\mathrm{new}\) are \(\rho_n\) are already similar, such that they are basically interchangeable.

Speaking of which, about those assumptions: while they will clearly become more accurate as \(\rho_n\) approaches \(\rho_*\), they might be very dubious in the beginning. A consequence of this is that the early iterations might get “trapped” in a suboptimal subspace spanned by \(\rho_1\), \(\rho_2\), etc. To say it another way, we would be varying \(n\) coefficients \(\alpha_m\) to try to optimize a \(D\)-dimensional \(\rho_{n+1}\), where in general \(D \gg n\), at least in the beginning.

There is an easy fix to this problem: add a small amount of the raw residual \(R_m\) to “nudge” \(\rho_{n+1}\) towards the right subspace, where \(\beta \in [0,1]\) is a tunable parameter:

\[\begin{aligned} \boxed{ \rho_{n+1} = \sum_{m = N}^n \alpha_m (\rho_m + \beta R_m) } \end{aligned}\]

In other words, we end up introducing a small amount of the raw outputs \(\rho_m^\mathrm{new}\), while still giving more weight to iterations with smaller residuals.

Pulay mixing is very effective for certain types of problems, e.g. density functional theory, where it can accelerate convergence by up to two orders of magnitude!

- P. Pulay, Convergence acceleration of iterative sequences. The case of SCF iteration, 1980, Elsevier.

© Marcus R.A. Newman, a.k.a. "Prefetch".
Available under CC BY-SA 4.0.