This tutorial assumes a basic familiarity both with Markov Chain Monte Carlo (MCMC) sampling methods and with linear algebra. Before you get started, make sure you're comfortable with the language used when discussing MCMC analyses, and that you know what an eigenvector is.
Let's take a look at using umbrella sampling to model a complicated distribution. First things first, we'll simulate some ground-truth data.
import numpy as np
import numpy.polynomial.polynomial as poly
import matplotlib.pyplot as plt
# set the seed for reproducibility
np.random.seed(234)
# generate a random polynomial
xpts = np.random.uniform(0,10, size=6)
ypts = np.random.uniform(3,6, size=6)
pfit = poly.polyfit(xpts,ypts,5)
# here's the simulated probability density function (pdf)
Npdf = int(1e6)
x_pdf = np.linspace(0,10,Npdf)
y_pdf = poly.polyval(x_pdf, pfit)
y_pdf = y_pdf/(np.sum(y_pdf)/Npdf*(x_pdf.max()-x_pdf.min()))
plt.figure()
plt.rc("font", size=16)
plt.plot(x_pdf, y_pdf/y_pdf.max(), c="k", lw=2)
plt.xlabel(r"$x$")
plt.show()
For the target distribution above, there is a low probability "valley" at $x\approx5$ which can cause a bottleneck for the sampler, leading to inefficient exploration of the posterior. The idea behind umbrella sampling is that rather than sampling from the full distribution all at once, we'll break the complicated distribution, $\pi(x)$, into several smaller windows, sample from these windows separately, and then recombine the sub-samples, $\pi_i(x)$, into a single joint posterior distribution. We'll want to center one of our windows on the low-probability region around $x\approx5$ so we'll need at least three windows for this problem, the "valley" umbrella plus one on either side to cover the rest of the distribution.
Let's mark the rough edges of our windows like so:
plt.figure()
plt.rc("font", size=16)
plt.plot(x_pdf, y_pdf/y_pdf.max(), c="k", lw=2)
plt.axvspan(0, 4, color="C0", alpha=0.2)
plt.axvspan(1, 9, color="C1", alpha=0.2)
plt.axvspan(6, 10, color="C2", alpha=0.2)
plt.xlabel("x")
plt.show()
For reason which will become clear in a moment, we need our windows to overlap a bit. Each window will also be assigned it's own bias function, $\psi(x)$, sometimes called simply a bias or umbrella. The bias functions serve to restrict the sampler to a given window and ensure that samples are drawn from the low probabilty regions. Let's define our biases now.
import scipy.stats as stats
def psi_0(x):
return 2*stats.triang(c=0.5, loc=-4, scale=8).pdf(x)
def psi_1(x):
return stats.triang(c=0.5, loc=1, scale=8).pdf(x)
def psi_2(x):
return 2*stats.triang(c=0.5, loc=6, scale=8).pdf(x)
psi_fxns = [psi_0, psi_1, psi_2]
Nwin = len(psi_fxns)
plt.figure()
plt.rc("font", size=16)
plt.plot(x_pdf, y_pdf/y_pdf.max(), c="k", lw=3)
for i in range(Nwin):
plt.plot(x_pdf, psi_fxns[i](x_pdf)/np.max(psi_fxns[i](x_pdf)), lw=3, ls="--")
plt.text(5*i, 1.1, r"$\psi_{0}$".format(i), color="C{0}".format(i), fontsize=24, ha="center")
plt.savefig("/Users/research/Desktop/umbrella_small.png")
plt.show()
I've used tent biases (triangular potentials) because they're easy to set up and perform calculations with, but you can use whatever functional form you like. As long as (1) the windows overlap, and (2) we sample adequately from each window, the results will be insensitive to any particular choice of $\psi$.
Now that we've got our umbrellas defined, it's time to sample! For a real problem, you'd use a Monte Carlo sampler such as PyMC3 or emcee, but today we're gonna cheat a little and use np.random to draw samples directly from the known target distribution.
# draw samples from each window
Nsamp = Npdf//2
samples = np.zeros((Nwin,Nsamp))
for i, psi in enumerate(psi_fxns):
prob = psi(x_pdf)*y_pdf
prob /= np.sum(prob)
samples[i] = np.random.choice(x_pdf, p=prob, size=Nsamp)
# make plots
fig, ax = plt.subplots(1,2, figsize=(12,4))
plt.rc("font", size=16)
bins = np.linspace(0,10,41)
for i, samp in enumerate(samples):
ax[0].hist(samp, bins=bins, density=True, histtype="step", lw=2, label=r"$\pi_{0}$".format(i))
ax[0].set_yticks([])
ax[0].legend(loc="upper center")
y, x = np.histogram(np.hstack(samples), bins=bins, density=True)
x = 0.5*(x[1:]+x[:-1])
ax[1].plot(x, y, lw=2, color="k", label="samples")
plt.fill_between(x_pdf, y_pdf, color="lightgrey", alpha=0.5, label="truth")
ax[1].legend(loc="upper center")
ax[1].set_yticks([])
plt.show()
The left panel above shows the biased posterior samples from each window. Notice how $\pi_1$ is only weakly bimodal despite the deep valley in the target distribution. This leveling effect is the purpose of the biasing function $\psi_1$, which has helped us draw samples from the low probability region of $\pi$.
The right panel shows the posterior results when we naively combine all biased samples into a single distribution. It's not the worst estimate in the world, but clearly we need to account for the effects of the (known) biasing potentials and the (unknown) zero-point offsets between each of the three sub-distributions. Here's what the sub-distributions look like after reweighting samples to remove the effects of each $\psi_i$:
plt.figure()
plt.rc("font", size=16)
bins = np.linspace(0,10,81)
# weights for each window removing bias
for i, psi in enumerate(psi_fxns):
w = 1/psi(samples[i])
w /= np.sum(w)
y, x = np.histogram(samples[i], bins=bins, weights=w)
x = 0.5*(x[1:]+x[:-1])
plt.plot(x[y>0], y[y>0], lw=2)
# ground truth renormalized for visualization
plt.fill_between(x_pdf, y_pdf/3, color="lightgrey", alpha=0.5, label="truth")
plt.legend(loc="upper right")
plt.yticks([])
plt.show()
The individual sub-distributions now follow the curvature of the target distribution, but they're offset from one another by an unknown amount. To account for these offsets, we need to calculate the relative window weights, $z_i$. We can make a crude estimate of $z_i$ by simply averaging the debiased samples within each overlap region. To track our calculations, we'll create a 3x2 matrix, $\Phi$, where each row contains the average of $\pi_i$ within each of the two overlap regions (one value per column).
Phi = np.zeros((3,2))
for i, psi in enumerate(psi_fxns):
w = 1/psi(samples[i])
w /= np.sum(w)
y, x = np.histogram(samples[i], bins=[0,1,4,6,9,10], weights=w)
Phi[i,0] = np.mean(y[1])
Phi[i,1] = np.mean(y[3])
#print(Phi.round(2))
After computing the averages, we get:
\begin{equation}\tag{1} \Phi = \begin{bmatrix} 0.71 & 0 \\ 0.5 & 0.39 \\ 0 & 0.73 \\ \end{bmatrix} \end{equation}Let's consider what each matrix elements means. $\Phi_{01} = 0$ indicates that no samples from window 0 fall in the overlap region between windows 1 & 2, and likewise $\Phi_{20} = 0$ indicates that no samples from window 2 fall in the overlap region between windows 0 & 1. These zero-elements are expected from our construction of the windows. $\Phi_{00} = 0.71$ and $\Phi_{10}=0.5$ give us the average values of $\pi_0$ vs $\pi_1$ in their mutual overlap region, and likewise for $\Phi_{11} = 0.39$ and $\Phi_{21}=0.73$ for $\pi_1$ vs $\pi_2$. Renormalizing everything relative to $\pi_1$ (which overlaps both of the other distributions) yields:
z_wham = np.array([Phi[0,0]/Phi[1,0], 1.0, Phi[2,1]/Phi[1,1]])
plt.figure()
plt.rc("font", size=16)
bins = np.linspace(0,10,81)
for i, psi in enumerate(psi_fxns):
w = 1/psi(samples[i])
w /= np.sum(w)
y, x = np.histogram(samples[i], bins=bins, weights=w)
x = 0.5*(x[1:]+x[:-1])
plt.plot(x[y>0], y[y>0]/z_wham[i], lw=3, ls="--")
# ground truth renormalized for visualization
plt.fill_between(x_pdf, y_pdf/6, color="lightgrey", alpha=0.5, label="truth")
plt.yticks([])
plt.legend(loc="upper right")
plt.show()
That looks pretty good! Don't be fooled, though - our crude weighting method only worked this well because we're dealing with a simple toy problem. For real world applications, you'll need something a little more sophisticated.
Now, calculating $z_i$ by hand is a good learning experience, but what we really want is an algorithm for determining window weights automatically given a set of umbrellas, $\psi_i$, and corresponding biased sub-samples, $\pi_i$. As a reminder, here's what our everything looked like immediately after sampling:
# make plots
fig, ax = plt.subplots(1,2, figsize=(12,4))
plt.rc("font", size=16)
bins = np.linspace(0,10,41)
ax[0].plot(x_pdf, y_pdf/y_pdf.max(), c="k", lw=3)
for i in range(Nwin):
ax[0].plot(x_pdf, psi_fxns[i](x_pdf)/np.max(psi_fxns[i](x_pdf)), lw=3, ls="--")
ax[0].text(5*i, 1.1, r"$\psi_{0}$".format(i), color="C{0}".format(i), fontsize=24, ha="center")
ax[1].hist(samples[i], bins=bins, density=True, histtype="step", lw=2, label=r"$\pi_{0}$".format(i))
ax[1].set_yticks([])
ax[1].legend(loc="upper center")
plt.savefig("/Users/research/Desktop/umbrella.png")
plt.show()
We can summarize the entire end-to-end procedure for calculating window weights as
\begin{equation}\tag{2} z_i = \int \psi_i(x)\pi(x)dx = \langle \psi_i \rangle_{\pi} \end{equation}where $\langle f \rangle_{\pi}$ denotes an average of some function $f$ with respect to $\pi$. In other words, to determine each $z_i$ we're going to take the average of each $\psi_i$ weighted by the empirically sampled target distribution, $\pi$.
If the full target distribution, $\pi(x)$, were known ahead of time then solving this integral would be easy. But of course $\pi$ is not known - it's what we're trying to determine! Furthermore, we don't actually have samples of $\pi$ yet, we have three sets of biased sub-samples, $\pi_i$, so we'll need to compute $\langle \psi_i \rangle_{\pi_j}$ for each $(i,j)$ and then combine these to estimate $\langle \psi_i \rangle_{\pi}$. The challenge is that this final combination step depends on $z$, making the whole process a bit circular.
Different methods for implementing umbrella sampling more or less come down to different methods for solving the integral in Equation (2). The most popular method is the Weighted Histogram Analysis Method (WHAM, Kumar+ 1992), which works by binning the data and computing a histogram in the overlap region. The procedure we walked through above is essentially a single-bin implementation of WHAM. Another popular method is the Multistate Bennet Acceptance Ratio (MBAR, Shirts & Chodera 2008, which does not require discretization of the data.
Recently, Thiede+ 2016 and Dinner+ 2017 demonstrated that this calculation can be recast as an eigenvector problem, which they call the Eigenvector Method of Umbrella Sampling (EMUS). I like linear algebra - it's fast and robust and feels a bit like dark magic sometimes - so we'll use EMUS from here on out.
We can restate integral in Equation (2) above as an explicit sum
\begin{equation}\tag{3} z_i = \sum_{i=1}^N \Bigg\langle \frac{\psi_i(x)}{\sum_{j=1}^N \psi_j(x)/z_j} \Bigg\rangle_{\pi_i} \end{equation}Notice that $z$ shows up on both the left side and the right side of the equation. This is the circularity I alluded to earlier, and it means that we'll eventually have to solve for $z$ iteratively. To do so using EMUS, we first define a square overlap matrix, $F$, with each element ($i,j$) defined as
\begin{equation}\tag{4} F_{ij} = \Bigg\langle \frac{\psi_j/z_i}{\sum_{k=1}^N \psi_k/z_k} \Bigg\rangle_{\pi_i}\end{equation}As its name implies, $F$ tracks the extent to which samples drawn within one window fall under the umbrella of any other window. I'll demonstrate how to calculate $F$ in a moment, but for now let's jump ahead to the results for our example problem.
After running the EMUS algorithm, the overlap matrix for our example problem comes out as:
\begin{equation}\tag{5} F = \begin{bmatrix} 0.83 & 0.07 & 0 \\ 0.39 & 0.72 & 0.31 \\ 0 & 0.08 & 0.8 \\ \end{bmatrix} \end{equation}The zero-valued corner element $F_{02}$ reflects the fact that umbrellas $\psi_0$ and $\psi_2$ don't overlap, so there are no samples of $\pi_0$ which fall under $\psi_2$. By the same logic, no samples of $\pi_2$ fall under $\psi_0$, so $F_{20} = 0$ as well.
The on-diagonal terms all have relatively large values because each set of sub-samples $\pi_i$ fall under their own umbrella $\psi_i$. The smaller value of $F_{11}$ compared to $F_{00}$ and $F_{22}$ reflects the fact that most of the samples of $\pi_1$ are drawn from regions of the target distribution where $\psi_1$ is small, i.e. the peak of $\psi_i$ corresponds the the valley in $\pi$.
The small value of $F_{01} = 0.07$ indicates that most samples of $\pi_0$ are drawn from locations where $\psi_1$ is small (or zero), whereas the moderate value of $F_{10} = 0.39$ indicates that a substantial fraction of samples of $\pi_1$ fall under umbrella $\psi_0$. Comparing $F_{01}$ vs $F_{10}$ gives us the relative weights of $z_0$ vs $z_1$. This same argument can be applied to the remaining two off diagonal terms $F_{12}$ and $F_{21}$.
So that's $F$, but what we really care about are the window weights, $z_i$. We'd like to be able to calculate $z_i$ using linear algebra, so let's first define $z \equiv [z_1,z_2,...,z_N]$ as a vector. Taking the product of $z$ and the $j^{th}$ column of $F$ yields
\begin{equation}\tag{6} \sum_{i=1}^N z_i F_{ij} = \sum_{i=1}^N \Bigg\langle \frac{\psi_j}{\sum_{k=1}^N \psi_k/z_k} \Bigg\rangle_{\pi_i} = \langle \psi_j \rangle_{\pi} \end{equation}Recall from Equation (2) that $\langle \psi_j \rangle_{\pi} = z_j$, so $\sum_i z_i F_{ij} = z_j$. Considering all columns in $F$ simultaneously yields the left eigenvalue problem
\begin{equation}\tag{7} zF = z \end{equation}which when solved provides an estimate of the window weights.
If we knew $F$ a priori, finding the eigenvalues and eigenvectors of Equation (7) would be a straightforward application of linear algebra. But, in practice, we need to estimate both $z$ and $F$ from our emirical samples, $\pi_i$. As we noted earlier, this must be done iteratively. Our strategy will be to pick a starting guess for $z$ and calculate a first estimate of $F$ from Equation (4). We'll then use our estimate of $F$ to calculate an updated value for $z$ using Equation (7), and then iterate between Equations (4) and (7) until the result converges.
With all the necessary background theory covered, we're now finally in a position to calculate our window weights. First we'll need to define a couple of functions for iteratively solving Equations (4) and (7) for $F$ and $z$, respectively. Credit where credit is due: the functions below have been adapted and simplified from a more complete EMUS implementation by Matthews+ 2018. Their code contains a lot of flexiblity for automatic integration with the popular MCMC sampler emcee and for improved efficiency when the rough posterior geometry is not well understood prior to sampling. We don't need all that right now, so I've built stripped-down version of their functions for the sake of clarity in this tutorial. Once you've good a good grasp of the examples here, I encourage you to take a look at the original source code. Here's my simplified version:
from scipy import linalg
# Equation (4)
def F_iter(psi_fxns, samples, z):
Nwin = len(psi_fxns)
F = np.zeros((Nwin,Nwin))
for i in range(Nwin):
denom = 0.
for k in range(Nwin):
denom += psi_fxns[k](samples[i])/z[k]
for j in range(Nwin):
num = psi_fxns[j](samples[i])/z[i]
F[i,j] = np.mean(num/denom)
return F
# Equation (7)
def z_iter(F, tol=1.E-10, max_iter=100):
# stationary distribution is the last column of QR factorization
# this solves for the eigenvector
M = np.eye(len(F))-F
q,r = linalg.qr(M)
z = q[:,-1]
z /= np.sum(z)
# polish solution using power method
# this isolates the largest (aka dominant) eigenvalue
for itr in range(max_iter):
znew = np.dot(z,F)
tv = np.abs(znew[z > 0] - z[z > 0])
tv = tv/z[z > 0]
maxresid = np.max(tv)
if maxresid < tol:
break
else:
z = znew
# return normalized (by convention)
return z/np.sum(z)
As a starting estimate for $z_i$, we'll take the average of each $\psi_i$ with respect to $\pi_i$.
z = np.zeros(Nwin)
for i, psi in enumerate(psi_fxns):
z[i] = np.mean(psi(samples[i]))
z /= np.sum(z)
print("Starting estimate for z:", z.round(4))
Now all that's left to do is run the calculation. We'll print out the first and last iteration and you'll see that the results converge quite rapidly.
# first iteration of Eq. 4 & 7
F = F_iter(psi_fxns, samples, z)
z = z_iter(F)
print("\n\nITERATION 1")
print("\n F\n--\n\n", F.round(4))
print("\n z\n--\n", z.round(4))
# iterate until convergence
for n in range(100):
# for numerical stability
z_old = np.copy(z)
z_old[z_old < 1e-100] = 1e-100
# Eq. 4 & 7 again
F = F_iter(psi_fxns, samples, z)
z = z_iter(F)
# check if we have converged
if np.max(np.abs(z-z_old)/z_old) < 1e-10:
break
print("\n\nITERATION {0}".format(n+1))
print("\n F\n--\n\n", F.round(4))
print("\n z\n--\n", z.round(4))
We reached our convergence tolerance after just four iterations, and after just a single iteration the results were already very close to their final values. What's more, we'll reach convergence nearly as fast regardless of our starting estimate for $z$. Try running this code yourself with a few different guesses for $z$ and you'll see what I mean.
EMUS is both fast and robust - exactly what we want in an algorithm. All that's left to do is plot our results and make sure they're accurate (spolier alert, they are).
weights = []
for i in range(Nwin):
# compute weights as in eq 4 in Matthews et al.
wd = 0.
for k in range(Nwin):
wd += psi_fxns[k](samples[i])/z[k]
weights.append(1./wd)
weights = np.array(weights).flatten()
sub_samples = np.random.choice(np.array(samples).flatten(), p=weights/np.sum(weights), size=Nsamp)
# histogram using all weighted samples
fig, ax = plt.subplots(1,2, figsize=(13,4))
ax[0].hist(np.array(samples).flatten(), weights=weights, bins=bins, histtype="step", density=True, lw=2, color="k", label='samples')
ax[0].fill_between(x_pdf, y_pdf, label='truth', color='lightgrey', alpha=0.5)
ax[0].set_yticks([])
ax[0].legend(frameon=False, loc='upper right')
ax[0].set_title("all weighted samples")
# histogram using sub-samples
ax[1].hist(sub_samples, bins=bins, histtype="step", density=True, lw=2, color="r", label='samples')
ax[1].fill_between(x_pdf, y_pdf, label='truth', color='lightgrey', alpha=0.5)
ax[1].set_yticks([])
ax[1].legend(frameon=False, loc='upper right')
ax[1].set_title("sub-sampled draws")
plt.show()
The left plot shows the weighted histogram calculated using all samples, whereas the right plot shows an unweighted histogram of a subset of samples selected via weighted draws from the full set of samples. As expected, the results are equivalent.