Introduction to Approximate Bayesian Computation (ABC)

Many of the posts in this blog have been concerned with using MCMC based methods for Bayesian inference. These methods are typically “exact” in the sense that they have the exact posterior distribution of interest as their target equilibrium distribution, but are obviously “approximate”, in that for any finite amount of computing time, we can only generate a finite sample of correlated realisations from a Markov chain that we hope is close to equilibrium.

Approximate Bayesian Computation (ABC) methods go a step further, and generate samples from a distribution which is not the true posterior distribution of interest, but a distribution which is hoped to be close to the real posterior distribution of interest. There are many variants on ABC, and I won’t get around to explaining all of them in this blog. The wikipedia page on ABC is a good starting point for further reading. In this post I’ll explain the most basic rejection sampling version of ABC, and in a subsequent post, I’ll explain a sequential refinement, often referred to as ABC-SMC. As usual, I’ll use R code to illustrate the ideas.

Basic idea

There is a close connection between “likelihood free” MCMC methods and those of approximate Bayesian computation (ABC). To keep things simple, consider the case of a perfectly observed system, so that there is no latent variable layer. Then there are model parameters \theta described by a prior \pi(\theta), and a forwards-simulation model for the data x, defined by \pi(x|\theta). It is clear that a simple algorithm for simulating from the desired posterior \pi(\theta|x) can be obtained as follows. First simulate from the joint distribution \pi(\theta,x) by simulating \theta^\star\sim\pi(\theta) and then x^\star\sim \pi(x|\theta^\star). This gives a sample (\theta^\star,x^\star) from the joint distribution. A simple rejection algorithm which rejects the proposed pair unless x^\star matches the true data x clearly gives a sample from the required posterior distribution.

Exact rejection sampling

  • 1. Sample \theta^\star \sim \pi(\theta^\star)
  • 2. Sample x^\star\sim \pi(x^\star|\theta^\star)
  • 3. If x^\star=x, keep \theta^\star as a sample from \pi(\theta|x), otherwise reject.
  • 4. Return to step 1.

This algorithm is exact, and for discrete x will have a non-zero acceptance rate. However, in most interesting problems the rejection rate will be intolerably high. In particular, the acceptance rate will typically be zero for continuous valued x.

ABC rejection sampling

The ABC “approximation” is to accept values provided that x^\star is “sufficiently close” to x. In the first instance, we can formulate this as follows.

  • 1. Sample \theta^\star \sim \pi(\theta^\star)
  • 2. Sample x^\star\sim \pi(x^\star|\theta^\star)
  • 3. If \Vert x^\star-x\Vert< \epsilon, keep \theta^\star as a sample from \pi(\theta|x), otherwise reject.
  • 4. Return to step 1.

Euclidean distance is usually chosen as the norm, though any norm can be used. This procedure is “honest”, in the sense that it produces exact realisations from

\theta^\star\big|\Vert x^\star-x\Vert < \epsilon.

For suitable small choice of \epsilon, this will closely approximate the true posterior. However, smaller choices of \epsilon will lead to higher rejection rates. This will be a particular problem in the context of high-dimensional x, where it is often unrealistic to expect a close match between all components of x and the simulated data x^\star, even for a good choice of \theta^\star. In this case, it makes more sense to look for good agreement between particular aspects of x, such as the mean, or variance, or auto-correlation, depending on the exact problem and context.

In the simplest case, this is done by forming a (vector of) summary statistic(s), s(x^\star) (ideally a sufficient statistic), and accepting provided that \Vert s(x^\star)-s(x)\Vert<\epsilon for some suitable choice of metric and \epsilon. We will return to this issue in a subsequent post.

Inference for an intractable Markov process

I’ll illustrate ABC in the context of parameter inference for a Markov process with an intractable transition kernel: the discrete stochastic Lotka-Volterra model. A function for simulating exact realisations from the intractable kernel is included in the smfsb CRAN package discussed in a previous post. Using pMCMC to solve the parameter inference problem is discussed in another post. It may be helpful to skim through those posts quickly to become familiar with this problem before proceeding.

So, for a given proposed set of parameters, realisations from the process can be sampled using the functions simTs and stepLV (from the smfsb package). We will use the sample data set LVperfect (from the LVdata dataset) as our “true”, or “target” data, and try to find parameters for the process which are consistent with this data. A fairly minimal R script for this problem is given below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
require(smfsb)
data(LVdata)
 
N=1e5
message(paste("N =",N))
prior=cbind(th1=exp(runif(N,-6,2)),th2=exp(runif(N,-6,2)),th3=exp(runif(N,-6,2)))
rows=lapply(1:N,function(i){prior[i,]})
message("starting simulation")
samples=lapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)})
message("finished simulation")
 
distance<-function(ts)
{
  diff=ts-LVperfect
  sum(diff*diff)
}
 
message("computing distances")
dist=lapply(samples,distance)
message("distances computed")
 
dist=sapply(dist,c)
cutoff=quantile(dist,1000/N)
post=prior[dist<cutoff,]
 
op=par(mfrow=c(2,3))
apply(post,2,hist,30)
apply(log(post),2,hist,30)
par(op)

This script should take 5-10 minutes to run on a decent laptop, and will result in histograms of the posterior marginals for the components of \theta and \log(\theta). Note that I have deliberately adopted a functional programming style, making use of the lapply function for the most computationally intensive steps. The reason for this will soon become apparent. Note that rather than pre-specifying a cutoff \epsilon, I’ve instead picked a quantile of the distance distribution. This is common practice in scenarios where the distance is difficult to have good intuition about. In fact here I’ve gone a step further and chosen a quantile to give a final sample of size 1000. Obviously then in this case I could have just selected out the top 1000 directly, but I wanted to illustrate the quantile based approach.

One problem with the above script is that all proposed samples are stored in memory at once. This is problematic for problems involving large numbers of samples. However, it is convenient to do simulations in large batches, both for computation of quantiles, and also for efficient parallelisation. The script below illustrates how to implement a batch parallelisation strategy for this problem. Samples are generated in batches of size 10^4, and only the best fitting samples are stored before the next batch is processed. This strategy can be used to get a good sized sample based on a more stringent acceptance criterion at the cost of addition simulation time. Note that the parallelisation code will only work with recent versions of R, and works by replacing calls to lapply with the parallel version, mclapply. You should notice an appreciable speed-up on a multicore machine.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
require(smfsb)
require(parallel)
options(mc.cores=4)
data(LVdata)
 
N=1e5
bs=1e4
batches=N/bs
message(paste("N =",N," | bs =",bs," | batches =",batches))
 
distance<-function(ts)
{
  diff=ts[,1]-LVprey
  sum(diff*diff)
}
 
post=NULL
for (i in 1:batches) {
  message(paste("batch",i,"of",batches))
  prior=cbind(th1=exp(runif(bs,-6,2)),th2=exp(runif(bs,-6,2)),th3=exp(runif(bs,-6,2)))
  rows=lapply(1:bs,function(i){prior[i,]})
  samples=mclapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)})
  dist=mclapply(samples,distance)
  dist=sapply(dist,c)
  cutoff=quantile(dist,1000/N)
  post=rbind(post,prior[dist<cutoff,])
}
message(paste("Finished. Kept",dim(post)[1],"simulations"))
 
op=par(mfrow=c(2,3))
apply(post,2,hist,30)
apply(log(post),2,hist,30)
par(op)

Note that there is an additional approximation here, since the top 100 samples from each of 10 batches of simulations won’t correspond exactly to the top 1000 samples overall, but given all of the other approximations going on in ABC, this one is likely to be the least of your worries.

Now, if you compare the approximate posteriors obtained here with the “true” posteriors obtained in an earlier post using pMCMC, you will see that these posteriors are really quite poor. However, this isn’t a very fair comparison, since we’ve only done 10^5 simulations. Jacking N up to 10^7 gives the ABC posterior below.

ABC posterior from 10^7 iterations
ABC posterior from 10^7 samples

This is a bit better, but really not great. There are two basic problems with the simplistic ABC strategy adopted here, one related to the dimensionality of the data and the other the dimensionality of the parameter space. The most basic problem that we have here is the dimensionality of the data. We have 16 (bivariate) observations, so we want our stochastic simulation to shoot at a point in a 16- or 32-dimensional space. That’s a tough call. The standard way to address this problem is to reduce the dimension of the data by introducing a few carefully chosen summary statistics and then just attempting to hit those. I’ll illustrate this in a subsequent post. The other problem is that often the prior and posterior over the parameters are quite different, and this problem too is exacerbated as the dimension of the parameter space increases. The standard way to deal with this is to sequentially adapt from the prior through a sequence of ABC posteriors. I’ll examine this in a future post as well, once I’ve also posted an introduction to the use of sequential Monte Carlo (SMC) samplers for static problems.

Further reading

For further reading, I suggest browsing the reference list of the Wikipedia page for ABC. Also look through the list of software on that page. In particular, note that there is a CRAN package, abc, providing R support for ABC. There is a vignette for this package which should be sufficient to get started.

Advertisements
Report this ad
Advertisements
Report this ad

Published by

darrenjw

I am Professor of Stochastic Modelling within the School of Mathematics & Statistics at Newcastle University, UK. I am also a computational systems biologist.

9 thoughts on “Introduction to Approximate Bayesian Computation (ABC)”

  1. Thanks for this post. I do find the wikipedia page on ABC (also available as a PLOS article by Sunnåker and colleagues) useful, and the practical example presented there quite captivating. However both me and my students find that Wikipedia article a little confusing (I am writing this as a ‘warning’ to the occasional reader) as it first presents an experiment represented as a hidden Markov model, but then the whole inference seems to be based on a system observed without error (i.e. the ‘gamma’ parameter is ignored during the inference). Other than that it’s a very good presentation.

    1. First I would like to thank Darren Wilkinson for a very nice entry on ABC, and for refering to our topic page review on the topic. I would also like to thank Umberto for information regarding the missing value of gamma in the example in the article, and for the kind words regarding our article. However, I just happen to see this post by chance, and since the publication of the ABC review we haven’t been contacted by anyone regarding the value of gamma, so we haven’t had the chance to add the information about gamma until now.

      Actually, we used gamma = 0.8 in the example (and not 1, which would have meant no measurement noise). However, this information somehow disappeared during the preparation of the PLoS article. To make the example as clear as possible we assumed that the value of gamma (=0.8) is known prior to the analysis, but in general gamma can of course be estimated as well for this model. We apologize that the information regarding the value of gamma was missing in the example.

      Consider the sequence used in the article: AAAABAABBAAAAAABAAAA. If gamma would be (close to) one (negligible measurement noise) it would mean that the only chance for a flip would be that a true flip has occured (not caused by measurement noise). Since we have 6 (of out 21 possible) switches between A and B in the sequence, the posterior for theta would not have the highest mass close to 0 (as in the example figure), but at a larger theta value. I tried to recompute the theoretical posterior for gamma at 1, which confirms this conclusion. In principle, it is therefore possible to see that gamma cannot have been set close to 1 in the computation of the posterior. Finally, note that theta appears to have a lower value that that used to generate the sequence (0.25), mainly due to that the sequence is short in combination with fast switches from B back to A (making experimental noise more likely than true switches in the system).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.