AlphaFold: Machine learning for protein structure prediction

In 2018, a group of computer scientists at DeepMind revealed a new method for protein structure prediction, called AlphaFold. In that year’s CASP competition, which benchmarks the state-of-the-art for protein structure prediction, AlphaFold swept the competition, generating more accurate predictions than any other research group.

AlphaFold has received considerable attention for this achievement, and a few weeks ago they published a scientific paper with the details of their new method. Since protein structure prediction often appears in Foldit puzzles, we wanted to review the AlphaFold method with Foldit players!

This blogpost is meant to summarize this exciting progress from AlphaFold, with an overview of their method, and some thoughts about the expected impact on protein research.

Machine Learning and Neural Nets

AlphaFold comes from DeepMind, a company well-known for tackling hard problems with machine learning algorithms. In 2016, a DeepMind program called AlphaGo famously beat a world-champion player of Go, a classic Chinese board game that is notoriously difficult for computer programs.

Machine learning (ML) is a branch of computer science that deals with self-improving algorithms. An ML algorithm is set up to perform a well-defined task, with a well-defined measure of performance. Over a “training” period, the algorithm is able to evaluate its own performance at the task and iteratively make changes that improve its performance.

One popular type of ML algorithm is a neural net, so called because it is inspired by the organization of neurons in the brain. Just like a web of neurons that communicate through synapses, a neural net is a web of virtual “nodes” that pass signals to one another. Typically, each node performs a simple mathematical operation on received signals (for example, testing if the sum of the signals exceeds some threshold), then passes on the new signal to downstream nodes. Training a neural net involves tuning the operations at each node so that the entire network produces the desired output from the training inputs.

A diagram of a simple neural network (from WikiMedia Commons). Signals are passed between nodes, each of which performs some simple (nonlinear) operation on the received signal and passes on the result. This network contains a single hidden layer of 4 nodes; the AlphaFold neural net contains hundreds of layers with thousands of nodes.

Neural nets have been very useful for abstracting information from complex inputs. A popular application of neural nets is the image recognition problem: the input is a 2D array of colored pixels, and the task is to classify the depicted object.

The AlphaFold algorithm is a neural net, very similar to the kind used for image recognition. In this case, the input is information about the protein sequence, and the task is to predict the distance between each residue in the folded protein.

Predicted Contacts vs. Predicted Distances

Many Foldit players will already be familiar with the concept of predicted contacts. These are residues in a protein that are predicted to be close to one another (“in contact”) in the folded structure, even if they are not neighbors in the protein sequence.

These predictions come from covariance patterns that emerge during evolution. We can observe these patterns by comparing very similar protein sequences in different organisms. For instance, we could compare the hemoglobin sequence in humans, chimps, dogs, mice, etc., and look for positions that tend to co-vary (i.e. two residues that seem to change together, as if they depend on one another). Strong covariance between two residues usually suggests that those residues interact with one another in the folded structure, through side-chain packing, H-bonding, electrostatics, etc.

Cartoon diagram of covariance (from GREMLIN). (Left) In these two related protein structures, the red and green residues interact with one another. When one of these mutates during the course of evolution, its partner may also have to mutate to maintain the interaction. (Right) Even when we don’t know the structure of these proteins, we can see evidence of this interaction when we compare lots of related protein sequences. The two positions in the dashed boxes display strong covariance.

One of the key insights of the AlphaFold group was to take these predictions a step further: Instead of using covariance to predict whether a two residues are “in contact” (a simple yes/no), AlphaFold attempts to predict the distance between the two residues (a range of values between 2 and 20 Å). These predictions are more difficult to make, but successful predictions provide much richer information about the folded protein structure.

We should note that, in 2018, a few other research groups were also using neural networks to predict distances—not just AlphaFold. The second insight of AlphaFold concerns their ability to generate a folded protein structure from predicted distances. They represent each distance prediction as a smooth restraint function, which allows them to employ a simple technique called gradient descent, directly folding the protein into a structure compatible with their predicted distances.

Predicted distances for residue pairs. (a) Similar to a contact map, this plot shows the predicted distance between every pair of residues in the structure. (b) For each pair of residues, the neural net produces a probability distribution of distances for each pair of residues. For the pair of residues marked by the blue star in (a), we can see the probability distribution favors a distance of about 8 Å. (c) The probability distribution is converted to a smooth restraint function, where the lowest point of the function corresponds to the favored distance (in this case, 8 Å). A simple gradient descent algorithm allows AlphaFold to efficiently fold a protein structure that optimizes all of their distance predictions.

Finally, AlphaFold combines their distance predictions with the Rosetta energy function (the same energy function used by Foldit) to refine their final folded structure.

AlphaFold Performance in CASP

The Critical Assessment of protein Structure Prediction (CASP) is an opportunity for different researchers to compare their structure prediction methods in a head-to-head competition. The CASP organizers collect unpublished protein structures and challenge researchers to predict the structures based on their protein sequence. Because the true protein structures are unpublished, all the predictions are “blind,” and all the participants can evaluate their methods on a level playing field, starting from the same information.

AlphaFold’s neural net was able to make remarkably accurate distance predictions for many of the targets of the 2018 CASP competition, and this led them toward protein models that were very similar to the true structure. The best way to visualize AlphaFold’s success is to look at their summed Z-score for all targets in the Free Modeling category.

Rankings from the 2018 CASP Free Modeling category (from CASP13). The y-axis shows the summed Z-score across all targets in the category, with all competing groups on the x-axis. The leftmost bar represents the AlphaFold group.

This is an incredible achievement, and AlphaFold represents a significant step forward in protein structure prediction, but the structure prediction problem is still far from “solved.” For most natural proteins, AlphaFold relies heavily on covariance patterns, and often struggles when the target has very few related sequences (covariance is harder to detect with just a few related sequences). However, even with zero related sequences AlphaFold can still make distance predictions, albeit with lower confidence. AlphaFold showed this by correctly predicting the structure of Foldit3, a protein designed by Foldit players, with no related sequences and no co-variance information!

One scientific limit of AlphaFold is that it suffers from the “black box” problem. Neural nets like the AlphaFold algorithm are considered “black box” techniques because their inner workings are hard to interpret. It is very difficult for us to deconstruct a neural network to figure out exactly what concepts the algorithm is “learning” about proteins. In other words, AlphaFold has improved our ability to predict a protein structure from its sequence; but hasn’t directly increased our understanding of how protein sequence relates to structure.

Impact of AlphaFold

Since AlphaFold’s debut in 2018, many other research groups have begun experimenting with machine learning for predicting residue distances. Just this month, shortly after AlphaFold published their method, researchers at the Baker Lab published trRosetta, which builds on the AlphaFold method (see PDF from the Baker Lab website).

The Baker Lab researchers realized that a neural net could be trained to predict not just the distance between two residues, but also the relative orientation of those two residues. By training an algorithm to predict both distance and orientation between residues, the Baker group was able to make protein models with even greater accuracy!

Building on AlphaFold with trRosetta. (a) The AlphaFold neural net predicts only the distance between residues pairs. We can also train the neural net to predict the orientation of residue pairs (defined by several angles and torsions). (b) These angle and torsion predictions can also be converted into smooth restraint functions, which is key for applying the predictions to a protein model. (c) The orientation predictions improve the accuracy of final protein models for a set of CASP targets.

The CASP competition returns in the summer of 2020, and it will be very exciting to see how other groups have incorporated AlphaFold’s progress into their own prediction methods!

However, Foldit is unlikely to see any immediate changes as a direct result of AlphaFold’s success.

Since Foldit was launched in 2008, our focus has been gradually shifting away from protein structure prediction. The main reason for this is that we think Foldit players have more to contribute in other problems, like protein design or building models into cryoEM data. It’s likely that we can use distance predictions to help with these tasks (for example, to check if distance predictions for a designed sequence are compatible the designed structure), but for now we are still evaluating the most effective ways to use neural nets for these kinds of problems!

Special thanks goes to Baker Lab scientist Ivan Anishchenko for contributions to this blog post!

( Posted by Foldit Staff

bkoep 62 503 | Fri, 01/31/2020 - 00:25 | 8 comments )

Fri, 01/31/2020 - 01:18

Susume 62 764

Offline

Joined: 10/02/2011

Groups: Anthropic Dreams

Improve prediction for foldit designs?

The first hurdle for a foldit design making its way to the wet lab is that Rosetta has to be able to predict its structure. I wonder if augmenting the Rosetta runs with restraints from AlphaFold would let Rosetta find the fold for some foldit designs that it currently can't solve.

Fri, 01/31/2020 - 20:11

spvincent 36 13

Joined: 12/07/2007

Groups: Contenders

There's a domain from a

There's a domain from a protein called Streptococcus G protein that completely changes its structure from a 3-helix bundle to a 1-helix/4-strand sheet when a single amino acid is changed.

https://www.pnas.org/content/106/50/21011
http://conflux.mwclarkson.com/2009/12/a-single-residue-dictates-a-fold/

Original paper

https://www.pnas.org/content/106/50/21149

I wonder how well AlphaFold would work here? Not only AlphaFold: a case like this must pose real problems for any protein prediction method that relies on sequence similarity to determine secondary and tertiary structure.

Wed, 02/05/2020 - 04:53

neilpg628 62 2044

Joined: 08/01/2019

Groups: Foldit Staff

Streptococcus G

This is true, it is one of those pathological cases that scientists have found. It is a pretty small protein, which allows for a larger fraction of the secondary structure to be changed by one residue. It's also interesting that binding affinity remains almost the same between the structures.

I agree though that AlphaFold would not perform very well on a sequence like this. It doesn't look like there is much covariance across multiple native proteins.

Sat, 02/08/2020 - 18:23

tx: I've been wondering about

tx: I've been wondering about this protein and this blog post seemed to be as good a place as any to ask about it. It's all very well to say it's a pathological case but without some understanding of why this is so isn't there going to be a slight question mark hanging over protein structures determined using this method?

Wed, 02/05/2020 - 03:04

agcohn821 62 580

Joined: 11/05/2019

Hi Spvincent! Great question!

Hi Spvincent! Great question! I have passed this along to the team!

Wed, 02/05/2020 - 10:58

Bruno Kestemont 1 2

Joined: 09/24/2012

Groups: Go Science

Human net

I just wonder if our sharings are not a kind of human net.

Trying different paths and starting again with the best shared result. This is done manually by sharing with group or automatically with ourselve by save.SaveSolution() and save.LoadSolution() commands. At the end, we don't know what strategy gained, but the best solution that emerges is likely to gain the competition.

The network of parallel solutions avoids the path-dependent determinism.

Would it be interesting for roseta@home to implement this strategy ? (automatic shares to a pool of good candidates following different criteria, automatic loading of current selected solutions following these different criteria).

(I use this strategy in filtered puzzles when I have enough resources available: optimizing the bonus in parallel to tracks optimizing the score).

Thu, 02/06/2020 - 05:25

Human Net

You could argue that way, though AlphaFold uses existing structure motifs from a wide range of PDBs to predict contact distances. What you're describing sounds like true ab initio folding, by predicting the structure from nothing more than multiple gradient descent-guided paths across the energy landscape.

Sun, 03/08/2020 - 16:02

Seagat2011 62 2044

Joined: 08/24/2010

*** A More powerful partial NN Update ***

Most NN emulate discrete operations or programs instead of partial programs. NN which emulate partial programs would be much more powerful; so for example, if 2 operational modules are combined, you could achieve not 1 + 1 = 2 operational equivalency, but 1.5 + 1.5 = 3 program modules, or even 1.9 + 1.9 = 3.8 programs, by combining only 2 artificial neural networks !