How can predicting protein folding speed up drug discovery?

How can predicting protein folding speed up drug discovery?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I'm asking this as a layperson without much knowledge in biology, so please correct me if my understanding is wrong.

Recently DeepMind's AlphaFold managed to predict protein structure from acid amino sequence with stunning accuracy. We are being told that this could "pave ways toward advances in drug discovery."

But I fail to see how it can happen. From my understanding the combination of protein structure is infinite, and if we tweak a protein structure that works good to fight against a virus, it will either work better or worse, but the key point is we don't know until we try. So essentially this is still a hit-and-miss thing, depending purely on dumb luck.

So how can the fast, accurate prediction of protein folding help in drug discovery?

Firstly protein structures are not infinite. Most proteins adopt specific structure.
Drugs carry out their function by binding to its target protein. Structure prediction helps drug discovery process in two ways -

  1. it allows identification of pockets in target proteins (where drugs can bind) whose structures are not yet solved using experimental methods
  2. it allows in silico experimentation i.e. you can take a large number of molecules and simulate whether they will bind to a specific location in your target

I would suggest going through these reviews to get a better overview of the importance of fast and accurate structure determination in drug discovery :

  1. Verinde and Hol, 1994, Structure
  2. Montfort and Workman, 2017, Essays in Biochemistry

Simulation of Ligand-Receptor binding usually involves the retrieval of X-ray crystallography of the desired protein (Receptor) from bioinformatic databases. The X-ray data of the protein contains important features of the protein, one of which is Protein conformation (Folding status). Gathering X-ray crystallography data of a specific new polypeptide is an extremely hard, time consuming process which requires advanced techniques and accurate purification of the protein from biological systems. Therefore if an algorithm can predict the folding of a protein correctly (at least to some extents) , other algorithms can benefit from this independence from experimental results. Fast, simultaneous docking surveys (Virtual Screening) would be available with a lot more ease eventually leading to designing new medications only by computer simulations (Which requires extremely less financial budget compared to testing each candidate molecule one by one on animals)

Related Story

In this year’s CASP, AlphaFold predicted the structure of dozens of proteins with a margin of error of just 1.6 angstroms—that’s 0.16 nanometers, or atom-sized. This far outstrips all other computational methods and for the first time matches the accuracy of techniques used in the lab, such as cryo-electron microscopy, nuclear magnetic resonance and x-ray crystallography. These techniques are expensive and slow: it can take hundreds of thousands of dollars and years of trial and error for each protein. AlphaFold can find a protein’s shape in a few days.

The breakthrough could help researchers design new drugs and understand diseases. In the longer term, predicting protein structure will also help design synthetic proteins, such as enzymes that digest waste or produce biofuels. Researchers are also exploring ways to introduce synthetic proteins that will increase crop yields and make plants more nutritious.

“It’s a very substantial advance,” says Mohammed AlQuraishi, a systems biologist at Columbia University who has developed his own software for predicting protein structure. “It's something I simply didn't expect to happen nearly this rapidly. It's shocking, in a way.”

“This really is a big deal,” says David Baker, head of the Institute for Protein Design at the University of Washington and leader of the team behind Rosetta, a family of protein analysis tools. “It’s an amazing achievement, like what they did with Go.”

Astronomical numbers

Identifying a protein’s structure is very hard. For most proteins, researchers have the sequence of amino acids in the ribbon but not the contorted shape they fold into. And there are typically an astronomical number of possible shapes for each sequence. Researchers have been wrestling with the problem at least since the 1970s, when Christian Anfinsen won the Nobel prize for showing that sequences determined structure.

The launch of CASP in 1994 gave the field a boost. Every two years, the organizers release 100 or so amino acid sequences for proteins whose shapes have been identified in the lab but not yet made public. Dozens of teams from around the world then compete to find the correct way to fold them up using software. Many of the tools developed for CASP are already used by medical researchers. But progress was slow, with two decades of incremental advances failing to produce a shortcut to painstaking lab work.

CASP got the jolt it was looking for when DeepMind entered the competition in 2018 with its first version of AlphaFold. It still could not match the accuracy of a lab but it left other computational techniques in the dust. Researchers took note: soon many were adapting their own systems to work more like AlphaFold.

This year more than half of the entries use some form of deep learning, says Moult. The accuracy overall was higher as a result. Baker’s new system, called trRosetta, uses some of DeepMind’s ideas from 2018. But it still came a “very distant second,” he says.

In CASP, results are scored using what’s known as a global distance test (GDT), which measures on a scale from 0 to 100 how close a predicted structure is to the actual shape of a protein identified in lab experiments. The latest version of AlphaFold scored well for all proteins in the challenge. But it got a GDT score above 90 for around two thirds of them. Its GDT for the hardest proteins was 25 points higher than the next best team, says John Jumper, who heads up the AlphaFold team at DeepMind. In 2018 the lead was around six points.

A score above 90 means that any differences between the predicted structure and the actual structure could be down to experimental errors in the lab rather than a fault in the software. It could also mean that the predicted structure is a valid alternative configuration to the one identified in the lab, within the range of natural variation.

According to Jumper, there were four proteins in the competition that independent judges had not finished working on in the lab and AlphaFold’s predictions pointed them towards the correct structures.

AlQuraishi thought it would take researchers 10 years to get from AlphaFold’s 2018 results to this year’s. This is close to the physical limit for how accurate you can get, he says. “These structures are fundamentally floppy. It doesn’t make sense to talk about resolutions much below that.”

Puzzle pieces

AlphaFold builds on the work of hundreds of researchers around the world. DeepMind also drew on a wide range of expertise, putting together a team of biologists, physicists and computer scientists. Details of how it works will be released this week at the CASP conference and in a peer-reviewed article in a special issue of the journal Proteins next year. But we do know that it uses a form of attention network, a deep-learning technique that allows an AI to train by focusing on parts of a larger problem. Jumper compares the approach to assembling a jigsaw: it pieces together local chunks first before fitting these into a whole.

DeepMind trained AlphaFold on around 170,000 proteins taken from the protein data bank, a public repository of sequences and structures. It compared multiple sequences in the data bank and looked for pairs of amino acids that often end up close together in folded structures. It then uses this data to guess the distance between pairs of amino acids in structures that are not yet known. It is also able to assess how accurate these guesses are. Training took “a few weeks,” using computing power equivalent to between 100 and 200 GPUs.

How can predicting protein folding speed up drug discovery? - Biology

In our study published in Nature, we demonstrate how artificial intelligence research can drive and accelerate new scientific discoveries. We’ve built a dedicated, interdisciplinary team in hopes of using AI to push basic research forward: bringing together experts from the fields of structural biology, physics, and machine learning to apply cutting-edge techniques to predict the 3D structure of a protein based solely on its genetic sequence.

Our system, AlphaFold – described in peer-reviewed papers now published in Nature and PROTEINS – is the culmination of several years of work, and builds on decades of prior research using large genomic datasets to predict protein structure. The 3D models of proteins that AlphaFold generates are far more accurate than any that have come before - marking significant progress on one of the core challenges in biology. The AlphaFold code used at CASP13 is available on Github here for anyone interested in learning more or replicating our results. We’re also excited by the fact that this work has already inspired other, independent implementations, including the model described in this paper , and a community - built, open source implementation , described here .

What is the protein folding problem?

Proteins are large, complex molecules essential to all of life. Nearly every function that our body performs - contracting muscles, sensing light, or turning food into energy - relies on proteins, and how they move and change. What any given protein can do depends on its unique 3D structure. For example, antibody proteins utilised by our immune systems are ‘Y-shaped’, and form unique hooks. By latching on to viruses and bacteria, these antibody proteins are able to detect and tag disease - causing microorganisms for elimination. Collagen proteins are shaped like cords, which transmit tension between cartilage, ligaments, bones, and skin. Other types of proteins include Cas9, which, using CRISPR sequences as a guide, act like scissors to cut and paste sections of DNA antifreeze proteins, whose 3D structure allows them to bind to ice crystals and prevent organisms from freezing and ribosomes, which act like a programmed assembly line, helping to build proteins themselves.

The recipes for those proteins - called genes - are encoded in our DNA. An error in the genetic recipe may result in a malformed protein, which could result in disease or death for an organism. Many diseases, therefore, are fundamentally linked to proteins. But just because you know the genetic recipe for a protein doesn’t mean you automatically know its shape. Proteins are comprised of chains of amino acids (also referred to as amino acid residues). But DNA only contains information about the sequence of amino acids - not how they fold into shape. The bigger the protein, the more difficult it is to model, because there are more interactions between amino acids to take into account. As demonstrated by Levinthal’s paradox , it would take longer than the age of the known universe to randomly enumerate all possible configurations of a typical protein before reaching the true 3D structure - yet proteins themselves fold spontaneously, within milliseconds. Predicting how these chains will fold into the intricate 3D structure of a protein is what’s known as the “protein folding problem” - a challenge that scientists have worked on for decades. This unsolved problem has already inspired countless developments, from spurring IBM’s efforts in supercomputing ( BlueGene ), to novel citizen science efforts ( [email protected] and FoldIt ) to new engineering realms, such as rational protein design.

Why is protein folding important?

I think that we shall be able to get a more thorough understanding of the nature of disease in general by investigating the molecules that make up the human body, including the abnormal molecules, and that this understanding will permit. the problem of disease to be attacked in a more straightforward manner such that new methods of therapy will be developed.

Scientists have long been interested in determining the structures of proteins because a protein’s form is thought to dictate its function. Once a protein’s shape is understood, its role within the cell can be guessed at, and scientists can develop drugs that work with the protein’s unique shape.

Over the past five decades, researchers have been able to determine shapes of proteins in labs using experimental techniques like cryo-electron microscopy , nuclear magnetic resonance and X-ray crystallography , but each method depends on a lot of trial and error, which can take years of work, and cost tens or hundreds of thousands of dollars per protein structure. This is why biologists are turning to AI methods as an alternative to this long and laborious process for difficult proteins. The ability to predict a protein’s shape computationally from its genetic code alone – rather than determining it through costly experimentation – could help accelerate research.

How can AI make a difference?

Fortunately, the field of genomics is quite rich in data thanks to the rapid reduction in the cost of genetic sequencing. As a result, deep learning approaches to the prediction problem that rely on genomic data have become increasingly popular in the last few years. To catalyse research and measure progress on the newest methods for improving the accuracy of predictions, a biennial global competition called CASP (Critical Assessment of protein Structure Prediction) was established in 1994, and has become the gold standard for assessing predictive techniques. We’re indebted to decades of prior work by the CASP organisers, as well as to the thousands of experimentalists whose structures enable this kind of assessment.

DeepMind’s work on this problem resulted in AlphaFold, which we submitted to CASP13. We’re proud to be part of what the CASP organisers have called “unprecedented progress in the ability of computational methods to predict protein structure,” placing first in rankings among the teams that entered (our entry is A7D).

Our team focused specifically on the problem of modelling target shapes from scratch, without using previously solved proteins as templates. We achieved a high degree of accuracy when predicting the physical properties of a protein structure, and then used two distinct methods to construct predictions of full protein structures.

Using neural networks to predict physical properties

Both of these methods relied on deep neural networks that are trained to predict properties of the protein from its genetic sequence. The properties our networks predict are: (a) the distances between pairs of amino acids and (b) the angles between chemical bonds that connect those amino acids. The first development is an advance on commonly used techniques that estimate whether pairs of amino acids are near each other.

We trained a neural network to predict a distribution of distances between every pair of residues in a protein (visualised in Figure 2). These probabilities were then combined into a score that estimates how accurate a proposed protein structure is. We also trained a separate neural network that uses all distances in aggregate to estimate how close the proposed structure is to the right answer.

Figure 2: Two ways of visualising the accuracy of AlphaFold’s predictions. The top figure features the distance matrices for three proteins. The brightness of each pixel represents the distance between the amino acids in the sequence comprising the protein–the brighter the pixel, the closer the pair. Shown in the top row are the real, experimentally determined distances and, in the bottom row, the average of AlphaFold’s predicted distance distributions. Importantly, these match well on both global and local scales. The bottom panels represent the same comparison using 3D models, featuring AlphaFold’s predictions (blue) versus ground-truth data (green) for the same three proteins.

Using these scoring functions, we were able to search the protein landscape to find structures that matched our predictions. Our first method built on techniques commonly used in structural biology, and repeatedly replaced pieces of a protein structure with new protein fragments. We trained a generative neural network to invent new fragments, which were used to continually improve the score of the proposed protein structure.

The second method optimised scores through gradient descent - a mathematical technique commonly used in machine learning for making small, incremental improvements - which resulted in highly accurate structures. This technique was applied to entire protein chains rather than to pieces that must be folded separately before being assembled into a larger structure, to simplify the prediction process.

The AlphaFold version used at CASP13 is available on Github for anyone interested in learning more, or replicating our protein folding results.

What happens next?

While we’re thrilled by the success of our protein folding model, there’s still much to be done in the realm of protein biology, and we’re excited to continue our efforts in this field. We’re committed to establishing ways that AI can contribute to basic scientific discovery, with the hope of making real-world impact. This approach might serve to ultimately improve our understanding of the body and how it works, enabling scientists to target and design new, effective cures for diseases more efficiently. Scientists have only mapped structures for about half of all the proteins made by human cells. Some rare diseases involve mutations in a single gene, resulting in a malformed protein which can have profound effects on the health of an entire organism. A tool like AlphaFold might help rare disease researchers predict the shape of a protein of interest rapidly and economically. As scientists acquire more knowledge about the shapes of proteins and how they operate through simulations and models, this method may eventually help us contribute to efficient drug discovery, while also reducing the costs associated with experimentation. Our hope is that AI will be useful for disease research, and ultimately improve the quality of life for millions of patients around the world.

But potential benefits aren’t restricted to health alone - understanding protein folding will assist in protein design, which could unlock a tremendous number of benefits . For example, advances in biodegradable enzymes - which can be enabled by protein design - could help manage pollutants like plastic and oil, helping us break down waste in ways that are more friendly to our environment. In fact, researchers have already begun engineering bacteria to secrete proteins that will make waste biodegradable, and easier to process.

The success of our first foray into protein folding is indicative of how machine learning systems can integrate diverse sources of information to help scientists come up with creative solutions to complex problems at speed. Just as we’ve seen how AI can help people master complex games through systems like AlphaGo and AlphaZero , we similarly hope that one day, AI breakthroughs will help serve as a platform to advance our understanding of fundamental scientific problems, too.

It’s exciting to see these early signs of progress in protein folding, demonstrating the utility of AI for scientific discovery. Even though there’s a lot more work to do before we’re able to have a quantifiable impact on treating diseases, managing waste, and more, we know the potential is enormous. With a dedicated team focused on delving into how machine learning can advance the world of science, we’re looking forward to seeing the many ways our technology can make a difference.

Listen to our podcast featuring the researchers behind this work.

This blog post is based on the following work:

The AlphaFold version used at CASP13 is available on Github for anyone interested in learning more, or replicating our protein folding results.

This work was done in collaboration with Andrew Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek, Sandy Nelson, Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, David Jones, David Silver, Koray Kavukcuoglu and Demis Hassabis

How can predicting protein folding speed up drug discovery? - Biology

Proteins are essential to life, supporting practically all its functions. They are large complex molecules, made up of chains of amino acids, and what a protein does largely depends on its unique 3D structure. Figuring out what shapes proteins fold into is known as the “protein folding problem”, and has stood as a grand challenge in biology for the past 50 years. In a major scientific advance, the latest version of our AI system AlphaFold has been recognised as a solution to this grand challenge by the organisers of the biennial Critical Assessment of protein Structure Prediction (CASP). This breakthrough demonstrates the impact AI can have on scientific discovery and its potential to dramatically accelerate progress in some of the most fundamental fields that explain and shape our world.

A protein’s shape is closely linked with its function, and the ability to predict this structure unlocks a greater understanding of what it does and how it works. Many of the world’s greatest challenges, like developing treatments for diseases or finding enzymes that break down industrial waste, are fundamentally tied to proteins and the role they play.

We have been stuck on this one problem – how do proteins fold up – for nearly 50 years. To see DeepMind produce a solution for this, having worked personally on this problem for so long and after so many stops and starts, wondering if we’d ever get there, is a very special moment.

Co-Founder and Chair of CASP, University of Maryland

This has been a focus of intensive scientific research for many years, using a variety of experimental techniques to examine and determine protein structures, such as nuclear magnetic resonance and X-ray crystallography. These techniques, as well as newer methods like cryo-electron microscopy, depend on extensive trial and error, which can take years of painstaking and laborious work per structure, and require the use of multi-million dollar specialised equipment.

The ‘protein folding problem’

In his acceptance speech for the 1972 Nobel Prize in Chemistry, Christian Anfinsen famously postulated that, in theory, a protein’s amino acid sequence should fully determine its structure. This hypothesis sparked a five decade quest to be able to computationally predict a protein’s 3D structure based solely on its 1D amino acid sequence as a complementary alternative to these expensive and time consuming experimental methods. A major challenge, however, is that the number of ways a protein could theoretically fold before settling into its final 3D structure is astronomical. In 1969 Cyrus Levinthal noted that it would take longer than the age of the known universe to enumerate all possible configurations of a typical protein by brute force calculation – Levinthal estimated 10^300 possible conformations for a typical protein. Yet in nature, proteins fold spontaneously, some within milliseconds – a dichotomy sometimes referred to as Levinthal’s paradox.

Protein folding explained

Results from the CASP14 assessment

In 1994, Professor John Moult and Professor Krzysztof Fidelis founded CASP as a biennial blind assessment to catalyse research, monitor progress, and establish the state of the art in protein structure prediction. It is both the gold standard for assessing predictive techniques and a unique global community built on shared endeavour. Crucially, CASP chooses protein structures that have only very recently been experimentally determined (some were still awaiting determination at the time of the assessment) to be targets for teams to test their structure prediction methods against they are not published in advance. Participants must blindly predict the structure of the proteins, and these predictions are subsequently compared to the ground truth experimental data when they become available. We’re indebted to CASP’s organisers and the whole community, not least the experimentalists whose structures enable this kind of rigorous assessment.

AlphaFold: The making of a scientific breakthrough

The main metric used by CASP to measure the accuracy of predictions is the Global Distance Test (GDT) which ranges from 0-100. In simple terms, GDT can be approximately thought of as the percentage of amino acid residues (beads in the protein chain) within a threshold distance from the correct position. According to Professor Moult, a score of around 90 GDT is informally considered to be competitive with results obtained from experimental methods.

In the results from the 14th CASP assessment, released today, our latest AlphaFold system achieves a median score of 92.4 GDT overall across all targets. This means that our predictions have an average error (RMSD) of approximately 1.6 Angstroms, which is comparable to the width of an atom (or 0.1 of a nanometer). Even for the very hardest protein targets, those in the most challenging free-modelling category, AlphaFold achieves a median score of 87.0 GDT (data available here).

Improvements in the median accuracy of predictions in the free modelling category for the best team in each CASP, measured as best-of-5 GDT.

Two examples of protein targets in the free modelling category. AlphaFold predicts highly accurate structures measured against experimental result.

These exciting results open up the potential for biologists to use computational structure prediction as a core tool in scientific research. Our methods may prove especially helpful for important classes of proteins, such as membrane proteins, that are very difficult to crystallise and therefore challenging to experimentally determine.

This computational work represents a stunning advance on the protein-folding problem, a 50-year-old grand challenge in biology. It has occurred decades before many people in the field would have predicted. It will be exciting to see the many ways in which it will fundamentally change biological research.

Professor Venki Ramakrishnan

Nobel Laureate and President of the Royal Society

Our approach to the protein folding problem

We first entered CASP13 in 2018 with our initial version of AlphaFold, which achieved the highest accuracy among participants. Afterwards, we published a paper on our CASP13 methods in Nature with associated code, which has gone on to inspire other work and community-developed open source implementations. Now, new deep learning architectures we’ve developed have driven changes in our methods for CASP14, enabling us to achieve unparalleled levels of accuracy. These methods draw inspiration from the fields of biology, physics, and machine learning, as well as of course the work of many scientists in the protein folding field over the past half-century.

A folded protein can be thought of as a “spatial graph”, where residues are the nodes and edges connect the residues in close proximity. This graph is important for understanding the physical interactions within proteins, as well as their evolutionary history. For the latest version of AlphaFold, used at CASP14, we created an attention-based neural network system, trained end-to-end, that attempts to interpret the structure of this graph, while reasoning over the implicit graph that it’s building. It uses evolutionarily related sequences, multiple sequence alignment (MSA), and a representation of amino acid residue pairs to refine this graph.

By iterating this process, the system develops strong predictions of the underlying physical structure of the protein and is able to determine highly-accurate structures in a matter of days. Additionally, AlphaFold can predict which parts of each predicted protein structure are reliable using an internal confidence measure.

We trained this system on publicly available data consisting of

170,000 protein structures from the protein data bank together with large databases containing protein sequences of unknown structure. It uses approximately 16 TPUv3s (which is 128 TPUv3 cores or roughly equivalent to

100-200 GPUs) run over a few weeks, a relatively modest amount of compute in the context of most large state-of-the-art models used in machine learning today. As with our CASP13 AlphaFold system, we are preparing a paper on our system to submit to a peer-reviewed journal in due course.

An overview of the main neural network model architecture. The model operates over evolutionarily related protein sequences as well as amino acid residue pairs, iteratively passing information between both representations to generate a structure.

The potential for real-world impact

When DeepMind started a decade ago, we hoped that one day AI breakthroughs would help serve as a platform to advance our understanding of fundamental scientific problems. Now, after 4 years of effort building AlphaFold, we’re starting to see that vision realised, with implications for areas like drug design and environmental sustainability.

Professor Andrei Lupas, Director of the Max Planck Institute for Developmental Biology and a CASP assessor, let us know that, “AlphaFold’s astonishingly accurate models have allowed us to solve a protein structure we were stuck on for close to a decade, relaunching our effort to understand how signals are transmitted across cell membranes.”

We’re optimistic about the impact AlphaFold can have on biological research and the wider world, and excited to collaborate with others to learn more about its potential in the years ahead. Alongside working on a peer-reviewed paper, we’re exploring how best to provide broader access to the system in a scalable way.

In the meantime, we’re also looking into how protein structure predictions could contribute to our understanding of specific diseases with a small number of specialist groups, for example by helping to identify proteins that have malfunctioned and to reason about how they interact. These insights could enable more precise work on drug development, complementing existing experimental methods to find promising treatments faster.

AlphaFold is a once in a generation advance, predicting protein structures with incredible speed and precision. This leap forward demonstrates how computational methods are poised to transform research in biology and hold much promise for accelerating the drug discovery process.

PhD, Founder & CEO Calico, Former Chairman & CEO, Genentech

We’ve also seen signs that protein structure prediction could be useful in future pandemic response efforts, as one of many tools developed by the scientific community. Earlier this year, we predicted several protein structures of the SARS-CoV-2 virus, including ORF3a, whose structures were previously unknown. At CASP14, we predicted the structure of another coronavirus protein, ORF8. Impressively quick work by experimentalists has now confirmed the structures of both ORF3a and ORF8. Despite their challenging nature and having very few related sequences, we achieved a high degree of accuracy on both of our predictions when compared to their experimentally determined structures.

As well as accelerating understanding of known diseases, we’re excited about the potential for these techniques to explore the hundreds of millions of proteins we don’t currently have models for – a vast terrain of unknown biology. Since DNA specifies the amino acid sequences that comprise protein structures, the genomics revolution has made it possible to read protein sequences from the natural world at massive scale – with 180 million protein sequences and counting in the Universal Protein database (UniProt). In contrast, given the experimental work needed to go from sequence to structure, only around 170,000 protein structures are in the Protein Data Bank (PDB). Among the undetermined proteins may be some with new and exciting functions and – just as a telescope helps us see deeper into the unknown universe – techniques like AlphaFold may help us find them.

Unlocking new possibilities

AlphaFold is one of our most significant advances to date but, as with all scientific research, there are still many questions to answer. Not every structure we predict will be perfect. There’s still much to learn, including how multiple proteins form complexes, how they interact with DNA, RNA, or small molecules, and how we can determine the precise location of all amino acid side chains. In collaboration with others, there’s also much to learn about how best to use these scientific discoveries in the development of new medicines, ways to manage the environment, and more.

For all of us working on computational and machine learning methods in science, systems like AlphaFold demonstrate the stunning potential for AI as a tool to aid fundamental discovery. Just as 50 years ago Anfinsen laid out a challenge far beyond science’s reach at the time, there are many aspects of our universe that remain unknown. The progress announced today gives us further confidence that AI will become one of humanity’s most useful tools in expanding the frontiers of scientific knowledge, and we’re looking forward to the many years of hard work and discovery ahead!

Until we’ve published a paper on this work, please cite:

High Accuracy Protein Structure Prediction Using Deep Learning

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Kathryn Tunyasuvunakool, Olaf Ronneberger, Russ Bates, Augustin Žídek, Alex Bridgland, Clemens Meyer, Simon A A Kohl, Anna Potapenko, Andrew J Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Martin Steinegger, Michalina Pacholska, David Silver, Oriol Vinyals, Andrew W Senior, Koray Kavukcuoglu, Pushmeet Kohli, Demis Hassabis.

In Fourteenth Critical Assessment of Techniques for Protein Structure Prediction (Abstract Book), 30 November - 4 December 2020. Retrieved from here.

Attention visualization

We visualized the attention layers of the BERT model for the ligand arm. We expected the attention layers to emphasize and highlight the fragments responsible for binding in both ligands and receptors. As we mentioned above, the attention within the BERT architecture is represented by 6 layers with 12 matrices of weights, each responsible for highlighting various regions of a molecule. We tried two ways of matrix-to-vector transformation in order to map the weights onto the string representation vector: the first way takes the first row of matrix weights only, the other averages over the columns.

However, the attention focus within both methods was spread across molecules instead of particular sites. We illustrate it with one of the examples in the Figure below.

In the left plot we show an active site of human thrombin (receptor, shown as orange surface) and a bound ligand (blue sticks) as taken from the Protein Data Bank ( the green ovals tag the actually interacting groups. The right plot shows a planar projection of the ligand: again green ovals map the interacting groups, while the red and green spots indicate molecular fragments assumed by attention to be responsible for binding of the ligand with the receptor. One can see that the attention is spread beyond the interacting groups, which might showcase that binding is actually dependent on a full topology of a structure rather than attributed to particular sites.

&ldquoA very special moment"

Living cells are comprised of billions of different proteins, each of which has a complex 3D shape that defines what it does and how it works.

More than 200 million proteins have been discovered, and the number continues to rise. But we only know the exact shape of only a few hundred thousand.

Each protein is a string made up 20 amino acids, arranged in different orders. Their interactions with each other make the protein fold, with scientist Cyrus Levinthal estimating in 1969 that there were some 10^300 possible conformations for a typical protein.

A major goal of computational biologists has therefore been to work out how to predict a protein's shape just from looking at a string of amino acids.

Using brute force computing is essentially impossible, given the astronomical number of configurations, so scientists have increasingly looked to artificial intelligence as a way to achieve this goal.

Enter AlphaFold. Trained on the sequences and structures of about

170,000 proteins mapped out by the RCSB Protein Data Bank and other protein databases, it can accurately predict the shape of proteins simply from their sequence of amino acids. The system was trained on 128 TPUv3 cores for the duration of a "few weeks."

This month, AlphaFold defeated around 100 teams in a protein-structure prediction challenge called Critical Assessment for Structure Prediction (CASP), set up 25 years ago to encourage research in the field.

"We have been stuck on this one problem &ndash how do proteins fold up &ndash for nearly 50 years," CASP co-founder and chair Professor John Moult said.

"To see DeepMind produce a solution for this, having worked personally on this problem for so long and after so many stops and starts, wondering if we&rsquod ever get there, is a very special moment."

CASP uses a metric called the Global Distance Test (GDT) which measures how accurate protein folding predictions are when compared to the correct position, out of a score of 100.

A score of 90 GDT has long been considered to be the benchmark to beat, as it is similar to what can be obtained from experimental lab methods (something which can take months or years, expensive equipment, and still fail). In the latest CASP assessment, AlphaFold achieved a median score of 92.4 GDT across all targets &ndash an average error of about the width of an atom.

For the hardest protein targets, it had a median score of 87.0 GDT.

"This computational work represents a stunning advance on the protein-folding problem, a 50-year-old grand challenge in biology," said Nobel laureate and President of the Royal Society, Professor Venki Ramakrishnan.

"It has occurred decades before many people in the field would have predicted. It will be exciting to see the many ways in which it will fundamentally change biological research.&rdquo

The hope is that the breakthrough will allow scientists to understand how proteins work, making drug development easier, and potentially setting humanity on a path towards being able to develop enzymes that can eat plastic, or absorb carbon.

Significant work is still required, particularly on how proteins combine to form complexes, and how they react with RNA, DNA, and small molecules.

"AlphaFold is a once in a generation advance, predicting protein structures with incredible speed and precision," said Arthur Levinson, former CEO of Genentech and current CEO of Alphabet's Calico. "This leap forward demonstrates how computational methods are poised to transform research in biology and hold much promise for accelerating the drug discovery process.&rdquo

DeepMind plans to publish a paper detailing AlphaFold's achievement, but it was coy on whether it would release the algorithm itself.

&ldquoWe&rsquore right at the beginning of exploring how best to enable other groups to use our structure predictions,&rdquo the company said.

BCBetweenness Centrality
H3AfricaHuman Heredity and Health in Africa
HTSHigh Throughput Sequencing
HUMAHuman Mutation Analysis web server
LShortest path
PRIMOProtein Interactive Modeling web server
RINResidue Interaction Network
RMSDRoot Mean Square Deviation
RMSFRoot Mean Square Fluctuation
VAPORVariant Analysis Portal

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

How Do Proteins Fold? Levinthal’s Paradox and the Protein Folding Problem

Now we know that the form of a protein is tightly related to its function. Knowledge of protein’s 3D structure is a huge hint for understanding how the protein works, and use that information for different purposes control or modify protein’s function, predict what molecules bind to that protein and understand various biological interactions, assist drug discovery or even design our own proteins.

Yet, one of the biggest challenges of biology has been to accurately predict the 3D native structure of the protein from only its 1D sequence of amino acidic residues. Why is this a big problem?

The protein folding problem is stated in Levinthal’s paradox:

“Finding the native folded state of a protein by a random search among all possible configurations can take an enormously long time. Yet proteins can fold in seconds or less.”

From a general physicochemical point of view, how can proteins adopt their unique 3D native structure -a global free energy minimum form- in a biologically reasonable time without exhaustive enumeration of all possible conformations? This is under the assumption that proteins should randomly search configurations until the native form is reached.

Levinthal believed that proteins must solve the problem by folding through predetermined pathways.

Future promises

Well this is just a first step into the biological scientific discovery, a lot more work needed to be done to successfully predict the exact protein structure. Which will change the whole biological history, making us able to have a quantifiable impact on treating diseases, managing the environment, and more.

This article is written on the basis of research by DeepMind’s scientist. Below is research paper published by DeepMind’s team. (Cited).

De novo structure prediction with deep-learning based scoring
R.Evans, J.Jumper, J.Kirkpatrick, L.Sifre, T.F.G.Green, C.Qin, A.Zidek, A.Nelson, A.Bridgland, H.Penedones, S.Petersen, K.Simonyan, S.Crossan, D.T.Jones, D.Silver, K.Kavukcuoglu, D.Hassabis, A.W.Senior
In Thirteenth Critical Assessment of Techniques for Protein Structure Prediction (Abstracts) 1–4 December 2018.