Why are some protein sequences known but their 3D structure isn't?

Why are some protein sequences known but their 3D structure isn't?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Why are there some proteins that have a known amino acid sequence, but their 3D structure is not known? Wouldn't finding the former in a lab lead to the discovery of the latter? Please correct me if I have misunderstood something.

Protein sequencing is a nicely constrained problem: you have a one-dimensional sequence of amino acid members, which come from a limited set of options (made a bit more complicated by post-translational modifications, but not much more so). Because it's one-dimensional, it's a problem you can easily solve by chopping up a protein into little bits, using mass differences between amino acids to understand their constituents, and determining the order from that distribution. If a DNA (or mRNA) sequence is known, it becomes even easier - you can skip the protein sequencing process and get the amino acid sequence directly from the nucleic acid sequence and the genetic code.

By comparison, protein folding is an absolute nightmare to solve for. Chemical bonds between amino acids are not rigid, they can bend and twist in all directions. The conformation of those bonds also depends not just on adjacent amino acids (as in a 1-D problem) but potentially on any other amino acid in the sequence (not to mention external influences… ).

In a large molecule like a protein there is a massive massive degrees of freedom problem. From Wikipedia, describing Levinthal's paradox, bold mine:

In 1969, Cyrus Levinthal noted that, because of the very large number of degrees of freedom in an unfolded polypeptide chain, the molecule has an astronomical number of possible conformations. An estimate of 3300 or 10143 was made in one of his papers[1] (often incorrectly cited as the 1968 paper[2]). For example, a polypeptide of 100 residues will have 99 peptide bonds, and therefore 198 different phi and psi bond angles. If each of these bond angles can be in one of three stable conformations, the protein may misfold into a maximum of 3198 different conformations (including any possible folding redundancy). Therefore, if a protein were to attain its correctly folded configuration by sequentially sampling all the possible conformations, it would require a time longer than the age of the universe to arrive at its correct native conformation.

Now, of course that's not the actual process that proteins use to fold (they don't iterate through all possible combinations, they settle through an energy landscape where only certain intermediate conformations are realized), and we can use that in computational models to solve protein structures more quickly than the age of the universe, but it's still quite a slow process. Projects like [email protected] have aimed to distribute the computational load among unused processing power in devices around the world, including idle gaming consoles and personal computers, but there are many many protein structures to solve.

It's possible to get a general picture of protein shape using imaging techniques like X-ray crystallography or cryo-EM, and for some purposes these techniques give a lot of information, but these techniques are also by no means simple and can be prone to errors.

To answer why sequences are known before structures, it is worth highlighting the typical 'workflow' for a biochemical researcher. Briefly, sequence is always before structure because you need the sequence to determine the structure. As with everything else one would like to investigate, you have to start with the information that you already have. In modern sciences these are usually as following for protein research:


1. Isolate some bacteria or fungi from e.g. the ocean or anywhere else, and sequence their whole genome (DNA). This is very realistic to do, and not that expensive anymore.

2. Once you have the genome sequence, there is a lot of bioinformatic work to be done to annotate the sequence. This means, identify coding regions for e.g. for proteins. There are programs that are very good at this, because we already have a lot of information on what is encoded in living organisms (based on experimental data and years of research).

3. The DNA annotation programs can assign thousands of proteins in one genome. These protein sequences are then uploaded in relevant databases, for other people to view and work with. Note that these proteins sequences are NOT experimentally validated. They are however assumed to be correct with some statistical validity (which is usually correct now a days due to the overwhelming amounts of collected knowledge, and sophisticated software).

4. Scientists (i.e. biochemist and others) can then work with these protein sequences to find out if they actually are what the programs assumes. This involves the bottleneck of actually been able to produce and purify the protein of interest (which may be very difficult).

5. After experimentally assuring that the protein actually has the function you are interested in (by doing experiments), and being able to actually produce and purify it, one would typically want to determine its structure. This is because the three-dimensional structure of a protein can explain how and why it work the way it does. This is however difficult to do, experimentally (as well described by the other post answer).

Can you trust homology-models?

As a scientist working with protein structure and function, I would also note that (in my opinion) you cannot completely trust structures solved purely computationally (i.e. homology-models). These estimated homology-models are simply based on actually structures that are experimentally validated (e.g. crystallographic). Even though homology-models are very useful when you do not have a better structure, you can never be completely sure that they are correct (as they are simply assumed models of structural models; i.e. models of a models).

The active site of enzymes is of great interest to understand how enzymes catalytic their reaction. It is of vital importance to know how the catalytic residues are structurally placed/arranged in the active site to understand and even modify their catalytic behaviour. Even if you have a homology-model is 98% correct, the 2% error could be the structural placement of catalytically important residues. You also can not know for sure what the error is. One should therefore be very careful to put to much reasoning into a homology-model. In summary, if you don't have a experimentally validated structure (which is difficult to get) you can never be completely sure of what is going on (or at least, you would be working in the dark - looking for effects).

Experimentally validated structures:

I would also like to add that x-ray crystal structures are, as of today, the golden standard when it comes to protein structure information (although cryoEM is catching up(!), and NMR structure give a lot of information about dynamics). You should check out the PDB database. If you have a high resolution structure, of e.g. 1.1A you are approaching atomic resolution and can even see the rings in the aromatic amino acid side chains (which is very cool!).

To answer your question in brief:

Sequence is always before stucture as you cannot experimentally determine structure without the sequence (also required to computationally model the structure). The protein sequences are simply assigned with complexed programs, from the DNA sequences. After you have the sequence you need to experimentally validate that the assigned protein sequences are correct. Only after all this work ca you start to determine its three-dimensional structure… through a lot of hard work.

3D models reveal why some animals don't get coronavirus

This technology could help prevent future coronavirus outbreaks.

Early on in the pandemic, it became clear Covid-19 had made the leap from animals to humans.

The exact chain of transmission isn't known, but the science so far suggests bats played a starring role. After a tiger contracted the virus, scientists started to ask: What other animals can get Covid-19?

A new study published Thursday in PLOS Computational Biology offers molecular clues to which of the animals we come in closest contact with are most susceptible to coronavirus. And, perhaps, more importantly, the study shows which animals are least susceptible to infection.

Pangolins, which were blamed in the past for spreading Covid-19 to humans, score most highly on the susceptible list. Mice are the least susceptible. Cats fall somewhere in the middle and despite reports of dogs with Covid-19, the study's results were inconclusive when it comes to man's best friend.

The key, this study suggests, may lie in a single molecule carried by some animals and not others.

Protein Power — Covid-19 infection occurs when the spike protein of SARS-CoV-2 binds to specific receptors on cells, allowing the virus to enter animal (and human) cells and start replicating.

The receptor in question is known as the ACE2 receptor protein, and it lies on the surface of the cell. It is this protein that forms the basis of the new study. The researchers used a unique form of computer modeling to generate 3D protein models.

"Our hypothesis was that there must be similarities in the amino acid sequence of the ACE2 receptor of susceptible species and that's exactly what we found," João Rodrigues, lead author on the study and a postdoctoral research fellow in structural biology at Stanford University, tells Inverse.

The 3D models enabled the researchers to test how the virus' spike protein interacts with receptor proteins from the cells of 28 different animals as diverse as guinea pigs and ducks.

To see whether the proteins on the animal's cells interacted with the virus' spike protein, the researchers used a scientific measurement known as a HADDOCK score, named for the gruff Captain Archibald Haddock from the Tintin comics.

This figure from the study shows how each animal's HADDOCK score compares:

"The HADDOCK score is an indicator of how well two proteins fit together, sort of like a key in a lock," Rodrigues says.

Some proteins fit better than others — much like Cinderella's slipper, if the proteins don't gel together, then the virus can't enter the cell. As a result, the HADDOCK score can reveal any given animal's likelihood of becoming infected. Surprisingly given their reputation as plague-carriers, rats' scores suggest they are less likely to get coronavirus than humans, cats, or even cows.

"Good fits will have lower scores. In our study, non-susceptible species have higher scores than susceptible species," Rodrigues says.

The higher the HADDOCK score, the less susceptible the species is to the coronavirus. But to fully understand the implications of each animal's score, it has to be weighed in relation to other animals' — a mouse scores -93.2 in the model, for example, which may not seem great, but it's considerably higher than humans' score of -116.2.

"This difference in score is because the mouse ACE2 has certain mutations compared to the human variant that we predict make it bind less well to the viral spike protein," Rodrigues says.

Most of the non-susceptible species in their model also have this same mutation inhibiting protein binding, Rodrigues explains. The mutation is the key to understanding why some animals are susceptible to the coronavirus, while others are not.

Future Coronaviruses — The researchers hope the protein modeling technology they use in this study could help prevent future novel coronavirus outbreaks in humans.

"Armed with this knowledge, we should be able to build models that predict — emphasis on predict — which species are susceptible to this and other coronaviruses and that could be potential animal reservoirs," Rodrigues says.

Essentially, if you understand which animals are capable of becoming infected with coronaviruses at all, you can potentially stop the chain of transmission to humans.

"Our protocol is readily applicable to other coronaviruses, as long as we know the structures of the viral spike protein and of the receptor to which it binds to," Rodrigues says.

Limitations — The researchers are upfront about two key limitations to their study. The research uses 3D protein models, and does not look at real-time Covid-19 cases.

"First, while our models do agree with experimental data, for the most part, there is always a degree of uncertainty due to the computational nature of our work," Rodrigues says.

"This means we can make educated guesses about how the virus spike binds to the hosts' ACE2 receptors and about which amino acids of the receptor play an important role in this process," Rodrigues says.

"It does not mean, however, that our results can be used to say, enact policies affecting animal health or that the general public should look at our results as a 'ruler' for risk for Covid for pets," he adds.

And while the binding of spike protein to the ACE2 receptor is important, it's "only one early step of the entire viral infection process," Rodrigues says.

"So, even if our models correctly predict strong binding of the spike protein to ACE2 there is a chance that other subsequent steps fail and therefore there is no productive infection," he says.

Therapy Time — Recent developments in artificial intelligence have helped overcome what's known as the "protein folding problem," which occurs when researchers are unsure of the shapes that folded proteins form.

DeepMind's AI technology AlphaFold enables scientists to predict protein structure using its models. It opens a new pathway for biological research.

The timing is good, too, as Rodrigues' team hopes other scientists will use their findings to create therapeutic drugs to tackle the coronavirus' harmful effects on the human body.

"Since we made our protocols completely open-source, interested researchers can build on our results and refine them to their liking, for example, to test which variants of human ACE2 would bind the spike protein best," Rodrigues says.

The proposed therapy works through mutations that "enhance binding" of ACE2 to the spike protein, according to Rodrigues' model.

"One form of therapy being developed is to create artificial versions of human ACE2 that have these and other mutations and use them as 'traps' for the virus," Rodrigues says.

By tricking the virus into binding to the traps rather than our own ACE2 receptors, it would allow the body to "buy time for our immune system to mount a counter-attack," Rodrigues says.

No, DeepMind has not solved protein folding

This week DeepMind has announced that, using artificial intelligence (AI), it has solved the 50-year old problem of ‘protein folding’. The announcement was made as the results were released from the 14 th and latest competition on the Critical Assessment of Techniques for Protein Structure Prediction (CASP14). The competition pits teams of computational scientists against one another to see whose method is the best at predicting the structures of protein molecules – and DeepMind’s solution, ‘AlphaFold 2’, emerged as the clear winner.

Don’t believe everything you read in the media

There followed much breathless reporting in the media that AI can now be used to accurately predict the structures of proteins – the molecular machinery of every living thing. Previously the laborious experimental work of solving protein structures was the domain of protein crystallographers, NMR spectroscopists and cryo-electron microscopists, who worked for months and sometimes years to work out each new structure.

Should the experimentalist now all quit the lab and leave the field to Deep Mind?

No, they shouldn’t, for several reasons.

Firstly, there is no doubt that DeepMind have made a big step forward. Of all the teams competing against one another they are so far ahead of the pack that the other computational modellers may be thinking about giving up. But we are not yet at the point where we can say that protein folding is ‘solved’. For one thing, only two-thirds of DeepMind’s solutions were comparable to the experimentally determined structure of the protein. This is impressive but you have to bear in mind that they didn’t know exactly which two-thirds of their predictions were closest to correct until the comparison with experimental solutions was made.* Would you buy a satnav that was only 67% accurate?

So a dose of realism is required. It is also difficult to see right now, despite DeepMind’s impressive performance, that this will immediately transform biology.

Impressive predictions – but how do you know they’re correct?

Alphafold 2 will certainly help to advance biology. For example, as already reported, it can generate folded structure predictions that can then be used to solve experimental structures by crystallography (and probably other techniques). So this will help the science of structure determination go a bit faster in some cases.

However, despite some of the claims being made, we are not at the point where this AI tool can be used for drug discovery. For DeepMind’s structure predictions (111 in all), the average or root-mean-squared difference (RMSD) in atomic positions between the prediction and the actual structure is 1.6 Å (0.16 nm). That’s about the size of a bond-length.

That sounds pretty good but it’s not clear from DeepMind’s announcement how that number is calculated. It might be calculated only by comparing the positions of the alpha-Carbon atoms in the protein backbone – a reasonable way to estimate the accuracy of the overall fold of the protein. Or, it might be calculated over all the atomic positions, a much more rigorous test. If it is the latter, then an RMSD of 1.6 Å is an even more impressive result.

But it’s still not nearly good enough for delivering reliable insights into protein chemistry or drug design. To do that, we want to be confident of atomic positions to within a margin of around 0.3 Å. AlphaFold 2’s best prediction has an RMSD for all atoms of 0.9 Å. Many of the predictions contributing to their average of 1.6 Å will have deviations in atomic positions even greater than that. So, despite the claims, we’re not yet ready to use Alphafold 2 to create new drugs.

There are other reasons not to believe that the protein folding problem is ‘solved’. AI methods rely on learning the rules of protein folding from existing protein structures. This means that it may find it more difficult to predict the structures of proteins with folds that are not well represented in the database of solved structures.

Also, as reported in Nature, the method cannot yet reliably tackle predictions of proteins that are components of multi-protein complexes. These are among the most interesting biological entities in living things (e.g. ribosomes, ion channels, polymerases). So there is quite a large territory remaining were AlphaFold 2 cannot take us. The experimentalists, who have been successful in mapping out the structures of complexes of growing complexity, have still a lot of valuable work to do.

While all of the above is supposed to sound a note of caution to counter some of the more hyperbolic claims that have been heard in the media in recent days, I still want to emphasise my admiration for the achievements of the AlphaFold team. They have clearly made a very significant advance.

That advance will be much clearer once their peer-reviewed paper is published (we should not judge science by press releases), and once the tool is openly available to the academic community – or indeed anyone who wants to study protein structure.

Update (02 Dec, 18:43): This post was updated to provide a clearer explanation of the RMSD measures used to compare predicted and experimentally determined protein structures. I am very grateful to Prof Leonid Sazanov who pointed out some necessary corrections and additions on Twitter.

*Update (12 Dec, 15:35): Strictly this is true, but it misses the more important point that the score given to each structure prediction (GDT_TS) broadly correlates with the closeness of its match to the experimental structure. As a result, I have deleted my SatNav crack.

For a deeply informed and very measured assessment of what DeepMind has actually achieved in CASP14, please read this blogpost by Prof. Mohammed AlQuraishi who knows this territory much better than I do. His post is pretty long but you can skip the technical bits explaining how AlphaFold 2 works. He gives a very good account of the nature of DeepMind’s advance in AlQuraishi’s view, AlphaFold 2 does represent a solution to the protein structure prediction problem, though he is careful to define what he means by a solution. He also acknowledges that there are still some significant improvements to be made to the programme, but regards these as more of an engineering challenge than a scientific one. He agrees that AlphaFold 2 won’t be used any time soon for drug design work. AlQuraishi also gives an excellent overview of the implications of this work for protein folders, structural biologists and biotechnologists in general, and offers some very interesting thoughts on the differences between DeepMind’s approach to research and that of more traditional academic groups.

Why is AlphaFold2 considered “revolutionary”?

A research team from DeepMind joined CASP13 (the 13 th competition) in 2018 with AlphaFold, a program based on a “deep neural network.” The depth of the neural network refers to the number of parameters in the model, which was about 21 million, and the model was trained based on a large amount of sequence and structural information for 29,000 known proteins. Although AlphaFold won CASP13, their GDS-TS for the most difficult targets was only 58.9, which was impressive but not that much better than the scores for the runner-up teams, which were around 52.

In 2020, DeepMind made the leap in CASP14 with AlphaFold2, which accomplished a median GDS-TS of 92.4. This means that the prediction had more than 92% of the amino acids in the protein in the correct conformation! This is the first time a computer model in the competition has reached a level of accuracy comparable to that of experimental techniques such as X-ray crystallography.

Examples for the comparison between experimental and computational. Left: a pretty good GDT score (0.64) shows model and experimental results (purple vs green) matching fairly closely closely. In contrast, the example on the right with a bad GDT score (0.23) shows how the model and experimental (red vs. green) shapes are very different. AlphaFold2 had a GDT score higher than 0.9, and you can see images of the models on DeepMind’s blog. (this image modified from Hou et al. (2019) 3 courtesy of open access)

Even for a set of targets classified as the most difficult, the median score for AlphaFold2 was 87. Compared to the best score that hovered around 40 merely 4-5 years ago, this was indeed tremendous progress, and no wonder led to much excitement in the press. While DeepMind still has not officially released the algorithmic details of AlphaFold2, even experts in the field believe that revolutionary steps were taken to make such impressive improvement in prediction possible. Many people, myself included, eagerly await the official publication from DeepMind, since it is likely that their developments can find applications in other AI based problems related to proteins.

Reversing Denaturation

It is often possible to reverse denaturation because the primary structure of the polypeptide, the covalent bonds holding the amino acids in their correct sequence, is intact. Once the denaturing agent is removed, the original interactions between amino acids return the protein to its original conformation and it can resume its function. However, denaturation can be irreversible in extreme situations, like frying an egg. The heat from a pan denatures the albumin protein in the liquid egg white and it becomes insoluble. The protein in meat also denatures and becomes firm when cooked.

Figure (PageIndex<1>): Denaturing a protein is occasionally irreversible: (Top) The protein albumin in raw and cooked egg white. (Bottom) A paperclip analogy visualizes the process: when cross-linked, paperclips (&lsquoamino acids&rsquo) no longer move freely their structure is rearranged and &lsquodenatured&rsquo.

Chaperone proteins (or chaperonins ) are helper proteins that provide favorable conditions for protein folding to take place. The chaperonins clump around the forming protein and prevent other polypeptide chains from aggregating. Once the target protein folds, the chaperonins disassociate.

AlphaFold 2 Explained: A Semi-Deep Dive

At the end of last month, DeepMind, Google’s machine learning research branch known for building bots that beat world champions at Go and StarCraft II, hit a new benchmark: accurately predicting the structure of proteins. If their results are as good as the team claims, their model, AlphaFold, could be a major boon for both drug discovery and fundamental biological research. But how does this new neural-network-based model work? In this post, I’ll try to give you a brief but semi-deep dive behind both the machine learning and biology that power this model.

First, a quick biology primer: The functions of proteins in the body are entirely defined by their three-dimensional structures. For example, it’s the notorious “spike proteins’’ which stud coronavirus that allows the virus to enter our cells. Meanwhile, mRNA vaccines like Moderna’s and Pfizer’s replicate the shape of those spike proteins, causing the body to produce an immune response. But historically, determining protein structures (via experimental techniques like X-ray crystallography, nuclear magnetic resonance, and cryo-electron microscopy) has been difficult, slow, and expensive. Plus, for some types of proteins, these techniques don’t work at all.

In theory, though, the entirety of a protein’s 3D shape should be determined by the string of amino acids that make it up. And we can determine a protein’s amino acid sequences easily, via DNA sequencing (remember from Bio 101 how your DNA codes for amino acid sequences?). But in practice, predicting protein structure from amino acid sequences has been a hair-pullingly difficult task we’ve been trying to solve for decades.

This is where AlphaFold comes in. It’s a neural-network-based algorithm that’s performed astonishingly well on the protein folding problem, so much so that it seems to rival in quality the traditional slow and expensive imaging methods.

Sadly for nerds like me, we can’t know exactly AlphaFold works because the official paper has yet to be published and peer reviewed. Until then, all we have to go off of is the company’s blog post. But since AlphaFold (2) is actually an iteration on a slightly older model (AlphaFold 1) published last year, we can make some pretty good guesses. In this post, I’ll focus on two core pieces: the underlying neural architecture of AlphaFold 2 and how it managed to make effective use of unlabeled data.

First, this new breakthrough is not so different from a similar AI breakthrough I wrote about a few months ago, GPT-3. GPT-3 was a large language model built by OpenAI that could write impressively human-like poems, sonnets, jokes, and even code samples. What made GPT-3 so powerful was that it was trained on a very, very large dataset, and based on a type of neural network called a “Transformer.”

Transformers, invented in 2017, really do seem to be the magic machine learning hammer that cracks open problems in every domain. In an intro machine learning class, you’ll often learn to use different model architectures for different data types: convolutional neural networks are for analyzing images recurrent neural networks are for analyzing text. Transformers were originally invented to do machine translation, but they appear to be effective much more broadly, able to understand text, images, and, now, proteins. So one of the major differences between AlphaFold 1 and AlphaFold 2 is that the former used concurrent neural networks (CNNs) and the new version uses Transformers.

Now let’s talk about the data that was used to train AlphaFold. According to the blog post, the model was trained on a public dataset of 170,000 proteins with known structures, and a much larger database of protein sequences with unknown structures. The public dataset of known proteins serves as the model’s labeled training dataset, a ground truth. Size is relative, but based on my experience, 170,000 “labeled” examples is a pretty small training dataset for such a complex problem. That says to me the authors must have done a good job of taking advantage of that “unlabeled” dataset of proteins with unknown structures.

But what good is a dataset of protein sequences with mystery shapes? It turns out that figuring out how to learn from unlabeled data-”unsupervised learning”-has enabled lots of recent AI breakthroughs. GPT-3, for example, was trained on a huge corpus of unlabeled text data scraped from the web. Given a slice of a sentence, it had to predict which words came next, a task known as “next word prediction,” which forced it to learn something about the underlying structure of language. The technique has also been adopted to images, too: slice an image in half, and ask a model to predict what the bottom of the image should look like just from the top:

The idea is that, if you don’t have enough data to train a model to do what you want, train it to do something similar on a task that you do have enough data for, a task that forces it to learn something about the underlying structure of language, or images, or proteins. Then you can fine-tune it for the task you really wanted it to do.

One extremely popular way to do this is via embeddings. Embeddings are a way of mapping data to vectors whose position in space capture meaning. One famous example is Word2Vec: it’s a tool for taking a word (i.e. “hammer”) and mapping it to n-dimensional space so that similar words (“screw driver,” “nail”) are mapped nearby. And, like GPT-3, it was trained on a dataset of unlabeled text.

So what’s the equivalent of Word2Vec for molecular biology? How do we squeeze knowledge from amino acid chains with unknown, unlabeled structures? One technique is to look at clusters of proteins with similar amino acid sequences. Often, one protein sequence might be similar to another because the two share a similar evolutionary origin. The more similar those amino acid sequences, the more likely those proteins serve a similar purpose for the organisms they’re made in, which means, in turn, they’re more likely to share a similar structure.

So the first step is to determine how similar two amino acid sequences are. To do that, biologists typically compute something called an MSA or Multiple Sequence Alignment. One amino acid sequence may be very similar to another, but it may have some extra or “inserted” amino acids that make it longer than the other. MSA is a way of adding gaps to make the sequences line up as closely as possible.

According to the diagram in DeepMind’s blog post, MSA appears to be an important early step in the model.


The method of homology modeling is based on the observation that protein tertiary structure is better conserved than amino acid sequence. [3] Thus, even proteins that have diverged appreciably in sequence but still share detectable similarity will also share common structural properties, particularly the overall fold. Because it is difficult and time-consuming to obtain experimental structures from methods such as X-ray crystallography and protein NMR for every protein of interest, homology modeling can provide useful structural models for generating hypotheses about a protein's function and directing further experimental work.

There are exceptions to the general rule that proteins sharing significant sequence identity will share a fold. For example, a judiciously chosen set of mutations of less than 50% of a protein can cause the protein to adopt a completely different fold. [7] [8] However, such a massive structural rearrangement is unlikely to occur in evolution, especially since the protein is usually under the constraint that it must fold properly and carry out its function in the cell. Consequently, the roughly folded structure of a protein (its "topology") is conserved longer than its amino-acid sequence and much longer than the corresponding DNA sequence in other words, two proteins may share a similar fold even if their evolutionary relationship is so distant that it cannot be discerned reliably. For comparison, the function of a protein is conserved much less than the protein sequence, since relatively few changes in amino-acid sequence are required to take on a related function.

The homology modeling procedure can be broken down into four sequential steps: template selection, target-template alignment, model construction, and model assessment. [3] The first two steps are often essentially performed together, as the most common methods of identifying templates rely on the production of sequence alignments however, these alignments may not be of sufficient quality because database search techniques prioritize speed over alignment quality. These processes can be performed iteratively to improve the quality of the final model, although quality assessments that are not dependent on the true target structure are still under development.

Optimizing the speed and accuracy of these steps for use in large-scale automated structure prediction is a key component of structural genomics initiatives, partly because the resulting volume of data will be too large to process manually and partly because the goal of structural genomics requires providing models of reasonable quality to researchers who are not themselves structure prediction experts. [3]

The critical first step in homology modeling is the identification of the best template structure, if indeed any are available. The simplest method of template identification relies on serial pairwise sequence alignments aided by database search techniques such as FASTA and BLAST. More sensitive methods based on multiple sequence alignment – of which PSI-BLAST is the most common example – iteratively update their position-specific scoring matrix to successively identify more distantly related homologs. This family of methods has been shown to produce a larger number of potential templates and to identify better templates for sequences that have only distant relationships to any solved structure. Protein threading, [9] also known as fold recognition or 3D-1D alignment, can also be used as a search technique for identifying templates to be used in traditional homology modeling methods. [3] Recent CASP experiments indicate that some protein threading methods such as RaptorX indeed are more sensitive than purely sequence(profile)-based methods when only distantly-related templates are available for the proteins under prediction. When performing a BLAST search, a reliable first approach is to identify hits with a sufficiently low E-value, which are considered sufficiently close in evolution to make a reliable homology model. Other factors may tip the balance in marginal cases for example, the template may have a function similar to that of the query sequence, or it may belong to a homologous operon. However, a template with a poor E-value should generally not be chosen, even if it is the only one available, since it may well have a wrong structure, leading to the production of a misguided model. A better approach is to submit the primary sequence to fold-recognition servers [9] or, better still, consensus meta-servers which improve upon individual fold-recognition servers by identifying similarities (consensus) among independent predictions.

Often several candidate template structures are identified by these approaches. Although some methods can generate hybrid models with better accuracy from multiple templates, [9] [10] most methods rely on a single template. Therefore, choosing the best template from among the candidates is a key step, and can affect the final accuracy of the structure significantly. This choice is guided by several factors, such as the similarity of the query and template sequences, of their functions, and of the predicted query and observed template secondary structures. Perhaps most importantly, the coverage of the aligned regions: the fraction of the query sequence structure that can be predicted from the template, and the plausibility of the resulting model. Thus, sometimes several homology models are produced for a single query sequence, with the most likely candidate chosen only in the final step.

It is possible to use the sequence alignment generated by the database search technique as the basis for the subsequent model production however, more sophisticated approaches have also been explored. One proposal generates an ensemble of stochastically defined pairwise alignments between the target sequence and a single identified template as a means of exploring "alignment space" in regions of sequence with low local similarity. [11] "Profile-profile" alignments that first generate a sequence profile of the target and systematically compare it to the sequence profiles of solved structures the coarse-graining inherent in the profile construction is thought to reduce noise introduced by sequence drift in nonessential regions of the sequence. [12]

Given a template and an alignment, the information contained therein must be used to generate a three-dimensional structural model of the target, represented as a set of Cartesian coordinates for each atom in the protein. Three major classes of model generation methods have been proposed. [13] [14]

Fragment assembly Edit

The original method of homology modeling relied on the assembly of a complete model from conserved structural fragments identified in closely related solved structures. For example, a modeling study of serine proteases in mammals identified a sharp distinction between "core" structural regions conserved in all experimental structures in the class, and variable regions typically located in the loops where the majority of the sequence differences were localized. Thus unsolved proteins could be modeled by first constructing the conserved core and then substituting variable regions from other proteins in the set of solved structures. [15] Current implementations of this method differ mainly in the way they deal with regions that are not conserved or that lack a template. [16] The variable regions are often constructed with the help of fragment libraries.

Segment matching Edit

The segment-matching method divides the target into a series of short segments, each of which is matched to its own template fitted from the Protein Data Bank. Thus, sequence alignment is done over segments rather than over the entire protein. Selection of the template for each segment is based on sequence similarity, comparisons of alpha carbon coordinates, and predicted steric conflicts arising from the van der Waals radii of the divergent atoms between target and template. [17]

Satisfaction of spatial restraints Edit

The most common current homology modeling method takes its inspiration from calculations required to construct a three-dimensional structure from data generated by NMR spectroscopy. One or more target-template alignments are used to construct a set of geometrical criteria that are then converted to probability density functions for each restraint. Restraints applied to the main protein internal coordinates – protein backbone distances and dihedral angles – serve as the basis for a global optimization procedure that originally used conjugate gradient energy minimization to iteratively refine the positions of all heavy atoms in the protein. [18]

This method had been dramatically expanded to apply specifically to loop modeling, which can be extremely difficult due to the high flexibility of loops in proteins in aqueous solution. [19] A more recent expansion applies the spatial-restraint model to electron density maps derived from cryoelectron microscopy studies, which provide low-resolution information that is not usually itself sufficient to generate atomic-resolution structural models. [20] To address the problem of inaccuracies in initial target-template sequence alignment, an iterative procedure has also been introduced to refine the alignment on the basis of the initial structural fit. [21] The most commonly used software in spatial restraint-based modeling is MODELLER and a database called ModBase has been established for reliable models generated with it. [22]

Regions of the target sequence that are not aligned to a template are modeled by loop modeling they are the most susceptible to major modeling errors and occur with higher frequency when the target and template have low sequence identity. The coordinates of unmatched sections determined by loop modeling programs are generally much less accurate than those obtained from simply copying the coordinates of a known structure, particularly if the loop is longer than 10 residues. The first two sidechain dihedral angles (χ1 and χ2) can usually be estimated within 30° for an accurate backbone structure however, the later dihedral angles found in longer side chains such as lysine and arginine are notoriously difficult to predict. Moreover, small errors in χ1 (and, to a lesser extent, in χ2) can cause relatively large errors in the positions of the atoms at the terminus of side chain such atoms often have a functional importance, particularly when located near the active site.

Assessment of homology models without reference to the true target structure is usually performed with two methods: statistical potentials or physics-based energy calculations. Both methods produce an estimate of the energy (or an energy-like analog) for the model or models being assessed independent criteria are needed to determine acceptable cutoffs. Neither of the two methods correlates exceptionally well with true structural accuracy, especially on protein types underrepresented in the PDB, such as membrane proteins.

Statistical potentials are empirical methods based on observed residue-residue contact frequencies among proteins of known structure in the PDB. They assign a probability or energy score to each possible pairwise interaction between amino acids and combine these pairwise interaction scores into a single score for the entire model. Some such methods can also produce a residue-by-residue assessment that identifies poorly scoring regions within the model, though the model may have a reasonable score overall. [23] These methods emphasize the hydrophobic core and solvent-exposed polar amino acids often present in globular proteins. Examples of popular statistical potentials include Prosa and DOPE. Statistical potentials are more computationally efficient than energy calculations. [23]

Physics-based energy calculations aim to capture the interatomic interactions that are physically responsible for protein stability in solution, especially van der Waals and electrostatic interactions. These calculations are performed using a molecular mechanics force field proteins are normally too large even for semi-empirical quantum mechanics-based calculations. The use of these methods is based on the energy landscape hypothesis of protein folding, which predicts that a protein's native state is also its energy minimum. Such methods usually employ implicit solvation, which provides a continuous approximation of a solvent bath for a single protein molecule without necessitating the explicit representation of individual solvent molecules. A force field specifically constructed for model assessment is known as the Effective Force Field (EFF) and is based on atomic parameters from CHARMM. [24]

A very extensive model validation report can be obtained using the Radboud Universiteit Nijmegen "What Check" software which is one option of the Radboud Universiteit Nijmegen "What If" software package it produces a many page document with extensive analyses of nearly 200 scientific and administrative aspects of the model. "What Check" is available as a free server it can also be used to validate experimentally determined structures of macromolecules.

One newer method for model assessment relies on machine learning techniques such as neural nets, which may be trained to assess the structure directly or to form a consensus among multiple statistical and energy-based methods. Results using support vector machine regression on a jury of more traditional assessment methods outperformed common statistical, energy-based, and machine learning methods. [25]

Structural comparison methods Edit

The assessment of homology models' accuracy is straightforward when the experimental structure is known. The most common method of comparing two protein structures uses the root-mean-square deviation (RMSD) metric to measure the mean distance between the corresponding atoms in the two structures after they have been superimposed. However, RMSD does underestimate the accuracy of models in which the core is essentially correctly modeled, but some flexible loop regions are inaccurate. [26] A method introduced for the modeling assessment experiment CASP is known as the global distance test (GDT) and measures the total number of atoms whose distance from the model to the experimental structure lies under a certain distance cutoff. [26] Both methods can be used for any subset of atoms in the structure, but are often applied to only the alpha carbon or protein backbone atoms to minimize the noise created by poorly modeled side chain rotameric states, which most modeling methods are not optimized to predict. [27]

Several large-scale benchmarking efforts have been made to assess the relative quality of various current homology modeling methods. CASP is a community-wide prediction experiment that runs every two years during the summer months and challenges prediction teams to submit structural models for a number of sequences whose structures have recently been solved experimentally but have not yet been published. Its partner CAFASP has run in parallel with CASP but evaluates only models produced via fully automated servers. Continuously running experiments that do not have prediction 'seasons' focus mainly on benchmarking publicly available webservers. LiveBench and EVA run continuously to assess participating servers' performance in prediction of imminently released structures from the PDB. CASP and CAFASP serve mainly as evaluations of the state of the art in modeling, while the continuous assessments seek to evaluate the model quality that would be obtained by a non-expert user employing publicly available tools.

The accuracy of the structures generated by homology modeling is highly dependent on the sequence identity between target and template. Above 50% sequence identity, models tend to be reliable, with only minor errors in side chain packing and rotameric state, and an overall RMSD between the modeled and the experimental structure falling around 1 Å. This error is comparable to the typical resolution of a structure solved by NMR. In the 30–50% identity range, errors can be more severe and are often located in loops. Below 30% identity, serious errors occur, sometimes resulting in the basic fold being mis-predicted. [13] This low-identity region is often referred to as the "twilight zone" within which homology modeling is extremely difficult, and to which it is possibly less suited than fold recognition methods. [28]

At high sequence identities, the primary source of error in homology modeling derives from the choice of the template or templates on which the model is based, while lower identities exhibit serious errors in sequence alignment that inhibit the production of high-quality models. [6] It has been suggested that the major impediment to quality model production is inadequacies in sequence alignment, since "optimal" structural alignments between two proteins of known structure can be used as input to current modeling methods to produce quite accurate reproductions of the original experimental structure. [29]

Attempts have been made to improve the accuracy of homology models built with existing methods by subjecting them to molecular dynamics simulation in an effort to improve their RMSD to the experimental structure. However, current force field parameterizations may not be sufficiently accurate for this task, since homology models used as starting structures for molecular dynamics tend to produce slightly worse structures. [30] Slight improvements have been observed in cases where significant restraints were used during the simulation. [31]

The two most common and large-scale sources of error in homology modeling are poor template selection and inaccuracies in target-template sequence alignment. [6] [32] Controlling for these two factors by using a structural alignment, or a sequence alignment produced on the basis of comparing two solved structures, dramatically reduces the errors in final models these "gold standard" alignments can be used as input to current modeling methods to produce quite accurate reproductions of the original experimental structure. [29] Results from the most recent CASP experiment suggest that "consensus" methods collecting the results of multiple fold recognition and multiple alignment searches increase the likelihood of identifying the correct template similarly, the use of multiple templates in the model-building step may be worse than the use of the single correct template but better than the use of a single suboptimal one. [32] Alignment errors may be minimized by the use of a multiple alignment even if only one template is used, and by the iterative refinement of local regions of low similarity. [3] [11] A lesser source of model errors are errors in the template structure. The PDBREPORT database lists several million, mostly very small but occasionally dramatic, errors in experimental (template) structures that have been deposited in the PDB.

Serious local errors can arise in homology models where an insertion or deletion mutation or a gap in a solved structure result in a region of target sequence for which there is no corresponding template. This problem can be minimized by the use of multiple templates, but the method is complicated by the templates' differing local structures around the gap and by the likelihood that a missing region in one experimental structure is also missing in other structures of the same protein family. Missing regions are most common in loops where high local flexibility increases the difficulty of resolving the region by structure-determination methods. Although some guidance is provided even with a single template by the positioning of the ends of the missing region, the longer the gap, the more difficult it is to model. Loops of up to about 9 residues can be modeled with moderate accuracy in some cases if the local alignment is correct. [3] Larger regions are often modeled individually using ab initio structure prediction techniques, although this approach has met with only isolated success. [33]

The rotameric states of side chains and their internal packing arrangement also present difficulties in homology modeling, even in targets for which the backbone structure is relatively easy to predict. This is partly due to the fact that many side chains in crystal structures are not in their "optimal" rotameric state as a result of energetic factors in the hydrophobic core and in the packing of the individual molecules in a protein crystal. [34] One method of addressing this problem requires searching a rotameric library to identify locally low-energy combinations of packing states. [35] It has been suggested that a major reason that homology modeling so difficult when target-template sequence identity lies below 30% is that such proteins have broadly similar folds but widely divergent side chain packing arrangements. [4]

Uses of the structural models include protein–protein interaction prediction, protein–protein docking, molecular docking, and functional annotation of genes identified in an organism's genome. [36] Even low-accuracy homology models can be useful for these purposes, because their inaccuracies tend to be located in the loops on the protein surface, which are normally more variable even between closely related proteins. The functional regions of the protein, especially its active site, tend to be more highly conserved and thus more accurately modeled. [13]

Homology models can also be used to identify subtle differences between related proteins that have not all been solved structurally. For example, the method was used to identify cation binding sites on the Na + /K + ATPase and to propose hypotheses about different ATPases' binding affinity. [37] Used in conjunction with molecular dynamics simulations, homology models can also generate hypotheses about the kinetics and dynamics of a protein, as in studies of the ion selectivity of a potassium channel. [38] Large-scale automated modeling of all identified protein-coding regions in a genome has been attempted for the yeast Saccharomyces cerevisiae, resulting in nearly 1000 quality models for proteins whose structures had not yet been determined at the time of the study, and identifying novel relationships between 236 yeast proteins and other previously solved structures. [39]

Secondary Structure

The local folding of the polypeptide in some regions gives rise to the secondary structure of the protein. The most common are the α-helix and β-pleated sheet structures (Figure 4). Both structures are the α-helix structure—the helix held in shape by hydrogen bonds. The hydrogen bonds form between the oxygen atom in the carbonyl group in one amino acid and another amino acid that is four amino acids farther along the chain.

Figure 4. The α-helix and β-pleated sheet are secondary structures of proteins that form because of hydrogen bonding between carbonyl and amino groups in the peptide backbone. Certain amino acids have a propensity to form an α-helix, while others have a propensity to form a β-pleated sheet.

Every helical turn in an alpha helix has 3.6 amino acid residues. The R groups (the variant groups) of the polypeptide protrude out from the α-helix chain. In the β-pleated sheet, the “pleats” are formed by hydrogen bonding between atoms on the backbone of the polypeptide chain. The R groups are attached to the carbons and extend above and below the folds of the pleat. The pleated segments align parallel or antiparallel to each other, and hydrogen bonds form between the partially positive nitrogen atom in the amino group and the partially negative oxygen atom in the carbonyl group of the peptide backbone. The α-helix and β-pleated sheet structures are found in most globular and fibrous proteins and they play an important structural role.


Proteins are chains of amino acids joined together by peptide bonds. Many conformations of this chain are possible due to the rotation of the chain about each alpha-Carbon atom (Cα atom) . It is these conformational changes that are responsible for differences in the three dimensional structure of proteins. Each amino acid in the chain is polar, i.e. it has separated positive and negative charged regions with a free carbonyl group, which can act as hydrogen bond acceptor and an NH group, which can act as hydrogen bond donor. These groups can therefore interact in the protein structure. The [ which? ] 20 amino acids can be classified according to the chemistry of the side chain which also plays an important structural role. Glycine takes on a special position, as it has the smallest side chain, only one hydrogen atom, and therefore can increase the local flexibility in the protein structure. Cysteine on the other hand can react with another cysteine residue and thereby form a cross link stabilizing the whole structure. [ citation needed ]

The protein structure can be considered as a sequence of secondary structure elements, such as α helices and β sheets, which together constitute the overall three-dimensional configuration of the protein chain. In these secondary structures regular patterns of H bonds are formed between neighboring amino acids, and the amino acids have similar Φ and ψ angles. [ citation needed ]

The formation of these structures neutralizes the polar groups on each amino acid. The secondary structures are tightly packed in the protein core in a hydrophobic environment. Each amino acid side group has a limited volume to occupy and a limited number of possible interactions with other nearby side chains, a situation that must be taken into account in molecular modeling and alignments. [1]

Α Helix Edit

The α helix is the most abundant type of secondary structure in proteins. The α helix has 3.6 amino acids per turn with an H bond formed between every fourth residue the average length is 10 amino acids (3 turns) or 10 Å but varies from 5 to 40 (1.5 to 11 turns). The alignment of the H bonds creates a dipole moment for the helix with a resulting partial positive charge at the amino end of the helix. Because this region has free NH2 groups, it will interact with negatively charged groups such as phosphates. The most common location of α helices is at the surface of protein cores, where they provide an interface with the aqueous environment. The inner-facing side of the helix tends to have hydrophobic amino acids and the outer-facing side hydrophilic amino acids. Thus, every third of four amino acids along the chain will tend to be hydrophobic, a pattern that can be quite readily detected. In the leucine zipper motif, a repeating pattern of leucines on the facing sides of two adjacent helices is highly predictive of the motif. A helical-wheel plot can be used to show this repeated pattern. Other α helices buried in the protein core or in cellular membranes have a higher and more regular distribution of hydrophobic amino acids, and are highly predictive of such structures. Helices exposed on the surface have a lower proportion of hydrophobic amino acids. Amino acid content can be predictive of an α -helical region. Regions richer in alanine (A), glutamic acid (E), leucine (L), and methionine (M) and poorer in proline (P), glycine (G), tyrosine (Y), and serine (S) tend to form an α helix. Proline destabilizes or breaks an α helix but can be present in longer helices, forming a bend.

Β sheet Edit

β sheets are formed by H bonds between an average of 5–10 consecutive amino acids in one portion of the chain with another 5–10 farther down the chain. The interacting regions may be adjacent, with a short loop in between, or far apart, with other structures in between. Every chain may run in the same direction to form a parallel sheet, every other chain may run in the reverse chemical direction to form an anti parallel sheet, or the chains may be parallel and anti parallel to form a mixed sheet. The pattern of H bonding is different in the parallel and anti parallel configurations. Each amino acid in the interior strands of the sheet forms two H bonds with neighboring amino acids, whereas each amino acid on the outside strands forms only one bond with an interior strand. Looking across the sheet at right angles to the strands, more distant strands are rotated slightly counterclockwise to form a left-handed twist. The Cα atoms alternate above and below the sheet in a pleated structure, and the R side groups of the amino acids alternate above and below the pleats. The Φ and Ψ angles of the amino acids in sheets vary considerably in one region of the Ramachandran plot. It is more difficult to predict the location of β-sheets than of α-helices. The situation improves somewhat when the amino acid variation in multiple sequence alignments is taken into account.

Loops Edit

Some parts of the protein have fixed three-dimensional structure, but do not form any regular structures. They should not be confused with disordered or unfolded segments of proteins or random coil, an unfolded polypeptide chain lacking any fixed three-dimensional structure. These parts are frequently called "loops" because they connect β-sheets and α-helices. Loops are usually located at protein surface, and therefore mutations of their residues are more easily tolerated. Having more substitutions, insertions, and deletions in a certain region of a sequence alignment maybe an indication of a loop. The positions of introns in genomic DNA may correlate with the locations of loops in the encoded protein [ citation needed ] . Loops also tend to have charged and polar amino acids and are frequently a component of active sites.

Proteins may be classified according to both structural and sequence similarity. For structural classification, the sizes and spatial arrangements of secondary structures described in the above paragraph are compared in known three-dimensional structures. Classification based on sequence similarity was historically the first to be used. Initially, similarity based on alignments of whole sequences was performed. Later, proteins were classified on the basis of the occurrence of conserved amino acid patterns. Databases that classify proteins by one or more of these schemes are available. In considering protein classification schemes, it is important to keep several observations in mind. First, two entirely different protein sequences from different evolutionary origins may fold into a similar structure. Conversely, the sequence of an ancient gene for a given structure may have diverged considerably in different species while at the same time maintaining the same basic structural features. Recognizing any remaining sequence similarity in such cases may be a very difficult task. Second, two proteins that share a significant degree of sequence similarity either with each other or with a third sequence also share an evolutionary origin and should share some structural features also. However, gene duplication and genetic rearrangements during evolution may give rise to new gene copies, which can then evolve into proteins with new function and structure. [1]

Terms used for classifying protein structures and sequences Edit

The more commonly used terms for evolutionary and structural relationships among proteins are listed below. Many additional terms are used for various kinds of structural features found in proteins. Descriptions of such terms may be found at the CATH Web site, the Structural Classification of Proteins (SCOP) Web site, and a Glaxo Wellcome tutorial on the Swiss bioinformatics Expasy Web site.

Active site a localized combination of amino acid side groups within the tertiary (three-dimensional) or quaternary (protein subunit) structure that can interact with a chemically specific substrate and that provides the protein with biological activity. Proteins of very different amino acid sequences may fold into a structure that produces the same active site. Architecture is the relative orientations of secondary structures in a three-dimensional structure without regard to whether or not they share a similar loop structure. Fold (topology) a type of architecture that also has a conserved loop structure. Blocks is a conserved amino acid sequence pattern in a family of proteins. The pattern includes a series of possible matches at each position in the represented sequences, but there are not any inserted or deleted positions in the pattern or in the sequences. By way of contrast, sequence profiles are a type of scoring matrix that represents a similar set of patterns that includes insertions and deletions. Class a term used to classify protein domains according to their secondary structural content and organization. Four classes were originally recognized by Levitt and Chothia (1976), and several others have been added in the SCOP database. Three classes are given in the CATH database: mainly-α, mainly-β, and α–β, with the α–β class including both alternating α/β and α+β structures. Core the portion of a folded protein molecule that comprises the hydrophobic interior of α-helices and β-sheets. The compact structure brings together side groups of amino acids into close enough proximity so that they can interact. When comparing protein structures, as in the SCOP database, core is the region common to most of the structures that share a common fold or that are in the same superfamily. In structure prediction, core is sometimes defined as the arrangement of secondary structures that is likely to be conserved during evolutionary change. Domain (sequence context) a segment of a polypeptide chain that can fold into a three-dimensional structure irrespective of the presence of other segments of the chain. The separate domains of a given protein may interact extensively or may be joined only by a length of polypeptide chain. A protein with several domains may use these domains for functional interactions with different molecules. Family (sequence context) a group of proteins of similar biochemical function that are more than 50% identical when aligned. This same cutoff is still used by the Protein Information Resource (PIR). A protein family comprises proteins with the same function in different organisms (orthologous sequences) but may also include proteins in the same organism (paralogous sequences) derived from gene duplication and rearrangements. If a multiple sequence alignment of a protein family reveals a common level of similarity throughout the lengths of the proteins, PIR refers to the family as a homeomorphic family. The aligned region is referred to as a homeomorphic domain, and this region may comprise several smaller homology domains that are shared with other families. Families may be further subdivided into subfamilies or grouped into superfamilies based on respective higher or lower levels of sequence similarity. The SCOP database reports 1296 families and the CATH database (version 1.7 beta), reports 1846 families. When the sequences of proteins with the same function are examined in greater detail, some are found to share high sequence similarity. They are obviously members of the same family by the above criteria. However, others are found that have very little, or even insignificant, sequence similarity with other family members. In such cases, the family relationship between two distant family members A and C can often be demonstrated by finding an additional family member B that shares significant similarity with both A and C. Thus, B provides a connecting link between A and C. Another approach is to examine distant alignments for highly conserved matches. At a level of identity of 50%, proteins are likely to have the same three-dimensional structure, and the identical atoms in the sequence alignment will also superimpose within approximately 1 Å in the structural model. Thus, if the structure of one member of a family is known, a reliable prediction may be made for a second member of the family, and the higher the identity level, the more reliable the prediction. Protein structural modeling can be performed by examining how well the amino acid substitutions fit into the core of the three-dimensional structure. Family (structural context) as used in the FSSP database (Families of structurally similar proteins) and the DALI/FSSP Web site, two structures that have a significant level of structural similarity but not necessarily significant sequence similarity. Fold similar to structural motif, includes a larger combination of secondary structural units in the same configuration. Thus, proteins sharing the same fold have the same combination of secondary structures that are connected by similar loops. An example is the Rossman fold comprising several alternating α helices and parallel β strands. In the SCOP, CATH, and FSSP databases, the known protein structures have been classified into hierarchical levels of structural complexity with the fold as a basic level of classification. Homologous domain (sequence context) an extended sequence pattern, generally found by sequence alignment methods, that indicates a common evolutionary origin among the aligned sequences. A homology domain is generally longer than motifs. The domain may include all of a given protein sequence or only a portion of the sequence. Some domains are complex and made up of several smaller homology domains that became joined to form a larger one during evolution. A domain that covers an entire sequence is called the homeomorphic domain by PIR (Protein Information Resource). Module a region of conserved amino acid patterns comprising one or more motifs and considered to be a fundamental unit of structure or function. The presence of a module has also been used to classify proteins into families. Motif (sequence context) a conserved pattern of amino acids that is found in two or more proteins. In the Prosite catalog, a motif is an amino acid pattern that is found in a group of proteins that have a similar biochemical activity, and that often is near the active site of the protein. Examples of sequence motif databases are the Prosite catalog and the Stanford Motifs Database. [2] Motif (structural context) a combination of several secondary structural elements produced by the folding of adjacent sections of the polypeptide chain into a specific three-dimensional configuration. An example is the helix-loop-helix motif. Structural motifs are also referred to as supersecondary structures and folds. Position-specific scoring matrix (sequence context, also known as weight or scoring matrix) represents a conserved region in a multiple sequence alignment with no gaps. Each matrix column represents the variation found in one column of the multiple sequence alignment. Position-specific scoring matrix—3D (structural context) represents the amino acid variation found in an alignment of proteins that fall into the same structural class. Matrix columns represent the amino acid variation found at one amino acid position in the aligned structures. Primary structure the linear amino acid sequence of a protein, which chemically is a polypeptide chain composed of amino acids joined by peptide bonds. Profile (sequence context) a scoring matrix that represents a multiple sequence alignment of a protein family. The profile is usually obtained from a well-conserved region in a multiple sequence alignment. The profile is in the form of a matrix with each column representing a position in the alignment and each row one of the amino acids. Matrix values give the likelihood of each amino acid at the corresponding position in the alignment. The profile is moved along the target sequence to locate the best scoring regions by a dynamic programming algorithm. Gaps are allowed during matching and a gap penalty is included in this case as a negative score when no amino acid is matched. A sequence profile may also be represented by a hidden Markov model, referred to as a profile HMM. Profile (structural context) a scoring matrix that represents which amino acids should fit well and which should fit poorly at sequential positions in a known protein structure. Profile columns represent sequential positions in the structure, and profile rows represent the 20 amino acids. As with a sequence profile, the structural profile is moved along a target sequence to find the highest possible alignment score by a dynamic programming algorithm. Gaps may be included and receive a penalty. The resulting score provides an indication as to whether or not the target protein might adopt such a structure. Quaternary structure the three-dimensional configuration of a protein molecule comprising several independent polypeptide chains. Secondary structure the interactions that occur between the C, O, and NH groups on amino acids in a polypeptide chain to form α-helices, β-sheets, turns, loops, and other forms, and that facilitate the folding into a three-dimensional structure. Superfamily a group of protein families of the same or different lengths that are related by distant yet detectable sequence similarity. Members of a given superfamily thus have a common evolutionary origin. Originally, Dayhoff defined the cutoff for superfamily status as being the chance that the sequences are not related of 10 6, on the basis of an alignment score (Dayhoff et al. 1978). Proteins with few identities in an alignment of the sequences but with a convincingly common number of structural and functional features are placed in the same superfamily. At the level of three-dimensional structure, superfamily proteins will share common structural features such as a common fold, but there may also be differences in the number and arrangement of secondary structures. The PIR resource uses the term homeomorphic superfamilies to refer to superfamilies that are composed of sequences that can be aligned from end to end, representing a sharing of single sequence homology domain, a region of similarity that extends throughout the alignment. This domain may also comprise smaller homology domains that are shared with other protein families and superfamilies. Although a given protein sequence may contain domains found in several superfamilies, thus indicating a complex evolutionary history, sequences will be assigned to only one homeomorphic superfamily based on the presence of similarity throughout a multiple sequence alignment. The superfamily alignment may also include regions that do not align either within or at the ends of the alignment. In contrast, sequences in the same family align well throughout the alignment. Supersecondary structure a term with similar meaning to a structural motif. Tertiary structure is the three-dimensional or globular structure formed by the packing together or folding of secondary structures of a polypeptide chain. [1]

Secondary structure prediction is a set of techniques in bioinformatics that aim to predict the local secondary structures of proteins based only on knowledge of their amino acid sequence. For proteins, a prediction consists of assigning regions of the amino acid sequence as likely alpha helices, beta strands (often noted as "extended" conformations), or turns. The success of a prediction is determined by comparing it to the results of the DSSP algorithm (or similar e.g. STRIDE) applied to the crystal structure of the protein. Specialized algorithms have been developed for the detection of specific well-defined patterns such as transmembrane helices and coiled coils in proteins. [1]

The best modern methods of secondary structure prediction in proteins were claimed to reach 80% accuracy after using machine learning and sequence alignments [3] this high accuracy allows the use of the predictions as feature improving fold recognition and ab initio protein structure prediction, classification of structural motifs, and refinement of sequence alignments. The accuracy of current protein secondary structure prediction methods is assessed in weekly benchmarks such as LiveBench and EVA.

Background Edit

Early methods of secondary structure prediction, introduced in the 1960s and early 1970s, [4] [5] [6] [7] [8] focused on identifying likely alpha helices and were based mainly on helix-coil transition models. [9] Significantly more accurate predictions that included beta sheets were introduced in the 1970s and relied on statistical assessments based on probability parameters derived from known solved structures. These methods, applied to a single sequence, are typically at most about 60-65% accurate, and often underpredict beta sheets. [1] The evolutionary conservation of secondary structures can be exploited by simultaneously assessing many homologous sequences in a multiple sequence alignment, by calculating the net secondary structure propensity of an aligned column of amino acids. In concert with larger databases of known protein structures and modern machine learning methods such as neural nets and support vector machines, these methods can achieve up to 80% overall accuracy in globular proteins. [10] The theoretical upper limit of accuracy is around 90%, [10] partly due to idiosyncrasies in DSSP assignment near the ends of secondary structures, where local conformations vary under native conditions but may be forced to assume a single conformation in crystals due to packing constraints. Moreover, the typical secondary structure prediction methods do not account for the influence of tertiary structure on formation of secondary structure for example, a sequence predicted as a likely helix may still be able to adopt a beta-strand conformation if it is located within a beta-sheet region of the protein and its side chains pack well with their neighbors. Dramatic conformational changes related to the protein's function or environment can also alter local secondary structure.

Historical perspective Edit

To date, over 20 different secondary structure prediction methods have been developed. One of the first algorithms was Chou-Fasman method, which relies predominantly on probability parameters determined from relative frequencies of each amino acid's appearance in each type of secondary structure. [11] The original Chou-Fasman parameters, determined from the small sample of structures solved in the mid-1970s, produce poor results compared to modern methods, though the parameterization has been updated since it was first published. The Chou-Fasman method is roughly 50-60% accurate in predicting secondary structures. [1]

The next notable program was the GOR method is an information theory-based method. It uses the more powerful probabilistic technique of Bayesian inference. [12] The GOR method takes into account not only the probability of each amino acid having a particular secondary structure, but also the conditional probability of the amino acid assuming each structure given the contributions of its neighbors (it does not assume that the neighbors have that same structure). The approach is both more sensitive and more accurate than that of Chou and Fasman because amino acid structural propensities are only strong for a small number of amino acids such as proline and glycine. Weak contributions from each of many neighbors can add up to strong effects overall. The original GOR method was roughly 65% accurate and is dramatically more successful in predicting alpha helices than beta sheets, which it frequently mispredicted as loops or disorganized regions. [1]

Another big step forward, was using machine learning methods. First artificial neural networks methods were used. As a training sets they use solved structures to identify common sequence motifs associated with particular arrangements of secondary structures. These methods are over 70% accurate in their predictions, although beta strands are still often underpredicted due to the lack of three-dimensional structural information that would allow assessment of hydrogen bonding patterns that can promote formation of the extended conformation required for the presence of a complete beta sheet. [1] PSIPRED and JPRED are some of the most known programs based on neural networks for protein secondary structure prediction. Next, support vector machines have proven particularly useful for predicting the locations of turns, which are difficult to identify with statistical methods. [13] [14]

Extensions of machine learning techniques attempt to predict more fine-grained local properties of proteins, such as backbone dihedral angles in unassigned regions. Both SVMs [15] and neural networks [16] have been applied to this problem. [13] More recently, real-value torsion angles can be accurately predicted by SPINE-X and successfully employed for ab initio structure prediction. [17]

Other improvements Edit

It is reported that in addition to the protein sequence, secondary structure formation depends on other factors. For example, it is reported that secondary structure tendencies depend also on local environment, [18] solvent accessibility of residues, [19] protein structural class, [20] and even the organism from which the proteins are obtained. [21] Based on such observations, some studies have shown that secondary structure prediction can be improved by addition of information about protein structural class, [22] residue accessible surface area [23] [24] and also contact number information. [25]

The practical role of protein structure prediction is now more important than ever. [26] Massive amounts of protein sequence data are produced by modern large-scale DNA sequencing efforts such as the Human Genome Project. Despite community-wide efforts in structural genomics, the output of experimentally determined protein structures—typically by time-consuming and relatively expensive X-ray crystallography or NMR spectroscopy—is lagging far behind the output of protein sequences.

The protein structure prediction remains an extremely difficult and unresolved undertaking. The two main problems are the calculation of protein free energy and finding the global minimum of this energy. A protein structure prediction method must explore the space of possible protein structures which is astronomically large. These problems can be partially bypassed in "comparative" or homology modeling and fold recognition methods, in which the search space is pruned by the assumption that the protein in question adopts a structure that is close to the experimentally determined structure of another homologous protein. On the other hand, the de novo protein structure prediction methods must explicitly resolve these problems. The progress and challenges in protein structure prediction have been reviewed by Zhang. [27]

Before modelling Edit

Most tertiary structure modelling methods, such as Rosetta, are optimized for modelling the tertiary structure of single protein domains. A step called domain parsing, or domain boundary prediction, is usually done first to split a protein into potential structural domains. As with the rest of tertiary structure prediction, this can be done comparatively from known structures [28] or ab initio with the sequence only (usually by machine learning, assisted by covariation). [29] The structures for individual domains are docked together in a process called domain assembly to form the final tertiary structure. [30] [31]

Ab initio protein modelling Edit

Energy- and fragment-based methods Edit

Ab initio- or de novo- protein modelling methods seek to build three-dimensional protein models "from scratch", i.e., based on physical principles rather than (directly) on previously solved structures. There are many possible procedures that either attempt to mimic protein folding or apply some stochastic method to search possible solutions (i.e., global optimization of a suitable energy function). These procedures tend to require vast computational resources, and have thus only been carried out for tiny proteins. To predict protein structure de novo for larger proteins will require better algorithms and larger computational resources like those afforded by either powerful supercomputers (such as Blue Gene or MDGRAPE-3) or distributed computing (such as [email protected], the Human Proteome Folding Project and [email protected]). Although these computational barriers are vast, the potential benefits of structural genomics (by predicted or experimental methods) make ab initio structure prediction an active research field. [27]

As of 2009, a 50-residue protein could be simulated atom-by-atom on a supercomputer for 1 millisecond. [32] As of 2012, comparable stable-state sampling could be done on a standard desktop with a new graphics card and more sophisticated algorithms. [33] A much larger simulation timescales can be achieved using coarse-grained modeling. [34] [35]

Evolutionary covariation to predict 3D contacts Edit

As sequencing became more commonplace in the 1990s several groups used protein sequence alignments to predict correlated mutations and it was hoped that these coevolved residues could be used to predict tertiary structure (using the analogy to distance constraints from experimental procedures such as NMR). The assumption is when single residue mutations are slightly deleterious, compensatory mutations may occur to restabilize residue-residue interactions. This early work used what are known as local methods to calculate correlated mutations from protein sequences, but suffered from indirect false correlations which result from treating each pair of residues as independent of all other pairs. [36] [37] [38]

In 2011, a different, and this time global statistical approach, demonstrated that predicted coevolved residues were sufficient to predict the 3D fold of a protein, providing there are enough sequences available (>1,000 homologous sequences are needed). [39] The method, EVfold, uses no homology modeling, threading or 3D structure fragments and can be run on a standard personal computer even for proteins with hundreds of residues. The accuracy of the contacts predicted using this and related approaches has now been demonstrated on many known structures and contact maps, [40] [41] [42] including the prediction of experimentally unsolved transmembrane proteins. [43]

Comparative protein modeling Edit

Comparative protein modeling uses previously solved structures as starting points, or templates. This is effective because it appears that although the number of actual proteins is vast, there is a limited set of tertiary structural motifs to which most proteins belong. It has been suggested that there are only around 2,000 distinct protein folds in nature, though there are many millions of different proteins. The comparative protein modeling can combine with the evolutionary covariation in the structure prediction. [44]

These methods may also be split into two groups: [27]

    is based on the reasonable assumption that two homologous proteins will share very similar structures. Because a protein's fold is more evolutionarily conserved than its amino acid sequence, a target sequence can be modeled with reasonable accuracy on a very distantly related template, provided that the relationship between target and template can be discerned through sequence alignment. It has been suggested that the primary bottleneck in comparative modelling arises from difficulties in alignment rather than from errors in structure prediction given a known-good alignment. [45] Unsurprisingly, homology modelling is most accurate when the target and template have similar sequences. [46] scans the amino acid sequence of an unknown structure against a database of solved structures. In each case, a scoring function is used to assess the compatibility of the sequence to the structure, thus yielding possible three-dimensional models. This type of method is also known as 3D-1D fold recognition due to its compatibility analysis between three-dimensional structures and linear protein sequences. This method has also given rise to methods performing an inverse folding search by evaluating the compatibility of a given structure with a large database of sequences, thus predicting which sequences have the potential to produce a given fold.

Modeling of side-chain conformations Edit

Accurate packing of the amino acid side chains represents a separate problem in protein structure prediction. Methods that specifically address the problem of predicting side-chain geometry include dead-end elimination and the self-consistent mean field methods. The side chain conformations with low energy are usually determined on the rigid polypeptide backbone and using a set of discrete side chain conformations known as "rotamers." The methods attempt to identify the set of rotamers that minimize the model's overall energy.

These methods use rotamer libraries, which are collections of favorable conformations for each residue type in proteins. Rotamer libraries may contain information about the conformation, its frequency, and the standard deviations about mean dihedral angles, which can be used in sampling. [47] Rotamer libraries are derived from structural bioinformatics or other statistical analysis of side-chain conformations in known experimental structures of proteins, such as by clustering the observed conformations for tetrahedral carbons near the staggered (60°, 180°, -60°) values.

Rotamer libraries can be backbone-independent, secondary-structure-dependent, or backbone-dependent. Backbone-independent rotamer libraries make no reference to backbone conformation, and are calculated from all available side chains of a certain type (for instance, the first example of a rotamer library, done by Ponder and Richards at Yale in 1987). [48] Secondary-structure-dependent libraries present different dihedral angles and/or rotamer frequencies for α -helix, β -sheet, or coil secondary structures. [49] Backbone-dependent rotamer libraries present conformations and/or frequencies dependent on the local backbone conformation as defined by the backbone dihedral angles ϕ and ψ , regardless of secondary structure. [50]

The modern versions of these libraries as used in most software are presented as multidimensional distributions of probability or frequency, where the peaks correspond to the dihedral-angle conformations considered as individual rotamers in the lists. Some versions are based on very carefully curated data and are used primarily for structure validation, [51] while others emphasize relative frequencies in much larger data sets and are the form used primarily for structure prediction, such as the Dunbrack rotamer libraries. [52]

Side-chain packing methods are most useful for analyzing the protein's hydrophobic core, where side chains are more closely packed they have more difficulty addressing the looser constraints and higher flexibility of surface residues, which often occupy multiple rotamer conformations rather than just one. [53] [54]

In the case of complexes of two or more proteins, where the structures of the proteins are known or can be predicted with high accuracy, protein–protein docking methods can be used to predict the structure of the complex. Information of the effect of mutations at specific sites on the affinity of the complex helps to understand the complex structure and to guide docking methods.

A great number of software tools for protein structure prediction exist. Approaches include homology modeling, protein threading, ab initio methods, secondary structure prediction, and transmembrane helix and signal peptide prediction. Some recent successful methods based on the CASP experiments include I-TASSER, HHpred and AlphaFold. For complete list see main article.

Evaluation of automatic structure prediction servers Edit

CASP, which stands for Critical Assessment of Techniques for Protein Structure Prediction, is a community-wide experiment for protein structure prediction taking place every two years since 1994. CASP provides with an opportunity to assess the quality of available human, non-automated methodology (human category) and automatic servers for protein structure prediction (server category, introduced in the CASP7). [55]

The CAMEO3D Continuous Automated Model EvaluatiOn Server evaluates automated protein structure prediction servers on a weekly basis using blind predictions for newly release protein structures. CAMEO publishes the results on its website.

Watch the video: ΒΙΟΛΟΓΙΑ Β ΛΥΚΕΙΟΥ, 1ο Κεφάλαιο - Ενότητα: Πρωτεΐνες, Αμινοξέα u0026 Επίπεδα οργάνωσης των πρωτεϊνών (June 2022).