THE STRUCTURE OF PROTEINS
This page explains how amino acids combine to make proteins and what is meant by the primary, secondary and tertiary structures of proteins. Quaternary structure isn't covered. It only applies to proteins consisting of more than one polypeptide chain. There is a mention of quaternary structure on the IB chemistry syllabus, but on no other UK-based syllabus at this level.
Note: Quaternary structure can be very complicated, and I don't know exactly what depth the IB syllabus wants for this (which is why I haven't included it). I suspect what is wanted is fairly trivial. IB students should ask the advice of their teacher or lecturer.
The primary structure of proteins
Drawing the amino acids
In chemistry, if you were to draw the structure of a general 2-amino acid, you would probably draw it like this:
However, for drawing the structures of proteins, we usually twist it so that the "R" group sticks out at the side. It is much easier to see what is happening if you do that.
That means that the two simplest amino acids, glycine and alanine, would be shown as:
Peptides and polypeptides
Glycine and alanine can combine together with the elimination of a molecule of water to produce a dipeptide . It is possible for this to happen in one of two different ways - so you might get two different dipeptides.
In each case, the linkage shown in blue in the structure of the dipeptide is known as a peptide link . In chemistry, this would also be known as an amide link, but since we are now in the realms of biochemistry and biology, we'll use their terms.
If you joined three amino acids together, you would get a tripeptide. If you joined lots and lots together (as in a protein chain), you get a polypeptide .
A protein chain will have somewhere in the range of 50 to 2000 amino acid residues . You have to use this term because strictly speaking a peptide chain isn't made up of amino acids. When the amino acids combine together, a water molecule is lost. The peptide chain is made up from what is left after the water is lost - in other words, is made up of amino acid residues .
By convention, when you are drawing peptide chains, the -NH2 group which hasn't been converted into a peptide link is written at the left-hand end. The unchanged -COOH group is written at the right-hand end.
The end of the peptide chain with the -NH2 group is known as the N-terminal . and the end with the -COOH group is the C-terminal .
A protein chain (with the N-terminal on the left) will therefore look like this:
The "R" groups come from the 20 amino acids which occur in proteins. The peptide chain is known as the backbone . and the "R" groups are known as side chains .
Note: In the case where the "R" group comes from the amino acid proline, the pattern is broken. In this case, the hydrogen on the nitrogen nearest the "R" group is missing, and the "R" group loops around and is attached to that nitrogen as well as to the carbon atom in the chain.
I mention this for the sake of completeness - not because you would be expected to know about it in chemistry at this introductory level.
The primary structure of proteins
Now there's a problem! The term "primary structure" is used in two different ways.
At its simplest, the term is used to describe the order of the amino acids joined together to make the protein. In other words, if you replaced the "R" groups in the last diagram by real groups you would have the primary structure of a particular protein.
This primary structure is usually shown using abbreviations for the amino acid residues. These abbreviations commonly consist of three letters or one letter.
Using three letter abbreviations, a bit of a protein chain might be represented by, for example:
If you look carefully, you will spot the abbreviations for glycine (Gly) and alanine (Ala) amongst the others.
If you followed the protein chain all the way to its left-hand end, you would find an amino acid residue with an unattached -NH2 group. The N-terminal is always written on the left of a diagram for a protein's primary structure - whether you draw it in full or use these abbreviations.
The wider definition of primary structure includes all the features of a protein which are a result of covalent bonds. Obviously, all the peptide links are made of covalent bonds, so that isn't a problem.
But there is an additional feature in proteins which is also covalently bound. It involves the amino acid cysteine.
If two cysteine side chains end up next to each other because of folding in the peptide chain, they can react to form a sulphur bridge . This is another covalent link and so some people count it as a part of the primary structure of the protein.
Because of the way sulphur bridges affect the way the protein folds, other people count this as a part of the tertiary structure (see below). This is obviously a potential source of confusion!
Important: You need to know where your particular examiners are going to include sulphur bridges - as a part of the primary structure or as a part of the tertiary structure. You need to check your current syllabus and past papers. If you are studying a UK-based syllabus and haven't got these, follow this link to find out how to get hold of them.
The secondary structure of proteins
Within the long protein chains there are regions in which the chains are organised into regular structures known as alpha-helices (alpha-helixes) and beta-pleated sheets. These are the secondary structures in proteins.
These secondary structures are held together by hydrogen bonds. These form as shown in the diagram between one of the lone pairs on an oxygen atom and the hydrogen attached to a nitrogen atom:
Although the hydrogen bonds are always between C=O and H-N groups, the exact pattern of them is different in an alpha-helix and a beta-pleated sheet. When you get to them below, take some time to make sure you see how the two different arrangements works.
Important: If you aren't happy about hydrogen bonding and are unsure about what this diagram means, follow this link before you go on. What follows is difficult enough to visualise anyway without having to worry about what hydrogen bonds are as well!
You must also find out exactly how much detail you need to know about this next bit. It may well be that all you need is to have heard of an alpha-helix and know that it is held together by hydrogen bonds between the C=O and N-H groups. Once again, you need to check your syllabus and past papers - particularly mark schemes for the past papers.
If you follow either of these links, use the BACK button on your browser to return to this page.
In an alpha-helix, the protein chain is coiled like a loosely-coiled spring. The "alpha" means that if you look down the length of the spring, the coiling is happening in a clockwise direction as it goes away from you.
Note: If your visual imagination is as hopeless as mine, the only way to really understand this is to get a bit of wire and coil it into a spring shape. A bit of computer lead would do.
In truth, if you are a chemistry student, you are very unlikely to need to know this. If protein secondary structure is on your syllabus, your examiners are most likely only to want you to know how the structures are held together by hydrogen bonding. Check past papers to be sure.
If you are reading this as a biochemistry or biology student, and have been given some other way of recognising an alpha-helix, stick to whatever method you have been given.
The next diagram shows how the alpha-helix is held together by hydrogen bonds. This is a very simplified diagram, missing out lots of atoms. We'll talk it through in some detail after you have had a look at it.
What's wrong with the diagram? Two things:
First of all, only the atoms on the parts of the coils facing you are shown. If you try to show all the atoms, the whole thing gets so complicated that it is virtually impossible to understand what is going on.
Secondly, I have made no attempt whatsoever to get the bond angles right. I have deliberately drawn all of the bonds in the backbone of the chain as if they lie along the spiral. In truth they stick out all over the place. Again, if you draw it properly it is virtually impossible to see the spiral.
So, what do you need to notice?
Notice that all the "R" groups are sticking out sideways from the main helix.
Notice the regular arrangement of the hydrogen bonds. All the N-H groups are pointing upwards, and all the C=O groups pointing downwards. Each of them is involved in a hydrogen bond.
And finally, although you can't see it from this incomplete diagram, each complete turn of the spiral has 3.6 (approximately) amino acid residues in it.
If you had a whole number of amino acid residues per turn, each group would have an identical group underneath it on the turn below. Hydrogen bonding can't happen under those circumstances.
Each turn has 3 complete amino acid residues and two atoms from the next one. That means that each turn is offset from the ones above and below, such that the N-H and C=O groups are brought into line with each other.
In a beta-pleated sheet, the chains are folded so that they lie alongside each other. The next diagram shows what is known as an "anti-parallel" sheet. All that means is that next-door chains are heading in opposite directions. Given the way this particular folding happens, that would seem to be inevitable.
It isn't, in fact, inevitable! It is possible to have some much more complicated folding so that next-door chains are actually heading in the same direction. We are getting well beyond the demands of UK A level chemistry (and its equivalents) now.
The folded chains are again held together by hydrogen bonds involving exactly the same groups as in the alpha-helix.
Note: Note that there is no reason why these sheets have to be made from four bits of folded chain alongside each other as shown in this diagram. That was an arbitrary choice which produced a diagram which fitted nicely on the screen!
The tertiary structure of proteins
What is tertiary structure?
The tertiary structure of a protein is a description of the way the whole chain (including the secondary structures) folds itself into its final 3-dimensional shape. This is often simplified into models like the following one for the enzyme dihydrofolate reductase. Enzymes are, of course, based on proteins.
Note: This diagram was obtained from the RCSB Protein Data Bank. If you want to find more information about dihydrofolate reductase, their reference number for it is 7DFR.
There is nothing particularly special about this enzyme in terms of structure. I chose it because it contained only a single protein chain and had examples of both types of secondary structure in it.
The model shows the alpha-helices in the secondary structure as coils of "ribbon". The beta-pleated sheets are shown as flat bits of ribbon ending in an arrow head. The bits of the protein chain which are just random coils and loops are shown as bits of "string".
The colour coding in the model helps you to track your way around the structure - going through the spectrum from dark blue to end up at red.
You will also notice that this particular model has two other molecules locked into it (shown as ordinary molecular models). These are the two molecules whose reaction this enzyme catalyses.
What holds a protein into its tertiary structure?
The tertiary structure of a protein is held together by interactions between the the side chains - the "R" groups. There are several ways this can happen.
Some amino acids (such as aspartic acid and glutamic acid) contain an extra -COOH group. Some amino acids (such as lysine) contain an extra -NH2 group.
You can get a transfer of a hydrogen ion from the -COOH to the -NH2 group to form zwitterions just as in simple amino acids.
You could obviously get an ionic bond between the negative and the positive group if the chains folded in such a way that they were close to each other.
Notice that we are now talking about hydrogen bonds between side groups - not between groups actually in the backbone of the chain.
Lots of amino acids contain groups in the side chains which have a hydrogen atom attached to either an oxygen or a nitrogen atom. This is a classic situation where hydrogen bonding can occur.
For example, the amino acid serine contains an -OH group in the side chain. You could have a hydrogen bond set up between two serine residues in different parts of a folded chain.
You could easily imagine similar hydrogen bonding involving -OH groups, or -COOH groups, or -CONH2 groups, or -NH2 groups in various combinations - although you would have to be careful to remember that a -COOH group and an -NH2 group would form a zwitterion and produce stronger ionic bonding instead of hydrogen bonds.
van der Waals dispersion forces
Several amino acids have quite large hydrocarbon groups in their side chains. A few examples are shown below. Temporary fluctuating dipoles in one of these groups could induce opposite dipoles in another group on a nearby folded chain.
The dispersion forces set up would be enough to hold the folded structure together.
Important: If you aren't happy about van der Waals dispersion forces you should follow this link.
Use the BACK button on your browser to return to this page.
To understand how computer algorithms can be used to predict the secondary and tertiary structures of proteins
The student should be able to:
Protein structure may be considered at a variety of levels (for further information see Web-based tutorial ):
1 o (primary ) structure is the actual amino acid sequence of the protein
2 o (secondary ) structure refers to the localized organization of parts of the polypeptide chain (e.g. a helix, b sheet, turn etc.)
3 o (tertiary ) structure describes the three-dimensional organization of all the atoms in the polypeptide
4 o (quarternary ) structure refers to the organization of a protein composed of more than one polypeptide chain
This module deals with the prediction of the secondary and tertiary structure of proteins. The most direct route to the study of protein structure is the use of techniques such as X-ray crytallography and NMR to determine the atomic co-ordinates of a protein. However, whilst there are over 100,000 entries in the primary protein sequence databases, there are only just over 12,000 entries in the protein structure databases. In consequence, a variety of methods are in development to predict secondary and tertiary structure from the 1 o sequence information and this is the topic covered by this module. In truth this is an enormous subject worthy of a course all to itself, so only a somewhat superficial view can be presented here. More detailed tutorials and guides, such as " Sisyphus and protein structure prediction ", " Pedestrian guide to analysing sequence databases "and " A Guide to protein structure Prediction ", are available on the Web.
Secondary structure prediction
The most successful area of protein structure prediction deals with secondary structure and related topics including the interaction of proteins with membranes.
Signal peptides (or signal sequences) are short N-terminal amino acid sequences that target the protein for membrane translocation and are removed after translocation. SignalP predicts signal peptide cleavage sites in Gram-positive, Gram-negative and eukaryotic amino acid sequences. http://www.cbs.dtu.dk/services/SignalP/caution.html
TargetP predicts the subcellular location of eukaryotic protein sequences. The subcellular location assignment is based on the predicted presence of any of the N-terminal presequences chloroplast transit peptide, mitochondrial targeting peptide, or secretory pathway signal peptide
Trans-membrane a helices
Many proteins in the cell are integral membrane proteins that have one or more segments embedded in. In transmembrane proteins one or more segments of the protein completely traverse the phospholipid bilayer and these membrane spanning domains are always a helices or multiple b strands. Arguably, the most successful area in secondary structure prediction is that of the prediction of trans-membrane a helices. There are a variety of computational approaches which offer 90% accuracy or more in such predictions. We will focus on one of the approaches, known as TMHMM. although there are others such as TopPred2. MEMSAT. DAS and PHDhtm. which you might have a look at.
The large majority of trans-membrane a helices consist of an unusually long stretch of hydrophobic amino acid residues and it is this feature that many programs employ to identify such potential a helices. The helix also has a topology i.e. whether it runs inwards or outwards. Positively charged residues, arginine and lysine, play a central role in determining the orientation since they are primarily found in non-transmembrane parts of the polypeptide on the cytoplasmic side. TMHMM employs a hidden Markov model which closely onto these features to make highly accurate predictions of trans-membrane a helices.
Have a look at the output from a typical TMHMM analysis of the lactose permeas e (LacY) from E. coli. Notice it has 12 predicted trans-membrane a helices with their polarity clearly indicated.
a helices and b sheets etc.
One of the first predictive algorithms GOR (Garnier, Osguthorpe & Robson, 1978) for secondary structure was developed through a co-operation between a laboratory interested in developing the theory for protein secondary structure prediction methods and a laboratory interested in applying and comparing such methods. The GOR algorithm unambiguously assigns each residue to one conformational state of a-helix, extended chain, reverse turn or coil. In its initial form GOR was roughly 50% accurate on a test sample of 26 proteins. GOR has now been through a series of developments and version IV of GOR has a mean accuracy of 64.4% for a three state prediction. The program gives two outputs, one eye-friendly ( example ) giving the sequence and the predicted secondary structure in rows, H=helix, E=extended or beta strand and C=coil; the second ( example ) gives the probability values for each secondary structure at each amino acid position. The predicted secondary structure is the one of highest probability compatible with a predicted helix segment of at least four residues and a predicted extended segment of at least two residues.
There are a number of other secondary structure prediction approaches including PSIPRED. PHD. NNPREDICT. PROF and ZPRED. Most of these servers expect the input to an alignment of multiple sequences which enhances the accuracy of the predictions.
Jpred developed as a result of a study to test and compare different secondary structure prediction methods. Jpred takes a single input sequence and scans it against a non-redundant sequence database. The hits are aligned with CLUSTALW (v1.7) and the alignment is submitted to MULPRED, which uses a combination of single sequence methods that are combined to give a prediction profile, from which a consensus is taken. The methods used within MULPRED are Lim, GOR, Chou-Fasman, Rose and Wilmot/Thornton turn prediction methods. The accuracy of Jpred is approximately 73%.
Secondary structure elements are observed to combine in specific geometric arrangements known as motifs or super-secondary structures (see Web-based tutorial ) e.g. coiled coils, helix-turn-helix etc.
Coiled-coils are another structural feature of proteins which sometimes separate domains. Coiled coils comprise two, three or four amphipathic a helices wrapped round one another. Coiled coil motifs are particularly amenable to computer-based prediction because of the characteristic repeating patter of hydrophobic residues spaced every four and then three residues apart. This pattern forms a heptad repeat (abcdefg)n of amino acids in which positions a abd d tend to be hydrophobic and positions e and g are predominantly charged residues. Predictions of coiled coils can be obtained at PAIRCOIL and MULTICOIL. The leucine zipper structure is adopted by one family of the coiled coil proteins. Leucine zippers have a characteristic leucine repeat: Leu-X6-Leu-X6-Leu-X6-Leu (where X may be any residue) and TRESPASSER will detect such motifs with a high degree of accuracy.
The helix-turn-helix motif occurs in many DNA binding proteins and can be predicted using HTH .
Integrated structure prediction
There is a variety of servers which offer a secondary structure prediction integrated with a variety of other analyses.
secondary structure (more info),
solvent accessibility (more info),
globular regions ( more info),
transmembrane helices (more info),
coiled-coil regions ( more info).
a multiple sequence alignment (i.e. database search),
ProSite sequence motifs (more info),
low-complexity retions (SEG) ( more info),
ProDom domain assignments (more info),
Tertiary structure prediction
This component of the module, more than any other, can only skim the surface of a complex and extensive topic. An excellent and more detailed introduction to the the topic is provided in " A Guide to Structure Prediction (version 2) ".
The ultimate objective in protein structure prediction is to use abinitio methods to accurately predict the tertiary structure of a protein from its primary structure using purely physico-chemical information. However, such approaches are prevented at present by a lack of some of the basic information required combined with the enormous computational complexity of the task.
Tertiary structure describes the folding of the polypeptide chain to assemble the different secondary structure elements into a particular arrangement. Just as helices, sheets etc. are the units of secondary structure so the folds/domains are the units of tertiary structure. In multidomain proteins, tertiary structure includes the arrangement of domains relative to each other as well as the arrangement of residues within the domain. The terms �domain� and �fold� to a large extent mean the same thing though definitions may vary. Domains are regions of contiguous polypeptide chain that have been described as compact, local, and semi-independent units. A fold is defined as a component of tertiary structure in which the proteins have the same major secondary structures in the same arrangement with the same topological connections. There are glossaries of the different protein folds / domains .
The overall strategy for secondary structure prediction is summarized by the following flowchart
An excellent, more detailed, interactive flowchart has been produced by Robert Russell.
The first step in any attempt to predict the tertiary structure of a protein is to search the sequence databases for proteins that show sequence similarity. If the result of the search includes a protein of known structure then the route of choice is homology modelling. If there is no homologue in the structural databases then things become rather more difficult, but not impossible. Even with no no homologues of known structure it may be possible to use fold recognition methods. There is a so called "twilight area" of 20-30% sequence identity, where it is difficult to assess whether
One of the most important advances in sequence comparison recently has been the development of both gapped BLAST and PSI-BLAST (position specific interated BLAST). Both of these have made BLAST much more sensitive, and the latter is able to detect very remote homologues by taking the results of one search, constructing a profile and then using this to search the database again to find other homologues
The most successful tool for prediction of 3D structure is homology modelling. An approximate 3D model can be built for a protein, if it has "significant similarity" to a protein of known structure. So what is "significant similarity"? The answer is about 30% identity. At this level of identity it is possible to construct a model which has a correct fold structure, but may have inaccurate loops. Above levels of 90% sequence identity, homology modelling is about as accurate as the experimental determination of a protein structure.
Part of the problem of homology modelling at lower levels of similarity is to correctly align. Sequence alignments are more or less straightforward for levels of above 30% pairwise sequence identity. The region between 20 and 30% sequence identity is frequently referred to as the twilight zone.
It has long been recognised that proteins often adopt similar folds despite lack of significant sequence or functional similarity. Fortunately, certain folds crop up time and time again in proteins, and so fold recognition methods for predicting protein structure can be very effective. Methods of fold recognition attempt to detect similarities between the 3D structure of proteins that do not exhibit significant sequence similarity. There are numerous different approaches to fold recognition, though �threading� is a common feature of several of them. Some fold recognition programs can be accessed through the Web e.g. TOPITS. and 3D-PSSM. If you have predicted that protein under study contains a particular fold then it is important to establish which other proteins that contain a similar fold by looking at databases such as SCOP (Structural Classification of Proteins) or CATH (Protein Structure Classification).
Threading takes the query sequence of unknown structure threads it through the atomic co-ordinates of a protein whose structure is known. The query sequence is moved residue by residue through the template sequence and calculations are carried to determine the degree of "fitness" of the alignment by a variety of methods which could include thermodynamic criteria, solvent accessibility, secondary structure information etc. Such approaches are quite computationally intensive, but there are freely accessible Web-based sites which will carry out a threading analysis e.g. bioinbgu .
Building the model
Sophisticated and usually expensive software is commercially available for carrying out tertiary structure predictions, but there is a freely accessible Web-based modelling server. SWISS-MODEL is an Automated Protein Modelling Server running at the GlaxoWellcome Experimental Research in Geneva, Switzerland. When a sequence is submitted to SWISS-MODEL the sequence of events is as follows:
1. BLASTP2 finds all similarities of target sequence with sequences of known structure.
2. Templates with sequence identities above 25% and projected model size larger than 20 residues are selected. This step also detects domains which can be modelled based on unrelated templates
3 ProModII then generates the models in which the key process is the production of a framework which represents topology of corresponding atoms in the query sequence and the template(s).
4 Energy minimisation analysis is done for all models
CPHmodels is another Web based homology modelling server.
1. Use TMHMM to predict whether the human integrin beta subunit is likely to be an integral membrane protein and, if so, how many trans-membrane domains it has.
2. What advantages might TMHMM have over TopPred (see the original TMHMM paper )
3. Use GORIV to do a secondary structure prediction on the alpha chain of human hemoglobin. Compare the predictions to those of NNSSP.
4. Determine whether the human transcription factor AP-1 (proto-oncogene C-JUN) has a coiled coil motif
5. Does the E. coli Lac repressor contain any recognizable folds?
Erik L.L. Sonnhammer, Gunnar von Heijne, and Anders Krogh: A hidden Markov model for predicting transmembrane helices in protein sequences. In Proc. of Sixth Int. Conf. on Intelligent Systems for Molecular Biology, p 175-182 Ed J. Glasgow, T. Littlejohn, F. Major, R. Lathrop, D. Sankoff, and C. Sensen Menlo Park, CA: AAAI Press, 1998 ( pdf download )
Garnier J, Osguthorpe DJ, Robson B (1978) Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J Mol Biol 120(1):97-120
Accuracy of structure prediction methods
Tertiary structure prediction tools and structure databases
Comprehensive lists of structure prediction sites can be found at: