1. CHALLENGES TO STRUCTURAL
BIOLOGY IN THE GENOME AERA
In
the genome aera, the challenge to structural biologists is defined
as follows: To determine the three-dimensional structures of a
representative set of proteins such that all further studies of
protein function, e.g. in a medical-pharmacological context, may
be carried out on a firm structural basis. This challenge cannot
be met in the conventional way whereby a protein crystallographer
or an NMR spectroscopist applies her or his sophisticated methods
to the study of that single protein structure that seems the most
interesting at the time. For sure, this approach has been tremendously
successful over the last decade, filling the Protein Data Bank
at an ever increasing speed with structures of ever increasing
beauty, complexity and biological relevance1. However, in the
light of the above challenge, an all-out approach to structure
determination is needed in much the same way as it was and is
very successfully applied to genome research.
This approach has become known as "structural genomics".
1.1. Structural genomics
The term "structural genomics" has been in use for quite
some time, but has acquired a completely new meaning very recently.
Traditionally, it represented an effort to characterize the (physical)
structure of a complete genome by gene mapping and sequencing
2. Now, it stands for initiatives inspired by the genome
sequencing projects that aim at the determination of three-dimensional
protein structures in a systematic way3-6. The approaches taken
towards this goal fall into two broad categories:
(1) In the first, the emphasis is on determining the structures
of a set of proteins or protein domains which would yield a complete
representation of all protein (domain) folds present in the biosphere.
This approach is based on the notion that the number of folding
types (folds) for globular protein domains is not unlimited 7-9.
Very probably, it does not exceed the number of structure entries
now present in the Protein Data Bank. One may therefore hope to
cover the complete universe of three-dimensional protein structures
within a few years, provided that it is possible to identify new
folds from protein sequence. Computer-based methods for fold recognition
are currently being developed in a number of laboratories 10-12.
In a small bacterial genome, fold assignment with high confidence
is possible only for a small subset of coding sequences13.
However, advances in biocomputing methodology are likely to improve
the success rate in the near future14. A convenient
route towards fast structure determination targets proteins from
hyperthermophilic bacteria or archaea, because they can be easily
purified from recombinant Escherichia coli cells and lend themselves
especially to crystallization or NMR structure determination.
A number of crystal structures of these proteins has already been
determined 15-17. The knowledge of a representative
set of protein domain structures is hoped to enable the complete
fold prediction for newly sequenced genomes by homology modelling.
The availability of the predicted tertiary folds for most proteins
in a genome would in itself be of enormous value for many fields
of biological research. In addition, it may considerably facilitate
the detailed structure determination by protein crystallography
and NMR spectroscopy of those proteins for which this is deemed
necessary.
(2) A second approach to structural genomics focusses on structure
analysis methodology. Here, the main idea is to closely cooperate
with and learn from the genome sequencing projects. The use of
the wide variety of available coding sequences and efforts towards
parallelisation and automation of structure analysis are unifying
features of this approach. As before, bioinformatics will play
an important role in this brand of structural genomics for the
identification of relevant proteins or protein domains that are
amenable to structure analysis. The RIKEN NMR structure determination
project 18 exemplifies the technology-oriented structural
genomics efforts by attempting to establish a facility for the
broad-scale analysis of three-dimensional protein structures in
solution. The Berlin "Protein Structure Factory" initiative
belongs into the same category of structural genomics. However,
by employing both X-ray diffraction and NMR methods it does not
rely on one structure analysis technique exclusively. A main ingredient
of the Protein Structure Factory is the close collaboration with
the German Human Genome Project (DHGP).
Common to all structural genomics initiatives are efforts to identify
and eliminate bottlenecks in the structure determination process.
For example, it is generally agreed that the availability of bright
synchrotron beamlines is a prerequisite for the successful use
of diffraction methods 19. Membrane proteins, constituting
up to 30% of the protein inventory of an organism and against
which more than 50% of the currently used and tested drugs are
targetted, represent the most persistent bottleneck for all analytical
methods, because they are only water-soluble in the presence of
detergents and difficult to overproduce in quantities that are
required for biophysical studies.
2.
THE "PROTEIN STRUCTURE FACTORY": AN INTEGRATIVE APPROACH
The term "Protein Structure Factory" was chosen to represent
a common initiative of the DHGP and structural biologists from
the Berlin area aimed at the broad-scale analysis of proteins.
The Protein Structure Factory will be established to characterize
proteins encoded by the genes or cDNAs available at the Berlin
Resource Center of DHGP. At a later stage, it may analyze various
sets of input proteins selected by criteria of potential structural
novelty or medical or biotechnological usefulness. It represents
an integrative approach to structure analysis combining the computer-based
analysis of genes by bioinformatics techniques, automated gene
expression and purification of gene products, generation of a
biophysical fingerprint of the proteins and the determination
of their three-dimensional structures either in solution by NMR
spectroscopy or in the crystalline state by X-ray diffraction.
Here we briefly describe the main features of the planned Protein
Structure Factory.
2.1. Bioinformatics
Bioinformatics has two main tasks in the Protein Structure Factory:
To predict what can be done and to propose what should be done.
Predicting what can be done is equivalent to identifying proteins
that will permit their three-dimensional structures to be determined
by X-ray crystallography or NMR spectroscopy. These proteins will
have some properties in common. They will be soluble in aqueous
buffers up to a critical concentration, they will have a defined
globular structure, and this structure will be stable for at least
as long as it takes to grow and expose crystals or to measure
the NMR spectra. Proteins that contain long stretches of hydrophobic
or charged amino-acid residues, have extended sequence repeats
or use a limited repertoire of amino acids over long polypeptide
segments often do not display these properties. However, they
may still contain single or multiple domains that permit structure
analysis. In addition, bioinformatics will provide valuable information
aiding the structure determination by predicting sites of post-translational
modification and identifying proteins of known, similar tertiary
structure. Structural prediction will be used to decide whether
a given protein will be studied by NMR spectroscopy or by X-ray
diffraction or, for the latter case, whether its structure analysis
will require experimental phase determination or can be based
on a homologous model.
To propose what should be done is the more challenging task. It
is equivalent to finding proteins with interesting properties
such as novel folds or a function in biochemical pathways that
may be associated with disease. The more interesting a protein
appears, the more effort will have to be invested in its structure
analysis. Computational tools for functional sequence assignments
are currently being developed 21. This work addresses
questions concerning the subcellular localization of proteins,
their membership in families defined by function22-24
and their involvement in pathological states 25,26.
2.2. Automated gene expression
The method of choice to produce recombinant proteins for structural
and biophysical studies is the heterologous expression of their
genes in E. coli. Proteins that cannot be synthesized in E. coli
may alternatively be made in Saccharomyces cerevisiae or Pichia
pastoris. For structure analysis by X-ray diffraction methods,
the methionine residues of many proteins will have to be replaced
by selenomethionine. Likewise, NMR structure determination will
often require that the proteins be labelled with 13C
and/or 15N which can be introduced through cell growth
on media containing these isotopes in the form of 13C
glucose or 15NH4Cl.
Within the Protein Structure Factory, gene expression systems
will be obtained either by the cloning of PCR products or by the
direct construction of cDNA libraries in expression vectors (expression
libraries) 27. Both techniques will rely on the automated
manipulation of clones in multi-well microtiter plates or on high-density
membrane filters. Methods for the detection of protein coding
or novel clones with antibodies directed against protein tags
or by oligonucleotide fingerprinting are available27,28.
2.3. Purification of tagged proteins
The concept of the Protein Structure Factory requires the high-throughput
production of highly pure proteins in about 50 mg quantities for
structure analysis by NMR spectroscopy and X-ray crystallography.
This is accomplished in two production units for the parallel
fermentation and online purification of recombinant organisms
(one for E. coli and one for S. cerevisiae or P.
pastoris). E. coli is the organism of first choice,
since it can be cultivated easily and offers a large number of
readily available expression systems. Genes exhibiting low expression
in E. coli or yielding proteins which are produced as inclusion
bodies are expressed in yeast.
The recombinant organisms will be cultivated synchronously in
a battery of fermenters (Fig. 1). The cells from the different
fermenters are homogenised successively with a high pressure homogeniser.
The solubilised proteins are separated from the biomass by microfiltration
and the processed filtrate is then concentrated by ultrafiltration.
The following purification of the recombinant proteins takes advantage
of two tags of these proteins: a his6-tag and a strep-tag 29,30
whose corresponding DNA sequences are fused to the 3'- and 5'-terminus
of the protein-coding gene. This allows a highly efficient separation
of the recombinant protein from host cell protein. In the first
step, the recombinant proteins are successively bound to a Ni-NTA
column and eluted with imidazole. A second affinity chromatography
on a streptavidin matrix is applied for the final purification.
This semi-automated production and online purification will require
two days for proteins synthesized in E. coli or three days
for proteins from yeast. This production unit is designed to provide
several homogeneous proteins for structure analysis per day.
The goal of this unit is to characterize the proteins, as they
become available from expression and purification, by conventional
spectroscopic and calorimetric techniques. It will mainly serve
to confirm and to complement the information obtained from biocomputing
for further structure determination. The proteins will be analysed
with respect to their secondary structure and stability, in dependence
on temperature and pH.
The following techniques will be employed:
Fourier-transform infrared spectroscopy (FTIR), to obtain secondary
structure information by analysing the amide bands, circular dichroism
spectroscopy (CD), to confirm the data obtained by FTIR, fluorescence
spectroscopy, to investigate stability as a function of pH, differential
scanning calorimetry (DSC), to measure thermal stability.
Automated routines for the data acquisition and evaluation procedures
will be necessary to keep pace with the expected throughput of
proteins. In part, these routines are already available, some
have to be developed.
In summary, this unit will furnish biophysical parameters concerning
secondary structure and conformational stability of proteins,
independent of and preliminary to the determination of high-resolution
structures. It will help to establish experimental conditions
for protein crystallization and NMR studies. The biophysical data
may also be useful in those cases where high-resolution protein
structures cannot be obtained.
2.5. NMR spectroscopy
The role of NMR will be in the structure determination of protein
domains and of their functional complexes, and in the investigation
of ligand binding to help in the design of bioactive small molecules.
For this purpose, it is necessary to automate the key steps in
the NMR structure determination procedure. These include data
acquisition, sequence-specific resonance assignments and structure
calculation. Currently, it takes weeks to months for the spectral
assignments, especially those of the NOESY spectrum, to be accomplished.
In order to be able to determine the structures for all three
steps, some concepts and algorithms for automating the procedures
exist, and more need to be developed.
Automated data acquisition is probably the easiest task in this
project. It includes the definition of a data set which is suited
for automated interpretation. Most modern NMR spectrometers already
provide features which allow one to automate the data acquisition
itself. The critical step for being able to determine the structures
of a large number of proteins is in the necessary automation of
the assignment procedure. To date, a number of computer programs
for this purpose are available 31, but, in any case,
manual interference is required. Most of these software packages
will require peak lists obtained from the multi-dimensional spectra,
which usually contain false peaks generated from noise or artifacts.
The logics of the program are not then capable of handling this
problem. In the context of the protein structure factory, it is
required to generate a new piece of software which works directly
on the spectra and is already able to recognize peaks, noise and
artifacts as such. On the basis of a data set comprising CBCA
N NH, CBCA (CO) NNH, HCCH-COSY, HCCH-TOCSY, and amino-acid-sensitive
experiments, it is expected that the program will generate a list
of chemical shifts comprising those of all protons, carbons and
nitrogens present in the protein that can be used to evaluate
the three-dimensional NOESY spectra.
This peak list is then subjected to an automated structure calculation
protocol proposed by M. Nilges 32, which essentially
allows one to assign the NOESY spectra during the structure calculation.
In this manner, it is expected that approximately three months
of manual work can be saved per structure. It is expected that
the NMR structures of proteins with up to 120 amino acids can
be solved routinely, if their solubility is high enough, and that
sufficient signal-to-noise can be obtained in the 2- and 3-dimensional
spectra. The protein structure factory also provides means to
exploit the structural information generated. In this context,
NMR spectroscopy will be used to study ligand-protein interactions
in screening campaigns to detect binding in a site-specific manner.
This information will be used to optimize ligands.
2.6. Protein crystallization
At present, the crystallization of proteins is still the bottleneck
in the structure determination by means of X-ray diffraction.
There is no simple correlation between properties of proteins
and the large number of parameters that have to be considered
during crystallization. Consequently, the crystallization of proteins
is mostly an empirical process that requires a broad screening
of different crystallization conditions. In the Protein Structure
Factory, it is planned to have available a large number of purified
proteins or protein domains per year that are considered for crystallization.
Since a manual optimization of crystallization conditions on the
projected scale is not feasible, the development and the utilization
of a crystallization robot is a key issue of the crystal structure
determination within the Protein Structure Factory. The necessary
innovations will rely on two well established groups with ample
experience in protein crystallography and in the construction
of robots.
It is planned to build a crystallization robot that is pipetting
protein solutions and a buffer screen consisting of about 100
different conditions (pH, buffer, salt, polyethylene glycol, alcohols,
salts) for "hanging drop" vapor diffusion experiments
33: a drop consisting of protein and buffer is equilibrated
against the buffer at about twice the concentration, so that the
protein solution in the drop is brought to supersaturation and
eventually to crystallization. This is set up in trays with 24
wells, and the trays are automatically stored at two temperatures,
preferably 4°C and 18°C. The robot examines the trays
by light scattering to monitor aggregation of protein and, if
possible, nucleation, and in later stages the trays will be observed
by microscopes with suitable software to automatically recognize
crystalline material.
2.7. Acquisition of X-ray diffraction data using synchrotron
radiation
The use of synchrotron radiation will be crucial to the Protein
Structure Factory: high brilliance and tuneable wavelengths are
prerequisites for fast data collection, the use of small crystals
and multiwavelength anomalous diffraction (MAD) phasing19.
An example for a diffraction image obtained from a small crystal
at a synchrotron is shown in Fig. 2. With the opening of BESSY
II, direct access to a third-generation XUV storage ring source
with excellent conditions is available nearby. However, to shift
the maximum of the emitted spectrum towards the X-ray range, a
high-field multipole wiggler has to be installed as has been done
at other medium energy storage rings (ALS34, MAX II35,
ELETTRA).
Two beamlines are planned within the Protein Structure Factory:
the central beamline is optimized for rapidly measuring high resolution
MAD data sets. This MAD beamline will be equipped with a focussing
premirror, a double crystal monochromator and a refocussing mirror
to serve in the wavelength range from 0.7Å to 2.75 Å
which covers the absorption edges of all commonly used heavy atoms36.
To make use of the expected short exposure times a state-of-the-art
CCD detector with fast bus and high capacity storage system will
be installed at the MAD station. This will be especially useful
in cases when fine slicing down to 0.1° is employed.
The other beamline is designed as a constant-energy station with
a selectable wavelength around 0.9 Å and will be used for
the fast checking of crystal quality and further preliminary examinations.
It will accept radiation from the the side portion of the wiggler
fan and will be equipped with a premirror and a bent crystal monochromator
to select the appropriate wavelength and to focus and deflect
the X-ray beam. Both stations will be equipped with gaseous nitrogen
cooling and both need highly automated beamline control, efficient
software protocols and organization schemes to act as high-throughput
system.
The
high-throughput determination of three-dimensional protein structures
based on the X-ray diffraction data collected at the synchrotron
beamlines (see above) will have to employ robust and efficient
methods at four essential steps: Phasing, model building, refinement
and quality control. In some cases it will be possible to use
homologous protein or domain structures for molecular replacement
phasing. As the Protein Data Bank grows and the techniques for
detecting homology at the level of three-dimensional structure
improve, the frequency with which such search models are available
will increase substantially. Crystal structures can be solved
easily if the structural similarity of a search model is high
enough.
The analysis of protein structures with unpredicted fold requires
experimental phase determination. Once dreaded because of the
tedious trial-and-error searching for isomorphous derivatives,
phasing has become a routine process with the advent of MAD methods
37. All proteins produced in recombinant E. coli
can be labelled with heavy-atom markers in the form of selenomethionine
and thus subjected to MAD phasing. The power of MAD phasing may
be appreciated from Fig. 3 comparing the experimental electron
density (from MAD) with the final, refined density in a portion
of the structure of a bovine adrenoxin, Adx (4-108)38.
Here, the two iron atoms of the protein were sufficient for MAD
phasing to produce density that not only clearly reveals the protein
atoms around the C-terminus of Adx (4-108) but even some of the
water molecules bound in this region.
Currently, methods for semi-automated model building into electron-density
maps39 and structure refinement 40 are being
developed in a number of laboratories. These methods will be incorporated
into the crystal structure determination process of the Protein
Structure Factory. Finally, it will be necessary to stringently
assess the quality of the determined structures41 before
they are allowed to enter a database.
Genomics
does not end when all base pairs of DNA have been sequenced. In
contrast, it may be argued that the interesting part of the work
- aimed at understanding whole organisms by starting from the
molecules of life - is the one involving studies of structure
and function of the gene products. Structural genomics approaches
as the one described above and and large-scale, high-throughput
functional studies, functional genomics 42, are starting
to provide the tools to performing these analyses.
We are grateful to Jürgen J. Müller (Max-Delbrück-Centrum)
for providing figures 2 and 3. Supported by the Bundesministerium
für Bildung und Forschung through the Leitprojekt Proteinstrukturfabrik.
1. Abola, E.E., Sussman, J.L., Priluski, J. & Manning, N.O.
(1997) Protein Data Bank archives of three-dimensional macromolecular
structures. Methods Enzymol. 277, 556-571.
2. McKusick, V.A. (1997) Genomics: Structural and functional studies
of genomes. Genomics 45, 244-249.
3. Terwilliger, T.C., Waldo, G., Peat, T.S., Newman, J.M., Chu,
K. & Berendzen, J. (1998) Class-directed structure determination:
Foundation for a protein structure initiative. Protein Sci. 7, 1851-1856.
4. Shapiro, L. & Lima, C.D. (1998) The Argonne Structural Genomics
Workshop: Lamaze class for the birth of a new science. Structure
6, 265-267.
6. Koonin, E.V., Tatusov, R.L. & Galperin, M.Y. (1998) Beyond
complete genomes: from sequence to structure and function. Current
Opinion Struct. Biol. 8, 355-363.
7. Finkelstein, A.V. & Ptitsyn, O.B. (1987) Why do all globular
proteins fit the limited set of folding patterns?
Prog. Biophys. Mol. Biol. 50, 171-190.
8. Chothia, C. (1992) One thousand protein families for the molecular
biologist. Nature 357, 543-544.
9. Orengo, C.A., Jones, D.T. & Thornton, J.M. (1994) Protein
superfamilies and domain superfolds. Nature 372, 631-634.
10. Bork, P. & Eisenberg, D. (1998) Sequences and topology.
Deriving biological knowledge from genomic sequences. Current Opinion
Struct. Biol. 8, 331-332.
11. Fischer, D. & Eisenberg, D. (1996) Protein fold recognition
using sequence-derived predictions. Protein Sci. 5, 947-955.
12. Rice, D.W. & Eisenberg, D. (1997) A 3D-1D substitution matrix
for protein fold recognition that includes predicted secondary structure
of the sequence. J. Mol. Biol. 267, 1026-1038.
13. Fischer, D. & Eisenberg, D. (1997) Assigning folds to the
proteins encoded by the genome of Mycoplasma genitalium. Proc. Natl.
Acad. Sci. USA 94, 11929-11934.
14. Huynen, M., Doerks, T., Eisenhaber, F., Orengo, C., Sunyaev,
S., Yuan, Y. & Bork, P. (1998)
Homology-based fold predictions for Mycoplasma genitalium proteins.
J. Mol. Biol. 280, 323-326.
15. Kim, K.K., Hung, L.-W., Yokota, H., Kim, R. & Kim, S.-H.
(1998) Crystal structures of eukaryotic translation initiation factor
5A from Methanococcus jannaschii at 1.8 * resolution. Proc. Natl.
Acad. Sci. USA 95,
10419-10424.
16. Lim, J.-H., Yu, Y.G., Han, Y.S., Cho, S.-j., Ahn, B.-Y., Kim,
S.-H. & Cho, Y. (1997) The crystal structure of an Fe-superoxide
dismutase from the hyperthermophile Aquifex pyrophilus at 1.9 *
resolution: Structural basis for thermostability. J. Mol. Biol.
270, 259-274.
17. Kim, K.K., Kim, R. & Kim, S.-H. (1998) Crystal structure
of a small heat-shock protein. Nature 394, 595-599.
18. Saegusa, A. (1998) Japan's genome programme goes ahead, with
protein analysis. Nature 392, 219.
19. Kim, S.-H. (1998) Shining light on structural genomics. Nature
Struct. Biol. 5, 643-645.
21. Bork, P. & Koonin, E.V. (1998) Predicting functions from
protein sequences &Mac220; where are the bottlenecks?
Nature Genetics 18, 313-318.
22. Schultz, J., Milpetz, F., Bork, P. & Ponting, C.P. (1998)
SMART, a simple modular architecture research tool:
identification of signaling domains. Proc. Natl. Acad. Sci. USA
95, 5857-5864.
23. Bork, P., Dandekar, T., Eisenhaber, F. & Huynen, M. (1998)
Characterization of targeting domains by sequence analysis: glycogen-binding
domains in protein phosphatases. J. Mol. Med. 76, 77-79.
24. Yuan, Y., Schultz, J., Mlodzik, M. & Bork, P. (1997) Secreted
Fringe-like signaling molecules may be glycosyltransferases. Cell
88, 9-11.
25. Museghian, A.R., Bassett, D.E., Jr., Boguski, M., Bork, P. &
Koonin, E.V. (1997) Positionally cloned human disease genes: New
motifs and evolutionary conservation. Proc. Natl. Acad. Sci. USA
94, 5831-5836.
26. Bork, P., Hofmann, K., Bucher, P., Neuwald, A., Altschul, S.F.
& Koonin, E.V. (1997) A superfamily of conserved domains in
DNA damage-reponsive cell cycle checkpoint proteins. FASEB J. 11,
68-76.
27. Maier, E., Maier-Ewert, S., Bancroft, D. & Lehrach, H. (1997)
Automated array technologies for gene expression profiling. Drug
Discovery Today 2, 315-324.
28. Maier, E., Maier-Ewert, S., Ahmadi, R., Curtis, J. & Lehrach,
H. (1994) Application of robotic technology to automated sequence
fingerprint analysis by oligonucleotide hybridisation. J. Biotech.
35, 191-203.
29. Hochuli, E., Bannwarth, W., Dobeli, H., Gentz, R. & Stüber,
D. (1988) Genetic approach to facilitate purification of recombinant
proteins with a novel metal chelate adsorbent. Bio/Technology 6,
1321-1325.
30. Schmidt, T.G.M. & Skerra, A. (1994). One-step affinity purification
of bacterially produced proteins by means of the "Strep-tag"
and immobilized recombinant core streptavidin. J. Chromatogr. A
676, 337-345
31. Oschkinat, H. & Croft, D. (1994). Automated assignment of
multidimensional nuclear magnetic resonance spectra. H. Meth. Enzymol.
239, 308-318.
32. Nilges, M., Macias, M.C., OÕDonoghue, S.I. & Oschkinat,
H. (1997). Automated NOESY interpretation with ambiguous distance
restraints: the refined NMR solution structure of the pleckstrin
homology domain from ?-spectrin. J. Mol. Biol. 269, 408-422.
33. Weber, P.C. (1997) Overview of protein crystallization methods.
Methods Enzymol. 276, 13-22.
34. Earnest, T. (1995) Conceptual Design Report for ALS Beamline
5.0, Lawrence Berkeley Laboratory PN941209-2.
35. Svensson, L.A., StŒhl, K., Cerenius, Y., Oskarsson, *.,
Albertsson, J. & Liljas, A. (1997) A new beamline for crystallographic
measurements at the MAX II synchrotron, Lund, Sweden, Annual Report
182.
38. Müller, A., Müller, J.J., Muller, Y.A., Uhlmann, H.,
Bernhardt, R. & Heinemann, U. (1998) New aspects of electron
transfer revealed by the crystal structure of a truncated bovine
adrenodoxin, Adx(4-108). Structure 6,
269-280.
39. Fortier, S., Chiverton, A., Glasgow, J. & Leherte, L. (1997)
Critical-point analysis in protein electron-density map interpretation.
Methods Enzymol. 277, 131-157.
40. Lamzin, V.S. & Wilson, K.S. (1997) Automated refinement
for protein crystallography. Methods Enzymol. 277, 269-305.
41. Dodson, E.J., Davies, G.J., Lamzin, V.S., Murshudov, G.N. &
Wilson, K.S. (1998) Validation tools: can they indicate the information
content of macromolecular crystal structures? Structure 6, 685-690.