ABSTRACT
The astonishing diversity in structure and function of extracellular matrix (ECM) proteins originates from different combinations of domains. These are defined as autonomously folding units. Many domains are similar in sequence and structure indicating common ancestry. Evo lutionarily homologous domains are, however, often func tionally very different, which renders function prediction from sequence difficult. Related and different domains are frequently repeated in the same or in different polypeptide chains. Common assembly domains include α-helical coiled-coil domains and collagen triple helices. Other domains have been shown to be involved in assembly to other ECM proteins or in cell binding and cell signalling. The function of most of the domains, however, remains to be elucidated. ECM proteins are rather recent ‘inventions’, and most occur either in plants or mammals but not in both. Their creation by domain shuffling involved a number of different mechanisms at the DNA level in which introns played an important role.
INTRODUCTION
The extracellular matrix (ECM) is not just the glue between cells as believed for a long time. It is instead a highly elaborate association of proteins, proteoglycans and glycosaminogly cans, each of which has a specialized function in fulfilling the manifold purposes that the ECM has. The main purpose is serving the cell as a substrate for growth and providing a stable structure around them. This is a fundamental precondition for the existence of multicellular organisms. The central systems in eukaryotes (neural, circulatory, digestive and fertilization systems) evolved within and along with the ECM.
The ECM has to serve two masters: it must be a pleasant living space for the cell and a suitable scaffold for functional elements of the organism. To fulfil these purposes, a huge set of proteins and proteoglycans of unusual size and shape have evolved, which furnish the tissue with its distinct features and anchor the cell in its surroundings. Electron microscopy gives insight into a strange microcosm of crosses, spiders, strings of pearls, brushes, dumb-bells, rods and other oddities. The aston ishing diversity in structure and function of proteins in this bizarre arsenal, however, originates from a building set of a limited number of modules.
Here we review the present knowledge of the fundamentals of domain organization and scrutinize functional assignments derived from experimental data and from sequence homology. The complex multidomain organization of the multifunctional ECM proteins offers a fascinating view on mechanisms of evolution. The apparent redundancy of certain ECM proteins opens questions on selective forces in evolution.
In order to provide a concise overview, only the more recent primary publications and review articles could be cited. Ref erences to the original literature can be found in these.
EXTRACELLULAR MATRIX PROTEINS ARE MOSAIC MULTIDOMAIN PROTEINS BUILT OF MODULAR UNITS
Extracellular matrix proteins are typical multidomain proteins (Table 1). Most domains show identity with domains of the same protein or with domains found either in other ECM proteins or in multidomain proteins not normally classified as ECM proteins. These include cell adhesion molecules (CAMs and cadherins), many cellular receptors including integrins, and proteins of the immune and complement system and of the blood clotting cascade. Because of this wide and repeating dis tribution of domains, these types of proteins have been termed mosaic proteins built of modular units (Doolittle 1985, 1992; Doolittle et al., 1986).
Table I summarizes the domain organizations as revealed by sequence information for a large number of ECM proteins. Table I is not complete and additional compilations can be found in Bork (1991, 1992), Baron et al. (1991), Engel (1991), Patthy (1991a,b), Bork and Doolittle (1992) and Kreis and Vale (1993). Such comparisons demonstrate the widespread distrib ution of domains in different classes of proteins. Examples are the EGF domains in proteins of the ECM, in blood clotting and complement systems and in a number of cell-surface receptors. IgG-domains occur not only in the immunoglobulin family, but also in proteoglycans, in cell-adhesion molecules such as N CAM, and in receptors recognizing growth factors and carbo hydrates. One of the most widespread module is the fibronectin type 3 (F3) domain of which more than 300 variants in about 70 proteins (not counting species redundancies) have been detected so far. They are found in both extracellular and intra cellular proteins; for example, in the muscle protein twitchin and in the cytosolic domain of the integrin subunit β4.
It has to be pointed out that all classifications contain a degree of ambiguity and uncertainty, in some cases because of very low sequence identities, which might not reflect common evolutionary descent. It has been argued that in some cases similar domains were produced by convergent evolution, for example as discussed for EF-hand domains (Kretsinger, 1987).
DOMAINS ARE AUTONOMOUS STRUCTURAL UNITS
Domains may be defined by the sequence blocks which are repeated in the same protein or reoccur in different proteins. Often, however, it is difficult to define the exact starts and ends of domains on this basis. Recognition of a linker sequence which is normally hydrophilic may help in some cases. Domains are often encoded by single exons, but this cannot be an absolute rule since introns can be secondarily introduced into exons during evolution. In the present work a domain is defined as an autonomous, independently-folded, structural unit. The most stringent proof for the structural independence of domains comes from three-dimensional structures. These have been derived by NMR and X-ray diffractions for a number of domains (Table 2; Baron et al., 1991). Earlier indi cations of a conformational independence of fibronectin and laminin domains were based on circular dichroism studies, which indicated additivity of the spectra of different fragments (Odermatt et al., 1982; Ott et al., 1982). A powerful method for distinguishing individual domains in regions with sequence repeats is based on the resistance of recombinantly prepared fragments against proteolytic susceptibility (Winograd et al., 1991). The structural integrity of separated domains has also been demonstrated for many ECM proteins by the preserved biological functions of domains and fragments. Recently, the three dimensional structures of a pair of fibronectin-type I (Fl) domains in fibronectin (Williams et al., 1993), a pair of com plement control (CO) domains in factor H (Barlow et al., 1993) and a lectin-EGF-module pair (Graves et al., 1994) were resolved. The secondary structure of each module within a pair conformed closely with the structure of the separated single domains, implying that modules fold entirely autonomous within intact proteins.
THE DOMAIN ORGANIZATION OF ECM PROTEINS
The example of the Fl pair in fibronectin demonstrated the potential of NMR for elucidation of the geometry of domain organizations. The NMR technique is limited, however, to structures of smaller molar mass than about 20 000. X-ray analysis is applicable to larger structures but in this case crys tallisation of large ECM molecules is a severe problem. Electron microscopy, therefore, is one of the most powerful techniques for elucidation of larger domain organizations (Engel, 1994). Fibronectin (Hynes, 1990; Odermatt et al., 1982), laminin (Beck et al., 1990), thrombospondin (Lawler, 1986) and tenascin (Spring et al., 1989; Erickson, 1993) are examples in which this technique, in combination with hydro dynamic data and sequence analyses, yielded detailed infor mation (Fig. 1). EGF domains are found in all four of these proteins, in linear arrangements with a repeat of 2-2.5 nm per domain. The Fl, F2 and F3 domains in fibronectin are also in a linear array which is, however, strongly dependent on ionic strength (Markovic et al., 1983), indicating a solvent dependent internal association of domains. Likewise, the arms of thrombospondin and cartilage oligomeric matrix protein (Mörgelin et al., 1992) are extended. The stretched arms of laminin and fibronectin exhibit a limited flexibility compara ble with that of actin filaments (Engel et al., 1981); hence these domains are not loosely connected but interact with each other (as exemplified by the rigid end to end structure of the Fl pair). In contrast, only small constraints on the flexibility of domains were seen for the CO- and the LE-EGF-pairs.
The extended arrangements mentioned so far are formed by sequential arrangements of small globular domains in a single chain. In two types of domain, however, long linear structures are formed by several chains. These are the collagen triple helices formed by three chains with Gly-X-Y repeats, and the α-helical coiled-coil structures in which two to five chains with heptad repeats of nonpolar residues are connected (Cohen and Parry, 1990, 1994; Lupas et al., 1991). The length of these structures is highly variable. Coiled-coil structures in COMP and thrombospondin are not longer than 50 residues/chain (7.5 nm; Efimov et al., 1994). They are of similar size in tenascin (Spring et al., 1989) but longer, consisting of 600 residues (76 nm), in laminin (Beck et al., 1990). Collagen triple helices range from 45 residues/chain in the N-terminal small triple helix of collagen III (Bruckner et al., 1978) to 8 000 residues/chain (2400 nm) in some worm collagens (Gaill et al., 1991). Thus, an amazing diversity of forms of ECM proteins can be built up from the modular pool.
FUNCTIONAL PREDICTIONS FOR MODULES BASED ON PRIMARY SEQUENCE HOMOLOGY ARE OFTEN WRONG
Elucidation of the functions of individual domains in ECM proteins is a challenging but very time consuming and difficult task. An out standing success was the identification of the cell binding site containing Arg-Gly-Asp in fibronectin. In the pioneering work by Ruoslahti (1988), this site was identified in the 10th F3 domain of fibronectin. An exposed and flexible three dimensional structure has been recently demonstrated for the Arg-Gly-Asp region both in this domain (Baron et al., 1991) and in disinte grins (Blobel and White, 1992). When it was found that cell attachment by several other ECM proteins could be inhibited by Arg-Gly-Asp peptides, it was initially thought that a univer sally valid principle had been discovered. As a consequence, many putative cell attachment sites were predicted from sequence data. It is now realized that this does not hold true: many cell attachment processes are Arg-Gly-Asp indepen dent and many major cell attachment sites do not contain this sequence. It was even found that an F3 domain in tenascin, which contains an Arg Gly-Asp sequence at a similar location as the classic domain in fibronectin is not involved in attachment, although in the isolated recombi nantly prepared domains the tripeptide sequence was active (Aukhil et al., 1993). Instead, another domain mediates Arg-Gly-Asp independent cell binding in native tenascin (Spring et al., 1989). As it was pointed out by Ruoslahti (1988) and Hynes (1990), but ignored by many others, attachment is usually highly conformation dependent (Deutzmann et al., 1990) and, as for fibronectin (Hynes, 1990), more than one binding site may be involved.
Another example of inaccurate prediction of functions based on sequence similarity relates to the EGF domains. It is an appealing concept that some of the EGF-like domains in ECM proteins may act as localized signals for growth and differen tiation, which may act in a specific and vectorial way on adjacent cells. Indeed, growth-promoting functions have been experimentally shown for laminin, thrombospondin and tenascin (Engel, 1989). For laminin, which is amongst the first ECM molecules expressed in mammalian embryonic develop ment, it was possible to localize this function to fragment P1 (Panayotou et al., 1989) which comprises short-arm regions of the a, and y chains with about 25 EGF-like repeats in total. Unambigous proof is missing, however, for an EGF domain being the active functional site in the very large fragment Pl. Furthermore, it is clear that not all EGF domains in ECM and other proteins exhibit growth factor-like functions. The best demonstrated function of the laminin type EGF (EG’) domains (Table 2) is to provide a very specific binding site for the C terminal nidogen domain N3 (Mayer et al., 1993). The three dimensional structure of the nidogen binding EGF-domain will be known soon and it is hoped that details of its specific function will be explored. Other functions, like Ca2+ binding of specialized EGF domains (EG*), have been demonstrated (Table 2) and have been correlated with the three dimensional structure of the EG* domains (Baron et al., 1991). The functions of most other EGF-like domains remain unexplored.
Functional predictions that are entirely based on recognition of a general sequence motif are usually wrong. Very specific information like calcium-binding motifs in EGF- or EF-hand domains might be helpful but even in these cases there have been many disappointing experiences. We urge, therefore, that the frequently used term ‘putative functional domains’ should be avoided, since it can lead to confusion when ‘putative’ is inadvertantly omitted (eg. in the next review).
Another argument for the functional promiscuity of domains comes from estimates of the number of protein families. Recent genome sequencing efforts show that about one third of sequenced open reading frames belong to families that already have members in the databanks. From these data Chothia (1992) estimated that about 1500 different protein families exist. Even if there are ten times more families, because of biases in the databanks, the number of functions greatly excedes the number of basic protein structures. Thus the prototype of each family is modelled differently to fulfil specific functions. Extracellular modules provide examples of great functional variety being achieved from a few basic struc tures. Just as no one would claim to predict the antigen from the sequence of an antibody, we feel the elucidation of functions of protein modules should rely principally on exper imental effort, not sequence comparisons.
DOMAINS MAY FUNCTION INDEPENDENTLY OR IN COMBINATION WITH OTHER DOMAINS
Perhaps the most important function of coiled-coil and colla geneous domains is to connect subunits within a single molecule, in which they may exhibit a concerted function. This is clearly demonstrated by thrombospondins (Lawler 1986; Lawler et al., 1993) and COMP (Morgelin et al., 1992) in which 3 or 5 identical chains are combined. These all point in the same direction and hence the C-terminal cell binding domains of these molecules and other domains are brought in close vicinity (Fig. 1). This alignment may be important for simultane ous recognition of multiple receptor sites at the cell surface. Although details of the binding mechanism have not yet been explored, the situation may be comparable to the binding of the hexameric ‘flower bouquet’-shaped first component of complement Clq, that binds to clusters of IgG. In this example it was demonstrated that sufficient binding strength is only produced by multivalent binding (Tschopp et al., 1980). This affords a mechanism for discrim-inating between clustered and isolated IgG molecules at a cell surface.
In many collagens, several globular domains are combined by association of three chains in the collagen triple helix (for example, collagens IV, VI and XII; Table 1). Von Willebrand type A (VA) domains are involved in the self-assembly of some collagens and have frequently been designated as collagen binding domains, although direct proof for this activity is missing in most cases (Colombatti and Bonaldo, 1991).
Laminin is comparable, in that three different chains a, and y are connected by a coiled-coil domain. Many genetically distinct variants of these chains have been found (Paulsson, 1993) and these are combined to give distinct laminin isoforms. Some isoforms are transiently expressed at restricted sites, suggesting specialized functions. The assembly of the three different chains is highly specific and correct assembly is crucial for cell binding of laminin by a6 1 integrin and for the promotion of neurite outgrowth (Hunter et al., 1992, Deutzmann et al., 1990; Sung et al., 1993).
THE TIME COURSE OF EVOLUTION OF ECM PROTEINS
Doolittle (1985, 1992) attempted to group proteins according to their time of invention. He classifies ECM proteins as very recent inventions, each of which is found in animals or plants but not in both, nor in prokaryotes. This suggests that ECM proteins arose around the time that plants and animals diverged, perhaps 1 billion years ago (Doolittle, 1985) It has been proposed that modern mosaic proteins are the result of efficient mechanisms of exon shuffling (Patthy, 1991b). For several ECM proteins for which sufficient sequences from phy logenetically distant organisms were available, phylogenetic trees were constructed. The construction of the dendrograms utilized the method of maximum parsimony, which determines the tree requiring the minimum number of base substitutions (alternative phylogenetic reconstruction methods are possible) As an example, the phylogenetic tree of the thrombospondin gene family is shown (Fig. 2; modified from Lawler et al., 1993, by addition and inclusion of COMP). The dendrogram is based on a comparison of the C-terminal six EF domains and the TC domain (Table 1). Lawler et al. (1993) were able to assign a very rough time scale to the dendrogram by calibra tion with two phylogenetic events. This is possible by assuming a constant average rate of evolutionary divergence for the protein region under consideration. Different proteins change at very different rates but each rate is approximately constant (Doolittle, 1992); this rate constancy may apply to individual domains within mosaic proteins, but not for the entire protein. From Fig. 2, we can infer that the C-terminal portions of thrombospondins and COMP had a common ancestor earlier than 900 million years ago. At this time a gene duplication (open box 1) resulted in the two branches: one with a precursor for thrombospondins 1 and 2, and the other with a precursor for thrombospondin 3 and 4 plus COMP. One can also deduce from the phylogenetic tree, and the domain distri bution in different branches, that either the N-terminal domains PR3 EG of thrombospondins 1 and 2 were inserted into the thrombospondin 1/2 branch between 900 and 600 million years ago, or alternatively, they were already present in the common precursor and the thrombospondin type 3/4/COMP branch sub sequently lost these domains. The resolution of such phyloge netic trees is insufficient to resolve whether COMP diverged from thrombospondins 3 and 4 before or after the branching off of thrombospondin type 3. However, an early branching of COMP would support a model that evolution proceeded in the direction of increasing domain complexity. It is important to note, however, that phylogenetic trees derived from certain domains of a multidomain protein are not able to predict the history of other domains in the same protein, hence it remains unclear whether the three EGF domains common to all proteins (Table 1) were present in the precursor. It will be interesting to scrutinize the phylogeny of different domains within one protein and hence gain more insight into the pattern and timing of domain acquisition within multidomain proteins.
MECHANISMS OF THE EVOLUTION OF ECM PROTEINS
New proteins come from old proteins as the result of gene duplications followed by base substitutions (Doolittle, 1992). This very general statement also applies to mosaic proteins. It is obvious from Table 1, however, that in their case individual domains can also be rearranged extensively, somewhat like mobile elements (Doolittle,1992).
Mechanisms of gene duplication by unequal crossing-over between sister chromosomes containing the genes are described in textbooks. Unequal cross-overs can also readily extend tandemly repeated genes into long series. Duplications, deletions, inversions, conversions, slippages and translocations of DNA segments can arise as the result of erratic rejoining of fragments. The genomic sequence of chromosome III of C. elegans (Wilson et al., 1994) and a comparative study of large DNA sequences of mouse and man (Koop and Hood, 1994) revealed an intriguing view on gene organization, with evidence for duplications, inversions and other gene rearrange ments. Gene rearrangements are rare events catalyzed by the enzymes that mediate normal recombination processes; in the example of thrombospondins they resulted in gene duplications at time intervals in the range of 100 million years (see Fig. 2). For mosaic proteins it is generally believed that these processes are speeded up by the presence of introns. The most trivial reason for the higher speed of this process is the possibility of breaking and rej9ining the DNA anywhere in the long introns on either side of an exon. Thus a large number of different possible breakages could lead to exon shuffling, in each case the exon is left intact, which in turn encodes a stable protein domain in the majority of extracellular modules.
Exon shuffling could involve transposable elements (which make up about 10% of the genome of higher eukaryotes); examples can be seen ‘in flagranti’ in several human genetic disorders. Duplication or deletion of exons are also the cause of several human genetic disorders (Bates and Lehrach, 1994, Makalowski et al., 1994). The possibility of an exon variation at the DNA level by reverse transcription of an alternative splice variant at the RNA level may also be considered. Reverse transcription may occur by mammalian enzymes with reverse transcriptase activity or by the help of virus systems (Fink, 1987).
Details of alternative possible mechanisms for exon shuffling have been discussed by Rogers (1990) and Patthy (1991b). Several examples strongly suggest that exons can be inserted into preexisting introns. However, domain shuffling is clearly not the result of just one mechanism, nor is it the only process operating in multidomain protein evolution. This becomes evident with the increasing number of observations in which exons do not correspond to protein domains, or in which domains consist of several exons (e.g. F3 domains are encoded by two exons, yet no half F3s has been found to date). Even harder to reconcile with a dominant role for exon shuffling is the observation that exon-intron boundaries, within the same domain organization, can differ from one species to another. Extensive rearrangements following the presumed duplication of a common primordial gene were shown for the genes β and γ chains of laminin (Kallunki et al., 1991).
An example in which the relative contributions of exon shuffling and other processes were compared relates to the EF hand calcium modulated proteins. Extensive analysis revealed a random distribution of introns over ‘domain’ and ‘interdo main’ space, and that some introns were acquired after a four domain precursor was formed. It was therefore concluded that in the evolution of the widely distributed EF-hand protein family, exon shuffling played little if any role (Kretsinger and Nakayama, 1993).
The evolution of genes for the modular ECM proteins may have been further complicated by horizontal gene transfer; for example, Bork and Doolittle (1992) suggested that bacteria may have acquired a F3 domain from animals.
WHAT WERE THE SELECTION PRESSURES AFFECTING THE EVOLUTION OF ECM PROTEINS?
Unequal crossing-over may lead to either an increase or a decrease in the number of repeated domains. This suggests that the large number of repeated domains present in ECM proteins may be the result of natural selection. One reason for the large number of domains in an ECM protein may be the need to prevent diffusion of domains with specific activities into the otherwise open extracellular space: this could be achieved simply by making the proteins very large. For example, this allows localization of domains with cell signalling activity at specific sites, and the variability of the extracellular environ ment by time-dependent and vectorial expressions of different proteins (Engel, 1989). Another common feature of the large and extended ECM proteins is their ability to bridge between distant sites, for example between cellular receptors and other parts of the matrix. Clearly the development of specialized assembly domains was a prerequisite to develop multifunc tional large molecules with the potential of forming higher macromolecular organisations. The a-helical coiled-coil domains are also found in many cytoskeletal proteins but collagen triple helices are specific for extracellular proteins. In addition to other functions, collagen triple helices are essential for the formation of collagen fibres, cuticle structures and networks. They contribute essentially to the mechanical prop erties of tissues of larger organisms.
Of course, ECM proteins contain a large number of domains with much more specific functions than spacing, assembly or support; a few are listed in Table 2. In addition, it must be stressed that such specific functions have been elucidated for only a small percentage of the known domains.
This rather simplified interpretation of selection pressures is apparently contradictory to the complete lack of phenotype resultant from genetic elimination of certain ECM proteins (even those implicated in important functions). In contrast to fibronectin, which is absolutely required in early stages of embryonic development, no phenotype was detectable in trans genic mice after knock-outs of tenascin and S-type lectin (George et al., 1993, Poirier and Robertson, 1993, Saga et al., 1992). One explanation may be that hitherto unrecognized subtle functions of these proteins may cause a small increase of fitness, which is not obvious in the phenotype, and would only be apparent in the appropriate population size and natural environment (as seen for transgenic mice lacking metalloth ioneins; Michalska and Choo, 1993). Alternatively, there may be functional redundancies between some ECM proteins with similar domains, which can at least in part fulfil the function of the deficient protein. Even in this case, however, selective advantages may be necessary to maintain such a redundancy in a population. These could either be selection for a subtle divergent function, an increased fidelity for a certain process or an enhanced efficiency of a cumulative function.