ABSTRACT
The discovery of hemoglobins in virtually all kingdoms of organisms has shown (1) that the ancestral gene for hemoglobin is ancient, and (2) that hemoglobins can serve additional functions besides transport of oxygen between tissues, ranging from intracellular oxygen transport to catalysis of redox reactions. These different functions of the hemoglobins illustrate the acquisition of new roles by a pre-existing structural gene, which requires changes not only in the coding regions but also in the regulatory elements of the genes. The evolution of different regulated functions within an ancient gene family allows an examination of the types of biosequence data that are informative for various types of issues. Alignment of amino acid sequences is informative for the phylogenetic relationships among the hemoglobins in bacteria, fungi, protists, plants and animals. Although many of these diverse hemoglobins are induced by low oxygen concentrations, to date none of the molecular mechanisms for their hypoxic induction shows common regulatory proteins; hence, a search for matches in non-coding DNA sequences would not be expected to be fruitful. Indeed, alignments of non-coding DNA sequences do not reveal significant matches even between mammalian α- and β-globin gene clusters, which diverged approximately 450 million years ago and are still expressed in a coordinated and balanced manner. They are in very different genomic contexts that show pronounced differences in regulatory mechanisms. The α-globin gene is in constitutively active chromatin and is encompassed by a CpG island, which is a dominant determinant of its regulation, whereas the β-globin gene is in A+T-rich genomic DNA. Non-coding sequence matches are not seen between avian and mammalian β-globin gene clusters, which diverged approximately 250 million years ago, despite the fact that regulation of both gene clusters requires tissue-specific activation of a chromatin domain regulated by a locus control region. The cis-regulatory sequences needed for domain opening and enhancement do show common binding sites for transcription factors. In contrast, alignments of non-coding sequences from species representing multiple eutherian mammalian orders, some of which diverged as long as 135 million years ago, are reliable predictors of novel cis-regulatory elements, both proximal and distal to the genes. Examples include a potential target for the hematopoietic transcription factor TAL1.
INTRODUCTION
Many heme-binding proteins with diverse functions are known, including electron-transferring cytochromes, intracellular peroxidases and lignin-degrading extracellular peroxidases. Some of the most abundant hemoproteins, the hemoglobins, allow the reversible binding of oxygen to the heme. They are not commonly implicated in catalysis or electron transfer; instead, the heme-bound iron stays in its 2+ (FeII) oxidation state. Hemoglobins are usually thought of as the major proteins in erythrocytes circulating in the blood of vertebrates, carrying the oxygen inhaled by the lungs to the respiring tissues in the body. Hemoglobins were first found in blood simply because they are so abundant, with a concentration in normal human blood of 15 g per 100 ml. But it has become clear that hemoglobins are very widespread in the biosphere, are found in all groups of organisms, including prokaryotes, fungi, plants and animals, and carry out many different functions, including catalysis.
The widespread and diverse hemoglobins appear to be encoded by orthologous genes, i.e. the phylogenetic analysis indicates that the genes are descended from an ancient, common ancestral gene. Thus, the different functions of the hemoglobins illustrate the acquisition of new roles by a pre-existing structural gene, which requires changes not only in the coding regions but also in the regulatory elements of the genes. This paper will review this broad distribution of hemoglobins and use that as a context in which to examine some aspects of the evolution of regulation in this gene family. The evolution of different regulated functions within an ancient gene family also allows an analysis of the types of biosequence comparisons that are informative for various kinds of biological questions and that will be a parallel theme of this paper.
HEMOGLOBIN EVOLUTION FROM BACTERIA TO MAN ILLUSTRATES THE DIFFERING ROLES OF HEMOGLOBIN
Animal hemoglobins
The hemoglobin that can be readily isolated from the blood of any vertebrate is a heterotetramer of two α-globin and two β-globin polypeptides, with a heme tightly bound to a pocket in each globin monomer. The movements and interactions between the α- and β-globin subunits lead to the cooperative binding of oxygen to this hemoglobin, allowing it to pick up oxygen readily in the lungs and to unload it efficiently in the peripheral respiring tissues (Table 1). The amino acid sequences of the α- and β-globins are approximately 50% identical, regardless of which vertebrate species is the source, arguing that these two genes are descended from a common ancestor approximately 450 million years ago, in the ancestral jawed vertebrate (Goodman et al. 1987). Both α- and β-globins are about equally divergent from the monomeric myoglobin, an oxygen storage and delivery protein found in many tissues. It lacks the exquisite cooperativity of the blood hemoglobins, but its relationship to them is clear from both the primary sequence and the virtually identical three-dimensional structures, each containing the globin fold (Dickerson and Geis, 1983). Further studies have found hemoglobins in jawless vertebrates and in diverse invertebrates ranging from flies (arthropods) through earthworms (annelids) to nematodes (Riggs, 1991; Dixon et al. 1992; Sherman et al. 1992). The amino acid sequences of invertebrate hemoglobins can be aligned with those of vertebrate hemoglobins (examples are shown in Fig. 1). In a parsimony analysis of these aligned amino acid sequences, the vertebrate and invertebrate hemoglobins form separate, distinct, monophyletic clades within the overall tree for hemoglobins (Fig. 2). Thus, the primary structures of invertebrate hemoglobins are related, but somewhat distantly, to those of vertebrate globins. In some invertebrates, the large extracellular hemoglobins are fusion proteins composed of multiple copies of the familiar monomeric globins. As hemoglobins are found in more and more distantly related species, the estimated time for the last common ancestral hemoglobin gene moves further back, to at least 670 million years ago in the case of the invertebrate/vertebrate divergence (Goodman et al. 1988) (see Fig. 3).
Plant hemoglobins: symbiotic and nonsymbiotic
Plants not only make oxygen during photosynthesis, but they also use it for respiration via the electron transfer chain in mitochondria. Recent studies show that hemoglobins are widely used in plants to bind and transfer that oxygen. The first plant hemoglobins were discovered in the root nodules of legumes (reviewed in Appleby, 1984). These nodules are a symbiosis between rhizobial bacteria and the plant to allow fixation (reduction) of atmospheric nitrogen into a usable form, ammonia, which eventually appears in amino acids and other building blocks for the cells. Reduction of nitrogen consumes large amounts of energy, and the nodules have an abundant, plant-encoded hemoglobin, called leghemoglobin, that facilitates the diffusion of oxygen to the respiring bacteriods in the root nodule (Appleby, 1984). In addition, the binding of oxygen to leghemoglobin may help sequester the oxygen away from the nitrogen-fixing machinery, which is readily poisoned by oxygen (Dickerson and Geis, 1983). Although the amino acid sequences of leghemoglobins differ from those of vertebrate globin genes at approximately 80% of the positions, leghemoglobin folds into the same three-dimensional structure as the animal globins (Vainshtein et al. 1975). Thus, initial investigations clearly showed that some plants had hemoglobins, but it was thought that they were limited to legumes.
The recent discovery of hemoglobins in a large variety of plants, including many nonleguminous species, strongly argues that the leghemoglobins are a specialized product of divergence from an ancient plant hemoglobin gene – a gene that is itself descended from a hemoglobin gene in the last common ancestor to plants and animals. Hemoglobins distinct from leghemoglobin, called nonsymbiotic hemoglobins, have been discovered in root nodules of a nonleguminous plant (Appleby et al. 1983), in plants that do not form nodules (Bogusz et al. 1988) and in the monocotyledon cereals (Taylor et al. 1994). Recently, a nonsymbiotic hemoglobin gene was discovered in the legume soybean (Glycine max) that is distinct from the well-known leghemoglobin genes found in the same plant (Andersson et al. 1996). Thus, two different types of hemoglobin have been discovered in plants, a nonsymbiotic type that is widely distributed and perhaps ubiquitous among species and a symbiotic type that is induced upon nodulation. The nonsymbiotic hemoglobins are synthesized in a wide range of tissues, including stems and young leaves of mature plants, seed cotyledons and young shoots. Although messenger RNA from the soybean gene is also present in root nodules, it is much less abundant than that for the leghemoglobins. The expression pattern indicates that this protein has a more generalized function in plants, such as facilitating oxygen diffusion to rapidly respiring cells (Andersson et al. 1996).
A cladistic analysis of the amino acid sequences of plant hemoglobins (Andersson et al. 1996) generates two distinct branches, one with the symbiotic hemoglobins (characterized by the leghemoglobins) and the other with the nonsymbiotic hemoglobins (Fig. 2). Since the latter nonsymbiotic hemoglobins have been found in a wide range of plant species, these observations strongly support the hypothesis that a gene encoding the nonsymbiotic hemoglobin was present in the ancestor to plants (Fig. 3). It is likely that the symbiotic hemoglobins arose via duplication of an ancestral gene followed by divergence to fulfil more specialized functions in root nodules.
A common ancestral gene for plant and animal hemoglobins
The hemoglobin gene inferred to be present in the ancestor to plants was probably related to the hemoglobin gene in the ancestor to mammals, i.e. there was a gene for hemoglobin in the last common ancestor to plants and animals. The evidence for this conclusion is based on the number and positions of introns in the contemporary genes. The plant hemoglobin genes (both symbiotic and nonsymbiotic) are separated into four exons by three introns (Jensen et al. 1981; Brisson and Verma, 1982; Andersson et al. 1996), as illustrated in Fig. 3.
The first and third introns are in positions homologous to those of the two introns found in vertebrate α- and β-globin and in myoglobin genes. The second plant intron interrupts the region coding for the E helix of hemoglobin. A similar intron/exon structure is found for the hemoglobin genes in the nematodes Pseudoterranova and Ascaris (Dixon et al. 1992; Sherman et al. 1992), which may represent an older structure than the two-intron form found in the annelid Lumbricus (Jhiang et al. 1988) or the intron-less form found in the insect Chironomus (Antoine and Niessing, 1984). Thus, one can propose that the ancestor to plants and animals had a hemoglobin gene with three introns (Fig. 3). This arrangement has been retained in all the plant hemoglobin genes, both symbiotic and nonsymbiotic, and also in certain nematodes. The central intron was lost prior to the divergence of annelids and arthropods and, hence, is absent in all vertebrate hemoglobin and myoglobin genes. Other nematode hemoglobin genes have lost one or more introns from the ancestral three-intron structure (reviewed in Goldberg, 1995).
Although this model is attractive in its simplicity, the assignment of the central intron as homologous between plant and nematode hemoglobins is not definitive. Alignment of plant and nematode hemoglobin sequences shows several matches on both sides of the E helix, but the region interrupted by the central intron is completely different (Fig. 1). Simply starting from the predicted beginning of the E helix in the published alignments (Sherman et al. 1992), the central intron in nematodes interrupts the eighth codon, whereas the central intron in plants falls between the fourteenth and fifteenth codons encoding this helix. Thus, the introns appear to be in slightly different places and in different phases. Whether this is the result of extensive divergence from a common ancestor, resulting in intron ‘sliding,’ or independent insertions of introns (Dixon and Pohajdak, 1992) requires further study (discussed in Goldberg, 1995). Irrespective of whether the central intron was present early or was inserted later, the sequence relationships and overall gene structures argue strongly for an ancestral hemoglobin gene being present more than 1500 million years ago, prior to the divergence of plants and animals. It is also clear that the first and third introns are at least as old as the last common ancestor to plants and animals.
Hemoglobins in protists, fungi and bacteria
Recent studies show that the hemoglobin gene is truly ancient, preceding the divergence of prokaryotes and eukaryotes, and it now appears in a variety of exon/intron arrangements (Fig. 3). The subfamily of hemoglobins found in protists forms a distinct branch on cladograms (Zhu and Riggs, 1992), illustrated in Fig. 2. The hemoglobin gene in the protozoan Tetrahymena has no introns (Takagi et al. 1993), the homologous gene in Paramecium has a single intron that does not correspond to any other globin gene intron (Yamauchi et al. 1995), and one hemoglobin gene in the alga Chlamydomonas has three introns, at least two of which are in unique locations (Couture et al. 1994). Since these appear to be homologous genes, one must propose many introns in the ancestral gene followed by differential loss, substantial intron sliding or repeated insertions of introns to obtain the contemporary structures. Stoltzfus et al. (1994) have argued that this diversity in gene structure is incompatible with the idea that exons encode discrete units of protein structure.
The hemoglobins found in these unicellular organisms show more biochemical reactivities and hence have been implicated in a wider variety of potential functions that those traditionally associated with animal and plant hemoglobins (Table 1). The hemoglobins in Chlamydomonas are found in the chloroplast and are light-inducible. Unlike the case with many other oxyhemoglobins, O2 bound to the Chlamydomonas hemoglobin can be reduced (Couture and Guertin, 1996), which raises the interesting possibility that this hemoglobin could serve as an electron acceptor, perhaps in the electron transport system or another redox system. Alternatively, it could serve to trap oxygen to protect oxygen-labile proteins or perhaps deliver oxygen to the cytochrome oxidase of the respiratory chain (Couture et al. 1994). A flavohemoglobin from the yeast Saccharomyces is a fusion of a heme-binding domain and an FAD-binding domain (Zhu and Riggs, 1992). In contrast to most other hemoglobins, its production is induced by high levels of oxygen (Zhu and Riggs, 1992; Crawford et al. 1995), and it may play a role in protecting the cell from oxidative stress (Zhao et al. 1996).
Hemoglobins have been found in many bacteria as well. Like the yeast protein, the flavohemoglobins from Escherichia coli (Vasudevan et al. 1991), Bacillus subtilis (LaCelle et al. 1996) and Alcaligenes eutrophus (Cramm et al. 1994) have two domains, one for binding heme and one for binding a flavin cofactor. The flavohemoglobins from yeast and bacteria form a distinct clade (Fig. 2). The three-dimensional structure has been determined for the Alcaligenes flavohemoglobin (Ermler et al. 1995). The structure corresponds to the classical globin fold, demonstrating the homology between the bacterial and eukaryotic hemoglobins. The Alcaligenes flavohemoglobin has been implicated in catalyzing a reduction reaction, transferring a hydride ion from NADH to FAD and then the two electrons, via the heme, to a still-unknown substrate (Ermler et al. 1995). The hemoglobin in the sliding bacterium Vitreoscilla is not fused with a flavoprotein domain (Wakabayashi et al. 1986), but it falls within the clade with the flavohemoglobins (Fig. 2). Its three-dimensional structure also conforms to that of the globin fold (Tarricone et al. 1997). Like the Bacillus flavohemoglobin and many other hemoglobins, and in contrast to the regulation of the yeast hemoglobin, it is induced when cells are grown in low-oxygen (hypoxic or anaerobic) conditions (Dikshit et al. 1990). Its ability to complement deficiencies of terminal cytochrome oxidases in E. coli suggests that this hemoglobin can receive electrons during respiration (Dikshit et al. 1992). A hemoglobin encoded within a nif operon in the cyanobacterium Nostoc commune (Potts et al. 1992) is similar to the hemoglobins found in the unicellular eukaryotes Chlamydomonas, Tetrahymena and Paramecium.
Thus, hemoglobins are found in virtually all kingdoms of organisms, including eubacteria, unicellular eukaryotes, plants and animals. No hemoglobins have been reported in the archaebacteria to date, but it would be surprising if they were truly absent. Although the hemoglobin genes have been diverging for an extremely long time, as much as 1800 million years, relationships among them can be analyzed by multiple alignment of amino acid sequences of the encoded proteins followed by construction of phylogenetic trees (Table 2). There is a consistent function associated with many of the hemoglobins – effective transport of oxygen. This can be achieved by making very large amounts of hemoglobin in particular cells (e.g. erythrocytes) for transport through the body or in specialized locations requiring intense respiration (e.g. nitrogen-fixing root nodules). Alternatively, smaller amounts of proteins, such as myoglobin or the nonsymbiotic hemoglobins in plants, can be made in virtually all cells, perhaps providing an efficient intracellular oxygen-delivery system to the respiring mitochondria and chloroplasts (Wittenberg and Wittenberg, 1987). This latter function appears to be very widespread in the biosphere, being found in plants, animals and algae. When the respiratory electron transport system is not in a separate intracellular compartment, as in bacteria, some species still utilize hemoglobin under hypoxic conditions, perhaps to provide oxygen as the terminal electron acceptor. Indeed, the flavohemoglobins appear to catalyze redox reactions, with the heme playing a direct role in electron transfers as it does with the cytochromes. Perhaps these latter proteins provide clues about the function of the ancestral hemoglobins.
Common origins for many hemoproteins
Is there an evolutionary connection between the hemoglobins, the cytochromes that pass electrons down an energy gradient in respiration and in photosynthesis, and other hemoproteins that catalyze oxidations? Keilin (1966) suggested some time ago that hemoglobins may have evolved from heme enzymes that utilize oxygen (discussed in Riggs, 1991). Irrespective of the relationships among the proteins that bind them, it is likely that metal-bound porphyrin rings or related compounds were present at the time that photosynthesis evolved; indeed, they may have been utilized then as now in capturing light energy. One can speculate on interesting scenarios for the use of hemoproteins once oxygen appeared via photosynthesis. Given the capacity of oxygen to damage various cellular components, oxygen-binding hemoproteins may have functioned initially to protect cells from this reactive species. Once the utility of oxygen as an electron acceptor was realized in the evolution of respiratory chains, hemoproteins could serve as electron-transfer agents (leading to contemporary cytochromes) and oxygen-bound hemoproteins could serve as the terminal electron acceptors. Further gene duplications and divergence would allow the capacity to catalyze other redox reactions to evolve. The intracellular oxygen-transport properties may have arisen from a need to scavenge scarce oxygen to provide it for the respiratory chain (leading to contemporary myoglobins and nonsymbiotic hemoglobins). In multicellular organisms, the oxygen-scavenging hemoglobins could evolve into the abundant hemoglobins now used to transport oxygen.
Primary sequence relationships may not be particularly useful in testing these proposed connections, since the ancestral amino acid sequence may have diverged beyond recognition after billions of years of evolution. A more useful guide will be the determination of three-dimensional structures by X-ray crystallography (Table 2). In this regard, it is notable that the light-harvesting biliprotein C-phycocyanin, from the cyanobacterium Mastigocladus laminosus, has a three-dimensional structure very similar to that of a globin (Schirmer et al. 1985), suggesting a common ancestry (Fig. 3). Although this is not a heme-binding protein per se, it does bind a linear tetrapyrrole pigment derived from heme. Heme binds between two α-helices, coordinated to histidine, in proteins as diverse as lignin peroxidase (Edwards et al. 1993) and cytochrome b562 (Mathews et al. 1979), but the topology of these helices differs from the globin fold. Is this divergent or convergent evolution? As more structures are determined, it will be highly instructive to see which distant relationships among hemoproteins will be confirmed and to determine how far the superfamily of hemoglobin genes reaches.
MECHANISMS OF REGULATING HEMOGLOBIN GENE EXPRESSION
Response to O2
Given the differing roles of the hemoglobin proteins described above, it is not surprising that the regulation of the genes encoding them can differ dramatically. Since all these genes appear to be descended from a common ancestral gene (and hence are orthologous), the variations in regulatory mechanisms present examples of alterations of control sequences during evolution that allow pre-existing protein-coding genes to be adapted to different functions. In some cases, the regulatory changes and evolutionary distance may be so large that no remnant of the ancestral state is left to guide inferences from sequence alignments. For example, consider the regulation of various hemoglobin genes by O2 levels. Production of many of the known hemoglobins is induced by low O2 concentrations, as may be expected for proteins used for O2 transport. However, the mechanism can be direct, as in several bacteria, or quite indirect, as it is in mammals (Fig. 4). Expression of the bacterial hemoglobin gene from Vitreoscilla (Dikshit et al. 1990) and the flavohemoglobin gene from Bacillus subtilis (LaCelle et al. 1996) is induced at low O2 concentrations, but different proteins have been implicated in the regulation of these bacterial genes. The common anaerobic regulator FNR can be used to regulate positively expression of the Vitreoscilla hemoglobin gene (Joshi and Dikshit, 1994) and possibly of the Alcaligenes flavohemoglobin gene (Cramm et al. 1994). FNR, the fumarate nitrate reduction protein, induces expression of a large number of genes when O2 concentration is low and electron transport switches to alternative electron acceptors such as fumarate and nitrate. The O2 sensor in this case is well understood (Rouault and Klausner, 1996); under normal O2 conditions, FNR is an apo-protein with no iron–sulfur (Fe–S) cluster but, when O2 levels are low (i.e. anaerobic conditions), a 4Fe–4S cluster is formed. The FNR protein with the 4Fe–4S cluster is an active transcriptional regulator that induces expression of many genes, including the Vitreoscilla hemoglobin gene. A two-component regulatory system involving the ResD and ResE proteins has been implicated in the anaerobic induction of the flavohemoglobin gene hmp from Bacillus subtilis (LaCelle et al. 1996). As is characteristic of two-component regulatory systems in bacteria, one protein, ResD, is the response regulator and the other, ResE, is the histidine protein kinase that transduces a signal. The ResD/ResE system was discovered recently (Sun et al. 1996), and currently both the nature of the oxygen sensor and the mechanism of induction of the target genes are unknown. The FNR protein is also involved in regulation of the B. subtilis hmp gene, but indirectly via its role in increasing levels of nitrite (LaCelle et al. 1996).
Like these bacterial genes, vertebrate hemoglobin production is also increased under conditions of hypoxia, but by the indirect mechanism of increased erythropoiesis (Fig. 4). The low O2 concentration is actually sensed by cells in the kidney and liver, where it signals an increase in production of the hormone erythropoietin via response elements in a 3₲ enhancer and the promoter of the gene (reviewed in Huang et al. 1997). Erythropoietin then acts on the erythroid progenitor cells in the bone marrow to increase proliferation, to stimulate further erythroid differentiation and to block apoptosis (Mason-Garcia and Beckman, 1991; Witthuhn et al. 1993; Migliaccio et al. 1996). This then leads to increased production of erythrocytes, each carrying abundant hemoglobin. Thus, the O2-sensing system in vertebrates does not act directly on the hemoglobin genes, but rather acts on a hormone gene in a different tissue, eventually leading to an increase in the number of cells carrying hemoglobin. This may be viewed as an elaborate adaptation to the need for carrying oxygen to the many tissues in the body of a vertebrate, whereas hemoglobin regulation in bacteria only needs to serve the requirements of a single cell.
Much information has been gathered about the proteins and events that lead to the increased production of erythropoietin in the hepatoma cell line Hep3B. The protein HIF1 (hypoxia induction factor 1) plays a key role in increasing the production of both erythropoietin and other proteins that respond to hypoxic stress (Wang and Semenza, 1993). HIF1 is a heterodimer (Wang et al. 1995), and the primary regulation is exerted on the synthesis and stability of the HIF1α subunit (Huang et al. 1996). The HIF1β subunit is identical to the ARNT protein, the aryl hydrocarbon receptor nuclear transport protein; its concentration does not change in response to O2 (Wang et al. 1995). However, the HIF1α subunit is synthesized under low-O2 conditions in a process that appears to be under translational control. Under normal-O2 conditions, the concentration of the HIF1α subunit declines rapidly, leading to a decrease in erythropoietin production. The protein HNF4 also plays a critical positive role in the tissue-specific and hypoxia-inducible expression of the erythropoietin (Epo) gene (Galson et al. 1995). The nature of the O2 sensor in kidney and liver cells is still not defined, although some studies have implicated a hemoprotein in this process (Goldberg et al. 1988; Huang et al. 1997). In particular, one model is that the oxidation state of the Fe determines the conformation of the heme and that the conformation with reduced Fe signals the pathway leading to increased erythropoietin production.
The flavohemoglobin gene YHB1 from the yeast Saccharomyces cerevisiae is also regulated by oxygen, but in the opposite way – it is induced by high levels of O2. The HAP1 and HAP2/3/4 proteins have been implicated in this aerobic induction (Crawford et al. 1995); these proteins respond to intracellular heme concentrations. The yeast HAP2 and HAP3 proteins are homologous to two subunits of the heteromeric CCAAT-binding protein CP1 (also known as NF-Y and CBF), which are implicated in activated expression of all the mammalian globin genes (discussed below). Thus, despite the dissimilarities in the oxygen response and the greater complexity of the mammalian mechanism, homologous transcription factors are implicated in the regulation of homologous genes in yeast and mammals.
Little commonality is obvious from these disparate regulatory systems, but in only one example has a protein with a demonstrated capacity to regulate gene expression by sensing O2 levels, the FNR protein, been placed in the pathway. Proteins containing Fe–S clusters can respond reversibly to changes in oxidative conditions, whether by increasing the stability of a 4Fe–4S cluster in the case of FNR and the mammalian IRP1, or iron response protein (Rouault and Klausner, 1996), or by changing the conformation of a stable 2Fe–2S cluster in the SoxR protein. This latter protein binds to cognate sites in the promoter of the soxS gene under both anaerobic and aerobic conditions, but it changes its conformation (and apparently that of the promoter) under aerobic conditions to increase expression of the soxS gene (Hidalgo et al. 1997). The resulting increase in concentration of the SoxS protein induces expression of many genes involved in protection from oxidative stress. It is tantalizing to speculate that the initial monitors of O2 levels (O2 sensors) could be Fe–S proteins in many of the regulatory systems discussed here. The hemoproteins implicated in induction of erythropoeitin in mammals and YHB1 in yeast could be acting downstream of the initial sensor. Further studies should test this possibility. It should be noted that oxyR regulation in E. coli responds to low oxygen concentrations without the involvment of an Fe–S cluster protein, so other types of oxygen sensor molecules are known (Storz et al. 1990). Even if Fe–S proteins are implicated more broadly in O2 sensing, the large evolutionary distance between the species examined may preclude a clear determination of any ancestral relationships.
In contrast, evolutionary approaches provide a means to analyze the substantial amount of information available about the regulatory elements of hemoglobin genes in plants and animals. These show distinctive features that illustrate how DNA sequences have evolved to allow different homologous coding sequences (hemoglobin genes) to be expressed in different tissues, at different stages of development and at differing levels.
Plant hemoglobins
The leghemoglobin genes are expressed only in the nitrogen-fixing root nodules after the symbiotic bacteria have invaded, and they are expressed at high levels. In constrast, the nonsymbiotic hemoglobin genes are expressed in all tissues examined but at lower levels. The currently defined promoter elements for the two classes of gene are shown in Fig. 5. The promoters for leghemoglobin genes have ‘nodulin boxes’ that are critical for nodule-specific expression (Ramlov et al. 1993; Szczyglowski et al. 1994), but the promoters of genes for the nonsymbiotic plant hemoglobins lack this motif, having instead their own common conserved motifs (Andersson et al. 1996). One model for the role of these nodulin boxes is that specific activator proteins bind to these sequences in nodules, leading to high levels of expression of the leghemoglobin genes. Research into the regulation of the nonsymbiotic hemoglobin genes is at an early stage, and it will be most informative to investigate any similarities with the regulation in other species. The analysis of upstream promoter sequences has been useful in these plant hemoglobin genes (Table 2; Fig. 5).
Paradoxes in vertebrate globin evolution: α-globin versus β-globin gene regulation
An enormous amount of research has been devoted to understanding the regulation of vertebrate hemoglobin genes, including tissue- and developmental-stage-specificity and balanced production of the globin chains. Given the descent of α- and β-globin genes from a common ancestor (Figs 3, 6), one might have thought that their coordinated and balanced expression to produce the heterotypic tetramer α2β2 in erythrocytes would be the easiest aspect of regulation to explain. Since the two genes would have been identical after the initial duplication, with identical regulatory elements, one might have expected selection to keep the regulatory elements very similar. However, much has changed between the α-like and β-like globin gene clusters since their duplication. They are now on separate chromosomes in birds and mammals, and in mammals they are in radically different genomic contexts (Fig. 7). The β-globin gene clusters have an A+T content comparable to the bulk of mammalian DNA, they have no CpG islands, and the locus is in a DNAase-accessible, ‘open’ chromatin conformation only in erythroid cells, where the genes are expressed (reviewed in Collins and Weissman, 1984). In contrast, the α-like globin gene clusters are highly G+C-rich, they have a CpG island associated with each active gene, and the locus is in a constitutively ‘open’ chromatin conformation in all cells (Craddock et al. 1995). Tissue-specific gene expression is frequently correlated with an increased accessibility of the chromatin in a locus only in expressing cells, but this is not the case for the α-like globin genes of mammals.
Despite all these differences, expression of the α-globin and β-globin genes is appropriately balanced in erythroid cells, apparently by rather different mechanisms. Fig. 5 shows some of the better-characterized protein-binding sites in the proximal regulatory elements (roughly the 200 base pairs 5₲ to the cap site of the genes). Comparing the human β- and α-globin genes (Efstratiadis et al. 1980), one sees the TATA motif, to which the general transcription factor TFIID binds, and the CCAAT motif, to which trans-activators such as CP1 can bind. However, other protein-binding sites are quite different. In particular, the 5₲ flanking region and much of the α-globin gene are contained within a CpG island, which contains notable binding sites for Sp1, a relative of Sp1 called αIRP, and other less well-characterized proteins (Barnhart et al. 1988; Kim et al. 1988; Yost et al. 1993; Rombel et al. 1995). Aside from the TATA and CCAAT motifs, the protein-binding sites are completely different in the proximal regulatory region of the human β-globin gene. The CpG island encompassing the 5₲ flanking region and much of the gene is a key component of the cis-regulatory elements for the α-globin gene of rabbits and humans, possibly through its effects on chromatin structure (Pondel et al. 1995; Shewchuk and Hardison, 1997), but no CpG island is found at any of the β-like globin genes.
As can be seen in Figs 2 and 3, the mammalian α- and β-globin genes are relatively close to each other on a phylogenetic scale that includes many taxa (animals, plants, fungi and bacteria), but this time frame is in fact quite long relative to the evolution of regulatory elements. Currently, the coordinated regulation of α- and β-globin genes is more paradoxical than clear. Differences are even seen between the distal elements that regulate the α- globin and β-globin gene clusters, called the locus control region (or LCR) for β-globin genes (reviewed in Grosveld et al. 1993) and HS-40 for the α-globin genes (Higgs et al. 1990). As illustrated in Fig. 8, one powerful enhancing region of the β-globin LCR, called DNAase hypersensitive site 2, or HS2, and α-globin HS-40 are each composed of binding sites for transcription factors NFE2 (a member of the AP1 family), GATA1 or its relatives and a family of proteins that bind to the DNA sequences that include a CACC motif, generically referred to as CACBPs (Talbot et al. 1990; Jarman et al. 1991). In both cases, these distal regulatory sites cause a large increase in the level of expression of the target genes (Fig. 7). However, the similarities appear to end there. HS2 serves as an enhancer within the context of a much larger β-globin LCR, which also acts to open the chromatin over a discrete locus in erythroid cells, whereas HS-40 is currently the only characterized erythroid-specific, distal regulator, and it enhances globin gene expression within a locus that is part of a large block of constitutively active chromatin, which also includes several ubiquitously expressed genes. Thus, domain opening does not appear to play a role in regulation of α-globin genes (Craddock et al. 1995), but it is a key initial step in the regulation of β-globin genes (Groudine et al. 1983; Forrester et al. 1990).
Further insights into the evolution of coordinated expression between α- and β-globin genes may be gleaned by further analysis of the globin gene clusters in the amphibian Xenopus (Hosbach et al. 1983) or the zebrafish Danio rerio (Chan et al. 1997), in which the α-globin genes and β-globin genes are still closely linked. For instance, it would be very helpful to know the location and composition of the LCR in these cases.
Avian and mammalian β-globin genes
Since comparisons of mammalian α- and β-globin genes, whose ancestors diverged early in the vertebrate lineage approximately 450 million years ago, show more differences than similarities, one might expect comparisons over a shorter phylogenetic distance to reveal information about regulation, such as alignments within a gene lineage but between families of vertebrates (Fig. 6). Extensive sequences are available for both mammalian (e.g. human) and avian (chicken) β-globin gene clusters, and in both cases the genes are expressed only in erythroid cells in a developmentally regulated manner. One might anticipate common aspects of regulation, and in general this is true. Both human and chicken β-globin gene clusters are in an ‘open’ chromatin domain only in erythroid cells, and DNAase hypersensitive sites (HSs) appear in the promoters only at the developmental stage at which the gene is expressed (reviewed in Evans et al. 1990; Felsenfeld, 1993). Thus, alterations in both overall and specific chromatin structure are critical to the regulation of both gene clusters. In both species, the expression of the genes is controlled by both distal and proximal regulatory sequences. One may further anticipate that these common features of gene organization and regulation would be reflected in sequence comparisons, but that is largely not the case.
A comparison of the complete sequences of the chicken (Reitman et al. 1993) and human (Collins and Weissman, 1984) β-globin gene clusters shows a simple and somewhat disappointing pattern (Fig. 9, lower panel). The sequence matches are restricted to portions of the protein-coding regions. Each β-related globin gene in humans is equally distant from each of the β-related globin genes in chickens, e.g. the human ε-globin gene is no more closely related to the chicken ε-globin gene than it is to the adult βA-globin gene, despite having the same name. This suggests that the series of gene duplications and divergences that gave rise to the β-globin gene clusters occurred independently in the lineages to the ancestral mammal and to the ancestral bird, and that is consistent with the inferences drawn from phylogenetic reconstructions based on the amino acid sequences of the proteins (Fig. 6).
No statistically significant alignments are seen in the promoter regions or in the distal control elements. The chicken β/ε enhancer shows no striking matches to any portion of the human β-globin gene cluster, nor does the human β-globin LCR resemble any of the sequences around the 5₲ HSs in the chicken gene cluster. At least for this particular gene cluster, the comparison between birds and mammals is too distant to discern candidates for cis-regulatory sequences by pairwise sequence alignments. The situation may be different for the α-globin gene clusters, since the human ζ-globin and avian π- globin genes are orthologous and restricted to embryonic erythroid expression (Proudfoot et al. 1982). A complete DNA sequence of the chicken α-globin gene cluster including the region comparable to HS-40 will be highly informative.
Even though the pairwise alignments did not reveal sufficiently long matching segments to be significant, a comparison of the protein-binding sites in known regulatory regions shows that some of the same proteins are used in both species. For instance, the chicken β/ε enhancer has binding sites for AP1/NFE2, a CACBP and GATA1 (Evans et al. 1990), strikingly similar to the array in HS2 of the human β-globin LCR and the α-globin HS-40 (Fig. 8). Alignments of long DNA sequences are not expected to be able to identify single, isolated bindings sites simply from the similarity score. A typical protein-binding site is usually 6–8 base pairs long, and some variation in the binding sites can occur without affecting binding affinity (and hence could be tolerated even in a region under strong selection). Thus, a functional binding site could comprise a string of as few as six nucleotides, only four of which match in a pairwise comparison. Strings meeting that criterion occur randomly at too high a frequency to allow one to distinguish functional binding sites from random matches. However, a group of conserved binding sites could score as a significant alignment if the order and spacing are also conserved. In the three cases discussed here, the order NFE2/AP1–CACBP–GATA1 is the same, but the spacing differs. The inability to detect similar regulatory regions between chicken and human using nucleotide identities as the basis for a similarity score illustrates the need for the development of software that identifies all potential protein-binding sites and searches for similar patterns within these binding sites. This also has been espoused as a good approach for analyzing sequences between the pufferfish Fugu and humans (Aparicio et al. 1995).
The general result is that homologous proteins are playing important, and probably similar, roles in the regulation of the the β-globin gene clusters in birds and mammals, even though the pairwise alignments do not reveal these as conserved elements. Hence, the comparison of transcription factor binding sites is more informative than the alignment of non-coding DNA sequences (Table 2).
Mammalian γ- and β-globins: phylogenetic footprinting and differential phylogenetic footprints
In contrast to the previous comparisons, the detailed study of globin gene clusters in many mammalian species has provided a rich resource of information from which to glean further insight not only into the evolution of the gene clusters but also into their regulation. The β-globin gene clusters have been extensively studied in human, the prosimian galago, the lagamorph rabbit, the artiodactyls goat and cow, and the rodent mouse. Diagrams of these gene clusters are shown in Fig. 10, and aspects of their evolution and regulation have been reviewed (Collins and Weissman, 1984; Goodman et al. 1987; Hardison, 1991; Hardison and Miller, 1993). The ε-globin gene is at the 5₲ end of all the mammalian globin gene clusters and is expressed only in embryonic red cells. In most species, expression of the γ-globin gene is also limited to embryonic red cells, but in anthropoid mammals its expression continues and predominates in fetal red cells. The appearance of this new pattern of fetal expression of the γ-globin genes coincides roughly with the duplication of the genes in primate evolution, which leads to the hypothesis that the duplication allowed the changes that caused the fetal recruitment (Fitch et al. 1991). The β-globin gene is expressed after birth in all mammals, but in galago, mouse and rabbit its expression initiates and predominates in the fetal liver (arguing that fetal expression of the β-globin gene is the ancestral state). The recruitment of γ-globin genes for fetal expression in anthropoid primates is accompanied by a corresponding delay in expression of the β-globin gene.
Both the invariant patterns in gene regulation in mammals as well as the changes in expression pattern of the anthropoid γ- and β-globin genes should be reflected in changes in cis-acting DNA sequences regulating the expression of the genes. The invariant patterns should be reflected in DNA sequences in the control regions that change very slowly over evolutionary time, which are recognizable as conserved sequence blocks or phylogenetic footprints. Comparisons of the DNA sequences of entire globin gene clusters reveal regions of high similarity extending for over 1000 base pairs 5₲ to the orthologous genes and in long regions throughout the distal LCR (Li et al. 1990; Hardison and Miller, 1993; Hardison et al. 1997b; Slightom et al. 1997), as illustrated for the human–mouse sequence comparison in Fig. 9 (top panel).
The LCR has been implicated in opening the chromatin in the β-globin gene domain to allow or stimulate high-level gene expression (Fig. 7), and the 5₲ flanking regions contain the proximal promoters (to approximately −100) plus upstream regulators involved in induction and silencing (reviewed in Stamatoyannopoulos and Nienhuis, 1994). Thus, the overall pattern of sequence conservation outside the coding regions includes most of the currently characterized regulatory regions, although in many cases the sequence conservation extends beyond them, indicating the possibility that more will be discovered. This general pattern of sequence conservation between eutherian mammals (human and mouse in particular) localized to known or testable regulatory regions is seen for several other mammalian loci, demonstrating the utility of this general approach (Hardison et al. 1997a).
Some of the protein-binding sites flanking the globin genes are shown in more detail in Fig. 5. The TATA box (the binding site for TFIID), the CCAAT box (the binding site for CP1 and other families of proteins) and the CACC box (the binding site for EKLF) were initially recognized as conserved regions (Efstratiadis et al. 1980; Lacy and Maniatis, 1980), as was the DRE in mammalian β-globin genes (Stuve and Myers, 1990). Important sequence motifs at approximately −160 were noted as conserved (Hardison, 1983) prior to the discovery of the proteins binding to them, such as the GATA1-binding site in the promoters for ε- and γ-globin genes (Tsai et al. 1989; Gong et al. 1991) and the BB1-binding site in the promoter for β-globin genes (Antoniou et al. 1988; Macleod and Plumb, 1991). In other cases, such as the AP1/NFE2 binding sites and the GATA1 sites in the locus control regions and enhancers of globin genes, specific binding and evidence of conservation were discovered at approximately the same time (deBoer et al. 1988; Ney et al. 1990; Talbot et al. 1990). Even in extensively studied regions, such as HS2 of the LCR, a conserved E box sequence was the initial observation (Hardison et al. 1993) that led to the recent discovery of the importance of this region in full enhancement by this element, via the action of basic helix–loop–helix proteins such as TAL1 and USF and/or a novel factor called HS2NFE5 (Lam and Bresnick, 1996; Elnitski et al. 1997). Many of the protein-binding sites in the proximal regulatory regions of the human globin genes are conserved in the orthologous genes in all mammals examined, in keeping with important roles in regulation and illustrative of the power of the phylogenetic approach.
Differences in the patterns of expression can be analyzed by a differential phylogenetic footprinting approach (Gumucio et al. 1994). One striking example is a region in the γ-globin gene promoter that is conserved in anthropoid primates, but is different in other mammals. This is a binding site for a factor called the stage-selector protein, or SSP (Fig. 5), that has been implicated in the differential expression of γ-globin and β-globin genes (Jane et al. 1992). SSP is a heterodimer (Jane et al. 1995) between CP2 (Lim et al. 1992) and some other protein. Interestingly, the NFE4 protein implicated in the stage-specific expression of the chicken β-globin gene (Foley and Engel, 1992) also contains CP2 as part of a heterodimer (Jane et al. 1995). Thus, this approach of looking for patterns of conservation consistent with differential gene regulation in eutherian mammals has indeed led to the discovery of a protein that is probably involved in that differential expression.
Another level of differential analysis compares the proximal regulatory regions of genes expressed at different times of development, i.e. mammalian ε-, γ- and β-globin genes. As illustrated in Fig. 5, most of the protein-binding sites are different in these promoters. For instance, βDRF, EKLF and BB1BP have been implicated only in the regulation of the β-globin gene (Evans et al. 1990). A similar but distinctive CACC motif is found in a comparable position in the 5₲ flank of all three genes, but EKLF is active only at the β-globin CACC box (Donze et al. 1995; Perkins et al. 1995), leaving open the important possibility that other CACBPs, perhaps active only at one developmental stage, are regulating γ- and ε-globin genes. A GATA-binding site is conserved at approximately the same position in both γ- and ε-globin gene 5₲ flanking regions, but the comparable region for β-globin does not have a conserved GATA site (Hardison et al. 1994). Even conserved DNA sequence motifs with very similar sequences may serve as binding sites for different proteins. A CCAAT motif located at approximately −80 in all the vertebrate globin gene promoters can be bound by a heteromeric complex called CP1, NF-Y or CBF (Hooft van Huijsduijnen et al. 1990, and references therein). However, preparations of CP1 bind much more strongly to the CCAAT box in the α-globin gene promoter than in the β-globin gene promoter (Cohen et al. 1986). Also, multiple additional proteins bind to the CCAAT box, some of which have been implicated in the activation of β-globin gene expression (deBoer et al. 1988; Delvoye et al. 1993).
Thus, non-coding DNA sequence alignments of these groups of orthologous genes in different eutherian mammals reveal protein-binding sites important for regulated expression (Table 2). The differences in the arrays of proteins functioning at ε-, γ- and β-globin genes indicate that a distinctive battery of proteins functions in the promoter for each type of gene. Indeed, this implication is consistent with the observation that cis-acting sequences needed for stage-specific regulation of expression map close to the genes (Trudel and Costantini, 1987, and references therein). In the context of the evolutionary tree shown in Fig. 6, these results show that, for the globin gene clusters, comparisons among orthologous genes that share a common ancestor early in the eutherian lineage are useful for revealing conserved cis-regulatory sites (mostly protein-binding sites). However, comparisons between paralogous genes resulting from gene duplications as recently as the divergence between the ancestor to both β-globin and ε-globin genes in the mammalian lineage do not reveal common regulatory elements. Hence, it is not surprising that sequence comparisons between genes whose ancestor diverged even earlier, e.g. mammalian versus avian β-globin genes, or mammalian α-versus β-globin genes, do not reveal matches in regulatory elements.
These conclusions apply equally well to distal regulatory elements such as the LCR. Comparisons of orthologous sequences among eutherian mammals show highly conserved sequences throughout the β-globin LCR, and these correlate precisely with regulatory elements (reviewed in Hardison et al. 1997b). Some parts of the LCR, in particular those homologous to DNAase hypersensitive sites 1, 2 and 3, are conserved in the marsupial and monotreme mammals (R. Hope, R. Baird, J. Kulibawa and M. Goodman, personal communication). As the conservation of the LCR is mapped even more deeply on an evolutionary tree, the issue of its origin comes into tighter focus. This important element has been implicated in initiating and maintaining an open chromatin domain in the otherwise highly repressed nucleus of erythroid cells. Thus, is may have arisen around the time that vertebrate erythrocytes evolved to carry large amounts of hemoglobin. One might expect similar sequences to be detectable in many vertebrates but, in fact, homologous sequences were not seen even in the avian–mammalian comparison, despite the fact that functionally analogous DNA sequences are known. Since homologous LCRs are seen early in the mammalian lineage (prior to the eutherian–metatherian split), but they are not detected in avians, the question arises as to whether the avian (or mammalian) LCRs have been rearranged to the point where they are no longer detectable in these comparisons or whether the distance is just too great. Analysis of more intermediate species would help answer this question.
Identification of conserved and differentially conserved sequences is an important guide to identifying cis-regulatory sequences and is helpful in finding the many proteins involved in regulation of the genes. This approach, plus many studies on the function of these sequences, has revealed a highly complex array of regulatory sequences and proteins. However, sequence analysis provides little insight into the important issues of how these DNA sequences, with proteins bound, work together to accomplish the several levels of regulation. Fundamental questions about the mechanism of action of the LCR in domain opening, the identification and mechanisms of proteins required for developmental control and the possibility of interaction between the promoters and the LCR (and hence possible competition as a mechanism for regulation) remain unresolved (Tuan et al. 1992; Martin et al. 1996; Wijgerde et al. 1996) and require further study.
CONCLUDING REMARKS
DNA and protein sequence comparisons and alignments allow one to apply the principles of evolutionary biology to learn much about genes and their regulation, but one needs to compare sequences from species or genes separated by an appropriate distance to obtain useful information. In some cases, the amino acid sequences may be so different that comparisons of three-dimensional structures are needed to deduce truly ancient relationships, e.g. among different hemoproteins such as a hemoglobin, ligninases, cytochromes, etc. Comparisons of the amino acid sequences of proteins are highly informative within a family, with members ranging from bacteria to mammals. Comparisons of gene structure are clearly informative for globin genes only from the ancestor to plants and animals, although the information in gene structures from protists needs more analysis. Deductions on gene regulation based on sequence analysis between different vertebrate families, such as birds and mammals, may need the development of new software analyzing protein-binding sites. Alignments of non-coding DNA sequences in a group of mammals are highly informative about regulatory elements, but similar analyses between birds and mammals have been uninformative, at least for the β-globin gene cluster. Thus, depending on the type of question being asked, sequence or structural comparisons will be informative, but the appropriate phylogenetic distance needs to be employed. The choice of species will probably need to be varied for different loci, given the differences in evolutionary rates for various loci, but for many mammalian loci, comparisons between human and mouse are highly informative for studies of regulatory regions.
ACKNOWLEDGEMENTS
Work from this laboratory was supported by PHS grants 1RO1 DK27635, 1RO1 LM05773 and 1RO1 LM05110. I thank Dr W. Miller for Fig. 9.