High Coding Density on the Largest Paramecium tetraurelia

Aug 10, 2004 - system of choice for functional analyses of many genes shared with vertebrates ... type laboratory stock [21]. Cells were cultured using ... ucts were sequenced using dye-terminator chemistry (DYEnamic animals, have been ...
2MB taille 7 téléchargements 210 vues
Current Biology, Vol. 14, 1397–1404, August 10, 2004, 2004 Elsevier Ltd. All rights reserved.

DOI 10.1016/j .c ub . 20 04 . 07 .0 2 9

High Coding Density on the Largest Paramecium tetraurelia Somatic Chromosome Marek Zagulski,1 Jacek K. Nowak,1 Anne Le Moue¨l,2 Mariusz Nowacki,1,2 Andrzej Migdalski,1 Robert Gromadka,1 Benjamin Noe¨l,3 Isabelle Blanc,3 Philippe Dessen,4 Patrick Wincker,5 Anne-Marie Keller,3 Jean Cohen,3 Eric Meyer,2,* and Linda Sperling3,* 1 Institute of Biochemistry and Biophysics DNA Sequencing Laboratory Polish Academy of Sciences Pawinskiego 5a 02-106 Warsaw Poland 2 Laboratoire de Ge´ne´tique Mole´culaire ENS 46 rue d’Ulm 75005 Paris France 3 Centre de Ge´ne´tique Mole´culaire CNRS 91198 Gif-sur-Yvette cedex France 4 Institut Gustave-Roussy 39 rue Camille Desmoulins 94805 Villejuif cedex France 5 Genoscope—Centre National de Se´quenc¸age 2 rue Gaston Cre´mieux CP5706 91057 Evry cedex France

Summary Paramecium, like other ciliates, remodels its entire germline genome at each sexual generation to produce a somatic genome stripped of transposons and other multicopy elements [1]. The germline chromosomes are fragmented by a DNA elimination process that targets heterochromatin to give a reproducible set of some 200 linear molecules 50 kb to 1 Mb in size [2]. These chromosomes are maintained at a ploidy of 800n in the somatic macronucleus and assure all gene expression. We isolated and sequenced the largest megabase somatic chromosome in order to explore its organization and gene content. The AT-rich (72%) chromosome is compact, with very small introns (average size 25 nt), short intergenic regions (median size 202 nt), and a coding density of at least 74%, higher than that reported for budding yeast (70%) or any other free-living eukaryote. Similarity to known proteins could be detected for 57% of the 460 potential protein coding genes. Thirty-two of the proteins are shared with vertebrates but absent from yeast, consistent with the morphogenetic complexity [3] of Paramecium, a long-standing model for differentiated functions shared with metazoans but often absent from *Correspondence: [email protected] (E.M.); [email protected] (L.S.)

simpler eukaryotes [4]. Extrapolation to the whole genome suggests that Paramecium has at least 30,000 genes.

Results and Discussion The 50–60 germline chromosomes of the ⵑ100 Mb Paramecium genome are reproducibly rearranged during macronuclear development, with elimination of 10%– 15% of the DNA and amplification to high copy number. Numerous (ⵑ60,000) short unique copy elements (internal eliminated sequences or IES) are precisely excised, which is necessary for gene expression since most genes are interrupted by IESs [5]. Transposons and other repeated sequences are removed by an imprecise mechanism leading to fragmentation of the germline chromosomes into smaller macronuclear chromosomes, which are healed by de novo telomere addition [2]. Since several rounds of endoreplication precede DNA elimination, considerable heterogeneity exists at chromosome ends, even within a single macronucleus. The finished megabase chromosome sequence of 984,602 nt is a single contiguous sequence close to the expected 1 Mb size, and conceptual and experimental restriction digestions with the rare cutter BsiWI are in good agreement (Supplemental Figure S1 online). Telomeric repeats at somatic chromosome ends consist of approximately 30 copies of A3C3 and A2C4 hexanucleotides branched randomly over ⵑ1 kb regions [6]. We mapped reads with telomeric repeats from whole genome shotgun (WGS) primary sequence data to the chromosome. (The megabase shotgun library, made by restriction digestion and tailing, does not contain telomeres. Primary sequence data from the Paramecium WGS project [http://www.genoscope.cns.fr/] is available in the trace archive [http://www.ncbi.nlm.nih.gov/ Traces/].) Many reads mapped to the left end of the sequence, but few reads mapped to the right end (data not shown). However, we found tandem repeats of the P126 element, a 126 nt motif related to G protein WD40 repeats, at both ends of the chromosome. The P126 element was originally identified by Forney and Rodkey [7] in the immediate subtelomeric region of several macronuclear chromosomes, so this is a good indication that the sequence does extend to the telomeric region at both ends and represents an entire macronuclear chromosome. Strikingly, the ratio of observed/expected CpG dinucleotides for the sequence is 0.40, while this ratio is close to 1.0 for the other dinucleotides. DNA methylation on cytosine, which has not been reported in Paramecium, might explain this severe CpG depression. It could occur in the germline micronucleus or at a specific developmental stage, as discussed in the Supplemental Data. Table 1 compares other characteristics of the chromosome with those of Dictyostelium chromosome 2 [8] and of the Plasmodium [9] and yeast [10] genomes. The

Current Biology 1398

Table 1. Megabase Chromosome Characteristics and Comparison to Other Organisms

Size (bp) G⫹C content overall In predicted coding regions In predicted intergenic regions No. of genes Genes with introns (%) Number of introns Number of exons Average number of exons per CDS Average intron length (bp) Average exon length (bp) Average gene length (bp) Percent coding Gene density (kb/gene) Number of tRNA genes

P. tetraurelia

D. discoideum

P. falciparum

S. cerevisiae

984,602 27.5 29.2 15 460 84 1,061 1,521 3.3 24.8 482 1,651 74.4 2.14 1b

7,520,000 22.2 28 14 2,799 68 3,587 6,398 2.3 177 711 1,626 60.5 2.6 73

22,853,764 19.4 23.7 13.6 5,268 54 7,406 12,674 2.4 178.7 979 2,283 52.6 4.3 43

12,495,682 38.3 NA NA 5,770 5 272 NA ⵑ1 287a NA 1,424 70.5 2.09 275

The characteristics of the megabase DNA sequence were calculated by Artemis. The percent coding is exclusive of introns (with introns, 77.1%). Only three previously characterized genes were detected by a BLASTN homology search of the sequence against the public nucleotide database. The only predicted tRNA gene, with a marginally significant Cove score, is uncertain. The sequence contains no rRNA genes, as expected, since they are tandemly arranged on dedicated rDNA chromosomes. Two inverted DNA repeats, of unknown origin and significance, are separated by ⵑ1.3 and ⵑ22 kb, respectively, for the smaller and the larger repeat; the smaller repeat is GC rich. Smaller repeat: 132845..133153 and 136154..136462, size 308 bp; larger repeat 267647..268472 and 289000..289826, size 826 bp. Data for comparison is for D. discoideum chromosome 2 [8] and for the P. falciparum and S. cerevisiae genomes as presented in [9] and [10]. NA, not available. a Calculated from the 256 spliceosomal introns in Yeast Intron DataBase [40]. b Cys, anticodon GCA, Cove score 56.41 (516706..516635).

72.5% AT content, although lower than that of Plasmodium or of Dictyostelium, is one of the highest so far reported for a eukaryotic genome. Gene models were annotated manually and using GlimmerM (Figure 1 and Supplemental Table at http://paramecium.cgm.cnrsgif.fr/megabase/), resulting in a predicted coding density of 74% (77% if introns are not excluded from the calculation). The only eukaryotic genome reported to have higher coding density is that of the parasite Encephalitozoon cuniculi [11]. It is important to note that few Paramecium genes have been studied to date and no complete, annotated ciliate genome is available. The closest complete genome, that of Plasmodium, is separated by at least 1 billion years, the time at which the alveolates, comprising ciliates and apicomplexan parasites, branched from other eukaryotes [12]. Plasmodium, moreover, has lost genes owing to its parasitic lifestyle. The gene models presented here must therefore be considered preliminary until other data become available to improve gene finder training and to validate the predicted CDS, especially those with limited sequence similarity to known proteins. The introns are smaller in size and more frequent than in Dictyostelium, Plasmodium, or yeast. Indeed, Paramecium introns are among the smallest known, ranging in size from 20 to 35 nt. They contain the canonical GT . . . AG splice site junctions of spliceosomal introns. Most of the genes on the megabase chromosome (84%) are interrupted by one or more introns, and their size distribution is the same as for the 462 introns annotated during a pilot project of random single-run sequencing of the Paramecium macronuclear genome [13]. Figure 2 shows the size distribution of the intergenic regions; half are smaller than 200 nt, and the intervals are even smaller between convergent genes. It has been shown for one locus on a different chromosome that two convergent genes use the same 26 bp between

their respective stop codons, on opposite strands, as 3⬘ UTR (D. Kobric and R.E. Pearlman, personal communication). The few large intergenic regions have a size distribution similar to that of the annotated megabase genes, suggesting that these regions may contain additional CDSs or pseudogenes. Detection of the only currently annotated pseudogene, PTMB.411c, involved comparison with a series of paralogs found elsewhere in the genome. Alignment revealed that this CDS is indeed an RNA N6-adenine methylase pseudogene, presenting several deletions, mutated splice site junctions, and in-frame stop codons. The paucity of identified pseudogenes may reflect their tendency to be associated with heterochromatin, as in many species [14], and thus to be eliminated during development of the macronucleus: the first pseudogene that was identified in Paramecium is near a telomere and is severely underamplified with respect to the bulk of the macronuclear DNA [15]. It is also possible that the Paramecium genome rapidly loses sequences not required for function, so that pseudogenes do not remain recognizable for long. As shown in Table 2, homologs could be detected for 260 (57%) of the predicted proteins. The evidence was provided by examination of matches against the swissprot⫹sptrembl protein database, NCBI’s Conserved Domain Database, and the InterPro protein domain database, as detailed in Experimental Procedures. If we remove the 30 gene models that we consider uncertain for lack of either sequence similarity with known proteins, structural motifs, or paralogs elsewhere in the Paramecium genome (white boxes in Figure 1), 40% of the predicted proteins remain orphans, presumably representing phylum- or species-specific proteins. Half of the predicted proteins have at least one InterPro protein signature. The ten most frequent InterPro signatures are given in Table 2 along with the percentage

Figure 1. Map of the Megabase Chromosome The map shows the position and the strand for each predicted CDS of the megabase chromosome. Each line represents 100 kb, with tick marks every 10 kb. Asterisks associated with the CDS number indicate a predicted signal peptide. Gene ontology molecular function terms were mapped using the GOA project gene associations for significant BLASTP matches and the interpro2go mappings for InterPro domains, as detailed in Experimental Procedures. The term assignments are distributed as follows: binding, 11; catalytic activity, 97; chaperone activity, 3; enzyme regulator activity, 1; molecular function unknown, 92; motor activity, 1; nucleic acid binding, 18; signal transducer activity, 3; structural molecule activity, 15; transcription regulator activity, 2; transporter activity, 16. If several terms were found for a given gene, the most specific one was adopted (e.g., “nucleic acid binding” rather than “binding” or “catalytic activity” for an RNA helicase).

Current Biology 1400

Figure 2. Size of Intergenic Regions The figure shows a histogram of the size of megabase chromosome intergenic regions (transparent bars with black borders), superimposed on a histogram of the size of the CDSs (gray bars with no borders), between 0 and 3000 nt, with 100 nt bins. There are CDSs larger than 3000 nt, but there are essentially no intergenic regions larger than 3000 nt. The intergenic regions were further classed according to the orientation of the flanking CDSs. The median distance between tandem genes is 198.5 nt; between convergent genes, 144 nt; and between divergent genes, 309 nt. The median CDS size is 1210 nt.

of proteins in other organisms that contain each of the domains. As the genes on this chromosome represent only ⵑ1.5% of the organism’s gene complement, extrapolation to the whole genome requires caution. Nonetheless, the MORN motif appears to be strikingly more

frequent in Paramecium than in the other organisms, while K⫹ channel pore region domains, protein kinase domains, and metallo-phosphoesterase domains are overrepresented, especially compared to their frequency in unicellular eukaryotes. As previously discussed in the context of the random survey project [13], Paramecium seems to devote an important part of its coding capacity to proteins involved in signaling pathways, which may have evolved as part of the organism’s repertoire of responses to environmental challenge. The presence of four K⫹ channel genes on the chromosome is compatible with the recent prediction that Paramecium contains at least 200 K⫹ channel genes—significantly more than man [16]. The authors suggest that different sets of K⫹ channels, which are located in ciliary membranes and involved in ciliary motility, might be expressed under different environmental conditions. Given the size of the intergenic regions on the megabase chromosome (Figure 2), it is likely that many genes have extremely small promoters, so we looked for clusters of functionally related genes that might be cotranscribed using the Gene Ontology (GO) term assignments and the InterPro signatures. We did not find any evidence for clusters of genes involved in the same biological process or pathway. However, we did identify six examples of clusters of two or three paralogous genes (Supplemental Table S2). These paralogous genes undoubtedly originate from quite ancient duplications, given the low percentage of amino acid identity between the proteins (25%–56%). In the case of the cluster of three actin genes (PTMB.200, 201c, 202), each gene has closer homologs not only in the Paramecium genome (which contains at least 20 actin or actin-like genes according to preliminary analysis of the primary sequence data [not shown]), but also in other species.

Table 2. Characterization of Putative CDSs Feature

Number

Percent

Predicted CDS Homology to known proteins or domains Uncertain proteins Hypothetical proteins Pseudogenes CDS with a signal peptide CDS with multiple transmembrane helices CDS with an InterPro match

460 260 30 170 1 31 54 227

100 56.5 6.5 37 0.2 6.7 12 49

Most Frequent InterPro Domains IPR000719 IPR001841 IPR001680 IPR002048 IPR000379 IPR003593 IPR003409 IPR001806 IPR004842 IPR001622

Protein kinase Zn-finger, RING G-protein beta WD-40 repeat Calcium-binding EF-hand Esterase/lipase/thioesterase AAA ATPase MORN motif RAS GTPase superfamily Metallo-phosphoesterase K⫹ channel, pore region

32 7 7 6 6 5 5 4 4 4

Pt

Dd

Sc

At

Ce

Dm

Hs

6.9 1.5 1.5 1.3 1.3 1.1 1.1 0.87 0.87 0.87

1.9 0.8 1.1 0.9 NA 1.1 NA 0.86 NA NA

1.9 0.6 1.6 0.2 0.6 1.3 0 0.6 0.3 0

4.0 1.9 0.9 0.6 0.8 1.2 0.06 0.4 0.3 0.1

2.5 0.8 0.7 0.5 0.6 1.6 0.005 0.4 0.4 0.5

1.9 1.0 1.2 0.7 0.9 0.9 0.006 0.6 0.3 0.3

2.3 1.4 1.3 1.0 0.4 0.5 0.05 0.7 0.1 0.4

The upper part of the table describes features of the 460 annotated CDSs. The CDSs with homology to known proteins were identified by evaluation of sequence similarity to known proteins or domains, as described in Experimental Procedures. The lower part of the table presents the occurrence of the 10 most frequent InterPro domains of the Paramecium megabase chromosome, compared to the 30 most frequent domains of Dictyostelium chromosome 2 as given in [8] and the frequency of domains in fully sequenced eukaryotes (InterPro database). The number of megabase genes with each domain is given in the first column, followed by the percentage of genes from each organism with the given domain. Pt, P. tetraurelia; Dd, D. discoideum; Sc, S. cerevisiae; At, A. thaliana; Ce, C. elegans; Dm, D. melanogaster; Hs, H. sapiens. NA, not among the 30 most frequent Dictyostelium chromosome 2 domains.

Paramecium Somatic Chromosome Organization 1401

We compared the set of predicted proteins with proteomes from complete eukaryotic genomes to screen for interesting patterns of incidence among species. We did not find evidence for proteins uniquely shared with Plasmodium. However, 32 proteins are shared by vertebrates (represented by man) but not yeasts (represented by S. cerevisiae and S. pombe). Among the latter, most are also shared with invertebrates (represented by C. elegans and D. melanogaster) and/or plants (represented by A. thaliana) and a few with filamentous fungi (represented by N. crassa). After validation by homology searches against the entire protein database, we focused our attention on the 24 proteins presented in Table 3. Two groups seem particularly noteworthy. First, a few putative proteins are shared with man but not fungi or invertebrates, including PTMB.423c, which has a very significant match with a hypothetical human protein, and PTMB.114, which matches an interferon-induced guanylate binding protein found only in vertebrates and plants. A second group consists of five proteins shared with plants but not animals. Since Plasmodium contains a relic plastid [9], it was of interest to see whether any of these genes might come from plastids, in support for the theory of a single symbiogenetic origin of chloroplasts and subsequent lateral transfer of plastids from a red alga to the corticoflagellate ancestor of chromalveolates [17], before the divergence of ciliates and apicomplexa. Four of the proteins in this group (PTMB.151c, PTMB.356, PTMB.361c, and PTMB.376c) do have homologs in bacteria, including cyanobacteria, and one of them, PTMB.356, is a probable lysophospholipase also found among Plasmodium apicoplast proteins. However, in all cases, the best matches are with plants and apicomplexa, not with cyanobacteria, so we find no phylogenetic support for a plastid origin of any of these genes. Other Paramecium proteins, implicated in axonemal structures, are conserved in plants, animals, ciliates, and flagellates but absent from fungi (the radial spoke head protein PTMB.212c, absent from A. thaliana but present in algae such as Chlamydomonas). Many additional proteins are absent from the reference yeasts, including some proteins that have been studied in ciliates. Copines (PTMB.394c) are Ca2⫹-dependent phospholipid binding proteins first discovered in Paramecium [18]; Myb-related transcription factors (PTMB.149) have been described in spirotrich ciliates [19]; and UNC119 orthologs (PTMB.296), found up until now only in animals, have been characterized in Paramecium (D. Gogendeau and F. Koll, personal communication). Another protein shared with animals is an inositol-1,4,5 triphosphate receptor (PTMB.445c), a ligand-gated calcium ion channel that modulates Ca2⫹ release from intracellular stores. Although the Paramecium protein shares only 25% amino acid identity with its vertebrate homologs, the protein contains the ryanodin/IP3 homology domain, a Ca2⫹/Na⫹ pore domain, and the characteristic six transmembrane helices located at the C terminus of the molecule. It will be interesting to see whether this protein is involved in morphogenesis in Paramecium, since it has been shown that cortical pattern is respecified by lithium ions [20] in a manner reminiscent of the teratogenic effects of lithium during amphibian em-

bryogenesis. All of the genes on the chromosome and their annotations are available (gene table at http:// paramecium.cgm.cnrs-gif.fr/megabase; Generic Genome Browser at http://paramecium.cgm.cnrs-gif.fr/cgi-bin/ gbrowse). Examination of the megabase chromosome illustrates the rich diversity in functions and origins of Paramecium protein-coding genes. Given the ⵑ75 Mb size of the macronuclear genome, we can extrapolate to a surprisingly large gene complement for a unicellular organism (⬎30,000 genes). Since powerful direct and reverse genetic techniques are available, Paramecium could be a system of choice for functional analyses of many genes shared with vertebrates and/or apicomplexan parasites but not always present in model eukaryotes such as yeast, worm, or fly. In conclusion, our analysis of the largest DNA molecule in the Paramecium macronucleus provides a first glimpse of a chromosome “stripped for action” by DNA rearrangements that remove all the germline heterochromatic sequences. As complete ciliate genomes become available, it should be possible to see whether the unusually high coding content of this molecule is a characteristic result of the differentiation of a somatic nucleus or whether other forces are driving the Paramecium genome toward prokaryote-like compactness. Experimental Procedures Paramecium Strain and Culture Paramecium tetraurelia strain d4-2 is an entirely homozygous wildtype laboratory stock [21]. Cells were cultured using standard methods [22]. Chromosome Isolation and Shotgun Library Construction The megabase chromosome was isolated by clamped homogeneous electric field (CHEF) gel electrophoresis as previously described [23]. Young cells (3 divisions post autogamy) were used for a first separation at 70 s pulse frequency. In order to eliminate small DNA molecules, the region between 600 kb and the limit mobility zone was cut out and subjected to electrophoresis on a second gel using 113 s pulses, to allow optimal separation in the megabase region. The largest chromosome was cut out of the gel and the DNA was purified using agarase (Sigma). Library construction, optimized for small amounts of DNA, involved partial restriction digestion with a very frequent cutter, Tsp509I. Size-selected partial digestion products (1–3 kb) were cloned by a tailing procedure in pCRScript (Stratagene). Sequence Determination Clones from the shotgun library were amplified and the PCR products were sequenced using dye-terminator chemistry (DYEnamic ET Terminator Cycle Sequencing Kit Amersham US81050) with MegSeqR or MegSeqU primers. Bases were called using Phred [24] and assembled using GAP4 [25]. Approximately 6200 plasmids from the shotgun DNA library were sequenced. Final shotgun assembly (4.5 ⫻ coverage) did not cover the whole chromosome, and 40 gaps were found. Gaps were filled either by primer walking or by multiplex PCR [26]. The entire sequence of the 980 kb region was determined with a statistical error rate 1/700,000. The entire chromosome was covered by 60 overlapping PCR products (15–25 kb). Restriction digestion patterns with several different enzymes were fully compatible with the sequence. Annotation Artemis [27] was used for annotation, and the entire analysis relied on custom Perl scripts written using the Bioperl library [28]. At the outset of the project, too few Paramecium genes were available for training of an ab initio gene finder. We therefore manually annotated

Current Biology 1402

Table 3. Species Distribution of Conserved Proteins Putative Protein

Length (aa)

Group

Best Match

Accession Number

E Value

Identity (%)

Overlap

3e-52

28

0.70

2e-30

25

1.00

3e-98

28

0.96

hypothetical protein T4B21.12 (putative calcium-dependent protein kinase) hypothetical protein hypothetical protein T10F20.3 protein (F2H15.21 protein)

5e-39 2e-27

35 26

0.69 0.71

6e-23 6e-23 9e-63

25 34 37

0.72 0.45 0.95

hypothetical protein similar to CG17349 gene product hypothetical protein KIAA0590 hypothetical protein F09G8.2 in chr. III precursor CG1637 protein Unc-119 protein homolog (retinal protein 4) similar to ATP/GTP-binding protein inositol 1,4,5-triphosphate receptor type 2

0.0 1e-54 1e-167 2e-29

37 50 27 28

1.00 0.62 1.00 1.00

3e-36 3e-26

27 33

0.79 1.00

3e-57 5e-34

35 25

0.57 0.40

1e-125

56

0.96

6e-32 8e-28 1e-108 2e-53

45 32 40 50

0.28 0.93 0.92 0.76

2e-35 1e-19

35 29

0.56 0.92

2e-81

35

1.00

Definition

Proteins Shared with Vertebrates (and Plants), Absent from Yeasts, Fly, and Worm PTMB.114

835

VP

H. sapiens

GBP1_HUMAN

PTMB.212c

452

V

H. sapiens

Q9NQ10

PTMB.423c

970

V

H. sapiens

Q8TBY9

interferon-induced guanylate-binding protein 1 DJ412I7.1 (similar to radial spokehead protein) hypothetical protein

Proteins Shared with Plants but Not with Animals PTMB.151c PTMB.164c

341 494

P PNF

A. thaliana A. thaliana

Q9C6B3 Q9ZSA3

PTMB.356 PTMB.361c PTMB.376c

415 420 391

PF PF PNY

A. thaliana P. falciparum A. thaliana

O23287 PF10_0306 Q9LDU9

Proteins Shared Uniquely with Animals PTMB.56 PTMB.153 PTMB.199 PTMB.227

1196 315 1360 339

VI VI VI VI

H. H. H. C.

sapiens sapiens sapiens elegans

Q8NE11 Q8NHP5 O60332 YLS2_CAEEL

PTMB.378 PTMB.296

559 174

I VI

D. melanogaster H. sapiens

Q9VZ56 U119_HUMAN

PTMB.414 PTMB.445c

708 2910

VIF VI

H. sapiens H. sapiens

Q8NEM8 IP3S_HUMAN

Other Proteins Shared with Plants and Animals but Absent from Yeasts PTMB.142c

390

VIPN

H. sapiens

HPPD_HUMAN

PTMB.149 PTMB.175c PTMB.180c PTMB.256c

490 247 549 263

VIPN VIP VIPF VIPN

H. sapiens H. sapiens A. thaliana H. sapiens

MYBB_HUMAN ATTY_HUMAN Q9SLI8 NUHM_HUMAN

PTMB.357c PTMB.360c

412 218

VIPF VIP

H. sapiens H. sapiens

Q9Y377 GILT_HUMAN

PTMB.394c

534

VIP

H. sapiens

CNE5_HUMAN

4-hydroxyphenylpyruvate dioxygenase (EC 1.13.11.27) (4HPPD) (HPD) (HPPDase) Myb-related protein B (B-Myb) tyrosine aminotransferase (EC 2.6.1.5) F20D21.27 protein NADH-ubiquinone oxidoreductase 24 kDa subunit, mitochondrial precursor (EC 1.6.5.3) (EC 1.6.99.3) CGI-67 protein gamma interferon inducible lysosomal thiol reductase precursor copine V

Putative proteins of the chromosome shared with vertebrates (V, represented by H. sapiens), plants (P, represented by A. thaliana), and/or invertebrates (I, represented by D. melagnogaster and C. elegans) but (with the exception of PTMB.376c) absent from yeasts (Y, represented by S. cerevisiae and S. pombe). N indicates that the protein is shared with N. crassa, and F indicates that the protein is also shared by P. falciparum. The columns, from left to right, give the name of the Paramecium CDS whose product is under consideration, the length of the putative protein, the distribution among the different groups of organisms, and the characteristics of the best match: species, accession number, definition, E-value , amino acid identity, and overlap (the fraction of the query that is covered by the subject match). We note that PTMB.256c is a subunit of respiratory complex I, which is present in mitochondria of many species including human, Paramecium, and some yeasts such as S. pombe and Y. lipolytica, but absent from S. cerevisiae. The complete list of genes with vertebrate homologs but absent from yeast is: PTMB.56 WD-40 repeat protein, PTMB.62 hypothetical protein with arrestin domain, PTMB.68 K⫹ channel, PTMB.113 K⫹ channel, PTMB.114 Guanylate binding protein, PTMB.142c 4-hydroxyphenyl pyruvate dioxygenase, PTMB.149 Myb-related protein, PTMB.153 Conserved hypothetical protein, PTMB.171c Guanylyl cyclase, PTMB.175c Tyrosine aminotransferase, PTMB.180c Phosphatase regulatory subunit, PTMB.189c Phosphatidyl inositol-4-phosphate-5 kinase, PTMB.199 Conserved WD-40 and TPR repeat-containing protein, PTMB.206c Phosphatidyl inositol-4-phosphate-5 kinase, PTMB.212c Radial spoke protein, PTMB.214c Conserved hypothetical protein, PTMB.227 Dnase II-like, PTMB.231c Prenylcysteine lyase, PTMB.252c hypothetical protein with LITAF membrane-association domain, PTMB.256c NADHubiquinone oxidoreductase 24 kDa subunit, PTMB.296 UNC-119, PTMB.353 MORN repeat protein, PTMB.357c Conserved protein, alpha/beta hydrolase, PTMB.360c Gamma-interferon inducible lysosomal thiol reductase-like protein, PTMB. 361c, MORN repeat protein, PTMB.387c MORN repeat protein, PTMB.394c Copine, PTMB.403c K⫹ channel, PTMB.414 Zn-carboxypeptidase, PTMB.422c Guanylate nucleotide binding protein, PTMB.423c WD-40 protein, and PTMB.445c Inositol 1,4,5-phosphate receptor.

genes using multiple lines of evidence: GC content, codon bias, and sequence similarity with known proteins and intron predictions (profile hmms were built using the HMMER-2.2 package [29], using introns from a pilot random sequencing project [13] and from Paramecium genes in the Invertebrate division of the public nucleotide

database). Once enough genes had been annotated manually, GlimmerM was trained and used to complete the structural annotation (see below). tRNA predictions were made using tRNAscanSE-1.23 [30]. The predicted proteins were extracted from the gene models, and

Paramecium Somatic Chromosome Organization 1403

adjustment of the gene models as well as functional annotation were based on case-by-case examination of all the evidence provided by (1) BLASTP matches (cutoff E ⬍ 10⫺2, BLOSUM62 matrix and default parameters) against the swissprot⫹sptrembl database and clustalw mutiple alignments with the five best matches, (2) RPSBLAST matches (cutoff E ⬍ 10⫺1) with NCBI’s Conserved Domain Database [31], (3) InterPro matches found using InterProScan v3.1 [32] with default parameters, (4) Signal peptide (SignalP-2.0 [33]), transmembrane helix (TMHMM [34]), and coiled coils [35] predictions, and (5) identification of close Paramecium paralogs by TBLASTN (cutoff E ⬍ 10⫺40) against the primary WGS sequence data. GO [36] Molecular Function terms were mapped using GOA gene associations and interpro2go mappings [37]. The terms were further mapping to high-level goaslim function terms, generating the classification shown in Figure 1. A gene table with the evidence used for annotation is available at http://paramecium.cgm.cnrs-gif.fr/megabase/. The annotation can be viewed using the Generic Genome Browser [38] at http://paramecium.cgm.cnrs-gif.fr/cgi-bin/gbrowse. Translated CDSs were compared to complete proteomes (H. sapiens, A. thaliana, D. melanogaster, C. elegans, S. cerevisiae, and S. pombe from http://www.ebi.ac.uk/proteome/, P. falciparum from http://plasmodb.org, and N. crassa from http://www.broad.mit.edu/ annotation/fungi/neurospora/; December 2003) using BLASTP. Species with a match with an E-value ⬍ 10⫺3 were scored after visual inspection of the alignments. Proteins with interesting patterns of occurrence among these species were then used to search the nonredundant protein database at NCBI for validation and in order to examine homologs in all available species using the “taxonomy report” feature. Gene Prediction with GlimmerM GlimmerM v3 [39] was used for ab initio gene prediction. The source code was modified to take into account the Paramecium genetic code. For training, a reference set of genes was constituted. The set is comprised of 51 Paramecium tetraurelia genes from the Invertebrate division of the public nucleotide database, 66 conserved genes from the Paramecium WGS project that were validated by ClustalW multiple alignment with homologs from other species, and 141 genes from the manual annotation of the megabase chromosome consisting either of conserved genes validated by multiple alignment with homologs from other species or genes validated by nucleotide alignment with Paramecium paralogs found in the primary data from the Paramecium WGS project. Given the small size of Paramecium introns and gene density higher than that of the Plasmodium falciparum genome for which GlimmerM was originally designed, we found the best predictions by changing the splice site filter size for training from 60 nt to 24 nt. Although the accuracy of the predictions is expected to improve as the size of the training set increases, performance based on the initial set of 262 reference genes is already quite useful as judged by comparison of the GlimmerM gene models with CDS from the manual annotation of the entire megabase chromosome (Supplemental Table S3). Supplemental Data Supplemental Data, including tables, a figure, and a discussion of dinucleotide frequencies, can be found at http://www.currentbiology.com/cgi/content/full/14/15/1397/DC1. A gene table is available at http://paramecium.cgm.cnrs-gif.fr/megabase/. The annotation can be viewed with the Generic Genome Browser at http:// paramecium.cgm.cnrs-gif.fr/cgi-bin/gbrowse. Acknowledgments We gratefully acknowledge support from the CNRS for construction of a European network (GDRE Paramecium Genomics), the Polish Ministry of Science (grant KBN 3P04A00625), the Ministe`re de l’Education Nationale, de la Recherche et de la Technologie (MENRT) Program “Centre de Ressources Biologiques” (J.C.), the MENRT Program “Recherche fondamentale en Microbiologie et Maladies infectieuses et parasitaires” (E.M.), the Association pour la Recherche sur le Cancer (grant # 5733, E.M.), and the Ligue Nationale contre le Cancer (grant # 75/01-RS/73, E.M.). M.N. was supported by a graduate studentship from the MENRT, and M.N. and J.K.N.

received support from the CNRS in the framework of the PolishFrench Centre for Plant Biotechnology. I.B. received support through the MENRT program “Centres de Ressources Biologiques.” We are indebted to Janine Beisson and Christoper J. Herbert for constructive criticism of the manuscript. We thank Lawrence Aggerbeck and the Gif-Orsay Microarray Platform for helping us index the shotgun library. Received: January 21, 2004 Revised: June 14, 2004 Accepted: June 14, 2004 Published: August 10, 2004 References 1. Jahn, C.L., and Klobutcher, L.A. (2002). Genome remodeling in ciliated protozoa. Annu. Rev. Microbiol. 56, 489–520. 2. Le Moue¨l, A., Butler, A., Caron, F., and Meyer, E. (2003). Developmentally regulated chromosome fragmentation linked to imprecise elimination of repeated sequences in paramecia. Eukaryot. Cell 2, 1076–1090. 3. Jerka-Dziadosz, M., and Beisson, J. (1990). Genetic approaches to ciliate pattern formation: from self-assembly to morphogenesis. Trends Genet. 6, 41–45. 4. Go¨rtz, H.D. (1988). Paramecium (Berlin: Springer-Verlag). 5. Gratias, A., and Be´termier, M. (2001). Developmentally regulated excision of internal DNA sequences in Paramecium aurelia. Biochimie 83, 1009–1022. 6. Baroin, A., Prat, A., and Caron, F. (1987). Telomeric site position heterogeneity in macronuclear DNA of Paramecium primaurelia. Nucleic Acids Res. 15, 1717–1728. 7. Forney, J., and Rodkey, K. (1992). A repetitive DNA sequence in Paramecium macronuclei is related to the beta subunit of G proteins. Nucleic Acids Res. 20, 5397–5402. 8. Glo¨ckner, G., Eichinger, L., Szafranski, K., Pachebat, J.A., Bankier, A.T., Dear, P.H., Lehmann, R., Baumgart, C., Parra, G., Abril, J.F., et al. (2002). Sequence and analysis of chromosome 2 of Dictyostelium discoideum. Nature 418, 79–85. 9. Gardner, M.J., Hall, N., Fung, E., White, O., Berriman, M., Hyman, R.W., Carlton, J.M., Pain, A., Nelson, K.E., Bowman, S., et al. (2002). Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 419, 498–511. 10. Goffeau, A., Barrell, B.G., Bussey, H., Davis, R.W., Dujon, B., Feldmann, H., Galibert, F., Hoheisel, J.D., Jacq, C., Johnston, M., et al. (1996). Life with 6000 genes. Science 274, 563–567. 11. Katinka, M.D., Duprat, S., Cornillot, E., Metenier, G., Thomarat, F., Prensier, G., Barbe, V., Peyretaillade, E., Brottier, P., Wincker, P., et al. (2001). Genome sequence and gene compaction of the eukaryote parasite Encephalitozoon cuniculi. Nature 414, 450–453. 12. Baldauf, S.L., Roger, A.J., Wenk-Siefert, I., and Doolittle, W.F. (2000). A kingdom-level phylogeny of eukaryotes based on combined protein data. Science 290, 972–977. 13. Sperling, L., Dessen, P., Zagulski, M., Pearlman, R.E., Migdalski, A., Gromadka, R., Froissard, M., Keller, A.M., and Cohen, J. (2002). Random sequencing of Paramecium somatic DNA. Eukaryot. Cell 1, 341–352. 14. Dasilva, C., Hadji, H., Ozouf-Costaz, C., Nicaud, S., Jaillon, O., Weissenbach, J., and Crollius, H.R. (2002). Remarkable compartmentalization of transposable elements and pseudogenes in the heterochromatin of the Tetraodon nigroviridis genome. Proc. Natl. Acad. Sci. USA 99, 13636–13641. 15. Dubrana, K., and Amar, L. (2000). Programmed DNA underamplification in Paramecium primaurelia. Chromosoma 109, 460–466. 16. Haynes, W.J., Ling, K.-Y., and Kung, C. (2003). PAK paradox: Paramecium appears to have more K(⫹)-channel genes than human. Eukaryot. Cell 2, 737–745. 17. Cavalier-Smith, T. (2002). The phagotrophic origin of eukaryotes and phylogenetic classification of Protozoa. Int. J. Syst. Evol. Microbiol. 52, 297–354. 18. Creutz, C.E., Tomsig, J.L., Snyder, S.L., Gautier, M.C., Skouri, F., Beisson, J., and Cohen, J. (1998). The copines, a novel class

Current Biology 1404

19.

20. 21.

22. 23.

24.

25. 26.

27.

28.

29. 30.

31.

32.

33.

34.

35. 36.

37.

38.

39.

40.

of C2 domain-containing, calcium-dependent, phospholipidbinding proteins conserved from Paramecium to humans. J. Biol. Chem. 273, 1393–1402. Yang, T., Perasso, R., and Baroin-Tourancheau, A. (2003). Myb genes in ciliates: a common origin with the myb protooncogene? Protist 154, 229–238. Beisson, J., and Ruiz, F. (1992). Lithium-induced respecification of pattern in Paramecium. Dev. Genet. 13, 194–202. Sonneborn, T.M. (1974). Paramecium aurelia. In Handbook of Genetics, R. King, ed. (New York: Plenum Publishing), pp. 469–594. Sonneborn, T.M. (1970). Methods in Paramecium research. Methods Cell. Physiol. 4, 241–339. Caron, F. (1992). A high degree of macronuclear chromosome polymorphism is generated by variable DNA rearrangements in Paramecium primaurelia during macronuclear differentiation. J. Mol. Biol. 225, 661–678. Ewing, B., and Green, P. (1998). Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194. Staden, R. (1996). The Staden sequence analysis package. Mol. Biotechnol. 5, 233–241. Tettelin, H., Radune, D., Kasif, S., Khouri, H., and Salzberg, S. (1999). Optimized multiplex PCR: efficiently closing a wholegenome shotgun sequencing project. Genomics 62, 500–507. Rutherford, K., Parkhill, J., Crook, J., Horsnell, T., Rice, P., Rajandream, M.A., and Barrell, B. (2000). Artemis: sequence visualization and annotation. Bioinformatics 16, 944–945. Stajich, J.E., Block, D., Boulez, K., Brenner, S.E., Chervitz, S.A., Dagdigian, C., Fuellen, G., Gilbert, J.G., Korf, I., Lapp, H., et al. (2002). The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 12, 1611–1618. Eddy, S. (1996). Hidden Markov models. Curr. Opin. Struct. Biol. 6, 361–365. Lowe, T., and Eddy, S. (1997). tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964. Marchler-Bauer, A., Anderson, J.B., DeWeese-Scott, C., Fedorova, N.D., Geer, L.Y., He, S., Hurwitz, D.I., Jackson, J.D., Jacobs, A.R., Lanczycki, C.J., et al. (2003). CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res. 31, 383–387. Zdobnov, E., and Apweiler, R. (2001). InterProScan-an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17, 847–848. Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijne, G. (1997). Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 10, 1–6. Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E.L. (2001). Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 305, 567–580. Lupas, A., Van Dyke, M., and Stock, J. (1991). Predicting coiled coils from protein sequences. Science 252, 1162–1164. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29. Camon, E., Magrane, M., Barrell, D., Binns, D., Fleischmann, W., Kersey, P., Mulder, N., Oinn, T., Maslen, J., Cox, A., et al. (2003). The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res. 13, 662–672. Stein, L.D., Mungall, C., Shu, S., Caudy, M., Mangone, M., Day, A., Nickerson, E., Stajich, J.E., Harris, T.W., Arva, A., et al. (2002). The generic genome browser: a building block for a model organism system database. Genome Res. 12, 1599–1610. Salzberg, S., Pertea, M., Delcher, A., Gardner, M., and Tettelin, H. (1999). Interpolated Markov models for eukaryotic gene finding. Genomics 59, 24–31. Lopez, P.J., and Se´raphin, B. (2000). YIDB: the Yeast Intron DataBase. Nucleic Acids Res. 28, 85–86.

Accession Numbers The Accession Number for the megabase chromosome sequence is CR548612.