This file contains one logo per layer. These logos use RGB (RVB) colors.
g
i
ue
rv
a
O céanolo
q
to
ire
Obse
Laboratoire
•
NR
S • INSU •
er
C
1882 C
The logo that is visible is the version that will be exported, or printed. Turn off this layer (READ ME FIRST) before exporting or printing.
/M
M • UP
de Banyuls
ARAGO
In the calque menu, click on the "eye" symbol to toggle the logo displayed.
Genomes, phylogeny and lateral gene transfer desdevises.free.fr/ISMB2011
Yves Desdevises Observatoire Océanologique de Banyuls Université Pierre et Marie Curie France
Outline Molecular phylogenetics Phylogenomics Lateral gene transfer Illustrated uses of evolutionary genomics in a microalgaevirus association
A few words on molecular phylogenetics
• Goal: propose a hypothesis of relationships between several taxa
• Phylogeny = tree • Speciation: binary • Hypothesis: A
B
C
Phylogenetic network
L An abro ide am sd p a HL imi ali bropsis ases dia ustrcaa ch li tus s eru oe re leo sm pu nc ar tat gin us atu s
Symphodus tinca Symphodus ocellatus
Symphodus melanocercus Ctenolabrus rupestris Labrus merula Labrus viridis Cheilinus trilobatus Cheilinus chlorourus Epibulus incidiator
Stetojulis bandanensis
Halichoeres margaritaceus Labropsis australis Halichoeres marginatus Anampses geographicus Anampses caeruleopunctatus Labroides dimidiatus Labrichthys unilineatus Coris julis Hemigymnus melapterus Hemigymnus fasciatus Thalassoma bifasciatum Thalassoma lunare Thalassoma lutescens Pictilabrus laticlavius Notolabrus tetricus Bodianus rufus
ralis Labropsis aust s aceu rgarit a m us es lan hoer c u li t a r H is o ns sh e e r oe an lich nd a Ha sb uli j o et St
Clepticus parrae Symphodus roissali Symphodus cinereus
Pagrus major
Symphodus roissali Symphodus cinereus
Symphodus tinca
Symphodus tinca
Symphodus ocellatus
Symphodus ocellatus
Symphodus mediterraneus
Symphodus mediterraneus
Symphodus melanocercus Ctenolabrus rupestris
Symphodus melanocercus Ctenolabrus rupestris
Labrus merula
Labrus merula
Labrus viridis
Labrus viridis
Cheilinus trilobatus
Cheilinus trilobatus
Cheilinus chlorourus
Cheilinus chlorourus
Epibulus incidiator
Epibulus incidiator
Stetojulis albovittata
Stetojulis albovittata
Stetojulis bandanensis
Stetojulis bandanensis
Halichoeres hortulanus
Halichoeres hortulanus
Halichoeres margaritaceus
Halichoeres margaritaceus
Labropsis australis
Labropsis australis
Halichoeres marginatus
Halichoeres marginatus
Anampses geographicus
Anampses geographicus
Anampses caeruleopunctatus
Anampses caeruleopunctatus Labroides dimidiatus
Labroides dimidiatus
Labrichthys unilineatus
Labrichthys unilineatus Coris julis
Coris julis
Hemigymnus melapterus
Hemigymnus melapterus
Hemigymnus fasciatus
Hemigymnus fasciatus
Thalassoma bifasciatum
Thalassoma bifasciatum
Thalassoma lunare
Thalassoma lunare
Thalassoma lutescens Pictilabrus laticlavius Notolabrus tetricus Bodianus rufus
Thalassoma lutescens Pictilabrus laticlavius Notolabrus tetricus Bodianus rufus
Clepticus parrae
Clepticus parrae
Pagrus major
Pagrus major
us ric t te
s ru s ab l ufu to sr o u n N dia rrae Bo s pa u c i pt Cle major Pagrus
Symphodus roissali Symph odus c inereu Sym s pho Sy dus mp tinc Sy a ho m du ph so ce od lla us tus m ed ite rra ne us
s rcu ce no ela sm ris du est ho rup mp us Sy abr nol Cte ula s mer Labru Labrus viridis
Sy mp ho du so ce lla tus
r to ia cid in
m ajo r
us ul ib Ep
ris est rup us abr nol Cte
SSyy mmp phh oodd uuss crion iess real uis
Halichoeres hortulanus
La br oid es dim cae idi rule atu opu s Anam nct atu pses s geog raph icus Halichoeres m arginatus
An am pse s
St et oju lis alb Ep ov ibu itta lus inc Che idia ta ilinu tor s ch lorou rus Cheilinus trilobatus
Halichoeres margaritac eus albovittata Stetojulis bandanensis Stetojulis s ouru chlor inus il e h C Ch ei lin us tri lo ba Labrus merula viridis tus dus tinca Sympho
s rcu s creaneu r o e it n eda us m el phod us m Sym d ho mp Sy
Symphodus mediterraneus
Stetojulis albovittata
s rufu nus a i d Bo Pa gr us
Symphodus cinereus
s s nu icu a l h p tu or gra eo sh g e r es oe ps ch am i l n A Ha
Th TH ala haelam Coris ss sisgym julis om om nu a b a s fa ifa lute sciat us Hemigymnusscmela iat scepteru um ns s Pic tilab rus nare l maaticlu o l a s s viu s Thala Cl ep tic labrus tetricus Noto us pa rra e
tus nus fascia Hemigym rus apte s mel juli nus ris igym s Co tu Hem ea ilin un ys th ich br La
unilineatus Labrichthys
Symphodus roissali
Thalassoma bifasciatum Thala ssom a luna Tha re las som Pic a lu tila tes bru cen sl s ati cla viu s
Phylogenetic trees
Molecular data
• Most current source for phylogeny • Nucleotides ou amino acids (for ancient divergences) • Important step: alignment (with the help of alignment softwares)
• Use of evolutionary models to build trees
• Gene tree ≠ species tree • Genes: orthologous or paralogous Paralogs Orthologs
a
b* c
Orthologs
C* B
A*
b* C*
Duplication
Tree Ancestral gene
A*
Making a molecular phylogeny Data DNA, AA, ...
Alignment Software + eye
Characters
Distances
Data quality Saturation, homogeneity, ...
Distances
Method
Model?
Data type, taxa number
BI ML Model?
MP Weigthing?
Optimality criteria
(sites, changes)
Yes
Tree(s) Validation Bootstrap, ...
ME...
No
NJ...
Optimality criteria • Different methods to choose the “best tree” from the alignment
• Hypothesis on how evolution works • Different in different methods • Number of steps (parsimony) • Sum of branch lengths (minimum evolution) • Likelihood (ML, with evolutionary model)
Parsimony
• Method based on individual characters, associated to cladistics
• “Ockham’s razor”: favour the simplest solution • Assess character fit to trees by character mapping via parsimony
• Optimisation is different on different trees
Distances
• Assessement of the mean number of changes between two taxa
• Based on distances, not individual characters • Data sometimes only as distances (e.g. DNA/ DNA hybridation temperature), if not, data transformation in distance matrix
• Mainly used for molecular data and molecular distances can be corrected using models of sequence evolution (same as ML)
• Main method: neighbor-joining (NJ) • Very fast
Maximum likelihood
• Maximum Likelihood = ML • Method based on individual characters • Uses an explicit evolutionary model (DNA or AA) • The more computationally complex method • Model very important: only for molecular data • ML finds (one) tree (and model parameters) maximizing the probability of the data
Bayesian inference • Recent and now widely used method • Uses Bayes formula to generate posterior probability of parameters (among which topology and branch lengths), based on previous knowledge on data: prior probability
• Tree (as well as parameters of the model such as substitution rates) with confidence intervals (support values) for clades
Validation • Trees can be validated using resampling
procedures such as boostrap (resampling with replacement)
• Assessment of clade support: add some noise in (alter) the data and rebuild the tree, do this from many altered datasets
• The more a clade is strong, the more it will appear in all trees: % = bootstrap support
Supertrees
• Combine trees with partially overlapping taxa • Bigger tree • Many methods (at least 17)
• Uses of supertrees • Combining trees from different data/studies • Phylogenomics: genes are often unequally present in the taxa under study
• Metagenomic: taxa partially and unequally represented in sequences
➡Many gaps in the matrix:
• Supermatrix (as is) • Design several complete sub-matrices, compute subtrees, build supertree
• e.g. Sargasso Sea environmental sequences
...
Phylogenomics
• Genomes: more accurate and precise phylogenies? Not so simple...
• Very large dataset: computation difficult • Genomes are plastic: duplications (total, partial), fusions, chromosome fissions, LGT, ...
• No good model of genomic evolution • Diminution of stochastic error (random), only by increasing character number
• The possibility of systematic error remains, for
example caused by wrong method or model choice
• 3 main biases • Composition bias: sequences with the same composition tend to cluster
• Check from sequences • Long branch attraction • Good taxon sampling • Heterotachy: substitution rate change through time for fixed positions
• Hard to detect and correct
Genomes
• More characters • New character types: gene order, gene content,
nucleotidic signature (DNA strings), rare genomic changes
• 2 main approaches • Classical: sequences (gene concatenation) and phylogeny (supermatrix or supertree)
• Whole genome features: gene order, gene content, DNA string
• + 1: rare genomic changes
Classical methods
• Resolution of difficult phylogenetic problems (e.g. Tree of Life, Eukaryotes, Bilateria)
• Evolution of gene groups (e.g. family):
mutations, selective pressure, divergence, duplications, ...
• Identification of lateral gene transfer (which brings noise in the the tree-like signal)
• Example: classical tree of Deuterostomians
• Genomic data (Nature, 2006) - 146 genes - Classical methods: sequences - Bias control
• Example: Eukaryote phylogeny (2009, 2010)
• Example: Tree of Life (Science, 2006)
• But is the history of life really tree like?
Lateral Gene Transfer
• LGT in the tree of life
• Lateral gene transfer is more and more recognized
as an important factor shaping the evolution of life
• Current debate is no more on the existence of LGT but on its importance: can we still consider that the evolution of life is mainly tree-like?
• No (?) in Prokaryotes • Yes (?) in Eukaryotes
Methods to unveil LGT • Compositional methods • Comparison of evolutionary rates • Look for similar sequences in databases through BLAST
• Phylogenetic approach
Compositional methods
• Look via bioinformatics in complete genomes for • atypical nucleotide composition in putatively transferred genes
• atypical codon usage patterns • Only for recent transfer events (before homogenization)
Evolutionary rates
• Compare pairwise distances between gene
orthologs within families vs distances between genomes (from a reference tree): if no LGT, these distances should be roughly equal
• Another possibility is to compare instantaneous substitution matrices in genes vs genomes
• In case of LGT, these rates should differ
BLAST and similarity
• Find homologs of a query sequence in databases, genomes, ... via a similarity search (e.g. BLAST)
• Pattern of gene presence/absence in organisms = phyletic pattern
• Identification of LGT for genes with unusual affiliation
• Drawback: similarity does not necessarily mean evolutionary proximity
• Sensitive to taxa/gene representation in databases
Phylogenetic approach
• Look for individual gene trees incongruent with a reference phylogeny
• Reference tree: rDNA, genomes, gene
concatenation, consensus tree, supertree, ...
• Need well supported trees • Test for incongruence between topologies • Cannot detect LGT between neighbours
• Between symbionts with complete genomes available: blast symbiont ORFs against host genome to identify putatively transferred genes
• Cophylogenetic methods (gene tree within species tree) can be used to infer a scenario for the LGT
• Example: frp gene acquired in red algae and green plants from ∂-proteobacteria
• Example: multiple transfers virus-to-host and hostto-virus between Emiliania huxleyi and EhV86
• Example: multiple transfers virus-to-host and hostto-virus between Emiliania huxleyi and EhV86
Case study: Prasinophyte microalgae and their viruses
Hosts: Prasinophyceae Chlorophyta: green algae (Order Mamiellales, ubiquitous picophytoplankton) 3 main genera, 6 complete genomes to date Ostreococcus (3 genomes) Bathycoccus (1 genome) Micromonas (2 genomes)
Chrétiennot-Dinet et al. (1995)
Ostreococcus
Bathycoccus
Micromonas
Host phylogeny (SSU rDNA) Ostreococcus RCC344
0,74
0,86
Ostreococcus RCC356 O. lucimarinus CCMP2972
Chrétiennot-Dinet et al. (1995)
Ostreococcus
Ostreococcus RCC1108 0,99 O. tauri RCC745 Ostreococcus RCC1107
Bathycoccus prasinos RCC1105
1
Bathycoccus prasinos RCC464
0,99
M. pusilla RCC497 M. pusilla CCMP1545
0,99
Micromonas RCC1109 Micromonas RCC828
0,63 Micromonas RCC451
Micromonas
Bathycoccus
Viruses Phycodnavirus Prasinovirus Important role in the regulation of phytoplanktonic populations
ML Escande, OOB
Giant virus ("Girus"): 100-200 nm Large genomes: about 200 Kb (closely-related Chlorovirus: almost 400 Kb!)
Prasinovirus tree from partial DNA polymerase (about 600 bp) OxV
MpV
BpV
Prasinovirus genomes 6 genomes (4 correspond to hosts with complete genomes) 3 OxV: 1 OlV + 2 OtV 2 BpV 1 MpV
OtV5 genome
Genome comparisons
!
Phylogeny from genomic data
ari nu s cu s lu cim Os tre oc oc
Global tree based on ß DNA polymerase (DP): Eukaryotes, Eubacteria, Archae, and viruses
O
0
10
1/
1/100
1/81
Bat
hyc
Chlamydomonas hlorell PhysC c O S omitrella ry or a za gh um
s
na
us
us
mo
luc im ar oco ccus inus taur i
occ
0.1
cro
vir
Mi
Micromonas CCMP1545 299 s RCC ccu co thy
mi
us
Ostr e
Ce V
Mi
cc
Ba
co
nas
omo
Micr
Os tre o
m
liu opsis ondy Arabid olysph P Homo
EhV
86
Thermococcus At
CV PbCV MT32 1 5 483 8 PbCV FR 15 AR V C Pb
1 MpV1 OlV
Bp
Ot V1
2 BpV OtV5
Pb C
V
NY
2A PbCV1
Methanosarcina M Me ethanoco tha ccoid nos es aet a
Bp
V1
Sy
us
Ot V1
roc occ
c oc
as
nia
0.1
on
0.1
chl o
oc
om
lsto
1/9 4
Pro
ch ne
in Ra
OlV1 1/100 OtV5 1/100 0.88/76
ar
00
M
1/1
Mp V1
V1
BpV2
Focus on prasinoviruses and their hosts based on the concatenation of 5 genes in common: DP, PCNA, lsu and ssu Ribonucleotide reductase, Thymidine synthase
uri s ta
occu
oc stre
us
Micromonas CCMP1545 299 s RCC ccu n as co om o thy Ba
oc
oc cu
Ostr
eoc
Ce
V
Mi mi vir
uc
im a occu rinu s ta s uri
lium psis hondy o d i b a p s Ar Poly Homo
us
Green algal dsDNA viruses monophyletic Global coevolution with algal hosts
sl
Chlamydomonas hlorell PhysC O S comitrela ry or la za gh um
Micr
Os tre
EhV 86
At C
Bp V
s nu cim
Ot V1
1
ari
1 MpV1 OlV
Pb
CV
NY
2A
PbCV1
Meth M Me ethan tha nos a
2 BpV OtV5
Evolutionary divergence: Host > virus
V PbCV MT3 1 25 483 8 PbCV FR 5 R1 A CV Pb
Os
nia
1/100
hyc occ us
1/9 4
Mi
Ot
V1
1/100 0.88/76
Bat
as on
m cro
0.1
0.1
o
lsto
00
OlV1 OtV5
1/81
1/1
V1
om in
BpV2
1/100
V1
Mp
Ra
0
10
1/ Bp
i
taur
ar
Recent colonisation of hosts by viruses?
eo
Ostr
us cocc
M
tre o
co
ccu
s lu
Higher evolutionary rate in hosts?
Lateral gene transfers in prasinoviruses
Viruses are known as "bag of genes", or "gene robbers", steeling genes from their hosts: LGT Suspected to be vectors of gene transfers between eukaryotes Virus strains can recombine within hosts (e.g. H1N1)
General methodolody for identifying LGT Define candidate gene for transfer via BLAST: present in host and viruses Find same genes in different taxa (using BLAST, GenBank, ...): make a dataset with most closely related hits (BLAST), reference taxa, candidate gene in host and virus Align sequences and make tree Look at the tree to identify LGT
Host-virus LGT in OtV5? Blast each viral ORF against host genome and keep ORF meeting specific criteria (AA ID > 45 % on > 50 AA) Blastp against GenBank nr, keep all viral ORFs with host in the 50 best blast hits, and get these BBHs Keep these sequences if similar known gene function in Phycodnaviruses 6 candidates for LGT
Make phylogenetic tree for each candidate, adding host and virus sequences in the alignment + other BBHs (and reference sequences)
NO NO Pyrophosphatase
GDP-mannose Unknown
?
NO
Topoisomerase
NO
Ribonucleosidediphosphate reductase
? Maybe...
LGT from the new genomic data? These virus genomes possess unique pathways for AA synthesis, never seen in any virus before These biosynthesis pathways are not shared by all prasinovirus genomes A HSP70 gene is found only in the BpV genome Do involved gene originate from a lateral transfer? LGT from host or other sources?
Different AA synthesis pathways in related virus genomes !
Only in MpV and OtV: LGT?
Only in MpV and OtV: LGT? Only in OtV: LGT?
as ydo mo n
s lu cim ari nu s
As co m yc et e
Chla m
cu
on as
Ory
za
si Ba
te
dio
ce my
Arabid o
yd om
psis
a Oryz
Ch lam
dy liu
oc
Micromonas us occ ri hyc u ta
on
oc
Bat
Acetolactate synthase
m
Mus Homo
cterium
ella
Gram
Mimivir
us
Thauera
s
cu
oc
oc ct
La
s
cillu
toba
Lac
OtV5
s ide
bur ia
te ro
1 V1 Ol OtV
Ba c
M
eth
str
Shuttleworthia
V1
p Ms
o an
MpV1
Clo
Ros e
llic ute s
era ha
sp
Mo
O O lV tV 1 1
Flavoba
ph
tre
s cu oc oc
lys
Populus
s de oi er
ella mitr sco Phy
ct Ba Po
Os
tre Os
Asparagine synthase
idi
um
Dehydroquinate synthase tr e Os oco
tre
ccu
oc oc cu
s ta
s lu cim ari
uri
Oryza M Vi edicago tis
nu s
onas
nas
ydomo Chlam
lla
OtV1 OlV1
?
rob m
ulu
ac
to Magne
no
lo Ch
spirillum
Se le
ferax
o
sc
y Ph
tre mi
op sis
Populus
Rhodo
Microm
Ar ab id
Os
mo
na
s
onas Ch lor
ella
Chlamydo m
HSP 70
cum m Triti u gh
r So Nic
s
otia
uc sl
na
Vitis
o
oc
e str
lus
u cc
inu r a
im
Os
ccus treoco
tauri
O
sin os
nas C
pra
omo
us
RCC299
Micr
cc co
thy Ba
as Micromon
CMP 1545
Sp
in a
cia
u Pop
BpV1 Bp V2
Evolution of a gene family Example of capsid protein in prasinoviruses
Phycodnaviruses are icosahedral, with a capsid formed by different proteins (capsomers, or capsid-like proteins (clp)), which have probably evolved via duplications
What comparative genomics tells us about clp evolution in prasinoviruses within phycodnaviruses?
Capsomers evolution ATCV1_Z664R_OG 0,99
1
1
0,55 1
0,86
0,71
PBCV_NY2A_BO59R PBCV_NY2A_B617L PBCV_NY2A_B825L 0,74 0,77 PpV_01 0,61 PoV_01B HaV_1 0,93 0,56 OtV1_clp6 OtV5_clp6 0,98 OlV1_clp6 MpV1_clp6 0,8 BpV1_clp6 BpV2_clp6 1 OtV1_clp1 1 OtV5_clp1 1 OlV1_clp1 MpV1_clp1 1 OtV1_clp2 1 OtV5_clp2 0,85 OlV1_clp2 1 MpV1_clp2 1 BpV1_clp2 BpV2_clp2 1 1 OtV1_clp3 0,77 OtV5_clp3 0,61 OlV1_clp3 1 MpV1_clp3 1 BpV1_clp3 BpV2_clp3 1 OtV1_clp7 1 OtV5_clp7 0,87 OlV1_clp7 1 MpV1_clp7 1 BpV1_clp7 BpV2_clp7 1 OtV1_clp4 0,99 OtV5_clp4 1 OlV1_clp4 0,69 1
1 0,93
8 putative capsid genes in prasinovirus genomes (7 in BpV) Many duplications Phylogenetic tree including other PhycoDNAviruses available
1
BpV1_clp4 BpV2_clp4 MpV1_clp4 1 OtV1_clp5 1 OtV5_clp5 0,97 OlV1_clp5 MpV1_clp5 1 BpV1_clp5 BpV2_clp5 1
1 1 1
0.2
1
OtV1_clp8 OtV5_clp8 OlV1_clp8 MpV1_clp8 BpV1_clp8 BpV2_clp8
ATCV1_Z664R_OG 0,99
1
1
PBCV_NY2A_BO59R PBCV_NY2A_B617L PBCV_NY2A_B825L 0,74 0,77 PpV_01 0,61 PoV_01B HaV_1 0,93 OtV1_clp6 0,56 OtV5_clp6 0,98 OlV1_clp6 MpV1_clp6 0,8 BpV1_clp6 BpV2_clp6 1 OtV1_clp1 1 OtV5_clp1 1 OlV1_clp1 MpV1_clp1 1 OtV1_clp2 1 OtV5_clp2 0,85 OlV1_clp2 1 MpV1_clp2 1 BpV1_clp2 BpV2_clp2 1 1 OtV1_clp3 0,77 OtV5_clp3 0,61 OlV1_clp3 1 MpV1_clp3 1 BpV1_clp3 BpV2_clp3 1 OtV1_clp7 1 OtV5_clp7 0,87 OlV1_clp7 1 MpV1_clp7 1 BpV1_clp7 BpV2_clp7 1 OtV1_clp4 0,99 OtV5_clp4 1 OlV1_clp4
X
0,55 1
0,86
0,71
0,69 1
1 0,93
Ancestral copy in Prasinoviruses = clp 6 Evolution via duplications Loss of clp1 in BpV
1
BpV1_clp4 BpV2_clp4 MpV1_clp4 1 OtV1_clp5 1 OtV5_clp5 0,97 OlV1_clp5 MpV1_clp5 1 BpV1_clp5 BpV2_clp5 1
1 1 1
0.2
1
OtV1_clp8 OtV5_clp8 OlV1_clp8 MpV1_clp8 BpV1_clp8 BpV2_clp8
To conclude... All these questions can only be studied using comparative genomics Obtaining genomes is more and more easy, fast, and cheap, but analyzing them requires more and more human skills: this is where the bottleneck is now, so get to genomics!