O céanolo
This file contains one logo per layer. These logos use RGB (RVB) colors.
g
In the calque menu, click on the "eye" symbol to toggle the logo displayed.
i
ue
rv
a
ire
q
to
Obse
Laboratoire
•
S • INSU •
er
C
NR
Turn off this layer (READ ME FIRST) before exporting or printing.
/M
M • UP
1882 C
The logo that is visible is the version that will be exported, or printed.
de Banyuls
ARAGO
Genomes, phylogeny and lateral gene transfer desdevises.free.fr/ISMB2011
Yves Desdevises Observatoire Océanologique de Banyuls Université Pierre et Marie Curie France
1
Outline Molecular phylogenetics Phylogenomics Lateral gene transfer Illustrated uses of evolutionary genomics in a microalgaevirus association
2
A few words on molecular phylogenetics 3
• Goal: propose a hypothesis of relationships between several taxa
• Phylogeny = tree • Speciation: binary • Hypothesis: A
B
C
Phylogenetic network
4
An am pse s
Cheilinus trilobatus Cheilinus chlorourus Epibulus incidiator
Stetojulis bandanensis Halichoeres hortulanus
Halichoeres margaritace us albovittata Stetojulis bandanensis Stetojulis rus lorou nus ch Cheili Ch eil in us tril ob a Labrus merula viridis tus
Halichoeres margaritaceus Labropsis australis Halichoeres marginatus Anampses geographicus Anampses caeruleopunctatus
Coris julis
r to ia cid in
Hemigymnus melapterus Hemigymnus fasciatus Thalassoma bifasciatum
Bodianus rufus Clepticus parrae Pagrus major
Symphodus roissali
Symphodus cinereus
Symphodus cinereus
Symphodus tinca
Symphodus tinca
Symphodus ocellatus
Symphodus ocellatus
Symphodus mediterraneus
Symphodus mediterraneus
Symphodus melanocercus
Labrus viridis
Pictilabrus laticlavius Notolabrus tetricus
Symphodus roissali
Sympho dus cin ereus Sym phod Sy us tin mp ca Sy ho m du ph so ce od ll us atu s m ed ite rra ne us
s rcu ce no ela sm ris du ho pest s ru mp bru Sy nola Cte a s merul Labru
Thalassoma lunare Thalassoma lutescens
stris rupe
us tinca Symphod
brus nola Cte
Sym ph od us oce lla tus
Labrichthys unilineatus
us tric te s bru fus la to s ru No ianu d rrae Bo us pa ptic Cle major Pagrus Symphodus roissali
lis Labropsis austra ceus rgarita us es ma lan hoer Halic is ortu ns sh re ne oe da lich an Ha sb juli to Ste
Labroides dimidiatus
s ulu ib Ep
Pa gru sm ajo r
La bro ide sd im cae idia rule opu tus Anam nct atu pses s geog raph icus Halichoeres margin atus
Labrus merula Labrus viridis
Thalassoma bifasciatum
Symphodus melanocercus Ctenolabrus rupestris
Ste to juli sa Ep lbo ibu vit lus ta inc Chei idia ta linus tor chlo rour us Cheilinus trilobatus
La
Symphodus ocellatus Symphodus mediterraneus
Th
br An am oide s di pse HLab mid alic ropsis aus s ca iatu tralis ho eru s ere leo sm pu nct arg atu ina s tus
Symphodus cinereus Symphodus tinca
Stetojulis albovittata
SSyy mmp phh oodd uuss cro inis ere sa ulis
nus fasciatus Hemigym rus apte mel julis ris s Co tu ea ilin un ys th ch bri La
Symphodus roissali
s s nu icu ula ph ort gra eo sh sg ere pse ho am lic An Ha
fus s ru ianu Bod
s rcueus ocean ditnerr s meela hodu s m Symp odu ph Sym
nus igym Hem
unilineatus Labrichthys
Th TH Cor ala haem is ju ss lasig lis om soym nu a b ma s fa ifa lute sciatu s Hemigymnusscmelapterus iatuscen m s Pic tilabr us are la maticlun lavi sso us Thala Cle ptic tetricus Notolabrus us pa rra e
alasso ma lun Tha are lass Pic om a lu tila tesc bru ens s la tic lav ius
Phylogenetic trees
Symphodus melanocercus Ctenolabrus rupestris
Ctenolabrus rupestris Labrus merula
Labrus merula
Labrus viridis
Labrus viridis Cheilinus trilobatus
Cheilinus trilobatus Cheilinus chlorourus
Cheilinus chlorourus
Epibulus incidiator
Epibulus incidiator
Stetojulis albovittata
Stetojulis albovittata
Stetojulis bandanensis
Stetojulis bandanensis
Halichoeres hortulanus
Halichoeres hortulanus
Halichoeres margaritaceus
Halichoeres margaritaceus
Labropsis australis
Labropsis australis
Halichoeres marginatus
Halichoeres marginatus
Anampses geographicus
Anampses geographicus
Anampses caeruleopunctatus
Anampses caeruleopunctatus Labroides dimidiatus
Labroides dimidiatus
Labrichthys unilineatus
Labrichthys unilineatus Coris julis
Coris julis
Hemigymnus melapterus
Hemigymnus melapterus
Hemigymnus fasciatus
Hemigymnus fasciatus
Thalassoma bifasciatum
Thalassoma bifasciatum
Thalassoma lunare
Thalassoma lunare
Thalassoma lutescens
Thalassoma lutescens Pictilabrus laticlavius
Pictilabrus laticlavius Notolabrus tetricus
Notolabrus tetricus
5
Bodianus rufus
Bodianus rufus Clepticus parrae
Clepticus parrae
Pagrus major
Pagrus major
Molecular data
• Most current source for phylogeny • Nucleotides ou amino acids (for ancient divergences) • Important step: alignment (with the help of alignment softwares)
• Use of evolutionary models to build trees 6
• Gene tree ≠ species tree • Genes: orthologous or paralogous Paralogs Orthologs
Orthologs
a
b* c
C* B
A*
b* C*
A*
Duplication
Tree Ancestral gene
7
Making a molecular phylogeny Data DNA, AA, ...
Alignment Software + eye
Characters
Distances
Data quality Saturation, homogeneity, ...
Distances
Method
Model?
Data type, taxa number
BI ML Model?
MP
Optimality criteria
Weigthing? (sites, changes)
Yes
Tree(s) Validation Bootstrap, ...
ME...
No
NJ...
8
Optimality criteria • Different methods to choose the “best tree” from the alignment
• Hypothesis on how evolution works • Different in different methods • Number of steps (parsimony) • Sum of branch lengths (minimum evolution) • Likelihood (ML, with evolutionary model) 9
Parsimony
10
• Method based on individual characters, associated to cladistics
• “Ockham’s razor”: favour the simplest solution • Assess character fit to trees by character mapping via parsimony
• Optimisation is different on different trees
11
Distances
12
• Assessement of the mean number of changes between two taxa
• Based on distances, not individual characters • Data sometimes only as distances (e.g. DNA/ DNA hybridation temperature), if not, data transformation in distance matrix
• Mainly used for molecular data and molecular distances can be corrected using models of sequence evolution (same as ML)
• Main method: neighbor-joining (NJ) • Very fast
13
Maximum likelihood
14
• Maximum Likelihood = ML • Method based on individual characters • Uses an explicit evolutionary model (DNA or AA) • The more computationally complex method • Model very important: only for molecular data • ML finds (one) tree (and model parameters) maximizing the probability of the data
15
Bayesian inference • Recent and now widely used method • Uses Bayes formula to generate posterior probability of parameters (among which topology and branch lengths), based on previous knowledge on data: prior probability
• Tree (as well as parameters of the model such as substitution rates) with confidence intervals (support values) for clades
16
17
Validation • Trees can be validated using resampling
procedures such as boostrap (resampling with replacement)
• Assessment of clade support: add some noise in (alter) the data and rebuild the tree, do this from many altered datasets
• The more a clade is strong, the more it will appear in all trees: % = bootstrap support
18
Supertrees
19
• Combine trees with partially overlapping taxa • Bigger tree • Many methods (at least 17)
20
• Uses of supertrees • Combining trees from different data/studies • Phylogenomics: genes are often unequally present in the taxa under study
• Metagenomic: taxa partially and unequally represented in sequences
➡Many gaps in the matrix:
• Supermatrix (as is) • Design several complete sub-matrices, compute subtrees, build supertree
21
• e.g. Sargasso Sea environmental sequences
...
22
Phylogenomics
23
24
• Genomes: more accurate and precise phylogenies? Not so simple...
• Very large dataset: computation difficult • Genomes are plastic: duplications (total, partial), fusions, chromosome fissions, LGT, ...
• No good model of genomic evolution • Diminution of stochastic error (random), only by increasing character number
• The possibility of systematic error remains, for
example caused by wrong method or model choice
25
• 3 main biases • Composition bias: sequences with the same composition tend to cluster
• Check from sequences • Long branch attraction • Good taxon sampling • Heterotachy: substitution rate change through time for fixed positions
• Hard to detect and correct 26
Genomes
• More characters • New character types: gene order, gene content,
nucleotidic signature (DNA strings), rare genomic changes
• 2 main approaches • Classical: sequences (gene concatenation) and phylogeny (supermatrix or supertree)
• Whole genome features: gene order, gene content, DNA string
• + 1: rare genomic changes
27
Classical methods
28
• Resolution of difficult phylogenetic problems (e.g. Tree of Life, Eukaryotes, Bilateria)
• Evolution of gene groups (e.g. family):
mutations, selective pressure, divergence, duplications, ...
• Identification of lateral gene transfer (which brings noise in the the tree-like signal)
29
• Example: classical tree of Deuterostomians
30
• Genomic data (Nature, 2006) - 146 genes - Classical methods: sequences - Bias control
31
• Example: Eukaryote phylogeny (2009, 2010)
32
• Example: Tree of Life (Science, 2006)
33
• But is the history of life really tree like?
34
Lateral Gene Transfer
35
• LGT in the tree of life
36
• Lateral gene transfer is more and more recognized
as an important factor shaping the evolution of life
• Current debate is no more on the existence of LGT but on its importance: can we still consider that the evolution of life is mainly tree-like?
• No (?) in Prokaryotes • Yes (?) in Eukaryotes 37
38
Methods to unveil LGT • Compositional methods • Comparison of evolutionary rates • Look for similar sequences in databases through BLAST
• Phylogenetic approach 39
Compositional methods
• Look via bioinformatics in complete genomes for • atypical nucleotide composition in putatively transferred genes
• atypical codon usage patterns • Only for recent transfer events (before homogenization)
40
Evolutionary rates
• Compare pairwise distances between gene
orthologs within families vs distances between genomes (from a reference tree): if no LGT, these distances should be roughly equal
• Another possibility is to compare instantaneous substitution matrices in genes vs genomes
• In case of LGT, these rates should differ 41
BLAST and similarity
• Find homologs of a query sequence in databases, genomes, ... via a similarity search (e.g. BLAST)
• Pattern of gene presence/absence in organisms = phyletic pattern
• Identification of LGT for genes with unusual affiliation
• Drawback: similarity does not necessarily mean evolutionary proximity
• Sensitive to taxa/gene representation in databases
42
Phylogenetic approach
• Look for individual gene trees incongruent with a reference phylogeny
• Reference tree: rDNA, genomes, gene
concatenation, consensus tree, supertree, ...
• Need well supported trees • Test for incongruence between topologies • Cannot detect LGT between neighbours 43
• Between symbionts with complete genomes available: blast symbiont ORFs against host genome to identify putatively transferred genes
• Cophylogenetic methods (gene tree within species tree) can be used to infer a scenario for the LGT
44
• Example: frp gene acquired in red algae and green plants from ∂-proteobacteria
45
• Example: multiple transfers virus-to-host and hostto-virus between Emiliania huxleyi and EhV86
46
• Example: multiple transfers virus-to-host and hostto-virus between Emiliania huxleyi and EhV86
47
Case study: Prasinophyte microalgae and their viruses 48
Hosts: Prasinophyceae Chlorophyta: green algae (Order Mamiellales, ubiquitous picophytoplankton) 3 main genera, 6 complete genomes to date Ostreococcus (3 genomes) Bathycoccus (1 genome) Micromonas (2 genomes)
49
Chrétiennot-Dinet et al. (1995)
Ostreococcus
Bathycoccus
50
Micromonas
Host phylogeny (SSU rDNA) Ostreococcus RCC344
0,74
0,86
Ostreococcus RCC356 O. lucimarinus CCMP2972
Chrétiennot-Dinet et al. (1995)
Ostreococcus
Ostreococcus RCC1108 0,99 O. tauri RCC745 Ostreococcus RCC1107
Bathycoccus prasinos RCC1105
1
Bathycoccus prasinos RCC464
0,99
Bathycoccus
M. pusilla RCC497 M. pusilla CCMP1545
0,99
Micromonas RCC1109 Micromonas RCC828
0,63 Micromonas RCC451
Micromonas
51
Viruses Phycodnavirus Prasinovirus Important role in the regulation of phytoplanktonic populations
ML Escande, OOB
52
Giant virus ("Girus"): 100-200 nm Large genomes: about 200 Kb (closely-related Chlorovirus: almost 400 Kb!)
53
54
55
Prasinovirus tree from partial DNA polymerase (about 600 bp) OxV
MpV
BpV
56
Prasinovirus genomes 6 genomes (4 correspond to hosts with complete genomes) 3 OxV: 1 OlV + 2 OtV 2 BpV 1 MpV
57
OtV5 genome
58
Genome comparisons
!
59
Phylogeny from genomic data
60
arin us cu s lu cim oc reoc Ost
Global tree based on ß DNA polymerase (DP): Eukaryotes, Eubacteria, Archae, and viruses
1/81
1/100
Bath
ycoc
Chlamydomonas Ph Chlorella om O ysc ry Sorg itrella za h um
us luc im ococ arin cus us taur i
as
oc oc c
Ostre
opsis ondy Arabid lysph Po Homo
us
cus
0.1
on
tre
Ce V Mim ivir
rom Mic
Micromonas CCMP1545 C299us s RC occ c mona thy Micro Ba
Os
lium
EhV8 6
Thermococcus AtC V PbCV MT32 1 5 3 PbCV FR48 158 AR CV Pb
NY 2A PbCV1
Methanosarcina Me Met thanoco ccoide hano s saet a
OtV
1
Pb
CV
MpV1 1 OlV
V1
BpV2
Bp
Bp
BpV2 OtV5
Focus on prasinoviruses and their hosts based on the concatenation of 5 genes in common: DP, PCNA, lsu and ssu Ribonucleotide reductase, Thymidine synthase
tauri
cus
ococ
Ostre
00
1/1
V1
Pro
s
1
roco
ccus
s cu
a on
ia ston
OtV
c co ho
m
4
no
Ral
1/9
1/100 0.88/76
chlo
ec
ari
00
M
1/1
V1
n Sy
Mp
OlV1 1/100 OtV5
0.1 0.1
61
Green algal dsDNA viruses monophyletic
m
liu psis ondy Arabidoolysph P Homo
6
Thermococcus AtC
V PbCV MT3 1 25 83 PbCV FR4 158 AR CV Pb
1
V1 s lu cim
OtV
Bp
us
1 MpV1 OlV
ari n
NY
2A PbCV1
Methanosarcina M Me ethanoco tha ccoide nos s aeta BpV2 OtV5
Evolutionary divergence: Host > virus
us luc im ococ arin cus us taur i
EhV8
Pb CV
Global coevolution with algal hosts
cc
Ostre
V Mim ivir us
Chlamydomonas lorella Ph Ch comitre O ys lla ry Sor za gh um
Micromonas
co
CCMP1545 9 CC29 us as R cocc thy Ba
tre o
mon
Micro
Os
Ce
chlo
roc occ
s
as
cu
oc
oc
on
nia lsto Ra
om
00
Pro
ch ne
Os
i
taur
in ar
cus
ococ
Ostre
M
Recent colonisation of hosts by viruses?
Sy
tre oc
oc cu
Higher evolutionary rate in hosts? 1/1 BpV2
Bp V
1/100
1
V1
1/81
1/10
Mp
Bat
0
hyco
1/94
ccus
0.1
0.1
s
na
o rom
Mic
OtV
1
OlV1 1/100 OtV5 1/100 0.88/76
62
Lateral gene transfers in prasinoviruses
63
Viruses are known as "bag of genes", or "gene robbers", steeling genes from their hosts: LGT Suspected to be vectors of gene transfers between eukaryotes Virus strains can recombine within hosts (e.g. H1N1)
64
us
General methodolody for identifying LGT Define candidate gene for transfer via BLAST: present in host and viruses Find same genes in different taxa (using BLAST, GenBank, ...): make a dataset with most closely related hits (BLAST), reference taxa, candidate gene in host and virus Align sequences and make tree Look at the tree to identify LGT
65
Host-virus LGT in OtV5? Blast each viral ORF against host genome and keep ORF meeting specific criteria (AA ID > 45 % on > 50 AA) Blastp against GenBank nr, keep all viral ORFs with host in the 50 best blast hits, and get these BBHs Keep these sequences if similar known gene function in Phycodnaviruses 6 candidates for LGT
66
Make phylogenetic tree for each candidate, adding host and virus sequences in the alignment + other BBHs (and reference sequences)
NO NO Pyrophosphatase
GDP-mannose Unknown
?
67
NO
Topoisomerase
NO
Ribonucleosidediphosphate reductase
68
? Maybe...
69
LGT from the new genomic data? These virus genomes possess unique pathways for AA synthesis, never seen in any virus before These biosynthesis pathways are not shared by all prasinovirus genomes A HSP70 gene is found only in the BpV genome Do involved gene originate from a lateral transfer? LGT from host or other sources?
70
Different AA synthesis pathways in related virus genomes Only in MpV and OtV: LGT?
!
Only in MpV and OtV: LGT? Only in OtV: LGT?
71
ari
onas
nu
s
Ory za
as on om
opsis
a
Oryz
lam
Acetolactate synthase
Ch
nd
te
yce
iom sid Ba
Arabid
yd
la itrel
scom
Populus
es id
ro
Phy
cte
Ba
Po lys ph o
ete
cim
m yc
s lu
As co
cu
ydom
oc
s cu
oc
Chlam
ccus hyco uri Bat ta
tre
oc oc
tre
Os
Micromonas
Os
Asparagine synthase
yliu m
Mus Homo
Flavobact
erium
ella Gram
Mimivir
us
Thauera
tes
idiu m
1
us
illus obac
cc
Lact
OtV5
O O lV tV 1 1
str
1 OlV OtV
1
MpV1
Clo
Shuttleworthia
pV
co cto La
Ms
no tha
Me
Ba cte ro ide Rose s buri a
llic u
ra
ae
h sp
Mo
72
Dehydroquinate synthase Os ccu s lu
cim
s ta
cc u
co
uri
ari nu s
Microm
Oryza VitMedicago is
tre o
oco
tre
Os
onas
monas op
a rell
bid
mit co
Ara
ys Ph
sis
Populus
ydo Chlam
OtV1 OlV1
? Se
om on
tospirillum Magne
as
lum
cu
ba
loro
Ch
Rhodofe rax
len
Chlamydom onas
73
Ch
lor ella
HSP 70
inu
um
h rg
So
um
Tritic Nic oti
ana
s lu
Vitis
oc
tre Os
s
ar cim
s occus
Ostreoc
tauri
as C C
BpV1 Bp V2
Micro
s
ino
ras sp
cu
oc
s RCC299
mon
c thy
Ba
Micromona
MP1 545
Sp
in
ac
ia
ulu
Pop
cu oc
74
Evolution of a gene family Example of capsid protein in prasinoviruses
75
Phycodnaviruses are icosahedral, with a capsid formed by different proteins (capsomers, or capsid-like proteins (clp)), which have probably evolved via duplications
What comparative genomics tells us about clp evolution in prasinoviruses within phycodnaviruses?
76
Capsomers evolution ATCV1_Z664R_OG 0,99
PBCV_NY2A_BO59R PBCV_NY2A_B617L PBCV_NY2A_B825L PpV_01 PoV_01B HaV_1 0,93 OtV1_clp6 0,56 OtV5_clp6 0,98 OlV1_clp6 MpV1_clp6 0,8 BpV1_clp6 BpV2_clp6 1 OtV1_clp1 1 OtV5_clp1 1 OlV1_clp1 MpV1_clp1 1 OtV1_clp2 1 OtV5_clp2 0,85 OlV1_clp2 1 MpV1_clp2 1 BpV1_clp2 BpV2_clp2 1 1 OtV1_clp3 0,77 OtV5_clp3 0,61 OlV1_clp3 1 MpV1_clp3 1 BpV1_clp3 BpV2_clp3 1 OtV1_clp7 1 OtV5_clp7 0,87 OlV1_clp7 1 MpV1_clp7 1 BpV1_clp7 BpV2_clp7 1 OtV1_clp4 0,99 OtV5_clp4 1 OlV1_clp4
0,74 0,77 0,61 1
1
0,55 1
0,86
0,71
8 putative capsid genes in prasinovirus genomes (7 in BpV) Many duplications Phylogenetic tree including other PhycoDNAviruses available
1
0,69 1
0,97 1 0,93
BpV1_clp4 BpV2_clp4 MpV1_clp4 1 OtV1_clp5 1 OtV5_clp5 OlV1_clp5 MpV1_clp5 1 BpV1_clp5 BpV2_clp5 1
1 1 1
1 OtV1_clp8 OtV5_clp8 OlV1_clp8 MpV1_clp8 BpV1_clp8 BpV2_clp8
0.2
ATCV1_Z664R_OG 0,99
PBCV_NY2A_BO59R PBCV_NY2A_B617L PBCV_NY2A_B825L PpV_01 PoV_01B HaV_1 0,93 0,56 OtV1_clp6 OtV5_clp6 0,98 OlV1_clp6 MpV1_clp6 0,8 BpV1_clp6 BpV2_clp6 1 OtV1_clp1 1 OtV5_clp1 1 OlV1_clp1 MpV1_clp1 1 OtV1_clp2 1 OtV5_clp2 0,85 OlV1_clp2 1 MpV1_clp2 1 BpV1_clp2 BpV2_clp2 1 1 OtV1_clp3 0,77 OtV5_clp3 0,61 OlV1_clp3 1 MpV1_clp3 1 BpV1_clp3 BpV2_clp3 1 OtV1_clp7 1 OtV5_clp7 0,87 OlV1_clp7 1 MpV1_clp7 1 BpV1_clp7 BpV2_clp7 1 OtV1_clp4 0,99 OtV5_clp4 1 OlV1_clp4
77
0,74 0,77 0,61 1
1
X
0,55 1
0,86
0,71
0,69
0,97 0,93
Evolution via duplications Loss of clp1 in BpV
1
1
1
Ancestral copy in Prasinoviruses = clp 6
BpV1_clp4 BpV2_clp4 MpV1_clp4 1 OtV1_clp5 1 OtV5_clp5 OlV1_clp5 MpV1_clp5 1 BpV1_clp5 BpV2_clp5 1
1 1 1
1 OtV1_clp8 OtV5_clp8 OlV1_clp8 MpV1_clp8 BpV1_clp8 BpV2_clp8
0.2
78
To conclude... All these questions can only be studied using comparative genomics Obtaining genomes is more and more easy, fast, and cheap, but analyzing them requires more and more human skills: this is where the bottleneck is now, so get to genomics!
79