Big slides - Yves Desdevises .fr

A few words on molecular phylogenetics .... Mainly used for molecular data and molecular distances can .... Sensitive to taxa/gene representation in databases ...
19MB taille 1 téléchargements 317 vues
This file contains one logo per layer. These logos use RGB (RVB) colors.

g

i

ue

rv

a

O céanolo

q

to

ire

Obse

Laboratoire



NR

S • INSU •

er

C

1882 C

The logo that is visible is the version that will be exported, or printed. Turn off this layer (READ ME FIRST) before exporting or printing.

/M

M • UP

de Banyuls

ARAGO

In the calque menu, click on the "eye" symbol to toggle the logo displayed.

Genomes, phylogeny and lateral gene transfer desdevises.free.fr/ISMB2011

Yves Desdevises Observatoire Océanologique de Banyuls Université Pierre et Marie Curie France

Outline Molecular phylogenetics Phylogenomics Lateral gene transfer Illustrated uses of evolutionary genomics in a microalgaevirus association

A few words on molecular phylogenetics

• Goal: propose a hypothesis of relationships between several taxa

• Phylogeny = tree • Speciation: binary • Hypothesis: A

B

C

Phylogenetic network

L An abro ide am sd p a HL imi ali bropsis ases dia ustrcaa ch li tus s eru oe re leo sm pu nc ar tat gin us atu s

Symphodus tinca Symphodus ocellatus

Symphodus melanocercus Ctenolabrus rupestris Labrus merula Labrus viridis Cheilinus trilobatus Cheilinus chlorourus Epibulus incidiator

Stetojulis bandanensis

Halichoeres margaritaceus Labropsis australis Halichoeres marginatus Anampses geographicus Anampses caeruleopunctatus Labroides dimidiatus Labrichthys unilineatus Coris julis Hemigymnus melapterus Hemigymnus fasciatus Thalassoma bifasciatum Thalassoma lunare Thalassoma lutescens Pictilabrus laticlavius Notolabrus tetricus Bodianus rufus

ralis Labropsis aust s aceu rgarit a m us es lan hoer c u li t a r H is o ns sh e e r oe an lich nd a Ha sb uli j o et St

Clepticus parrae Symphodus roissali Symphodus cinereus

Pagrus major

Symphodus roissali Symphodus cinereus

Symphodus tinca

Symphodus tinca

Symphodus ocellatus

Symphodus ocellatus

Symphodus mediterraneus

Symphodus mediterraneus

Symphodus melanocercus Ctenolabrus rupestris

Symphodus melanocercus Ctenolabrus rupestris

Labrus merula

Labrus merula

Labrus viridis

Labrus viridis

Cheilinus trilobatus

Cheilinus trilobatus

Cheilinus chlorourus

Cheilinus chlorourus

Epibulus incidiator

Epibulus incidiator

Stetojulis albovittata

Stetojulis albovittata

Stetojulis bandanensis

Stetojulis bandanensis

Halichoeres hortulanus

Halichoeres hortulanus

Halichoeres margaritaceus

Halichoeres margaritaceus

Labropsis australis

Labropsis australis

Halichoeres marginatus

Halichoeres marginatus

Anampses geographicus

Anampses geographicus

Anampses caeruleopunctatus

Anampses caeruleopunctatus Labroides dimidiatus

Labroides dimidiatus

Labrichthys unilineatus

Labrichthys unilineatus Coris julis

Coris julis

Hemigymnus melapterus

Hemigymnus melapterus

Hemigymnus fasciatus

Hemigymnus fasciatus

Thalassoma bifasciatum

Thalassoma bifasciatum

Thalassoma lunare

Thalassoma lunare

Thalassoma lutescens Pictilabrus laticlavius Notolabrus tetricus Bodianus rufus

Thalassoma lutescens Pictilabrus laticlavius Notolabrus tetricus Bodianus rufus

Clepticus parrae

Clepticus parrae

Pagrus major

Pagrus major

us ric t te

s ru s ab l ufu to sr o u n N dia rrae Bo s pa u c i pt Cle major Pagrus

Symphodus roissali Symph odus c inereu Sym s pho Sy dus mp tinc Sy a ho m du ph so ce od lla us tus m ed ite rra ne us

s rcu ce no ela sm ris du est ho rup mp us Sy abr nol Cte ula s mer Labru Labrus viridis

Sy mp ho du so ce lla tus

r to ia cid in

m ajo r

us ul ib Ep

ris est rup us abr nol Cte

SSyy mmp phh oodd uuss crion iess real uis

Halichoeres hortulanus

La br oid es dim cae idi rule atu opu s Anam nct atu pses s geog raph icus Halichoeres m arginatus

An am pse s

St et oju lis alb Ep ov ibu itta lus inc Che idia ta ilinu tor s ch lorou rus Cheilinus trilobatus

Halichoeres margaritac eus albovittata Stetojulis bandanensis Stetojulis s ouru chlor inus il e h C Ch ei lin us tri lo ba Labrus merula viridis tus dus tinca Sympho

s rcu s creaneu r o e it n eda us m el phod us m Sym d ho mp Sy

Symphodus mediterraneus

Stetojulis albovittata

s rufu nus a i d Bo Pa gr us

Symphodus cinereus

s s nu icu a l h p tu or gra eo sh g e r es oe ps ch am i l n A Ha

Th TH ala haelam Coris ss sisgym julis om om nu a b a s fa ifa lute sciat us Hemigymnusscmela iat scepteru um ns s Pic tilab rus nare l maaticlu o l a s s viu s Thala Cl ep tic labrus tetricus Noto us pa rra e

tus nus fascia Hemigym rus apte s mel juli nus ris igym s Co tu Hem ea ilin un ys th ich br La

unilineatus Labrichthys

Symphodus roissali

Thalassoma bifasciatum Thala ssom a luna Tha re las som Pic a lu tila tes bru cen sl s ati cla viu s

Phylogenetic trees

Molecular data

• Most current source for phylogeny • Nucleotides ou amino acids (for ancient divergences) • Important step: alignment (with the help of alignment softwares)

• Use of evolutionary models to build trees

• Gene tree ≠ species tree • Genes: orthologous or paralogous Paralogs Orthologs

a

b* c

Orthologs

C* B

A*

b* C*

Duplication

Tree Ancestral gene

A*

Making a molecular phylogeny Data DNA, AA, ...

Alignment Software + eye

Characters

Distances

Data quality Saturation, homogeneity, ...

Distances

Method

Model?

Data type, taxa number

BI ML Model?

MP Weigthing?

Optimality criteria

(sites, changes)

Yes

Tree(s) Validation Bootstrap, ...

ME...

No

NJ...

Optimality criteria • Different methods to choose the “best tree” from the alignment

• Hypothesis on how evolution works • Different in different methods • Number of steps (parsimony) • Sum of branch lengths (minimum evolution) • Likelihood (ML, with evolutionary model)

Parsimony

• Method based on individual characters, associated to cladistics

• “Ockham’s razor”: favour the simplest solution • Assess character fit to trees by character mapping via parsimony

• Optimisation is different on different trees

Distances

• Assessement of the mean number of changes between two taxa

• Based on distances, not individual characters • Data sometimes only as distances (e.g. DNA/ DNA hybridation temperature), if not, data transformation in distance matrix

• Mainly used for molecular data and molecular distances can be corrected using models of sequence evolution (same as ML)

• Main method: neighbor-joining (NJ) • Very fast

Maximum likelihood

• Maximum Likelihood = ML • Method based on individual characters • Uses an explicit evolutionary model (DNA or AA) • The more computationally complex method • Model very important: only for molecular data • ML finds (one) tree (and model parameters) maximizing the probability of the data

Bayesian inference • Recent and now widely used method • Uses Bayes formula to generate posterior probability of parameters (among which topology and branch lengths), based on previous knowledge on data: prior probability

• Tree (as well as parameters of the model such as substitution rates) with confidence intervals (support values) for clades

Validation • Trees can be validated using resampling

procedures such as boostrap (resampling with replacement)

• Assessment of clade support: add some noise in (alter) the data and rebuild the tree, do this from many altered datasets

• The more a clade is strong, the more it will appear in all trees: % = bootstrap support

Supertrees

• Combine trees with partially overlapping taxa • Bigger tree • Many methods (at least 17)

• Uses of supertrees • Combining trees from different data/studies • Phylogenomics: genes are often unequally present in the taxa under study

• Metagenomic: taxa partially and unequally represented in sequences

➡Many gaps in the matrix:

• Supermatrix (as is) • Design several complete sub-matrices, compute subtrees, build supertree

• e.g. Sargasso Sea environmental sequences

...

Phylogenomics

• Genomes: more accurate and precise phylogenies? Not so simple...

• Very large dataset: computation difficult • Genomes are plastic: duplications (total, partial), fusions, chromosome fissions, LGT, ...

• No good model of genomic evolution • Diminution of stochastic error (random), only by increasing character number

• The possibility of systematic error remains, for

example caused by wrong method or model choice

• 3 main biases • Composition bias: sequences with the same composition tend to cluster

• Check from sequences • Long branch attraction • Good taxon sampling • Heterotachy: substitution rate change through time for fixed positions

• Hard to detect and correct

Genomes

• More characters • New character types: gene order, gene content,

nucleotidic signature (DNA strings), rare genomic changes

• 2 main approaches • Classical: sequences (gene concatenation) and phylogeny (supermatrix or supertree)

• Whole genome features: gene order, gene content, DNA string

• + 1: rare genomic changes

Classical methods

• Resolution of difficult phylogenetic problems (e.g. Tree of Life, Eukaryotes, Bilateria)

• Evolution of gene groups (e.g. family):

mutations, selective pressure, divergence, duplications, ...

• Identification of lateral gene transfer (which brings noise in the the tree-like signal)

• Example: classical tree of Deuterostomians

• Genomic data (Nature, 2006) - 146 genes - Classical methods: sequences - Bias control

• Example: Eukaryote phylogeny (2009, 2010)

• Example: Tree of Life (Science, 2006)

• But is the history of life really tree like?

Lateral Gene Transfer

• LGT in the tree of life

• Lateral gene transfer is more and more recognized

as an important factor shaping the evolution of life

• Current debate is no more on the existence of LGT but on its importance: can we still consider that the evolution of life is mainly tree-like?

• No (?) in Prokaryotes • Yes (?) in Eukaryotes

Methods to unveil LGT • Compositional methods • Comparison of evolutionary rates • Look for similar sequences in databases through BLAST

• Phylogenetic approach

Compositional methods

• Look via bioinformatics in complete genomes for • atypical nucleotide composition in putatively transferred genes

• atypical codon usage patterns • Only for recent transfer events (before homogenization)

Evolutionary rates

• Compare pairwise distances between gene

orthologs within families vs distances between genomes (from a reference tree): if no LGT, these distances should be roughly equal

• Another possibility is to compare instantaneous substitution matrices in genes vs genomes

• In case of LGT, these rates should differ

BLAST and similarity

• Find homologs of a query sequence in databases, genomes, ... via a similarity search (e.g. BLAST)

• Pattern of gene presence/absence in organisms = phyletic pattern

• Identification of LGT for genes with unusual affiliation

• Drawback: similarity does not necessarily mean evolutionary proximity

• Sensitive to taxa/gene representation in databases

Phylogenetic approach

• Look for individual gene trees incongruent with a reference phylogeny

• Reference tree: rDNA, genomes, gene

concatenation, consensus tree, supertree, ...

• Need well supported trees • Test for incongruence between topologies • Cannot detect LGT between neighbours

• Between symbionts with complete genomes available: blast symbiont ORFs against host genome to identify putatively transferred genes

• Cophylogenetic methods (gene tree within species tree) can be used to infer a scenario for the LGT

• Example: frp gene acquired in red algae and green plants from ∂-proteobacteria

• Example: multiple transfers virus-to-host and hostto-virus between Emiliania huxleyi and EhV86

• Example: multiple transfers virus-to-host and hostto-virus between Emiliania huxleyi and EhV86

Case study: Prasinophyte microalgae and their viruses

Hosts: Prasinophyceae Chlorophyta: green algae (Order Mamiellales, ubiquitous picophytoplankton) 3 main genera, 6 complete genomes to date Ostreococcus (3 genomes) Bathycoccus (1 genome) Micromonas (2 genomes)

Chrétiennot-Dinet et al. (1995)

Ostreococcus

Bathycoccus

Micromonas

Host phylogeny (SSU rDNA) Ostreococcus RCC344

0,74

0,86

Ostreococcus RCC356 O. lucimarinus CCMP2972

Chrétiennot-Dinet et al. (1995)

Ostreococcus

Ostreococcus RCC1108 0,99 O. tauri RCC745 Ostreococcus RCC1107

Bathycoccus prasinos RCC1105

1

Bathycoccus prasinos RCC464

0,99

M. pusilla RCC497 M. pusilla CCMP1545

0,99

Micromonas RCC1109 Micromonas RCC828

0,63 Micromonas RCC451

Micromonas

Bathycoccus

Viruses Phycodnavirus Prasinovirus Important role in the regulation of phytoplanktonic populations

ML Escande, OOB

Giant virus ("Girus"): 100-200 nm Large genomes: about 200 Kb (closely-related Chlorovirus: almost 400 Kb!)

Prasinovirus tree from partial DNA polymerase (about 600 bp) OxV

MpV

BpV

Prasinovirus genomes 6 genomes (4 correspond to hosts with complete genomes) 3 OxV: 1 OlV + 2 OtV 2 BpV 1 MpV

OtV5 genome

Genome comparisons

!

Phylogeny from genomic data

ari nu s cu s lu cim Os tre oc oc

Global tree based on ß DNA polymerase (DP): Eukaryotes, Eubacteria, Archae, and viruses

O

0

10

1/

1/100

1/81

Bat

hyc

Chlamydomonas hlorell PhysC c O S omitrella ry or a za gh um

s

na

us

us

mo

luc im ar oco ccus inus taur i

occ

0.1

cro

vir

Mi

Micromonas CCMP1545 299 s RCC ccu co thy

mi

us

Ostr e

Ce V

Mi

cc

Ba

co

nas

omo

Micr

Os tre o

m

liu opsis ondy Arabid olysph P Homo

EhV

86

Thermococcus At

CV PbCV MT32 1 5 483 8 PbCV FR 15 AR V C Pb

1 MpV1 OlV

Bp

Ot V1

2 BpV OtV5

Pb C

V

NY

2A PbCV1

Methanosarcina M Me ethanoco tha ccoid nos es aet a

Bp

V1

Sy

us

Ot V1

roc occ

c oc

as

nia

0.1

on

0.1

chl o

oc

om

lsto

1/9 4

Pro

ch ne

in Ra

OlV1 1/100 OtV5 1/100 0.88/76

ar

00

M

1/1

Mp V1

V1

BpV2

Focus on prasinoviruses and their hosts based on the concatenation of 5 genes in common: DP, PCNA, lsu and ssu Ribonucleotide reductase, Thymidine synthase

uri s ta

occu

oc stre

us

Micromonas CCMP1545 299 s RCC ccu n as co om o thy Ba

oc

oc cu

Ostr

eoc

Ce

V

Mi mi vir

uc

im a occu rinu s ta s uri

lium psis hondy o d i b a p s Ar Poly Homo

us

Green algal dsDNA viruses monophyletic Global coevolution with algal hosts

sl

Chlamydomonas hlorell PhysC O S comitrela ry or la za gh um

Micr

Os tre

EhV 86

At C

Bp V

s nu cim

Ot V1

1

ari

1 MpV1 OlV

Pb

CV

NY

2A

PbCV1

Meth M Me ethan tha nos a

2 BpV OtV5

Evolutionary divergence: Host > virus

V PbCV MT3 1 25 483 8 PbCV FR 5 R1 A CV Pb

Os

nia

1/100

hyc occ us

1/9 4

Mi

Ot

V1

1/100 0.88/76

Bat

as on

m cro

0.1

0.1

o

lsto

00

OlV1 OtV5

1/81

1/1

V1

om in

BpV2

1/100

V1

Mp

Ra

0

10

1/ Bp

i

taur

ar

Recent colonisation of hosts by viruses?

eo

Ostr

us cocc

M

tre o

co

ccu

s lu

Higher evolutionary rate in hosts?

Lateral gene transfers in prasinoviruses

Viruses are known as "bag of genes", or "gene robbers", steeling genes from their hosts: LGT Suspected to be vectors of gene transfers between eukaryotes Virus strains can recombine within hosts (e.g. H1N1)

General methodolody for identifying LGT Define candidate gene for transfer via BLAST: present in host and viruses Find same genes in different taxa (using BLAST, GenBank, ...): make a dataset with most closely related hits (BLAST), reference taxa, candidate gene in host and virus Align sequences and make tree Look at the tree to identify LGT

Host-virus LGT in OtV5? Blast each viral ORF against host genome and keep ORF meeting specific criteria (AA ID > 45 % on > 50 AA) Blastp against GenBank nr, keep all viral ORFs with host in the 50 best blast hits, and get these BBHs Keep these sequences if similar known gene function in Phycodnaviruses 6 candidates for LGT

Make phylogenetic tree for each candidate, adding host and virus sequences in the alignment + other BBHs (and reference sequences)

NO NO Pyrophosphatase

GDP-mannose Unknown

?

NO

Topoisomerase

NO

Ribonucleosidediphosphate reductase

? Maybe...

LGT from the new genomic data? These virus genomes possess unique pathways for AA synthesis, never seen in any virus before These biosynthesis pathways are not shared by all prasinovirus genomes A HSP70 gene is found only in the BpV genome Do involved gene originate from a lateral transfer? LGT from host or other sources?

Different AA synthesis pathways in related virus genomes !

Only in MpV and OtV: LGT?

Only in MpV and OtV: LGT? Only in OtV: LGT?

as ydo mo n

s lu cim ari nu s

As co m yc et e

Chla m

cu

on as

Ory

za

si Ba

te

dio

ce my

Arabid o

yd om

psis

a Oryz

Ch lam

dy liu

oc

Micromonas us occ ri hyc u ta

on

oc

Bat

Acetolactate synthase

m

Mus Homo

cterium

ella

Gram

Mimivir

us

Thauera

s

cu

oc

oc ct

La

s

cillu

toba

Lac

OtV5

s ide

bur ia

te ro

1 V1 Ol OtV

Ba c

M

eth

str

Shuttleworthia

V1

p Ms

o an

MpV1

Clo

Ros e

llic ute s

era ha

sp

Mo

O O lV tV 1 1

Flavoba

ph

tre

s cu oc oc

lys

Populus

s de oi er

ella mitr sco Phy

ct Ba Po

Os

tre Os

Asparagine synthase

idi

um

Dehydroquinate synthase tr e Os oco

tre

ccu

oc oc cu

s ta

s lu cim ari

uri

Oryza M Vi edicago tis

nu s

onas

nas

ydomo Chlam

lla

OtV1 OlV1

?

rob m

ulu

ac

to Magne

no

lo Ch

spirillum

Se le

ferax

o

sc

y Ph

tre mi

op sis

Populus

Rhodo

Microm

Ar ab id

Os

mo

na

s

onas Ch lor

ella

Chlamydo m

HSP 70

cum m Triti u gh

r So Nic

s

otia

uc sl

na

Vitis

o

oc

e str

lus

u cc

inu r a

im

Os

ccus treoco

tauri

O

sin os

nas C

pra

omo

us

RCC299

Micr

cc co

thy Ba

as Micromon

CMP 1545

Sp

in a

cia

u Pop

BpV1 Bp V2

Evolution of a gene family Example of capsid protein in prasinoviruses

Phycodnaviruses are icosahedral, with a capsid formed by different proteins (capsomers, or capsid-like proteins (clp)), which have probably evolved via duplications

What comparative genomics tells us about clp evolution in prasinoviruses within phycodnaviruses?

Capsomers evolution ATCV1_Z664R_OG 0,99

1

1

0,55 1

0,86

0,71

PBCV_NY2A_BO59R PBCV_NY2A_B617L PBCV_NY2A_B825L 0,74 0,77 PpV_01 0,61 PoV_01B HaV_1 0,93 0,56 OtV1_clp6 OtV5_clp6 0,98 OlV1_clp6 MpV1_clp6 0,8 BpV1_clp6 BpV2_clp6 1 OtV1_clp1 1 OtV5_clp1 1 OlV1_clp1 MpV1_clp1 1 OtV1_clp2 1 OtV5_clp2 0,85 OlV1_clp2 1 MpV1_clp2 1 BpV1_clp2 BpV2_clp2 1 1 OtV1_clp3 0,77 OtV5_clp3 0,61 OlV1_clp3 1 MpV1_clp3 1 BpV1_clp3 BpV2_clp3 1 OtV1_clp7 1 OtV5_clp7 0,87 OlV1_clp7 1 MpV1_clp7 1 BpV1_clp7 BpV2_clp7 1 OtV1_clp4 0,99 OtV5_clp4 1 OlV1_clp4 0,69 1

1 0,93

8 putative capsid genes in prasinovirus genomes (7 in BpV) Many duplications Phylogenetic tree including other PhycoDNAviruses available

1

BpV1_clp4 BpV2_clp4 MpV1_clp4 1 OtV1_clp5 1 OtV5_clp5 0,97 OlV1_clp5 MpV1_clp5 1 BpV1_clp5 BpV2_clp5 1

1 1 1

0.2

1

OtV1_clp8 OtV5_clp8 OlV1_clp8 MpV1_clp8 BpV1_clp8 BpV2_clp8

ATCV1_Z664R_OG 0,99

1

1

PBCV_NY2A_BO59R PBCV_NY2A_B617L PBCV_NY2A_B825L 0,74 0,77 PpV_01 0,61 PoV_01B HaV_1 0,93 OtV1_clp6 0,56 OtV5_clp6 0,98 OlV1_clp6 MpV1_clp6 0,8 BpV1_clp6 BpV2_clp6 1 OtV1_clp1 1 OtV5_clp1 1 OlV1_clp1 MpV1_clp1 1 OtV1_clp2 1 OtV5_clp2 0,85 OlV1_clp2 1 MpV1_clp2 1 BpV1_clp2 BpV2_clp2 1 1 OtV1_clp3 0,77 OtV5_clp3 0,61 OlV1_clp3 1 MpV1_clp3 1 BpV1_clp3 BpV2_clp3 1 OtV1_clp7 1 OtV5_clp7 0,87 OlV1_clp7 1 MpV1_clp7 1 BpV1_clp7 BpV2_clp7 1 OtV1_clp4 0,99 OtV5_clp4 1 OlV1_clp4

X

0,55 1

0,86

0,71

0,69 1

1 0,93

Ancestral copy in Prasinoviruses = clp 6 Evolution via duplications Loss of clp1 in BpV

1

BpV1_clp4 BpV2_clp4 MpV1_clp4 1 OtV1_clp5 1 OtV5_clp5 0,97 OlV1_clp5 MpV1_clp5 1 BpV1_clp5 BpV2_clp5 1

1 1 1

0.2

1

OtV1_clp8 OtV5_clp8 OlV1_clp8 MpV1_clp8 BpV1_clp8 BpV2_clp8

To conclude... All these questions can only be studied using comparative genomics Obtaining genomes is more and more easy, fast, and cheap, but analyzing them requires more and more human skills: this is where the bottleneck is now, so get to genomics!