Phylogenetic reconstruction Computer Lab - Yves Desdevises

Sep 7, 2010 - rv rh tp tl eye rose pa po g villi;. You can use the edit .... Generate a null distribution of differences between randomly generated trees (e.g. 100).
136KB taille 11 téléchargements 263 vues
Phylogenetic reconstruction Yves Desdevises

Computer Lab

If needed, you can download softwares and instructions from the following link: http://desdevises.free.fr/Phylogenetic_reconstruction/Courses_and_Labs.html Create a folder, and subfolders, on your computer to place input and output files, they tend to multiply very rapidly! Of course, you may use any other software you are familiar with. Take your time and do not hesitate to use help, example files and manuals from the various programs you will use. 1. Morphology Softwares: Text editor (NotePad, …) or Excel -

Define morphological characters on species. You can find some info on the FAO species identification sheets and on FishBase (http://www.fishbase.org/search.php?lang=English)

-

Code these characters: o Presence/absence if possible (0/1) o Else use additive binary coding (00, 10, 01, 11, ..) or multiple states (0, 1, 2), ordered or not

-

Enter characters in a matrix: o Rows: species. Always keep the same names for species, in a single word (e.g. D_cervinus), for all analyses o Columns: characters (1 word for each) o Save the matrix in text format with a clear name (e.g. sparmorpho.txt)

Phylogeny from morphological characters Softwares: PAUP or Phylip, TreeRot, and TreeView or FigTree, … -

First, transform the matrix in the format used by PAUP (Nexus): remove first line (species) and import this new text file in PAUP, previously launched. On a PC, PAUP works from commands to be entered below the window. On a Mac, you need to open the Terminal and put all input file in your home folder. To import sparmorpho.txt, type: tonexus format=text fromfile=sparmorpho.txt tofile=sparmorpho.nex datatype=standard o In the file you obtain (sparmorpho.nex), add a new line between the lines Format and Matrix, called Charlabels, which contains character labels:

1/10

7/09/10

Charlabels incisive molar tc mouth rv villi;

pectoral rh tp

anal can_ant can_lat tl eye rose pa po

spine g

You can use the edit command to open and modify a file in PAUP -

Parsimony analysis of this matrix can now start. The outgroup can be defined before or after the analysis, the following way: outgroup S_maena o Run the analysis, with a search algorithm adapted to your number of taxa: Exhaustive: alltrees Branch-and-Bound: bandb Heuristic: hsearch o Save tree with branch lengths and an appropriate name: savetrees brlens=yes file=[filename]

-

This tree can be seen and edited in PAUP (showtrees), or in softwares with a GUI, like TreeView or FigTree

-

If you obtain several trees, you can make a consensus: contree [strict consensus. For a majority-rule consensus: contree / majrule=yes] To save the consensus type contree / file=[filename] majrule=yes

Note: PAUP commands are here: http://paup.csit.fsu.edu/Cmd_ref_v2.pdf See PAUP website, with a very informative FAQ: http://paup.csit.fsu.edu/paupfaq/faq.html You can put all commands in the input file ("batch file"), the complete analysis will be done when you will execute the file. To do that, put the commands between begin paup; and end; with ; at the end of each line, this way (default options are not required if unchanged): begin paup; set criterion=parsimony; [default option] outgroup S_maena; hsearch; [for a heuristic search] savetrees brlens=yes file=sparmorpho.tre; end; Of course, you can use other options, and perform various tests (PTP, ...). PAUP can also be used with TreeRot to compute decay indices (see http://people.bu.edu/msoren/TreeRot.html)

2/10

7/09/10

With Phylip: -

Make a copy of sparmorpho. Remove character names on the first row and write species number and character number, separated by a blank space. Rename this matrix infile and save it as text with spaces (this is Phylip format)

-

Put infile in the exe folder from the Phylip folder. Phylip is a package containing numerous programs. It produces output files named outtree (tree) and outfile (results). It is important to rename them to avoid their overwriting in subsequent analyses

-

Run Pars and analyse infile

-

Recover output files: outfile and outtree. You can give them more understandable names (e.g. phylomor for the tree)

-

You can see and modify the tree in TreeView, NJ-plot, or FigTree and have a look on the results in a text editor such as Wordpad or Word

-

If you obtain several trees, you can make a consensus with Consense. To do so, rename your input file intree (duplicate and rename phylomor), place it in the exe folder, and launch Consense. You can change the consensus type via the command C. The output file outtree (to be renamed) can be seen with TreeView or FigTree for example

2. Molecular characters Sequence retrieval and management Softwares: Seaview (or BioEdit, ClustalX, MAFFT, …) -

Go on NCBI (http://www.ncbi.nlm.nih.gov/)

-

Find nucleotide sequences (Search) from organisms under study

-

Compile them in Fasta format in a text file, one for each sequence type if there are several

-

Open this file in Seaview, clean sequences if needed by cutting unwanted residues, check if a manual alignment is possible (Props ➙ Allow seq. Editing). Align sequences (Align ➙ Align all, options allow you to choose between ClustalW2 and Muscle). You can also use MAFFT (http://align.bmr.kyushu-u.ac.jp/mafft/online/server/)

-

It is often a good idea to check the alignment (e.g. in Seaview) and to improve it manually. You can use GBlocks (http://molevol.cmima.csic.es/castresana/Gblocks_server.html) to keep only well aligned regions. GBlocks requires the alignment to be exported in Fasta.

Model selection Softwares: PAUP, ModelTest, MrModeltest -

Execute your sequence dataset in PAUP

-

Execute the modelblockPAUPb10 file (in the paupblock folder in the Modeltest folder)

-

Run the model.scores file in ModelTest: in Windows, a MS-DOS window must be used, and a command line entered, such as

modeltest3.7 3/10

7/09/10

-

You can also use ModelTest server: http://darwin.uvigo.es/software/modeltest_server.html

-

The output file contains the selected model and a command block for PAUP

-

MrModelTest works exactly the same way, but generates command blocks for PAUP and MrBayes

3 Molecular phylogeny Parsimony, maximum likelihood and distance Softwares: SeaView, PhyML, PAUP, Phylip With SeaView (which contains a PhyML function for ML analysis, and uses a Phylip module for parsimony):

-

-

Open SeaView, open the alignment

-

You can select species (use the mouse to create a group: Species ➙ Create group) and/or sites (Site ➙ Create set, then select with the mouse at the bottom of the window)

-

Try different tree reconstructions by varying methods and parameters via the Trees menu. You can validate your tree using a bootstrap analysis. For model-based analyses (e.g. ML), set the model and parameters you obtained from ModelTest.

Save your tree(s) via File in the tree window (give clear names to trees (e.g. phylomol.tre). You can save trees as text (Newick or Phylip format, that can be read with softwares like FigTree) or graphic files (pdf). Do not forget to root your tree and show clade support values. With PAUP -

It is easier to enter directly commands in the input file below the data matrix, and to execute it, for example: begin paup; log file=log.txt start; set criterion=parsimony; hsearch nreps=10 addseq=random swap=tbr; savetrees file=mp.tre brlens; set criterion=distance; dset distance=logdet objective=me; hsearch nreps=10 addseq=random swap=tbr; savetrees file=me.tre brlens; set criterion=likelihood; lset nst=2 basefreq=empirical rates=gamma ncat=4; lset tratio=estimate shape=estimate; lscore 1; 4/10

7/09/10

lset tratio=previous shape=previous; hsearch nreps=1 swap=tbr start=1; savetrees file=ml.tre brlens; log stop; end; -

Copying the command block obtained from the ModelTest analysis will automatically set the best ML model to the data, for ML and distance analyses, depending on the optimality criterion chosen, for example:

[…] Bathycoccus Ostreococcus ; End;

GAAACTGC[…] GAAACTGC[…]

[!Likelihood settings from best-fit model (TrN+I+G) selected by hLRT in Modeltest 3.8 on Mon Jul 21 09:36:36 2008] BEGIN PAUP; Lset Base=(0.2618 0.2011 0.2659) Nst=6 Rmat=(1.0000 2.6267 1.0000 1.0000 5.4868) Rates=gamma Shape=0.7304 Pinvar=0.5973; END;

With PhyML -

PhyML can be run directly from its website (http://atgc.lirmm.fr/phyml/; a local version can also be downloaded)

-

The file must be entered in Phylip format

-

Set the various parameters and run the analysis

4 Molecular phylogeny via Bayesian inference Software: MrBayes, Tracer -

MrBayes uses Nexus files. However, it is less tolerant than PAUP to the variant of Nexus format. For example, Nucleotide must be replaced by DNA as datatype in the input file

-

Enter commands directly after the data matrix. Examples:

Template for non-coding sequences: begin mrbayes; log start replace; set autoclose = no nowarn=yes; lset nst=6 rates = invgamma; mcmc ngen=500000 printfreq=1000 samplefreq=100 nchains=4 savebrlens=yes; 5/10

7/09/10

sumt burnin=1250 contype = halfcompat; log stop; end; Template for coding sequences (for a codon partition model): begin mrbayes; log start replace; set autoclose = no nowarn=yes; charset 1st_pos = 1-.\3; charset 2nd_pos = 2-.\3; charset 3rd_pos = 3-.\3; partition by_codon = 3:1st_pos,2nd_pos,3rd_pos; set partition = by_codon; lset applyto = (all) nst=6 rates = invgamma; prset applyto = (all); unlink revmat=(all) shape=(all) pinvar=(all) statefreq=(all) tratio= (all); mcmc ngen=500000 printfreq=1000 samplefreq=100 nchains=4 temp =0.2 savebrlens=yes; sumt burnin=1250 contype = halfcompat; log stop; end; Template for protein sequences: begin mrbayes; log start replace; set autoclose = no nowarn=yes; lset rates = invgamma; prset aamodelpr = mixed; mcmc ngen=100000 printfreq=1000 samplefreq=100 nchains=4 savebrlens=yes; sumt burnin=250 contype = halfcompat; log stop; end; -

Put the input file (e.g. data dataBI.nex) in the same folder as MrBayes

-

Run MrBayes and type execute dataBI.nex

-

Many output files are produced. The consensus tree file is the .con file

A great Wiki manual for MrBayes can be found here: http://mrbayes.csit.fsu.edu/wiki/index.php/Manual You can observe your output parameter file with Tracer, to insure that the different runs have converged correctly, and that the sampled trees are not autocorrelated (ESS > 100). To do so, lauch the .p files by clicking on “+” and trace the statistics you want to observe. You can select two runs at the same time to compare them. 6/10

7/09/10

5. Tests of topology - Supertrees Softwares: PAUP, Rainbow Topology comparison: Several trees must be stored in memory. For example, compute a NJ tree and a MP tree from the same data, and save them each time in PAUP memory. To do so, it is important to distinguish trees in PAUP memory from trees in files. Save the tree obtained from the first analysis (e.g. NJ) with an explicit name (e.g. Spar16S_NJtree), then perform the second analysis (e.g. MP), and add the precedent tree to this new tree(s) (that you may also save, e.g. Spar16S_MPtree), now stored in memory, by typing gettrees file=Spar16S_NJtree mode=7 (to keep trees from both places, see PAUP commands manual). Trees must be in Nexus format. Kishino-Hasegawa test (PAUP): -

2 trees must be stored in memory

o If criterion = parsimony pscores 1-2 / khtest; [all trees is the default option] o If criterion = likelihood

lscores 1-2 / khtest; Shimodeira-Hasegawa test (PAUP): -

2 or more trees must be stored on memory set criterion = likelihood

lscores 1-2 / shtest; [a kh test is preformed at the same time] Symmetric difference (PAUP): -

2 or more trees must be stored on memory

-

Compute the observed symmetric difference

treedist metric=symdiff [default option] -

Generate a null distribution of differences between randomly generated trees (e.g. 100)

Generatetrees random ntrees=100 [default option] treedist metric=symdiff [default option] fd=yes [default option] showall=no (if you do not want to see all tree distances) -

See where your observed metric falls within the null distribution (be careful: the null hypothesis is incongruence!) to assess its significance

SOWH test (PAUP and SeqGen): -

Quite complex… (e.g. see http://people.virginia.edu/~drt3b/protocols/sowhTest.php) 7/10

7/09/10

Supertrees: -

Several at least partially overlapping input trees in Nexus format can be combined in a single MRP matrix (in Nexus format) with Rainbow. For example, you can make several overlapping trees from your datasets on Sparids by duplicating one of them, deleting taxa in each, and making trees with your favorite reconstruction method. All input trees must be written in a single Nexus file, with trees on single lines in parenthetic format, such as:

#nexus begin trees; tree source 1 = ((((P_creatopus,P_carneipes),(P_bulleri,…; tree source 2 = ((DiomeDdea_exulans,O_leucorhoa…; tree source 3 = ((P_hypoleuca,(((P_alba,(((Pterodroma …; end; -

Open this input file via File ➙ Load

-

In the Supertree menu, choose Create MRP matrix…. You can differentially weight input trees, remove trees, or add trees from other files. When you are done click OK

-

A MRP matrix in Nexus format is created and can be analysed via parsimony in PAUP

-

PAUP can be called from Rainbow via the Supertree menu

6. Cophylogeny Softwares: CopyCat, TreeMap 1.0, Jane Global fit method -

For a distance-based analysis (global fit method), start CopyCat. At startup, the program asks you were to put the results: enter the path to an appropriate folder on your computer

-

You need 3 files, with perfectly matching names: o An association file in text format, mentioning on each line symbiont (parasite, virus) and associated host, separated by a tab. If any, generalist symbionts must be repeated, as well as host species with many symbionts: OtV5 OtV3 MicCV1 MicCV1 …

RCC745 RCC745 CCMP1545 RCC299

o The symbiont tree in Newick format (Phylip): ((((OtV5:0.024,OtV3:0.017):0.004,(OlV158:0.009,BpV132:0.018)… o The host tree in the same format -

Go directly to the second tab Configuration and Execution of Parafit (the first tab is to work on sequences from GenBank) 8/10

7/09/10

o Select the number of permutations (999 is a good number) o You can correct principal coordinates for negative values. You are not supposed to have some negative values from patristic distances, but you will be able to test various options in different analyses o Select the Association file with the corresponding button o Choose create distance matrix from host tree (you can already have a distance matrix ready, or compute it from the sequences, but we will use our tree) o Do the same thing for the parasite tree o Click validate the specified data. Here you may have a message warning you that some hosts have no parasites, which is possible (but not the contrary!), and does not require to eliminate these hosts from the analysis o Click start analysis on this machine and wait for the analysis to run. If the analysis does not start, find the ParafitWrapper executable Jar file (certainly in the defaultwdir) and double-click on it -

Go to the third tab to see the results o Select the parasite distance matrix produced by Parafit: it should be placed in the just created ID_[X] directory (generally in the defaultwdir), and be called [Name].parasites.dist o Do the same thing for the host distance file [Name].hosts.dist o Select the Parafit output Hostpara.out (be careful to rename successive output files to avoid overwriting them) o Click show resolved Parafit results, and interpret the results

Event-based methods -

For an event-based analysis, start TreeMap 1.0. TreeMap 1.0 does not allow the modification of event costs. TreeMap 2.0 can do that, but only works on old non-Intel Macintoshes... To do that you can use TreeMap 3.0, which is still a Beta version written in Java that will work on all platforms, but currently works only on Mac OS X. Jane can also be used but does not perform tests

-

TreeMap (and Jane) uses special Nexus input files, that you can create via the New item in the File menu. TreeMap requires rooted fully-resolved trees in Nexus or Phylip format, that you input via the appropriate buttons, then edit the host-parasite associations, and click OK. Another way to proceed is to create an input file using as basis the example files given with TreeMap

-

TreeMap shows you the Tanglegram (nodes can be rotated), a Reconstruction made by reconciling the two trees without host switches, a Branch Lengths window displaying copaths for cospeciating pairs, and a Histogram window. Menu items are active or not depending on the window at the forefront. The Histogram window is destined to assess if the cost of your observed reconstruction is significantly higher than the cost computed from random associations, which requires the generation of a null distribution (the histogram)

-

Reconstructions, including host switches, can be found via heuristic (finding only one 9/10

7/09/10

reconstruction) or exact searches (very long! For simple cases only) via the Reconstruction menu, by placing the Reconstruction window at the forefront. TreeMap tries to maximise the number of cospeciations, and many reconstructions with the same maximum number of cospeciations can be produced. Reconstructions can be cleared or stored via the Reconstruction menu -

Significance of the maximum number of cospeciations found by TreeMap can be assessed from the Randomisation menu with the Reconstruction window at the forefront. You can randomise host or parasite tree, or both, that may give different results! You have to choose the best option depending on your biological question. Use proportional-to-distinguishable model to generate random trees, and enter an appropriate number of random trees (e.g. 100, 1000). Results are displayed in the histogram window

-

Randomising branch lengths aims at assessing the significance of the correlation between branch lengths. This cannot be done via classical testing because of the tree-like structure of the data. The statistic tested here is the observed correlation, in the Branch Lengths window. Only cospeciating pairs are compared, so this depends on the reconstruction. If you have additive trees, copaths are plotted, if trees are ultrametric, you should plot coalescence times (choose via the View menu) and the Yule item becomes available in the Randomisation ➙ Branch Lengths… menu

-

To use Jane, open your nexus file (File ➙ Open Tree), define event costs via Settings ➙ Set Costs, and host switch distance via the same menu. Population size and Number of Iterations are parameters of the search algorithm (the more the better, but the longer)

-

Once solutions are computed, open them by clicking on the corresponding line

10/10

7/09/10