[ generated from ../../GDE/PHYML/usersguide_phyliplike.html ]
PHYML User's guide (PHYLIP-like interface)
Overview
PHYML is a software implementing a new method for building phylogenies from DNA and protein sequences using maximum likelihood. Data sets can be analysed under several models of evolution (JC69, K80, F81, F84, HKY85, TN93 and GTR for nucleotides and Dayhoff, JTT, mtREV, WAG, DCMut, RtREV, CpREV, VT, Blosum62 and MtMam for amino acids). A discrete-gamma model (Yang, 1994) is implemented to accommodate rate variation among sites. Invariable sites can also be taken into account. PHYML has been compared to several other softwares using extensive simulations. The results indicate that its topological accuracy is at least as high as that of fastDNAml, while being much faster.
The PHYLIP-like interface
Download the binary files ; you can execute PHYML by double-clicking on the "phyml" file or by opening a shell window and typing "phyml" without parameters. The interactive command-line interface is PHYLIP-like. You can change the default value of an option by typing its corresponding character and validate your settings by typing 'Y'. PHYML produces several results files :
<sequence file name>_phyml_lk.txt : likelihood value(s)
<sequence file name>_phyml_tree.txt : inferred tree(s)
<sequence file name>_phyml_stat.txt : detailed execution stats
<sequence file name>_phyml_boot_trees.txt : bootstrap trees (special
case)
<sequence file name>_phyml_boot_stats.txt : bootstrap statistics
(special case)
Here are the possible uses of PHYML :
One data set, one starting tree
Standard analysis under a given substitution model, PHYML then returns
the inferred tree. Moreover, a special option allows to perform
non-parametric bootstrapp analysis on the original data set. PHYML then
returns the bootstrap tree with branch lengths and bootstrap values,
using standard NEWICK format (an option gives the pseudo trees in a
*_boot_trees.txt file).
Several data sets, one starting tree
Several standard analysis start from the same intial tree with
different data sets, without the bootstrap option.
The results are given in the order of the data sets.
This can be used to process multiple genes in a supertree approach.
One data set, several starting trees
Several standard analysis of the same data set using different starting
tree situations, without the bootstrap option.
All results are given in the order of the trees. Moreover, the most
likely tree is provided in the *_best_stat.txt and *_best_tree.txt
files.
This should be used to avoid being trapped into local optima and then
obtain better trees. Fast parsimony methods can be used to obtain a set
of starting trees.
Several data sets, several starting trees
Several standard runs, where each data set is analysed with the
corresponding starting tree, without the bootstrap option.
The results are given in the order of the data sets.
This can be used when comparing the likelihood of various trees
regarding different data sets.
Options
Sequences The input sequence file is a standard PHYLIP file of
aligned DNA or amino-acids sequences. It should look like this in
interleaved format :
5 60
Tax1 CCATCTCACGGTCGGTACGATACACCTGCTTTTGGCAG
Tax2 CCATCTCACGGTCAGTAAGATACACCTGCTTTTGGCGG
Tax3 CCATCTCCCGCTCAGTAAGATACCCCTGCTGTTGGCGG
Tax4 TCATCTCATGGTCAATAAGATACTCCTGCTTTTGGCGG
Tax5 CCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGG
GAAATGGTCAATATTACAAGGT GAAATGGTCAACATTAAAAGAT GAAATCGTCAATATTAAAAGGT GAAATGGTCAATCTTAAAAGGT GAAATGGTCAATATTAAAAGGT
The same data set in sequential format:
5 60
Tax1 CCATCTCACGGTCGGTACGATACACCTGCTTTTGGCAGGAAATGGTCAATATTACAAGGT
Tax2 CCATCTCACGGTCAGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT
Tax3 CCATCTCCCGCTCAGTAAGATACCCCTGCTGTTGGCGGGAAATCGTCAATATTAAAAGGT
Tax4 TCATCTCATGGTCAATAAGATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT
Tax5 CCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAATATTAAAAGGT
On the first line is the number of taxa, a space, then the number of characters for each taxon. The maximum number of characters in species name MUST not exceed 50. Blanks within the species name are NOT allowed. However, blanks (one or more) MUST appear at the end of each species name. In a sequence, three special characters '.', '-', and '?' may be used: a dot '.' means the same character as in the first sequence, a dash '-' means an alignment gap and a question mark '?' means an undetermined nucleotide. Sites at which one or more sequences involve '-' are NOT excluded from the analysis. Therefore, gaps are treated as unknown character (like '?') on the grounds that ''we don't know what would be there if something were there'' (J. Felsenstein, PHYLIP documentation). Finally, standard ambiguity characters for nucleotides are accepted (Table 1).
CAPTION: Table 1 - Nucleotide character coding
Character Nucleotide
A Adenosine
G Guanine
C Cytosine
T Thymine
U Uracil
M A or C
R A or G
W A or T
S C or G
Y C or T
K G or T
B C or G or T
D A or G or T
H A or C or T
V A or C or G
N or X or ? unknown
CAPTION: Table 2 - Amino-acid character coding
Character Amino-acid
A Alanine
R Arginine
N or B Asparagine
D Aspartic acid
C Cysteine
Q or Z Glutamine
E Glutamic acid
G Glycine
H Histidine
I Isoleucine
L Leucine
K Lysine
M Methionine
F Phenylalanine
P Proline
S Serine
T Threonine
W Tryptophan
Y Tyrosine
V Valine
X or ? unknown
Data type
This indicates if the sequence file contains DNA or amino-acids. The
default choice is to analyse DNA sequences.
Sequence format
The input sequences can be either in interleaved (default) or
sequential format, see "Sequences" above.
Number of data sets
Multiple data sets are allowed, e.g. to perform bootstrap analysis
using SEQBOOT (from the PHYLIP package). In this case, the data sets
are given one after the other, in the formats above explained. For
example (with three data sets):
5 60
Tax1 CCATCTCACGGTCGGTACGATACACCTGCTTTTGGCAGGAAATGGTCAATATTACAAGGT
Tax2 CCATCTCACGGTCAGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT
Tax3 CCATCTCCCGCTCAGTAAGATACCCCTGCTGTTGGCGGGAAATCGTCAATATTAAAAGGT
Tax4 TCATCTCATGGTCAATAAGATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT
Tax5 CCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAATATTAAAAGGT
5 60
Tax1 CCATCTCACGGTCGGTACGATACACCTGCTTTTGGCAGGAAATGGTCAATATTACAAGGT
Tax2 CCATCTCACGGTCAGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT
Tax3 CCATCTCCCGCTCAGTAAGATACCCCTGCTGTTGGCGGGAAATCGTCAATATTAAAAGGT
Tax4 TCATCTCATGGTCAATAAGATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT
Tax5 CCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAATATTAAAAGGT
5 60
Tax1 CCATCTCACGGTCGGTACGATACACCTGCTTTTGGCAGGAAATGGTCAATATTACAAGGT
Tax2 CCATCTCACGGTCAGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT
Tax3 CCATCTCCCGCTCAGTAAGATACCCCTGCTGTTGGCGGGAAATCGTCAATATTAAAAGGT
Tax4 TCATCTCATGGTCAATAAGATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT
Tax5 CCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAATATTAAAAGGT
Perform bootstrap and Number of pseudo data sets
When there is only one data set you can ask PHYML to generate
bootstrapped pseudo data sets from this original data set. PHYML then
returns the bootstrap tree with branch lengths and bootstrap values,
using standard NEWICK format. The "Print pseudo trees" option gives the
pseudo trees in a *_boot_trees.txt file.
Substitution model
A nucleotide or amino-acid substitution model. For DNA sequences, the
default choice is HKY85 (Hasegawa et al., 1985). This model is
analogous to K80 (Kimura, 1980), but allows for different base
frequencies. The other models are JC69 (Jukes and Cantor, 1969), K80
(Kimura, 1980), F81 (Felsenstein, 1981), F84 (Felsenstein, 1989), TN93
(Tamura and Nei, 1993) and GTR (e.g., Lanave et al. 1984, Tavaré 1986,
Rodriguez et al. 1990). The rate matrices of these models are given in
Swofford et al. (1996).
It is also possible to specify a custom substitution model, considering
that six substitution rate parameters and four equilibrium frequencies
define time-reversible DNA substitution models. The substitution rates
are defined by a string of six digits :
digit 1 digit 2 digit 3 digit 4 digit 5 digit 6
A<->C A<->G A<->T C<->G C<->T G<->T
000000 defines a model where the six relative rate parameters are equal : this corresponds to the JC69 model if the equilibrium frequencies are equal (0.25), or the F81 model if they are different. 010010 corresponds to a model where the A<->G and C<->T rates are optimised independently of the other parameters : this is the K80 model if base frequencies are equal (0.25), or the HKY85 model if they are different. 010020 is the TN93 model. 012345 is the GTR model. This notation is very concise and allows to define a wide range of models in a comprehensive framework. For amino-acid sequences, the default choice is JTT (Jones, Taylor and Thornton, 1992). The other models are Dayhoff (Dayhoff et al., 1978), mtREV (as implemented in Yang's PAML), WAG (Whelan and Goldman, 2001) and DCMut (Kosiol and Goldman, 2005), RtREV (Dimmic et al.), CpREV (Adachi et al., 2000) VT (Muller and Vingron, 2000), Blosum62 (Henikoff anf Henikoff, 1992) and MtMam (Cao, 1998).
Base frequency estimates
Under most of the nucleotide based models (except JC69 and K2P), base
frequencies can be estimated from the data (empirical) or adjusted so
as to maximise the likelihood (ML). The later makes the program slower.
Comparing the results obtained under the two options might be useful
when analysing sequences that correspond to concatenations of several
genes with different nucleotide compositions.
Transition / transversion ratio
With DNA sequences, it is possible to set the transition/transversion
ratio, except for the JC69 and F81 models, or to estimate its value by
maximising the likelihood of the phylogeny. The later makes the program
slower. The default value is 4.0. The definition of the
transition/transversion ratio is the same as in PAML (Yang, 1994). In
PHYLIP, the ''transition/transversion rate ratio'' is used instead. 4.0
in PHYML roughly corresponds to 2.0 in PHYLIP.
Proportion of invariable sites
The default is to consider that the data set does not contain
invariable sites (0.0). However, this proportion can be set to any
value in the 0.0-1.0 range. This parameter can also be estimated by
maximising the likelihood of the phylogeny. The later makes the program
slower.
Number of substitution rate categories
The default is having all the sites evolving at the same rate, hence
having one substitution rate category. A discrete-gamma distribution
can be used to account for variable substitution rates among sites, in
which case the number of categories that defines this distribution is
supplied by the user. The higher this number, the better is the
goodness-of-fit regarding the continuous distribution. The default is
to use four categories, in this case the likelihood of the phylogeny at
one site is averaged over four conditional likelihoods corresponding to
four rates and the computation of the likelihood is four times slower
than with a unique rate. Number of categories less than four or higher
than eight are not recommended. In the first case, the discrete
distribution is a poor approximation of the continuous one. In the
second case, the computational burden becomes high and an higher number
of categories is not likely to enhance the accuracy of phylogeny
estimation.
Gamma distribution parameter
The shape of a gamma distribution is defined by this numerical
parameter. The higher its value, the lower the variation of
substitution rates among sites (this option is used when having more
than 1 substitution rate category). The default value is 1.0. It
corresponds to a moderate variation. Values less than say 0.7
correspond to high variations. Values between 0.7 and 1.5 corresponds
to moderate variations. Higher values correspond to low variations.
This value can be fixed by the user. It can also be estimated by
maximising the likelihood of the phylogeny.
Starting tree(s)
Used as the starting tree(s) to be refined by the maximum likelihood
algorithm. The default is to use a BIONJ distance-based tree. It is
also possible to supply one or several trees in NEWICK format, one per
line in the file, which must be written in the standard parenthesis
representation (NEWICK format) ; the branch lengths must be given, and
the tree(s) must be unrooted. Labels on branches (such as bootstrap
proportions) are supported. Therefore, a tree with four taxa named A,
B, C, and D with a bootstrap value equal to 90 on its internal branch,
should look like this:
(A:0.02,B:0.004,(C:0.1,D:0.04)90:0.05);
If you give several trees and analyse several data sets the two numbers
must match.
Optimise starting tree(s) options
You can optimise the starting tree(s) in three ways : - You can
optimise the topology, the branch lengths and rate parameters
(transition/transversion ratio, proportion of invariant sites, gamma
distribution parameter), - You can keep the topology and optimise the
branch lengths and rate parameters (it is not possible to optimise the
tree topology and keep the branch lengths), - You can ask for no
optimisation, PHYML just returns the likelihood of the starting
tree(s).
References
Z. Yang (1994) J. Mol. Evol. 39, 306-14.
S. Ota & W.-H. Li (2001) Mol. Biol. Evol. 18, 1983-1992.
N. Saitou & M. Nei (1987) Mol. Biol. Evol. 4(4), 406-425.
W. Bruno, N. D. Socci, & A. L. Halpern (2000) Mol. Biol.
Evol. 17, 189-197.
J. Felsenstein (1989) Cladistics 5, 164-166.
G. J. Olsen, H. Matsuda, R. Hagstrom, & R. Overbeek (1994)
CABIOS 10, 41-48.
N. Goldman (1993) J. Mol. Evol. 36, 182-198.
M. Kimura (1980) J. Mol. Evol. 16, 111-120.
T. H. Jukes & C. R. Cantor (1969) in Mammalian Protein Metabolism,
ed. H. N. Munro. (Academic Press, New York) Vol. III, pp. 21-132.
M. Hasegawa, H. Kishino, & T. Yano (1985) J. Mol. Evol. 22, 160-174.
J. Felsenstein (1981) J. Mol. Evol. 17, 368-376.
David L. Swofford, Gary J. Olsen, Peter J. Waddel, & David M. Hillis
(1996) in Molecular Systematics, eds. David M. Hillis, Craig Moritz, &
Barbara K. Mable. (Sinauer Associates, Inc., Sunderland, Massachusetts,
USA).
K. Tamura & M. Nei (1993) Mol. Biol. Evol. 10, 512-526.
Lanave C, Preparata G., Saccone C. and Serio G.. (1984) A new method
for calculating evolutionary substitution rates. J. Mol. Evol.
20, 86-93.
Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. (1978). A model of
evolutionary change in proteins. In: Dayhoff, M. O. (ed.) Atlas of
Protein Sequence Structur, Vol. 5, Suppl. 3. National Biomedical
Research Foundation, Washington DC, pp. 345-352.
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid
generation of mutation data matrices from protein sequences. CABIOS 8:
275-282.
S. Whelan and N. Goldman. (2001). A general empirical model of
protein evolution derived from multiple protein families using a
maximum-likelihood approach Mol. Biol. Evol. 18, 691-699
Dimmic M.W., J.S. Rest, D.P. Mindell, and D. Goldstein. 2002.
RArtREV: An amino acid substitution matrix for inference of retrovirus
and reverse transcriptase phylogeny. Journal of Molecular Evolution 55:
65-73.
Adachi, J., P. Waddell, W. Martin, and M. Hasegawa. 2000. Plastid
genome phylogeny and a model of amino acid substitution for proteins
encoded by chloroplast DNA. Journal of Molecular Evolution 50:348-358.
Muller, T., and M. Vingron. 2000. Modeling amino acid replacement.
Journal of Computational Biology 7:761-776.
Henikoff, S., and J. G. Henikoff. 1992. Amino acid substitution
matrices from protein blocks. Proc. Natl. Acad. Sci., U.S.A.
89:10915-10919.
Cao, Y. et al. 1998 Conflict amongst individual mitochondrial
proteins in resolving the phylogeny of eutherian orders. Journal of
Molecular Evolution 15:1600-1611.
|