Google
More docs on the ARB website.
See also index of helppages.
Last update on 25. Nov 2018 .
Main topics:
Related topics:

    Column statistic

    OCCURRENCE

    ARB_NT/SAI/Create SAI from Sequences/Positional Variability ...

     

    DESCRIPTION

    Calculates the base and frequencies positional variability for each column independently.

    It uses the parsimony method to find the minimum number of mutations for each site, as they are determined by the specified topology.

    The calculation is performed for sequences of all species in tree. For best results you should use one of the biggest trees available. The tree should have been optimized using ARB_PARSIMONY.

    The result can be used by:

    • Parsimony to weight the characters properly
    • Neighbour joining to estimate the distances more accurately.
    • Filter (read notes below)

    Resulting SAI will contain the following character codes:

    '.'                          Less than 10% valid characters
    '-'                          No mutations.
    '0123456789ABCDE...'         Mutation rate category
    The higher the digit/character of the mutation rate category is, the more conserved the site is. Stepping 2 positions rightwards in the list of given characters, approximately halves the mutation rate (explicit mappings see below).

    Valid characters are "ACGTUacgtu" for DNA/RNA (or all amino acid codes for AA sequences).

     

    NOTES

    Opposed to consensus- and max-frequency-SAIs, the positional variability SAI is calculated based on the specified topology.

    Later that PVP-SAI might be used as an filter to further optimize that topology. When you filter out columns with high variability, topology changes that imply an increased number of mutations in these columns will receive no penalty.

    Repeating several iterations of these 2 steps might lead to a systematic error:

    • variable columns will tend to become even more variable and
    • conserved columns will tend to become even more conserved.

    The systematic error caused by this effect will probably mostly emphasize topological errors of the initial tree. To avoid that problem a tree should as well be optimized using other filters (e.g. max-frequency). This is especially true for the initial tree optimization.

     

    WARNINGS

    if you only have small trees (<100 species), using this function makes not much sense.

     

    Mapping of site mutation rate to categories:

    mutation rate      category
    45.8% .. 75%          0     (max. possible mutation rate is ~75%)
    36.5% .. 45.8%        1
    28.2% .. 36.5%        2
    21.3% .. 28.2%        3
    15.7% .. 21.3%        4
    11.5% .. 15.7%        5
     8.3% .. 11.5%        6
     6.0% ..  8.3%        7
     4.3% ..  6.0%        8
     3.1% ..  4.3%        9
     2.2% ..  3.1%        A
     1.5% ..  2.2%        B
     1.1% ..  1.5%        C
     0.78% .. 1.1%        D
     0.55% .. 0.78%       E
     0.39% .. 0.55%       F
     0.28% .. 0.39%       G
     0.20% .. 0.28%       H
     0.14% .. 0.20%       I
    mutations/million  category
    976 .. 1400          J
    691 ..  975          K
    489 ..  690          L
    346 ..  488          M
    245 ..  345          N
    173 ..  244          O
    123 ..  172          P
     87 ..  122          Q
     62 ..   86          R
     44 ..   61          S
     31 ..   43          T
     22 ..   30          U
     16 ..   21          V
     11 ..   15          W
      8 ..   10          X
      6 ..    7          Y
      1 ..    5          Z
     

    BUGS

    No bugs known