In a standard RNA, base frequencies are not equally distributed. Especially in the archea subclass we find extremely G+C rich sequences. This yielded in a couple of new rate corrections, algorithms and programs which:-
calculate the average G+C content of all/two sequences
-
correct the distance.
But further research showed us that the G+C frequencies are not equally distributed within a sequence. Especially helical parts have a significant higher G+C content than non helical parts. One strait forward algorithm would calculate each frequency independently for each column. Especially for small datasets the resulting frequencies would look like random data, as too few examples are analyzed.
In ARB we implemented a combination of the 2 approaches. Lets say we want to estimate a Parameter 'P' with a maximum variance 'maxvar', so we need a minimum samples 'minsap'.
You can give your favorite method a higher weight by controlling the smoothing parameter:
Less smoothing -> independent parameter estimates
Much smoothing -> clustered parameter estimates
To get a good tree we recommend you to try all selections.
|