Main topics:

Graph Aligner (SINA)

OCCURRENCE
DESCRIPTION
SINA documentation online
SINA version
OPTIONS
Verbosity
TRICKS
ADVANCED OPTIONS
NOTES
WARNINGS
BUGS

OCCURRENCE

ARB Editor -> Edit -> Prototypical Graph Aligner

DESCRIPTION

SINA is an alternative to the integrated aligners. It has been developed for the SILVA project.

Other than the integrated aligners SINA

uses aligned sequences from the reference database as reference to align the selected sequences.
employs full dynamic programming to create the alignment.
considers all selected relatives at once, instead of falling back to less similar sequences only if the current sequence is missing bases (e.g. because it is a partial sequence).

SINA documentation online

https://sina.readthedocs.io/_/downloads/en/stable/pdf/

Note: parts of this document were taken from the above SINA documentation.

SINA version

Please note this documentation applies to the modified SINA version 1.7.2 which (optionally) is delivered with arb.

Modifications applied:

fix build with arb7 + gcc 7.5
define CLI interface "ARB7.1" (allows arb to detect SINA 1.7.2)
fixed some error messages (clearness; do NOT show WHOLE alignment; explicitely show bad character)
new option '--dont-expect-start' (allows to work with databases not containing 'start'-fields)

ARB still supports SINA version 1.3. If there is no option to select the reference database, you are either using version 1.3 or you try to use a (newer) SINA version which does not yet support the "ARB7.1"-CLI-interface.

In case you are using version 1.3, you may like to use the old, corresponding version of this helppage which may be accessed via http://bugs.arb-home.de/browser/trunk/HELP_SOURCE/source/sina_main.hlp?rev=18781

OPTIONS

Select the sequences to be aligned as usual ("Current Species", "Selected Species" or "Marked Species").

Select a PT-Server or SINA kmer-search:

You may select a PT-Server that will be used for reference search. Make sure it is up to date and contains all sequences you want to be considered as reference.

Alternatively select '-undefined-' to use the sina-internal engine for reference search. SINA will maintain an index file (.sidx) that will be stored next to the reference database file. It will automatically get updated after the reference database changed.

Select a reference database:

The sequences from the reference database will be used as references when aligning the sequences in the current database.

Normally you will like to use the current database as reference database. This can be done in two ways:

select 'Last saved' to use the last saved state of the current database.
select 'Current' to use the loaded database. This is currently not possible with PT-Server.

Alternatively you may specify any other database as reference database via 'Explicit as [selected]'. This could e.g. be the same database used to build the PT-Server.

This allows e.g. to use a high quality database containing only typestrains as reference while working on small, specialized databases.

It will also avoid polluting the set of references with the state of your current working dataset. So you dont have to fear some badly aligned sequence from your working set will be used as a reference.

The effective reference database path will be shown in the input field below.

Important note:

The previously integrated SINA version 1.3 always aligned versus the state of the referenced sequences in the CURRENT DATABASE.

SINA version 1.7 always aligns versus the state of the referenced sequences in the REFERENCE DATABASE (which may, but does not have to be the same!).

When using SINA version 1.7 with the PT-Server, please MAKE SURE you specify the same database used to calculate the PT-Server as reference database.

Follow these steps:

save the reference database (e.g. as ref.arb)
start arb on ref.arb and calculate a PT-Server
go back to your working database, open the sina window, specify the calculated PT-Server AND specify the saved ref.arb as reference database.
restart at 1. whenever you need to update your references

If SINA detects any inconsistencies between the PT-Server and the reference database, it will silently try to update the PT-Server database.

Excluded references:

Some sequences will not be used as references:

sequences with less than 10 gaps are considered not aligned and will not be used as references.
if "Realign" (see advanced options) is checked, a sequence will never be used as references for itself.

Some options define additional requirements for the chosen reference sequence set (see option below, esp. advanced options).

Decide what to do with possible overhang ("Overhang placement"). If your sequence extends beyond the reference sequences on either side of the alignment, those bases cannot be aligned properly. Three options of handling this situation are supported:

"keep attached"

just leave them dangling, directly attached to the last base that could be aligned properly

"move to edge"

move them out to the very beginning and end of the alignment. This allows you to easily spot sequences with overhang, and decide what to do yourself. Recommended, but only if you check your sequences after alignment!

"remove"

automatically remove these bases.

Choose "Handling of unmappable insertions":

Configures how the alignment width is preserved.

"Shift surrounding bases"

The alignment is executed without constraining insertion sizes. Insertions for which insufficient columns exist between the adjoining aligned bases are force fitted into the alignment using NAST. That is, the minimum number of aligned bases to the left and right of the insertion are moved to accommodate the insertion. This mode will add warnings to the log for each sequence in which aligned bases had to be moved.

"Forbid during DP alignment"

The alignment is executed using a scoring scheme disallowing insertions for which insufficient columns exist in the alignment. This mode causes less “misalignments” than the shift mode as it computes the best alignment under the constraint that no columns may be added to the alignment. However, it will not show if the computed alignment suffered from a lack of empty columns.

"Delete bases"

The alignment is executed without constraining insertion sizes. Insertions larger than the number of columns between the adjoining aligned bases are truncated. While this mode yields the most accurate alignment for sequences with large insertions, it should be used with care as it modifies the original sequence.

Choose "Character Case":

Configures which bases should be written using lower case characters.

"Do not modify"

All bases will be written using the case they had in the input data.

"Show unaligned bases as lower case"

Aligned bases will be written in upper case; unaligned bases will be written in lower case. This serves to mark sections of the query sequences that could not be aligned because they were insertions (internal or edge) with respect to any of the reference sequences.

"Uppercase all"

All bases will use upper case characters

Define "Family conservation weight": (default 1)

Adjust the weight factor for the frequency at which a node was observed in the reference alignment. Use 0 to disable weighting. This feature prefers the more common placement for bases with inconsistent alignment in the reference database.

Define "Size of full-length sequences": (default 1400)

Set the minimum length a reference sequence is required to have to be considered full length. See also "Minimal number of full length sequences" in ADVANCED OPTIONS below.

Select a "Protection Level" higher than that of the sequences if you want the alignment software to actually modify the bases. Choose a lower protection level to execute a "dry run", not changing anything. Note that sequences with a protection level of zero will always be changed.

Verbosity

All output will be printed to the console that opens when you start sina.

Several options allow you to change the noisiness:

When "Show changed sections" is checked, sina shows differences between the inferred alignment and the original alignment.

That output will be colorized if "color bases" is checked.

Check "Show statistic" to show the distance to original alignment.

TRICKS

If you want to see how the alignment that would be produced by the graph aligner differs from your current alignment, and why the program would act that way, you can set the protection level to "0" and the Logging level to "debug". The output on the console will now include all differing sections of the alignment and the matching parts of the reference sequences.

ADVANCED OPTIONS

Select the "Show advanced options" Button at the top to gain access to the you-may-now-shoot-yourself-in-the-foot-severely dialog window.

Don't be surprised if the graph aligner crashes after you entered silly values here. No sanity check of your options is done.

Pos.Var:

Select a positional variability filter. If possible, use the filter appropriate for the type of sequences you want aligned. Positional variability statistics will be considered when placing the individual bases.

Field used for automatic filter selection:

Configures a database field using which the value for positional variability filter is determined by majority vote from the selected reference sequences. Since the filters are usually computed at domain level, this approach is usually sufficient to select an appropriate filter. For SILVA database, the field 'tax_slv' contains appropriate data.

Turn check:

If selected (default) sequences will be automatically reversed and/or complemented if this will likely improve the alignment.

Realign:

If selected, the sequence itself is excluded from the result of the executed PT-Server family search. If deselected, the alignment of an identical sequence found by the PT-Server is copied.

Gap insertion/extension penalties: (default is 5/2)

You can change the penalties associated with opening and extending gaps.

Match/mismatch scores: (default is 2/-1)

Configures the scores given for a match (should be positive) and a mismatch (should be negative).

Family search min/min_score/max: (default 40/0.7/40)

The first value tells the graph aligner how many sequences it should try to always use. The second value determines the minimal identity with the target sequence additional reference sequences should have. The third value selects the maximal number of sequences to be used as a reference.

Minimal number of full length sequences: (default 1)

Set the minimum number of full length (see "Size of full-length sequences" setting above) reference sequences that must be included in the selected reference set. The search will proceed regardless of other settings until this setting has been satisfied. If it cannot be satisfied by any sequence in the reference database, the query sequence will be discarded. This setting exists to ensure that the entire length of the query sequence will be covered in the presence of partial sequences contained within your reference database.

Family search oligo length/mismatches: (default 10/0)

The first value sets the size of k for the reference search (size of kmer). For SSU rRNA sequences, the default of 10 is a good value. For different sequence types, different values may perform better. For 5S, for example, 6 has shown to be more effective.

The second value allows k-mer matches in the reference database to contain n mismatches. This feature is only supported by the pt-server search engine and requires substantial additional compute time (in particular for n > 1).

Minimal reference sequence length: (default 150)

Set the minimum length reference sequences are required to have. Sequences shorter than this will not be included in the selection.

Note: If you are working with particularly short reference sequences, you will need to lower this settings to allow any reference sequences to be found.

Alignment bounds: (default 0/0)

These values set the beginning and the end of the gene within the reference alignment. See "Number of references required to touch bounds" for more information.

Number of references required to touch bounds: (default: 0)

Similar to "Minimal number of full length sequences", this option requires a total of n sequences to cover each the beginning and the end of the gene within the alignment.

This option is more precise than "Minimal number of full length sequences", but requires that the column numbers for the range in which the full gene is expected be specified via "Alignment bounds" (see above).

Save used references in 'used_rels': (default is off)

Writes the names of the alignment reference sequences into the field used_rels. This option allows using ´Mark by reference´ to highlight the reference sequences used to align a given query sequence.

Store highest identity in 'align_ident_slv': (default is off)

Computes the highest similarity the aligned query sequence has with any of the sequences in the alignment reference set. The value is written to the field 'align_ident_slv'.

Disable fast search: (default is to use fast search)

Use all k-mers occurring in the query sequence in the search. By default, only k-mers starting with an A are used for extra performance.

Score search results by absolute oligo match count: (default is off)

Use absolute (number of shared k-mers) match scores in the kmer search rather than relative (number or shared k-mers divided by length of reference sequence) match scores.

Suppress warnings about missing 'start' field: (default is off)

This option suppresses warnings about missing 'start' fields and allows to use sina with databases not using the 'start' w/o getting flooded with warnings.

SINA command: (default "arb_sina.sh")

If arb has problems finding the sina binary for whatever reasons, you may specify an explicit path here. Please note, doing so will stop a fat-tarball-installation from working!

NOTES

SINA automatically decides the number of threads being used.

When recording macros acting on the SINA window, problems with the used macro-IDs may occur ("XSINA" vs. "SINA" prefix) and ARB complaining about unknown macro ids (e.g. sth like "Unknown action 'SINA/CURR_PT_SERVER' in macro"). This is caused by the way the sina window toggles between 'normal' and 'advanced' options. To avoid this, toggle between advanced and normal at the start of your macro. If problems persist manually correct the action prefixes in your macro.

WARNINGS

When using SINA 1.3 you have to make sure that the alignment selected in ´Alignment Administration´ is the same alignment as used in the ARB_EDIT4 instance. Starting with SINA 1.7.2 it will always use the same alignment as the editor.