Sequence alignment of proteins and nucleotides are very important for biologists, it is the first step in many evolutionary and functional studies. These sequences have a very precise function and have mutated over time. Thus, sequences that are similar probably have similar functions, and a similarity among two sequences is mostly indicative ofcommon ancestry. By comparing homologous characters, we can reconstruct the evolutionary events that have led to the formation of the extant sequences from the common ancestor.
To compare two or more sequences, we use several sequence alignment strategies. All these algorithms involve the identification of the correct location of deletions, insertions and substitutions that have occurred in aset of sequences since their divergence from a common ancestor. Two different modes of alignments are used :
a local alignment will align part of a sequence with part of other sequences in an optimal way.
a global alignment will compare each element of sequence with each element in other sequences.
Their usage is different : global alignment algorithms are used in comparative andevolutionary studies, because two genes in different species may be similar over short regions but very different on the remaining parts of the gene, so a local alignment which would try to align the entire sequence would not find these homologies. Local alignment methods have their greatest utility in database searching and retrieval.
Before computer ages, biologists were doing manual alignment. Whenthere are only a few gaps and the two sequences are not too different from each other, a reasonable alignment can be obtained by visual inspection, but this method is subjective and unscalable.
A first algorithmic resolution of this problem has been created by Gibbs and McIntyre in 1970 : the dot-matrix method. The two sequences are written as a header of a twodimentional matrix, and a dot isput in the matrix at positions where the elements of each sequences are identical. The alignment is then define by a path from the top-left to the bottom right cell. This path is composed of diagonals, either through a dot, which is a match, or an empty cell, which is a mismatch, and horizontal and vertical steps, which correspond to insertions or deletions. A problem with this algorithm is thatthe alphabet for nucleotides is small, only four elements, so for two random elements, the probability of a match is twenty five percent, so there is a lot of noise in the matrices. We can use a window filtering to remove the noise and keep only the significant sequences of matches. Another problem with this algorithm is that it delivers a possible alignment, but it could not be the best one.With the help of computers, more sophisticated algorithms can be implemented : they are based on scoring matrices and gap penalties. The idea is to find an alignment which reflects accurately the evolutionary relationships between sequences, and it turns out to be the optimal alignment, where the number of gaps and mismatches are minimized to certain criteria. Unfortunately, reducing the number ofmismatches results in an increase of the number of gaps, and vice-versa. Very precise scoring schemes had been defined to compromise a gap penalty and a scoring matrix to get this optimal alignment. These matrices are produced by inspection of the real world, how likely a certain type of amino acid could be replaced by another according to a close shape or chemical structure. The gap penaltiesare based on our assessment of how frequent different types of insertions and deletions occur in evolution in comparison with the frequency of occurrence of point substitutions.
The algorithms to apply these scoring scheme cannot be simple algorithms that try every single possibilities of alignment to find the optimal one, because the search space is way too big. For a simple sequence...