Periodicity and Structure of DNA Sequences Using Signal Processing Methods

It is known that all characteristics of living beings are determined by the gene sequences.

Abstract
It is known that all characteristics of living beings are determined by the gene sequences. Certain sequences of nucleotides in the DNA sequence give definite properties to living organisms. There are a few existing methods that allow us to compare the genomic sequences for their properties. Moreover, almost all available methods depend on comparing actual sequences rather than on comparing their transforms. Hence we try to explore the possibility of using signal processing methods in DNA research.

We apply one simple mapping function to transform genomic sequences into signals so that projection into the time frequency domain becomes possible.

The purpose of the project is to pick up sequences and find similarities in structure and properties using Gabor functions.

The problems
Problem 1 – The Search

Two sequences are given: the ‘long’ DNA sequence

1 and the ‘short’ 2 ( p < L )

The aim is to find out whether the long sequence has a short subsequence of length p that is similar to , and if it has then to determine where. We also want to check the degree of their similarity. A linear search is nearly impossible since the length of DNA sequences may be very large and there are infinite combinations of nucleotides possible. Because the sequences may be similar but not the same, it is pointless to check for exactly the same sequences, and this is another fallacy of the linear search.

Problem 2 – The Comparison

Genome sequences of different animals are processed in order to reveal their relative nearness in light of evolutional theory. Two DNA sequences of evolutionary related animals are investigated, using previously defined method, to point out the genomic similarity.

Problem 3 – Investigation of Periodicity in DNA Sequence

The goal is to reveal and investigate possible periodicities in a given DNA sequence.

The basic approach

The solution of all the problems mentioned above is largely based on the same principals and consists of six basic stages:
1) Description of character strings using numerical sequences

2) The digital signals are projected into the time-frequency domain using Gabor transform

3) Obtained matrices are compared. Appropriate error function is chosen and goes through special normalization

4) Previously defined error function is applied to the difference of absolute values of elements in matrices and then the process repeated in, so called, “sliding window” fashion. The value and location of minimum error are found, thus giving us an approximate location of the best match

5) Fine-tuning phase. Properly defined simple search is applied in order to pinpoint an exact location

6) Different variations of the algorithm described above are used to solve the defined problems

3

Threshold ≈ 0.4

Obtained error value 0.35 suggests the high probability of foxes and wolves being moderately related species. The error value 0.49 brings us to conclusion that foxes and bats are likely to be far less evolutionary related species than foxes and wolves.

Tools
The project was programmed in Matlab 6.5 on a PC platform.

Conclusions
An advanced DSP method of DNA analysis using Gabor transforms was successfully applied to the chosen test cases. The multiple ways for creating fast Gabor transform were suggested and partially implemented. Complexities of Gabor transform in its various forms have also been estimated. A convenient way for error detection based on L1-norm was found and an appropriate normalization process has been created, in such way that provides us with universal invariant numbers used in comparison. The implementation possibilities were demonstrated with the help of multiple examples.

A set of test case problems, was proposed and analyzed using the features mentioned above. The DSP method succeeded in solving various DNA search and analysis problems, even when various mutations and foreign organism insertions were introduced into genomic sequences.

Moreover, it was confirmed that the use of Gabor matrices, graphically represented by colormaps, allows detecting local periodicities in the genomic sequence. The known phenomenon of triplet based periodicity in protein coding regions was shown in calculation as well as visually. Useful formula for determination of periodicity-frequency relation was presented and implemented.

Several important biological theories, such as evolutionary relation, were tested and found compatible with the results obtained in our experiments. The genomic difference between animals of the same species (Drosophila melanogaster) living in different parts of the world, however small, was detected, proving that external variables such as climate have some effect on genomic compositions throughout generations. Experiments with species, that have different degrees of similarity between them, produced excellent results, thus proving the earlier statement that closely related species (fox and wolf in this case) should have more similar genomic sequences than species judged by other criteria, such as appearance and way of life, to be more distantly related (like fox and bat).

It is clear that the use of DSP methods has a high potential for future scientific development and different fields of science could benefit from applying such methods.

Acknowledgment
We would like to thank our supervisor Nagesh Subbanna and our lab engineer Johanan Erez for their support and guidance.

We are also grateful to the Ollendorff Minerva Center for supporting this project.