Amature Bioinformatics — How to Write an Algorithm for Finding the Most Frequent Kmers within DNA in Python 3, Part 1
Often times in genetics it helps to think of the genome as an encoded message, which in fact it is, and like an encoded message when a certain pattern appears very frequently it often indicates some kind of importance within the message. Therefore, finding these messages and determining their level of frequency is important. The trouble is that the size of the genome is enormous, and we need to be smart about the way we look for these sequences.
Ultimatley our goal will be to create a function how takes in a DNA sequence and the length of the pattern we are looking for, call it k, and return the pattern that shows up the most. We will refer to a DNA pattern of length k as a k-mer and will use this interchangeably with pattern. So let’s take a naive approach to this problem for now and discuss why it is an issue.
To start we will need a method to count these patterns so let’s create a simple pattern count function. The goal of this function will be to take in a DNA sequence and a k-mer as an input then return the number of times that pattern occured, including overlapping sequences.
Example: pattern_count(AAA, AA) = 2
Note: This need to include overlapping patterns means that we cannot simply use the .count() method built into python. This method does not account for overlapping k-mers.
The function works by taking a window of length k, where k is the length of the pattern we are trying to find and walking it along seq, where seq is our DNA sequence. Then at each step check if the pattern matches the window and add 1 to the accumulator, count, if it matches.
With this function created let us move on to creating a our most_frequent_kmer function.
This function takes a window of length k and runs the pattern_count function from above onto the entire length of the DNA sequence at each step through the for loop then appends this number to the count array. Then finds the maximum value within the count array and reasigns this to a new variable called max_count which is used as a reference in the second for loop. The second for loop will walk through each number we collected into the count array and check if the number is matches the max_count value we determined. If it matches we will add the corresponding kmer to the set freq_count which will function as our collection of most frequent kmers. A set is used to collect the kmers because sets by their nature cannot have repeat elements, this allows us to avoid adding in a bunch of duplicate elements just to remove them again.
While this funciton completes the task, it is VERY slow!!! Running this function on a seq of just ~1000 base pairs (bp) long and a K=13 can take ~0.15s. This may not seem like a long time, but when we consider that a typical bacteria genome is ~4 Billion bp long this one function can take upwards of ~140 hours!
So why is this so slow? At each step in the for loop we find a new pattern and comparing that pattern to the entire sequence of DNA and the for loop will iterate a number of times equal to the size of the DNA seq creating a lot of redundent calculations. To compound onto this each comparison of window to pattern within the pattern_count function is itself a short loop that checks each character between window and pattern. In summary, this funciton will make a number of steps somewhere on the order of |seq|²*k.
What we really want is to only loop through the DNA sequence once, collect all of the sequence counts into a single bin and find the most frequent pattern after this. Reducing the number of steps to |seq|*k.
This can be accomplished with the use of a lexicograph which we will talk more about in the next article.
- Finding the most repeated kmer within a genome can give us valuable information to their function
- Genomes are too large to approach with a simple algorithm to find the most number of kmers and we really do not want to loop through a sequence more than necessary