In the previous article we spoke about some of the performance issues that can occur with the size of the data sets that bioinformatics deals with. This article we will revise our previous algorithm to utilize a Lexicographically ordered list of all the possible Kmers (DNA sequences of length k, where k is an integer).

Recall in the previous algorithm we had slid a window of length k along some genome multiple times to count all the frequency of all the patterns within that genome, causing our algorithm to very taxing for larger data sets. …

Often times in genetics it helps to think of the genome as an encoded message, which in fact it is, and like an encoded message when a certain pattern appears very frequently it often indicates some kind of importance within the message. Therefore, finding these messages and determining their level of frequency is important. The trouble is that the size of the genome is enormous, and we need to be smart about the way we look for these sequences.

Ultimatley our goal will be to create a function how takes in a DNA sequence and the length of the pattern…

alec vis

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store