卖萌的弱渣

I am stupid, I am hungry.

Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

Example

Given s = “AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT”,

Return:

[“AAAAACCCCC”, “CCCCCAAAAA”].

Solution

  • use hash map to store the appearance times of each 10-letter substring
  • don’t use substring as key, use 20bit number, use 2 bit to distinguish (A,C,G,T) 4 letters
(Repeated-DNA-Sequence.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class Solution(object):
    def findRepeatedDnaSequences(self, s):
        """
        :type s: str
        :rtype: List[str]
        """
        ret = []
        if not s:
            return ret
        # use 2 bits to represent DNA, We just need 20bit to represent a 10-letter DNA
        dna = {'A':0,'C':1,'G':2,'T':3}
        hash_map = dict()

        sum = 0
        for x in range(len(s)):
            sum = ((sum<<2) + dna[s[x]]) & 0xFFFFF # 20bit
            if x< 9:
                continue

            hash_map[sum] = hash_map.get(sum, 0) + 1
            if hash_map[sum] ==2:
                ret.append(s[x-9:x+1])
        return ret