算法的帮助!快速算法在搜索的字符串与合作伙伴算法、字符串、合作伙伴、快速

2023-09-11 03:30:07 作者:逝水无痕

我要寻找一个巨大的字符串(这是一个亿万由数十亿个字符的生物体的基因组序列)的快速算法搜索的目的。

I am looking for a fast algorithm for search purpose in a huge string (it's a organism genome sequence composed of hundreds of millions to billions of chars).

有只有4个字符{A,C,G,T} present在此字符串,而A只能搭配T,而C对,G。

There are only 4 chars {A,C,G,T} present in this string, and "A" can only pair with "T" while "C" pairs with "G".

现在我正在寻找两个子(用{minLen,MAXLEN}之间的两个子串{intervalMinLen,intervalMaxLen}之间的长度的限制,和区间长度)可以对彼此antiparallely。

Now I am searching for two substrings (with length constraint of both substring between {minLen, maxLen}, and interval length between {intervalMinLen, intervalMaxLen}) that can pair with one another antiparallely.

例如, 该字符串是:ATCAG GACCA TACGC CTGAT

For example, The string is: ATCAG GACCA TACGC CTGAT

约束:minLen = 4,MAXLEN = 5,intervalMinLen = 9,intervalMaxLen = 10

Constraints: minLen = 4, maxLen = 5, intervalMinLen = 9, intervalMaxLen = 10

结果应该是

ATCAG配对CTGAT

"ATCAG" pair with "CTGAT"

TCAG配对CTGA

在此先感谢。

更新:我已经有方法来确定两个字符串是否可以对彼此。唯一担心的是做穷举搜索是非常耗时。

Update: I already have the method to determine whether two string can pair with one another. The only concern is doing exhaustive search is very time consuming.

推荐答案

我认为这是一个有趣的问题,所以我放在一起的基础上考虑'foldings'程序,它会扫描向外可能的对称匹配来自不同的折点。如果N个核苷酸的数量,M是'maxInterval-minInterval',你应该有运行时间为O(N * M)。我可能已经错过了一些边界情况,所以使用code小心,但它的工作所提供的例子。请注意,我用的填充中间缓冲区来存储的基因组,因为这会减少在内部循环所需的边界情况的比较次数;这种权衡额外的内存分配更好的速度。随意如果你做任何修正或改进编辑的帖子。

I thought this was an interesting problem, so I put together a program based on considering 'foldings', which scans outward for possible symmetrical matches from different 'fold points'. If N is the number of nucleotides and M is 'maxInterval-minInterval', you should have running time O(N*M). I may have missed some boundary cases, so use the code with care, but it does work for the example provided. Note that I've used a padded intermediate buffer to store the genome, as this reduces the number of comparisons for boundary cases required in the inner loops; this trades off additional memory allocation for better speed. Feel free to edit the post if you make any corrections or improvements.

class Program
{
    public sealed class Pairing
    {
        public int Index { get; private set; }

        public int Length { get; private set; }

        public int Offset { get; private set; }

        public Pairing(int index, int length, int offset)
        {
            Index = index;
            Length = length;
            Offset = offset;
        }
    }

    public static IEnumerable<Pairing> FindPairings(string genome, int minLen, int maxLen, int intervalMinLen, int intervalMaxLen)
    {
        int n = genome.Length;
        var padding = new string((char)0, maxLen);
        var padded = string.Concat(padding, genome, padding);

        int start = (intervalMinLen + minLen)/2 + maxLen;
        int end = n - (intervalMinLen + minLen)/2 + maxLen;

        //Consider 'fold locations' along the genome
        for (int i=start; i<end; i++)
        {
            //Consider 'odd' folding (centered on index) about index i
            int k = (intervalMinLen+2)/2;
            int maxK = (intervalMaxLen + 2)/2;
            while (k<=maxK)
            {
                int matchLength = 0;
                while (IsPaired(padded[i - k], padded[i + k]) && (k <= (maxK+maxLen)))
                {
                    matchLength++;

                    if (matchLength >= minLen && matchLength <= maxLen)
                    {
                        yield return new Pairing(i-k - maxLen, matchLength, 2*k - (matchLength-1));
                    }
                    k++;
                }
                k++;
            }

            //Consider 'even' folding (centered before index) about index i
            k = (intervalMinLen+1)/2;
            while (k <= maxK)
            {
                int matchLength = 0;
                while (IsPaired(padded[i - (k+1)], padded[i + k]) && (k<=maxK+maxLen))
                {
                    matchLength++;

                    if (matchLength >= minLen && matchLength <= maxLen)
                    {
                        yield return new Pairing(i - (k+1) - maxLen, matchLength, 2*k + 1 - (matchLength-1));
                    }
                    k++;
                }
                k++;
            }
        }
    }

    private const int SumAT = 'A' + 'T';
    private const int SumGC = 'G' + 'C';
    private static bool IsPaired(char a, char b)
    {
        return (a + b) == SumAT || (a + b) == SumGC;
    }


    static void Main(string[] args)
    {
        string genome = "ATCAGGACCATACGCCTGAT";
        foreach (var pairing in FindPairings(genome, 4, 5, 9, 10))
        {
            Console.WriteLine("'{0}' pair with '{1}'",
                              genome.Substring(pairing.Index, pairing.Length),
                              genome.Substring(pairing.Index + pairing.Offset, pairing.Length));
        }
        Console.ReadKey();
    }


}