快速算法独特的集文本两个很长的序列很长、序列、算法、独特

2023-09-11 04:29:37 作者:我非柠檬为何心酸

我需要比较的X和Y染色体的DNA序列,和寻找模式(约50-75个碱基对组成)所特有的Y染色体。请注意,这些序列的部分可以在染色体重复。这需要快速进行(BLAST需要47个天,需要几个小时或更小)。是否有任何特别的算法或方案适合于这种比较?此外,速度是关键在这里。

一个我把这个SO的一个原因是,从特定应用领域以外的人,谁可以提出自己的字符串比较使用在日常使用的算法,可以适用于我们利用获得的视角。所以,不要害羞!

解决方案 在构建序列X中的后缀树秒。 对于序列Y每个首发位置我,寻找Y [i..i + 75]在S.如果不匹配,可以发现在首发位置i(即,如果查找失败J&LT后面的字符串; 75个核苷酸配对),那么你已经找到了长度-J字符串唯一为y。 在最小的这类字符串在所有起始位置我是最短的唯一的字符串(或只是暂停后,你发现任何这样的字符串,如果你不感兴趣的最小化的长度)。

总时间:O(| X | + M | Y |),其中m为最大字符串长度(如:M = 75)

有基于广义后缀树可能是更高效的算法。

I need to compare the DNA sequences of X and Y chromosomes, and find patterns (composed of around 50-75 base pairs) that are unique to the Y chromosome. Note that these sequence parts can repeat in the chromosome. This needs to be done quickly (BLAST takes 47 days, need a few hours or less). Are there any algorithms or programs in particular suited to this kind of comparison? Again, speed is the key here.

One of the reasons I put this on SO was to get perspective from people outside the specific application domain, who can put forth algorithms they use in string comparison in their daily use, that may apply to our use. So don't be shy!

解决方案 Python利用深度学习进行文本摘要的综合指南 附教程

Build a suffix tree S on sequence X. For each starting position i in sequence Y, look for the string Y[i..i+75] in S. If no match can be found starting at position i (i.e. if lookup fails after j < 75 nucleotides matched) then you have found a length-j string unique to Y. The smallest such string over all starting positions i is the shortest unique string (or just halt after you find any such string if you aren't interested in minimising the length).

Total time: O(|X| + m|Y|) where m is the maximum string length (e.g. m = 75).

There are probably even more efficient algorithms based on generalised suffix trees.