最佳数据结构的纵横字谜搜索字谜、数据结构

2023-09-10 23:55:39 作者:大胆地往前走

我有一个庞大的数据库解决字谜,包括一个字和描述。 我的应用程序允许搜索特定长度上的特定位置的单词和字符(这样做是艰辛的道路......经过的所有文字和检查每个)。 加上说明搜索(如有必要)

I have a large database for solving crossword puzzles, consisting of a word and a description. My application allows searching for words of a specific length and characters on specific positions (this is done the hard way ... go through all words and check each). Plus a search by description (if necessary)

例如找字_ _ A _ _ B(6字母组成的单词,第三个字符A和上次B)

For instance find word _ _ A _ _ B (6 letter word, third character A and last B)

我想指数在这样的方式,搜索将是非常快的话。 我的第一个想法是使用一个平衡的树结构,任何其他建议?

I would like to index the words in such way that the searching would be really fast. My first idea was to use a balanced tree structure, any other suggestion?

推荐答案

好了,我要提出一些奇怪的,但是从 C ++未来我一直在使用加速很长一段时间,我都来看看多指标库。

Okay, I am going to propose something weird, but coming from C++ I have been using Boost for a long time and I have come to see the MultiIndex library.

该库的想法是创建一个集合,但有很多不同的方式进行查询。它可以建模,实际上,一个数据库。

The idea of this library is to create one collection, but have many different ways to query it. It could model, in fact, a database.

所以,让我们把我们的话在一个表中,并把必要的指标到位:

So, let's put our words in a table, and put the necessary indexes in place:

word                     |length|c0|c1|c2| ... |c26|
-------------------------|------|--|--|--| ... |---|
Singapour                |9     |S |i |n | ... |0  |

现在的查询将如下所示:

Now the query will look like:

Select word From table Where length=9 And c2='n' And c8='u';

很容易,不是吗?

Easy enough isn't it ?

有关最大效率,表应在长度分配,并且将索引(每CX列1)应当是本地的分区。

For maximum efficiency, the table should be partitioned on the length, and the indexes (one per cX column) should be local to the partition.

对于你的内存解决方案,具有单位长度一个容器,包含多达指标的长度,每个索引是一个哈希表指向排序列表(更容易合并)

For an in-memory solution you would have one container per length, containing as many indexes as the length, each index being a hash table pointing to a sorted list (easier merge)

下面是一个Python说明:

Here is a python description:

class Dictionary:
  def __init__(self, length):
    self.length = length
    self.words = set([])
    self.indexes = collections.defaultdict(set)

  def add(self, word):
    if len(word) != self.length:
      raise RuntimeException(word + ' is not ' + `self.length` + ' characters long')

    if word in self.words:
      raise RuntimeException(word + ' is already in the dictionary')

    self.words.add(word)

    for i in range(0,length):
      self.indexes[(i,word[i])].add(word)

  def search(self, list):
    """list: list of tuples (position,character)
    """
    def compare(lhs,rhs): return cmp(len(lhs),len(rhs))

    sets = [self.indexes[elem] for elem in list]
    sets.sort(compare)
    return reduce(intersection, sets)

我自愿提供的长度参数,以减少哈希的大小,从而使搜索更好。此外,组由长度排序,使得交叉点的计算是更好:)

I have voluntarily provided the length argument, to minimize the size of the hashes and thus make the search better. Also, the sets are sorted by length so that the computation of the intersection be better :)

继续前进,对其他解决方案进行测试,如果你想:)

Go ahead and test it against other solutions if you wish :)