寻找最频繁子树的(解析)树木的集合子树、树木、频繁

2023-09-11 02:20:32 作者:初遇

我有树的节点被标记(但不是唯一)的集合。特别的树是从经过分析的句子(集合看到 http://en.wikipedia.org/wiki/Treebank )。我希望从集合中提取最常见的子树 - 性能(还)没有问题。我很感激​​算法(最好的Java)或指针到做到这一点的树库的工具。注意,顺序子节点是很重要的。

I have a collection of trees whose nodes are labelled (but not uniquely). Specifically the trees are from a collection of parsed sentences (see http://en.wikipedia.org/wiki/Treebank). I wish to extract the most common subtrees from the collection - performance is not (yet) an issue. I'd be grateful for algorithms (ideally Java) or pointers to tools which do this for treebanks. Note that order of child nodes is important.

修改 @mjv。我们正在努力在有限的领域(化学),其中有一个程式化的语言,所以树木的varirty不是很大 - 可能类似于儿童读者。简单树猫坐在垫子上。

EDIT @mjv. We are working in a limited domain (chemistry) which has a stylised language so the varirty of the trees is not huge - probably similar to children's readers. Simple tree for "the cat sat on the mat".

<sentence>
  <nounPhrase>
    <article/>
    <noun/>
  </nounPhrase>
  <verbPhrase>
    <verb/>
    <prepositionPhrase>
      <preposition/>
      <nounPhrase>
        <article/>
        <noun/>
      </nounPhrase>
    </prepositionPhrase>
  </verbPhrase>
</sentence>

下面的句子包含词类的两个相同的子树(实际令牌猫,垫不在匹配重要)。因此算法将需要检测这种。请注意,并非所有的nounPhrases是相同的 - 大黑猫可能是:

Here the sentence contains two identical part-of-speech subtrees (the actual tokens "cat". "mat" are not important in matching). So the algorithm would need to detect this. Note that not all nounPhrases are identical - "the big black cat" could be:

      <nounPhrase>
        <article/>
        <adjective/>
        <adjective/>
        <noun/>
      </nounPhrase>

句子的长度将更长 - 15到30节点之间。我希望得到有用的结果,从1000树木。如果不采取超过一天左右,这是可以接受的。

The length of sentences will be longer - between 15 to 30 nodes. I would expect to get useful results from 1000 trees. If this does not take more than a day or so that's acceptable.

显然较短的树越频繁,就会使nounPhrase是很常见的。

Obviously the shorter the tree the more frequent, so nounPhrase will be very common.

修改如果这是被压扁的树来解决那么我认为这将是关系到最长的公共子串,而不是最长公共序列。但请注意,我并​​不只想最长的 - 我想所有这些足够长的时间是有趣的清单(标准尚未确定)

EDIT If this is to be solved by flattening the tree then I think it would be related to Longest Common Substring, not Longest Common Sequence. But note that I don't necessarily just want the longest - I want a list of all those long enough to be "interesting" (criterion yet to be decided).

推荐答案

寻找最频繁子树集合中,创建子树的紧凑的形式,然后遍历每个子树,并使用HashSet的计算它们的出现。 30个节点是太大了一个完美的哈希 - 这是每个节点只有一位,你需要这么多,表明它是否是一个兄弟姐妹或孩子

Finding the most frequent subtrees in the collection, create a compact form of the subtree, then iterate every subtree and use a hashset to count their occurrences. 30 nodes is too big for a perfect hash - it's only about one bit per node, and you need that much to indicate whether it's a sibling or a child.

这问题不是LCS - 最常见的序列是不相关的最长公共子序列。最频繁的子树是发生最多。

That problem isn't LCS - the most common sequence isn't related to the longest common subsequence. The most frequent subtree is that which occurs the most.

这应该是在最坏的情况下,O(NL ^ 2)对于长度为L N树(假设含L-节点的子树的测试平等是O(L))。

It should be at worst case O(N L^2) for N trees of length L (assuming testing equality of a subtree containing L nodes is O(L)).