寻找最频繁子树的（解析）树木的集合子树、树木、频繁

2023-09-11 02:20:32 作者：初遇

我有树的节点被标记（但不是唯一）的集合。特别的树是从经过分析的句子（集合看到 http://en.wikipedia.org/wiki/Treebank ）。我希望从集合中提取最常见的子树 - 性能（还）没有问题。我很感激算法（最好的Java）或指针到做到这一点的树库的工具。注意，顺序子节点是很重要的。

I have a collection of trees whose nodes are labelled (but not uniquely). Specifically the trees are from a collection of parsed sentences (see http://en.wikipedia.org/wiki/Treebank). I wish to extract the most common subtrees from the collection - performance is not (yet) an issue. I'd be grateful for algorithms (ideally Java) or pointers to tools which do this for treebanks. Note that order of child nodes is important.

修改 @mjv。我们正在努力在有限的领域（化学），其中有一个程式化的语言，所以树木的varirty不是很大 - 可能类似于儿童读者。简单树猫坐在垫子上。

EDIT @mjv. We are working in a limited domain (chemistry) which has a stylised language so the varirty of the trees is not huge - probably similar to children's readers. Simple tree for "the cat sat on the mat".

<sentence>
  <nounPhrase>
    <article/>
    <noun/>
  </nounPhrase>
  <verbPhrase>
    <verb/>
    <prepositionPhrase>
      <preposition/>
      <nounPhrase>
        <article/>
        <noun/>
      </nounPhrase>
    </prepositionPhrase>
  </verbPhrase>
</sentence>

下面的句子包含词类的两个相同的子树（实际令牌猫，垫不在匹配重要）。因此算法将需要检测这种。请注意，并非所有的nounPhrases是相同的 - 大黑猫可能是：

Here the sentence contains two identical part-of-speech subtrees (the actual tokens "cat". "mat" are not important in matching). So the algorithm would need to detect this. Note that not all nounPhrases are identical - "the big black cat" could be:

      <nounPhrase>
        <article/>
        <adjective/>
        <adjective/>
        <noun/>
      </nounPhrase>

句子的长度将更长 - 15到30节点之间。我希望得到有用的结果，从1000树木。如果不采取超过一天左右，这是可以接受的。

The length of sentences will be longer - between 15 to 30 nodes. I would expect to get useful results from 1000 trees. If this does not take more than a day or so that's acceptable.

显然较短的树越频繁，就会使nounPhrase是很常见的。

Obviously the shorter the tree the more frequent, so nounPhrase will be very common.

修改如果这是被压扁的树来解决那么我认为这将是关系到最长的公共子串，而不是最长公共序列。但请注意，我并不只想最长的 - 我想所有这些足够长的时间是有趣的清单（标准尚未确定）

EDIT If this is to be solved by flattening the tree then I think it would be related to Longest Common Substring, not Longest Common Sequence. But note that I don't necessarily just want the longest - I want a list of all those long enough to be "interesting" (criterion yet to be decided).

寻找最频繁子树的（解析）树木的集合子树、树木、频繁

推荐答案