结构化聚类树数据结构化、数据、聚类树

2023-09-11 03:09:53 作者:专注卖萌20年

假设我们在一个半结构化的格式为一棵树给定的数据。作为一个例子,该树可以形成为一个有效的XML文档或为有效JSON文档。你可以把它想象成为一个类Lisp S-EX pression或(G)代数数据类型在Haskell或者Ocaml。

Suppose we are given data in a semi-structured format as a tree. As an example, the tree can be formed as a valid XML document or as a valid JSON document. You could imagine it being a lisp-like S-expression or an (G)Algebraic Data Type in Haskell or Ocaml.

我们都获得了大量的树结构文件。我们的目标是群集的文献是相似。通过聚类,我们指的是一种方法,把文档分到的Ĵ的基团,使得在每个元件的样子彼此

We are given a large number of "documents" in the tree structure. Our goal is to cluster documents which are similar. By clustering, we mean a way to divide the documents into j groups, such that elements in each looks like each other.

我相信有论文在那里,它描述的方法,但是,因为我不是很人工智能/聚类/ MachineLearning地区已知的,我要问别人谁是要寻找什么,在哪里挖。

I am sure there are papers out there which describes approaches but since I am not very known in the area of AI/Clustering/MachineLearning, I want to ask somebody who are what to look for and where to dig.

我目前的做法是这样的:

My current approach is something like this:

我想每个文档转换成n维向量设立一个K-均值聚类。 要做到这一点,我递归遍历文档树,并为每​​个I级计算的向量。如果我是一棵树的顶点,我再次出现在所有subvertices,然后总结自己的载体。此外,每当我复发,功率因数施加因此它的问题越来越少了进一步下跌的树,我去。文件最终载体是树的根。 根据在树叶中的数据,我申请一个函数,它接受的数据载体中。

但肯定的是,有更好的办法。我的方法的一个缺点是,它只会相似性聚类树,有一个顶部结构非常相像的。如果相似度present,但更远的发生倒树,然后我的方法可能不会很好地工作。

But surely, there are better approaches. One weakness of my approach is that it will only similarity-cluster trees which has a top structure much like each other. If the similarity is present, but occurs farther down the tree, then my approach probably won't work very well.

我想有在全文本搜索解决方案为好,但我想在数据优势的半结构化present。

I imagine there are solutions in full-text-search as well, but I do want to take advantage of the semi-structure present in the data.

正如所建议,人们需要定义文件之间的距离的功能。如果没有这个功能,我们可以不适用的聚类算法。

As suggested, one need to define a distance function between documents. Without this function, we can't apply a clustering algorithm.

在事实上,它可能是一个问题是关于那个非常距离函数并且其实例。我想文件,其中邻近的根元素是相同的群集彼此接近。向下我们去树越远,就越是重要。

In fact, it may be that the question is about that very distance function and examples thereof. I want documents where elements near the root are the same to cluster close to each other. The farther down the tree we go, the less it matters.

我想从程序集群堆栈踪迹。这些都能很好地形成树结构,其中,函数接近的根是内函数失败。我需要这可能发生,因为同样的事件发生在code堆栈跟踪的一个体面的距离函数。

I want to cluster stack traces from programs. These are well-formed tree structures, where the function close to the root are the inner function which fails. I need a decent distance function between stack traces that probably occur because the same event happened in code.

推荐答案

由于您的问题(堆栈跟踪)的性质,我将其降低到一个字符串匹配问题。再presenting堆栈跟踪作为树是有点开销:在堆栈跟踪每一个元素,你有一个父。

Given the nature of your problem (stack trace), I would reduce it to a string matching problem. Representing a stack trace as a tree is a bit of overhead: for each element in the stack trace, you have exactly one parent.

如果字符串匹配确实是更适合你的问题,你可以通过你的数据运行,地图的每个节点上的哈希,并为每一个文件及其正克。

If string matching would indeed be more appropriate for your problem, you can run through your data, map each node onto a hash and create for each 'document' its n-grams.

例如:

映射:

在异常A - > 0 在异常乙 - > 1 在异常Ç - > 2 在异常ð - > 3

文档答:0-1-2 文档B:1-2-3

Doc A: 0-1-2 Doc B: 1-2-3

2克DOC答: X0,01,12,2X

2-grams for doc A: X0, 01, 12, 2X

2克DOC B: X1,12,23,3X

2-grams for doc B: X1, 12, 23, 3X

使用正克,你就可以群集事件的相似序列无论根节点(在此examplem事件12)

Using the n-grams, you will be able to cluster similar sequences of events regardless of the root node (in this examplem event 12)

不过,如果你仍然相信,你需要的树木,而不是字符串,则必须考虑以下因素:树木找到相似的复杂得多。你会希望找到类似的子树,用子树类似于在一个更大的深度产生了较好的相似性得分。为此,你需要去发现封闭子树(子树是基本的子树树木扩展它)。你不想要的是包含子树是非常罕见的,或者说是present正在处理(其中,你会得到,如果你不找频繁模式)每个文档中的数据收集。

However, if you are still convinced that you need trees, instead of strings, you must consider the following: finding similarities for trees is a lot more complex. You will want to find similar subtrees, with subtrees that are similar over a greater depth resulting in a better similarity score. For this purpose, you will need to discover closed subtrees (subtrees that are the base subtrees for trees that extend it). What you don't want is a data collection containing subtrees that are very rare, or that are present in each document you are processing (which you will get if you do not look for frequent patterns).

下面是一些具体的:

http://portal.acm.org/citation.cfm?id=1227182 http://www.springerlink.com/content/yu0bajqnn4cvh9w9/ http://portal.acm.org/citation.cfm?id=1227182 http://www.springerlink.com/content/yu0bajqnn4cvh9w9/

一旦你有你频繁子树,您可以使用它们以同样的方式,你会使用正克集群。

Once you have your frequent subtrees, you can use them in the same way as you would use the n-grams for clustering.