帮助了解交叉验证和决策树决策树

2023-09-11 02:54:24 作者:给儿子聘个妈

我已经读了决策树和交叉验证,我理解这两个概念。但是,我无法理解交叉验证,因为它涉及到决策树。从本质上讲交叉验证,您可以训练和测试之间切换,当你的数据集是比较小的,以最大限度地提高您的错误估计。一个很简单的算法是这样的:

I've been reading up on Decision Trees and Cross Validation, and I understand both concepts. However, I'm having trouble understanding Cross Validation as it pertains to Decision Trees. Essentially Cross Validation allows you to alternate between training and testing when your dataset is relatively small to maximize your error estimation. A very simple algorithm goes something like this:

在决定褶皱的数量,你想要的(K) 在细分数据集分成k折叠 使用K-1折的训练集构建一个树。 使用测试设置来估算约在树中的错误统计。 保存结果,以便以后 在重复步骤3-6 k次留下了不同的褶皱为您的测试设置。 平均在整个迭代的错误predict总误差

我无法弄清楚的问题是在这棵树,你挑到最后你会的K决策树,可能都略有不同,因为他们可能不拆以同样的方式,等等?我有一个想法是挑一个以最小的误差(尽管只是它的表现最好对折给了它不会使之优化 - 也许用分层将有助于但一切我读过说这不仅有助于一点点)。

The problem I can't figure out is at the end you'll have k Decision trees that could all be slightly different because they might not split the same way, etc. Which tree do you pick? One idea I had was pick the one with minimal errors (although that doesn't make it optimal just that it performed best on the fold it was given - maybe using stratification will help but everything I've read say it only helps a little bit).

据我所知交叉验证点在以后可以用来修剪节点的统计数据来计算。所以,真正树中的每个节点的基础上赋予它的测试仪将计算其统计信息。什么是重要的是,这些节点中的统计,但如果你的平均您的错误。你如何跨越ķ树的每个节点内合并这些统计数据时,每棵树可以改变的,他们选择什么分裂,等等。

As I understand cross validation the point is to compute in node statistics that can later be used for pruning. So really each node in the tree will have statistics calculated for it based on the test set given to it. What's important are these in node stats, but if your averaging your error. How do you merge these stats within each node across k trees when each tree could vary in what they choose to split on, etc.

什么是在每个迭代计算的总误差的地步?这不是东西,可以修剪过程中使用。

What's the point of calculating the overall error across each iteration? That's not something that could be used during pruning.

任何帮助,这一点皱纹会大大AP preciated。

Any help with this little wrinkle would be much appreciated.

推荐答案

我无法弄清楚的问题是,在最后,你会的K决策树,可能都略有不同,因为他们可能不拆以同样的方式,等等。这棵树,你选哪个?的

交叉验证的目的是不是要帮助选择特定的实例的分类(或决策树,或其他自动学习应用),而是限定模式的,即提供指标,如平均误差率,相对于这个平均值等,它可以在维护的precision水平可以从应用程序期望有用的偏差。其中一件事交叉验证可以帮助断言是训练数据是否足够大。

The purpose of cross validation is not to help select a particular instance of the classifier (or decision tree, or whatever automatic learning application) but rather to qualify the model, i.e. to provide metrics such as the average error ratio, the deviation relative to this average etc. which can be useful in asserting the level of precision one can expect from the application. One of the things cross validation can help assert is whether the training data is big enough.

至于选择一个特定的树,您应该改为运行尚未对现有的训练数据的100%另外培训,因为这通常会产生更好的树。 (交叉验证方法的缺点是,我们需要划分[一般小]的训练数据为褶皱,正如你在问题中暗示这可能会导致树,要么过度拟合或underfit特定数据实例)

With regards to selecting a particular tree, you should instead run yet another training on 100% of the training data available, as this typically will produce a better tree. (The downside of the Cross Validation approach is that we need to divide the [typically little] amount of training data into "folds" and as you hint in the question this can lead to trees which are either overfit or underfit for particular data instances).

在决策树的情况下,我不知道你提到的统计数据云集的节点,用来修剪​​树涉及。也许特定用途的交叉验证相关技术?...

In the case of decision tree, I'm not sure what your reference to statistics gathered in the node and used to prune the tree pertains to. Maybe a particular use of cross-validation related techniques?...

相关推荐