帮助了解交叉验证和决策树决策树

2023-09-11 02:54:24 作者：给儿子聘个妈

我已经读了决策树和交叉验证，我理解这两个概念。但是，我无法理解交叉验证，因为它涉及到决策树。从本质上讲交叉验证，您可以训练和测试之间切换，当你的数据集是比较小的，以最大限度地提高您的错误估计。一个很简单的算法是这样的：

I've been reading up on Decision Trees and Cross Validation, and I understand both concepts. However, I'm having trouble understanding Cross Validation as it pertains to Decision Trees. Essentially Cross Validation allows you to alternate between training and testing when your dataset is relatively small to maximize your error estimation. A very simple algorithm goes something like this:

在决定褶皱的数量，你想要的（K）在细分数据集分成k折叠使用K-1折的训练集构建一个树。使用测试设置来估算约在树中的错误统计。保存结果，以便以后在重复步骤3-6 k次留下了不同的褶皱为您的测试设置。平均在整个迭代的错误predict总误差

我无法弄清楚的问题是在这棵树，你挑到最后你会的K决策树，可能都略有不同，因为他们可能不拆以同样的方式，等等？我有一个想法是挑一个以最小的误差（尽管只是它的表现最好对折给了它不会使之优化 - 也许用分层将有助于但一切我读过说这不仅有助于一点点）。

The problem I can't figure out is at the end you'll have k Decision trees that could all be slightly different because they might not split the same way, etc. Which tree do you pick? One idea I had was pick the one with minimal errors (although that doesn't make it optimal just that it performed best on the fold it was given - maybe using stratification will help but everything I've read say it only helps a little bit).

据我所知交叉验证点在以后可以用来修剪节点的统计数据来计算。所以，真正树中的每个节点的基础上赋予它的测试仪将计算其统计信息。什么是重要的是，这些节点中的统计，但如果你的平均您的错误。你如何跨越ķ树的每个节点内合并这些统计数据时，每棵树可以改变的，他们选择什么分裂，等等。

As I understand cross validation the point is to compute in node statistics that can later be used for pruning. So really each node in the tree will have statistics calculated for it based on the test set given to it. What's important are these in node stats, but if your averaging your error. How do you merge these stats within each node across k trees when each tree could vary in what they choose to split on, etc.

什么是在每个迭代计算的总误差的地步？这不是东西，可以修剪过程中使用。

What's the point of calculating the overall error across each iteration? That's not something that could be used during pruning.

任何帮助，这一点皱纹会大大AP preciated。

Any help with this little wrinkle would be much appreciated.

帮助了解交叉验证和决策树决策树

推荐答案