• 欢迎访问速搜资源吧,如果在网站上找不到你需要的资源,可以在留言板上留言,管理员会尽量满足你!

【速搜问答】决策树学习是什么

问答 admin 1个月前 (04-10) 31次浏览 已收录 0个评论

汉英对照:
Chinese-English Translation:

数据挖掘和机器学习中的决策树训练,使用决策树作为预测模型来预测样本的类标。这种决策树也称作分类树或回归树。在这些树的结构里,叶子节点给出类标而内部节点代表某个属性。

Decision tree training in data mining and machine learning uses decision tree as prediction model to predict the class label of samples. This decision tree is also called classification tree or regression tree. In the structure of these trees, the leaf node gives the class label, while the inner node represents an attribute.

统计学,数据挖掘和机器学习中的决策树训练,使用决策树作为预测模型来预测样本的类标。这种决策树也称作分类树或回归树。在这些树的结构里,叶子节点给出类标而内部节点代表某个属性。

Decision tree training in statistics, data mining and machine learning uses decision tree as a prediction model to predict the class label of samples. This decision tree is also called classification tree or regression tree. In the structure of these trees, the leaf node gives the class label, while the inner node represents an attribute.

在决策分析中,一棵决策树可以明确地表达决策的过程。在数据挖掘中,一棵决策树表达的是数据而不是决策。

In decision analysis, a decision tree can express the decision process clearly. In data mining, a decision tree expresses data rather than decision.

推广

extension

在数据挖掘中决策树训练是一个常用的方法。目标是创建一个模型来预测样本的目标值。

Decision tree training is a common method in data mining. The goal is to create a model to predict the target value of the sample.

一个描述泰坦尼克号上乘客生存的决策树

A decision tree describing the survival of passengers on the Titanic

一棵树的训练过程为:根据一个指标,分裂训练集为几个子集。这个过程不断的在产生的子集里重复递归进行,即递归分割。当一个训练子集的类标都相同时 递归停止。这种决策树的自顶向下归纳 (TDITD) 是贪心算法的一种, 也是当前为止最为常用的一种训练方法。

The training process of a tree is: according to an index, split the training set into several subsets. This process is repeated recursively in the generated subset, namely recursive segmentation. When the class labels of a training subset are the same, recursion stops. The top-down induction (tditd) of decision tree is one of greedy algorithms, and it is also the most commonly used training method so far.

数据以如下方式表示:

The data are presented as follows:

决策树的类型

Types of decision tree

在数据挖掘中,决策树主要有两种类型:

There are two main types of decision tree in data mining

分类树的输出是样本的类标。

The output of the classification tree is the class label of the sample.

回归树的输出是一个实数 (例如房子的价格,病人待在医院的时间等)。

The output of the regression tree is a real number (such as the price of the house, the time of the patient in the hospital, etc.).

术语分类和回归树 (CART) 包含了上述两种决策树, 最先由 Breiman 等提出。分类树和回归树有些共同点和不同点—例如处理在何处分裂的问题。

The term classification and regression tree (CART) includes the above two decision trees, which were first proposed by Breiman et al. There are some similarities and differences between classification trees and regression trees – for example, dealing with the problem of where to split.

有些集成的方法产生多棵树:

Some integrated methods produce multiple trees

装袋算法(Bagging), 是一个早期的集成方法,用有放回抽样法来训练多棵决策树,最终结果用投票法产生。

Bagging is an early integration method, which uses put back sampling method to train multiple decision trees, and the final result is generated by voting method.

随机森林(Random Forest) 使用多棵决策树来改进分类性能。

Random forest uses multiple decision trees to improve classification performance.

提升树(Boosting Tree) 可以用来做回归分析和分类决策。

Boosting tree can be used for regression analysis and classification decision.

旋转森林(Rotation forest) – 每棵树的训练首先使用主元分析法 (PCA)。

Rotation forest – principal component analysis (PCA) is used to train each tree.

还有其他很多决策树算法,常见的有:

There are many other decision tree algorithms

ID3 算法

ID3 algorithm

C4.5 算法

C4.5 algorithm

CHi-squared Automatic Interaction Detector (CHAID), 在生成树的过程中用多层分裂。

Chi squared automatic interaction detector (CHAID) uses multi-layer splitting in the process of spanning tree.

MARS 可以更好的处理数值型数据。

Mars can deal with numerical data better.

模型表达式

Model expression

构建决策树时通常采用自上而下的方法,在每一步选择一个最好的属性来分裂。”最好” 的定义是使得子节点中的训练集尽量的纯。不同的算法使用不同的指标来定义”最好”。本部分介绍一些最常见的指标。

When building a decision tree, we usually adopt a top-down approach, choosing the best attribute to split at each step The definition of “best” is to make the training set in the child nodes as pure as possible. Different algorithms use different indicators to define “best”. This section introduces some of the most common indicators.

基尼不纯度指标

Purity index of Gini

在 CART 算法中, 基尼不纯度表示一个随机选中的样本在子集中被分错的可能性。基尼不纯度为这个样本被选中的概率乘以它被分错的概率。当一个节点中所有样本都是一个类时,基尼不纯度为零。

In cart algorithm, Gini impure represents the probability that a randomly selected sample is wrongly divided in a subset. Gini impure is the probability that the sample will be selected multiplied by the probability that it will be misclassified. When all samples in a node are one class, Gini impure is zero.

假设 y 的可能取值为{1, 2, …, m},令

Suppose that the possible value of Y is {1, 2 , m}

信息增益

information gain

ID3,C4.5 和 C5.0 决策树的生成使用信息增益。信息增益 是基于信息论中信息熵的理论。

ID3, C4.5 and C5.0 decision trees are generated using information gain. Information gain is based on the theory of information entropy in information theory.

决策树的优点

Advantages of decision tree

与其他的数据挖掘算法相比,决策树有许多优点:

Compared with other data mining algorithms, decision tree has many advantages

易于理解和解释 人们很容易理解决策树的意义。

It is easy to understand and explain. People can easily understand the meaning of decision tree.

只需很少的数据准备 其他技术往往需要数据归一化。

With little data preparation, other technologies often need data normalization.

即可以处理数值型数据也可以处理类别型 数据。其他技术往往只能处理一种数据类型。例如关联规则只能处理类别型的而神经网络只能处理数值型的数据。

That is to say, it can deal with both numerical data and category data. Other technologies tend to deal with only one data type. For example, association rules can only deal with categorical data, while neural networks can only deal with numerical data.

使用白箱 模型. 输出结果容易通过模型的结构来解释。而神经网络是黑箱模型,很难解释输出的结果。

White box model is used. The output is easily explained by the structure of the model. Neural network is a black box model, which is difficult to explain the output results.

可以通过测试集来验证模型的性能 。可以考虑模型的稳定性。

Test sets can be used to verify the performance of the model. The stability of the model can be considered.

强健控制. 对噪声处理有好的强健性。

Robust control. Robust to noise.

可以很好的处理大规模数据 。

It can deal with large-scale data very well.

缺点

shortcoming

训练一棵最优的决策树是一个完全 NP 问题。因此, 实际应用时决策树的训练采用启发式搜索算法例如 贪心算法 来达到局部最优。这样的算法没办法得到最优的决策树。

Training an optimal decision tree is a complete NP problem. Therefore, in practical application, heuristic search algorithm such as greedy algorithm is used to train the decision tree. This algorithm can not get the optimal decision tree.

决策树创建的过度复杂会导致无法很好的预测训练集之外的数据。这称作过拟合。 剪枝机制可以避免这种问题。

The excessive complexity of decision tree will lead to the failure to predict the data outside the training set. This is called over fitting. Pruning mechanism can avoid this problem.

有些问题决策树没办法很好的解决,例如 异或问题。解决这种问题的时候,决策树会变得过大。 要解决这种问题,只能改变问题的领域或者使用其他更为耗时的学习算法 (例如统计关系学习 或者 归纳逻辑编程).

Some problems can not be solved by decision tree, such as XOR problem. When solving this problem, the decision tree will become too large. To solve this problem, we can only change the domain of the problem or use other more time-consuming learning algorithms (such as statistical relation learning or inductive logic programming)

对那些有类别型属性的数据, 信息增益 会有一定的偏置。

For those data with categorical attributes, the information gain will be biased.

延伸

extend

决策图

Decision chart

在决策树中, 从根节点到叶节点的路径采用汇合或与。 而在决策图中, 可以采用 最小消息长度 (MML)来汇合两条或多条路径。

In the decision tree, the path from the root node to the leaf node is converged or combined with. In decision graph, minimum message length (MML) can be used to join two or more paths.

用演化算法来搜索

Search by evolutionary algorithm

演化算法可以用来避免局部最优的问题。

Evolutionary algorithm can be used to avoid the local optimal problem.


速搜资源网 , 版权所有丨如未注明 , 均为原创丨转载请注明原文链接:【速搜问答】决策树学习是什么
喜欢 (0)
[361009623@qq.com]
分享 (0)
发表我的评论
取消评论
表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址