Decision Trees and Overfitting - Simplified
Decision tree algorithm can become exceedingly difficult here's a simple explanations.
AI is less of a closed door nowadays. How many “free” or online training are there? — we’ve all lost count.
I plan on creating an end-to-end introduction to AI (not specific to machine learning, NLP, or deep learning). Before I further evolve the AI introduction, I find it is very ripe right now for exploring learning about the decision tree algorithm.
As I typically do in most instances, if you want a deep dive into algorithms, you are much better off purchasing a comprehensive textbook. Or, you start with these high-level introductions. You want to advance to coding in the same algorithmic problem space? That’s a separate endeavor (multi-part series), at least in the context of how I write. Take, for instance, where I’ve done a multi-part on SciPy: my second post was coding-centric (here [1]).
Let us get right to it.
The algorithmic goal
The decision tree method is generally used for supervised learning problems (I am omitting unsupervised decision tree implementations [4].) Its purpose is to learn a model from the training data, which may be used to generate predictions on test data that have not yet been seen. Decision trees are a form of machine learning techniques that are used for both regression and classification problems. The core principle underlying decision trees are partitioning the feature space into regions and then utilizing these regions to forecast the target variable.
Types (a couple of examples)
There are many distinct types of decision tree algorithms; however, CART (Classification and Regression Trees) and ID3 (Iterative Dichotomiser 3) are used more often than not. The data set is sliced recursively along one or more features in each of these algorithms until they reach a point where each resultant region includes samples that have the same target value [2][3].
How It works
The concept of decision trees may be understood in a very straightforward manner. The aim is to produce a model that minimizes some cost functions associated with making predictions on new data points using the trained model. In order to accomplish this, we need to locate splits in our data that maximize some criterion, such as information gain (IG) or Gini impurity. When we have identified these optimal splits, we will be able to construct our final model by following the branch of the decision tree that leads to the highest possible predicted probability for each class label at each of the tree’s leaf nodes.
The decision trees make predictions by learning a series of if-then-else conditions from the data. Each condition in the tree (such as root node, internal node, or leaf node) leads to a new set of nodes with new conditions. This continues until all possible outcomes have been tried (“fitted” [5]). After being trained on past data, a model can tell, when given new data, it hasn’t seen before, which path through a tree will lead to what result (an approach that uses the specific examples used during training to classify or predict new values that weren’t thought of when the model was first made). In the end, decision tree algorithms learn complex relationships between inputs and outputs by breaking them down into smaller pieces that build up into the function that best fits the evidence.
Overfitting issues and mitigations
Overfitting is a problem that occurs in machine learning and is specific to which a model performs well on training data but does not generalize well to new [9] samples. This often happens (but is not limited to) when the model is too complicated for the data being used. Because there are few constraints placed on the decision tree algorithm’s ability to learn new patterns, they are especially susceptible to overfitting.
When using decision tree algorithms, preventing overfitting can be accomplished in a number of different ways, some of which are as follows:
— Pre-pruning is the technique of preventing the development of the tree if it reaches a particular size or depth limit [7].
— Post-pruning entails growing the entire tree [8], after which unnecessary nodes, or nearly identical nodes, are eliminated one by one until only the nodes that optimize the performance are left. After being trimmed and pruned, the tree in question will be more manageable in terms of its size and interpretation, and this will occur without excessive loss of precision.
— K-fold: after experimenting with various strategies for splitting your dataset into train and test sets, you should determine the performance level of each individual run and then take the average of those results.
— Bootstrap aggregating: fit multiple models on random subsamples of your original dataset (with replacement); after this, combine their predictions by averaging or voting [6].
— Thank You.