Random Forest Algorithm in Machine Learning Python

What Are Random Forests?

Random forests are a type of machine learning algorithm used for decision-making, particularly in classification and regression tasks. They resemble decision trees but are more complex and effective, especially with large datasets.

A random forest is essentially a collection of multiple decision trees grouped together. Unlike a single decision tree, which may overfit or give biased results, the random forest algorithm averages the outcomes of all its trees, reducing errors and providing more accurate predictions.

One key feature of random forests is that their decision trees are uncorrelated, meaning they bring different perspectives to the table. This diversity helps in capturing various aspects of the data, leading to more reliable results. However, care must be taken to prune irrelevant decisions within the trees, as overly deep trees can introduce noise and reduce the model’s effectiveness.

In essence, random forests leverage the “wisdom of crowds” concept—more trees mean more knowledge, leading to better decisions.

Why the name “Random”?

Two key concepts that give it the name random: 

  1. A random sampling of training data set when building trees. 
  1. Random subsets of features considered when splitting nodes.

How Do Random Forests Work?

Random forests work by combining multiple decision trees to improve the accuracy and reliability of predictions. The key to their success lies in the low correlation between the trees. This diversity ensures that individual errors in one tree are offset by others, similar to diversifying investments in a financial portfolio.

To create a random forest, the algorithm uses random samples of data, often with replacement (bootstrapping). This randomization helps the trees remain different enough to provide a wide range of perspectives, reducing the risk of overfitting.

During training, each tree is built using these random samples, and when it’s time to make predictions, the forest averages the results from all the trees. This process, called “bagging” (bootstrap aggregating), ensures that the final prediction is more accurate than any single tree’s result.

To prevent the forest from becoming too complex or generating excessive data, scientists limit the number of features and decisions each tree can consider, typically allowing only a few subsets. This maintains diversity between trees and prevents overlapping, which could lead to less clear results.

In essence, random forests are like enhanced decision trees, with the added benefit of combining multiple perspectives to reach more accurate conclusions.

When Should Random Forests Be Used?

Random forests are versatile and can be used in most situations where decision trees are applicable. They’re especially useful when dealing with large datasets that may have complex correlations. A random forest allows you to handle vast amounts of data and still generate reliable solutions.

When to Use Decision Trees:
  • Small Data Sets: Decision trees are ideal for smaller datasets where speed and simplicity are more important than extreme accuracy.
  • Less Critical Decisions: Use decision trees when the decision is important but not critical, such as in business scenarios where the outcome isn’t highly sensitive.
When to Use Random Forests:
  • Large Data Sets: Random forests are better suited for handling massive datasets where accuracy is crucial.
  • High-Stakes Decisions: They are ideal in sensitive situations, like processing data for government organizations or large corporations, where accuracy is paramount.

Summary: Use decision trees for simpler, less data-intensive tasks. Opt for random forests when accuracy and the ability to process large amounts of data are essential.

Random Forest Classifier in Python

rf = RandomForestClassifier(n_estimators=30)

rf.fit(X_train_clean,y_train_clean)

y_pred = rf.predict(X_test_clean)

print(classification_report(y_pred, y_test_clean))

precision recall f1-score support
0 0.90 0.80 0.85 55
1 0.74 0.86 0.79 36
accuracy 0.82 91
macro avg 0.82 0.83 0.82 91
weighted avg 0.83 0.82 0.83 91

Random Forest Feature Importance

In a trained random forest, we can easily access the feature importance, which shows how well each feature reduces impurity at each node. This is represented as a percentage, indicating the feature’s weight in training and prediction. To visualize it, we create a sorted pandas series of the top ten features, which are then plotted horizontally.

importance_rf = pd.Series(rf.feature_importances_, index=X_train_clean.columns)

importance_rf_sorted = importance_rf.sort_values()

importance_rf_sorted.nlargest(20).plot(kind='barh', color='orange') plt.title("Feature Importance Random Forest")

plt.show()

: Feature (global) importance obtained by fitting a random Forest Classifieron the Heart Disease Dataset.
Feature (global) importance obtained by fitting a random Forest Classifier
on the Heart Disease Dataset.

 

The following snippet code shows the role of pruning in controlling overfitting, that is it shows how accuracy varies as max_depth varies.

max_depth = range(1,20)

train_scores = []

test_scores = []

for a in max_depth:

tree = RandomForestClassifier(random_state=0,max_depth=a)

tree.fit(X_train_clean,y_train_clean)

train_scores.append(tree.score(X_train_clean,y_train_clean))

test_scores.append(tree.score(X_test_clean,y_test_clean))

 plt.plot(max_depth, test_scores, train_scores)      

plt.xlabel('max_depth')

plt.ylabel('Random Forest Accuracy')

plt.show()

 The role of pruning in control overfitting.
The role of pruning in control overfitting.

 

We’ve applied various models to this dataset. Random Forests perform well as an ensemble method, but Logistic Regression outperforms the others, which is expected given the small dataset and linear relationships among some features. To demonstrate this, we implement a function that outputs the scores of the models we’ve examined in this section.

def fitting_models():

lr=LogisticRegression()

dt = DecisionTreeClassifier()

svc = SVC()

rfc = RandomForestClassifier() clfs =

[('Logistic Regression', lr), ('Decision Tree', dt), ('Support Vector Classifier', svc), ('Random Forest Classifier', rfc) ]

  for name,clf in clfs:

  clf.fit(X_train,y_train)

  pred = clf.predict(X_test)

  score = format(accuracy_score(y_test,pred), '.4f')

  print("{} : {}".format(name,score))

   fitting_models()

Logistic Regression : 0.9533 

Decision Tree : 0.8422 

Support Vector 

Classifier : 0.4867 

Random Forest Classifier : 0.9467

Random Forests Regressor

For the sake of completeness, we also show an example of Regression Random Forest. We will make use of the hourly-aggregated data from the Washington D.C. Bike Sharing system, publicy available at https://www.capitalbikeshare.com/systemdata. This dataset was used by Fanaee and Gama (2013) in their nice, well-written paper.

di = DataIngestion(df='bike_sharing.csv', col_to_drop=None, col_target='cnt')

X_rf = di.features()

y_rf = di.target()

 

X_train, X_test, y_train, y_test = train_test_split(X_rf, y_rf,test_size=0.3, random_state=42)

X_train = X_train.reset_index(drop=True)

y_train = y_train.reset_index(drop=True)

X_test = X_test.reset_index(drop=True)

y_test = y_test.reset_index(drop=True)

 

rf = RandomForestRegressor(n_estimators=30)

rf.fit(X_train,y_train)

y_pred = rf.predict(X_test)

 

rmse_test = MSE(y_test,y_pred)**(1/2)

print('RMSE of RF (Test Set): {:.4f}'.format(rmse_test))

RMSE of RF (Test Set): 60.3499

Random Forest – FAQs

What is the advantage and drawback of Random Forests compared to Decision Trees?

Advantage: Random Forests typically offer better predictive power than Decision Trees.

Drawback: Decision Trees are more interpretable because you can visualize the splits that lead to predictions, something that’s not possible with Random Forests.

How do I know how many trees I should use?

To determine the number of trees, start by experimenting. Manually tweaking and tuning your model usually reveals the best value quicker than expected. This trial-and-error approach is common in building Machine Learning models, where we test different hyperparameters like the number of trees. In Part 10, we’ll explore k-Fold Cross Validation and Grid Search, powerful techniques to find the optimal number of trees and other hyperparameters.

When to Use Random Forest vs. Other Models?

The best approach is to try all models. Thanks to templates, it takes only about 10 minutes, which is minor compared to other parts of a data science project, like Data Preprocessing. By testing various models and comparing results (using cross-validation, which we’ll cover in Part 10), you increase your chances of getting better results.

However, for shortcuts:

  1. Linear Problems:
    • Use Simple Linear Regression if you have one feature.
    • Use Multiple Linear Regression if you have multiple features.
  2. Non-Linear Problems:
    • Consider Polynomial Regression, SVR, Decision Tree, or Random Forest.

In Part 10, you’ll learn how to select the best model using k-Fold Cross Validation, which evaluates model performance and helps you choose the best one. If you’re eager, feel free to jump directly to Part 10.

How do we decide how many trees would be enough to get a relatively accurate result? 

To decide how many trees are enough for an accurate result, start with experimentation and manual tweaking, similar to Python. Use enough trees to achieve good accuracy, but avoid using too many, as it could lead to overfitting. In Part 10, you’ll learn how to find the optimal number using Parameter Tuning, specifically Grid Search with k-Fold Cross Validation.

Why do we get different results between Python and R?

The difference in results between Python and R is likely due to the random split of data. If you perform cross-validation on all models in both languages (covered in Part 10), you’ll likely get similar mean accuracy.

Can we select the most significant variables thanks to the p-value like we did in R before? 

No, you can’t use p-values with Random Forests because they are not linear models, and p-values are only applicable to linear models. Therefore, feature selection via p-values isn’t possible. However, you can perform feature extraction, which will be covered in Part 9 – Dimensionality Reduction. This can be applied to Random Forests to reduce the number of features.

Leave a Comment