Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update explanation on nr of trees in GBDT #799

Merged
merged 3 commits into from
Jan 29, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions notebooks/ensemble_ex_03.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -101,20 +101,21 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Both gradient boosting and random forest models improve when increasing the\n",
"number of trees in the ensemble. However, the scores reach a plateau where\n",
"adding new trees just makes fitting and scoring slower.\n",
"Random forest models improve when increasing the number of trees in the\n",
"ensemble. However, the scores reach a plateau where adding new trees just\n",
"makes fitting and scoring slower.\n",
"\n",
"To avoid adding new unnecessary tree, unlike random-forest gradient-boosting\n",
"Gradient boosting models overfit when the number of trees is too large. To\n",
"avoid adding a new unnecessary tree, unlike random-forest gradient-boosting\n",
"offers an early-stopping option. Internally, the algorithm uses an\n",
"out-of-sample set to compute the generalization performance of the model at\n",
"each addition of a tree. Thus, if the generalization performance is not\n",
"improving for several iterations, it stops adding trees.\n",
"\n",
"Now, create a gradient-boosting model with `n_estimators=1_000`. This number\n",
"of trees is certainly too large. Change the parameter `n_iter_no_change` such\n",
"that the gradient boosting fitting stops after adding 5 trees that do not\n",
"improve the overall generalization performance."
"of trees is certainly too large. Change the parameter `n_iter_no_change`\n",
"such that the gradient boosting fitting stops after adding 5 trees to avoid\n",
"deterioration of the overall generalization performance."
]
},
{
Expand Down
17 changes: 9 additions & 8 deletions notebooks/ensemble_sol_03.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -129,20 +129,21 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Both gradient boosting and random forest models improve when increasing the\n",
"number of trees in the ensemble. However, the scores reach a plateau where\n",
"adding new trees just makes fitting and scoring slower.\n",
"Random forest models improve when increasing the number of trees in the\n",
"ensemble. However, the scores reach a plateau where adding new trees just\n",
"makes fitting and scoring slower.\n",
"\n",
"To avoid adding new unnecessary tree, unlike random-forest gradient-boosting\n",
"Gradient boosting models overfit when the number of trees is too large. To\n",
"avoid adding a new unnecessary tree, unlike random-forest gradient-boosting\n",
"offers an early-stopping option. Internally, the algorithm uses an\n",
"out-of-sample set to compute the generalization performance of the model at\n",
"each addition of a tree. Thus, if the generalization performance is not\n",
"improving for several iterations, it stops adding trees.\n",
"\n",
"Now, create a gradient-boosting model with `n_estimators=1_000`. This number\n",
"of trees is certainly too large. Change the parameter `n_iter_no_change` such\n",
"that the gradient boosting fitting stops after adding 5 trees that do not\n",
"improve the overall generalization performance."
"of trees is certainly too large. Change the parameter `n_iter_no_change`\n",
"such that the gradient boosting fitting stops after adding 5 trees to avoid\n",
"deterioration of the overall generalization performance."
]
},
{
Expand All @@ -167,7 +168,7 @@
"source": [
"We see that the number of trees used is far below 1000 with the current\n",
"dataset. Training the gradient boosting model with the entire 1000 trees would\n",
"have been useless."
"have been detrimental."
]
},
{
Expand Down
15 changes: 8 additions & 7 deletions python_scripts/ensemble_ex_03.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,20 +64,21 @@
# Write your code here.

# %% [markdown]
# Both gradient boosting and random forest models improve when increasing the
# number of trees in the ensemble. However, the scores reach a plateau where
# adding new trees just makes fitting and scoring slower.
# Random forest models improve when increasing the number of trees in the
# ensemble. However, the scores reach a plateau where adding new trees just
# makes fitting and scoring slower.
#
# To avoid adding new unnecessary tree, unlike random-forest gradient-boosting
# Gradient boosting models overfit when the number of trees is too large. To
# avoid adding a new unnecessary tree, unlike random-forest gradient-boosting
# offers an early-stopping option. Internally, the algorithm uses an
# out-of-sample set to compute the generalization performance of the model at
# each addition of a tree. Thus, if the generalization performance is not
# improving for several iterations, it stops adding trees.
#
# Now, create a gradient-boosting model with `n_estimators=1_000`. This number
# of trees is certainly too large. Change the parameter `n_iter_no_change` such
# that the gradient boosting fitting stops after adding 5 trees that do not
# improve the overall generalization performance.
# of trees is certainly too large. Change the parameter `n_iter_no_change`
# such that the gradient boosting fitting stops after adding 5 trees to avoid
# deterioration of the overall generalization performance.

# %%
# Write your code here.
Expand Down
17 changes: 9 additions & 8 deletions python_scripts/ensemble_sol_03.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,20 +86,21 @@
)

# %% [markdown]
# Both gradient boosting and random forest models improve when increasing the
# number of trees in the ensemble. However, the scores reach a plateau where
# adding new trees just makes fitting and scoring slower.
# Random forest models improve when increasing the number of trees in the
# ensemble. However, the scores reach a plateau where adding new trees just
# makes fitting and scoring slower.
#
# To avoid adding new unnecessary tree, unlike random-forest gradient-boosting
# Gradient boosting models overfit when the number of trees is too large. To
# avoid adding a new unnecessary tree, unlike random-forest gradient-boosting
# offers an early-stopping option. Internally, the algorithm uses an
# out-of-sample set to compute the generalization performance of the model at
# each addition of a tree. Thus, if the generalization performance is not
# improving for several iterations, it stops adding trees.
#
# Now, create a gradient-boosting model with `n_estimators=1_000`. This number
# of trees is certainly too large. Change the parameter `n_iter_no_change` such
# that the gradient boosting fitting stops after adding 5 trees that do not
# improve the overall generalization performance.
# of trees is certainly too large. Change the parameter `n_iter_no_change`
# such that the gradient boosting fitting stops after adding 5 trees to avoid
# deterioration of the overall generalization performance.

# %%
# solution
Expand All @@ -110,7 +111,7 @@
# %% [markdown] tags=["solution"]
# We see that the number of trees used is far below 1000 with the current
# dataset. Training the gradient boosting model with the entire 1000 trees would
# have been useless.
# have been detrimental.

# %% [markdown]
# Estimate the generalization performance of this model again using the
Expand Down