Open
Description
Describe the issue linked to the documentation
the dataset in this example is sorted by label: the first 500 tweets are toxic and the rest non-toxic
the example does not explicitly shuffle it, but it still works because cross_validate
for classification uses a stratified k fold by default. However that is not immediately obvious, and I think there has been some discussion recently in scikit-learn that it might be better not to stratify by default. So I think it might be good to actually shuffle the dataset.
Not sure where would be the best place to do it -- in the hosted zip file, in the fetcher, in the example, or by passing shuffle=True to cross_validate
WDYT @Vincent-Maladiere
Suggest a potential alternative/fix
No response
Activity