Skip to content

shuffling the toxicity dataset? #1234

Open
@jeromedockes

Description

Describe the issue linked to the documentation

the dataset in this example is sorted by label: the first 500 tweets are toxic and the rest non-toxic

the example does not explicitly shuffle it, but it still works because cross_validate for classification uses a stratified k fold by default. However that is not immediately obvious, and I think there has been some discussion recently in scikit-learn that it might be better not to stratify by default. So I think it might be good to actually shuffle the dataset.

Not sure where would be the best place to do it -- in the hosted zip file, in the fetcher, in the example, or by passing shuffle=True to cross_validate

WDYT @Vincent-Maladiere

Suggest a potential alternative/fix

No response

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions