Skip to content

Introduce a "join" operator for simpler join execution #1237

Open
@rcap107

Description

Problem Description

The current implementation of Joiner and MultiAggJoiner has some limitations which are in part caused by the fact they need to follow the scikit-learn estimator template:

  • It implements only the left join. This makes sense as an estimator because the number of samples must remain constant. However, a user may expect to be able to perform any other kind of join (inner, outer, anti...) since that is the behavior of pandas or polars merge operators.
  • It is hard to put in production because the join tables are defined in the init and may change between the init and when the join is executed.
    ( @Vincent-Maladiere )
  • In general, the fit/transform structure makes it clunky to use if the user only needs it to perform multiple joins and does not care about putting it into a pipeline.

I think it would be useful to have a more lightweight "join operator" that implements the join without the constraints of the estimator.

Feature Description

Rather than the current implementation, a join_tables operator would look similar to this:

joined_table = skrub.join_tables(main_table, 
   aux_tables=[aux_table_1, aux_table_2, ...], 
   left_on=["key1", "key2"], 
   right_on=['"id1", "id2"], how="inner"
)

I am calling this an "operator" because it will operate directly on the given tables, and is stateless.

It should be possible to reuse most of the machinery that has already been implemented in the Joiners, so it should not be too complicated to implement.

Alternative Solutions

No response

Additional Context

No response

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions