Skip to content

issues with dask interpolate #185

Open
@knaaptime

Description

i've finally had some time to take a closer look at the dask backend, so starting a thread here to triage some thoughts. In this gist I'm looking at timing for some pretty big datasets (source=2.7m, target=137k). tl;dr,:

  • the single core interpolation takes 4-4.5 minutes,
  • the parallelized version takes 3.5 min
  • the dask-based version takes 1.5 min
    • (thats the generic one without doing any config. If i try to use distributed, everything crashes)

So this is really nice when you've got a large enough dataset, and it will be great to get it fleshed out for the other variable types. Now that i've interacted with the code a little more:

  • i'm not sure what the dtype error is in the currently failing tests, but im not seeing it locally
  • why do categoricals need to be categorical dtype? Usually they are just strings, so this line can be a little confusing when you get back AttributeError: Can only use .cat accessor with a 'category' dtype. If dask absolutely requires a categorical type, we should do a check or conversion
    • what's the purpose of category_vars? The categorical variables should already come back formatted like that, so this block is basically just doing the same thing but forcing categorical instead of using unique
  • if not given, the id col should probably be the target gdf index
    • we probably want to drop the dask index ('hilbert_distance' in the current example) that comes back from area_interpolate_dask, and probably instead set the index to id_col
  • the real work here is still done by the area_interpolate function, and dask is acting as a scheduler (basically a drop in replacement for joblib in the parallelized implementation). So all actual computation of extensive/intensive/categorical happens in the workers, and just needs to be stuck back together at the end. So shouldn't we have a single aggregation instead of multiple based on variable type? the weighting and summations should've already happened by this point, so everything should get the same groupby/apply?

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions