Open
Description
i've finally had some time to take a closer look at the dask backend, so starting a thread here to triage some thoughts. In this gist I'm looking at timing for some pretty big datasets (source=2.7m, target=137k). tl;dr,:
- the single core interpolation takes 4-4.5 minutes,
- the parallelized version takes 3.5 min
- the dask-based version takes 1.5 min
- (thats the generic one without doing any config. If i try to use
distributed
, everything crashes)
- (thats the generic one without doing any config. If i try to use
So this is really nice when you've got a large enough dataset, and it will be great to get it fleshed out for the other variable types. Now that i've interacted with the code a little more:
- i'm not sure what the dtype error is in the currently failing tests, but im not seeing it locally
- why do categoricals need to be categorical dtype? Usually they are just strings, so this line can be a little confusing when you get back
AttributeError: Can only use .cat accessor with a 'category' dtype
. If dask absolutely requires a categorical type, we should do a check or conversion- what's the purpose of category_vars? The categorical variables should already come back formatted like that, so this block is basically just doing the same thing but forcing categorical instead of using
unique
- what's the purpose of category_vars? The categorical variables should already come back formatted like that, so this block is basically just doing the same thing but forcing categorical instead of using
- if not given, the id col should probably be the target gdf index
- we probably want to drop the dask index ('hilbert_distance' in the current example) that comes back from
area_interpolate_dask
, and probably instead set the index toid_col
- we probably want to drop the dask index ('hilbert_distance' in the current example) that comes back from
- the real work here is still done by the area_interpolate function, and dask is acting as a scheduler (basically a drop in replacement for joblib in the parallelized implementation). So all actual computation of extensive/intensive/categorical happens in the workers, and just needs to be stuck back together at the end. So shouldn't we have a single aggregation instead of multiple based on variable type? the weighting and summations should've already happened by this point, so everything should get the same groupby/apply?
Metadata
Assignees
Labels
No labels
Activity