issues with dask interpolate

i've finally had some time to take a closer look at the dask backend, so starting a thread here to triage some thoughts. In [this gist](https://gist.github.com/knaaptime/ad33beff5ab503e754f614b891778243) I'm looking at timing for some pretty big datasets (source=2.7m, target=137k). tl;dr,:
- the single core interpolation takes 4-4.5 minutes, 
- the parallelized version takes 3.5 min
- the dask-based version takes 1.5 min
	- (thats the generic one without doing any config. If i try to use `distributed`, everything crashes) 	

So this is really nice when you've got a large enough dataset, and it will be great to get it fleshed out for the other variable types. Now that i've interacted with the code a little more:

- i'm not sure what the dtype error is in the [currently failing tests](https://github.com/pysal/tobler/actions/runs/6134207190/job/16646804066), but im not seeing it locally
- why do categoricals [**need** to be categorical dtype](https://github.com/pysal/tobler/blob/ce6fcb900b4290cd7bbec99236dd40a1baa98f0b/tobler/area_weighted/area_interpolate_dask.py#L88)? Usually they are just strings, so this line can be a little confusing when you get back `AttributeError: Can only use .cat accessor with a 'category' dtype`. If dask absolutely requires a categorical type, we should do a check or conversion
	- what's the purpose of [category_vars](https://github.com/pysal/tobler/blob/ce6fcb900b4290cd7bbec99236dd40a1baa98f0b/tobler/area_weighted/area_interpolate_dask.py#L193)? The categorical variables should already [come back](https://github.com/pysal/tobler/blob/ce6fcb900b4290cd7bbec99236dd40a1baa98f0b/tobler/area_weighted/area_interpolate.py#L361) formatted like that, so [this block](https://github.com/pysal/tobler/blob/ce6fcb900b4290cd7bbec99236dd40a1baa98f0b/tobler/area_weighted/area_interpolate_dask.py#L88) is basically just doing the same thing but forcing categorical instead of using [`unique`](https://github.com/pysal/tobler/blob/ce6fcb900b4290cd7bbec99236dd40a1baa98f0b/tobler/area_weighted/area_interpolate.py#L358)
- if not given, the [id col](https://github.com/pysal/tobler/blob/ce6fcb900b4290cd7bbec99236dd40a1baa98f0b/tobler/area_weighted/area_interpolate_dask.py#L14) should probably be the target gdf index
	- we probably want to drop the dask index ('hilbert_distance' in the current example) that comes back from `area_interpolate_dask`, and probably instead set the index to `id_col`
- the real work here [is still done](https://github.com/pysal/tobler/blob/ce6fcb900b4290cd7bbec99236dd40a1baa98f0b/tobler/area_weighted/area_interpolate_dask.py#L196) by the area_interpolate function, and dask is acting as a scheduler (basically a drop in replacement for joblib in the parallelized implementation). So all actual computation of extensive/intensive/categorical happens in the workers, and just needs to be stuck back together at the end. So shouldn't we have a single aggregation instead of [multiple](https://github.com/pysal/tobler/blob/ce6fcb900b4290cd7bbec99236dd40a1baa98f0b/tobler/area_weighted/area_interpolate_dask.py#L155) based on variable type? the weighting and summations should've already happened by this point, so everything should get the same groupby/apply?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issues with dask interpolate #185

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development