Skip to content

[Bug] Empty dataset not empty #300

Open
@danielfromearth

Description

Summary


An inappropriate fill value is set when creating an empty dataset copy. This results in failures of subsequent processing, because instead of the dataset being truly empty, there is a "valid" value in a data variable, instead of a true fill value.

Description of the problem


When there are no data points that match the requested spatiotemporal conditions, l2ss-py creates an empty dataset copy here. @ank1m and I discovered an edge case where a valid value is being placed in the new, copied variable, instead of the expected null or fill value. This occurred for the following "ground_pixel_quality_flag" variable, which notably has an integer type (int32) and has no declared '_FillValue' attribute:

Here is a screenshot showing the variable, in a TEMPO collection:
image

Since this variable, "support_data/ground_pixel_quality_flag", doesn't have a '_FillValue', l2ss-py tries to create an empty array using np.nan instead. But, because this variable is of type 'int32', it can't use np.nan!

Instead, the code raises a
RuntimeWarning: invalid value encountered in cast multiarray.copyto(a, fill_value, casting='unsafe')
and then defaults back to using a 0 instead of np.nan.

However, 0 is a valid value for this variable (see the valid_min and valid_max attributes in the above screenshot), so subsequent operations see a valid array, rather than an empty, or all-fill-value, array.

Impact


This causes a failure during the below service chain call, after the "Stitchee" service tries to determine whether the files coming from l2ss-py are empty. Stitchee considers the file as "not empty" here because the variable's single value is not a fill value or null.

Steps to reproduce


The following request currently fails: https://harmony.uat.earthdata.nasa.gov/C1262899916-LARC_CLOUD/ogc-api-coverages/1.0.0/collections/all/coverage/rangeset?forceAsync=true&granuleId=G1269044803-LARC_CLOUD%2CG1269044708-LARC_CLOUD%2CG1269044681-LARC_CLOUD%2CG1269044688-LARC_CLOUD%2CG1269044514-LARC_CLOUD%2CG1269044741-LARC_CLOUD%2CG1269044710-LARC_CLOUD%2CG1269044439-LARC_CLOUD%2CG1269044715-LARC_CLOUD%2CG1269044815-LARC_CLOUD%2CG1269044726-LARC_CLOUD%2CG1269044787-LARC_CLOUD%2CG1269044827-LARC_CLOUD%2CG1269044658-LARC_CLOUD%2CG1269044679-LARC_CLOUD%2CG1269044727-LARC_CLOUD&subset=lat(32.56485%3A42.82943)&subset=lon(-135.7248%3A-52.76692)&subset=time(%222024-08-02T00%3A00%3A00.000Z%22%3A%222024-08-02T10%3A39%3A37.000Z%22)&concatenate=true&skipPreview=true

Desired change


An appropriate fill or null value for each variable's dtype is used when creating an "empty" dataset.

I think that means the dataset copy in l2ss should either:

  • take into account the dtype, and use the appropriate default _FillValue for that dtype to begin with (such as from netCDF4.default_fillvals), or
  • catch the invalid type warning, and then determine an appropriate _FillValue

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions