Skip to content

Bug when row-filtering with null values? #957

@yohplala

Description

@yohplala
    """Test categorical data with nulls and read with filters"""
    fn = os.path.join(str(tempdir), 'test.parquet')
    # Create DataFrame with categorical and nullable columns
    df = pd.DataFrame({
        'cat_col': ['A', 'B', None, 'C'] * 2,
        'filter_col': list(range(8)),
        'nullable_int': pd.array([1, None, 3, 4] * 2, dtype="Int64")
    })
    df['cat_col'] = df['cat_col'].astype('category')
    print("df")
    print(df)
    # Write DataFrame
    write(fn, df, file_scheme='hive', row_group_offsets=[0, 4])
    # Test reading with row_filter and value filter
    pf = ParquetFile(fn)
    # Test with row_filter=True and filter_col > 6
    df_filtered = pf.to_pandas(filters=[('filter_col', '>', 6)], row_filter=True)
    expected = df[df['filter_col'] > 6].reset_index(drop=True)
    print("df_filtered")
    print(df_filtered)
    print("expected")
    print(expected)
    assert_frame_equal(df_filtered, expected)

shows:

testing.pyx:173: AssertionError
---------------------------- Captured stdout call -----------------------------
df
  cat_col  filter_col  nullable_int
0       A           0             1
1       B           1          <NA>
2     NaN           2             3
3       C           3             4
4       A           4             1
5       B           5          <NA>
6     NaN           6             3
7       C           7             4

df_filtered
  cat_col  filter_col   nullable_int
0     NaN           7  2267141176816

expected
  cat_col  filter_col  nullable_int
0       C           7             4
=========================== short test summary info ===========================
FAILED fastparquet/test/test_output.py::test_categorical_with_nulls_and_filters - AssertionError: DataFrame.iloc[:, 0] (column name="cat_col") are different

DataFrame.iloc[:, 0] (column name="cat_col") values are different (100.0 %)
[index]: [0]
[left]:  [NaN]
Categories (3, object): ['A', 'B', 'C']
[right]: ['C']
Categories (3, object): ['A', 'B', 'C']
At positional index 0, first diff: nan != C
============================== 1 failed in 8.17s ==============================

Describe the issue:

This test would tend to show a limitation when combining row filtering with nulls within the same dataframe.
I have pushed this test case in PR #956
and will try to investigate.

Environment:

  • Python version: 3.13
  • Operating System: Windows
  • Install method (conda, pip, source): mix of source + unzipping windows wheels to retrieve compiled code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions