[Converter] Equality Deletes Conversion with Enforce Primary Key Uniqueness support #552

Zyiqin-Miranda · 2025-05-20T05:12:32Z

Summary

This PR adds support for converting equality deletes while enforcing primary key uniqueness. Test cases added for correctness check.

How it works?
For Iceberg table, within each partition:

Fetch all equality deletes and data files that belong to this partition.
Group equality deletes and applicable data files together. By applicable, according to Iceberg spec, equality deletes can apply to all data files having strictly smaller sequence number.
Sort equality delete files and data files based off sequence number. When downloaded, data files table get additional file_path and ordered record index column.
After converting, we produce position delete table A that delete records based off equality deletes. Data table a contains all remaining records.
Append to Data table a for any data files that don't got applied by equality deletes and get data table b.
Drop duplicates on Data table b, append unique record index based off final table to ensure deterministic record get kept. Produced position delete table B that represent duplicate primary key meant to be deleted.
Final output will be position delete files containing position delete A + B

Rationale

Explain the reasoning behind the changes and their benefits to the project.

Changes

List the major changes made in this pull request.

Impact

Discuss any potential impacts the changes may have on existing functionalities.

Testing

Describe how the changes have been tested, including both automated and manual testing strategies.
If this is a bugfix, explain how the fix has been tested to ensure the bug is resolved without introducing new issues.

Regression Risk

If this is a bugfix, assess the risk of regression caused by this fix and steps taken to mitigate it.

Checklist

Unit tests covering the changes have been added
- If this is a bugfix, regression tests have been added
E2E testing has been performed

Additional Notes

Any additional information or context relevant to this PR.

…ditional logging; Code clean-up

…upport

pdames

LGTM! Just a few minor comments to address, then feel free to merge. Cool to see Iceberg equality delete support working together with enforcement of unique primary keys!

pdames · 2025-05-20T20:06:30Z

deltacat/tests/compute/converter/test_convert_session.py

@@ -639,3 +645,195 @@ def test_converter_pos_delete_multiple_identifier_fields_success(

    # Assert elements are same disregard ordering in list
    assert sorted(pk_combined_res) == sorted(expected_result_tuple_list)
+
+


Can we add a test case to ensure that our existing file-sequence-number-based sort is stable when multiple files share the same sequence number?

Good catch! Will add a secondary sort key based off file path, I noticed the files Spark write out have some sort of prefix in the file name that represents the order of this file within same snapshot, e.g: s3://xxxx/partitionKey=8/primaryKey_bucket=83/00000-58-3229fae0-4316-4ea4-9e09-367a5d1b96f9-00001.parquet, the bolded part is observed to be the order of files added within this transaction.

pdames · 2025-05-20T20:21:16Z

deltacat/compute/converter/steps/convert.py

+            f"Length of data file table remaining plus length of pos delete table should match origin data file table length"
+            f"But got {len(position_delete_table)} pos delete, {len(remaining_data_table)} equality delete, "
+            f"doesn't equal to original data table length: {len(data_file_table)}"


Suggested change

f"Length of data file table remaining plus length of pos delete table should match origin data file table length"

f"But got {len(position_delete_table)} pos delete, {len(remaining_data_table)} equality delete, "

f"doesn't equal to original data table length: {len(data_file_table)}"

f"Expected undeleted data file record count plus length of pos deletes to match original data file record count of {len(data_file_table)}, "

f"but found {len(position_delete_table)} pos deletes + {len(remaining_data_table)} equality deletes."

pdames · 2025-05-20T20:27:33Z

deltacat/compute/converter/steps/convert.py

+        f"Length of all data files list: {len(set(all_data_files))} should be greater than"
+        f"Length of corresponding data files list: {len(set(data_files_downloaded))}"


Suggested change

f"Length of all data files list: {len(set(all_data_files))} should be greater than"

f"Length of corresponding data files list: {len(set(data_files_downloaded))}"

f"Length of all data files ({len(set(all_data_files))}) should never be less than "

f"the length of candidate equality delete data files ({len(set(data_files_downloaded))})"

pdames · 2025-05-20T20:28:27Z

deltacat/compute/converter/pyiceberg/update_snapshot_overrides.py

+                    replace_delete_snapshot.append_data_file(data_file)
+            if to_be_deleted_files:
+                for delete_file in to_be_deleted_files:
+                    print(f"debug_delete_file_snapshot:{delete_file}")


Should this be removed (or converted to a debug log)?

pdames · 2025-05-20T20:28:46Z

deltacat/compute/converter/pyiceberg/update_snapshot_overrides.py

+        with append_delete_files_override(tx.update_snapshot()) as append_snapshot:
+            if new_position_delete_files:
+                for data_file in new_position_delete_files:
+                    print(f"debug_append_snapshot_data_file:{data_file}")


Should this be removed (or converted to a debug log)?

Zyiqin-Miranda · 2025-05-28T20:58:38Z

Thanks @pdames, could you review the changes in this commit? Passing filesystem in case of path doesn't contain info to infer which filesystem it is but user explicitly pass in a filesystem.

pdames

Good catch on the missing user-specified filesystem assignment in pyarrow utils!

…ueness support (ray-project#552) * Rebase changes * [Converter] Add assertion for correctness; Add aggregate stats and additional logging; Code clean-up * [Converter] Equality Convertion with Enforce Primary Key Uniqueness support * [Converter] Additional code clean-up * Address comments * [Bug fix] Pass filesystem to resolve_path_and_filesystem function * Remove print statement --------- Co-authored-by: Miranda <[email protected]>

yiqinzhu added 4 commits May 19, 2025 21:55

Rebase changes

9308ac6

[Converter] Add assertion for correctness; Add aggregate stats and ad…

86f3564

…ditional logging; Code clean-up

[Converter] Equality Convertion with Enforce Primary Key Uniqueness s…

8b20e8c

…upport

[Converter] Additional code clean-up

1a87965

pdames self-requested a review May 20, 2025 18:16

pdames approved these changes May 20, 2025

View reviewed changes

yiqinzhu added 3 commits May 28, 2025 13:27

Address comments

9fa6c65

[Bug fix] Pass filesystem to resolve_path_and_filesystem function

7cda3b4

Remove print statement

5472947

pdames approved these changes May 30, 2025

View reviewed changes

Zyiqin-Miranda merged commit 055cb9c into ray-project:2.0 May 30, 2025
3 checks passed

Zyiqin-Miranda deleted the find-deletes-with-duplicates branch May 30, 2025 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Converter] Equality Deletes Conversion with Enforce Primary Key Uniqueness support #552

[Converter] Equality Deletes Conversion with Enforce Primary Key Uniqueness support #552

Uh oh!

Zyiqin-Miranda commented May 20, 2025

Uh oh!

pdames left a comment •

edited

Loading

Uh oh!

pdames May 20, 2025 •

edited

Loading

Uh oh!

Zyiqin-Miranda May 20, 2025

Uh oh!

pdames May 20, 2025

Uh oh!

pdames May 20, 2025

Uh oh!

pdames May 20, 2025

Uh oh!

pdames May 20, 2025

Uh oh!

Zyiqin-Miranda commented May 28, 2025 •

edited

Loading

Uh oh!

pdames left a comment

Uh oh!

Uh oh!

Uh oh!

		@@ -639,3 +645,195 @@ def test_converter_pos_delete_multiple_identifier_fields_success(

		# Assert elements are same disregard ordering in list
		assert sorted(pk_combined_res) == sorted(expected_result_tuple_list)

		f"Length of all data files list: {len(set(all_data_files))} should be greater than"
		f"Length of corresponding data files list: {len(set(data_files_downloaded))}"

[Converter] Equality Deletes Conversion with Enforce Primary Key Uniqueness support #552

[Converter] Equality Deletes Conversion with Enforce Primary Key Uniqueness support #552

Uh oh!

Conversation

Zyiqin-Miranda commented May 20, 2025

Summary

Rationale

Changes

Impact

Testing

Regression Risk

Checklist

Additional Notes

Uh oh!

pdames left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pdames May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zyiqin-Miranda May 20, 2025

Choose a reason for hiding this comment

Uh oh!

pdames May 20, 2025

Choose a reason for hiding this comment

Uh oh!

pdames May 20, 2025

Choose a reason for hiding this comment

Uh oh!

pdames May 20, 2025

Choose a reason for hiding this comment

Uh oh!

pdames May 20, 2025

Choose a reason for hiding this comment

Uh oh!

Zyiqin-Miranda commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pdames left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pdames left a comment •

edited

Loading

pdames May 20, 2025 •

edited

Loading

Zyiqin-Miranda commented May 28, 2025 •

edited

Loading