Skip to content

Feat: support bit_count function #1602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
May 30, 2025

Conversation

kazantsev-maksim
Copy link
Contributor

@kazantsev-maksim kazantsev-maksim commented Apr 2, 2025

Which issue does this PR close?

Related to Epic: #240
bit_count: SELECT bit_count(0) => 0
DataFusionComet bit_count has same behavior with Spark 's bit_count function
Spark: https://spark.apache.org/docs/latest/api/sql/index.html#bit_count

Closes #.

Rationale for this change

Defined under Epic: #240

What changes are included in this PR?

bitwise_count.rs: impl for bit_count function
planner.rs: Maps Spark 's bit_count function to DataFusionComet bit_count physical expression from Spark physical expression
expr.proto: bit_count has been added,
QueryPlanSerde.scala: bit_count pattern matching case has been added,
CometExpressionSuite.scala: A new UT has been added for bit_count function.

How are these changes tested?

A new UT has been added.

val table = "bitwise_count_test"
withTable(table) {
sql(s"create table $table(col1 long, col2 int, col3 short, col4 byte) using parquet")
sql(s"insert into $table values(1111, 2222, 17, 7)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind adding random number cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Added tests with random data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a test with a Parquet file not created by Spark (see makeParquetFileAllTypes) and see if this reports correct results with unsigned int columns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CometTestBase.makeParquetFileAllTypes has all the integer types you may want to test.

@codecov-commenter
Copy link

codecov-commenter commented Apr 2, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 59.47%. Comparing base (f09f8af) to head (b73149d).
Report is 218 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1602      +/-   ##
============================================
+ Coverage     56.12%   59.47%   +3.34%     
- Complexity      976     1147     +171     
============================================
  Files           119      128       +9     
  Lines         11743    12532     +789     
  Branches       2251     2356     +105     
============================================
+ Hits           6591     7453     +862     
+ Misses         4012     3889     -123     
- Partials       1140     1190      +50     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that we have this one dragged long and I hope we can merge soon, but one more question

fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
let arg = self.arg.evaluate(batch)?;
match arg {
ColumnarValue::Array(array) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any possibility that we get a dictionary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems impossible to me, but if you have a case study on how this can be tested, I'm ready to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to have many repeated values in order to create dictionary values
If you use something like https://github.com/apache/datafusion-comet/blob/main/spark/src/test/scala/org/apache/spark/sql/CometTestBase.scala#L940
you should be able to test it

@mbutrovich
Copy link
Contributor

mbutrovich commented Apr 30, 2025

DataFusion's bit_count has same behavior with Spark 's bit_count function Spark

If this is the case, can we delegate to a ScalarFunc expression instead of creating a new one? Similar to:
https://github.com/apache/datafusion-comet/pull/1700/files#diff-602f1658f4020a9dc8a47f49ac1411735d56d6612aa971751435104e301a9035

Kazantsev Maksim added 3 commits May 1, 2025 18:09
@kazantsev-maksim
Copy link
Contributor Author

@mbutrovich I couldn't find a built-in implementation of bit_count in the DataFusion project, but i rewrote it using scalarFunc without adding a new proto expr.

let operand = $OPERAND
.as_any()
.downcast_ref::<$DT>()
.expect("compute_op failed to downcast array");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps report $DT as well in case of failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

val table = "bitwise_count_test"
withTable(table) {
sql(s"create table $table(col1 long, col2 int, col3 short, col4 byte) using parquet")
sql(s"insert into $table values(1111, 2222, 17, 7)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a test with a Parquet file not created by Spark (see makeParquetFileAllTypes) and see if this reports correct results with unsigned int columns?

}
}

test("bitwise_count - random values (native parquet gen)") {
Copy link
Contributor Author

@kazantsev-maksim kazantsev-maksim May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CometTestBase.makeParquetFileAllTypes has all the integer types you may want to test.

@parthchandra I've already added the test using makeParquetFileAllTypes, or does it still need to be improved?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I must have missed this. This is sufficient, I think. (Thank you!)

}
}

test("bitwise_count - random values (native parquet gen)") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I must have missed this. This is sufficient, I think. (Thank you!)

@kazantsev-maksim
Copy link
Contributor Author

Thank you for the feedback! @kazuyukitanimura @parthchandra

@kazuyukitanimura kazuyukitanimura merged commit 53c724e into apache:main May 30, 2025
69 checks passed
@kazuyukitanimura
Copy link
Contributor

Merged Thank you @kazantsev-maksim @parthchandra @mbutrovich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants