Skip to content

Regression in substr performance from 37.1.0 to 38.0.0 #712

Closed
@timsaucer

Description

@timsaucer

Describe the bug

In 37.1.0 and earlier the substr function in python called the built in scalar function Substr, which allowed for being called with either two or three arguments (string, start position, length). In 38.0.0 this now uses a macro that generates a single form that only takes two arguments and the user needs to call substring if they want to specify length.

To Reproduce

import datafusion
from datafusion import lit, functions as F
ctx = datafusion.SessionContext()
df = ctx.from_pydict({"a": [1]})
df.select(F.substr(lit("test text"), lit(0), lit(3)).alias("subst")).show()

In 37.1.0 produces:

DataFrame()
+-------+
| subst |
+-------+
| te    |
+-------+

In 38.0.0 produces:

Traceback (most recent call last):
  File "/Users/tsaucer/src/personal/datafusion-python/examples/test_substr.py", line 22, in <module>
    df.select(F.substr(lit("test text"), lit(0), lit(3)).alias("subst")).show()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: substr() takes 2 positional arguments but 3 were given

Expected behavior

No change in performance, but it is probably better to have users update to calling substring. Currently this function is not exposed, but it is included as part of PR #711

Additional context

This two line commit will resolve the issue:

259f79c

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions