-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
read_csv
and scan_csv
producing different result on CSV fields with invalid quoting
#22395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Notes import polars as pl
pl.read_csv(b"""\
a,b
1,2"3
4,5
""") Panics on debug at
|
scan_csv is correct here. As the second value is malformed and the new line should never be respected as it is in the middle of a quote field that never finishes. |
I can give this a try, with some input from the core team on expected behavior and implementation. Observations:
malformed = b"""\
a,b
1,2"3
4,5
"""
Design goals:
Decision points:
(*) for the original
and illustration of the control flag from another tool (
If the core team can weigh in what we want, I can work on the code. Should be doable once we have clarity on the desired behavior. |
Does it really need an additional flag or should this just fall under the I think there's an additional wrinkle. Shouldn't these two return the same number of rows: (pl.scan_csv(b"""1,2"3\n4,5""", has_header=False)
.select(pl.nth(1))
.collect()
)
# returns 2 rows of "2"3", "5" in contrast to (pl.scan_csv(b"""1,2"3\n4,5""", has_header=False)
.select(pl.nth(0))
.collect()
)
# returns just 1 row 1 and also shouldn't the order of the malformed column not matter? (pl.scan_csv(b"""2"3,1\n5,4""", has_header=False)
.select(pl.nth(1))
.collect()
)
# returns 2 rows
(pl.scan_csv(b"""2"3,1\n5,4""", has_header=False)
.select(pl.nth(0))
.collect()
)
# returns 2 rows Maybe it should raise for any RFC1480 violations and if |
We should read in as 2 rows for the example in the issue post, regardless of the projection. I believe this is also the simplest solution - this isn't a case that we want to make any large changes for as the CSV is technically malformed. |
read_csv
and scan_csv
producing different resultread_csv
and scan_csv
producing different result on malformed CSV
read_csv
and scan_csv
producing different result on malformed CSVread_csv
and scan_csv
producing different result on CSV fields with invalid quoting
If we are okay with this and consider this important enough, I will go ahead and prepare a code change whereby:
This should make everything consistent (including all cases as above), which is the real priority imo. It is not as easy as it sounds though - the reader/parser logic is split over multiple places. |
This isn't an easy one. I already looked into it and it requires changes to the simd code. You would always need to do more work during simd to be able to raise the error on invalid files. |
Thanks @ritchie46 . After further investigation, here is the proposed plan forward. The moment we deviate from 'strict' (RFC4180), things get messy, yet this is common practice in many environments and tools. Some of the following parsing features emerge (hereofter 'quote' is typically the 1-byte doublequote character):
Polars assumes strict (RFC4180), but has slightly inconsistent approaches for handling malformed data: (a) in Now, changing the core logic is likely to trigger regression issues (functional and/or performance). Therefore, the proposed plan forward:
In addition:
Feedback appreciated, hopefully this is on the right track (my first deep-dive into the world of SIMD). |
I don't want to deviate from rfc1480.
This isn't allowed. Only we don't recognize the error. I think the way forward is recognizing the error on the initial pass. I believe we alreayd have the SIMD information anyhow. So I agree with point 1, but not with point 2. I don't think we should go down that path. People then should fix there csv-files with fixing software. |
Checks
Reproducible example
Log output
Issue description
Follow on from the initial report #22392
With
.collect(projection_pushdown=False)
(orread_csv
) 2 rows are produced.2 rows are also produced with
quote_char
set to the empty string.Expected behavior
Same result.
Installed versions
main
The text was updated successfully, but these errors were encountered: