-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
fix: Improve consistency between read_csv
and scan_csv
#22876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #22876 +/- ##
=======================================
Coverage 80.99% 80.99%
=======================================
Files 1674 1671 -3
Lines 236899 236959 +60
Branches 2787 2792 +5
=======================================
+ Hits 191882 191933 +51
- Misses 44350 44356 +6
- Partials 667 670 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
// check malformed: quotes are not allowed in unenclosed fields | ||
if self.quoting { | ||
let has_quote_mask = | ||
simd_bytes.simd_eq(self.simd_quote_char).to_bitmask(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's worth taking the performance hit of this extra code in the hot-loop for the malformed quote case, especially as it affects the performance in benchmarks.
I think the best strategy here is to just "let it fail" - i.e. proceed to parsing as usual and allow it to fail.
With regards to the original issue at #22876, I believe the key requirement on our side is that we behave consistently (i.e. return the same result for read_csv
/ scan_csv
/ scan_csv().select(nth(0))
). It's fine even if the behavior isn't correct since the CSV itself is malformed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. I simplified the change and took out the hot loop field-based detection of doublequote in an unenclosed field (for both SIMD and non-SIMD).
Detecting malformed data is now limited to row count mismatches and will not panic. This makes read_csv
and scan_csv
consistent. However, this means we are blind to many malformed patterns, e.g. 'a"b,c"d,e' is malformed but will not be detected, etc.
In addition, read_csv
and scan_csv
are not consistent with scan_csv/select(pl.nth())
since the second pass uses two types of logic depending on whether all projected columns have been completed or not. If we want to fix that, we may need to swap skip_this_line()
with the SplitFields
iterator for parsing consistency. Thoughts?
*edit: Note that it is should be easy to track in pass 1 if the chunk has any quotes at all, offering an option to check deeper only if quotes have been detected. If we do this, there should not be any performance impact to CSV files without quotes. Just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may need to swap
skip_this_line()
with theSplitFields
iterator for parsing consistency.
I don't think the performance impact of that would be acceptable. If there's no other way to give the same output for all cases, I think it's fine to just rely on the fact that we always error by default if we see the height of the parsed DataFrame doesn't match. I would also change the title of this PR to "raise error by default on invalid CSV quotes"
@@ -441,7 +450,11 @@ impl<'a> CoreReader<'a> { | |||
let result = slf | |||
.read_chunk(b, projection, 0, count, Some(0), b.len()) | |||
.and_then(|mut df| { | |||
debug_assert!(df.height() <= count); | |||
if df.height() > count { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if df.height() > count { | |
if df.height() != count { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had that originally, but this fails when the CSV has comment lines inside a chunk (see test test_csv_multi_char_comment
and test_count_csv_comment_char
). The clean way would be to have read_chunk
return (or mutate) rows_parsed
in addition to the DataFrame
and make the check strict (rows_parsed != count
). Ok with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can add a check for comment_char to the if statement here then -
height > count || (height < count && !has_comment_char)
Sharing WIP - some test are failing until resolved.
|
fixes #22395
This commit improves consistency between
read_csv()
andscan_csv()
for some types of malformed CSV data. It will issue a warning for some types of malformed CSV data. A decision is needed on the scope as this may impact performance, see below.In scope:
a) (Cheap) Return warning instead of panic when
read_chunk()
returns more lines thanCountLines
found. Add a similar warning toscan_csv()
. This makesread_csv
consistent withscan_csv
. We may want to error instead?b) Updated documentation.
c) Added tests.
TBD in/out of scope (currently in):
d) (Not cheap) Lazy_quotes_1: Looks for quotes unenclosed fields in
SplitFields
iterator: detect and warn when quotes are found in unenclosed field, except for columns to the RHS of last projected column.Out of scope:
e) Allow lazy_quotes_1 with projection, specifically when columns to the RHS of the last projected column have an odd number of quotes.
f) Lazy_quotes_2: Does not check for single-quotes in enclosed fields (yet) as this is more ambiguous and further impacts performance. Exception is the line count above.
g) Leading whitespace in fields.
Performance regression (for unenclosed fields lazy_quotes_1 check, d above) observed. Caveat, my dev machine is not good for benchmarking, it is a lightweight portable laptop running Windows + WSL. It shows some performance regression on a single large CSV file (NYC dataset, 24M records, short fields, unenclosed) in the order of +9% on number of instructions, +15% on wall-time.
Details: (non-representative) performance assessment
(with malformed check)
(nodebug-release)
commit 1d60b0488b29fe6c7d4292a7f49da6e1f025ca23
(without malformed check)
(nodebug-release)
commit 5be0a9b
General notes on performance. I suspect overall, some improvements are still possible for
read_csv
if we consider the following:(A) Internal
(B) User facing:
fast-mode
option to end users inread_csv()
. When enabled, polars assert wellformed data and ignores all quotes.Combined, this would present 3 approaches: (i) fast-mode (expects well-formed, no quotes, no checks), (ii) default mode (expects well-formed including quotes, limited checking), (iii) forgiving mode (feature flags to allow RFC exceptions).
Kindly review:
warning
vserror
. My take is to start withwarning
now and move toerror
when forgiving feature flags are available.