Skip to content

fix: Improve consistency between read_csv and scan_csv #22876

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

kdn36
Copy link
Contributor

@kdn36 kdn36 commented May 22, 2025

fixes #22395

This commit improves consistency between read_csv() and scan_csv() for some types of malformed CSV data. It will issue a warning for some types of malformed CSV data. A decision is needed on the scope as this may impact performance, see below.

In scope:
a) (Cheap) Return warning instead of panic when read_chunk() returns more lines than CountLines found. Add a similar warning to scan_csv(). This makes read_csv consistent with scan_csv. We may want to error instead?
b) Updated documentation.
c) Added tests.

TBD in/out of scope (currently in):
d) (Not cheap) Lazy_quotes_1: Looks for quotes unenclosed fields in SplitFields iterator: detect and warn when quotes are found in unenclosed field, except for columns to the RHS of last projected column.

Out of scope:
e) Allow lazy_quotes_1 with projection, specifically when columns to the RHS of the last projected column have an odd number of quotes.
f) Lazy_quotes_2: Does not check for single-quotes in enclosed fields (yet) as this is more ambiguous and further impacts performance. Exception is the line count above.
g) Leading whitespace in fields.

Performance regression (for unenclosed fields lazy_quotes_1 check, d above) observed. Caveat, my dev machine is not good for benchmarking, it is a lightweight portable laptop running Windows + WSL. It shows some performance regression on a single large CSV file (NYC dataset, 24M records, short fields, unenclosed) in the order of +9% on number of instructions, +15% on wall-time.

Details: (non-representative) performance assessment

(with malformed check)
(nodebug-release)
commit 1d60b0488b29fe6c7d4292a7f49da6e1f025ca23

denecker@DESKTOP-V7KT63E:~/my-polars/issue_22395$ ~/poop/zig-out/bin/poop 'python nyc_read.py'
Benchmark 1 (3 runs): python nyc_read.py
  measurement          mean ± σ            min … max           outliers
  wall_time          6.14s  ±  723ms    5.47s  … 6.91s           0 ( 0%)
  peak_rss           8.43GB ± 6.33MB    8.43GB … 8.44GB          0 ( 0%)
  cpu_cycles          102G  ±  487M      101G  …  102G           0 ( 0%)
  instructions        141G  ±  395K      141G  …  141G           0 ( 0%)
  cache_references    561M  ± 34.0M      541M  …  600M           0 ( 0%)
  cache_misses        333M  ± 4.69M      327M  …  336M           0 ( 0%)
  branch_misses       306M  ± 9.31M      298M  …  316M           0 ( 0%)

(without malformed check)
(nodebug-release)
commit 5be0a9b

denecker@DESKTOP-V7KT63E:~/my-polars/issue_22395$ ~/poop/zig-out/bin/poop 'python nyc_read.py'
Benchmark 1 (3 runs): python nyc_read.py
  measurement          mean ± σ            min … max           outliers
  wall_time          5.33s  ±  169ms    5.22s  … 5.53s           0 ( 0%)
  peak_rss           8.43GB ±  967KB    8.43GB … 8.44GB          0 ( 0%)
  cpu_cycles         90.0G  ±  332M     89.7G  … 90.4G           0 ( 0%)
  instructions        129G  ±  463K      129G  …  129G           0 ( 0%)
  cache_references    540M  ± 17.3M      526M  …  560M           0 ( 0%)
  cache_misses        331M  ± 1.67M      329M  …  333M           0 ( 0%)
  branch_misses       288M  ± 14.1M      280M  …  305M           0 ( 0%)

General notes on performance. I suspect overall, some improvements are still possible for read_csv if we consider the following:
(A) Internal

  1. Use bounded queue between pass-1 and pass-2 to provide back-pressure. This should keep passes in sync and keep cache hot (requires investigation).
  2. Review L3 cache size optimization (individual chunks are 16 MB, per thread?)
  3. In pass 2, consider collecting character indices for further processing rather than re-starting 'next()` for every field. Relevant when fields are short.
  4. I noticed SIMD_SIZE is hardcoded as 64 bytes (AVX-512) but many Intel laptops only support 32 bytes (AVX-2). (Fwiw, hardcoding SIMD_SIZE to 32 bytes showed some improvement on my machine).

(B) User facing:

  1. Introduce a fast-mode option to end users in read_csv(). When enabled, polars assert wellformed data and ignores all quotes.
  2. Introduce feature flag(s) for lazyquotes options with separate code path. This protects the performance on wellformed CSV. Consider memchr to manage code complexity.
    Combined, this would present 3 approaches: (i) fast-mode (expects well-formed, no quotes, no checks), (ii) default mode (expects well-formed including quotes, limited checking), (iii) forgiving mode (feature flags to allow RFC exceptions).

Kindly review:

  • Issue warning vs error. My take is to start with warning now and move to error when forgiving feature flags are available.
  • Scope: how much performance impact is acceptable to detect malformed data? Imo, the next step would be a reliable benchmark on a dedicated machine.

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels May 22, 2025
Copy link

codecov bot commented May 22, 2025

Codecov Report

Attention: Patch coverage is 97.43590% with 1 line in your changes missing coverage. Please review.

Project coverage is 80.99%. Comparing base (5be0a9b) to head (fba49d6).
Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-io/src/csv/read/splitfields.rs 96.77% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #22876   +/-   ##
=======================================
  Coverage   80.99%   80.99%           
=======================================
  Files        1674     1671    -3     
  Lines      236899   236959   +60     
  Branches     2787     2792    +5     
=======================================
+ Hits       191882   191933   +51     
- Misses      44350    44356    +6     
- Partials      667      670    +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

// check malformed: quotes are not allowed in unenclosed fields
if self.quoting {
let has_quote_mask =
simd_bytes.simd_eq(self.simd_quote_char).to_bitmask();
Copy link
Collaborator

@nameexhaustion nameexhaustion May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's worth taking the performance hit of this extra code in the hot-loop for the malformed quote case, especially as it affects the performance in benchmarks.

I think the best strategy here is to just "let it fail" - i.e. proceed to parsing as usual and allow it to fail.

With regards to the original issue at #22876, I believe the key requirement on our side is that we behave consistently (i.e. return the same result for read_csv / scan_csv / scan_csv().select(nth(0))). It's fine even if the behavior isn't correct since the CSV itself is malformed.

Copy link
Contributor Author

@kdn36 kdn36 May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. I simplified the change and took out the hot loop field-based detection of doublequote in an unenclosed field (for both SIMD and non-SIMD).

Detecting malformed data is now limited to row count mismatches and will not panic. This makes read_csv and scan_csv consistent. However, this means we are blind to many malformed patterns, e.g. 'a"b,c"d,e' is malformed but will not be detected, etc.

In addition, read_csv and scan_csv are not consistent with scan_csv/select(pl.nth()) since the second pass uses two types of logic depending on whether all projected columns have been completed or not. If we want to fix that, we may need to swap skip_this_line() with the SplitFields iterator for parsing consistency. Thoughts?

*edit: Note that it is should be easy to track in pass 1 if the chunk has any quotes at all, offering an option to check deeper only if quotes have been detected. If we do this, there should not be any performance impact to CSV files without quotes. Just a thought.

Copy link
Collaborator

@nameexhaustion nameexhaustion May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may need to swap skip_this_line() with the SplitFields iterator for parsing consistency.

I don't think the performance impact of that would be acceptable. If there's no other way to give the same output for all cases, I think it's fine to just rely on the fact that we always error by default if we see the height of the parsed DataFrame doesn't match. I would also change the title of this PR to "raise error by default on invalid CSV quotes"

@@ -441,7 +450,11 @@ impl<'a> CoreReader<'a> {
let result = slf
.read_chunk(b, projection, 0, count, Some(0), b.len())
.and_then(|mut df| {
debug_assert!(df.height() <= count);
if df.height() > count {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if df.height() > count {
if df.height() != count {

Copy link
Contributor Author

@kdn36 kdn36 May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that originally, but this fails when the CSV has comment lines inside a chunk (see test test_csv_multi_char_comment and test_count_csv_comment_char). The clean way would be to have read_chunk return (or mutate) rows_parsed in addition to the DataFrame and make the check strict (rows_parsed != count). Ok with that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can add a check for comment_char to the if statement here then -

height > count || (height < count && !has_comment_char)

@kdn36 kdn36 marked this pull request as draft May 23, 2025 17:46
@kdn36
Copy link
Contributor Author

kdn36 commented May 23, 2025

Sharing WIP - some test are failing until resolved.
Open items:

  • Strict line count check when there are comments in a chunk (ping @nameexhaustion ).
  • Consistency when select(pl.nth()) is used, separate PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_csv and scan_csv producing different result on CSV fields with invalid quoting
2 participants