fix: Improve consistency between `read_csv` and `scan_csv` #22876

kdn36 · 2025-05-22T08:32:48Z

This commit improves consistency between read_csv() and scan_csv() for some types of malformed CSV data. It will issue a warning for some types of malformed CSV data. A decision is needed on the scope as this may impact performance, see below.

In scope:
a) (Cheap) Return warning instead of panic when read_chunk() returns more lines than CountLines found. Add a similar warning to scan_csv(). This makes read_csv consistent with scan_csv. We may want to error instead?
b) Updated documentation.
c) Added tests.

TBD in/out of scope (currently in):
d) (Not cheap) Lazy_quotes_1: Looks for quotes unenclosed fields in SplitFields iterator: detect and warn when quotes are found in unenclosed field, except for columns to the RHS of last projected column.

Out of scope:
e) Allow lazy_quotes_1 with projection, specifically when columns to the RHS of the last projected column have an odd number of quotes.
f) Lazy_quotes_2: Does not check for single-quotes in enclosed fields (yet) as this is more ambiguous and further impacts performance. Exception is the line count above.
g) Leading whitespace in fields.

Performance regression (for unenclosed fields lazy_quotes_1 check, d above) observed. Caveat, my dev machine is not good for benchmarking, it is a lightweight portable laptop running Windows + WSL. It shows some performance regression on a single large CSV file (NYC dataset, 24M records, short fields, unenclosed) in the order of +9% on number of instructions, +15% on wall-time.

Details: (non-representative) performance assessment

(with malformed check)
(nodebug-release)
commit 1d60b0488b29fe6c7d4292a7f49da6e1f025ca23

denecker@DESKTOP-V7KT63E:~/my-polars/issue_22395$ ~/poop/zig-out/bin/poop 'python nyc_read.py'
Benchmark 1 (3 runs): python nyc_read.py
  measurement          mean ± σ            min … max           outliers
  wall_time          6.14s  ±  723ms    5.47s  … 6.91s           0 ( 0%)
  peak_rss           8.43GB ± 6.33MB    8.43GB … 8.44GB          0 ( 0%)
  cpu_cycles          102G  ±  487M      101G  …  102G           0 ( 0%)
  instructions        141G  ±  395K      141G  …  141G           0 ( 0%)
  cache_references    561M  ± 34.0M      541M  …  600M           0 ( 0%)
  cache_misses        333M  ± 4.69M      327M  …  336M           0 ( 0%)
  branch_misses       306M  ± 9.31M      298M  …  316M           0 ( 0%)

(without malformed check)
(nodebug-release)
commit 5be0a9b

denecker@DESKTOP-V7KT63E:~/my-polars/issue_22395$ ~/poop/zig-out/bin/poop 'python nyc_read.py'
Benchmark 1 (3 runs): python nyc_read.py
  measurement          mean ± σ            min … max           outliers
  wall_time          5.33s  ±  169ms    5.22s  … 5.53s           0 ( 0%)
  peak_rss           8.43GB ±  967KB    8.43GB … 8.44GB          0 ( 0%)
  cpu_cycles         90.0G  ±  332M     89.7G  … 90.4G           0 ( 0%)
  instructions        129G  ±  463K      129G  …  129G           0 ( 0%)
  cache_references    540M  ± 17.3M      526M  …  560M           0 ( 0%)
  cache_misses        331M  ± 1.67M      329M  …  333M           0 ( 0%)
  branch_misses       288M  ± 14.1M      280M  …  305M           0 ( 0%)

General notes on performance. I suspect overall, some improvements are still possible for read_csv if we consider the following:
(A) Internal

Use bounded queue between pass-1 and pass-2 to provide back-pressure. This should keep passes in sync and keep cache hot (requires investigation).
Review L3 cache size optimization (individual chunks are 16 MB, per thread?)
In pass 2, consider collecting character indices for further processing rather than re-starting 'next()` for every field. Relevant when fields are short.
I noticed SIMD_SIZE is hardcoded as 64 bytes (AVX-512) but many Intel laptops only support 32 bytes (AVX-2). (Fwiw, hardcoding SIMD_SIZE to 32 bytes showed some improvement on my machine).

(B) User facing:

Introduce a fast-mode option to end users in read_csv(). When enabled, polars assert wellformed data and ignores all quotes.
Introduce feature flag(s) for lazyquotes options with separate code path. This protects the performance on wellformed CSV. Consider memchr to manage code complexity.
Combined, this would present 3 approaches: (i) fast-mode (expects well-formed, no quotes, no checks), (ii) default mode (expects well-formed including quotes, limited checking), (iii) forgiving mode (feature flags to allow RFC exceptions).

Kindly review:

Issue warning vs error. My take is to start with warning now and move to error when forgiving feature flags are available.
Scope: how much performance impact is acceptable to detect malformed data? Imo, the next step would be a reliable benchmark on a dedicated machine.

codecov · 2025-05-22T08:49:45Z

Codecov Report

Attention: Patch coverage is 97.43590% with 1 line in your changes missing coverage. Please review.

Project coverage is 80.99%. Comparing base (5be0a9b) to head (fba49d6).
Report is 14 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/polars-io/src/csv/read/splitfields.rs	96.77%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #22876   +/-   ##
=======================================
  Coverage   80.99%   80.99%           
=======================================
  Files        1674     1671    -3     
  Lines      236899   236959   +60     
  Branches     2787     2792    +5     
=======================================
+ Hits       191882   191933   +51     
- Misses      44350    44356    +6     
- Partials      667      670    +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nameexhaustion · 2025-05-22T10:59:03Z

crates/polars-io/src/csv/read/splitfields.rs

+                        // check malformed: quotes are not allowed in unenclosed fields
+                        if self.quoting {
+                            let has_quote_mask =
+                                simd_bytes.simd_eq(self.simd_quote_char).to_bitmask();


I don't think it's worth taking the performance hit of this extra code in the hot-loop for the malformed quote case, especially as it affects the performance in benchmarks.

I think the best strategy here is to just "let it fail" - i.e. proceed to parsing as usual and allow it to fail.

With regards to the original issue at #22876, I believe the key requirement on our side is that we behave consistently (i.e. return the same result for read_csv / scan_csv / scan_csv().select(nth(0))). It's fine even if the behavior isn't correct since the CSV itself is malformed.

Thanks for clarifying. I simplified the change and took out the hot loop field-based detection of doublequote in an unenclosed field (for both SIMD and non-SIMD).

Detecting malformed data is now limited to row count mismatches and will not panic. This makes read_csv and scan_csv consistent. However, this means we are blind to many malformed patterns, e.g. 'a"b,c"d,e' is malformed but will not be detected, etc.

In addition, read_csv and scan_csv are not consistent with scan_csv/select(pl.nth()) since the second pass uses two types of logic depending on whether all projected columns have been completed or not. If we want to fix that, we may need to swap skip_this_line() with the SplitFields iterator for parsing consistency. Thoughts?

*edit: Note that it is should be easy to track in pass 1 if the chunk has any quotes at all, offering an option to check deeper only if quotes have been detected. If we do this, there should not be any performance impact to CSV files without quotes. Just a thought.

we may need to swap skip_this_line() with the SplitFields iterator for parsing consistency.

I don't think the performance impact of that would be acceptable. If there's no other way to give the same output for all cases, I think it's fine to just rely on the fact that we always error by default if we see the height of the parsed DataFrame doesn't match. I would also change the title of this PR to "raise error by default on invalid CSV quotes"

nameexhaustion · 2025-05-22T10:59:36Z

crates/polars-io/src/csv/read/read_impl.rs

@@ -441,7 +450,11 @@ impl<'a> CoreReader<'a> {
                        let result = slf
                            .read_chunk(b, projection, 0, count, Some(0), b.len())
                            .and_then(|mut df| {
-                                debug_assert!(df.height() <= count);
+                                if df.height() > count {


Suggested change

if df.height() > count {

if df.height() != count {

I had that originally, but this fails when the CSV has comment lines inside a chunk (see test test_csv_multi_char_comment and test_count_csv_comment_char). The clean way would be to have read_chunk return (or mutate) rows_parsed in addition to the DataFrame and make the check strict (rows_parsed != count). Ok with that?

I think you can add a check for comment_char to the if statement here then -

height > count || (height < count && !has_comment_char)

crates/polars-io/src/csv/read/read_impl.rs

kdn36 · 2025-05-23T18:24:59Z

Sharing WIP - some test are failing until resolved.
Open items:

Strict line count check when there are comments in a chunk (ping @nameexhaustion ).
Consistency when select(pl.nth()) is used, separate PR?

fix: Improve consistency between read_csv and scan_csv

e571360

kdn36 requested review from ritchie46, c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners May 22, 2025 08:32

github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels May 22, 2025

kdn36 added 3 commits May 22, 2025 10:53

fix: Improve consistency between read_csv and scan_csv

8a1a894

Trigger CI

cd920e0

Clean up code

fba49d6

nameexhaustion reviewed May 22, 2025

View reviewed changes

crates/polars-io/src/csv/read/read_impl.rs Outdated Show resolved Hide resolved

kdn36 marked this pull request as draft May 23, 2025 17:46

no checking in hot loop for malformed csv

17d12e4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: Improve consistency between `read_csv` and `scan_csv` #22876

fix: Improve consistency between `read_csv` and `scan_csv` #22876

Uh oh!

kdn36 commented May 22, 2025

Uh oh!

codecov bot commented May 22, 2025 •

edited

Loading

Uh oh!

nameexhaustion May 22, 2025 •

edited

Loading

Uh oh!

kdn36 May 23, 2025 •

edited

Loading

Uh oh!

nameexhaustion May 26, 2025 •

edited

Loading

Uh oh!

nameexhaustion May 22, 2025

Uh oh!

kdn36 May 23, 2025 •

edited

Loading

Uh oh!

nameexhaustion May 26, 2025

Uh oh!

Uh oh!

kdn36 commented May 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

fix: Improve consistency between read_csv and scan_csv #22876

Are you sure you want to change the base?

fix: Improve consistency between read_csv and scan_csv #22876

Uh oh!

Conversation

kdn36 commented May 22, 2025

Uh oh!

codecov bot commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nameexhaustion May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kdn36 May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nameexhaustion May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nameexhaustion May 22, 2025

Choose a reason for hiding this comment

Uh oh!

kdn36 May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nameexhaustion May 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kdn36 commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fix: Improve consistency between `read_csv` and `scan_csv` #22876

fix: Improve consistency between `read_csv` and `scan_csv` #22876

codecov bot commented May 22, 2025 •

edited

Loading

nameexhaustion May 22, 2025 •

edited

Loading

kdn36 May 23, 2025 •

edited

Loading

nameexhaustion May 26, 2025 •

edited

Loading

kdn36 May 23, 2025 •

edited

Loading

kdn36 commented May 23, 2025 •

edited

Loading