`read_csv` and `scan_csv` producing different result on CSV fields with invalid quoting #22395

cmdlineluser · 2025-04-24T15:35:17Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

(pl.scan_csv(b"""1,2"3\n4,5""", has_header=False)
   .select(pl.nth(0))
   .collect()
)

# shape: (1, 1)
# ┌──────────┐
# │ column_1 │
# │ ---      │
# │ i64      │
# ╞══════════╡
# │ 1        │
# └──────────┘

Log output

_init_credential_provider_builder(): credential_provider_init = None
polars-stream: updating graph state
async thread count: 4
polars-stream: running in_memory_sink in subgraph
polars-stream: running MultiScan[csv] in subgraph
[MultiScanTaskInitializer]: spawn_background_tasks(), 1 sources, reader name: csv, ReaderCapabilities(PRE_SLICE)
[MultiScanTaskInitializer]: predicate: None, skip files mask: None, predicate to reader: None
[MultiScanTaskInitializer]: scan_source_idx: 0 extra_ops: ExtraOperations { row_index: None, pre_slice: None, cast_columns_policy: ErrorOnMismatch, missing_columns_policy: Raise, include_file_paths: None, predicate: None } 
[MultiScanTaskInitializer]: Readers init range: 0..1 (1 / 1 files)
[ReaderStarter]: max_concurrent_scans: 1
[MultiScan]: Initialize source 0
[ReaderStarter]: scan_source_idx: 0
[ReaderStarter]: max_concurrent_scans is 1, waiting..
[AttachReaderToBridge]: got reader, n_readers_received: 1
[CsvFileReader]: project: 1 / 2, slice: None, row_index: None
[CsvSource]: Start line splitting
[AttachReaderToBridge]: ApplyExtraOps::Noop
[ReaderStarter]: Stopping (no more readers)
[MultiScanState]: Readers disconnected
polars-stream: done running graph phase
polars-stream: updating graph state

Issue description

Follow on from the initial report #22392

With .collect(projection_pushdown=False) (or read_csv) 2 rows are produced.

shape: (2, 1)
┌──────────┐
│ column_1 │
│ ---      │
│ i64      │
╞══════════╡
│ 1        │
│ 4        │
└──────────┘

2 rows are also produced with quote_char set to the empty string.

(pl.scan_csv(b"""1,2"3\n4,5""", has_header=False, quote_char="")
   .select(pl.nth(0))
   .collect(projection_pushdown=False)
)

Expected behavior

Same result.

Installed versions

main

The text was updated successfully, but these errors were encountered:

nameexhaustion · 2025-04-24T18:20:20Z

Notes

import polars as pl

pl.read_csv(b"""\
a,b
1,2"3
4,5
""")

Panics on debug at

polars/crates/polars-io/src/csv/read/read_impl.rs

Line 444 in cca795d

debug_assert!(df.height() <= count);

using the above MRE

ritchie46 · 2025-04-25T08:00:08Z

scan_csv is correct here. As the second value is malformed and the new line should never be respected as it is in the middle of a quote field that never finishes.

kdn36 · 2025-05-14T13:27:05Z

I can give this a try, with some input from the core team on expected behavior and implementation.

Observations:

The CSV is technically malformed per RFC1480 (*), see also Wikipedia. RFC1480 reserves double quote to 'enclose' a field, (start and end of a field).
However, many (most?) parser implementations seem to quietly accept a mid-field single doublequote (python csv, pandas, excel). The RFC also states "be liberal in what you accept (etc)"
From the code, read_csv panics because the function [row] count() sees 0 rows, but read_chunk() returns a df with 2 rows, triggering a panic on debug_assert when comparing 0 vs 2.
For scan_csv, the count() similarly sees (0 rows), with read_chunk() returning a df with 2 rows. It does not debug_assert check and hence proceeds (df with 2 rows). In addition, if pl.nth(0) is used, it only captures 1 row (!), see details.

malformed = b"""\
a,b
1,2"3
4,5
"""

>>> print(pl.scan_csv(malformed)
...    .collect()
... )
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1   ┆ 2"3 │
│ 4   ┆ 5   │
└─────┴─────┘

>>> print(pl.scan_csv(malformed)
...    .select(pl.nth(0))
...    .collect()
... )
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
└─────┘

Design goals:

Design goal 1: consistent behavior between read_csv and scan_csv.
Design goal 2: binary success or non-panic error (no partial processing or silently dropping rows).
Design goal 3 (tbd): controllable behavior (see e.g. for inspiration the --lazy-quotes flag in https://miller.readthedocs.io/en/latest/reference-main-flag-list/#csv-only-flags), which would mean an extra flag.
Design goal 4 (tbd): sound default for this flag
Design goal 5: well documented

Decision points:

Decision point 1: do we want strict enforcement of RFC1480 by default, or a more relaxed approach on par with industry practices. (my vote, I am leaning towards the 'liberal' default, unless that goes against polars' overall philosophy). Would we be creating a breaking change?
Decision point 2: do we want an extra flag to control the behavior (my vote: yes)?

(*) for the original test.txt from #22392:

$ mlr --csv check test.txt
mlr :  parse error on line 25187, column 425: bare " in non-quoted-field

and illustration of the control flag from another tool (mlr):

$ cat some.csv
1,2"3
4,5
$ ./mlr --csv nothing some.csv
mlr: mlr: CSV header/data length mismatch 1 != 2 at filename some.csv row 2.
.
$ ./mlr --csv --lazy-quotes nothing some.csv
$

If the core team can weigh in what we want, I can work on the code. Should be doable once we have clarity on the desired behavior.

deanm0000 · 2025-05-14T20:38:10Z

Does it really need an additional flag or should this just fall under the ignore_errors umbrella to make it "work" similar to read_csv or having projection_pushdown=False?

I think there's an additional wrinkle. Shouldn't these two return the same number of rows:

(pl.scan_csv(b"""1,2"3\n4,5""", has_header=False)
   .select(pl.nth(1))
   .collect()
)
# returns 2 rows of "2"3", "5"

in contrast to

(pl.scan_csv(b"""1,2"3\n4,5""", has_header=False)
   .select(pl.nth(0))
   .collect()
)
# returns just 1 row 1

and also shouldn't the order of the malformed column not matter?

(pl.scan_csv(b"""2"3,1\n5,4""", has_header=False)
   .select(pl.nth(1))
   .collect()
)
# returns 2 rows
(pl.scan_csv(b"""2"3,1\n5,4""", has_header=False)
   .select(pl.nth(0))
   .collect()
)
# returns 2 rows

Maybe it should raise for any RFC1480 violations and if ignore_errors=True, it can just turn off predicate pushdown for this case?

nameexhaustion · 2025-05-15T08:16:39Z

We should read in as 2 rows for the example in the issue post, regardless of the projection. I believe this is also the simplest solution - this isn't a case that we want to make any large changes for as the CSV is technically malformed.

kdn36 · 2025-05-15T20:43:00Z

If we are okay with this and consider this important enough, I will go ahead and prepare a code change whereby:

a mid-field single double quote is considered acceptable ('RFC1480 with exceptions')
a warning will be issued when this condition is met
including documentation and testing

This should make everything consistent (including all cases as above), which is the real priority imo. It is not as easy as it sounds though - the reader/parser logic is split over multiple places.

ritchie46 · 2025-05-16T07:48:33Z

This isn't an easy one. I already looked into it and it requires changes to the simd code. You would always need to do more work during simd to be able to raise the error on invalid files.

kdn36 · 2025-05-18T19:36:21Z

Thanks @ritchie46 . After further investigation, here is the proposed plan forward.

The moment we deviate from 'strict' (RFC4180), things get messy, yet this is common practice in many environments and tools. Some of the following parsing features emerge (hereofter 'quote' is typically the 1-byte doublequote character):

[lazy_quote_1] allow quotes in non-enclosed fields and treat them as literal. Usually allowed, not ambiguous, not too complicated.
[lazy_quote_2] allow single quotes in enclosed fields, often allowed, but this is ambiguous: should 2 consecutive quotes now be treated as 1 or 2? and, what is the (new) end-of-field logic? For example, excel/python-csv/pandas treat this logic differently than say go-encoding or mlr). More complicated in SIMD it seems (tbc).
[trim_leading_whitespace] allow whitespace at the start of a field
[newlines_in_row] allow newline characters inside a row
[comment_lines] allow comment lines, supported
and more.. some tools even support 'fast mode' (no quote characters)

Polars assumes strict (RFC4180), but has slightly inconsistent approaches for handling malformed data: (a) in count() (first pass, not parallel), (b) the SplitFIelds next() method (second pass, parallel), (c) skip_this_line()/ find_quoted (when nth() is used), and there are more. Interestingly, the SplitFields next() method actually follows lazy_quote_1 approach above, unlike the other methods, hence the observed inconsistencies.

Now, changing the core logic is likely to trigger regression issues (functional and/or performance).

Therefore, the proposed plan forward:

(This issue) Out-of-the-box, polars sticks to strict. The calls to read_csv()/scan_csv() with or without nth() are consistent. Report an error when malformed csv is detected (caveat, not all cases can be detected). For implementation, the SplitFields next() logic including SIMD can be updated relatively easily to detect lazy_quote_1 violations, others tbd, I think. Recommend third-party tools to end users to normalize csv first as a workaround.
(Enhancement) Introduce lazy quotes features with user-facing flags, starting with lazy_quote_1. The feature flags will help avoid regressions and localize the inevitable performance degradation. This requires documentation and benchmarking. Implementation is TBD, consider a separate code path and memchr for SIMD as the logic may get (too) complicated (e.g. switch between literal mode and toggle mode, error handling, etc). Details to be worked out.

In addition:

The skip_this_line() code, which gets called with nth() does not use SIMD? This one is used with projection and perhaps we should optimize as well.
(code anomaly) The call to read_chunk() has chunk_size (bytes) hard-coded to usize::MAX, which subsequently gets mapped to n_lines (rows) in the call to parse_lines(). Looks off?

Feedback appreciated, hopefully this is on the right track (my first deep-dive into the world of SIMD).

ritchie46 · 2025-05-26T07:55:47Z

I don't want to deviate from rfc1480.

  Each field may or may not be enclosed in double quotes (however
       some programs, such as Microsoft Excel, do not use double quotes
       at all).  If fields are not enclosed with double quotes, then
       double quotes may not appear inside the fields.

This isn't allowed. Only we don't recognize the error. I think the way forward is recognizing the error on the initial pass. I believe we alreayd have the SIMD information anyhow. So I agree with point 1, but not with point 2. I don't think we should go down that path. People then should fix there csv-files with fixing software.

cmdlineluser added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 24, 2025

nameexhaustion added accepted Ready for implementation P-high Priority: high A-io-csv Area: reading/writing CSV files and removed needs triage Awaiting prioritization by a maintainer labels Apr 24, 2025

ritchie46 mentioned this issue Apr 25, 2025

Some rows dropped after .select() in rust and scan_csv in python with the default quote_char #22392

Closed

2 tasks

nameexhaustion added P-medium Priority: medium and removed P-high Priority: high labels May 15, 2025

nameexhaustion changed the title ~~read_csv and scan_csv producing different result~~ read_csv and scan_csv producing different result on malformed CSV May 15, 2025

nameexhaustion changed the title ~~read_csv and scan_csv producing different result on malformed CSV~~ read_csv and scan_csv producing different result on CSV fields with invalid quoting May 15, 2025

kdn36 linked a pull request May 22, 2025 that will close this issue

fix: Improve consistency between read_csv and scan_csv #22876

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`read_csv` and `scan_csv` producing different result on CSV fields with invalid quoting #22395

`read_csv` and `scan_csv` producing different result on CSV fields with invalid quoting #22395

cmdlineluser commented Apr 24, 2025

nameexhaustion commented Apr 24, 2025

Uh oh!

ritchie46 commented Apr 25, 2025

Uh oh!

kdn36 commented May 14, 2025 •

edited

Loading

Uh oh!

deanm0000 commented May 14, 2025

Uh oh!

nameexhaustion commented May 15, 2025

Uh oh!

kdn36 commented May 15, 2025

Uh oh!

ritchie46 commented May 16, 2025

Uh oh!

kdn36 commented May 18, 2025 •

edited

Loading

Uh oh!

ritchie46 commented May 26, 2025

Uh oh!

Uh oh!

read_csv and scan_csv producing different result on CSV fields with invalid quoting #22395

read_csv and scan_csv producing different result on CSV fields with invalid quoting #22395

Comments

cmdlineluser commented Apr 24, 2025

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

nameexhaustion commented Apr 24, 2025

Uh oh!

ritchie46 commented Apr 25, 2025

Uh oh!

kdn36 commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deanm0000 commented May 14, 2025

Uh oh!

nameexhaustion commented May 15, 2025

Uh oh!

kdn36 commented May 15, 2025

Uh oh!

ritchie46 commented May 16, 2025

Uh oh!

kdn36 commented May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ritchie46 commented May 26, 2025

Uh oh!

`read_csv` and `scan_csv` producing different result on CSV fields with invalid quoting #22395

`read_csv` and `scan_csv` producing different result on CSV fields with invalid quoting #22395

kdn36 commented May 14, 2025 •

edited

Loading

kdn36 commented May 18, 2025 •

edited

Loading