Skip to content

read_csv and scan_csv producing different result on CSV fields with invalid quoting #22395

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
cmdlineluser opened this issue Apr 24, 2025 · 9 comments · May be fixed by #22876
Open
2 tasks done

read_csv and scan_csv producing different result on CSV fields with invalid quoting #22395

cmdlineluser opened this issue Apr 24, 2025 · 9 comments · May be fixed by #22876
Labels
A-io-csv Area: reading/writing CSV files accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@cmdlineluser
Copy link
Contributor

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

(pl.scan_csv(b"""1,2"3\n4,5""", has_header=False)
   .select(pl.nth(0))
   .collect()
)

# shape: (1, 1)
# ┌──────────┐
# │ column_1 │
# │ ---      │
# │ i64      │
# ╞══════════╡
# │ 1        │
# └──────────┘

Log output

_init_credential_provider_builder(): credential_provider_init = None
polars-stream: updating graph state
async thread count: 4
polars-stream: running in_memory_sink in subgraph
polars-stream: running MultiScan[csv] in subgraph
[MultiScanTaskInitializer]: spawn_background_tasks(), 1 sources, reader name: csv, ReaderCapabilities(PRE_SLICE)
[MultiScanTaskInitializer]: predicate: None, skip files mask: None, predicate to reader: None
[MultiScanTaskInitializer]: scan_source_idx: 0 extra_ops: ExtraOperations { row_index: None, pre_slice: None, cast_columns_policy: ErrorOnMismatch, missing_columns_policy: Raise, include_file_paths: None, predicate: None } 
[MultiScanTaskInitializer]: Readers init range: 0..1 (1 / 1 files)
[ReaderStarter]: max_concurrent_scans: 1
[MultiScan]: Initialize source 0
[ReaderStarter]: scan_source_idx: 0
[ReaderStarter]: max_concurrent_scans is 1, waiting..
[AttachReaderToBridge]: got reader, n_readers_received: 1
[CsvFileReader]: project: 1 / 2, slice: None, row_index: None
[CsvSource]: Start line splitting
[AttachReaderToBridge]: ApplyExtraOps::Noop
[ReaderStarter]: Stopping (no more readers)
[MultiScanState]: Readers disconnected
polars-stream: done running graph phase
polars-stream: updating graph state

Issue description

Follow on from the initial report #22392

With .collect(projection_pushdown=False) (or read_csv) 2 rows are produced.

shape: (2, 1)
┌──────────┐
│ column_1 │
│ ---      │
│ i64      │
╞══════════╡
│ 1        │
│ 4        │
└──────────┘

2 rows are also produced with quote_char set to the empty string.

(pl.scan_csv(b"""1,2"3\n4,5""", has_header=False, quote_char="")
   .select(pl.nth(0))
   .collect(projection_pushdown=False)
)

Expected behavior

Same result.

Installed versions

main

@cmdlineluser cmdlineluser added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 24, 2025
@nameexhaustion nameexhaustion added accepted Ready for implementation P-high Priority: high A-io-csv Area: reading/writing CSV files and removed needs triage Awaiting prioritization by a maintainer labels Apr 24, 2025
@nameexhaustion
Copy link
Collaborator

Notes

import polars as pl

pl.read_csv(b"""\
a,b
1,2"3
4,5
""")

Panics on debug at

debug_assert!(df.height() <= count);
using the above MRE

@ritchie46
Copy link
Member

scan_csv is correct here. As the second value is malformed and the new line should never be respected as it is in the middle of a quote field that never finishes.

@kdn36
Copy link
Contributor

kdn36 commented May 14, 2025

I can give this a try, with some input from the core team on expected behavior and implementation.

Observations:

  • The CSV is technically malformed per RFC1480 (*), see also Wikipedia. RFC1480 reserves double quote to 'enclose' a field, (start and end of a field).
  • However, many (most?) parser implementations seem to quietly accept a mid-field single doublequote (python csv, pandas, excel). The RFC also states "be liberal in what you accept (etc)"
  • From the code, read_csv panics because the function [row] count() sees 0 rows, but read_chunk() returns a df with 2 rows, triggering a panic on debug_assert when comparing 0 vs 2.
  • For scan_csv, the count() similarly sees (0 rows), with read_chunk() returning a df with 2 rows. It does not debug_assert check and hence proceeds (df with 2 rows). In addition, if pl.nth(0) is used, it only captures 1 row (!), see details.
malformed = b"""\
a,b
1,2"3
4,5
"""
>>> print(pl.scan_csv(malformed)
...    .collect()
... )
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1   ┆ 2"3 │
│ 4   ┆ 5   │
└─────┴─────┘

>>> print(pl.scan_csv(malformed)
...    .select(pl.nth(0))
...    .collect()
... )
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
└─────┘

Design goals:

  • Design goal 1: consistent behavior between read_csv and scan_csv.
  • Design goal 2: binary success or non-panic error (no partial processing or silently dropping rows).
  • Design goal 3 (tbd): controllable behavior (see e.g. for inspiration the --lazy-quotes flag in https://miller.readthedocs.io/en/latest/reference-main-flag-list/#csv-only-flags), which would mean an extra flag.
  • Design goal 4 (tbd): sound default for this flag
  • Design goal 5: well documented

Decision points:

  • Decision point 1: do we want strict enforcement of RFC1480 by default, or a more relaxed approach on par with industry practices. (my vote, I am leaning towards the 'liberal' default, unless that goes against polars' overall philosophy). Would we be creating a breaking change?
  • Decision point 2: do we want an extra flag to control the behavior (my vote: yes)?

(*) for the original test.txt from #22392:

$ mlr --csv check test.txt
mlr :  parse error on line 25187, column 425: bare " in non-quoted-field

and illustration of the control flag from another tool (mlr):

$ cat some.csv
1,2"3
4,5
$ ./mlr --csv nothing some.csv
mlr: mlr: CSV header/data length mismatch 1 != 2 at filename some.csv row 2.
.
$ ./mlr --csv --lazy-quotes nothing some.csv
$

If the core team can weigh in what we want, I can work on the code. Should be doable once we have clarity on the desired behavior.

@deanm0000
Copy link
Collaborator

Does it really need an additional flag or should this just fall under the ignore_errors umbrella to make it "work" similar to read_csv or having projection_pushdown=False?

I think there's an additional wrinkle. Shouldn't these two return the same number of rows:

(pl.scan_csv(b"""1,2"3\n4,5""", has_header=False)
   .select(pl.nth(1))
   .collect()
)
# returns 2 rows of "2"3", "5"

in contrast to

(pl.scan_csv(b"""1,2"3\n4,5""", has_header=False)
   .select(pl.nth(0))
   .collect()
)
# returns just 1 row 1

and also shouldn't the order of the malformed column not matter?

(pl.scan_csv(b"""2"3,1\n5,4""", has_header=False)
   .select(pl.nth(1))
   .collect()
)
# returns 2 rows
(pl.scan_csv(b"""2"3,1\n5,4""", has_header=False)
   .select(pl.nth(0))
   .collect()
)
# returns 2 rows

Maybe it should raise for any RFC1480 violations and if ignore_errors=True, it can just turn off predicate pushdown for this case?

@nameexhaustion nameexhaustion added P-medium Priority: medium and removed P-high Priority: high labels May 15, 2025
@nameexhaustion
Copy link
Collaborator

We should read in as 2 rows for the example in the issue post, regardless of the projection. I believe this is also the simplest solution - this isn't a case that we want to make any large changes for as the CSV is technically malformed.

@nameexhaustion nameexhaustion changed the title read_csv and scan_csv producing different result read_csv and scan_csv producing different result on malformed CSV May 15, 2025
@nameexhaustion nameexhaustion changed the title read_csv and scan_csv producing different result on malformed CSV read_csv and scan_csv producing different result on CSV fields with invalid quoting May 15, 2025
@kdn36
Copy link
Contributor

kdn36 commented May 15, 2025

If we are okay with this and consider this important enough, I will go ahead and prepare a code change whereby:

  • a mid-field single double quote is considered acceptable ('RFC1480 with exceptions')
  • a warning will be issued when this condition is met
  • including documentation and testing

This should make everything consistent (including all cases as above), which is the real priority imo. It is not as easy as it sounds though - the reader/parser logic is split over multiple places.

@ritchie46
Copy link
Member

This isn't an easy one. I already looked into it and it requires changes to the simd code. You would always need to do more work during simd to be able to raise the error on invalid files.

@kdn36
Copy link
Contributor

kdn36 commented May 18, 2025

Thanks @ritchie46 . After further investigation, here is the proposed plan forward.

The moment we deviate from 'strict' (RFC4180), things get messy, yet this is common practice in many environments and tools. Some of the following parsing features emerge (hereofter 'quote' is typically the 1-byte doublequote character):

  • [lazy_quote_1] allow quotes in non-enclosed fields and treat them as literal. Usually allowed, not ambiguous, not too complicated.
  • [lazy_quote_2] allow single quotes in enclosed fields, often allowed, but this is ambiguous: should 2 consecutive quotes now be treated as 1 or 2? and, what is the (new) end-of-field logic? For example, excel/python-csv/pandas treat this logic differently than say go-encoding or mlr). More complicated in SIMD it seems (tbc).
  • [trim_leading_whitespace] allow whitespace at the start of a field
  • [newlines_in_row] allow newline characters inside a row
  • [comment_lines] allow comment lines, supported
  • and more.. some tools even support 'fast mode' (no quote characters)

Polars assumes strict (RFC4180), but has slightly inconsistent approaches for handling malformed data: (a) in count() (first pass, not parallel), (b) the SplitFIelds next() method (second pass, parallel), (c) skip_this_line()/ find_quoted (when nth() is used), and there are more. Interestingly, the SplitFields next() method actually follows lazy_quote_1 approach above, unlike the other methods, hence the observed inconsistencies.

Now, changing the core logic is likely to trigger regression issues (functional and/or performance).

Therefore, the proposed plan forward:

  1. (This issue) Out-of-the-box, polars sticks to strict. The calls to read_csv()/scan_csv() with or without nth() are consistent. Report an error when malformed csv is detected (caveat, not all cases can be detected). For implementation, the SplitFields next() logic including SIMD can be updated relatively easily to detect lazy_quote_1 violations, others tbd, I think. Recommend third-party tools to end users to normalize csv first as a workaround.
  2. (Enhancement) Introduce lazy quotes features with user-facing flags, starting with lazy_quote_1. The feature flags will help avoid regressions and localize the inevitable performance degradation. This requires documentation and benchmarking. Implementation is TBD, consider a separate code path and memchr for SIMD as the logic may get (too) complicated (e.g. switch between literal mode and toggle mode, error handling, etc). Details to be worked out.

In addition:

  • The skip_this_line() code, which gets called with nth() does not use SIMD? This one is used with projection and perhaps we should optimize as well.
  • (code anomaly) The call to read_chunk() has chunk_size (bytes) hard-coded to usize::MAX, which subsequently gets mapped to n_lines (rows) in the call to parse_lines(). Looks off?

Feedback appreciated, hopefully this is on the right track (my first deep-dive into the world of SIMD).

@kdn36 kdn36 linked a pull request May 22, 2025 that will close this issue
@ritchie46
Copy link
Member

I don't want to deviate from rfc1480.

  Each field may or may not be enclosed in double quotes (however
       some programs, such as Microsoft Excel, do not use double quotes
       at all).  If fields are not enclosed with double quotes, then
       double quotes may not appear inside the fields.

This isn't allowed. Only we don't recognize the error. I think the way forward is recognizing the error on the initial pass. I believe we alreayd have the SIMD information anyhow. So I agree with point 1, but not with point 2. I don't think we should go down that path. People then should fix there csv-files with fixing software.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-csv Area: reading/writing CSV files accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants