Replies: 1 comment 1 reply
-
Hey, I did a test with the provided file on DuckDB and using PyArrow, and I agree the file is corrupted. DuckDB: dd$ ./duckdb
v1.1.3 19864453f7
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select * from read_parquet('~/Downloads/222-2024-12-30-sleep_feature.zstd.parquet.2');
Invalid Error: don't know what type:
D PyArrow: ~$ python3
Python 3.12.7 (main, Oct 1 2024, 11:15:50) [GCC 14.2.1 20240910] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.parquet as pq
>>> pq.read_table('~/Downloads/222-2024-12-30-sleep_feature.zstd.parquet.2')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1843, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1485, in read
table = self._dataset.to_table(
^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3804, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Unexpected end of stream: Page was smaller (236) than expected (310)
>>> import pyarrow
>>> pyarrow.__version__
'17.0.0'
>>> ~~Feel free to raise an issue if you would prefer the error handling to take care of this instead of panicking 👍 ~~
EDIT: actually I just tested off latest branch of arrow-rs: use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
fn main() {
let file =
std::fs::File::open("/home/jeffrey/Downloads/222-2024-12-30-sleep_feature.zstd.parquet.2")
.unwrap();
let parquet_reader = ParquetRecordBatchReaderBuilder::try_new(file)
.unwrap()
.build()
.unwrap();
let mut batches = Vec::new();
for batch in parquet_reader {
batches.push(batch.unwrap());
}
} Running this gives an error and not an internal panic: arrow-rs$ cargo run --example read_parquet
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.11s
Running `/media/jeffrey/1tb_860evo_ssd/.cargo_target_cache/debug/examples/read_parquet`
thread 'main' panicked at parquet/./examples/read_parquet.rs:12:28:
called `Result::unwrap()` on an `Err` value: ParquetError("EOF: Invalid page header")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
arrow-rs$
So seems it is already properly handled. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm using serde_arrow for reading and writing Parquet files. Writing succeeds without issues, but when reading, I encounter the following error in some files:
Writing Code:
Error msg
Additional Information:
Questions:
Is this error caused by file corruption?
The file
222-2024-12-30-sleep_feature.zstd.parquet.2.zip
Beta Was this translation helpful? Give feedback.
All reactions