-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingneeds triageAwaiting prioritization by a maintainerAwaiting prioritization by a maintainerpythonRelated to Python PolarsRelated to Python Polars
Description
Checks
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of Polars.
Reproducible example
from random import choices
import os
import psutil
import polars as pl
def print_memory_usage(msg: str):
process = psutil.Process(os.getpid())
rss_mb = process.memory_info().rss / (1024 * 1024)
print(f"{msg}: RSS Memory Usage = {rss_mb:.2f} MB")
col_count = 100
row_count = 10_000
repetitions = 1000
num_range = 100
cols = {str(i): choices(range(num_range), k=row_count) for i in range(col_count)}
pl.DataFrame(cols).write_parquet("leaky.parquet")
print_memory_usage("Base level")
# Part 1: Scan files and store the collected schemas in a list.
schemas = []
for f in (pl.scan_parquet("leaky.parquet") for _ in range(repetitions)):
schemas.append(f.collect_schema())
print_memory_usage("Storing schemas in a list")
# Part 2: Scan files and store the LazyFrames in a list
results = []
for f in (pl.scan_parquet("leaky.parquet") for _ in range(repetitions)):
results.append(f)
print_memory_usage("Storing lazyframes in a list")
# Part 3: Scan files and store both the LazyFrames and the collected schemas in lists.
schemas = []
results = []
for f in (pl.scan_parquet("leaky.parquet") for _ in range(repetitions)):
schemas.append(f.collect_schema())
results.append(f)
print_memory_usage("Storing lazyframes AND schemas in a list")Log output
Base level: RSS Memory Usage = 69.44 MB
Storing schemas in a list: RSS Memory Usage = 90.69 MB
Storing lazyframes in a list: RSS Memory Usage = 91.39 MB
Storing lazyframes AND schemas in a list: RSS Memory Usage = 293.79 MBIssue description
Storing LazyFrame instances in a list results in a significant memory increase if collect_schema was called before. This is the case even if the same file is scanned multiple times. My task involves scanning many parquet files into LazyFrames, determining a common schema, casting all frames to that schema and finally concatenating and writing them into a single large parquet file.
The issue seems to similar to #22366 but using the solution from the resulting PR #24513 did not resolve the issue.
Expected behavior
Part 3 of the above example should not increase memory usage more than part 1 and part 2 combined, so roughly 113 MB.
Installed versions
Details
--------Version info---------
Polars: 1.37.1
Index type: UInt32
Platform: Windows-11-10.0.26100-SP0
Python: 3.13.9 (tags/v3.13.9:8183fa5, Oct 14 2025, 14:09:13) [MSC v.1944 64 bit (AMD64)]
Runtime: rt32
----Optional dependencies----
Azure CLI Der Befehl "az" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
<not installed>
adbc_driver_manager <not installed>
altair <not installed>
azure.identity <not installed>
boto3 <not installed>
cloudpickle <not installed>
connectorx <not installed>
deltalake <not installed>
fastexcel <not installed>
fsspec <not installed>
gevent <not installed>
google.auth <not installed>
great_tables <not installed>
matplotlib <not installed>
numpy <not installed>
openpyxl <not installed>
pandas <not installed>
polars_cloud <not installed>
pyarrow <not installed>
pydantic <not installed>
pyiceberg <not installed>
sqlalchemy <not installed>
torch <not installed>
xlsx2csv <not installed>
xlsxwriter <not installed>
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingneeds triageAwaiting prioritization by a maintainerAwaiting prioritization by a maintainerpythonRelated to Python PolarsRelated to Python Polars