Skip to content

Storing LazyFrames in a list results in significant memory increase if collect_schema() was called before. #26257

@woernerm

Description

@woernerm

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from random import choices
import os
import psutil
import polars as pl

def print_memory_usage(msg: str):
    process = psutil.Process(os.getpid())
    rss_mb = process.memory_info().rss / (1024 * 1024)
    print(f"{msg}: RSS Memory Usage = {rss_mb:.2f} MB")


col_count = 100
row_count = 10_000
repetitions = 1000
num_range = 100

cols = {str(i): choices(range(num_range), k=row_count) for i in range(col_count)}
pl.DataFrame(cols).write_parquet("leaky.parquet")

print_memory_usage("Base level")

# Part 1: Scan files and store the collected schemas in a list.
schemas = []
for f in (pl.scan_parquet("leaky.parquet") for _ in range(repetitions)):
    schemas.append(f.collect_schema())
print_memory_usage("Storing schemas in a list")

# Part 2: Scan files and store the LazyFrames in a list
results = []
for f in (pl.scan_parquet("leaky.parquet") for _ in range(repetitions)):
    results.append(f)
print_memory_usage("Storing lazyframes in a list")

# Part 3: Scan files and store both the LazyFrames and the collected schemas in lists.
schemas = []
results = []
for f in (pl.scan_parquet("leaky.parquet") for _ in range(repetitions)):
    schemas.append(f.collect_schema())
    results.append(f)
print_memory_usage("Storing lazyframes AND schemas in a list")

Log output

Base level: RSS Memory Usage = 69.44 MB
Storing schemas in a list: RSS Memory Usage = 90.69 MB
Storing lazyframes in a list: RSS Memory Usage = 91.39 MB
Storing lazyframes AND schemas in a list: RSS Memory Usage = 293.79 MB

Issue description

Storing LazyFrame instances in a list results in a significant memory increase if collect_schema was called before. This is the case even if the same file is scanned multiple times. My task involves scanning many parquet files into LazyFrames, determining a common schema, casting all frames to that schema and finally concatenating and writing them into a single large parquet file.

The issue seems to similar to #22366 but using the solution from the resulting PR #24513 did not resolve the issue.

Expected behavior

Part 3 of the above example should not increase memory usage more than part 1 and part 2 combined, so roughly 113 MB.

Installed versions

Details
--------Version info---------
Polars:              1.37.1
Index type:          UInt32
Platform:            Windows-11-10.0.26100-SP0
Python:              3.13.9 (tags/v3.13.9:8183fa5, Oct 14 2025, 14:09:13) [MSC v.1944 64 bit (AMD64)]
Runtime:             rt32

----Optional dependencies----
Azure CLI            Der Befehl "az" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.
<not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                <not installed>
openpyxl             <not installed>
pandas               <not installed>
polars_cloud         <not installed>
pyarrow              <not installed>
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageAwaiting prioritization by a maintainerpythonRelated to Python Polars

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions