I am running a dataset that fits into RAM, but I am unable to run certain methods, such as .unique()
, .drop_nulls()
, and so on without getting an out of memory error. I am using Polars version 0.18.4 with 14GB of RAM on WSL. df.estimated_size()
says my dataset is around 6GBs when I read it in. The schema of my data is:
index: Int64
first_name: Utf8
last_name: Utf8
race: Utf8
pct_1: Float64
pct_2: Float64
pct_3: Float64
pct_4: Float64
When I try to use .drop_nulls()
, I get an out of memory error. To sidestep this, I wrote a custom function:
def iterative_drop_nulls(expr: pl.Expr, subset: list[str]) -> pl.LazyFrame:
for col in subset:
expr = expr.filter(~pl.col(col).is_null())
return expr
df = df.pipe(iterative_drop_nulls, ["col1", "col2"]).collect()
However, I have not been able to think of a similar trick for .unique()
. Is there a way to make .unique()
take less memory? I have tried:
df = df.lazy().unique(cols).collect(streaming=True)
and
def unique(df: pl.DataFrame, subset: list[str], n_rows: int = 100_000) -> pl.DataFrame:
parts = []
for slice in df.iter_slices(n_rows=n_rows):
parts.append(df.unique(slice, subset=subset))
return pl.concat(parts)
For now, I am using pl.from_pandas(df.collect().to_pandas().drop_duplicates(subset=["col1", "col2"]))
. I am curious why Polars is not more memory efficient than Pandas in this case, and if there are any better solutions.