I am running a dataset that fits into RAM, but I am unable to run certain methods, such as
.drop_nulls(), and so on without getting an out of memory error. I am using Polars version 0.18.4 with 14GB of RAM on WSL.
df.estimated_size() says my dataset is around 6GBs when I read it in. The schema of my data is:
index: Int64 first_name: Utf8 last_name: Utf8 race: Utf8 pct_1: Float64 pct_2: Float64 pct_3: Float64 pct_4: Float64
When I try to use
.drop_nulls(), I get an out of memory error. To sidestep this, I wrote a custom function:
def iterative_drop_nulls(expr: pl.Expr, subset: list[str]) -> pl.LazyFrame: for col in subset: expr = expr.filter(~pl.col(col).is_null()) return expr df = df.pipe(iterative_drop_nulls, ["col1", "col2"]).collect()
However, I have not been able to think of a similar trick for
.unique(). Is there a way to make
.unique() take less memory? I have tried:
df = df.lazy().unique(cols).collect(streaming=True)
def unique(df: pl.DataFrame, subset: list[str], n_rows: int = 100_000) -> pl.DataFrame: parts =  for slice in df.iter_slices(n_rows=n_rows): parts.append(df.unique(slice, subset=subset)) return pl.concat(parts)
For now, I am using
pl.from_pandas(df.collect().to_pandas().drop_duplicates(subset=["col1", "col2"])). I am curious why Polars is not more memory efficient than Pandas in this case, and if there are any better solutions.