Reduce mem consumption in unique() of polars

Bessie11 · June 26, 2023, 6:53pm

I am running a dataset that fits into RAM, but I am unable to run certain methods, such as .unique(), .drop_nulls(), and so on without getting an out of memory error. I am using Polars version 0.18.4 with 14GB of RAM on WSL. df.estimated_size() says my dataset is around 6GBs when I read it in. The schema of my data is:

index: Int64
first_name: Utf8
last_name: Utf8
race: Utf8
pct_1: Float64
pct_2: Float64
pct_3: Float64
pct_4: Float64

When I try to use .drop_nulls(), I get an out of memory error. To sidestep this, I wrote a custom function:

def iterative_drop_nulls(expr: pl.Expr, subset: list[str]) -> pl.LazyFrame:
    for col in subset:
        expr = expr.filter(~pl.col(col).is_null())

    return expr

df = df.pipe(iterative_drop_nulls, ["col1", "col2"]).collect()

However, I have not been able to think of a similar trick for .unique(). Is there a way to make .unique() take less memory? I have tried:

df = df.lazy().unique(cols).collect(streaming=True)

and

def unique(df: pl.DataFrame, subset: list[str], n_rows: int = 100_000) -> pl.DataFrame:
    parts = []
    for slice in df.iter_slices(n_rows=n_rows):
        parts.append(df.unique(slice, subset=subset))

    return pl.concat(parts)

For now, I am using pl.from_pandas(df.collect().to_pandas().drop_duplicates(subset=["col1", "col2"])). I am curious why Polars is not more memory efficient than Pandas in this case, and if there are any better solutions.

Demarcus93 · June 27, 2023, 9:52am

The issue you are facing is that certain methods like .unique() and .drop_nulls() in Polars are causing out of memory errors even though your dataset fits into RAM. To address this, you have already written a custom function iterative_drop_nulls() to handle the .drop_nulls() operation.

For the .unique() operation, you have tried a couple of approaches, but they did not work as expected. Currently, you are using a workaround by converting the DataFrame to a Pandas DataFrame and using the drop_duplicates() method there.

It is worth noting that Polars may not be as memory-efficient as Pandas in this specific case. However, there might be alternative solutions to improve memory efficiency.

Unfortunately, there is no direct answer to your question as it requires more investigation and analysis of the dataset and the specific operations being performed. It is recommended to check the Polars documentation, raise an issue on the Polars GitHub repository, or seek help from the Polars community to address this memory efficiency concern and explore potential solutions.