Reduce mem use in unique()

Aglae_Carroll · June 26, 2023, 3:52am

I am receiving an out of memory error when attempting to use the .unique() method on a dataset that fits into my computer’s RAM.

My laptop has 16GB of RAM, and I am running WSL with 14GB of RAM. I have tested the dataset with the .estimated_size() method, and it is around 6GBs. The schema of my data is:

index: Int64
first_name: Utf8
last_name: Utf8
race: Utf8
pct_1: Float64
pct_2: Float64
pct_3: Float64
pct_4: Float64

size = pl.read_parquet("data.parquet").estimated_size()
df = pl.scan_parquet("data.parquet") # use LazyFrames

I am using Polars version 0.18.4 and the LazyFrames feature. I am unable to perform tasks such as .unique(), .drop_nulls(), and so on without getting SIGKILLed.

I have attempted to sidestep this issue by writing a custom function, but this still results in an out of memory error.

def iterative_drop_nulls(expr: pl.Expr, subset: list[str]) -> pl.LazyFrame:
    for col in subset:
        expr = expr.filter(~pl.col(col).is_null())

    return expr

df = df.pipe(iterative_drop_nulls, ["col1", "col2"]).collect()

I have also tried using .lazy() and .collect(streaming=True), but to no avail.

I have found a workaround using Pandas:

df = pl.from_pandas(
    df.collect()
    .to_pandas()
    .drop_duplicates(subset=["col1", "col2"])
)

This works, but it doesn’t explain why I am getting an out of memory error when using Polars, given that the dataset fits into RAM. Is there something I can do to make .unique() take less memory?

Elyssa.Tromp76 · June 26, 2023, 8:23pm

The out of memory error you’re experiencing when using the .unique() method in Polars is likely due to the method attempting to load the entire dataset into memory to perform the uniqueness check. Even though your dataset fits into your computer’s RAM, Polars may still require additional memory for processing.

To make the .unique() method take less memory, you can try a few possible solutions:

Increase the available memory: If possible, try allocating more memory to your WSL environment. You mentioned that you’re currently using 14GB of RAM, so consider increasing it further to see if it helps with the memory issue.
Use chunked processing: Instead of loading the entire dataset into memory at once, you can process it in smaller chunks. Polars provides a .chunk() method that allows you to split the dataset into chunks and operate on each chunk individually. This can help reduce memory usage. Here’s an example:
```
chunk_size = 100000  # Adjust the chunk size as needed
unique_values = []
for chunk in df.chunk(chunk_size):
    unique_values.append(chunk.unique())
result = pl.concat(unique_values).unique()
```
This code splits the dataset into chunks of size chunk_size and performs the uniqueness check on each chunk. Finally, it concatenates the unique values from each chunk and performs the uniqueness check again on the combined result.
Use a combination of lazy operations: Instead of using .unique() directly, you can try using a combination of lazy operations to achieve the same result. For example, you can use .groupby() followed by .select() to get the unique values for specific columns:
```
unique_values = df.groupby(["col1", "col2"]).select(pl.all()).collect()
```
This code groups the dataset by the specified columns and selects all columns (pl.all()) for each group. The result will be the unique values for the specified columns.

Try these approaches and see if they help reduce the memory usage when using the .unique() method in Polars.