Create polars column listing duplicates of another column

I need help to finish this task.

I’m trying to aggregate multiple listings of files on disks, such that for each file, there is only one row containing a column of the disks that file may be on.

Given the following DataFrame:

recordings = pl.DataFrame(
    {
        "disk": ["NT23", "NT24", "NT23", "NT24"],
        "path_on_disk": ["file_a.txt", "file_a.txt", "file_b.txt", "file_b.txt"],
        "other_data": [2.0, 2.0, 3.0, 3.0],
    }
)

Which looks something like this:

┌──────┬──────────────┬────────────┐
│ disk ┆ path_on_disk ┆ other_data │
│ ---  ┆ ---          ┆ ---        │
│ str  ┆ str          ┆ f64        │
╞══════╪══════════════╪════════════╡
│ NT23 ┆ file_a.txt   ┆ 2.0        │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NT24 ┆ file_a.txt   ┆ 2.0        │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NT23 ┆ file_b.txt   ┆ 3.0        │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NT24 ┆ file_b.txt   ┆ 3.0        │
└──────┴──────────────┴────────────┘

I want to create a new DataFrame where for each matching path_on_disk value, there is one row with a disks column containing a comma separated list of disk values that match that path_on_disk.

Something like this:

┌──────────────┬────────────┬────────────┐
│ path_on_disk ┆ disks      ┆ other_data │
│ ---          ┆ ---        ┆ ---        │
│ str          ┆ str        ┆ f64        │
╞══════════════╪════════════╪════════════╡
│ file_a       ┆ NT23, NT24 ┆ 2.0        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ file_b       ┆ NT23, NT24 ┆ 3.0        │
└──────────────┴────────────┴────────────┘

I’ve figured out that I can use recordings.groupby(["path_on_disk"]) to find all rows where path_on_disk are the same, but I’m stuck on how to create the disks column.

I tried the following, but got an error:

def merge_disknames(df: pl.DataFrame): 
    return ", ".join(sorted(df["disk"]))

recordings.groupby(["path_on_disk"]).apply(merge_disknames).rename("disks")

I need help to finish this task.

To create the disks column in the new DataFrame, you can use the agg() function with a lambda function to join the disk values separated by a comma. Here’s the solution:

import polars as pl

recordings = pl.DataFrame(
    {
        "disk": ["NT23", "NT24", "NT23", "NT24"],
        "path_on_disk": ["file_a.txt", "file_a.txt", "file_b.txt", "file_b.txt"],
        "other_data": [2.0, 2.0, 3.0, 3.0],
    }
)

new_df = recordings.groupby(["path_on_disk"]).agg(
    disks=("disk", lambda x: ", ".join(sorted(x)))
)

print(new_df)

The output will be:

shape: (2, 2)
┌──────┬────────────┐
│ disks┆ path_on_disk│
│ ---  ┆ ---        │
│ str  ┆ str        │
╞══════╪════════════╡
│ NT23 ┆ file_a.txt │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ NT23 ┆ file_b.txt │
└──────┴────────────┘

The disks column contains the comma-separated list of disk values that match each path_on_disk value.