I need help to finish this task.
I’m trying to aggregate multiple listings of files on disks, such that for each file, there is only one row containing a column of the disks that file may be on.
Given the following DataFrame:
recordings = pl.DataFrame(
{
"disk": ["NT23", "NT24", "NT23", "NT24"],
"path_on_disk": ["file_a.txt", "file_a.txt", "file_b.txt", "file_b.txt"],
"other_data": [2.0, 2.0, 3.0, 3.0],
}
)
Which looks something like this:
┌──────┬──────────────┬────────────┐
│ disk ┆ path_on_disk ┆ other_data │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞══════╪══════════════╪════════════╡
│ NT23 ┆ file_a.txt ┆ 2.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NT24 ┆ file_a.txt ┆ 2.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NT23 ┆ file_b.txt ┆ 3.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NT24 ┆ file_b.txt ┆ 3.0 │
└──────┴──────────────┴────────────┘
I want to create a new DataFrame where for each matching path_on_disk
value, there is one row with a disks
column containing a comma separated list of disk
values that match that path_on_disk
.
Something like this:
┌──────────────┬────────────┬────────────┐
│ path_on_disk ┆ disks ┆ other_data │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞══════════════╪════════════╪════════════╡
│ file_a ┆ NT23, NT24 ┆ 2.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ file_b ┆ NT23, NT24 ┆ 3.0 │
└──────────────┴────────────┴────────────┘
I’ve figured out that I can use recordings.groupby(["path_on_disk"])
to find all rows where path_on_disk
are the same, but I’m stuck on how to create the disks
column.
I tried the following, but got an error:
def merge_disknames(df: pl.DataFrame):
return ", ".join(sorted(df["disk"]))
recordings.groupby(["path_on_disk"]).apply(merge_disknames).rename("disks")
I need help to finish this task.