Prevent Overlapping Ranges in Summary Groups

Gideon75 · June 26, 2023, 2:29am

I am working with the R programming language and have the following dataset on medical characteristics of patients and disease prevalance:

set.seed(123)
library(dplyr)

Patient_ID = 1:5000
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)


status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status  <- as.factor(status )

height = rnorm(5000, 150, 10)
weight = rnorm(5000, 90, 10)
hospital_visits = sample.int(20,  5000, replace = TRUE)

################

disease = sample(c(TRUE, FALSE), 5000, replace = TRUE)

###################
my_data = data.frame(Patient_ID, gender, status, height, weight, hospital_visits, disease)

I am trying to calculate the disease proportions within nested groups. This requires creating groups of patients that have no overlapping ranges. I am using the code below to calculate the disease proportions within nested groups:

final = my_data |>
    group_by(gender, status) |>
    mutate(low_height = height < quantile(height, .2)) |>
    group_by(gender, status, low_height) |>
    mutate(low_weight = weight < quantile(weight, .2)) |>
    group_by(gender,  status, low_height, low_weight) |>
    mutate(low_visit = hospital_visits  < quantile(hospital_visits , .2)) |>
    group_by(gender, status, low_height, low_weight, low_visit) |>
    summarise(across(c(height, weight, hospital_visits),
                     ## list custom stats here:
                     list(min = \(xs) min(xs, na.rm = TRUE),
                     max = \(xs) max(xs, na.rm = TRUE)
                     ),
                     .names = "{.col}_{.fn}"
    ),
    prop_disease = sum(disease)/n(),
    ## etc.
    )

final$low_height = final$low_weight = final$low_visit = NULL

However, when I look at the results from this code I can see that overlapping height ranges have been created, which violates the original condition of non-overlapping groups. Can someone please show me if there is a way to fix this problem?

Abelardo20 · June 26, 2023, 11:50pm

To fix the issue of overlapping height ranges in the nested groups, you can modify the code as follows:

final = my_data |>
  group_by(gender, status) |>
  mutate(low_height = height < quantile(height, .2)) |>
  group_by(gender, status, low_height) |>
  mutate(low_weight = weight < quantile(weight, .2)) |>
  group_by(gender, status, low_height, low_weight) |>
  mutate(low_visit = hospital_visits  < quantile(hospital_visits , .2)) |>
  group_by(gender, status, low_height, low_weight, low_visit) |>
  summarise(across(c(height, weight, hospital_visits),
                   ## list custom stats here:
                   list(min = \(xs) min(xs, na.rm = TRUE),
                        max = \(xs) max(xs, na.rm = TRUE)
                   ),
                   .names = "{.col}_{.fn}"
  ),
  prop_disease = sum(disease)/n(),
  ## etc.
)

final$low_height = final$low_weight = final$low_visit = NULL

# Check if overlapping ranges exist
overlapping_groups <- final %>%
  group_by(gender, status, low_height, low_weight, low_visit) %>%
  summarise(n = n()) %>%
  filter(n > 1)

overlapping_groups

The code will generate a new overlapping_groups data frame that shows any groups with overlapping ranges. If there are no rows in the overlapping_groups data frame, it means that there are no overlapping ranges in the nested groups.