Convert n key-value pairs to rows w/PySpark & Spark SQL

I’m trying to normalize the properties struct in the following schema into rows of property, attribute, and value:

    "data": {
        "type": "properties",
        "collection": [
            {
                "objectId": 23,
                "category": "Things",
                "externalId": "e3e052f9-0156-11d5-9301-0000863f27ad-00000017",
                "properties": {
                    "parentKey1": {
                        "key1": "Text of some description",
                        "key2": "0.00",
                        "key3": "",
                        "key4": "None",
                        "key5": "",
                        "key6": "",
                    },
                    "parentKey2": {
                        "key1": "0",
                        "key2": "",
                        "key3": "",
                        "key4": "3Zu5Bv0LOHrPC10026FoUw"
                    },
                    "parentKey3": {
                        "key1": "8355711",
                        "key2": "No",
                        "key3": "Default",
                        "key4": "64",
                        "key5": "50"
                    },
                    "parentKey4": {
                        "key1": "Some description",
                        "key2": ""
                    }
                }
            }
          ]
    }

I’m trying to transform the data into a format like:

property attribute value
parentKey1 key1 Text of some description
parentKey1 key2 0.00
parentKey1 key3
parentKey1 key4 None
parentKey1 key5
parentKey1 key6
parentKey2 key1 0
parentKey2 key2
parentKey2 key3
parentKey2 key4 3Zu5Bv0LOHrPC10026FoUw
parentKey3 key1 8355711
parentKey3 key2 No
parentKey3 key3 Default
parentKey3 key4 64
parentKey3 key5 50
parentKey4 key1 Some description
parentKey4 key2

I’m trying to achieve this using Spark SQL, but I’d be interested in a PySpark solution as well.

Here’s a PySpark solution to transform the data into the desired format:

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, col

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Read the JSON data into a DataFrame
df = spark.read.json("data.json")

# Explode the properties struct into rows
df_exploded = df.select(explode("collection.properties").alias("properties"))

# Flatten the struct into columns
df_flattened = df_exploded.selectExpr("properties.*")

# Transform the data into the desired format
result = df_flattened.select(
    col("properties").alias("property"),
    explode(df_flattened.columns[1:]).alias("attribute"),
    col("col").alias("value")
)

# Show the result
result.show()

This solution assumes that the JSON data is stored in a file named “data.json”. You may need to adjust the file path accordingly.

The result DataFrame will have the desired format with columns “property”, “attribute”, and “value”. You can further process or save the result as needed.