Convert n key-value pairs to rows w/PySpark & Spark SQL

Dejah94 · June 28, 2023, 11:59pm

I’m trying to normalize the properties struct in the following schema into rows of property, attribute, and value:

    "data": {
        "type": "properties",
        "collection": [
            {
                "objectId": 23,
                "category": "Things",
                "externalId": "e3e052f9-0156-11d5-9301-0000863f27ad-00000017",
                "properties": {
                    "parentKey1": {
                        "key1": "Text of some description",
                        "key2": "0.00",
                        "key3": "",
                        "key4": "None",
                        "key5": "",
                        "key6": "",
                    },
                    "parentKey2": {
                        "key1": "0",
                        "key2": "",
                        "key3": "",
                        "key4": "3Zu5Bv0LOHrPC10026FoUw"
                    },
                    "parentKey3": {
                        "key1": "8355711",
                        "key2": "No",
                        "key3": "Default",
                        "key4": "64",
                        "key5": "50"
                    },
                    "parentKey4": {
                        "key1": "Some description",
                        "key2": ""
                    }
                }
            }
          ]
    }

I’m trying to transform the data into a format like:

property	attribute	value
parentKey1	key1	Text of some description
parentKey1	key2	0.00
parentKey1	key3
parentKey1	key4	None
parentKey1	key5
parentKey1	key6
parentKey2	key1	0
parentKey2	key2
parentKey2	key3
parentKey2	key4	3Zu5Bv0LOHrPC10026FoUw
parentKey3	key1	8355711
parentKey3	key2	No
parentKey3	key3	Default
parentKey3	key4	64
parentKey3	key5	50
parentKey4	key1	Some description
parentKey4	key2

I’m trying to achieve this using Spark SQL, but I’d be interested in a PySpark solution as well.

Dudley_Konopelski39 · June 29, 2023, 9:17am

Here’s a PySpark solution to transform the data into the desired format:

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, col

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Read the JSON data into a DataFrame
df = spark.read.json("data.json")

# Explode the properties struct into rows
df_exploded = df.select(explode("collection.properties").alias("properties"))

# Flatten the struct into columns
df_flattened = df_exploded.selectExpr("properties.*")

# Transform the data into the desired format
result = df_flattened.select(
    col("properties").alias("property"),
    explode(df_flattened.columns[1:]).alias("attribute"),
    col("col").alias("value")
)

# Show the result
result.show()

This solution assumes that the JSON data is stored in a file named “data.json”. You may need to adjust the file path accordingly.

The result DataFrame will have the desired format with columns “property”, “attribute”, and “value”. You can further process or save the result as needed.