I find myself working with PySpark and using the Spark DataFrame in which each row of the DataFrame contains this information (which will always be the same), although the values inside "tree", "grass" and "weed" may vary ".
{tree={in_season=true, index={color=null, category=null, value=null}, display_name=Tree, data_available=false}, weed={in_season=false, index={color=null, category=null, value=null}, display_name=Weed, data_available=false}, grass={in_season=true, index={color=null, category=null, value=null}, display_name=Grass, data_available=false}}
What I'm trying to do is keep some fields, for example, from "tree", keep the fields "in_season", "index -> value", "display_name", among others.
The dataframe has the following schema:
df2.printSchema()
- data: map (nullable = true)
- key: string
- value: string (valueContainsNull = true)
- types: string (nullable = true)
- plants: string (nullable = true)
What I have tried so far is to use StructType() as follows:
schema = ArrayType(
StructType([StructField("tree", StringType())]))
df3 = df2.withColumn("tree", from_json(df2.types, schema))
The result I am getting is NULL for each row of the dataframe.
Is there any other way to do this, or do I have to do it with the StructType in another way?
Thank you very much in advance for the help!
For your problem, it may be useful to use explode. Link to the article that deals with it: PySpark explode array and map columns to rows