I wonder how you would iteratively get the values from a json string in pyspark. I have the following format of my data and would like to create the “value” column:
id_1 | id_2 | json_string | value |
---|---|---|---|
1 | 1001 | {“1001”:106, “2200”:101} | 106 |
1 | 2200 | {“1001”:106, “2200”:101} | 101 |
x
df_2 = df.withColumn
(
'value', get_json_object(col('json_string'), concat(lit('$.'), col('id_2')))
)
Which gives the error Column is not iterable
However, just inserting the key manually works, i.e:
df_2 = df.withColumn
(
'value', get_json_object(col('json_string'), '$.1001')))
)
Any tips on solving this problem? It is not possible to manually insert the “id_2” values since there are many thousands of keys within the dataset and the json_string is in reality much longer with many more key-value pairs.
Super thankful for any suggestions!
Regards
Advertisement
Answer
You can use it within expr()
which would allow you to concat the string and id_2
.
data_ls = [
("1", "1001", '''{"1001":106, "2200":101}'''),
("1", "2200", '''{"1001":106, "2200":101}''')
]
data_sdf = spark.createDataFrame(data_ls, ("id1", "id2", "jstr"))
# +---+----+--------------------+
# |id1| id2| jstr|
# +---+----+--------------------+
# | 1|1001|{"1001":106, "220...|
# | 1|2200|{"1001":106, "220...|
# +---+----+--------------------+
data_sdf.
withColumn('val', func.expr('get_json_object(jstr, concat("$.", id2))')).
show(truncate=False)
# +---+----+------------------------+---+
# |id1|id2 |jstr |val|
# +---+----+------------------------+---+
# |1 |1001|{"1001":106, "2200":101}|106|
# |1 |2200|{"1001":106, "2200":101}|101|
# +---+----+------------------------+---+