Problem
When creating a DataFrame using Row()
, you pass in arguments defining the first and second column names and values in any order. In the output, you notice the column values are assigned in the order they were passed in, not to the column you indicated.
Example code
from pyspark.sql import Row
row1 = Row(FirstColumn=1, SecondColumn=2)
row2 = Row(SecondColumn=3, FirstColumn=4)
df = spark.createDataFrame([row1, row2])
df.show()
Expected output
FirstColumn |
SecondColumn |
1 |
2 |
4 |
3 |
Actual output
FirstColumn |
SecondColumn |
1 |
2 |
3 |
4 |
Cause
When you create a DataFrame using Row()
with named arguments, it inherits a tuple instead of a dictionary, so input argument mapping does not occur.
Solution
Create the DataFrame from a list of dictionaries, or use the row.toDict()
method.
To create the DataFrame from a list of dictionaries, adapt the following example code.
data = [
{"FirstColumn": 1, "SecondColumn": 2},
{"SecondColumn": 3, "FirstColumn": 4}
]
df2 = spark.createDataFrame(data)
df2.show()
Alternatively, to use the row.toDict()
method, adapt the following example code.
row1 = Row(FirstColumn=1, SecondColumn=2)
row2 = Row(SecondColumn=3, FirstColumn=4)
row1_dict = row1.asDict()
row2_dict = row2.asDict()
df = spark.createDataFrame([row1_dict, row2_dict])
df.show()