You are using JDBC to write to a SQL table that has primary key constraints, and the job fails with a
Alternatively, you are using JDBC to write to a SQL table that does not have primary key constraints, and you see duplicate entries in recently written tables.
When Apache Spark performs a JDBC write, one partition of the DataFrame is written to a SQL table. This is generally done as a single JDBC transaction, in order to avoid repeatedly inserting data. However, if the transaction fails after the commit occurs, but before the final stage completes, it is possible for duplicate data to be copied into the SQL table.
PrimaryKeyViolation error occurs when a write operation is attempting to insert a duplicate entry for the primary key.
You should use a temporary table to buffer the write, and ensure there is no duplicate data.
- Verify that speculative execution is disabled in your Spark configuration:
spark.speculation false. This is disabled by default.
- Create a temporary table on your SQL database.
- Modify your Spark code to write to the temporary table.
- After the Spark writes have completed, check the temporary table to ensure there is no duplicate data.
- Merge the temporary table with the target table on your SQL database.
- Delete the temporary table.
This workaround should only be used if you encounter the listed data duplication issue, as there is a small performance penalty when compared to Spark jobs that write directly to the target table.