Problem
You are using JDBC to write to a SQL table that has primary key constraints, and the job fails with a PrimaryKeyViolation error.
Alternatively, you are using JDBC to write to a SQL table that does not have primary key constraints, and you see duplicate entries in recently written tables.
Cause
When Apache Spark performs a JDBC write, one partition of the DataFrame is written to a SQL table. This is generally done as a single JDBC transaction, in order to avoid repeatedly inserting data. However, if the transaction fails after the commit occurs, but before the final stage completes, it is possible for duplicate data to be copied into the SQL table.
The PrimaryKeyViolation error occurs when a write operation is attempting to insert a duplicate entry for the primary key.
Solution
You should use a temporary table to buffer the write, and ensure there is no duplicate data.
- Verify that speculative execution is disabled in your Spark configuration: spark.speculation false. This is disabled by default.
- Create a temporary table on your SQL database.
- Modify your Spark code to write to the temporary table.
- After the Spark writes have completed, check the temporary table to ensure there is no duplicate data.
- Merge the temporary table with the target table on your SQL database.
- Delete the temporary table.