Problem
You’re using Scala in a notebook to parallelize write operations across multiple tables. The following code demonstrates an example.
%scala
def f2(x: String): Unit = {
val data = Array(<your-array>)
val dataRDD = sc.parallelize(data)
val dataDF = dataRDD.toDF()
val tempFile = "<your-base-path>".replaceAll("destinationTableName", x)
dataDF.write.mode("overWrite").parquet(tempFile)
println(x)
}
val tables = Array(<your-table-names>)
sc.parallelize(tables).foreach(f2)
When you run the cell, you encounter the following error in the output field.
Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 (TID 10) (10.101.174.123 executor 0): java.lang.NullPointerException: Cannot invoke "org.apache.spark.SparkContext.parallelize$default$2()" because the return value of "$line1b8899e3320d4fb98e335d30f4db19a03.$read$$iw$$iw.sc()" is null
Cause
Using sc.parallelize on an array of table names and performing write operations in each parallelized task inadvertently triggers initialization of multiple, simultaneous SparkContext or SparkSession instances.
In the example in the previous section, parallel execution using sc.parallelize(tables).foreach(f2) causes multiple Apache Spark tasks to attempt writing concurrently, leading to the NullPointerException.
Solution
Iterate over table names sequentially using a standard foreach loop instead of parallelizing operations using sc.parallelize.
Using a sequential foreach ensures tasks execute one after another, preventing SparkContext conflicts and avoiding stage failure.
%scala
def f2(x: String): Unit = {
val data = Array(<your-array>)
val dataRDD = sc.parallelize(data)
val dataDF = dataRDD.toDF()
val tempFile = "<your-base-path>".replaceAll("destinationTableName", x)
dataDF.write.mode("overWrite").parquet(tempFile)
println(x)
}
val tables = Array(<your-table-names>)
tables.foreach(f2)