Problem
When implementing the Alternating Least Squares (ALS) algorithm using Apache Spark PySpark, you notice slow model fitting (even when using a significant number of computational resources), particularly in large datasets.
Cause
Slow model fitting signals concurrent issues with compute-intensive operations associated with ALS and its default configurations. Compute-intensive operations involve high-grade matrix computations, which are CPU-intensive.
Regarding default configuration, PySpark sets Spark partitions to ten, meaning only ten CPU cores are used even if compute has more available, leading to inefficient resource usage.
Solution
Change the default number of blocks to the total number of cores available in the compute.
als = ALS()
num_cores = sc.defaultParallelism #Gets the total number of CPU cores on the cluster
als.setNumBlocks(num_cores)#Overrides the number of blocks to the number of CPU cores
model = als.fit()
Databricks also recommends using compute-optimized instances in order to obtain more cores.
Further reading: Empirical experiment
The following are the results of an experiment that consists of four trials to compare memory-optimized and compute-optimized computes. All trials use a consistent data size of 3.3 GB and incur approximately 6 DBUs per hour.
Trials 1 and 2 use memory-optimized compute; trial 1 has a default setting of 10 cores, resulting in a processing time of 5 minutes. In trial 2, the ALS blocks are explicitly set to match the number of cores available (20), reducing the time to 3 minutes.
Trials 3 and 4 use compute-optimized compute; trial 3 has a default setting of 10 cores and a processing time of 3 minutes. Hence, when it is compared with Trial 1, which uses memory-optimized compute, it is clear that the compute-optimized compute has better performance.
In Trial 4, the ALS blocks are explicitly set to match the number of cores available (40), and they turn out to have the lowest execution time, making them the optimal configuration for maximizing performance.
Data Size | DBR usage | Compute | ALS blocks | Time | |
Trial 1 | 3.3GB | 6.12 DBU/hr |
Memory 5 workers |
Default (10) | 5 min |
Trial 2 | 3.3GB | 6.12 DBU/hr |
Memory 5 workers |
20 cores | 3 min |
Trial 3 | 3.3GB | 6 DBU/hr |
Compute Optimized c4.2XL 5 workers |
Default (10) | 3 min |
Trial 4 Optimal config |
3.3GB | 6 DBU/hr |
Compute Optimized 5 workers |
40 cores | 2 min |