Slow model fitting when implementing Alternating Least Squares using Apache Spark PySpark

Override the block default to match the total cores available and consider using compute-optimized instances.

Written by Amruth Ashoka

Last published at: November 12th, 2024

Problem

When implementing the Alternating Least Squares (ALS) algorithm using Apache Spark PySpark, you notice slow model fitting (even when using a significant number of computational resources), particularly in large datasets.

 

Cause

Slow model fitting signals concurrent issues with compute-intensive operations associated with ALS and its default configurations. Compute-intensive operations involve high-grade matrix computations, which are CPU-intensive. 

Regarding default configuration, PySpark sets Spark partitions to ten, meaning only ten CPU cores are used even if compute has more available, leading to inefficient resource usage. 

 

Solution

Change the default number of blocks to the total number of cores available in the compute. 

als = ALS()
num_cores = sc.defaultParallelism #Gets the total number of CPU cores on the cluster
als.setNumBlocks(num_cores)#Overrides the number of blocks to the number of CPU cores
model = als.fit()

 

Databricks also recommends using compute-optimized instances in order to obtain more cores.

 

Further reading: Empirical experiment

The following are the results of an experiment that consists of four trials to compare memory-optimized and compute-optimized computes. All trials use a consistent data size of 3.3 GB and incur approximately 6 DBUs per hour. 

Trials 1 and 2 use memory-optimized compute; trial 1 has a default setting of 10 cores, resulting in a processing time of 5 minutes. In trial 2, the ALS blocks are explicitly set to match the number of cores available (20), reducing the time to 3 minutes.

Trials 3 and 4 use compute-optimized compute; trial 3 has a default setting of 10 cores and a processing time of 3 minutes. Hence, when it is compared with Trial 1, which uses memory-optimized compute, it is clear that the compute-optimized compute has better performance.

In Trial 4, the ALS blocks are explicitly set to match the number of cores available (40), and they turn out to have the lowest execution time, making them the optimal configuration for maximizing performance.

 

  Data Size DBR usage Compute ALS blocks Time
Trial 1 3.3GB 6.12 DBU/hr

Memory
Optimized
r61d.XL

5 workers

Default (10) 5 min
Trial 2 3.3GB 6.12 DBU/hr

Memory
Optimized
r61d.XL

5 workers

20 cores 3 min
Trial 3 3.3GB 6 DBU/hr

Compute Optimized

c4.2XL 

5 workers

Default (10) 3 min 
Trial 4
Optimal config
3.3GB 6 DBU/hr

Compute Optimized
c4.2XL 

5 workers

40 cores 2 min