Loops in Databricks or Parallel Processing

Usage of loops and conditions is unavoidable in both application development and data processing. However, for a data engineer, this becomes a nightmare if not handled correctly.

I will not be deep diving into how loops are handled in Databricks. However, in short, the problem is that, the same block of code is processed one after another. In short, serially. That means, if the time taken to execute a single block is 30 milliseconds, and if you need to loop through it 100 times, it will take 3000 milliseconds (3 seconds).

Now, 30 milliseconds I have mentioned above can vary depending on what kind of processing you have. If you have things like calling an external service, creating dataframes, you will need more computational power and increase the execution time.

Another point is that iterating a data frame creates query plans which grows with each successive iteration. Meaning, that the time taken to execute the 100th time might be 10 times more than the time taken to execute the first few times.

The reason I mentioned ‘Data Engineer’ specifically is because we need to work with very large sets of data, thousands and sometimes millions. And sometimes these datasets increase exponentially each time we need to process them.

I had to deal with such a dataset and that is why I thought it might be good to write about this.

I needed to do sentiment analysis on text comments and these number of comments have grown exponentially over the years. The last year I had to run the sentiment analysis was over 30k. So, I had to call the external service, get the results into a dataframe and save into a csv file. This was sometimes running overnight!

As expected, I tried so many tweeks, such as caching, increasing computational power and improving the code. But, none of them either worked or brought into a satisfactory level of performance. I even tried a different mechanism called checkpointing, but I could not get far through that due to the nature of the processing I needed to do. However, I was able to bring down the total execution time to 5 hours.

Not only is this taking so much of time, this is also costly as you have your cluster running for hours.

This is when parallel processing became my solution. It makes sense. We need to process very large datasets that grows exponentially. When working with Databricks, we have the ability to the resources and we are charged only for the time we use them. And Databricks runs on a Spark cluster where distributed processing can be done across multiple machines.

However, one thing I had to do is to rewrite my code that will use distributed processing in Spark; that is align the code that will run parallelly using all the available machines, rather than serially.

In this case, I broke my data frame into subsets (I wrote these into a series of files, each file containing 10 rows of data) and called the processing file parallelly.

With this, I was able to bring down the 5 hours to just 30 minutes.

Brought down the cost and improved productivity.



Leave a comment