A partition is a group of your data. When loading from HDFS, Spark will create a partition for each one of your files. This is normally fine, but there are times when you want to repartition, and Spark has a few functions to do that. The most commonly used are
repartition asks Spark to create partitions of roughly equal amounts of data and distribute them around your cluster.
This causes a shuffle action, as data is being moved around the cluster. Examples of other shuffle actions include sorts and joins, and it is an expensive operation.
You can see how often you are causing a shuffle by looking at how many stages are created for your Spark application. Each stage is a result of a shuffle.
coalesce will reduce the amount of partitions to the number you supply, giving you larger partitions. Unlike
repartition, it won’t cause a shuffle. If there are already fewer partitions than the number you ask for, it will do nothing.
coalesce will likely give you fewer, larger, partitions.
When to use
You only ever want to use
repartition if all of the following is true:
- You have done some filtering, removing a lot of your data
- You are about to perform another operation that benefits from running in parallel
- In your last run, or in your benchmarking, you noticed the vast majority of your data ended up in only a few of your partitions. This meant your application wasn’t taking advantage of all the cores it had available to it, because many of the cores have finished with their data set and are then sitting there idle.
If any of these is not true, in particular number 3, then you should resist the urge to call
repartition. It’s most probably doing more harm than good.
Cover image from Unsplash.