Important interview point on -coalesce() vs repartition() performance evaluation

2 min readApr 4, 2022

We may think that coalesce is the best approach for reducing the number of partitions when compare with repartition. Yes, but not in all cases. Refer below example where repartition delivers better performances than coalesce.

NOTE: It depends on the transformation that we are doing on the dataframe

repartition — It preserves the spark.sql.shuffle.partitions value and it does shuffling and creates a new set of mentioned partitions (so one extra stage is created for shuffling) (wide dependency)

coalesce — It will enforce the spark.sql.shuffle.partitions value to our specified value. it wont create new partitions instead it tries to reduce the partitions which are already there and no full shuffle happens (narrow dependency)

You have two dataframes df1 and df2, you will need to join both the dataframes and write the output as 5 part files in a optimized way. Would you use coalesce or repartition?

df3 = df1.join(df2, df1(“id”)===df2(“id”),”inner”)

df3.coalesce(5).write.parquet(path) or df3.repartition(5).write.parquet(path)

General assumption :

By default join will create 200 part files. will go for coalesce(5) as we need to decrease the part files from (200 to 5) and it avoids full shuffle.

I have seen many blogs on internet, use coalesce if you want to decrease the number of partitions, use repartition if you want to increase the partitions.

Above one is good in general, not a silver bullet.

Look at the screenshot -

Repartition took :13.40 Mins
coalesce took : 15.14 Mins

Reason —

If you use coalesce(5), It will enforce the spark.sql.shuffle.partitions value to 5 from 200 and all the join has to perform in 5 partitions. (which means only 5 tasks runs at a time parallely i.e., only 5 cores is used for joining the data eventhough we have enough cores and ram but we are under utilizing it (it will not create 200 part files, it will only create 5 part files for joins also))

If you use repartition(5), It preserves the spark.sql.shuffle.partitions value and does the join in 200 partitions. (which means join happens with 200 parallel tasks and after that it will reduce 200 to 5 part files)

So, usage of coalesce/repartition depends on the transformations you have in the code, not on the silver bullet you see on the internet.

Important interview point on -coalesce() vs repartition() performance evaluation

Written by M S Dillibabu