Best Notebook for testing/developing spark code on Google cloud clusters

M S Dillibabu
2 min readMay 24, 2022

I have prepared a poc kind of comparison for the best notebook among jupyter, databricks and zeppelin notebooks on top of dataproc cluster in google cloud, to execute spark code.

comparison of notebooks

Above billing information and all other comparisons are done on single node which has 15gb ram and 4 cores (n1-standard-4 node) and some local storage which differs for each notebook.

GCP vertex internally uses jupyter notebook. It has 2 type managed and user managed. Managed is a modified version of jupyter notebook, it has extra features which is very useful to use on dataproc cluster (developed specifically for google clusters) but the only limitation is, it doesn't has root user, so we cant install spylon-kernel to run spark scala code. But user-managed notebook lacks in few features when compared to managed notebooks but it has root user feature so that we can install third party features like spark with scala(spylon-kernel). Next one is databricks notebook, we know that databricks is company specifically founded on spark technology, so it has many advanced features and in-built optimizations. we can use any language python,scala,sql in the same notebook.

I personally found vertex-AI Managed(Jupyter notebook) as the best option to execute spark codes on dataproc clusters (google cloud). But it differs for different clouds example for azure , databricks is the best option. It has auto start and stop cluster option so that billing is done on utilization part only. If it is in idle start then there is no billing for that.

We can create dataproc clusters directly and at the time of creation we can enable to use jupyter and zeppelin notebook by using below command or by GUI. There is no auto stop option only auto deletion for dataproc clusters option is there in this when compared to databricks cluster

gcloud dataproc clusters \
create clustername \
— project projectname \
— region us-central1 \
— zone us-central1-a \
— no-address \
— master-machine-type n1-standard-4 \
— master-boot-disk-size 200 \
— num-workers 3 \
— worker-machine-type n1-standard-4 \
— worker-boot-disk-size 250 \
— enable-component-gateway \
— optional-components=ANACONDA,JUPYTER,ZEPPELIN \
— max-idle 30000s // used for auto deletion

--

--