How to use Jupyter Notebook with Spark on Ubuntu

less than 1 minute read

In this post, I have outlined the protocol that I used to enable Jupyter notebooks to run Spark on my Ubuntu computer. I have followed, with a few tweaks, the instructions in a blog post by The Data Incubator. I have the Python 2.7 version of Anaconda, which comes with Jupyter pre-installed.

Install Java

$sudo apt-get install default-jre
$sudo apt-get install default-jdk

Java installation can be verified using

$java -version

I get the following response from this command

openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

Install Spark

Download Spark from http://spark.apache.org/downloads.html and simply extract the contents of the .tgz file as follows.

$tar zxvf spark-2.0.1-bin-hadoop2.4.tgz

Setup paths

Add the following two lines to shell’s startup script, ~/.bashrc in my case. I installed Spark in the directory /usr/local/share/spark/.

export SPARK_HOME=/usr/local/share/spark/spark-2.0.1-bin-hadoop2.4 
export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib

Now run the ‘.bashrc’

$source ~/.bashrc

Install Apache Toree

Install Toree and configure Jupyter to run Toree as follows.

$pip install https://dist.apache.org/repos/dist/dev/incubator/toree/0.2.0/snapshots/dev1/toree-pip/toree-0.2.0.dev1.tar.gz
$jupyter toree install --user

Install py4j

$pip install py4j

That’s it! Now we can use PySpark from Jupyter Notebooks.

from pyspark import SparkContext
sc = SparkContext("local[*]", "temp")
print sc.version
2.0.1