How to use Jupyter Notebook with Spark on Ubuntu
In this post, I have outlined the protocol that I used to enable Jupyter notebooks to run Spark on my Ubuntu computer. I have followed, with a few tweaks, the instructions in a blog post by The Data Incubator. I have the Python 2.7 version of Anaconda, which comes with Jupyter pre-installed.
Install Java
$sudo apt-get install default-jre $sudo apt-get install default-jdk
Java installation can be verified using
$java -version
I get the following response from this command
openjdk version "1.8.0_131" OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11) OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
Install Spark
Download Spark from http://spark.apache.org/downloads.html and simply extract the contents of the .tgz file as follows.
$tar zxvf spark-2.0.1-bin-hadoop2.4.tgz
Setup paths
Add the following two lines to shell’s startup script, ~/.bashrc in my case. I installed Spark in the directory /usr/local/share/spark/.
export SPARK_HOME=/usr/local/share/spark/spark-2.0.1-bin-hadoop2.4 export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib
Now run the ‘.bashrc’
$source ~/.bashrc
Install Apache Toree
Install Toree and configure Jupyter to run Toree as follows.
$pip install https://dist.apache.org/repos/dist/dev/incubator/toree/0.2.0/snapshots/dev1/toree-pip/toree-0.2.0.dev1.tar.gz $jupyter toree install --user
Install py4j
$pip install py4j
That’s it! Now we can use PySpark from Jupyter Notebooks.
from pyspark import SparkContext
sc = SparkContext("local[*]", "temp")
print sc.version
2.0.1