How to use Jupyter Notebook with Spark on Ubuntu

less than 1 minute read

In this post, I have outlined the protocol that I used to enable Jupyter notebooks to run Spark on my Ubuntu computer. I have followed, with a few tweaks, the instructions in a blog post by The Data Incubator. I have the Python 2.7 version of Anaconda, which comes with Jupyter pre-installed.

Install Java

$sudo apt-get install default-jre
$sudo apt-get install default-jdk

Java installation can be verified using

$java -version

I get the following response from this command

openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

Install Spark

Download Spark from http://spark.apache.org/downloads.html and simply extract the contents of the .tgz file as follows.

$tar zxvf spark-2.0.1-bin-hadoop2.4.tgz

Setup paths

Add the following two lines to shell’s startup script, ~/.bashrc in my case. I installed Spark in the directory /usr/local/share/spark/.

export SPARK_HOME=/usr/local/share/spark/spark-2.0.1-bin-hadoop2.4 
export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib

Now run the ‘.bashrc’

$source ~/.bashrc

Install Apache Toree

Install Toree and configure Jupyter to run Toree as follows.

$pip install https://dist.apache.org/repos/dist/dev/incubator/toree/0.2.0/snapshots/dev1/toree-pip/toree-0.2.0.dev1.tar.gz
$jupyter toree install --user

Install py4j

$pip install py4j

That’s it! Now we can use PySpark from Jupyter Notebooks.

from pyspark import SparkContext
sc = SparkContext("local[*]", "temp")
print sc.version

2.0.1

Share on

Twitter Facebook Google+ LinkedIn

Neerav Kharche

How to use Jupyter Notebook with Spark on Ubuntu

Install Java

Install Spark

Setup paths

Install Apache Toree

Install py4j

Share on

You May Also Enjoy

Recommender Systems using Latent-Factor Models

Hypothesis Testing

Topic Modeling using Scikit-learn and Gensim

Bayesian Inference