When it comes to exploring and analysing large amounts of data, few tools beat the Apache Spark - IPython Notebook combination. However, in my journey to set up and use Apache Spark for my work, I often had to go through a lot of trial and error as the instructions often assume certain knowledge of Spark's internal settings. I had to plough through documentation in various places before I could get a reasonably working Spark + IPython setup going. So here I chronicle my experience in setting up Apache Spark for use with IPython notebooks, attempting at each step to explain the rationale and the settings used.
Although my aims is to hand-hold the reader through the setup process, I would still have to make certain basic assumptions so that this post does not become overly long by trying to cover all the possible setup conditions.
I will be using Ubuntu 16.04
Other Linux distributions would do just fine
The steps would be quite similar in Mac OS as well
All commands are executed by the default
You can execute the commands as another user but I would recommend that the account be part of the
Actually that's about all I would assume.
Preparing the linux environment
Installing Apache Spark
Starting Apache Spark with IPython
Step 1: Preparing the linux environment
Install JDK by typing:
sudo apt-get install default-jdk
JAVA_HOME to your environment by appending the following lines to
Next you would need to install Python. You can do it using Anaconda's Python distribution. I am using Python 3 here but Python 2 would do just fine as well.
Once you've downloaded the Anaconda Python package from here (say the package file name is
Anaconda3-4.4.0-Linux-x86_64.sh), install it by typing:
cd <folder_where_you_saved_anaconda> bash Anaconda3-4.4.0-Linux-x86_64.sh -p /usr/local/
-p tells the script to install Anaconda Python into the
When the installation is done, do a quick check on the ownership of the Anaconda folder by typing:
ls -al /usr/local
See if the owner and owner group are both
ubuntu (or whatever user name/group you are using). If not (it might say
root if you decide to fully automate this installation process with a script), change the ownership to yourself by typing:
sudo chown -r ubunut:ubuntu /usr/local/anaconda
This is only necessary if you intend to install extra packages using
pip rather than
conda (Anaconda's own package manager).
Step 2: Installing Apache Spark
Download the latest Spark package from the spark homepage. I recommend the pre-built versions.
Once downloaded untar the package into
/usr/local as well by typing:
tar -xvf <path_to_spark>.tar.gz -C /usr/local/
SPARK_HOME to your environment variables
And that's it! Spark is installed.
Step 3: Starting IPython Notebook with Apache Spark
To start Spark with IPython, type in the following command (or save it in a shell script for reuseability):
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser" $SPARK_HOME/bin/pyspark --master local
Let's break the command down. The first part,
PYSPARK_DRIVER_PYTHON=jupyter, tells Spark to use Jupyter as the driver Python. The second part passes options or arguments to the driver Python and tells Spark to start a Jupyter notebook without automatically opening a browser window. The third part invokes Spark in Python in local mode with 2 cores.
If you are deploying Spark to a cluster of nodes, change
--master local to
Other notes if deploying to a cluster...
Apache Spark configurations are stored in the following files in
spark-defaults.confstores the basic or cluster-wide configurations
spark-env.shstores the node-specific configuration. For example, if you want a particular worker node to have slightly different configuration from the default. This file is read when Spark is initiated on the node itself.
spark-env.shwill override settings in
It is useful to set
spark.pyspark.pythonto the propery python path
This is especially true when using Anaconda as the driver python. To make sure that Spark uses Anaconda instead of the default system Python
In this way, all the packages available in Anaconda will also be available to Spark. For example, one might want to use
numpyfunctions in UDFs.
If a worker node has a different python path, set it in