Setting up a two node Spark cluster

Spark’s official website introduces Spark as a general engine for large-scale data processing. Spark is increasingly becoming popular among data mining practitioners due to the support it provides to create distributed data mining/processing applications.

In this post I’m going to describe how to setup a two node spark cluster in two separate machines. To play around and experiment with spark, I’ll also be using IPython Notebook which will act as a driver program.

Spark will be sitting on two machines, one acting as the master and other as the slave. If you need to add more slave nodes, you can simply do it by following steps in the section below on the machines you need run spark slave on.

Installing spark

Since Spark runs on java, before starting with this, make sure that JDK is installed and JAVA_HOME is properly set.

  1. Download and extract spark. You can download the latest version from page. At the time of writing this, latest version available was 1.4.1.
  2. In your home directory create a directory named spark and copy spark distribution in to that folder.
    mkdir spark
    mv spark-1.4.1-bin-hadoop2.6.tgz spark
  3. Extract contents into the folder.
    tar -zxvf spark-1.4.1-bin-hadoop2.6.tgz
  4. Set SPARK_HOME property on both machines
    • Add the following entries to ~/.bashrc or ~/.bash_profile (make sure you run a source ~/.bashrc or logout and re-login to the remote machine.)
export SPARK_HOME=/home/ubuntu/spark/spark-1.4.1-bin-hadoop2.6

You have to do this in both the machines. (In case you don’t have two machines you can try this out in the same machine)

Configuring SSH

For cluster related communications, Spark master should be able to create password less ssh logins to Spark slave. This is how you enable it.

  1. Generate a new key pair

This would create the key pair and save it in ~/.ssh directory. You have to do the same on the other machine.

2. Copy public key of master node and add it as an authorized key on the slave node

On master node

cat ~/.ssh/

On slave node

Open authorized_keys on master node and paste the public key

vi ~/.ssh/authorized_keys

3. Run ssh-copy-id from master node

ssh-copy-id ubuntu@slave-node

ubuntu is the name of the user on the slave machine.

Now from the master node, try to log into the slave node

ssh ubuntu@slave-node

If keys were added successfully, you should be able to log into the machine.

You should have an ssh server running on slave node to try this. To install an ssh server, you can run

sudo apt-get install openssh-server

Then start the server using following command.

Starting Spark

  1. On the master node run

When starting spark details of the nodes will get written to a log file. This will print, spark url of the master node. It usually looks like this. spark://<host-name>:7077.

2, Start the slave nodes giving url of the master.

./sbin/ &amp;amp;lt;spark-master-URL&amp;amp;gt;

Go to the web console of spark master and check the status of the cluster. If the slave node starts up correctly and joins to the cluster, you should see details of the worker node under Workers sections.


Setting up IPython Notebook

IPython Notebook is widely used by data scientists to log and present their findings in a reproducible and in an interactive way. Since I was running spark in a remote machine, what I was looking for is a driver program running as a server, which would run on the same network as spark but which I could access remotely and submit new jobs. For this, IPython Notebook suited very well.

  1. Log into the master node

2. Download and install IPython Notebook.

If you already have pip installed you can do this by:

pip install "ipython[notebook]"

For more installation options you can refer to their official website.

3. Create a new profile for spark in IPython Notebook

ipython profile create spark
[ProfileCreate] Generating default config file: u'/home/ubuntu/.ipython/profile_spark/'
[ProfileCreate] Generating default config file: u'/home/ubuntu/.ipython/profile_spark/'

4. Create the file ~/.ipython/profile_spark/startup/ and add the following

Following code snippet was taken from

import os
import sys

# Configure the environment
if 'SPARK_HOME' not in os.environ:
    os.environ['SPARK_HOME'] = '/home/ubuntu/spark/spark-1.4.1-bin-hadoop2.6'

# Create a variable for our root path
SPARK_HOME = os.environ['SPARK_HOME']

# Add the PySpark/py4j to the Python Path
sys.path.insert(0, os.path.join(SPARK_HOME, "python", "build"))
sys.path.insert(0, os.path.join(SPARK_HOME, "python"))

5. Start the IPython Notebook

ipython notebook --profile spark

6. You can access IPython Notebook UI by http://spark.master:8888
From the main page, create a new Notebook and then add the following lines.

from pyspark import SparkContext
# Getting spark context by connecting to an existing cluster
sc = SparkContext( 'spark://apim-gwm:7077', 'pyspark')

7. After executing the above code, if you go to the Spark Web Console, you may see pyspark listed as a Running Application.