Spark’s official website introduces Spark as a general engine for large-scale data processing. Spark is increasingly becoming popular among data mining practitioners due to the support it provides to create distributed data mining/processing applications.
In this post I’m going to describe how to setup a two node spark cluster in two separate machines. To play around and experiment with spark, I’ll also be using IPython Notebook which will act as a driver program.
Spark will be sitting on two machines, one acting as the master and other as the slave. If you need to add more slave nodes, you can simply do it by following steps in the section below on the machines you need run spark slave on.
Since Spark runs on java, before starting with this, make sure that JDK is installed and JAVA_HOME is properly set.
- Download and extract spark. You can download the latest version from http://spark.apache.org/downloads.html page. At the time of writing this, latest version available was 1.4.1.
- In your home directory create a directory named spark and copy spark distribution in to that folder.
mkdir spark mv spark-1.4.1-bin-hadoop2.6.tgz spark
- Extract contents into the folder.
tar -zxvf spark-1.4.1-bin-hadoop2.6.tgz
- Set SPARK_HOME property on both machines
- Add the following entries to ~/.bashrc or ~/.bash_profile (make sure you run a source ~/.bashrc or logout and re-login to the remote machine.)
export SPARK_HOME=/home/ubuntu/spark/spark-1.4.1-bin-hadoop2.6 export PATH=$SPARK_HOME/bin:$PATH
You have to do this in both the machines. (In case you don’t have two machines you can try this out in the same machine)
For cluster related communications, Spark master should be able to create password less ssh logins to Spark slave. This is how you enable it.
- Generate a new key pair
This would create the key pair and save it in ~/.ssh directory. You have to do the same on the other machine.
2. Copy public key of master node and add it as an authorized key on the slave node
On master node
On slave node
Open authorized_keys on master node and paste the public key
3. Run ssh-copy-id from master node
ubuntu is the name of the user on the slave machine.
Now from the master node, try to log into the slave node
If keys were added successfully, you should be able to log into the machine.
You should have an ssh server running on slave node to try this. To install an ssh server, you can run
sudo apt-get install openssh-server
Then start the server using following command.
- On the master node run
When starting spark details of the nodes will get written to a log file. This will print, spark url of the master node. It usually looks like this. spark://<host-name>:7077.
2, Start the slave nodes giving url of the master.
Go to the web console of spark master and check the status of the cluster. If the slave node starts up correctly and joins to the cluster, you should see details of the worker node under Workers sections.
Setting up IPython Notebook
IPython Notebook is widely used by data scientists to log and present their findings in a reproducible and in an interactive way. Since I was running spark in a remote machine, what I was looking for is a driver program running as a server, which would run on the same network as spark but which I could access remotely and submit new jobs. For this, IPython Notebook suited very well.
- Log into the master node
2. Download and install IPython Notebook.
If you already have pip installed you can do this by:
pip install "ipython[notebook]"
For more installation options you can refer to their official website.
3. Create a new profile for spark in IPython Notebook
ipython profile create spark [ProfileCreate] Generating default config file: u'/home/ubuntu/.ipython/profile_spark/ipython_config.py' [ProfileCreate] Generating default config file: u'/home/ubuntu/.ipython/profile_spark/ipython_notebook_config.py'
4. Create the file ~/.ipython/profile_spark/startup/00-pyspark-setup.py and add the following
Following code snippet was taken from https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python.
import os import sys # Configure the environment if 'SPARK_HOME' not in os.environ: os.environ['SPARK_HOME'] = '/home/ubuntu/spark/spark-1.4.1-bin-hadoop2.6' # Create a variable for our root path SPARK_HOME = os.environ['SPARK_HOME'] # Add the PySpark/py4j to the Python Path sys.path.insert(0, os.path.join(SPARK_HOME, "python", "build")) sys.path.insert(0, os.path.join(SPARK_HOME, "python"))
5. Start the IPython Notebook
ipython notebook --profile spark
6. You can access IPython Notebook UI by http://spark.master:8888
From the main page, create a new Notebook and then add the following lines.
from pyspark import SparkContext # Getting spark context by connecting to an existing cluster sc = SparkContext( 'spark://apim-gwm:7077', 'pyspark')
7. After executing the above code, if you go to the Spark Web Console, you may see pyspark listed as a Running Application.