August 2019 | Dios Kurniawan
I wrote this post to share with you how to install PySpark in MacOS. I was reformatting my laptop last week and I found it difficult to reinstall PySpark because the Apache Spark documentation did not mention much about MacOS. If you Google it, you would find there are quite many different ways of doing this, but I could assure you that what I wrote here would be the most straightforward way to install PySpark in a standalone setup in a MacOS.
Standalone setup is ideal if you need to write in PySpark but do not own or do not have access to a Hadoop / Spark cluster. If you only need to write PySpark programs locally in your laptop without actually running them in a cluster of machines, for example at home or in a coffee shop, then this is the way to go.
To install PySpark, follow these 7 easy steps below. This assumes that you are starting with a clean machine. Be warned, you will need a fast internet connection because the size of the software to download will be quite large.
Steps:
- Download and install Java (JDK) SE 8 if you haven’t done so. Beware: do not use other version. If you use newer version, you will have to downgrade.
- Download and install Anaconda (http://anaconda.com). Pick Python 3.x instead of Python 2.x. Test your installation by creating and running a simple Python program before you proceed to the next step.
- Download and install Homebrew (https://brew.sh). Homebrew is package manager for MacOS, we will need this to install Spark. Once Homebrew is installed, open a new Terminal and run this command to get core Apache Spark package:
brew install apache-spark
4. Once finished, go to your Spark directory /usr/local/Cellar/apache-spark as shown below (change the numbers with the actual version you have installed in your computer, in my case it is “2.4.5”) and then find and edit the bash_profile file in that directory:
cd /usr/local/Cellar/apache-spark/2.4.5
nano ~/.bash_profile
Add these new lines at the very bottom of the file, then save and close:
export SPARK_PATH=/usr/local/Cellar/apache-spark/2.4.5
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3 alias snotebook='$SPARK_PATH/bin/pyspark --master local[2]'
5. Run the file bash_profile you just edited by executing this:
source ~/.bash_profile
6. Next, download and install Findspark and PySpark using Conda:
conda install -c conda-forge findspark
conda install pyspark
7. Test your installation by starting a new Python 3 program as below (you can also use Jupyter Notebook):
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext(appName="dios")
If the program returns no error, that’s it, you’ve got your Pyspark environment ready!