Apache Spark is an analytics engine that allows for data processing with a standalone configuration or distributed enterprise setup. Apache Spark has APIs for major programming languages such as Scala, Java, and  Python. These APIs let you make quick work of batch or streaming data processing. 

Since the release of the Raspberry Pi 4 (4gb+), the Pi has become increasingly capable of running projects larger than a personal website or robotics playground. With the recent performance and memory increase, the Raspberry Pi is able to host components of enterprise applications.

In another tutorial, you can learn how to set up Apache Kafka on your Raspberry Pi.

With only a few simple steps you can setup Apache Spark on the Raspberry Pi 4 (Raspbian) as a standalone instance and run through a few demos to get a feel for the potential being Apache Spark.

By copying the examples given, you can get everything working in less than 10 minutes.

1. Update the Raspberry Pi

It’s always a good idea to fetch and install the latest updates before starting a new project. Especially if your RPi has been sitting for a while.

# Fetch new updates
sudo apt-get update

# Download / Install updates
sudo apt-get upgrade

2. Download Apache Spark and Extract

Go to the Apache Spark Downloads page and download the latest release files. At this time, the latest version is 3.0.1. We will be using the package type “Pre-built for Apache Hadoop 2.7”. However, the setup of Hadoop is not required at this time.

The result will be a download such as: “spark-3.0.1-bin-hadoop2.7.tgz”

# Navigate to the download file
# Extract and decompress the file into a new directory
tar xvfz <path to tgz file> -C <path to extract to>

This step will extract the tarball and move the contents to a new folder. 

3. Update the $PATH

In order to make using Spark easy, we want to append the /bin directory of Spark to the PATH; allowing us to interact with Spark from any terminal. In my case, this required that the ~/.profile be updated with the following syntax.

Replacing <your_username> with the username you are logged in with. Most likely this will be “pi”. Also, relacing <spark_directory> with the remaining path to where ever you extracted Spark.

Note: you can use whichever text editor you wish. I like to use nano.

cd ~
sudo nano .profile

# ^^^ The rest of your .profile will be here ^^^
# Scroll to the bottom and add

PATH=$PATH:/home/<your_username>/<spark_directory>/bin

# Save and Close the file

Restart the RPi so that the changes to your PATH take effect.

4. Start the spark-shell

After restarting your Raspberry Pi, open a new terminal and test the newly updated PATH. If configured correctly, you will see the following:

spark-shell

Apache Spark is started! Take note of a few things and then give it a whirl.

(a) Spark context Web UI available at http://<local_ip_of_your_PI>
(b) Pay attention to the Scala and Java versions
(c) It will warn you about not being able to the “load native-hadoop library”. That is fine as we do not require Hadoop for this project.

You can follow the URL mentioned by Spark in order to the graphic dashboard. The dashboard shows a bit of information including Jobs that have run or are currently running.

5. Run the first test Job

One of the classics tests for Apache Spark is to parse a text file and count all of the occurrences of specific words. Our example will be no different.

(a) Create a text file anywhere on your Pi, I chose the Desktop.
(b) Update provided sample code
(c) Run it!

# Within Spark Shell

var map = sc.textFile("/home/pi/Documents/hello_world").flatMap(line => line.split(" ")).map(word => (word, 1));
 
var counts = map.reduceByKey(_+_);

counts.coalesce(1).saveAsTextFile("/home/pi/Desktop/spark_001");

Explaining the code:

Assuming that the text file contains something like. Hello Hello World

sc.textFile("/home/pi/Documents/hello_world") will access the text file and emit each of the lines from the file, splitting on the end of the line.

// Line(s)
“Hello Hello World”

flatMap(line => line.split(" ")).map(word => (word, 1)) takes each of the lines from the text file and breaks them into words, by splitting each line on the empty space. Next, each and every word that comes from our flatMap will be joined in a tuple with the value of the string and a 1.

The 1 indicates that this word occurred 1 time.

// Map
(“Hello”, 1)
(“Hello”, 1)
(“World”, 1)

map.reduceByKey(_+_) squashes the collection of tuples so that any duplicate keys are removed, yet, the number (1) is added with the previous number associated with the same key.

// Reduced Map, counts
(“Hello”, 2)
(“World”, 1)

counts.coalesce(1).saveAsTextFile("/home/pi/Desktop/spark_001") writes the result of the counts to a text file for you to see. By adding coalesce(1) you are telling Spark to use a single partition.

The output will be a new directory called “spark_001”. Inside, you will see a file named “part-00000” open it and take a look.

Review the Apache Spark Web UI at http://localhost:4040 to see your job in the Job history.

More

Although this example used a basic File, take note that Spark can read and write with many data sources. MongoDB, Apache Kafka, Cassandra, just to name a few.