Skip to content

Apache Spark

Apache Spark

Official Documentation

Installation

Local Virtual Environment

Step1: Java

Make sure you have Java 8 installed on your machine. If not, you can install it using brew:

bash
# Install Java 11 (Arm Compatible)
brew install openjdk@11
bash
# Symlink
sudo ln -sfn $HOMEBREW_PREFIX/opt/openjdk@11/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-11.jdk
bash
java --version

Step2: New environment

bash
poetry new spark-project-name

Step3: Pyspark dependency

Run steps 3, 4 and 5 inside the environment created by poetry

bash
poetry shell

Then you can install the follwing dependencies:

bash
poetry add pyspark

Step4: Install JupyterLab

bash
poetry add jupyterlab

Step5: Start JupyterLab

bash
jupyter-lab

Step6: Test PySpark

Create a new notebook inside jupyterlab, and run the following code:

Local Container

Pre-requisites:

  • Docker
bash
docker run -p 8888:8888 jupyter/pyspark-notebook

Access the notebook at http://localhost:8888/ and start coding.

Spark on Kubernetes

TIP

Work in progress

Testing Spark

python
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql import SparkSession

spark = (
    SparkSession
    .builder
    .appName('test')
    .getOrCreate()
)

spark.sparkContext.setLogLevel('WARN')

data = [
    ("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
])

df = spark.createDataFrame(data=data,schema=schema)

df.printSchema()

df.show(truncate=False)

Spark Datetime Patterns

Basic Operations

Basic operations are the most common operations that you will use in your day to day work with Spark.

Basic Operations

Helpers

Boilerplate code that you can use to help you with your Spark jobs.

Helpers

Important Concepts

Important concepts that you should know when working with Spark.

Important Concepts

Optimization

Optimization techniques that you can use to improve the performance of your Spark jobs.

Optimization

Feel free to use any content here.