Apache Spark
Official Documentation
Installation
Local Virtual Environment
Step1: Java
Make sure you have Java 8 installed on your machine. If not, you can install it using brew:
bash
# Install Java 11 (Arm Compatible)
brew install openjdk@11
bash
# Symlink
sudo ln -sfn $HOMEBREW_PREFIX/opt/openjdk@11/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-11.jdk
bash
java --version
Step2: New environment
bash
poetry new spark-project-name
Step3: Pyspark dependency
Run steps 3, 4 and 5 inside the environment created by poetry
bash
poetry shell
Then you can install the follwing dependencies:
bash
poetry add pyspark
Step4: Install JupyterLab
bash
poetry add jupyterlab
Step5: Start JupyterLab
bash
jupyter-lab
Step6: Test PySpark
Create a new notebook inside jupyterlab, and run the following code:
Local Container
Pre-requisites:
- Docker
bash
docker run -p 8888:8888 jupyter/pyspark-notebook
Access the notebook at http://localhost:8888/
and start coding.
Spark on Kubernetes
TIP
Work in progress
Testing Spark
python
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.appName('test')
.getOrCreate()
)
spark.sparkContext.setLogLevel('WARN')
data = [
("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)
Useful Links
Basic Operations
Basic operations are the most common operations that you will use in your day to day work with Spark.
Helpers
Boilerplate code that you can use to help you with your Spark jobs.
Important Concepts
Important concepts that you should know when working with Spark.
Optimization
Optimization techniques that you can use to improve the performance of your Spark jobs.