Apache Spark

Official Documentation
Installation
Local Virtual Environment
Step1: Java
Make sure you have Java 8 installed on your machine. If not, you can install it using brew:
bash
# Install Java 11 (Arm Compatible)
brew install openjdk@11bash
# Symlink
sudo ln -sfn $HOMEBREW_PREFIX/opt/openjdk@11/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-11.jdkbash
java --versionStep2: New environment
bash
poetry new spark-project-nameStep3: Pyspark dependency
Run steps 3, 4 and 5 inside the environment created by poetry
bash
poetry shellThen you can install the follwing dependencies:
bash
poetry add pysparkStep4: Install JupyterLab
bash
poetry add jupyterlabStep5: Start JupyterLab
bash
jupyter-labStep6: Test PySpark
Create a new notebook inside jupyterlab, and run the following code:
Local Container
Pre-requisites:
- Docker
bash
docker run -p 8888:8888 jupyter/pyspark-notebookAccess the notebook at http://localhost:8888/ and start coding.
Spark on Kubernetes
TIP
Work in progress
Testing Spark
python
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.appName('test')
.getOrCreate()
)
spark.sparkContext.setLogLevel('WARN')
data = [
("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)Useful Links
Basic Operations
Basic operations are the most common operations that you will use in your day to day work with Spark.
Helpers
Boilerplate code that you can use to help you with your Spark jobs.
Important Concepts
Important concepts that you should know when working with Spark.
Optimization
Optimization techniques that you can use to improve the performance of your Spark jobs.