Performing business intelligence (BI) analysis using Apache Spark doesn’t need an expensive cluster. In this tutorial, we’ll use AWS CLI to provision a simple but powerful Apache Spark environment on an EC2 instance, perfect for running ad-hoc BI analysis from spreadsheet data. We’ll also cover smart ways to shut down the instance when you’re done to avoid unnecessary costs.
What You’ll Learn
- Launching an EC2 instance with Spark and Python via AWS CLI
- Uploading and processing Excel files with Spark
- Running PySpark analysis scripts
- Exporting data for BI tools
- Stopping or terminating the instance post-analysis
Prerequisites
- AWS CLI installed and configured (
aws configure
) - An existing EC2 Key Pair (
.pem
file) - Basic knowledge of Python or Spark
Step 1: Launch an EC2 Instance with Spark Using AWS CLI
We’ll use an Ubuntu AMI and install Spark, Java, and required Python libraries via user data script.
🔸 Create a user-data script: spark-bootstrap.sh
#!/bin/bash
apt update -y
apt install -y openjdk-11-jdk python3-pip wget unzip
pip3 install pandas openpyxl pyspark findspark matplotlib notebook
wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xvzf spark-3.5.0-bin-hadoop3.tgz
mv spark-3.5.0-bin-hadoop3 /opt/spark
echo 'export SPARK_HOME=/opt/spark' >> /etc/profile
echo 'export PATH=$PATH:$SPARK_HOME/bin' >> /etc/profile
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> /etc/profile
Make it readable:
chmod +x spark-bootstrap.sh
🔸 Launch the EC2 Instance
aws ec2 run-instances \
--image-id ami-0c94855ba95c71c99 \ # Ubuntu 20.04
--count 1 \
--instance-type t3.medium \
--key-name YOUR_KEY_PAIR_NAME \
--security-groups default \
--user-data file://spark-bootstrap.sh \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=SparkBI}]'
Replace YOUR_KEY_PAIR_NAME
with your EC2 key name.
🗂️ Step 2: Upload Your Excel File to the Instance
🔸 Find the Public IP of Your Instance
aws ec2 describe-instances \
--filters "Name=tag:Name,Values=SparkBI" \
--query "Reservations[*].Instances[*].PublicIpAddress" \
--output text
Upload your Excel file (sales_report.xls
)
scp -i your-key.pem sales_report.xls ubuntu@<EC2_PUBLIC_IP>:/home/ubuntu/
🧠 Step 3: Create and Run Your PySpark Script
sales_analysis.py
:
import os
import pandas as pd
from pyspark.sql import SparkSession
xls_file = "sales_report.xls"
csv_file = "sales_report.csv"
df = pd.read_excel(xls_file)
df.to_csv(csv_file, index=False)
spark = SparkSession.builder.appName("SalesBI").getOrCreate()
df_spark = spark.read.csv(csv_file, header=True, inferSchema=True)
# Sample Analysis
df_spark.groupBy("Region").sum("Sales").show()
Run it on EC2:
bash:
spark-submit sales_analysis.py
📊 Step 4: Export Data for BI Tools
You can save output as CSV for use in Power BI, Excel, or Apache Superset:
python:
df_spark.groupBy("Product").sum("Sales").write.csv("product_sales_output", header=True)
Use scp
to download:
bash:
scp -i your-key.pem -r ubuntu@<EC2_PUBLIC_IP>:product_sales_output/ .
💰 Step 5: Stop or Terminate EC2 to Save Costs
Stop the Instance (preserves data, costs ~$0.01/hr for EBS)
bash:
aws ec2 stop-instances --instance-ids i-xxxxxxxxxxxxxxxxx
🧭 Pro Tips
- Use Amazon S3 for persistent storage between sessions.
- For automation, script the entire process into AWS CloudFormation or a Makefile.
- If you’re doing frequent BI work, consider using Amazon EMR Serverless or SageMaker Studio.
Conclusion
With just a few CLI commands and a smart use of EC2, you can spin up a complete Apache Spark BI analysis environment. It’s flexible, cost-efficient, and cloud-native.
💡 Don’t forget to stop or terminate the EC2 instance when not in use to save on costs!