Build a Spark-Based BI Environment on AWS EC2 Using AWS CLI

Performing business intelligence (BI) analysis using Apache Spark doesn’t need an expensive cluster. In this tutorial, we’ll use AWS CLI to provision a simple but powerful Apache Spark environment on an EC2 instance, perfect for running ad-hoc BI analysis from spreadsheet data. We’ll also cover smart ways to shut down the instance when you’re done to avoid unnecessary costs.

What You’ll Learn

Launching an EC2 instance with Spark and Python via AWS CLI
Uploading and processing Excel files with Spark
Running PySpark analysis scripts
Exporting data for BI tools
Stopping or terminating the instance post-analysis

Prerequisites

AWS CLI installed and configured (aws configure)
An existing EC2 Key Pair (.pem file)
Basic knowledge of Python or Spark

Step 1: Launch an EC2 Instance with Spark Using AWS CLI

We’ll use an Ubuntu AMI and install Spark, Java, and required Python libraries via user data script.

🔸 Create a user-data script: spark-bootstrap.sh

#!/bin/bash
apt update -y
apt install -y openjdk-11-jdk python3-pip wget unzip
pip3 install pandas openpyxl pyspark findspark matplotlib notebook

wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xvzf spark-3.5.0-bin-hadoop3.tgz
mv spark-3.5.0-bin-hadoop3 /opt/spark

echo 'export SPARK_HOME=/opt/spark' >> /etc/profile
echo 'export PATH=$PATH:$SPARK_HOME/bin' >> /etc/profile
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> /etc/profile

Make it readable:

chmod +x spark-bootstrap.sh

🔸 Launch the EC2 Instance

aws ec2 run-instances \
  --image-id ami-0c94855ba95c71c99 \  # Ubuntu 20.04
  --count 1 \
  --instance-type t3.medium \
  --key-name YOUR_KEY_PAIR_NAME \
  --security-groups default \
  --user-data file://spark-bootstrap.sh \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=SparkBI}]'

Replace YOUR_KEY_PAIR_NAME with your EC2 key name.

🗂️ Step 2: Upload Your Excel File to the Instance

🔸 Find the Public IP of Your Instance

aws ec2 describe-instances \
  --filters "Name=tag:Name,Values=SparkBI" \
  --query "Reservations[*].Instances[*].PublicIpAddress" \
  --output text

Upload your Excel file (sales_report.xls)

scp -i your-key.pem sales_report.xls ubuntu@<EC2_PUBLIC_IP>:/home/ubuntu/

🧠 Step 3: Create and Run Your PySpark Script

sales_analysis.py:

import os
import pandas as pd
from pyspark.sql import SparkSession

xls_file = "sales_report.xls"
csv_file = "sales_report.csv"

df = pd.read_excel(xls_file)
df.to_csv(csv_file, index=False)

spark = SparkSession.builder.appName("SalesBI").getOrCreate()
df_spark = spark.read.csv(csv_file, header=True, inferSchema=True)

# Sample Analysis
df_spark.groupBy("Region").sum("Sales").show()

Run it on EC2:

bash:
spark-submit sales_analysis.py

📊 Step 4: Export Data for BI Tools

You can save output as CSV for use in Power BI, Excel, or Apache Superset:

python:
df_spark.groupBy("Product").sum("Sales").write.csv("product_sales_output", header=True)

Use scp to download:

bash:
scp -i your-key.pem -r ubuntu@<EC2_PUBLIC_IP>:product_sales_output/ .

💰 Step 5: Stop or Terminate EC2 to Save Costs

Stop the Instance (preserves data, costs ~$0.01/hr for EBS)

bash:
aws ec2 stop-instances --instance-ids i-xxxxxxxxxxxxxxxxx

🧭 Pro Tips

Use Amazon S3 for persistent storage between sessions.
For automation, script the entire process into AWS CloudFormation or a Makefile.
If you’re doing frequent BI work, consider using Amazon EMR Serverless or SageMaker Studio.

Conclusion

With just a few CLI commands and a smart use of EC2, you can spin up a complete Apache Spark BI analysis environment. It’s flexible, cost-efficient, and cloud-native.

💡 Don’t forget to stop or terminate the EC2 instance when not in use to save on costs!