Build a Spark-Based BI Environment on AWS EC2 Using AWS CLI

Performing business intelligence (BI) analysis using Apache Spark doesn’t need an expensive cluster. In this tutorial, we’ll use AWS CLI to provision a simple but powerful Apache Spark environment on an EC2 instance, perfect for running ad-hoc BI analysis from spreadsheet data. We’ll also cover smart ways to shut down the instance when you’re done to avoid unnecessary costs.

What You’ll Learn

  • Launching an EC2 instance with Spark and Python via AWS CLI
  • Uploading and processing Excel files with Spark
  • Running PySpark analysis scripts
  • Exporting data for BI tools
  • Stopping or terminating the instance post-analysis

Prerequisites

  • AWS CLI installed and configured (aws configure)
  • An existing EC2 Key Pair (.pem file)
  • Basic knowledge of Python or Spark

Step 1: Launch an EC2 Instance with Spark Using AWS CLI

We’ll use an Ubuntu AMI and install Spark, Java, and required Python libraries via user data script.

🔸 Create a user-data script: spark-bootstrap.sh

#!/bin/bash
apt update -y
apt install -y openjdk-11-jdk python3-pip wget unzip
pip3 install pandas openpyxl pyspark findspark matplotlib notebook

wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xvzf spark-3.5.0-bin-hadoop3.tgz
mv spark-3.5.0-bin-hadoop3 /opt/spark

echo 'export SPARK_HOME=/opt/spark' >> /etc/profile
echo 'export PATH=$PATH:$SPARK_HOME/bin' >> /etc/profile
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> /etc/profile

Make it readable:

chmod +x spark-bootstrap.sh

🔸 Launch the EC2 Instance

aws ec2 run-instances \
  --image-id ami-0c94855ba95c71c99 \  # Ubuntu 20.04
  --count 1 \
  --instance-type t3.medium \
  --key-name YOUR_KEY_PAIR_NAME \
  --security-groups default \
  --user-data file://spark-bootstrap.sh \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=SparkBI}]'

Replace YOUR_KEY_PAIR_NAME with your EC2 key name.

🗂️ Step 2: Upload Your Excel File to the Instance

🔸 Find the Public IP of Your Instance

aws ec2 describe-instances \
  --filters "Name=tag:Name,Values=SparkBI" \
  --query "Reservations[*].Instances[*].PublicIpAddress" \
  --output text

Upload your Excel file (sales_report.xls)

scp -i your-key.pem sales_report.xls ubuntu@<EC2_PUBLIC_IP>:/home/ubuntu/

🧠 Step 3: Create and Run Your PySpark Script

sales_analysis.py:

import os
import pandas as pd
from pyspark.sql import SparkSession

xls_file = "sales_report.xls"
csv_file = "sales_report.csv"

df = pd.read_excel(xls_file)
df.to_csv(csv_file, index=False)

spark = SparkSession.builder.appName("SalesBI").getOrCreate()
df_spark = spark.read.csv(csv_file, header=True, inferSchema=True)

# Sample Analysis
df_spark.groupBy("Region").sum("Sales").show()

Run it on EC2:

bash:
spark-submit sales_analysis.py

📊 Step 4: Export Data for BI Tools

You can save output as CSV for use in Power BI, Excel, or Apache Superset:

python:
df_spark.groupBy("Product").sum("Sales").write.csv("product_sales_output", header=True)

Use scp to download:

bash:
scp -i your-key.pem -r ubuntu@<EC2_PUBLIC_IP>:product_sales_output/ .

💰 Step 5: Stop or Terminate EC2 to Save Costs

Stop the Instance (preserves data, costs ~$0.01/hr for EBS)

bash:
aws ec2 stop-instances --instance-ids i-xxxxxxxxxxxxxxxxx

🧭 Pro Tips

  • Use Amazon S3 for persistent storage between sessions.
  • For automation, script the entire process into AWS CloudFormation or a Makefile.
  • If you’re doing frequent BI work, consider using Amazon EMR Serverless or SageMaker Studio.

Conclusion

With just a few CLI commands and a smart use of EC2, you can spin up a complete Apache Spark BI analysis environment. It’s flexible, cost-efficient, and cloud-native.

💡 Don’t forget to stop or terminate the EC2 instance when not in use to save on costs!

Exploring AI to the deep end – Facebook Reels

Today was exploring more of AI tools and many more.

Using chat-gpt generated a prompt for leonardo ai to create a subtle scenery.

Using GIMP on Ubuntu the scene was upscaled and cropped to 810:1440 ie 9:16 resolution. Used AI Music to generate a 30 second lo-fi bgm.mp3. using a PHP script generated 840 lines in format fname brightness contrast with values in a gradual change. This information was used in a loop for invoking imagemagic convert and create 840 frames. finally ffmpeg -i “%04d.jpg” -i bgm.mp3 -r 30 dusk.mp4. since frame rate is 30 and frames are 840, the video is of length 28 seconds and can be shared on facebook as a reel. That was my target.

Prompt in Chat GPT:

Suggest a detailed prompt for leonardo ai to generate a scenic view of country side just before sunset and the sun is behind the viewer.

The response, trimmed to the relevant portion which was further given to Leonardo AI:

A breathtaking countryside landscape bathed in the warm glow of the golden hour, just before sunset. The sun is behind the viewer, casting long, soft shadows over rolling green hills. A winding dirt road leads through fields of wildflowers in shades of lavender, yellow, and white, gently swaying in the breeze. A rustic wooden fence lines the path, guiding the eye toward a quaint farmhouse with a red roof and smoke curling from its chimney. Lush trees with golden-hued leaves dot the horizon, their branches swaying gently. A calm river reflects the soft, pastel hues of the evening sky—lavender, peach, and rose gold—while birds soar gracefully in the distance. The scene exudes warmth, tranquility, and a sense of peaceful solitude.

This was of dimension 1472×832 though I had set 16:9 resolution in the Leonardo image generator app.

Since I am on XUbuntu here I used GIMP to first upscale. As height required was 1440 the width proportionally came to 2548 once the scaling was done, chose canvas size from the image menu. setting width to 810 and position to center, since I needed 9:16 for Facebook reel.

Now for progressive frame generation a short script was written. For which the max thresholds were identified using GIMP. Colors -> Brightness & Contrast then in the slider, manually tweaking until the image was fully black. tried to fix these values to easy to calculate. And arrived at Brightness -120 and Contrast + 60. With a frame rate of 30 per second, a 28 second video will need 840 frames. So applying that brightness is 0 to -120 in 840 frames which evaluates to reduce by 1 in every 7 frames, whereas contrast is 0 to 60 and that evaluates to increase of 1 in every 14 frames. This was implemented using php scripting.

<?php

/*
brightness    0 => -120  7:1
Contrast      0 => 60   14:1

frames 840
*/

$list = range(1,840);

$bt = 0;
$ct = 0;

$bv = 0;
$cv = 0;

foreach($list as $sn){
   
   if($bt == 7){
   	$bv += 1;
   	$bt = 0;
   }
   
   if($ct == 14){
   	$cv += 1;
   	$ct = 0;
   }
      
   $bt++;
   $ct++;
   
   echo str_pad($sn, 4, '0', STR_PAD_LEFT)," $bv $cv","\n";
}

?>

This was further run from the command line and the output captured in a text file. Further a while loop creates the frames using image magik convert utility.

php -q bnc.php > list.txt

mkdir fg

cat list.txt | while read fi bv cv; do convert scene.jpg -brightness-contrast -${bv}x${cv} fg/${fi}.jpg ; done

cd fg
ffmpeg -i %04d.jpg -i /home/jijutm/Downloads/bgm-sunset.mp3 -r 30 ../sunset-reel.mp4

The bgm-sunset.mp3 was created using AI music generator and edited in audacity for special effects like fade in fade out etc.

Why this workflow is effective:

Automation: The PHP script and ImageMagick loop automate the tedious process of creating individual frames, saving a lot of time and effort.
Cost-effective: Using open-source tools like GIMP and FFmpeg keeps the cost down.
Flexibility: This approach gives a high degree of control over every aspect of the video, from the scenery to the music and the visual effects.
Efficient: By combining the strengths of different AI tools and traditional image/video processing software, this streamlined workflow is defined that gets the job done quickly and effectively.

The final reel on facebook page , see that also.

Ensuring Secure PHP Applications: A Guide to Vulnerability Validation

Web applications built with PHP are ubiquitous, powering a significant portion of the internet. However, their widespread use also makes them a frequent target for malicious actors. Ensuring the security of these applications is paramount, especially when development is handled by a third-party team. This article outlines a comprehensive approach to validating PHP code for vulnerabilities, minimizing risks and protecting sensitive data.

The Importance of Proactive Security:

Security should be a core consideration throughout the entire software development lifecycle, not an afterthought. Addressing vulnerabilities after deployment is significantly more costly and time-consuming than preventing them in the first place. Proactive security measures, including thorough code validation, are crucial for mitigating risks and maintaining a secure application.

Key Vulnerabilities to Watch For:

Several common vulnerabilities frequently plague PHP applications. Understanding these weaknesses is the first step in preventing them:

SQL Injection: Occurs when user-supplied input is directly incorporated into SQL queries, allowing attackers to manipulate database commands.
Cross-Site Scripting (XSS): Enables attackers to inject malicious scripts into web pages viewed by other users, potentially stealing cookies or redirecting users to phishing sites.
Cross-Site Request Forgery (CSRF): Exploits the trust a website has in a user’s browser, allowing attackers to perform unauthorized actions on behalf of the user.
File Inclusion: Arises when user input is used to dynamically include files, potentially allowing attackers to execute arbitrary code.
Command Injection: Happens when user input is used in system commands, allowing attackers to execute commands on the server.
Session Management Issues: Weaknesses in session handling can lead to session hijacking or other security breaches.
Improper Error Handling: Displaying sensitive information in error messages can provide valuable information to attackers.

A Multi-Layered Approach to Validation:

Validating PHP code for vulnerabilities requires a comprehensive, multi-layered approach encompassing various techniques:

  1. Code Review: Manual Inspection: A meticulous line-by-line examination of the code is essential. This process should focus on identifying patterns indicative of the vulnerabilities listed above. Special attention should be paid to areas where user input is processed or used in database queries, file operations, or system commands.
    Peer Review: Involving other experienced developers in the review process offers a fresh perspective and increases the likelihood of identifying overlooked issues.
  2. Automated Tools: Static Application Security Testing (SAST): SAST tools analyze the source code without executing it, identifying potential vulnerabilities based on predefined rules and patterns. These tools can flag issues like SQL injection, XSS, and other common weaknesses. Examples include PHPStan, Psalm, and RIPS.
    Dynamic Application Security Testing (DAST): DAST tools test the application in a runtime environment, simulating real-world attacks to uncover vulnerabilities that might not be apparent through static analysis. Tools like OWASP ZAP, Acunetix, and Netsparker fall into this category.
  3. Best Practices and Secure Coding Standards: Adherence to Standards: Following established secure coding guidelines, such as those provided by OWASP, is crucial. These guidelines provide a framework for writing secure code and minimizing vulnerabilities.
    Input Validation and Sanitization: Rigorous input validation and sanitization are essential for preventing many common vulnerabilities. All user inputs should be validated on both the client-side and server-side, and potentially harmful characters should be escaped or removed.
    Principle of Least Privilege: Granting only the necessary permissions to users and processes minimizes the potential damage from a successful attack.
    Regular Updates: Keeping PHP, libraries, frameworks, and the operating system up-to-date is crucial for patching known vulnerabilities.

Specific Considerations When Working with Third-Party Teams:

Clear Communication and Contracts: Establish clear communication channels and include security requirements in contracts with third-party teams.
Code Ownership and Access: Define code ownership and ensure access to the source code for thorough review.
Regular Security Audits: Conduct regular security audits of the application, especially after major updates or releases.
Vulnerability Disclosure Policy: Establish a clear vulnerability disclosure policy to handle security issues responsibly.

For a PHP code quality analyzer plugin for VS Code, the most popular choice is “PHPStan” which is a static code analysis tool that effectively detects potential errors and type issues in your PHP code without needing to actually run it, providing comprehensive insights into code quality. (source Google Search!

Conclusion:

Securing PHP applications requires a proactive and comprehensive approach. By implementing the strategies outlined in this article, including thorough code review, the use of automated tools, adherence to secure coding practices, and careful management of third-party relationships, organizations can significantly reduce the risk of vulnerabilities and protect their valuable data. Remember that security is an ongoing process, and continuous monitoring, testing, and improvement are essential for maintaining a secure application.

Setting session timestamp limited to script scope in PHP

Lets start with analyzing a use case. Well at Saturn we required to run a cron
job which was written in php, with a lot of use for date functions, every day
starting Jan 1 2001 to Dec 31 2012. I did not dig much into override_function
to change the behavior of the date function, instead a small class was written
to handle the ticks, but the loop was run using a history table in mysql.

Continue reading “Setting session timestamp limited to script scope in PHP”

MsSQL Export to CSV

In the near past I got frustrated searching for a tool to export from MsSQL express to proper CSV. As always the frustration lead me to google, and found the post on stackoverflow and to digital point forums and a lot of other places. All these had some or the other issues with us. One being the MsSQL not hosted on our servers and we dont have access to the GUI tools, the second is that I am a bit towards the fag end when Microsoft Technologies are in the anvil. Finally there was no other way and what I did was to migrate our PHP MySQL wrapper to support MsSQL.

Continue reading “MsSQL Export to CSV”

Class DumpIO – Inspired by Apache mod_dumpio but reluctant to restart webserver

Though there are far and wide systems for live debugging, and the sort, for forensic or load analysis, our php-extjs framework did not have anything pre planned, other than some query loggers, and background processing systems to log into db etc. While recently the Master MySQL server started showing variations in the cacti patterns (normal was about 4 to 20 in working window, but was steady between 35 and 40 in the tantrum period), we started to worry and could not identify the situation. Also restarting all application servers and clearing the session store would immediately drop the MySQL fsync graph to a standard pattern. This is the time when I looked for a input logger for Apache, and found about the dumpio, but needed the webserver to be restarted. Actually the time was ripe that the application was in a tantrum, and the MySQL graphs showing about 35 fsyncs average.

Revisiting Importance of event logging in server side scripting and other articles on the net, the out come was a class with a single static method. This was designed to pick and log any input. This was later moved as the lite version, and a full version capable of capturing the output also was built.
Continue reading “Class DumpIO – Inspired by Apache mod_dumpio but reluctant to restart webserver”

PHP Sessions in Memcache – Locking Issues

Actually it is ages since I sat down to scribble something. Well this one could not be avoided. Hence here it is.

In one of our FTE projects, we had faced a complication that Memcached on one node was using 100% cpu and php-cgi was complainging that the same node was not permitting any more memcached connections. The configuration was as what all says, session.save_handler = memcache, session.save_path = “tcp://:11211,tcp://:11211,tcp://:11211″. It was giving jitters to the night support, that this used to happen at the worse time when most of the clients are using the application. And eventually that memcached needed to be restarted, kicking all users out and every one has to login back from the login page. Now during the past weeks it was so horrible that we marked a portion of the ramdisk from one least loaded nodes and used nfs to export this to all the nodes for a file based sessions store.
Continue reading “PHP Sessions in Memcache – Locking Issues”

MariaFramework 0.22 released

The MariaFramework or phpmf has been released with a couple of new enhancements. The MariaFramework portal is updated with some new plugins for phpmf. Will dive into what the details of the plugins are in another post. For the time the enhancements of the all new MariaFramework.
Continue reading “MariaFramework 0.22 released”

php smtp email direct to mail box delivery

For sending status mails, with varying from addresses, for several of our projects at Saturn, we were using the phpmailer which uses our smtp server with authentication. Well our smtp host had a limitation of 250 emails per day. When our requirements grew out of this limit, mails started to pile up. Sure I could install exim4 or sendmail on my boxes, and that is what I did for immediate resolution. But here comes a new requirement, that the mails sent should be marked as such, and those which failed should be marked with the exact response of the receiving end mailserver.

At this point I thought about an SMTP direct to mail box Delivery system. My favorite language being PHP, and primary library being Google, I tried all possible ways, according to me, and they were not the right ones as I came to know later. All these did not get me in the right direction. And finally thought about writing one. Here too, being lazy, wanted to have the code from some ones work to ignite me. Okay I found the phpbb’s smtp.php referred on the net, and the function smtpmail from the same was the right choice.
Continue reading “php smtp email direct to mail box delivery”