Back to Portfolio
Slide 01 — Project Overview

AWS Smart Forecast

End-to-end retail forecasting and business analytics on AWS

  • Cloud-based forecasting demo built on the Rossmann Store Sales dataset
  • Designed to showcase production-style data engineering, ML forecasting, and BI delivery
  • Covers the full workflow from raw CSV ingestion to dashboard-ready forecasts
  • Built for both technical teams and business stakeholders
Slide 02 — Business Problem & Value

Business challenge and client value

Why retail forecasting needs more than a model

  • Daily store sales fluctuate due to promotions, holidays, seasonality, and store-specific factors
  • Retail data often arrives fragmented, inconsistent, and not directly usable for forecasting
  • Business teams need reliable forecasts for inventory, staffing, and promotion planning
  • Leaders need fast insights via dashboards and simple interfaces, not manual SQL workflows
  • The solution combines automation, forecasting, and analytics in one pipeline
Slide 03 — End-to-End Architecture

End-to-end AWS architecture for forecasting

From raw data to business-facing insights

  • Amazon S3 data lake with layered design (Bronze → Silver → Gold)
  • AWS Glue Crawler + Data Catalog for schema discovery and metadata management
  • Amazon Athena for validation and analytical SQL queries
  • AWS Glue Jobs (PySpark) for ETL, cleaning, joins, and feature engineering
  • Amazon SageMaker (XGBoost) for training and forecast generation
  • Amazon QuickSight for dashboards, KPIs, and monitoring
  • Optional chat analytics layer: Lambda + Athena + Streamlit + OpenAI
Slide 04 — Architecture Diagram

Architecture diagram: End-to-end AWS pipeline

Data flow from ingestion to BI/chat — modular, scalable, and traceable

End-to-end overview
Raw data CSV ingestion Amazon S3 Data Lake Bronze Silver Gold AWS Glue ETL & Features SageMaker Train & Forecast QuickSight Dashboards Athena SQL & Validation Optional: Chat (NLQ) Validated query templates via Lambda + Athena Streamlit UI Lambda Orchestration OpenAI Mapping Main flow Validation & ad-hoc
Main flow: Raw data → S3 (Bronze/Silver/Gold) → Glue (ETL & features) → SageMaker (train & forecast) → QuickSight (dashboards). Secondary path: S3 → Athena (SQL & validation). Optional: Chat (NLQ) uses Athena via validated query templates.
Slide 05 — Data Engineering Pipeline

Data engineering pipeline (Bronze → Silver → Gold)

Reliable, queryable, and ML-ready data preparation

  • Bronze (Raw): CSV files stored in S3 as received for reproducibility
  • Catalog & validation: Glue Crawler registers datasets; Athena checks schema, completeness, and anomalies
  • Silver (Cleaned): PySpark ETL joins sales and store metadata, fixes types, handles nulls, removes invalid rows
  • Storage optimization: Parquet output partitioned by store / year / month for faster Athena scans
  • Gold (Engineered): feature-enriched dataset for forecasting, dashboards, and downstream analytics
Slide 06 — Feature Engineering

Feature engineering for real-world retail demand

Transforming raw sales into predictive business signals

  • Lag features (e.g., lag_1, lag_7, lag_14) capture short-term memory and weekly behavior
  • Moving averages (e.g., ma_7, ma_30) smooth volatility and represent trends
  • Promo, holiday, and operational flags model demand shocks and store availability
  • Temporal features (weekday, month, week_of_year, year) encode seasonality
  • Store metadata (assortment, competition distance, store type) adds business context
  • Result: a richer feature matrix that improves forecast quality and interpretability
Slide 07 — Model Training & Validation

Forecasting with SageMaker (XGBoost) and time-aware validation

Forecast quality measured on future periods, not random samples

  • XGBoost regression model trained in Amazon SageMaker on the Gold dataset
  • Chronological split to avoid leakage (no future information in training; 70% train / 15% validation / 15% test)
  • Hyperparameter tuning and early stopping to improve generalization
  • One consistent ML workflow supports multiple prediction scenarios
  • Forecast outputs stored in S3 for dashboards and query-based reporting

Project metrics: RMSE 559.23, MAPE 5.25%, Bias -236.90 (Accuracy ~94.75%).

Slide 08 — Forecast Delivery

From forecasts to decisions

Business-facing delivery through dashboards and natural-language analytics

  • QuickSight dashboards show Forecast vs Actual trends, KPI cards, and error hotspots
  • Store-level filters support operational review and management reporting
  • Promotion impact and high-deviation stores are highlighted in dedicated views
  • Streamlit chat UI lets business users ask questions in natural language
  • Lambda + Athena backend executes controlled, validated query templates
  • OpenAI supports question interpretation and structured query mapping (no direct raw SQL access)
Slide 09 — Summary

Technical Strength & Business Relevance

Why this project matters for real client work

  • End-to-end AWS data + ML architecture (S3, Glue, Athena, SageMaker, QuickSight)
  • Strong data engineering execution (ETL, cleaning, joins, partitioning, feature pipelines)
  • Practical forecasting workflow design with leakage-aware validation
  • Business-ready outputs: dashboards, KPI monitoring, and conversational analytics
  • Modular design adaptable to retail, e-commerce, demand planning, and operations