Demand Prediction System for E-commerce
A complete machine learning and time-series forecasting system for predicting product demand (sales quantity) in e-commerce. Compares both supervised learning regression models and time-series models (ARIMA, Prophet) to find the best approach.
๐ Table of Contents
- Overview
- Features
- Project Structure
- Installation
- Dataset
- Usage
- Model Details
- Evaluation Metrics
- Visualizations
- Example Predictions
- Technical Details
๐ฏ Overview
This project implements a demand prediction system that uses historical sales data to forecast future product demand. The system compares two approaches:
- Machine Learning Models: Treat demand prediction as a regression problem using product features (price, discount, category, date features)
- Time-Series Models: Treat demand prediction as a time-series problem using historical patterns (ARIMA, Prophet)
The system automatically selects the best performing model across both approaches.
Key Capabilities:
- Predicts sales quantity for products on future dates (ML models)
- Predicts overall daily demand (Time-series models)
- Handles temporal patterns and seasonality
- Considers price, discount, category, and date features (ML models)
- Captures time-series patterns and trends (Time-series models)
- Automatically selects the best model from multiple candidates
- Provides comprehensive evaluation metrics
- Compares ML vs Time-Series approaches
โจ Features
- Data Preprocessing: Automatic handling of missing values, date feature extraction
- Feature Engineering:
- Date features (day, month, day_of_week, weekend, year, quarter)
- Categorical encoding (product_id, category)
- Feature scaling
- Multiple Models:
- Machine Learning Models:
- Linear Regression
- Random Forest Regressor
- XGBoost Regressor (optional)
- Time-Series Models:
- ARIMA (AutoRegressive Integrated Moving Average)
- Prophet (Facebook's time-series forecasting tool)
- Machine Learning Models:
- Model Selection: Automatic best model selection based on R2 score
- Evaluation Metrics: MAE, RMSE, and R2 Score
- Visualizations:
- Demand trends over time
- Monthly average demand
- Feature importance
- Model comparison
- Model Persistence: Save and load trained models using joblib
- Future Predictions: Predict demand for any product on any future date
๐ Project Structure
demand_prediction/
โ
โโโ data/
โ โโโ sales.csv # Sales dataset
โ
โโโ models/ # Generated during training
โ โโโ best_model.joblib # Best ML model (if ML is best)
โ โโโ best_timeseries_model.joblib # Best time-series model (if TS is best)
โ โโโ preprocessing.joblib # Encoders and scaler (for ML models)
โ โโโ model_metadata.json # Model metadata (legacy)
โ โโโ all_models_metadata.json # All models comparison metadata
โ
โโโ plots/ # Generated during training
โ โโโ demand_trends.png # Time series plot
โ โโโ monthly_demand.png # Monthly averages
โ โโโ feature_importance.png # Feature importance (ML models)
โ โโโ model_comparison.png # Model metrics comparison (all models)
โ โโโ timeseries_predictions.png # Time-series model predictions
โ
โโโ generate_dataset.py # Script to generate synthetic dataset
โโโ train_model.py # Main training script
โโโ predict.py # Prediction script
โโโ app.py # Streamlit dashboard (interactive web app)
โโโ requirements.txt # Python dependencies
โโโ README.md # This file
๐ Installation
Prerequisites
- Python 3.8 or higher
- pip (Python package manager)
Step 1: Navigate to Project Directory
cd demand_prediction
Step 2: Create Virtual Environment (Recommended)
Why use a virtual environment?
- Keeps project dependencies isolated from your system Python
- Prevents conflicts with other projects
- Makes it easier to manage package versions
- Best practice for Python projects
Quick Setup (Recommended):
Windows:
setup_env.bat
Linux/Mac:
chmod +x setup_env.sh
./setup_env.sh
Manual Setup:
Windows:
python -m venv venv
venv\Scripts\activate
Linux/Mac:
python3 -m venv venv
source venv/bin/activate
After activation, you should see (venv) in your terminal prompt.
To deactivate later:
deactivate
Step 3: Install Dependencies
pip install -r requirements.txt
Note: If you don't want to use XGBoost, you can remove it from requirements.txt. The system will work fine without it, just skipping XGBoost model training.
Alternative (without virtual environment): If you prefer not to use a virtual environment, you can install directly:
pip install -r requirements.txt
However, this is not recommended as it may cause conflicts with other Python projects.
Step 4: Generate Dataset
If you don't have a dataset, generate a synthetic one:
python generate_dataset.py
This will create data/sales.csv with realistic e-commerce sales data.
๐ Dataset
The dataset should contain the following columns:
- product_id: Unique identifier for each product (integer)
- date: Date of sale (YYYY-MM-DD format)
- price: Product price (float)
- discount: Discount percentage (0-100, float)
- category: Product category (string)
- sales_quantity: Target variable - number of units sold (integer)
Dataset Format Example
product_id,date,price,discount,category,sales_quantity
1,2020-01-01,499.99,10,Electronics,45
2,2020-01-01,29.99,0,Clothing,120
...
๐ป Usage
Step 1: Train the Model
Train the model using the sales dataset:
python train_model.py
This will:
- Load and preprocess the data
- Extract features from dates
- Encode categorical variables
- Train multiple ML models (Linear Regression, Random Forest, XGBoost)
- Prepare time-series data (aggregate daily sales)
- Train time-series models (ARIMA, Prophet)
- Evaluate each model using MAE, RMSE, and R2 Score
- Compare ML vs Time-Series models
- Select the best model automatically (across all model types)
- Save the model and preprocessing objects
- Generate visualizations
Output:
- Best model saved to
models/best_model.joblib(ML) ormodels/best_timeseries_model.joblib(TS) - Preprocessing objects saved to
models/preprocessing.joblib(for ML models) - Visualizations saved to
plots/directory - All models metadata saved to
models/all_models_metadata.json
Step 2: Make Predictions
For ML Models (product-specific predictions):
Predict demand for a specific product on a date:
python predict.py --product_id 1 --date 2024-01-15 --price 100 --discount 10 --category Electronics
Parameters for ML Models:
--product_id: Product ID (integer, required)--date: Date in YYYY-MM-DD format (required)--price: Product price (float, required)--discount: Discount percentage 0-100 (float, default: 0)--category: Product category (string, required)--model_type: Model type -auto(default),ml, ortimeseries
For Time-Series Models (overall daily demand):
Predict total daily demand across all products:
python predict.py --date 2024-01-15 --model_type timeseries
Parameters for Time-Series Models:
--date: Date in YYYY-MM-DD format (required)--model_type: Set totimeseriesto use time-series models
Example Predictions:
# ML Model - Electronics product with discount
python predict.py --product_id 1 --date 2024-06-15 --price 500 --discount 20 --category Electronics
# ML Model - Clothing product without discount
python predict.py --product_id 5 --date 2024-12-25 --price 50 --discount 0 --category Clothing
# Time-Series Model - Overall daily demand
python predict.py --date 2024-07-06 --model_type timeseries
# Auto-detect best model (default)
python predict.py --product_id 10 --date 2024-07-06 --price 75 --discount 15 --category Sports
Step 3: Launch Interactive Dashboard (Optional)
Launch the Streamlit dashboard for interactive visualization and predictions:
streamlit run app.py
The dashboard will open in your default web browser (usually at http://localhost:8501).
Dashboard Features:
๐ Sales Trends Page
- Interactive filters (category, product, date range)
- Daily sales trends visualization
- Monthly sales trends
- Category-wise analysis
- Price vs demand relationship
- Real-time statistics and metrics
๐ฎ Demand Prediction Page
- Interactive prediction interface
- Select model type (Auto/ML/Time-Series)
- For ML models:
- Product selection dropdown
- Category selection
- Price and discount sliders
- Date picker
- Product statistics display
- For Time-Series models:
- Date picker for future predictions
- Overall daily demand forecast
- Prediction insights and recommendations
๐ Model Comparison Page
- Side-by-side model performance comparison
- MAE, RMSE, and R2 Score metrics
- Visual charts comparing all models
- Best model highlighting
- Model type indicators (ML vs Time-Series)
Dashboard Screenshots:
- Interactive widgets for easy data exploration
- Real-time predictions with visual feedback
- Comprehensive model comparison charts
๐ค Model Details
Models Trained
Linear Regression
- Simple linear model
- Fast training and prediction
- Good baseline model
Random Forest Regressor
- Ensemble of decision trees
- Handles non-linear relationships
- Provides feature importance
- Hyperparameters:
- n_estimators: 100
- max_depth: 15
- min_samples_split: 5
- min_samples_leaf: 2
XGBoost Regressor (Optional)
- Gradient boosting algorithm
- Often provides best performance
- Handles complex patterns
- Hyperparameters:
- n_estimators: 100
- max_depth: 6
- learning_rate: 0.1
ARIMA (AutoRegressive Integrated Moving Average)
- Classic time-series forecasting model
- Captures trends and seasonality
- Automatically selects best order (p, d, q)
- Works on aggregated daily sales data
- Uses chronological train/validation split
Prophet (Facebook's Time-Series Forecasting)
- Designed for business time series
- Handles seasonality (weekly, yearly)
- Robust to missing data and outliers
- Works on aggregated daily sales data
- Uses chronological train/validation split
Model Comparison: ML vs Time-Series
Machine Learning Models:
- โ Predict per-product demand
- โ Use product features (price, discount, category)
- โ Can handle new products with similar features
- โ May not capture long-term temporal patterns as well
Time-Series Models:
- โ Capture temporal patterns and trends
- โ Handle seasonality automatically
- โ Good for overall demand forecasting
- โ Predict aggregate demand, not per-product
- โ Don't use product-specific features
The system automatically selects the best model based on R2 score across all model types.
Feature Engineering
For ML Models:
The system extracts the following features from the input data:
Date Features:
day: Day of month (1-31)month: Month (1-12)day_of_week: Day of week (0=Monday, 6=Sunday)weekend: Binary indicator (1 if weekend, 0 otherwise)year: Yearquarter: Quarter of year (1-4)
Original Features:
product_id: Encoded as categoricalprice: Numerical (scaled)discount: Numerical (scaled)category: Encoded as categorical
Total Features: 10 features after encoding and scaling
For Time-Series Models:
- Data is aggregated by date (total daily sales)
- Uses chronological split (80% train, 20% validation)
- Prophet automatically handles:
- Weekly seasonality
- Yearly seasonality
- Trend components
๐ Evaluation Metrics
The system evaluates models using three metrics:
MAE (Mean Absolute Error)
- Average absolute difference between predicted and actual values
- Lower is better
- Units: same as target variable (sales quantity)
RMSE (Root Mean Squared Error)
- Square root of average squared differences
- Penalizes large errors more than MAE
- Lower is better
- Units: same as target variable (sales quantity)
R2 Score (Coefficient of Determination)
- Proportion of variance explained by the model
- Range: -โ to 1 (1 is perfect prediction)
- Higher is better
- Used for model selection
Model Selection: The model with the highest R2 score is selected as the best model.
๐ Visualizations
The training script generates several visualizations:
Demand Trends Over Time (
plots/demand_trends.png)- Shows total daily sales quantity over the entire time period
- Helps identify overall trends and patterns
Monthly Average Demand (
plots/monthly_demand.png)- Bar chart showing average sales by month
- Reveals seasonal patterns (e.g., holiday season spikes)
Feature Importance (
plots/feature_importance.png)- Shows which features are most important for predictions
- Only available for tree-based models (Random Forest, XGBoost)
Model Comparison (
plots/model_comparison.png)- Side-by-side comparison of all models (ML and Time-Series)
- Color-coded: ML models (blue) vs Time-Series models (orange/red)
- Shows MAE, RMSE, and R2 Score for each model
Time-Series Predictions (
plots/timeseries_predictions.png)- Actual vs predicted plots for ARIMA and Prophet models
- Shows how well time-series models capture temporal patterns
- Only generated if time-series models are available
๐ฎ Example Predictions
Here are some example predictions to demonstrate the system:
# Example 1: Electronics on a weekday
python predict.py --product_id 1 --date 2024-03-15 --price 500 --discount 10 --category Electronics
# Expected: Moderate demand (weekday, some discount)
# Example 2: Clothing on weekend
python predict.py --product_id 5 --date 2024-07-06 --price 50 --discount 20 --category Clothing
# Expected: Higher demand (weekend, good discount)
# Example 3: Holiday season prediction
python predict.py --product_id 10 --date 2024-12-20 --price 100 --discount 25 --category Toys
# Expected: High demand (holiday season, good discount)
๐ง Technical Details
Data Preprocessing Pipeline
- Date Conversion: Convert date strings to datetime objects
- Feature Extraction: Extract temporal features from dates
- Missing Value Handling: Fill missing values with median (if any)
- Categorical Encoding: Label encode product_id and category
- Feature Scaling: Standardize numerical features using StandardScaler
Model Training Pipeline
- Data Splitting: 80% training, 20% validation
- Model Training: Train all available models
- Evaluation: Calculate MAE, RMSE, and R2 for each model
- Selection: Choose model with highest R2 score
- Persistence: Save model, encoders, and scaler
Prediction Pipeline
- Load Model: Load trained model and preprocessing objects
- Feature Preparation: Extract features from input parameters
- Encoding: Encode categorical variables using saved encoders
- Scaling: Scale features using saved scaler
- Prediction: Make prediction using loaded model
- Post-processing: Ensure non-negative predictions
Handling Unseen Data
The prediction script handles cases where:
- Product ID was not seen during training (uses default encoding)
- Category was not seen during training (uses default encoding)
Warnings are displayed in such cases.
๐ Learning Points
This project demonstrates:
- Supervised Learning: Regression problem solving
- Feature Engineering: Creating meaningful features from raw data
- Model Comparison: Training and evaluating multiple models
- Model Selection: Automatic best model selection
- Model Persistence: Saving and loading trained models
- Production-Ready Code: Clean, modular, well-documented code
- Time Series Features: Extracting temporal patterns
- Categorical Encoding: Handling categorical variables
- Feature Scaling: Normalizing features for better performance
- Evaluation Metrics: Understanding different regression metrics
๐ Troubleshooting
Issue: "Model not found"
Solution: Run python train_model.py first to train and save the model.
Issue: "XGBoost not available"
Solution: Install XGBoost with pip install xgboost, or the system will work without it (skipping XGBoost model).
Issue: "Category not seen during training"
Solution: This is handled automatically with a warning. The system uses a default encoding.
Issue: Poor prediction accuracy
Solutions:
- Ensure you have sufficient training data
- Check that input features are in the same range as training data
- Try retraining with different hyperparameters
- Consider adding more features or more training data
๐ Notes
The synthetic dataset generator creates realistic patterns including:
- Weekend effects (higher sales on weekends)
- Seasonal patterns (holiday season spikes)
- Price and discount effects
- Category-specific base prices
For production use, consider:
- Using real historical data
- Retraining models periodically
- Adding more features (promotions, weather, etc.)
- Implementing model versioning
- Adding prediction confidence intervals
๐ License
This project is provided as-is for educational purposes.
๐ค Author
Created as a complete machine learning project demonstrating demand prediction for e-commerce.
Happy Predicting! ๐