YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Demand Prediction System for E-commerce

A complete machine learning and time-series forecasting system for predicting product demand (sales quantity) in e-commerce. Compares both supervised learning regression models and time-series models (ARIMA, Prophet) to find the best approach.

๐Ÿ“‹ Table of Contents

๐ŸŽฏ Overview

This project implements a demand prediction system that uses historical sales data to forecast future product demand. The system compares two approaches:

  1. Machine Learning Models: Treat demand prediction as a regression problem using product features (price, discount, category, date features)
  2. Time-Series Models: Treat demand prediction as a time-series problem using historical patterns (ARIMA, Prophet)

The system automatically selects the best performing model across both approaches.

Key Capabilities:

  • Predicts sales quantity for products on future dates (ML models)
  • Predicts overall daily demand (Time-series models)
  • Handles temporal patterns and seasonality
  • Considers price, discount, category, and date features (ML models)
  • Captures time-series patterns and trends (Time-series models)
  • Automatically selects the best model from multiple candidates
  • Provides comprehensive evaluation metrics
  • Compares ML vs Time-Series approaches

โœจ Features

  • Data Preprocessing: Automatic handling of missing values, date feature extraction
  • Feature Engineering:
    • Date features (day, month, day_of_week, weekend, year, quarter)
    • Categorical encoding (product_id, category)
    • Feature scaling
  • Multiple Models:
    • Machine Learning Models:
      • Linear Regression
      • Random Forest Regressor
      • XGBoost Regressor (optional)
    • Time-Series Models:
      • ARIMA (AutoRegressive Integrated Moving Average)
      • Prophet (Facebook's time-series forecasting tool)
  • Model Selection: Automatic best model selection based on R2 score
  • Evaluation Metrics: MAE, RMSE, and R2 Score
  • Visualizations:
    • Demand trends over time
    • Monthly average demand
    • Feature importance
    • Model comparison
  • Model Persistence: Save and load trained models using joblib
  • Future Predictions: Predict demand for any product on any future date

๐Ÿ“ Project Structure

demand_prediction/
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ sales.csv                    # Sales dataset
โ”‚
โ”œโ”€โ”€ models/                          # Generated during training
โ”‚   โ”œโ”€โ”€ best_model.joblib           # Best ML model (if ML is best)
โ”‚   โ”œโ”€โ”€ best_timeseries_model.joblib # Best time-series model (if TS is best)
โ”‚   โ”œโ”€โ”€ preprocessing.joblib        # Encoders and scaler (for ML models)
โ”‚   โ”œโ”€โ”€ model_metadata.json         # Model metadata (legacy)
โ”‚   โ””โ”€โ”€ all_models_metadata.json    # All models comparison metadata
โ”‚
โ”œโ”€โ”€ plots/                           # Generated during training
โ”‚   โ”œโ”€โ”€ demand_trends.png           # Time series plot
โ”‚   โ”œโ”€โ”€ monthly_demand.png          # Monthly averages
โ”‚   โ”œโ”€โ”€ feature_importance.png      # Feature importance (ML models)
โ”‚   โ”œโ”€โ”€ model_comparison.png        # Model metrics comparison (all models)
โ”‚   โ””โ”€โ”€ timeseries_predictions.png  # Time-series model predictions
โ”‚
โ”œโ”€โ”€ generate_dataset.py             # Script to generate synthetic dataset
โ”œโ”€โ”€ train_model.py                  # Main training script
โ”œโ”€โ”€ predict.py                      # Prediction script
โ”œโ”€โ”€ app.py                          # Streamlit dashboard (interactive web app)
โ”œโ”€โ”€ requirements.txt                # Python dependencies
โ””โ”€โ”€ README.md                       # This file

๐Ÿš€ Installation

Prerequisites

  • Python 3.8 or higher
  • pip (Python package manager)

Step 1: Navigate to Project Directory

cd demand_prediction

Step 2: Create Virtual Environment (Recommended)

Why use a virtual environment?

  • Keeps project dependencies isolated from your system Python
  • Prevents conflicts with other projects
  • Makes it easier to manage package versions
  • Best practice for Python projects

Quick Setup (Recommended):

Windows:

setup_env.bat

Linux/Mac:

chmod +x setup_env.sh
./setup_env.sh

Manual Setup:

Windows:

python -m venv venv
venv\Scripts\activate

Linux/Mac:

python3 -m venv venv
source venv/bin/activate

After activation, you should see (venv) in your terminal prompt.

To deactivate later:

deactivate

Step 3: Install Dependencies

pip install -r requirements.txt

Note: If you don't want to use XGBoost, you can remove it from requirements.txt. The system will work fine without it, just skipping XGBoost model training.

Alternative (without virtual environment): If you prefer not to use a virtual environment, you can install directly:

pip install -r requirements.txt

However, this is not recommended as it may cause conflicts with other Python projects.

Step 4: Generate Dataset

If you don't have a dataset, generate a synthetic one:

python generate_dataset.py

This will create data/sales.csv with realistic e-commerce sales data.

๐Ÿ“Š Dataset

The dataset should contain the following columns:

  • product_id: Unique identifier for each product (integer)
  • date: Date of sale (YYYY-MM-DD format)
  • price: Product price (float)
  • discount: Discount percentage (0-100, float)
  • category: Product category (string)
  • sales_quantity: Target variable - number of units sold (integer)

Dataset Format Example

product_id,date,price,discount,category,sales_quantity
1,2020-01-01,499.99,10,Electronics,45
2,2020-01-01,29.99,0,Clothing,120
...

๐Ÿ’ป Usage

Step 1: Train the Model

Train the model using the sales dataset:

python train_model.py

This will:

  1. Load and preprocess the data
  2. Extract features from dates
  3. Encode categorical variables
  4. Train multiple ML models (Linear Regression, Random Forest, XGBoost)
  5. Prepare time-series data (aggregate daily sales)
  6. Train time-series models (ARIMA, Prophet)
  7. Evaluate each model using MAE, RMSE, and R2 Score
  8. Compare ML vs Time-Series models
  9. Select the best model automatically (across all model types)
  10. Save the model and preprocessing objects
  11. Generate visualizations

Output:

  • Best model saved to models/best_model.joblib (ML) or models/best_timeseries_model.joblib (TS)
  • Preprocessing objects saved to models/preprocessing.joblib (for ML models)
  • Visualizations saved to plots/ directory
  • All models metadata saved to models/all_models_metadata.json

Step 2: Make Predictions

For ML Models (product-specific predictions):

Predict demand for a specific product on a date:

python predict.py --product_id 1 --date 2024-01-15 --price 100 --discount 10 --category Electronics

Parameters for ML Models:

  • --product_id: Product ID (integer, required)
  • --date: Date in YYYY-MM-DD format (required)
  • --price: Product price (float, required)
  • --discount: Discount percentage 0-100 (float, default: 0)
  • --category: Product category (string, required)
  • --model_type: Model type - auto (default), ml, or timeseries

For Time-Series Models (overall daily demand):

Predict total daily demand across all products:

python predict.py --date 2024-01-15 --model_type timeseries

Parameters for Time-Series Models:

  • --date: Date in YYYY-MM-DD format (required)
  • --model_type: Set to timeseries to use time-series models

Example Predictions:

# ML Model - Electronics product with discount
python predict.py --product_id 1 --date 2024-06-15 --price 500 --discount 20 --category Electronics

# ML Model - Clothing product without discount
python predict.py --product_id 5 --date 2024-12-25 --price 50 --discount 0 --category Clothing

# Time-Series Model - Overall daily demand
python predict.py --date 2024-07-06 --model_type timeseries

# Auto-detect best model (default)
python predict.py --product_id 10 --date 2024-07-06 --price 75 --discount 15 --category Sports

Step 3: Launch Interactive Dashboard (Optional)

Launch the Streamlit dashboard for interactive visualization and predictions:

streamlit run app.py

The dashboard will open in your default web browser (usually at http://localhost:8501).

Dashboard Features:

  1. ๐Ÿ“ˆ Sales Trends Page

    • Interactive filters (category, product, date range)
    • Daily sales trends visualization
    • Monthly sales trends
    • Category-wise analysis
    • Price vs demand relationship
    • Real-time statistics and metrics
  2. ๐Ÿ”ฎ Demand Prediction Page

    • Interactive prediction interface
    • Select model type (Auto/ML/Time-Series)
    • For ML models:
      • Product selection dropdown
      • Category selection
      • Price and discount sliders
      • Date picker
      • Product statistics display
    • For Time-Series models:
      • Date picker for future predictions
      • Overall daily demand forecast
    • Prediction insights and recommendations
  3. ๐Ÿ“Š Model Comparison Page

    • Side-by-side model performance comparison
    • MAE, RMSE, and R2 Score metrics
    • Visual charts comparing all models
    • Best model highlighting
    • Model type indicators (ML vs Time-Series)

Dashboard Screenshots:

  • Interactive widgets for easy data exploration
  • Real-time predictions with visual feedback
  • Comprehensive model comparison charts

๐Ÿค– Model Details

Models Trained

  1. Linear Regression

    • Simple linear model
    • Fast training and prediction
    • Good baseline model
  2. Random Forest Regressor

    • Ensemble of decision trees
    • Handles non-linear relationships
    • Provides feature importance
    • Hyperparameters:
      • n_estimators: 100
      • max_depth: 15
      • min_samples_split: 5
      • min_samples_leaf: 2
  3. XGBoost Regressor (Optional)

    • Gradient boosting algorithm
    • Often provides best performance
    • Handles complex patterns
    • Hyperparameters:
      • n_estimators: 100
      • max_depth: 6
      • learning_rate: 0.1
  4. ARIMA (AutoRegressive Integrated Moving Average)

    • Classic time-series forecasting model
    • Captures trends and seasonality
    • Automatically selects best order (p, d, q)
    • Works on aggregated daily sales data
    • Uses chronological train/validation split
  5. Prophet (Facebook's Time-Series Forecasting)

    • Designed for business time series
    • Handles seasonality (weekly, yearly)
    • Robust to missing data and outliers
    • Works on aggregated daily sales data
    • Uses chronological train/validation split

Model Comparison: ML vs Time-Series

Machine Learning Models:

  • โœ… Predict per-product demand
  • โœ… Use product features (price, discount, category)
  • โœ… Can handle new products with similar features
  • โŒ May not capture long-term temporal patterns as well

Time-Series Models:

  • โœ… Capture temporal patterns and trends
  • โœ… Handle seasonality automatically
  • โœ… Good for overall demand forecasting
  • โŒ Predict aggregate demand, not per-product
  • โŒ Don't use product-specific features

The system automatically selects the best model based on R2 score across all model types.

Feature Engineering

For ML Models:

The system extracts the following features from the input data:

Date Features:

  • day: Day of month (1-31)
  • month: Month (1-12)
  • day_of_week: Day of week (0=Monday, 6=Sunday)
  • weekend: Binary indicator (1 if weekend, 0 otherwise)
  • year: Year
  • quarter: Quarter of year (1-4)

Original Features:

  • product_id: Encoded as categorical
  • price: Numerical (scaled)
  • discount: Numerical (scaled)
  • category: Encoded as categorical

Total Features: 10 features after encoding and scaling

For Time-Series Models:

  • Data is aggregated by date (total daily sales)
  • Uses chronological split (80% train, 20% validation)
  • Prophet automatically handles:
    • Weekly seasonality
    • Yearly seasonality
    • Trend components

๐Ÿ“ˆ Evaluation Metrics

The system evaluates models using three metrics:

  1. MAE (Mean Absolute Error)

    • Average absolute difference between predicted and actual values
    • Lower is better
    • Units: same as target variable (sales quantity)
  2. RMSE (Root Mean Squared Error)

    • Square root of average squared differences
    • Penalizes large errors more than MAE
    • Lower is better
    • Units: same as target variable (sales quantity)
  3. R2 Score (Coefficient of Determination)

    • Proportion of variance explained by the model
    • Range: -โˆž to 1 (1 is perfect prediction)
    • Higher is better
    • Used for model selection

Model Selection: The model with the highest R2 score is selected as the best model.

๐Ÿ“Š Visualizations

The training script generates several visualizations:

  1. Demand Trends Over Time (plots/demand_trends.png)

    • Shows total daily sales quantity over the entire time period
    • Helps identify overall trends and patterns
  2. Monthly Average Demand (plots/monthly_demand.png)

    • Bar chart showing average sales by month
    • Reveals seasonal patterns (e.g., holiday season spikes)
  3. Feature Importance (plots/feature_importance.png)

    • Shows which features are most important for predictions
    • Only available for tree-based models (Random Forest, XGBoost)
  4. Model Comparison (plots/model_comparison.png)

    • Side-by-side comparison of all models (ML and Time-Series)
    • Color-coded: ML models (blue) vs Time-Series models (orange/red)
    • Shows MAE, RMSE, and R2 Score for each model
  5. Time-Series Predictions (plots/timeseries_predictions.png)

    • Actual vs predicted plots for ARIMA and Prophet models
    • Shows how well time-series models capture temporal patterns
    • Only generated if time-series models are available

๐Ÿ”ฎ Example Predictions

Here are some example predictions to demonstrate the system:

# Example 1: Electronics on a weekday
python predict.py --product_id 1 --date 2024-03-15 --price 500 --discount 10 --category Electronics
# Expected: Moderate demand (weekday, some discount)

# Example 2: Clothing on weekend
python predict.py --product_id 5 --date 2024-07-06 --price 50 --discount 20 --category Clothing
# Expected: Higher demand (weekend, good discount)

# Example 3: Holiday season prediction
python predict.py --product_id 10 --date 2024-12-20 --price 100 --discount 25 --category Toys
# Expected: High demand (holiday season, good discount)

๐Ÿ”ง Technical Details

Data Preprocessing Pipeline

  1. Date Conversion: Convert date strings to datetime objects
  2. Feature Extraction: Extract temporal features from dates
  3. Missing Value Handling: Fill missing values with median (if any)
  4. Categorical Encoding: Label encode product_id and category
  5. Feature Scaling: Standardize numerical features using StandardScaler

Model Training Pipeline

  1. Data Splitting: 80% training, 20% validation
  2. Model Training: Train all available models
  3. Evaluation: Calculate MAE, RMSE, and R2 for each model
  4. Selection: Choose model with highest R2 score
  5. Persistence: Save model, encoders, and scaler

Prediction Pipeline

  1. Load Model: Load trained model and preprocessing objects
  2. Feature Preparation: Extract features from input parameters
  3. Encoding: Encode categorical variables using saved encoders
  4. Scaling: Scale features using saved scaler
  5. Prediction: Make prediction using loaded model
  6. Post-processing: Ensure non-negative predictions

Handling Unseen Data

The prediction script handles cases where:

  • Product ID was not seen during training (uses default encoding)
  • Category was not seen during training (uses default encoding)

Warnings are displayed in such cases.

๐ŸŽ“ Learning Points

This project demonstrates:

  1. Supervised Learning: Regression problem solving
  2. Feature Engineering: Creating meaningful features from raw data
  3. Model Comparison: Training and evaluating multiple models
  4. Model Selection: Automatic best model selection
  5. Model Persistence: Saving and loading trained models
  6. Production-Ready Code: Clean, modular, well-documented code
  7. Time Series Features: Extracting temporal patterns
  8. Categorical Encoding: Handling categorical variables
  9. Feature Scaling: Normalizing features for better performance
  10. Evaluation Metrics: Understanding different regression metrics

๐Ÿ› Troubleshooting

Issue: "Model not found"

Solution: Run python train_model.py first to train and save the model.

Issue: "XGBoost not available"

Solution: Install XGBoost with pip install xgboost, or the system will work without it (skipping XGBoost model).

Issue: "Category not seen during training"

Solution: This is handled automatically with a warning. The system uses a default encoding.

Issue: Poor prediction accuracy

Solutions:

  • Ensure you have sufficient training data
  • Check that input features are in the same range as training data
  • Try retraining with different hyperparameters
  • Consider adding more features or more training data

๐Ÿ“ Notes

  • The synthetic dataset generator creates realistic patterns including:

    • Weekend effects (higher sales on weekends)
    • Seasonal patterns (holiday season spikes)
    • Price and discount effects
    • Category-specific base prices
  • For production use, consider:

    • Using real historical data
    • Retraining models periodically
    • Adding more features (promotions, weather, etc.)
    • Implementing model versioning
    • Adding prediction confidence intervals

๐Ÿ“„ License

This project is provided as-is for educational purposes.

๐Ÿ‘ค Author

Created as a complete machine learning project demonstrating demand prediction for e-commerce.


Happy Predicting! ๐Ÿš€

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support