YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Demand Prediction System for E-commerce

A complete machine learning and time-series forecasting system for predicting product demand (sales quantity) in e-commerce. Compares both supervised learning regression models and time-series models (ARIMA, Prophet) to find the best approach.

🎯 Overview

This project implements a demand prediction system that uses historical sales data to forecast future product demand. The system compares two approaches:

Machine Learning Models: Treat demand prediction as a regression problem using product features (price, discount, category, date features)
Time-Series Models: Treat demand prediction as a time-series problem using historical patterns (ARIMA, Prophet)

The system automatically selects the best performing model across both approaches.

Key Capabilities:

Predicts sales quantity for products on future dates (ML models)
Predicts overall daily demand (Time-series models)
Handles temporal patterns and seasonality
Considers price, discount, category, and date features (ML models)
Captures time-series patterns and trends (Time-series models)
Automatically selects the best model from multiple candidates
Provides comprehensive evaluation metrics
Compares ML vs Time-Series approaches

✨ Features

Data Preprocessing: Automatic handling of missing values, date feature extraction
Feature Engineering:
- Date features (day, month, day_of_week, weekend, year, quarter)
- Categorical encoding (product_id, category)
- Feature scaling
Multiple Models:
- Machine Learning Models:
  - Linear Regression
  - Random Forest Regressor
  - XGBoost Regressor (optional)
- Time-Series Models:
  - ARIMA (AutoRegressive Integrated Moving Average)
  - Prophet (Facebook's time-series forecasting tool)
Model Selection: Automatic best model selection based on R2 score
Evaluation Metrics: MAE, RMSE, and R2 Score
Visualizations:
- Demand trends over time
- Monthly average demand
- Feature importance
- Model comparison
Model Persistence: Save and load trained models using joblib
Future Predictions: Predict demand for any product on any future date

📁 Project Structure

demand_prediction/
│
├── data/
│   └── sales.csv                    # Sales dataset
│
├── models/                          # Generated during training
│   ├── best_model.joblib           # Best ML model (if ML is best)
│   ├── best_timeseries_model.joblib # Best time-series model (if TS is best)
│   ├── preprocessing.joblib        # Encoders and scaler (for ML models)
│   ├── model_metadata.json         # Model metadata (legacy)
│   └── all_models_metadata.json    # All models comparison metadata
│
├── plots/                           # Generated during training
│   ├── demand_trends.png           # Time series plot
│   ├── monthly_demand.png          # Monthly averages
│   ├── feature_importance.png      # Feature importance (ML models)
│   ├── model_comparison.png        # Model metrics comparison (all models)
│   └── timeseries_predictions.png  # Time-series model predictions
│
├── generate_dataset.py             # Script to generate synthetic dataset
├── train_model.py                  # Main training script
├── predict.py                      # Prediction script
├── app.py                          # Streamlit dashboard (interactive web app)
├── requirements.txt                # Python dependencies
└── README.md                       # This file

🚀 Installation

Prerequisites

Python 3.8 or higher
pip (Python package manager)

Step 1: Navigate to Project Directory

cd demand_prediction

Step 2: Create Virtual Environment (Recommended)

Why use a virtual environment?

Keeps project dependencies isolated from your system Python
Prevents conflicts with other projects
Makes it easier to manage package versions
Best practice for Python projects

Quick Setup (Recommended):

Windows:

setup_env.bat

Linux/Mac:

chmod +x setup_env.sh
./setup_env.sh

Manual Setup:

Windows:

python -m venv venv
venv\Scripts\activate

Linux/Mac:

python3 -m venv venv
source venv/bin/activate

After activation, you should see (venv) in your terminal prompt.

To deactivate later:

deactivate

Step 3: Install Dependencies

pip install -r requirements.txt

Note: If you don't want to use XGBoost, you can remove it from requirements.txt. The system will work fine without it, just skipping XGBoost model training.

Alternative (without virtual environment): If you prefer not to use a virtual environment, you can install directly:

pip install -r requirements.txt

However, this is not recommended as it may cause conflicts with other Python projects.

Step 4: Generate Dataset

If you don't have a dataset, generate a synthetic one:

python generate_dataset.py

This will create data/sales.csv with realistic e-commerce sales data.

📊 Dataset

The dataset should contain the following columns:

product_id: Unique identifier for each product (integer)
date: Date of sale (YYYY-MM-DD format)
price: Product price (float)
discount: Discount percentage (0-100, float)
category: Product category (string)
sales_quantity: Target variable - number of units sold (integer)

Dataset Format Example

product_id,date,price,discount,category,sales_quantity
1,2020-01-01,499.99,10,Electronics,45
2,2020-01-01,29.99,0,Clothing,120
...

💻 Usage

Step 1: Train the Model

Train the model using the sales dataset:

python train_model.py

This will:

Load and preprocess the data
Extract features from dates
Encode categorical variables
Train multiple ML models (Linear Regression, Random Forest, XGBoost)
Prepare time-series data (aggregate daily sales)
Train time-series models (ARIMA, Prophet)
Evaluate each model using MAE, RMSE, and R2 Score
Compare ML vs Time-Series models
Select the best model automatically (across all model types)
Save the model and preprocessing objects
Generate visualizations

Output:

Best model saved to models/best_model.joblib (ML) or models/best_timeseries_model.joblib (TS)
Preprocessing objects saved to models/preprocessing.joblib (for ML models)
Visualizations saved to plots/ directory
All models metadata saved to models/all_models_metadata.json

Step 2: Make Predictions

For ML Models (product-specific predictions):

Predict demand for a specific product on a date:

python predict.py --product_id 1 --date 2024-01-15 --price 100 --discount 10 --category Electronics

Parameters for ML Models:

--product_id: Product ID (integer, required)
--date: Date in YYYY-MM-DD format (required)
--price: Product price (float, required)
--discount: Discount percentage 0-100 (float, default: 0)
--category: Product category (string, required)
--model_type: Model type - auto (default), ml, or timeseries

For Time-Series Models (overall daily demand):

Predict total daily demand across all products:

python predict.py --date 2024-01-15 --model_type timeseries

Parameters for Time-Series Models:

--date: Date in YYYY-MM-DD format (required)
--model_type: Set to timeseries to use time-series models

Example Predictions:

# ML Model - Electronics product with discount
python predict.py --product_id 1 --date 2024-06-15 --price 500 --discount 20 --category Electronics

# ML Model - Clothing product without discount
python predict.py --product_id 5 --date 2024-12-25 --price 50 --discount 0 --category Clothing

# Time-Series Model - Overall daily demand
python predict.py --date 2024-07-06 --model_type timeseries

# Auto-detect best model (default)
python predict.py --product_id 10 --date 2024-07-06 --price 75 --discount 15 --category Sports

Step 3: Launch Interactive Dashboard (Optional)

Launch the Streamlit dashboard for interactive visualization and predictions:

streamlit run app.py

The dashboard will open in your default web browser (usually at http://localhost:8501).

Dashboard Features:

📈 Sales Trends Page
- Interactive filters (category, product, date range)
- Daily sales trends visualization
- Monthly sales trends
- Category-wise analysis
- Price vs demand relationship
- Real-time statistics and metrics
🔮 Demand Prediction Page
- Interactive prediction interface
- Select model type (Auto/ML/Time-Series)
- For ML models:
  - Product selection dropdown
  - Category selection
  - Price and discount sliders
  - Date picker
  - Product statistics display
- For Time-Series models:
  - Date picker for future predictions
  - Overall daily demand forecast
- Prediction insights and recommendations
📊 Model Comparison Page
- Side-by-side model performance comparison
- MAE, RMSE, and R2 Score metrics
- Visual charts comparing all models
- Best model highlighting
- Model type indicators (ML vs Time-Series)

Dashboard Screenshots:

Interactive widgets for easy data exploration
Real-time predictions with visual feedback
Comprehensive model comparison charts

🤖 Model Details

Models Trained

Linear Regression
- Simple linear model
- Fast training and prediction
- Good baseline model
Random Forest Regressor
- Ensemble of decision trees
- Handles non-linear relationships
- Provides feature importance
- Hyperparameters:
  - n_estimators: 100
  - max_depth: 15
  - min_samples_split: 5
  - min_samples_leaf: 2
XGBoost Regressor (Optional)
- Gradient boosting algorithm
- Often provides best performance
- Handles complex patterns
- Hyperparameters:
  - n_estimators: 100
  - max_depth: 6
  - learning_rate: 0.1
ARIMA (AutoRegressive Integrated Moving Average)
- Classic time-series forecasting model
- Captures trends and seasonality
- Automatically selects best order (p, d, q)
- Works on aggregated daily sales data
- Uses chronological train/validation split
Prophet (Facebook's Time-Series Forecasting)
- Designed for business time series
- Handles seasonality (weekly, yearly)
- Robust to missing data and outliers
- Works on aggregated daily sales data
- Uses chronological train/validation split

Model Comparison: ML vs Time-Series

Machine Learning Models:

✅ Predict per-product demand
✅ Use product features (price, discount, category)
✅ Can handle new products with similar features
❌ May not capture long-term temporal patterns as well

Time-Series Models:

✅ Capture temporal patterns and trends
✅ Handle seasonality automatically
✅ Good for overall demand forecasting
❌ Predict aggregate demand, not per-product
❌ Don't use product-specific features

The system automatically selects the best model based on R2 score across all model types.

Feature Engineering

For ML Models:

The system extracts the following features from the input data:

Date Features:

day: Day of month (1-31)
month: Month (1-12)
day_of_week: Day of week (0=Monday, 6=Sunday)
weekend: Binary indicator (1 if weekend, 0 otherwise)
year: Year
quarter: Quarter of year (1-4)

Original Features:

product_id: Encoded as categorical
price: Numerical (scaled)
discount: Numerical (scaled)
category: Encoded as categorical

Total Features: 10 features after encoding and scaling

For Time-Series Models:

Data is aggregated by date (total daily sales)
Uses chronological split (80% train, 20% validation)
Prophet automatically handles:
- Weekly seasonality
- Yearly seasonality
- Trend components

📈 Evaluation Metrics

The system evaluates models using three metrics:

MAE (Mean Absolute Error)
- Average absolute difference between predicted and actual values
- Lower is better
- Units: same as target variable (sales quantity)
RMSE (Root Mean Squared Error)
- Square root of average squared differences
- Penalizes large errors more than MAE
- Lower is better
- Units: same as target variable (sales quantity)
R2 Score (Coefficient of Determination)
- Proportion of variance explained by the model
- Range: -∞ to 1 (1 is perfect prediction)
- Higher is better
- Used for model selection

Model Selection: The model with the highest R2 score is selected as the best model.

📊 Visualizations

The training script generates several visualizations:

Demand Trends Over Time (plots/demand_trends.png)
- Shows total daily sales quantity over the entire time period
- Helps identify overall trends and patterns
Monthly Average Demand (plots/monthly_demand.png)
- Bar chart showing average sales by month
- Reveals seasonal patterns (e.g., holiday season spikes)
Feature Importance (plots/feature_importance.png)
- Shows which features are most important for predictions
- Only available for tree-based models (Random Forest, XGBoost)
Model Comparison (plots/model_comparison.png)
- Side-by-side comparison of all models (ML and Time-Series)
- Color-coded: ML models (blue) vs Time-Series models (orange/red)
- Shows MAE, RMSE, and R2 Score for each model
Time-Series Predictions (plots/timeseries_predictions.png)
- Actual vs predicted plots for ARIMA and Prophet models
- Shows how well time-series models capture temporal patterns
- Only generated if time-series models are available

🔮 Example Predictions

Here are some example predictions to demonstrate the system:

# Example 1: Electronics on a weekday
python predict.py --product_id 1 --date 2024-03-15 --price 500 --discount 10 --category Electronics
# Expected: Moderate demand (weekday, some discount)

# Example 2: Clothing on weekend
python predict.py --product_id 5 --date 2024-07-06 --price 50 --discount 20 --category Clothing
# Expected: Higher demand (weekend, good discount)

# Example 3: Holiday season prediction
python predict.py --product_id 10 --date 2024-12-20 --price 100 --discount 25 --category Toys
# Expected: High demand (holiday season, good discount)

🔧 Technical Details

Data Preprocessing Pipeline

Date Conversion: Convert date strings to datetime objects
Feature Extraction: Extract temporal features from dates
Missing Value Handling: Fill missing values with median (if any)
Categorical Encoding: Label encode product_id and category
Feature Scaling: Standardize numerical features using StandardScaler

Model Training Pipeline

Data Splitting: 80% training, 20% validation
Model Training: Train all available models
Evaluation: Calculate MAE, RMSE, and R2 for each model
Selection: Choose model with highest R2 score
Persistence: Save model, encoders, and scaler

Prediction Pipeline

Load Model: Load trained model and preprocessing objects
Feature Preparation: Extract features from input parameters
Encoding: Encode categorical variables using saved encoders
Scaling: Scale features using saved scaler
Prediction: Make prediction using loaded model
Post-processing: Ensure non-negative predictions

Handling Unseen Data

The prediction script handles cases where:

Product ID was not seen during training (uses default encoding)
Category was not seen during training (uses default encoding)

Warnings are displayed in such cases.

🎓 Learning Points

This project demonstrates:

Supervised Learning: Regression problem solving
Feature Engineering: Creating meaningful features from raw data
Model Comparison: Training and evaluating multiple models
Model Selection: Automatic best model selection
Model Persistence: Saving and loading trained models
Production-Ready Code: Clean, modular, well-documented code
Time Series Features: Extracting temporal patterns
Categorical Encoding: Handling categorical variables
Feature Scaling: Normalizing features for better performance
Evaluation Metrics: Understanding different regression metrics

🐛 Troubleshooting

Issue: "Model not found"

Solution: Run python train_model.py first to train and save the model.

Issue: "XGBoost not available"

Solution: Install XGBoost with pip install xgboost, or the system will work without it (skipping XGBoost model).

Issue: "Category not seen during training"

Solution: This is handled automatically with a warning. The system uses a default encoding.

Issue: Poor prediction accuracy

Solutions:

Ensure you have sufficient training data
Check that input features are in the same range as training data
Try retraining with different hyperparameters
Consider adding more features or more training data

📝 Notes

The synthetic dataset generator creates realistic patterns including:
- Weekend effects (higher sales on weekends)
- Seasonal patterns (holiday season spikes)
- Price and discount effects
- Category-specific base prices
For production use, consider:
- Using real historical data
- Retraining models periodically
- Adding more features (promotions, weather, etc.)
- Implementing model versioning
- Adding prediction confidence intervals

📄 License

This project is provided as-is for educational purposes.

👤 Author

Created as a complete machine learning project demonstrating demand prediction for e-commerce.

Happy Predicting! 🚀

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support