Model Card for MM_Tiser_Qwen3_VL_FT_v2

Multimodal vision–language model for temporal reasoning question answering on synthetic charts and timelines, obtained by fine-tuning Qwen/Qwen3-VL-8B-Instruct on a TISER-derived dataset where each textual temporal context is rendered as a Gantt, line, or scatter chart paired with QA examples.

Model Details

Model Description

This model is a fine-tuned version of Qwen/Qwen3-VL-8B-Instruct specialized for question answering over temporal charts generated from the TISER temporal reasoning benchmark. The model receives a chart image and a natural language question, and it is trained with a Chain-of-Thought (CoT) plus reflection system prompt that structures reasoning inside <reasoning>, <timeline>, <reflection>, and <answer> tags, while the supervision focuses on producing the correct answer.

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been adapted from the automatically generated template.

Developed by: Dancat et al. (academic project at Politecnico di Torino)
Funded by [optional]: Academic / non-commercial research project
Shared by [optional]: Dancat
Model type: Multimodal vision–language (VLM) decoder fine-tuned for chart QA and temporal reasoning
Language(s) (NLP): English
License: Same license as Qwen/Qwen3-VL-8B-Instruct (see base model repository)
Finetuned from model [optional]: Qwen/Qwen3-VL-8B-Instruct

Model Sources [optional]

Repository: Dancat/MM_Tiser_Qwen3_VL_FT_v2
Paper [optional]: TISER / timeline self-reflection temporal reasoning paper (TISER benchmark)
Demo [optional]: (To be added, e.g. Gradio Space using this model)

Uses

Direct Use

The model is intended for research and experimentation on temporal reasoning over synthetic charts derived from textual temporal contexts. Typical direct uses include:

Visual question answering on Gantt-style timelines (e.g., “Which event was ongoing in 1917?”, “Which event lasted from 1915 to 1917?”).
Temporal reasoning over line and scatter plots where events or intervals are encoded along a time axis.
Probing how a multimodal LLM internalizes temporal relations when the context is presented visually rather than as pure text.

The fine-tuning prompt format used during training is:

System message (summarized):
The assistant must use a Chain-of-Thought (CoT) approach with reflection to answer queries about charts. It follows these steps inside tags:
- Step 1: Reason through the visual data step by step within the <reasoning> tags.
- Step 2: Identify relevant temporal events for answering the question within <timeline> tags, assuming relations are unidirectional.
- Step 3: Reflect on reasoning and the timeline inside <reflection> tags to check for errors or improvements.
- Step 4: Adjust the reasoning if needed; if more reasoning is required, go back to Step 1, otherwise proceed.
- Step 5: Provide the final concise answer inside the <answer> tags. If the answer is a number, output only the number; otherwise output the entity/event only.
  The message states that <reasoning>, <reflection>, and <timeline> are internal reasoning sections, that text should be written as paragraphs (no lists), and that the response must be entirely contained within <answer> tags.
User message template: Question: {example['question']}

Temporal context: The provided chart contains the temporal context for this question. Important: Use the chart to reason about the order, overlap and duration of events, and answer exactly what is asked in the question. When the question asks about a specific date or date range, identify the event whose interval actually includes that date or fully covers that range. If the chart does not provide enough information to answer, answer Unknown.

Downstream Use [optional]

Downstream uses may include:

Plugging the model into a larger RAG or agent pipeline that needs to read timeline-like plots, project Gantt charts, or synthetic historical timelines.
Further fine-tuning on domain-specific temporal data (e.g. project management charts, scientific timelines, annotated financial charts), possibly keeping the same CoT/reflection prompting structure.

Out-of-Scope Use

The model is not designed or evaluated for:

Open-domain chatting or general-purpose conversation.
Safety-critical applications (medical, legal, financial decisions, etc.) without human oversight.
Real-world numerical forecasting on raw time series; it reasons over symbolic intervals rendered in charts, not continuous sensor or market data.
Malicious or harmful uses, including discrimination, surveillance, or privacy-violating applications.

Bias, Risks, and Limitations

The model inherits biases and limitations from the base Qwen/Qwen3-VL-8B-Instruct pretraining, and this synthetic fine-tuning does not remove them. The training data use synthetic stories and temporal contexts (TISER-style), so performance may not directly transfer to messy real-world charts, hand-drawn timelines, or noisy scanned documents. The model assumes that all relevant temporal information is fully encoded in the chart; if the chart is ambiguous or omits necessary details, answers may be incorrect or hallucinated. The CoT + reflection instructions encourage structured reasoning, but they do not guarantee correctness, faithfulness, or the absence of spurious correlations.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. For real-world use, keep a human in the loop, validate answers with ground truth when possible, and avoid deploying the model in high-stakes settings without extensive evaluation. When studying robustness or fairness, construct evaluation sets that vary chart style, noise level, time scale, and the phrasing of temporal questions.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

model_id = "Dancat/MM_Tiser_Qwen3_VL_FT_v2"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_id)

image = Image.open("example_chart.png").convert("RGB")
question = "Which event was ongoing in 1917?"

system_message = (
    "You are an AI assistant that uses a Chain of Thought (CoT) approach with reflection "
    "to answer queries about charts. Follow these steps:\n\n"
    "Step 1. Reason through the visual data step by step within the <reasoning> tags.\n"
    "Step 2. Given your previous reasoning, identify relevant temporal events in the given "
    "context for answering the given question within <timeline> tags. Assume relations in "
    "the context are unidirectional.\n"
    "Step 3. Reflect on your reasoning and the timeline to check for any errors or "
    "improvements within the <reflection> tags.\n"
    "Step 4. Make any necessary adjustments based on your reflection. If there is "
    "additional reasoning required, go back to Step 1 (reason through the visual data "
    "step-by-step), otherwise move to the next step (Step 5).\n"
    "Step 5. Provide your final, concise answer within the <answer> tags. If the answer "
    "is a number, just output the number nothing else. Otherwise output the entity or "
    "event, without any additional comments.\n\n"
    "Important: The <reasoning>, <reflection> and <timeline> sections are for your "
    "internal reasoning process. All the reflection and the timeline have to be contained "
    "inside the thinking section.\n"
    "Do not use enumerations or lists when writing, use plain text instead such as "
    "paragraphs.\n"
    "The response to the query must be entirely contained within the <answer> tags.\n\n"
    "Use the following format for your response:\n\n"
    "<reasoning>\n"
    "[Your step-by-step reasoning goes here. This is your internal thought process.]\n"
    "<timeline>\n"
    "[Relevant temporal events for answering the given question.]\n"
    "</timeline>\n"
    "<reflection>\n"
    "[Your reflection on your reasoning, checking for errors or improvements]\n"
    "</reflection>\n"
    "[Any adjustments to your thinking based on your reflection]\n"
    "</reasoning>\n"
    "<answer>\n"
    "[Your final, concise answer to the query.]\n"
    "</answer>\n"
    "When answering, always follow these rules:\n"
    "- Use the chart to reason about the order, overlap, and duration of events, "
    "and answer exactly what is asked in the question.\n"
    "- Identify the event or interval that actually covers the requested date or "
    "date range on the timeline.\n"
    "- If the requested period is a range (e.g. 2006–2007), the correct event must "
    "cover the whole range, not just its start or end.\n"
    "- If no event covers the requested date or the whole requested range, answer "
    "'Unknown' (or the event labeled as Unknown in the chart).\n"
    "- Never pick an event only because it is the last or the most recent one. "
    "Always check whether its interval includes the queried date(s).\n"
)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": system_message}],
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {
                "type": "text",
                "text": (
                    f"Question: {question}\n\n"
                    "Temporal context: The provided chart contains the temporal context "
                    "for this question.\n"
                    "Important: Use the chart to reason about the order, overlap and "
                    "duration of events, and answer exactly what is asked in the "
                    "question. When the question asks about a specific date or date "
                    "range, identify the event whose interval actually includes that "
                    "date or fully covers that range. If the chart does not provide "
                    "enough information to answer, answer Unknown."
                ),
            },
        ],
    },
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=64, do_sample=False)

generated_ids = out[inputs["input_ids"].shape:][1]
answer = processor.decode(generated_ids, skip_special_tokens=True)
print(answer)

Training Details

Training Data

The training data are derived from the TISER temporal reasoning benchmark and its component datasets (e.g., TempReason, TimeQA, TGQA, etc.). Each textual temporal context is parsed to extract events and intervals, and then rendered into a chart image (Gantt chart, line plot, or scatter plot) using matplotlib. The original TISER questions and answers are kept as text, and multiple questions that refer to the same temporal “story” reuse the same chart image to avoid visual duplication and reduce storage and indexing costs. A smaller “tiny” subset is additionally prepared for quick sanity checks and qualitative evaluation.

Training Procedure

The model is fine-tuned with supervised learning on pairs of (chart image, question) with a CoT + reflection style target, using TRL’s SFTTrainer and a LoRA configuration on top of Qwen/Qwen3-VL-8B-Instruct. The chat format contains a system message that defines a multi-step reasoning protocol with <reasoning>, <timeline>, <reflection>, and <answer> tags, and a user message that includes the resized chart image and a textual question plus instructions about using the chart and answering “Unknown” if necessary.

Preprocessing [optional]

Parse each TISER temporal context into a structured event table with fields such as task label, start time, and end time.
Generate a chart image from this structure:
- Gantt chart for interval-focused views of events.
- Line or scatter plot for alternative temporal visualizations over the same time axis. The plotting code controls background, colors, gridlines, labels, and dynamic figure size so that dense contexts remain legible.
Build the chat example:
- System message: Chain-of-Thought and reflection instructions specifying that reasoning, timeline, and reflection content must go into the corresponding tags and that the final answer must be inside <answer> tags.
- User message: includes the chart image and a text block: “Question: {question} Temporal context: The provided chart contains the temporal context for this question. Important: Use the chart to reason about the order, overlap and duration of events, and answer exactly what is asked in the question. When the question asks about a specific date or date range, identify the event whose interval actually includes that date or fully covers that range. If the chart does not provide enough information to answer, answer Unknown.”
Tokenization and image preprocessing are handled via AutoProcessor for Qwen3-VL, with a custom data collator that masks prompt tokens so that the loss focuses on the assistant’s answer tokens.

Training Hyperparameters

Training regime: bf16 mixed precision (bf16=True, fp16=False, tf32=True)
Base model: Qwen/Qwen3-VL-8B-Instruct with LoRA adapters
Epochs: 1
Per-device train batch size: 2
Gradient accumulation steps: 4 (effective batch size 8 per optimization step)
Learning rate: 1e-4
LR scheduler: cosine schedule with warmup_steps=10
Weight decay: 0.01
Max gradient norm: 1.0
Logging: logging_steps=10
Evaluation: eval_strategy="steps", eval_steps=80
Checkpointing: save_strategy="steps", save_steps=80
Trainer: TRL SFTTrainer with report_to="trackio" for experiment tracking

Speeds, Sizes, Times [optional]

Model size: ~8B parameters (base Qwen3-VL-8B-Instruct plus LoRA adapters)
Hardware: 1× NVIDIA A100 80 GB GPU
Main fine-tuning run: ~247 optimization steps, taking about 6 hours including evaluation and periodic checkpointing
Checkpoints: Saved at intervals of 80 steps in the training output directory, plus a consolidated final model

Evaluation

Testing Data, Factors & Metrics

Testing Data

Evaluation is performed on a held-out split of the TISER-derived chart QA dataset, stratified by the original textual dataset (e.g., TempReason, TimeQA, TGQA) to maintain coverage across different temporal reasoning styles. A much smaller “tiny” split is also used in the associated notebook for quick interactive tests and qualitative examinations of model behavior.

Factors

Potential evaluation factors include:

Question type: identification (“which event”), duration queries (“how many years”), ordering (“which happened first”), overlap, and date-range coverage.
Chart type: Gantt vs line vs scatter charts, which influence how intervals and events are visually encoded.
Original dataset source: differences between benchmarks such as TimeQA, TGQA, or TempReason.

Metrics

Exact Match (EM): a binary measure of whether the normalized model prediction exactly matches the normalized ground-truth answer (case-insensitive, punctuation stripped).
F1 score: token-level F1 between normalized prediction and ground truth, capturing partial overlap for multi-token answers. These metrics are standard in textual QA benchmarks and are directly applicable to temporal QA where answers are short phrases or entities.

Results

Full quantitative results over all TISER subsets are still being consolidated. Preliminary experiments on small held-out subsets and the “tiny” split suggest improved alignment between answers and chart intervals compared to the base model, particularly for questions that require checking that an event fully covers a date range rather than only its start or end. Once the full evaluation is complete, detailed EM and F1 numbers per dataset and per chart type will be added here.

Summary

The fine-tuned model shows promising improvements on synthetic temporal chart QA tasks but has not yet been extensively benchmarked on real-world charts or beyond the TISER-derived setting. Users should treat the current results as preliminary and perform task-specific evaluation before deployment.

Model Examination [optional]

No dedicated interpretability or probing analyses are included at this time. Future model examination could involve:

Visualizing attention maps over chart regions for different question types.
Analyzing error patterns (e.g., off-by-one-year mistakes, confusion between overlapping events).
Studying how the CoT + reflection tags influence internal representations and whether they lead to more faithful temporal reasoning.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator (https://mlco2.github.io/impact#compute) presented in Lacoste et al. (2019) (https://arxiv.org/abs/1910.09700).

Hardware Type: 1× NVIDIA A100 80 GB GPU
Hours used: ≈6 hours for the main fine-tuning run, plus additional time for experiments and debugging
Cloud Provider: GPU cloud provider (e.g., Vast.ai or similar)
Compute Region: Typically an EU data center, depending on the selected node
Carbon Emitted: Not directly measured; should be estimated based on GPU type, power draw, utilization, and total runtime using the referenced calculator

Technical Specifications [optional]

Model Architecture and Objective

The model architecture is that of Qwen/Qwen3-VL-8B-Instruct, a multimodal vision–language model combining:

A vision encoder that processes chart images into visual embeddings.
A text tokenizer/embedding layer for natural language inputs.
A large autoregressive transformer decoder that integrates visual and textual information. During fine-tuning the objective is standard next-token prediction over the assistant’s response in the chat format; masking in the data collator ensures that loss is primarily applied to the tokens corresponding to the answer segment within the <answer> tags.

Compute Infrastructure

Hardware

Single NVIDIA A100-SXM4-80GB GPU with sufficient high-bandwidth memory for 8B-parameter multimodal training.
CPU, RAM, and storage sufficient to handle the image dataset, checkpoints, and logging (e.g., several hundred GB of disk space for charts and model outputs).

Software

Python 3.11 (or compatible)
transformers (version supporting Qwen3-VL-8B-Instruct and multimodal generation)
trl for supervised fine-tuning (SFT) with SFTTrainer
peft for LoRA configuration and adapter management
accelerate for distributed/accelerated training setup
datasets for loading and handling JSONL datasets
qwen-vl-utils for Qwen3-VL specific utilities
matplotlib, Pillow, numpy, pandas for temporal context parsing, chart generation, and image preprocessing
Experiment tracking library (e.g., TrackIO) for metrics logging and visualization

Citation [optional]

If you use this model, please also cite the Qwen3-VL technical report and the TISER temporal reasoning work.

BibTeX:

@misc{qwen3technicalreport,
  title         = {Qwen3 Technical Report},
  author        = {Qwen Team},
  year          = {2025},
  eprint        = {2505.09388},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2505.09388}
}

@article{tiser2025,
  title   = {Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models},
  author  = {Authors of the TISER paper},
  journal = {arXiv preprint arXiv:2504.05258},
  year    = {2025},
  url     = {https://arxiv.org/abs/2504.05258}
}