Instructions to use ai4data/datause-extraction with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- GLiNER2
How to use ai4data/datause-extraction with GLiNER2:
from gliner2 import GLiNER2 model = GLiNER2.from_pretrained("ai4data/datause-extraction") # Extract entities text = "Apple CEO Tim Cook announced iPhone 15 in Cupertino yesterday." result = extractor.extract_entities(text, ["company", "person", "product", "location"]) print(result) - GLiNER
How to use ai4data/datause-extraction with GLiNER:
from gliner import GLiNER model = GLiNER.from_pretrained("ai4data/datause-extraction") - Notebooks
- Google Colab
- Kaggle
datause-extraction
This repository contains the fine-tuned LoRA adapter weights for dataset mention extraction, trained on top of the base model fastino/gliner2-large-v1.
It classifies spans into three categories:
named_data: Proper named datasets, surveys, censuses, or registries (e.g., Demographic and Health Survey, LFS, UNHCR PRIMES).descriptive_data: Data resources described by their producer or characteristics rather than a proper name (e.g., World Bank household surveys, spatial socioeconomic data sets).vague_data: General references containing a data noun but lacking enough specificity to identify the exact source (e.g., administrative data, project statistics).
Rationale and Context: Forced Displacement, Refugees, and FCV
In Fragile, Conflict, and Violence (FCV) settings, monitoring the utilization of datasets is crucial for coordinating developmental and humanitarian aid. Research on forced displacement and refugee integration relies heavily on specific household surveys, operational registries, and geographic vulnerability datasets.
By automating the extraction of these references from project documents, appraisal papers, and academic studies, this model helps map data usage, highlights under-analyzed areas, and evaluates the policy impact of statistical capacity investments.
Data Sources & Domain Coverage
The model is specialized in the socio-economic development and forced displacement domains, with strong representation of:
- Humanitarian Registries & Briefs: UNHCR registration databases (PRIMES), Refugee Socio-Economic Inclusion Surveys (SEIS), Durable Solutions reports, and Protection Monitoring tools.
- Development Economics & Surveys: World Bank Project Appraisal Documents (PADs), Living Standards Measurement Study (LSMS), Demographic and Health Surveys (DHS), Multiple Indicator Cluster Surveys (MICS), and national censuses.
- FCV/Geospatial Data: Livelihood surveys, cash-based intervention tracking, and geographic data (e.g., Shuttle Radar Topography Mission, flood hazard mapping, population distribution layers).
Model Performance
The adapter was evaluated on the canonical layout-aware, project-purged Holdout v10 dataset (flat_ner_holdout_v10.jsonl / ai4data/datause-holdout) at a confidence threshold of 0.40 (Jaccard matching threshold = 0.50):
| Evaluation Set | TP | FP | FN | Precision | Recall | F0.5 Score |
|---|---|---|---|---|---|---|
| Positive-Only Records (465 chunks w/ mentions) | 576 | 51 | 152 | 91.9% | 79.1% | 0.8900 |
| All Records (Full set of 1,149 chunks) | 576 | 136 | 152 | 80.9% | 79.1% | 0.8054 |
How to Use
You can load and use this model either via the direct gliner2 library interface or using the high-level ai4data library wrappers.
Option 1: Using the ai4data Library (Recommended)
The ai4data python package automatically handles base model initialization, adapter downloads, token chunking, and post-filtering:
from ai4data import extract_from_text
text = (
"To analyze the impact of infrastructure spillovers, we combine data from the "
"2010 Ghana Living Standards Survey (GLSS) with production records for 17 "
"large-scale gold mines."
)
# Extract dataset mentions using this specific adapter
result = extract_from_text(
text,
adapter_id="ai4data/datause-extraction",
include_confidence=True
)
for ds in result.get("datasets", []):
print(f"Dataset: {ds['dataset_name']}")
print(f"Confidence: {ds['dataset_confidence']:.3f}")
print(f"Section: {ds['section_context']}")
print("-" * 30)
Option 2: Using the Raw gliner2 Interface
If you are using the raw weights directly as a LoRA adapter, you must load the base model (fastino/gliner2-large-v1) first and apply the adapter:
import torch
from gliner2 import GLiNER2
from huggingface_hub import snapshot_download
# 1. Initialize base model
kwargs = {}
if torch.cuda.is_available():
kwargs["map_location"] = "cuda"
elif torch.backends.mps.is_available():
kwargs["map_location"] = "mps"
else:
kwargs["map_location"] = "cpu"
model = GLiNER2.from_pretrained("fastino/gliner2-large-v1", **kwargs)
# 2. Download and apply the LoRA adapter weights
adapter_path = snapshot_download("ai4data/datause-extraction")
model.load_adapter(adapter_path)
# 3. Perform inference
text = (
"To analyze the impact of infrastructure spillovers, we combine data from the "
"2010 Ghana Living Standards Survey (GLSS) with production records for 17 "
"large-scale gold mines."
)
labels = ["named_data", "descriptive_data", "vague_data"]
predictions = model.predict_entities(text, labels, threshold=0.40)
for entity in predictions:
print(f"Text: {entity['text']} | Label: {entity['label']} | Score: {entity['score']:.3f}")
Model tree for ai4data/datause-extraction
Base model
fastino/gliner2-large-v1
from gliner import GLiNER model = GLiNER.from_pretrained("ai4data/datause-extraction")