UA-datasets¤
Unified, lightweight access to Ukrainian NLP benchmark datasets (QA, Text Classification, POS tagging) with automatic download, caching and consistent iteration.
UA-datasets is maintained by FIdo.ai (machine learning research division of the non-profit student organization FIdo at the National University of Kyiv-Mohyla Academy) for research purposes.
Features at a glance¤
| Capability | Description |
|---|---|
| Unified API | len(ds), indexing, iteration across all datasets |
| Resilient downloads | Retries, integrity / basic validation, fallback filenames (UA-SQuAD val) |
| Minimal deps | Core loaders rely only on the standard library |
| Consistent samples | QA: HF-style dict (id, title, context, question, answers, is_impossible); Classification (title, text, label, tags?); POS (tokens, tags) |
| Frequency helpers | Simple methods for label/answer frequency analysis |
| Ready for tooling | Works seamlessly with uv, ruff, mypy, pytest, pre-commit |
Available Datasets¤
| Task | Dataset | Class | Splits | Notes |
|---|---|---|---|---|
| Question Answering | UA-SQuAD | UaSquadDataset |
train, val |
SQuAD-style JSON; legacy val filename fallbacks |
| Text Classification | UA-News | NewsClassificationDataset |
train, test |
CSV (title,text,target[,tags]); optional tag parsing |
| POS Tagging | Mova Institute POS | MovaInstitutePOSDataset |
corpus | CoNLL-U like format; yields (tokens, tags) |
Quick Start¤
from pathlib import Path
from ua_datasets.question_answering import UaSquadDataset
ds = UaSquadDataset(root=Path("./data/ua_squad"), split="train", download=True)
print(f"Examples: {len(ds)}")
ex = ds[0]
print(ex["question"], "->", ex["answers"]["text"]) # list of answers (possibly empty if impossible)
Text classification:
from ua_datasets.text_classification import NewsClassificationDataset
news = NewsClassificationDataset(root=Path("./data/ua_news"), split="train", download=True)
title, text, label, tags = news[0]
POS tagging:
from ua_datasets.token_classification import MovaInstitutePOSDataset
pos = MovaInstitutePOSDataset(root=Path("./data/mova_pos"), download=True)
tokens, tags = pos[0]
Installation¤
Choose one method:
Using uv (recommended)¤
uv add ua-datasets
Via pip¤
pip install ua_datasets
From source (editable)¤
git clone https://github.com/fido-ai/ua-datasets.git
cd ua-datasets
pip install -e .
Benchmarks & Acknowledgements¤
- Benchmarks: See Benchmarks for leaderboard scaffolding.
- Acknowledgements: See Acknowledgements for dataset contributors.
Citation¤
If you found this library useful in academic research, please cite:
@software{ua_datasets_2021,
author = {Ivanyuk-Skulskiy, Bogdan and Zaliznyi, Anton and Reshetar, Oleksand and Protsyk, Oleksiy and Romanchuk, Bohdan and Shpihanovych, Vladyslav},
month = oct,
title = {ua_datasets: a collection of Ukrainian language datasets},
url = {https://github.com/fido-ai/ua-datasets},
version = {1.0.0},
year = {2021}
}
⭐ Consider starring the project on GitHub to support visibility.