Home
UA-datasets in a nutshell¤
UA-datasets is a collection of Ukrainian language datasets. Our aim is to build a benchmark for research related to natural language processing in Ukrainian.
This library is provided by FIdo.ai (machine learning research division of the non-profit student's organization FIdo, National University of Kyiv-Mohyla Academy) for research purposes.
Availabel datasets¤
- Question answering: UA-SQuAD
- Text classification: UA-News
- Part-of-speech tagging: Mova Institute Part of Speech Dataset
Installation¤
The library can be installed from PyPi in your virtual environment (e.g. venv, conda env)
pip install ua_datasets
Quick example¤
from ua_datasets import UaSquadDataset
qa_dataset = UaSquadDataset("data/", download=True)
for question, context, answer in qa_dataset:
print("Question: " + question)
print("Context: " + context)
print("Answer: " + answer)
Citation¤
If you found this library useful in academic research, please cite:
@software{ua_datasets_2021,
author = {Ivanyuk-Skulskiy, Bogdan and Zaliznyi, Anton and
Reshetar, Oleksand and Protsyk, Oleksiy and Romanchuk, Bohdan and
Shpihanovych, Vladyslav},
month = oct,
title = {ua_datasets: a collection of Ukrainian language datasets},
url = {https://github.com/fido-ai/ua-datasets},
version = {0.0.1},
year = {2021}
}
(Also consider starring the project on GitHub!)