UA-News¤
Dataset Summary¤
Ukrainian News is a collection of more than 150 thousand news articles, gathered from more than 20 news resources. Dataset samples are divided into 5 categories: політика
, спорт
, новини
, бізнес
, технології
. The dataset is provided by the non-profit student's organization FIdo.ai (machine learning research division of FIdo, National University of Kyiv-Mohyla Academy) for research purposes in data mining (classification, clustering, keywords extraction, etc.).
Dataset development is still in progress
Dataset Structure¤
Parameters:
-
root
: Directory path -
download
: Whether to download data -
split
: Which split of the data to load (train or test) -
return_tags
: Whether to return text keywords
Splits:
-
Train :
- File size: 324 MB
- Number of samples: 120417
-
Target distribution
політика
: 40364 (33.5%)спорт
: 40364 (33.5%)новини
: 40364 (33.5%)бізнес
: 40364 (33.5%)технології
: 40364 (33.5%)
-
Test:
- File size: 81 MB
- Number of samples: 30105
-
Target distribution
політика
: 40364 (33.5%)спорт
: 40364 (33.5%)новини
: 40364 (33.5%)бізнес
: 40364 (33.5%)технології
: 40364 (33.5%)
Data sample
{
"title" : 'На Донеччині зафіксували сьомий випадок коронавірусу',
"text" : 'Про це повідомив голова Донецької ОДА Павло Кириленко в Facebook ...,
"tags" : ['Донецька область', 'COVID-19', 'Новини'],
"target" : 'новини'
}
Example of usage¤
Our API¤
from ua_datasets import NewsClassificationDataset
train_data = NewsClassificationDataset(root='data/', split='train', return_tags=True)
for title, text, tags, target in train_data:
print(title, text, tags, target)
Hugging Face 🤗 API¤
from datasets import load_dataset
dataset = load_dataset("FIdo-AI/ua-news")
for item in dataset["train"]:
title, text, tags, target = item["title"], item["text"], item["tags"], item["target"]
print("Title: " + title)
print("Text: " + text)
print("Tags: " + tags)
print("Target: " + target)