Skip to main content

Table 1 The roles of different subsets of the 2018–2020 Slovenian Twitter dataset

From: Evolution of topics and hate speech in retweet network communities

Dataset Period No. of tweets Role
All tweets Jan. 2018–Dec. 2020 12,961,136 Collection, hate speech classification and topic detection
Original tweets Jan. 2018–Dec. 2020 8,363,271 Hate speech modeling
Retweets Jan. 2018–Dec. 2020 4,597,865 Network construction and community detection
Training set Dec. 2017–Jan. 2020 50,000 Hate speech model training and cross validation
Evaluation set Feb. 2020–Aug. 2020 10,000 Hate speech model evaluation
  1. Out of almost 13 million tweets collected, a sample of the original tweets is used for hate speech annotation, training of classification models, and their evaluation. The retweets are used to create retweet networks, and detect communities. All the tweets are automatically classified by the hate speech classification model, and are used to detect topics