|
No. of tweets
|
Overall
|
Acceptable
|
Unacceptable
|
---|
Alpha
|
Acc
|
F1(A)
|
F1(U)
|
---|
Self-agreement
|
5981
|
0.79
|
0.88
|
0.92
|
0.87
|
Inter-annotator agreement
|
53,831
|
0.60
|
0.79
|
0.85
|
0.75
|
Classification model
| | | | | |
Training set
|
50,000
|
0.61
|
0.80
|
0.85
|
0.77
|
Evaluation set
|
10,000
|
0.57
|
0.80
|
0.86
|
0.71
|
- Three measures are used: ordinal Krippendorff’s \(Alpha\), accuracy (\(Acc\)), and \(F_{1}\) for the classes of acceptable (A) and unacceptable (U) tweets. The first line is the self-agreement of individual annotators, and the second line is the inter-annotator agreement between different annotators. The last two lines are the evaluation results of the model, on the training set (by cross validation) and on the out-of-sample evaluation set, respectively. Note that the model performance is comparable to the inter-annotator agreement