Skip to main content

Advertisement

Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Table 1 Benchmarking of text corpora used for Doc2Vec training

From: From free text to clusters of content in health records: an unsupervised graph partitioning approach

Hyper-parameters NRLS Wikipedia
Window Size Minimum Count Subsampling 1M 2M 13M+ 5M+
15 5 0.001 765 755 836 531
5 5 0.001 807 775 798 580
5 20 0.001 801 785 809 587
5 20 0.00001 - - 379 465
15 20 0.00001 - - 387 424
  1. A Doc2Vec model was trained on three corpora of NRLS records of different sizes and a corpus of Wikipedia articles using a variety of hyper-parameters. The scores represent the quality of the vectors inferred using the corresponding model, i.e., the number of correct assignments out of 1500. Boldface identifies the best computational result