Skip to main content

Table 1 Benchmarking of text corpora used for Doc2Vec training

From: From free text to clusters of content in health records: an unsupervised graph partitioning approach

Hyper-parameters

NRLS

Wikipedia

Window Size

Minimum Count

Subsampling

1M

2M

13M+

5M+

15

5

0.001

765

755

836

531

5

5

0.001

807

775

798

580

5

20

0.001

801

785

809

587

5

20

0.00001

-

-

379

465

15

20

0.00001

-

-

387

424

  1. A Doc2Vec model was trained on three corpora of NRLS records of different sizes and a corpus of Wikipedia articles using a variety of hyper-parameters. The scores represent the quality of the vectors inferred using the corresponding model, i.e., the number of correct assignments out of 1500. Boldface identifies the best computational result