Improving topic modeling through homophily for legal documents

Topic modeling that can automatically assign topics to legal documents is very important in the domain of computational law. The relevance of the modeled topics strongly depends on the legal context they are used in. On the other hand, references to laws and prior cases are key elements for judges to rule on a case. Taken together, these references form a network, whose structure can be analysed with network analysis. However, the content of the referenced documents may not be always accessed. Even in that case, the reference structure itself shows that documents share latent similar characteristics. We propose to use this latent structure to improve topic modeling of law cases using document homophily. In this paper, we explore the use of homophily networks extracted from two types of references: prior cases and statute laws, to enhance topic modeling on legal case documents. We conduct in detail, an analysis on a dataset consisting of rich legal cases, i.e., the COLIEE dataset, to create these networks. The homophily networks consist of nodes for legal cases, and edges with weights for the two families of references between the case nodes. We further propose models to use the edge weights for topic modeling. In particular, we propose a cutting model and a weighting model to improve the relational topic model (RTM). The cutting model uses edges with weights higher than a threshold as document links in RTM; the weighting model uses the edge weights to weight the link probability function in RTM. The weights can be obtained either from the co-citations or from the cosine similarity based on an embedding of the homophily networks. Experiments show that the use of the homophily networks for topic modeling significantly outperforms previous studies, and the weighting model is more effective than the cutting model.

In computational law, topic modeling is one key task for many higher-level goals such as law document retrieval, law document clustering for exploration, or the comparison of legal documents (Katz et al. 2011;Lu et al. 2011;O'Neill et al. 2016;Wang et al. 2017;Yoshioka et al. 2018). To date, LDA is still the favored model for topic modelling, but does not take into account the larger context of laws and cited cases. However, ambiguity may still arise in the topics because law is a specific domain: different realworld contexts can lead to the breach of a same law, while a same real-world context can breach different law cases-meaning that the relevance of the modeled topics strongly depends on the legal context they are used in Kanapala et al. (2019). In the case of judgement cases, the citations to preexisting cases and laws are as important as the content to render a decision-similarly to scientific work.
Academic evaluation has frequently used citation analysis (Hirsch 2005), despite the controversies that have been shown even recently (Renoust et al. 2017). Using scientific citation networks has also improved the quality of topic modeling (Chang and Blei 2009). In comparison to academic citation networks, the structure of legal cases often follows a directed acyclic network (or DAG). The structure of these DAG may however differ. While academic citation networks are often deep with many authors citing one another, and even self citations, the structure of legal citations may look more shallow. This results in a flat organization with little cross references between cases but a clear emphasis in precedent cases making jurisprudence (Pelc 2014). As a consequence, the very shape of the networks built from legal citations shows a different topology to those built from academic citations.
In addition, since the content of the cited documents is not always available, network analysis then relies on the investigation of the co-citation patterns (Kim 2013;Khanam and Wagh 2017). Sometimes, this is referred to as homophily (McPherson et al. 2001;Borgatti et al. 2009), also known as the property of entities to agglomerate when being similar and the implied similarity of two entities. Homophily always corresponds to a bipartite structure, which can be projected into a single type network. Topic modeling is one approach to using homophily in the projections (Renoust et al. 2014). In the context of computational law, retrieval and topic modeling give rise to open challenges and publicly datasets such as the COLIEE data (Yoshioka et al. 2018). The COLIEE dataset provides a testbed for legal information extraction and entailment. It provided over 6k cases from the Canadian Federal Court for about 40 years, with very rich annotations including among a lot of different entities, citations to past cases, rulings, and laws.
Our work contributes a methodology for building topic modeling for legal documents, when the content of cited documents is not available. We propose in the current work to automatically build networks from cases of the COLIEE dataset. We analyze the COLIEE dataset. We then construct a homophily network consisting of nodes for legal cases and edges with weights for the references. There are two major types of citations. The first one refers to prior cases, while the other one to statute laws. We further propose to use these two types of citations to explore the similarity of cases, by constructing homophily relationships between cases. Furthermore, we use case homophily to improve topic modeling for legal cases. In particular, we work on the relational topic model (RTM) (Chang and Blei 2009) that uses the links between documents during topic modeling. We compare different strategies of using the edge weights in the homophily network as link information for RTM.
This work is an invited extension of the original presentation (Ashihara et al. 2019). In this paper, we extend our previous work as follows: • We improve the strategy of using homophily relationships for topic modeling.
Previously, we only set a threshold based on edge weights in the homophily network to decide whether a link should be used for RTM or not, which we call the cutting model. In this paper, we also propose to use the weights to weight the link probability in RTM, which we call the weighting model. Experiments show that our weighting model significantly outperforms our previous cutting model. • In addition to the product or the sum of prior case and status information, we propose a list of weighting methods with fuzzy logic aggregation (Detyniecki et al. 2000) for the weighting model, showing similar coherence score to the simple weighting model and better coherence scores than previous cutting model. • We also investigate the use of a kernel such as Node2vec (Grover and Leskovec 2016) to embed homophily network in low-dimension space. Experiments show that our Node2vec model, also significantly outperforms our previous cutting model. • We analyze the topic words in detail for the best performing topic model, and verify the effectiveness of the proposed models for legal case topic modeling.
The remainder of the paper is as follows: Sect. 2 presents the related work; Sect. 3 introduces the COLIEE dataset (Yoshioka et al. 2018) along with its characteristics; Sect. 4 presents homophily network modeling for this dataset. Section 5 describes topic modeling for legal cases and our proposed models for improving RTM using homophily networks; we report experiments and results in Sect. 6 and conclude the paper in Sect. 7.

Related work
In the current section, we present the related work to both legislation networks and topic modeling. In addition, we discuss the difference between our work and the related work.

Legislation networks
The interest in network analysis for legal documents has been significantly increasing recently. Many previous studies have shown that the analysis of legal networks is closely related to complex networks (Fowler et al. 2007;Kim 2013;Pelc 2014;Koniaris et al. 2017;Khanam and Wagh 2017;Lettieri et al. 2018;Lee et al. 2019). Fowler et al. (2007) developed a centrality measure based on Authorities and Hubs (Kleinberg 1999), which is dedicated to citations of cases in the US Supreme Court consisting of 26k+ cases in a citation network. In order to find complex network properties and homophily behavior in a treaty network, Kim (2013) explored a structure consisting of 1k citations for 747 treaties. Pelc (2014) investigated the fundamental precedent concept, i.e., previous deliberations being cited in cases, in the international commercial cases. They also did a centrality study of Authorities and Hubs, which confirms that the network structure is relevant to predicting case output. From the Official Journal of the European Union, Koniaris et al. (2017) built a law reference network. They showed that it has the temporal evolution and multi-scale structure property of multilayer complex networks. With betweenness centrality, Khanam and Wagh (2017) proposed an analysis on citation for judgements in Indian courts. The relevance of the EUCaseNet project (Lettieri et al. 2018) should also be underlined. It combines network analysis and centrality-based visualization to explore the entire EU case law corpus. Lee et al. (2019) explored the court decision versus constitution article patterns in Korea, which conducts topic analysis on the main clusters. Because in this paper we investigate the the Federal Court of Canada case law network, our target is close to these studies. We investigate the homophily in our analysis, which has been illustrated by all of the studies above. Although applying topic modeling for legislation networks is not new, we take the further step of using network analysis to improve topic modeling. Different from previous studies, we improve topic modeling with the case co-citation structure, and feed back to homophily of documents in ways of topic proximity.

Topic modeling
Latent Dirichlet allocation (LDA) is the first topic model introduced by Blei et al. (2003). As a graphical model, LDA can learn from observed documents to infer hidden word and document-topic distributions. In Sect. 5, we give a description of LDA in detail. The correlated topic model, proposed by Blei and Lafferty (2007), models topic occurrences using the logistic normal for LDA. The dynamic topic model proposed by Blei et al. (2006) models temporal information in sequence data. Most topic models are unsupervised, but supervised topic modeling has been studied too. The supervised LDA proposed by Blei and McAuliffe (2007) can model topics of responses and documents. Supervised LDA is suitable for data such as product reviews, which has both evaluation scores and corresponding descriptions of products. Ideal point topic models, proposed by Nguyen et al. (2015), assume that the responses are also hidden. RTM models the topic of a document pair, which shares links, e.g. references, between a document pair (Chang and Blei 2009). We describe RTM in detail in Sect. 5. Collaborative topic models proposed by Wang and Blei (2011), can make recommendation for user preferences using user data. Recent studies have also tried to bridge topic models to text representation methods based on word embeddings. E.g., Das et al. (2015) modeled topics with distributions on word embeddings instead of word types and showed that the proposed model is more robust for handling out-of-vocabulary words; similarly, Dieng et al. (2019) developed an embedded topic model that models words with categorical distributions on word embeddings and topic embeddings. We think that this can be one interesting direction for our future work.
In the context of joint network and topic modeling, Liu et al. (2009) proposed a framework to perform LDA-based topic modeling and author community discovery simultaneously; Zhu et al. (2013) proposed a mixed-topic link model for joint topic modeling and link prediction; Brochier et al. (2020) proposed a topic-word attention mechanism to generate document network embeddings via the interaction between topic and word embeddings. Different from previous studies, in this paper, we apply RTM for legal case analysis and improve it via co-citation homophily networks. To the best of our knowledge, this is the first work that utilizes co-citation homophily networks for topic modeling.

Data
We collect our data from the Competition on Legal Information Extraction/ Entailment (COLIEE 2018) (Yoshioka et al. 2018). 1 For our task we study the Case Law Competition Data Corpus, which has also been used in Task 1 and 2 in COLIEE 2018. The data consists of 6154 cases from the Federal Court of Canada over the period of approximately 40 years, ranging between 1974 and 2016. Note that most cases in the corpus are the ones with a date after 1986. This data corpus is very rich. Each case is a textual document containing multiple parts, including a summary of the court, case content, references to relevant past cases and statutes, rulings, counsels, legal topics of interest, solicitors, miscellaneous information, and important facts.
In this paper, we only focus on the prior cases and statutes noticed to form our networks. From the text input, they are divided by paragraph titles as follows: • Cases Noticed, they correspond to the past trials which are relevant to this trial. • Statutes Noticed, they correspond to laws referred to give the verdict of the trial.
Each consecutive line has one reference. Recall that as they are Canadian cases, they may be written in both English and French. We found that there are only 5576 cases that refer to prior cases and statutes noticed among all the cases.
The reference destination is always very detailed, and references can be separated into paragraphs or chapters. If we directly use these as basic units for analyzing network modeling, only a small number of references are redundant across the cases, making the network very sparse. Therefore, we consider the references to the full case or statute articles. The identification of each case or statute can be made based on a year, a title, and references. The parsing is conducted based on looking for the year structure, and make titles at a high granularity as nodes (Table 1). The cases without information about the year are discarded. In total, these correspond to 39 cases being cited. We also save the year information along with the nodes.

Legal network
In this section, we first describe the network model in Sect. 4.1. Second, we provide the details about the construction of the underlying homophily network in Sect. 4.2. Finally, in Sect. 4.3 we lay out on the use of Node2vec to embed the homophily network in a low dimensional space.

Network structure
In our data, each case refers to a set of prior cases and statutes noticed. In our network, G = (V,E), every case is represented as a node v 1 ∈ V , and a prior case or statute noticed is represented as another node v 2 ∈ V . We treat each citation as a link (v 1 , v 2 ) = e ∈ E . Figure 1 shows an overview of our network modeling. Our initial set contains |C| = 5539 cases. Each case can refer to multiple prior cases and statutes noticed. Each case c ∈ C may refer to several prior cases p ∈ P , where |P| = 25,112 in total. They also can have reference to a statute s ∈ S , where |S| = 1288. The citations to cases can be from the eighteenth century. Note that it cannot be guaranteed that the year information is reliable for cases before the eighteenth century, which constitute a small number of 78 cases.
With the above network modeling, our network includes 31,976 nodes, and 53,554 links. Note that reliable year information is only available for 29,319 nodes. We can separate the network into two sub-networks. The first, G P , is constructed using only the cases and their cited prior cases, consisting of 29,952 nodes with 53,554 edges. The second, G S , only considers the cases and their cited statutes, consisting of 4441 nodes with 6453 edges.
These two networks present one main connected component G consisting of 30,456 nodes with 52,453 links, which covers most nodes and edges. The node/link number for G P is 27,353/44,871, and 4125/6150 for G S . We further investigate the possibility of looking for case communities from these networks via Louvain clustering (Blondel et al. 2008) and modularity (Newman 2006). The main components of G, G P , and G S show a modularity Q G = 0.739 with 34 communities, a modularity Q G P = 0.762 for 45 communities, and a modularity Q G S = 0.747 for 27 communities, respectively. This is illustrated in Fig. 2.
If the communities were extremely imbalanced-consider an extremely large community surrounded by very small others-there would be more chances to be unsuccessful looking for homophily because most documents would have higher chances to share the same few common characteristics. Finding community structures within the Fig. 1 An overview of the structure of the legal case data in the COLIEE dataset citation network confirms that we can leverage on these structures by using homophily. In other words, there are groups of documents that share latent characteristics, and that may be differentiated enough to other groups.

Homophily network
Prior case and statute citations indicate double bipartite structures in our network model, i.e., from G P are case-prior case, and from G S are case-statute relationships. Bipartite projections into one-mode networks indicate a complex network structure (Guillaume and Latapy 2006). Thus, we further derive three one-mode networks: , in order to analyze homophily; where C represents nodes, E represents edges, and G ′ is a combination of G ′ P and G ′ S , Homophily is the property for entities that are linked together to share some existing characteristics, that may also be shared among a group or a cluster. Although this may be captured through an entity-characteristics association, naturally forming bipartite networks, other models, including multilayer networks and hypergraphs could also be investigated (Renoust 2013). In the context of homophily, multilayer networks and bipartite graphs have equivalence (Renoust 2014), and there exists the incidence/ Levi graph for hypergraphs (Levi 1942). All may be projected into 1-mode networks connecting the entities which may be linked by homophily. This single mode projection is the artifact which enables us to alleviate the limitation of not having access to the cited content but still embedding the network structure within our topic modeling.
To this end, we project the other two bipartite relationships onto case-case relationships. That is, let u ∈ C , v ∈ C be two initial cases that we are investigating. We can assign a set of references R u = {r 1 , r 2 , r 3 , . . .} to each of these cases, where r x represents either a prior case or a statute noticed. In a projected network G ′ , the original cases {u, v} ⊆ C become the nodes, and there exists a link (u, v) = e ∈ E ′ if and only if the intersection of their respective reference sets is non empty: R u ∩ R v � = ∅ . Each reference r x ∈ P is a prior case in the network induced by prior cases G ′ P . Each reference r x ∈ S is a statute law in the network induced by statutes G ′ S . A reference r x ∈ S ∪ P can be of either a case or a statute noticed in the general projected network G ′ .
We can obtain the weight of each link with the following two methods. In the first one, w n is the number of shared citations between two cases. In the second one, w j is the Jaccard index of these cases (Jaccard 1901). We find that the resulting networks are very dense, where the numbers of nodes and edges are 4803/286,435 for G ′ P , 3138/379,447 for G ′ S , and 5576/643,729 for G ′ , respectively. We can see that there is only a little overlap between links induced by prior cases and statutes. After investigating the main components of these networks, we get a size of 4244/286,403, 3033/379,426, and 4870/643,725 nodes and edges for G ′ P , G ′ S and G ′ , respectively. G ′ P has modularity Q G ′ P = 0.428 for 14 communities, G ′ S has modularity Q G ′ S = 0.542 for 13 communities, and G ′ has modularity Q G ′ = 0.502 for 7 communities. Figure 3 visualizes these networks with their communities.

Embedding of homophily network with Node2vec
Homophily networks can be embedded in a low dimensional space using kernels such as Node2vec (Grover and Leskovec 2016). Node2vec aims at embedding networks in low dimensional representations while preserving their properties such as node neighborhood, roles, or communities based on homophily. Node2vec is based on the model skip-grams (Mikolov et al. 2013) used for embedding words leveraging their contexts. To compute node embedding, node2vec operates the model skip-grams over random walks in the graph, allowing it to represent the neighbourhood and the overall position of each node in the graph. Our aim is to embed the homophily networks and exploit the semantics of their embeddings, as provided by Node2vec. Thus we can obtain the weight of each link by computing the cosine similarity between the two nodes of the link under consideration. For a link l = (n i , n j ) , one can compute the weight where v x is the embedding of the node x provided by the Node2vec embedding, 2 and sim(·, ·) is the cosine similarity. Using cosine similarity to weight links in the network is more general and robust, as this weight incorporates the similarity of two nodes and also the similarity of their neighborhood. The final Node2vec embedding of the case and status homophily networks is shown in Fig. 4   reduction technique that can be used for visualisation similar to t-SNE. Node2vec embeddings preserve the property of homophily. As shown in Fig. 4, one can see the different communities based, on prior case for G ′ P , status for G ′ S , and both prior case and status for G ′ . Node2vec can force words in similar documents to fall in the same topics because the nodes are cases/status and Node2vec leverage only random walks to clusters nodes.

Relational topic model with complex network
In this section, the classical topic model of LDA (Blei et al. 2003) is first introduced, and second the RTM (Chang and Blei 2009) is described (as illustrated in the left and right parts of Fig. 5, respectively). We then integrate the weights obtained by the homophily relationships to RTM for legal document analysis.
LDA (Blei et al. 2003) is a generative model. In LDA, documents are represented as a mixture of latent topics, and each topic is characterized by a distribution over the vocabulary in the documents. α and β are parameters in corpus-level, which are sampled when generating the corpus. θ d are document-level variables, which are sampled for each document. We assume that words in documents are generated by the topic probability. From α and θ d , the topic appearance probability is generated in document d. Then, the word topic probability z d,n is generated in document d. At last, the word x d,n for the nth word in the dth document is generated from the occurrence probabilities of the vocabulary in the topic k of β k and z d,n . The LDA model only focuses on documents themselves, without considering the relationships among documents.
The link relationship between documents is considered in the RTM (Chang and Blei 2009). Same as LDA, firstly documents are generated from topic distributions in RTM. Next, document links are modeled as binary variables, with one link for one pair of documents. Given a pair of documents d, and d , an indicator of binary link is drawn as: where ψ is the probability function of distributed link between the document pair. Sigmoid or exponential have been used for ψ . We adopt the exponential, because it performs better in the experiments reported in Chang and Blei (2009). ψ depends on the topic assignments z d and zd , which generate their words. It is calculated as: where z d = 1 N d n z d,n , • denotes element-wise product, and coefficients η and intercept ν are parameters, which can be estimated as: In the above two equations, 1 is the vector whose elements are all 1, ρ is a scalar to control the frequency of the negative observations of no links, and is given by where the summation is computed over all possible documents pairs.
We use the homophily network structure presented in Sect. 4.2 to improve on RTM. We propose two models, called the cutting model and weighting model, in order to strengthen the effect of co-citation patterns in the networks. Prior cases and statutes can have different influences to the judgement of current cases, but the inferences are case by case and thus data-driven; therefore, we design different weighting schemes to model the different influences. The motivation behind the cutting/weighting models is to take the best of both prior cases and statutes. Besides using the networks of prior cases only or the networks of statutes only, we also propose to use both of them by aggregating their weights.

Cutting model
The cutting model sets a threshold to filter noisy co-citations with low weights, while keeping the most influential statute laws and prior cases. In this model, we use either w n or w j in Eq. (1) to cut inefficient links, only keeping the links with edge weights higher than a threshold. The links above will be used in RTM as document links. We tried different thresholds of edge weights in the three homophily networks, G ′ P , G ′ S , and G ′ , respectively, in our experiments.

Weighting model
In this model, instead of cutting the links with edge weights, we keep all the links but use edge weights to weight the link probability function, i.e., Eq. (3). Note that we only use w j for weighting, because of the probability characteristic of Eq. (3). Formally, the weighting is performed as follows: we compare different types of weighting methods: • w = w P j : only use the edge weights w j in G ′ P . • w = w S j : only use the edge weights w j in G ′ S . • w = w P s : only use the edge weights w s in G ′ P . • w = w S s : only use the edge weights w s in G ′ S . • w = w P j + w S j : use the sum of the edge weights w j in G ′ P and G ′ S . • w = w P j ×w S j : use the product of the edge weights w j in G ′ P and G ′ S . • w = w P s + w S s : use the sum of the edge weights w s in G ′ P and G ′ S . • w = w P s ×w S s : use the product of the edge weights w s in G ′ P and G ′ S .
Fuzzy logic aggregation, extensively studied in Detyniecki's thesis (Detyniecki et al. 2000), can help balancing between the different types of weights. Fuzzy logic, especially triangular norm fuzzy logic (t-norm) which guarantees triangular inequality in probabilistic spaces, generalizes intersection in a lattice and conjunction in logic, offering many aggregation operators to define conjunction for values within [0,1] (Schweizer and Sklar 2011). Each t-norm operator is associated with an s-norm (t-conorm) with respect to De Morgan's law (Hurley 2005): S(x, y) = 1 − T (1 − x, 1 − y) . The t-norm is the standard semantics for conjunction in fuzzy logic and thus the couple t-norm/s-norm acts as AND/OR operators on real values in [0,1]. We experiment with several fuzzy operators: beside the classical sum/product, we also consider the family of Hamacher t-norms (Hamacher product (Hamacher 1976)) defined for ≥ 0 as the family of Yager t-norms (Yager 1980) defined for > 0 as and the Einstein summation (Einstein 1916) recall that T H ,0 is the classical product/sum. Fuzzy logic aggregators can also be used as weighting methods. We thus consider in addition to the methods presented above, the following weighting strategies: • w = T H , (w P j , w S j ) : the aggregation of w j in G ′ P and G ′ S , using Hamacher t-norm. • w = T Y, (w P j , w S j ) : the aggregation of w j in G ′ P and G ′ S , using Yager t-norm. • w = T E (w P j , w S j ) : the aggregation of w j in G ′ P and G ′ S , using Einstein sum.
In summary, the sum and product methods model the use of prior cases and statutes in an OR and AND relation, respectively. The fuzzy logic aggregation can further take .
a trade-off between the OR and AND relations. The same weighting strategies are also used to aggregate the weights obtained with Node2vec embedding, where w s is used instead of w j .

Experiments
We used the Canadian law corpus (Section 3), to conduct experiments for topic modeling.

Settings
For preprocessing, we performed the following. Words were lemmatized and lowercased using NLTK (Loper and Bird 2006). We also discarded word tokens containing non-alphabetic characters. In addition, we excluded stop words by using the English stop word list in NLTK. 3 We compared the homophily networks G ′ P , G ′ S , and G ′ . Using the edge weights obtained from Eqs. (1) and (2), we either judged which edge to use according to a threshold in the cutting model (Sect. 5.1) or weight the edges using the methods in the weighting model (Sect. 5.2). We used a publicly available RTM implementation for all our experiments. 4 The parameters of RTM were trained using the obtained network information and documents. 5 The number of topics and max iterator were set to 200, and 10, respectively.
The Node2vec kernel is used with 50 dimensions, an inout hyper-parameter set to 0.5 to preserve homophily property, and the remaining parameters are kept by default.
As a baseline, we compared with LDA (Blei et al. 2003), 6 which does not use link information. Furthermore, we compared to another baseline RTM (w → ∞) , which is an RTM that does not use any link information. For all the experiments, we used the default values for the corpus-level parameters α and β (see Fig. 5), which were 0.1 and 0.01, respectively.

Evaluation metrics
To evaluate the output topics of the different models, we essentially used the coherence score (Newman et al. 2010). Coherence measures the similarity among the output topic words, which is commonly used for evaluating topic modeling performance. Coherence can be computed as: where word k and word l are the kth and lth topic words output by a topic model , and N is the number of output topic words. sim(·, ·) is the cosine similarity of two words. The cosine similarity is calculated by representing the two word with word embeddings by coherence = GloVe840B (Pennington et al. 2014). 7 Note that we used the top ten words output by topic models to calculate coherence.
In addition, we also reported the C V (Röder et al. 2015), the C UMass (Mimno et al. 2011), and the C UCI (Newman et al. 2010) scores to confirm the performance consistency among different evaluation metrics. The C V metric is reported as the measures with strongest correlations with human ratings. C V is based on a sliding window, a oneset segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity.
C UMass is an intrinsic, asymmetrical confirmation measure between top word pairs that accounts for the ordering among the top words of a topic. C UMass is computed as: where P(w x , w y ) donates the probability that the co-occurrence of the words w x and w y while P(w y ) donate the frequency of the word w y in the corpus. Word probabilities are estimated based on document frequencies of the original documents used for learning the topics.
Unlike C UMass , C UCI is an extrinsic measure based on pointwise mutual information (PMI) using an external resource such as Wikipedia to estimate the pointwise mutual information. C UCI coherence is calculated by: C UCI is similar to coherence where the sim(., .) is replaced with the pointwise mutual information. Table 2 Coherence scores of topics from G ′ P and G ′ S against w n , w j , cutting and weighting models

Results
The coherence scores are shown in Table 2, comparing case G ′ P with statute G ′ S networks, for the baselines, and both the cutting and weighting models. In the table, w x with x ∈ {n, j, s} denotes the node weight as in Eqs. 1 and 2. In the cutting model, an edge is considered only when its weight w x ≥ w e ; an edge between two nodes is created in RTM and trained with RTM. w x ≥ 0 means that all edges are considered without considering their weights, w x → ∞ means none of the links is used for the model, which is the RTM baseline described in Sect. 6.1. In the weighting model, all edges between two nodes are created and trained using RTM, but they are weighted by the different weighting methods described in Sect. 5.2. In all cases, we can see that our RTM model given link information shows higher performance than the baselines, i.e., LDA and RTM with w x → ∞ . In addition, the cutting model that uses a weight threshold to cut links improves compared to the all link inclusion w x ≥ 0 without weighting. This means that there is noise information in some links which give negative effect when we treat them equally. As for the settings of creating nodes either from prior cases G ′ P or statute laws G ′ S , we do not observe significant difference. This indicates that both citations are good sources for improving topic modeling. There is also no big difference comparing between w n and w j , where w n is the common citation number between two cases, and w j treats homophily as similarity, respectively. In the cutting model, we only use prior cases citations, w n ≥ 5 performs the best in G ′ P ; when we only use statutes, w n ≥ 10 performs the best in G ′ S . Also, w j ≥ 0.50 and w j ≥ 0.25 performs the best for G ′ P and G ′ S , respectively, in the cutting model. Comparing the performance of the cutting model to that of the weighting model, we can see that the weighting model outperforms using citations from either prior cases G ′ P or statute laws G ′ S . This indicates the importance of giving weights to distinguish good and noise links for RTM. Comparing w j and w s in the weighting model, we can see that w j performs slightly better than w s . We also evaluated the best balance between G ′ P and G ′ S in the cutting model for the combined links of G ′ . For the cutting model, we use the best thresholds in Table 2 in order to balance the weights w P x and w S x . Table 3 shows the results. We also report the C V , C UMass , and C UCI scores to confirm the consistency among different evaluation metrics. As shown in Table 3, we can see in general they are very consistent, and thus we discuss the results based on the coherence scores only. Although the performance is better than the no link model w → ∞ , it is similar to when using them separately. It indicates that in the cutting model, the coherence score is not necessarily improved with both types of links, though the best parameters are used as each citation information. For the weighting model, the results shown in Table 3 indicate a trend similar to the cutting model that the coherence score is not necessarily improved by using both types of links no matter w j or w s are used or which weighting methods are used. However, the weighting model still outperforms the cutting model. The best weighting methods are the simple product and Hamacher product (smooth product) (see Eq. (4)), which means simple weighting strategies work well if there are both prior cases and statutes (an AND operator). Using cosine similarity based on Node2vec is very consistent, as the different weighting models have close coherence scores. The results obtained with Node2vec similarity w s can be interpreted by the fact that the kernel generalizes well the homophily networks. Meanwhile, Node2vec does not improve over simple and fuzzy weighting strategies. The best weighting strategies for Node2vec is the sum, and the interpretation can be that prior cases or statutes are enough to generate coherence topic models (an OR operator).
In addition, we investigated the impact of topic numbers for the topic models. Table 4 lists the coherence scores for the baseline LDA, RTM ( w → ∞ ), and the cutting model using different number of topics |T| of 200, 100, 50, and 10 with the best weight setting for G ′ P , G ′ S and G ′ , respectively. It performs the best at |T | = 50 when using G ′ P , and Table 4 Coherence scores over different number of topics for G ′ P , G ′ S and G ′ in the cutting model |T | = 10 when using G ′ S and G ′ . Though there is some variability, we can see that as the topic number decreases coherence tends to increase. Table 5 shows the coherence for the weighting model using different topic numbers |T| with different weighting methods. 8 We observe the same trend as the cutting model, where the small topic number at |T | = 10 shows the best coherence score. In addition, the weighting model still outperforms the cutting model when changing |T|, and w = w P j + w S j at |T | = 10 shows the best performance among all the models. The weighting models show the small topic number at |T | = 10 to be better with both w j and w s obtained respectively from the cocitations homophily networks and cosine similarity based on the Node2vec embedding of the homophily networks. We did not add the results for the fuzzy aggregation for different topic number as the results are similar to the simple weighting schemes. As a conclusion, if available we recommend to use both prior cases and statutes to generate topic models; otherwise, the prior cases if available and finally the statutes. We recommend also to use small topic numbers e.g., 10.   To understand the uncertainty of topic models on our data, we calculated the standard deviation. 9 Table 6 shows standard deviation results for the baseline models and our best model, with respect to the coherence evaluation metric with |T | = 10 . Our standard deviation results show a consistent small score, meaning that the results are stable.
To understand the performance of the topic models qualitatively, we also analyzed the output topic words. Table 7 shows topic word examples from the best performing model (i.e., the weighting model with w = w P j + w S j and |T | = 10 ) along with their topic IDs. We can see that these topic words are very informative and related to the areas of law in Canada. 10 Given these topic words, we can easily predict the topics of patent, immigration and refugee, contract, property, human right, labour and employment, and procedural. Through these predicted topics, we can easily understand the topics of important legal cases covered in the COLIEE dataset.
We further investigated the difference between topic similarity ( ψ ) and homophily ( w x ) as outputs of our models. The difference for the resulting weights is shown in Fig. 6. We can see that the shapes of the topic similarity are very similar in the perspective of their best weight used w n . However, if we investigate the most similar cases, different results may be obtained. With the cases-based topic similarity, with ψ = 0.65 , w n ≥ 1 , and w j ≥ 0.05 for G ′ P the closest cases are  Fig. 6 Comparison of the best w n and w j against the topic similarity ψ 9 We did not conduct bootstrapping, because in our context, bootstrapping may not be straightforwardly applicable. We would need a very specific sampling method that would preserve the citation graph structure and its modularity for homophily to be relevant. 10 https ://en.wikip edia.org/wiki/Law_of_Canad a#Proce dural _law. both examples, the applicants asked a judicial review for a humanitarian and compassionate relief. One application was accepted (#3451) and the other one rejected (#1276).

Conclusion
We presented a novel analysis method for the COLIEE corpus of the law dataset. Thanks to homophily, we improve topic modeling using the citation structure even without having access to the cited content. We built networks composed of thousands of cases and references. The references belonged to two types of citations, i.e., prior cases, and statutes laws. We explored these two types of citations to investigate citation homophily among cases. We further proposed a cutting model and a weighting model for using these references to improve the RTM. Experiments indicated that the weighting model outperforms the cutting model, which significantly improves topic modeling performance. In addition, the predicted topics are very informative for legal case analysis. We publish both our data and codes online 11 for further research.
In our future work, we first intend to use a multilayer network model with Detangler (Renoust et al. 2015) to visualize the overlapping of topics based on topics content and similarity in order to evaluate the capacity of our topic models to relate similar documents. Second, we plan to combine all the different items contained in the data, such as counselors. Lastly, we plan to further investigate the extraction of new links in the dataset with our topic modeling, allowing to explore the homophily between the cited cases and laws.
We have so far applied this method in the context of legal documents and computational law only. However, there are many similar data that our framework could work for. For instance, in scientific paper co-citations, there can be different types of co-citations in different parts of the paper such as "introduction" or "related work" and "core" parts that have different functions in the paper. There also can be co-citations types such as co-citations from authors themselves, co-authors, and other researchers. These different types of co-citations can be used for homophily-based topic modeling in our framework. In news data, there can be co-references to news agencies, events, and name entities. These different types of co-references play different roles and can be used for homophily-based topic modeling in our framework as well. Therefore, this work invites further investigation of a more general method of homophily-based topic modeling that would fit a larger set of application contexts, including citation network, but could also extend other relations implying homophily.