A top-down supervised learning approach to hierarchical multi-label classification in networks

Node classification is the task of inferring or predicting missing node attributes from information available for other nodes in a network. This paper presents a general prediction model to hierarchical multi-label classification, where the attributes to be inferred can be specified as a strict poset. It is based on a top-down classification approach that addresses hierarchical multi-label classification with supervised learning by building a local classifier per class. The proposed model is showcased with a case study on the prediction of gene functions for Oryza sativa Japonica, a variety of rice. It is compared to the Hierarchical Binomial-Neighborhood, a probabilistic model, by evaluating both approaches in terms of prediction performance and computational cost. The results in this work support the working hypothesis that the proposed model can achieve good levels of prediction efficiency, while scaling up in relation to the state of the art.


Introduction
Network representations provide a formal framework to specify relationships between interconnected entities (nodes).In a number of scenarios, such frameworks annotate nodes with attributes to naturally identify groups whose members are related by, e.g., a particular similarity measure.In analyzing networks with node attributes, most studies assume that a node can take a finite number of possible values, each of which is called a class.A class may represent the gender of a user in a social network or the function associated to a protein in a protein-protein interaction network.

The Problem
The task of inferring or predicting missing node attributes from information available for other nodes in a network is called node classification arXiv:2203.12569v1[cs.LG] 23 Mar 2022 (also referred to as attribute prediction) [4].Several formulations to the node classification problem have been proposed over the past decades.The problem of inferring an attribute from exactly one of two classes is referred to as binary classification [17].The extension of the binary classification problem to any finite (non-zero) number of classes is referred to as multi-class classification [22].Furthermore, each type of problem is categorized as a multi-label classification, if a node is allowed to be simultaneously associated to more than one class [25].
For most techniques that address the above-mentioned problems, node classification is generally carried out independently for each class [1,4,15,18,35].The main limitation of such compartmentalized approaches is that they ignore hidden relationships among classes, even when certain class relationships may serve as an input to improve the accuracy of attribute prediction.
In practice, classes may have explicit relations specifying their dependencies.For example, this is the case with the Gene Ontology hierarchy because genes and proteins associated to a function must also be associated to the ancestors of such function.The authors in [31], amid this limitation, define dependencies between classes as ancestral relations by means of a hierarchy represented by a directed acyclic graph.A connection from class C 1 to class C 2 in the hierarchy means that every node with attribute C 1 also has attribute C 2 (i.e., the nodes having attribute C 1 are a subclass of the nodes having attribute C 2 ).
Formally, a classification problem is considered hierarchical if and only if its hierarchy of classes is a strict partial order (i.e., a strict poset or, equivalently, a directed acyclic graph).A strict poset (C, ≺) over a finite set C of classes defines a binary relation ≺ on C that is asymmetric, anti-reflexive, and transitive.For instance, the hierarchy of biological processes can be defined over a strict poset according to which the functions cell death, programmed cell death, and apoptotic process are ordered by cell death ≺ programmed cell death ≺ apoptotic process [13].Note that the transitive property of the order ≺ guarantees that cell death is also the ancestor of apoptotic process.Note also that since the strict poset is anti-reflexive, no process can be ancestor of itself.Finally, the asymmetry property guarantees that apoptotic process cannot be ancestor of programmed cell death.Indeed, any strict poset is isomorphic to a directed acyclic graph (DAG), a concept more closely related to graph and network analysis.
Given a graph with some labeled nodes (i.e., nodes associated to classes) and a class hierarchy, the expected outcome of a hierarchical classification problem is a collection of predicted associations between nodes and classes.An inconsistent prediction for a hierarchical multilabel classification problem refers to the fact that a node is inferred to have a particular class C, but the outcome of the classifier fails to infer the node's association to all ancestor classes of C. In other words, an inconsistent prediction states that the prediction does not satisfy the ancestral relations for some class C. In many scenarios, it is desirable to rule out inconsistent prediction: that is, if a classifier predicts a particular class C for a node, then it should also predict all the ancestors of C for that node; conversely, if a classifier does not predict C for a node, then it should not predict any of C's descendants for that node.This constraint is often referred to as the true-path rule in Gene Ontology [2,32].
The efforts to classify nodes generally aim to comply with the underlying ancestral relations between classes in scenarios where such a hierarchy is known (and thus avoid inconsistent prediction).Consider a social network where nodes represent individuals and node attributes represent different levels of education.If a predictor outputs an individual without an undergraduate degree as candidate to have graduate degree, it fails to comply with the hierarchical organization of educational levels, in many reasonable scenarios, thereby violating the true-path rule.
Hierarchical classification problems may further be categorized depending on the approach used for training the underlying model.On the one hand, a top-down approach involves a binary classifier for each class in the hierarchy.In this case, the classifier associated to each class is trained iteratively from the roots (i.e., the classes without ancestors in the hierarchy) to the leaves (i.e., the classes without descendants in the hierarchy).In addition, local information about the ancestors and descendants of a class in the hierarchy is used to avoid independent predictions.On the other hand, a big-bang approach involves a multi-label classifier that considers the entire hierarchy of ancestral relations at once.The multi-label classifier is trained just once with the information of every class in the hierarchy and its dependencies.

Related Work
Several studies have applied both top-down and big-bang approaches across different domains [16,10,5,26].The authors in [16] proposes a topdown approach, called Hierarchical Binomial-Neighborhood (HBN), to predict protein functions in yeast Saccharomyces cerevisiae.It is shown by the authors that the hierarchical structure of functions can be exploited to completely avoid inconsistent predictions and, at the same time, outperform approaches based on independent class prediction.However, they point out that the main limitation of their approach is the high computational effort required for assigning probability weights to every protein-function pair.The authors in [26] introduce a top-down approach based on Chained Path Evaluation (CPE), which uses a classifier to train each non-leaf class (i.e., each class with at least one descendant) in the hierarchy.Information on ancestral relations is included in the classifier by adding an extra feature with the prediction of parents of each class.As in [16], the computational cost of the CPE model grows exponentially as a function of the number of paths in the hierarchy.The use of bigbang approaches is, in general, also limited by their high computational demands.In [10], for example, the authors present a big-bang approach that addresses hierarchical multi-label classification based on Predictive Clustering Trees (PCTs).The computational cost of the PCT approach is directly proportional to the size of the hierarchy.
Other studies address the node classification problem and obtain stateof-the-art performance for different case studies (see, e.g., [1,7,15,18,21,35]).However, they do not take into account dependencies between classes (hierarchical or not), for they focus on multi-class instead of multi-label problems.For this reason, such developments can not be compared directly to assess hierarchical multi-label classification prediction.

Main Contribution
This work introduces a top-down classification approach that addresses hierarchical multi-label classification (HMC) using supervised learning.Given a network G = (V G , E G ), an assignment of classes to nodes in the network, and a class hierarchy specified as a directed acyclic graph H = (V H , E H ), the hierarchical multi-label classification problem is addressed by building a binary classifier for each class.Classifiers are built iteratively from the roots of the hierarchy to the leaves.The approach uses a correction mechanism to guarantee that the true-path rule is satisfied by the classifier's outcome; it is enforced by computing cumulative probabilities along the paths of classes in the input hierarchy.
The results in this work support the working hypothesis that the proposed approach can achieve good levels of prediction efficiency, while scaling up in relation to the state of the art.This approach is showcased with a case study on the prediction of gene functions for Oryza sativa Japonica, a variety of rice.It is compared to the probabilistic HBN model [16], by evaluating both approaches in terms of prediction performance (by means of the true positive and true negative rates) and in terms of their computational cost (by means of a comparison of the execution time).In the case study, the prediction task uses two inputs.Namely, (i) a gene coexpression network (GCN), in which a node represents a gene and a class of a node represents a gene function; and (ii) the hierarchical structure of biological processes defined in [13].The goal of the prediction task is to infer gene attributes from 15 sub-hierarchies grouping 1 938 biological processes associated to 19 663 genes.
Outline.The remainder of the paper is organized as follows.Section 2 introduces the approach for node classification where classes have a hierarchical organization.Section 3 describes the problem of predicting gene functions.It also presents the results of applying the proposed model to Oryza sativa Japonica.Finally, Section 4 draws some concluding remarks and future research directions.

Hierarchical Classification
This section presents a top-down classification approach in the form of a supervised learning model for hierarchical multi-label classification.
The input of the model are a graph G = (V G , E G ) specifying an undirected network with nodes V G and edges E G , a directed acyclic graph H = (V H , E H ), with vertices V H and edges E H disjoint from V G and E G , respectively, representing the hierarchy of classes, and a function φ : V G → 2 V H with a partial assignment of classes to nodes in the network.For v ∈ V G , the set φ(v) ⊆ V H is the collection of classes initially associated to v. It is assumed that φ satisfies the true-path rule for the hierarchy H, meaning that if a node v satisfies C ∈ φ(v) for a class C ∈ V H , then φ(v) must contain all the ancestors of C in H.As mentioned in the introduction, the DAG H uniquely represents a strict poset.The goal of the model is then to build a function φ : 1A depicts an example of the input of the model where the nodes of the network G are labeled with classes A-E and the hierarchy of classes H is a DAG.According to the true-path rule, nodes labeled with class E are also related to classes A, B, and C. The objective is to predict new associations between nodes and classes for either nodes with or without labels.
The rest of this section is devoted to describe the main steps behind the construction of the supervised learning model.

Hierarchy Normalization
Hierarchies are represented as directed acyclic graphs where, in general, nodes can have any (finite) number of parents.Since the approach presented in this work assumes that every node has at most one parent, a topological traversal algorithm for directed graphs (see, e.g., [19]) is used to transform H into a tree, when required.In this way, the resulting model can take as input any hierarchy.
This algorithm uses the structure of H (not its tree version) and its distribution of classes.Given an ancestral relation A → B (i.e., class A is a direct ancestor of class B), a weight w(A, B) for such an edge is defined as the ratio between the number of nodes associated to B (i.e., the size of the set φ −1 (B)) and the number of nodes associated to A (i.e., the size of the set φ −1 (A)).Since all nodes associated to B must be associated to A, then by definition each weight the resulting tree is the node A j maximizing w(A j , B) among all the A i 's.Ties are broken arbitrarily.This process can be effectively computed in time and space O(|V H | + |E H |), namely, in resources linear in the size of H.Such a process, based on a topologicalsorting traversal, is described in Algorithm 1.1.Finally, note that the topological sorting of the vertices of H in Algorithm 1.1, can be exploited to compute the value of function w( , ) by dynamic programming in space Θ(|V H |).More precisely, a function ρ : V H → N assigning to each class B i its number of descendants ρ(B i ) in H can be computed from the direct descendants of B i , which are processed before B i in the topological sorting of V H . Algorithm 2.1: Topological-sorting based traversal for hierarchy normalization In the rest of this paper, it is assumed that hierarchy H is indeed a tree T.

The Model
Given the network G and the hierarchy tree T, the model is built in a process comprising three stages.Figure 2 depicts the general approach.
Stage 0: data pre-processing.In this stage, topological features of G and T, and hierarchical information in T are readied and combined for supervised learning.
Classes that are too specific or too general are ignored in the prediction to avoid overfitting and learning bias.In the case study presented in Section 3, a class is defined as too specific or too general if it is associated to less than 5 or more than 300 genes, respectively [16].As a result, the input hierarchy T can be split into several sub-trees, each one representing a sub-hierarchy T = (V T , E T ) with V T ⊆ V T and E T ⊆ E T , over which the model is applied independently.That is, a sub-hierarchy T is a subset of the classes and ancestral relations in T. As a matter of fact, this situation arises in the case study presented in Section 3. Furthermore, each sub-hierarchy T is associated to the subgraph of G consisting of all nodes labeled with the root class of T , that is, each sub-hierarchy In this way, sub-hierarchies are considered independent problems with smaller inputs (à la divide and conquer).
For each sub-hierarchy T in T, datasets are built based on two types of topological properties, namely, hand-crafted features and node embeddings.For the first type, properties of nodes V G such as degree, average neighbor degree, centrality, and eccentricity are computed.Additionally, for each class C in T , two features are computed to represent the probability of a node being associated to C and its parent in H based on the information of the neighborhood.For C and its parent, a node and its neighbors, and the associations between the neighbors and both classes, these new features represent the ratio between the number of neighbors The approach is split into three stages: data pre-processing, class prediction and performance evaluation.The approach is applied for every resulting sub-hierarchy H independently. Ancestral relations between classes are included in the model as features with the prediction of ancestors and are represented by the upward arrow in the prediction stage.In addition, a correction mechanism for inconsistencies is included by means of cumulative probabilities, which are computed according to the path of classes in the sub-hierarchy.If the probability of association between a node and a class is close to zero, then the cumulative probability of the association between the same node and the descendant classes will be close to zero as well.associated to each class and the total number of neighbors.For the second type of properties, continuous representations capturing the characteristics of the nodes in G (i.e.node embeddings) are computed using node2vec [14].

Hierarchical classification model
Stage 1: hierarchical classification.This stage comprises a top-down approach combining different supervised machine learning techniques/tools.It builds prediction classifiers for each sub-hierarchy T independently.The approach uses stratified k -fold cross-validation, the Synthetic Minority Over-sampling Technique (SMOTE) [6], hyper-parameter tuning [3], and a binary classifier, (e.g.XGBoost [8] or graph convolutional networks [18]).These techniques are combined sequentially in a pipeline, which is used iteratively from the root to the leaves of each sub-hierarchy T .Note that, since the top-down approach builds a different classifier for each class in the sub-hierarchy, the proposed model can be used for multiclass and multi-label problems.As a result, nodes can be independently associated to multiple classes.
The combination of the above-mentioned techniques/tools makes up the core of the approach; and each technique has a different objective.Stratified k -fold aims to overcome overfitting by randomly selecting independent k subsets of the dataset where the distribution of the labels is similar for all folds.In this approach, 5 folds are used for cross-validation, that is the train-test ratio is 80/20.Over-sampling aims to overcome learning bias handling imbalanced datasets for underrepresented classes.SMOTE synthesizes new examples of the minority class from the existing ones.Hyper-parameter tuning aims to improve the performance of the prediction by optimizing parameters of the classifier such as, e.g., learning rate, number of estimators, and maximum depth of trees.
Two types of classifiers were used; namely, the XGBoost [8] gradient boosting decision trees and graph convolutional networks [18].XGBoost was chosen for interpretability [11,28] and graph convolutional networks for state-of-the-art performance.In general, any other binary classifier can be used in this stage.The typical parameter values used for XGBoost classifiers are: gbtree booster, area under Precision-Recall (aucpr ) evaluation metric, learning rate (eta) of 0.05, maximum tree depth (max depth) of 6, subsample ratio (subsample) of 0.9, and minimum sum of instance weight in a child (min child weight) of 3.For the graph convolutional networks, the implementation by [9] was used with the following parameters: 16 layers of 16 units each, RelU activation function, dropout rate of 50%, learning rate of 0.01, and binary cross-entropy loss function.
Further details of the implementation can be founded in the repository https://github.com/migueleci/node_classification.
Classifiers for each class in T are built independently, so that there is no relation between their predictions.Including information from the ancestors of a class C into its classifier is not enough to avoid inconsistent predictions.For this reason, a correction mechanism is included in this stage.Since ensuring the true-path rule is key in the proposed approach, this stage computes cumulative probabilities along the paths of classes in T .Namely, the probability of association between a node v and C is directly related to the predicted probabilities of the node being associated to all ancestors of C. Intuitively, the principle is as follows: if the probability of association of C to v is close to zero, then the probability of association for all descendant classes of C to v will be close to zero as well.The main consequence of enforcing the principle is that the classification computed from the cumulative probability satisfies the true-path rule and removes the inconsistencies in the prediction.
Stage 2: performance evaluation.This stage comprises the evaluation of the metrics used for measuring the prediction performance of the classifiers.Performance evaluation focuses on recall (true positive rate) and precision scores.It also evaluates the precision-recall curve instead of the accuracy, loss, or ROC curves.This is mainly because datasets are often imbalanced (w.r.t. the positive class in a binary classification), thus both positive and negative classes of the binary classifier need to be analyzed separately.Recall and precision scores are computed from the predicted cumulative probabilities as a function of the optimum threshold, which is defined as the threshold that maximizes the F1 score from the precision-recall curve for the cumulative probabilities.

Gene Function Prediction
This section presents a case study on the prediction of gene functions (i.e., biological processes in which genes are involved) for the Oryza sativa Japonica rice variety.First, the problem of predicting gene functions is introduced.Then, the results after applying the approach proposed in Section 2 to this problem are described.The probabilistic approach, proposed in [16], is used to compare the novel results.

Gene Co-expression Networks
High-throughput sequencing technologies have enabled the identification of numerous genes and gene products.However, biological processes in which many such genes are involved remain largely unknown (i.e., relations between genes and biological processes have not been comprehensively validated through in vivo experimentation) [27].Identifying the functions of genes is key to enhance the understanding on how to characterize the genome of a particular organism.In general, traditional in silico approaches to predict gene functions consider each function as an independent class.The task is generally defined as a binary classification problem based on gene expression.
Genes (or gene products) can be associated to more than one biological process and such processes may be related (e.g., by ancestral relationships).The assignment of functions to genes obeys the true-path rule [32].Consequently, efforts to predict whether a gene is associated to a particular function should consider the ancestral relations of that function.Ignoring such a hierarchical structure leads to biological inconsistencies in the outcome of the prediction.On the contrary, when a gene is associated to multiple biological processes without ancestral relations, the prediction is done for each one of the functions independently.
As an example, consider two biological processes in which a gene may be involved: response to external stimulus and detection of light stimulus.According to the hierarchy of biological processes in Figure 3, the former function is an ancestor of the latter [2].By the true-path rule, if a gene is associated to detection of light stimulus, it must be associated to response to external stimulus.
In addition, a common approach to integrate large volumes of transcriptional data and synthesize the hierarchical structure of gene functions is to characterize gene co-expression networks (GCNs).GCNs have been used to infer biological processes and pathways based on highly correlated expression patterns between genes [24,33,34].It is well-known that coexpressed genes, i.e., genes with similar expression profiles, tend to share the same function or be related to the same regulatory pathway [12,30,36].

Predicting Gene Functions in Oryza sativa Japonica
The goal of this case study is to predict gene functions, that is, the biological processes in which some genes are involved.The problem is tackled by using the model proposed in Section 2 on the GNC of Oryza sativa Japonica [23] and a hierarchy of biological processes for this organism [29].The computational experiments supporting the results in this section have been executed in a cluster with 5 nodes, each one with 64GB of memory and a AMD Opteron™ Processor 6376 with 64 CPU cores.Formally, a gene co-expression network is represented as a undirected, weighted graph G = (V G , E G , f ), built from empirical data, where genes are represented by nodes V G , edges E G denote co-expression relationships, and the weight f : E G → R ≥0 measures the level of co-expression between genes.Additionally, the graph H = (V H , E H ) is a directed acyclic graph (DAG) which represents the hierarchical organization of biological processes, where E H represents the ancestral relations between functions.Genes are associated to one or more biological processes through a function φ : V G → 2 V H , where V H denotes the set of all biological processes.The predictive model combines the existing set of labels in φ, topological properties of G and the hierarchical information of H to obtain a new labeling function φ using the hierarchical multi-label classification approach.As a result, the function φ contains suggestions of previously unidentified associations between genes and functions satisfying the truepath rule.
The set of known associations between genes and functions used in this work contains 19 663 rice genes, 550 813 co-expression relations, 3 743 biological processes, 220 598 assignments of functions to genes, and 7 185 ancestral relations of functions (all biological processes belong to the same hierarchy).To avoid overfitting and learning bias in the proposed model, only those functions associated to more than 4 and at most 300 genes are considered [16].Under this criterion, 1 938 functions (52%) are used for prediction.As a result, the function hierarchy breaks down into 27 sub-hierarchies, from which 12 correspond to isolated functions or small sub-hierarchies (fewer than 7 functions).The 15 remaining sub-hierarchies are described in Table 1, sorted from the smallest to the largest in terms of the number of functions V T and number genes associated with each of them.

Root
Func Genes Desc The prediction performance of the proposed approach is compared with the HBN model presented in [16].The HBN model uses a top-down approach that integrates relational data of protein-protein interaction network (PPI) with the hierarchical data of biological processes with the objective of predicting protein functions.For this case study, the HBN model is adapted to the problem of predicting gene functions based on GCNs.To predict the probability of a gene g being associated to function A, the local neighborhood information of g in the GCN and the ancestors of A in the hierarchy are considered.The HBN model computes the probability of gene g being associated to function A obeying the true-path rule.The figures in this section show the mean performance for the proposed approach and the HBN model between multiple experiments, in which each experiment represents the mean performance between the k folds used for cross-validation.In all of them, the variation (error bar o standard deviation) is not included because it is negligible (and can add visual noise to the plots).Figure 4 illustrates the performance of the proposed approach using XGBoost (XGB) and graph convolutional network (GraphCN) classifiers and the HBN model measured with the area under the ROC curve and the average precision score.Note that their performance seem to be similar in most sub-hierarchies and it is not possible to conclude which one performs better from Figure 4.However, since only biological processes associated to more than 4 and less than 300 genes are considered (less than 2% of the genes in the GCN), datasets generated for the filtered biological processes are highly imbalanced.For this reason, the area under the ROC curve is not suitable for the case study (this measured is biased for the over-represented class in the classification task), and the analysis should focus on other metrics such as recall and F1 score instead.An outstanding difference between the performance of the proposed approach and the HBN model is observed when the confusion matrices are analyzed.Figure 5 shows the true positive rate (or the measure of recall) and the true negative rate for the 15 sub-hierarchies.Note that the true positive rate of the proposed approach is higher than the HBN model for most of the sub-hierarchies, whereas the true negative rate of the HBN model is higher for all sub-hierarchies.However, the HBN model is biased for the negative class because the probability predicted by the HBN model for most of the associations between genes and functions is close to zero.As the datasets are highly imbalanced, the performance in terms of the positive class are key to determine which approach is adequate.Recall that a dataset is said to be imbalanced for binary classification if one of the classes is under-represented in relation to the other one, i.e., the number of instances related to one class is much higher than the number of instances related to the other.For example, a dataset with 1 000 instances that has 900 negative and 100 positive samples is imbalanced.
The true positive rate illustrated in Figure 5 shows that the proposed approach outperforms the HBN model in the identification of the (positive) associations between genes and functions.The performance varies between XGBoost and graph convolutional networks, but both classifiers have better overall performance than the HBN model.The results suggest that graph convolutional networks are better for small sub-hierarchies, while XGBoost is better for larger ones.Even though the true negative rate of the HBN model is close to 1 for all sub-hierarchies, as illustrated on Figure 5, the performance of the proposed approach in terms of the average of both recall and precision (i.e., F1 score) is better than the HBN model.Figure 6 presents the F1 score of the proposed approach and the HBN model for the 15 sub-hierarchies.In this case study there is no observable correlation between the size/depth/span of a hierarchy and the prediction performance, according to the experiments.This is coherent with the overall computational complexity of the algorithms.On the other hand, there is no experimental evidence to suggest that some degree of correlation exists between the number of label nodes and the prediction performance.However, these observations need to be further investigated with other case studies.Finally, the execution time of the proposed approach and the HBN model is illustrated on Figure 7.The execution time for the graph convolutional network classifier is not included because the experiments were executed on CPUs rather than GPU.It is known that neural networks run much faster on GPUs; thus, it would not be fair to make a comparison with the available data.Note that the execution time is measured in seconds and plotted on a logarithmic scale.Except for the smallest subhierarchy (GO:0040007), the execution time of the proposed approach, using XGBoost classifier, is better than the HBN model.On average, the execution time of the HBN model is approximately 4 times as much of the proposed approach.

Conclusion and Future Work
By combining different techniques from machine learning, the hierarchical multi-label classification model presented in this paper introduces an approach to address the node classification problem for scenarios in which nodes can have attributes obeying a hierarchical organization.Taken into account hierarchical dependencies is shown to be a key aspect for obtaining more consistent predictions that satisfy the true-path rule.
A baseline comparison between the proposed approach using two different classification methods, namely, gradient boosting decision trees and graph convolutional networks, and the HBN model introduced by [16] is presented.Both approaches are applied to the problem of predicting gene function on the variety of rice Oryza sativa Japonica.The proposed hierarchical multi-label classification approach outperforms the HBN model in two aspects.First, using topological information of the network is a key feature to obtain the overall best performance of the prediction.In such setting, the true positive rates of the proposed approach are significantly higher than the HBN model, whereas the true negative rates yield similar values (close to one).This result suggests that the proposed approach can lead to good prediction of associations between genes and functions in Oryza sativa Japonica and, potentially, in other organisms.
For scenarios in which the classes of the hierarchy are under-represented, i.e., datasets are imbalanced, it is important to center the performance analysis on metrics that are not biased by the imbalanced dataset.Such metrics include the true positive rate (or the measure of recall), the true negative rate, and the F1-score.Other widely-used metrics, like the area under ROC curve and the measure of average precision, are misleading for evaluating the performance of a classifier under such conditions.Second, the execution time of the proposed approach for the XGBoost classifier is, on average, 4 times better than that of the HBN model.The reduction in computational cost of the proposed top-down approach can be attributed to the fact that it predicts the probability of associations between a class and every node of the network at the same time.Also, the efficient computation of the DAG into a tree helps in making the proposed approach relevant to analyze larger networks and hierarchies.
Finally, although the performance of the proposed approach is promising, it requires to gather sufficient information from node classes, which in some cases is incomplete or unavailable.For example, information about gene functions is limited for many genes and gene products.For some organisms there is no such information available at all.The shortage of information may lead to over-fitting or learning bias in the approach, and consequently to misleading conclusions.Including other networks as additional sources of information for the classification problem seems to be interesting for future work.Other networks can be added with the help of transfer learning techniques.For example, by creating new features that aggregate the information extracted from other networks that can be integrated in the proposed approach as additional input to improve the prediction performance.Furthermore, other approaches such as semisupervised and transductive learning can also be considered for future work to handle the amount of data required for training.

11 returnFig. 1 :
Fig.1:A.The classification approach gets as input a network with a node attribute and a set of known association between nodes and classes, and the hierarchy of ancestral relations represented as a DAG.Note that there are more nodes associated to class B than C. B. The DAG representation of the hierarchy is transformed into a tree using a topological traversal algorithm, based on the distribution of the classes in the network.Since classes B and C are ancestors of E, the ratio of nodes associated to E and C is higher that the ratio for E, and B (w(C, E) > w(B, E)), the algorithm removes edge (B, E) and returns a tree.

Fig. 2 :
Fig.2: Framework of the hierarchical multi-label classification approach.The approach is split into three stages: data pre-processing, class prediction and performance evaluation.The approach is applied for every resulting sub-hierarchy H independently. Ancestral relations between classes are included in the model as features with the prediction of ancestors and are represented by the upward arrow in the prediction stage.In addition, a correction mechanism for inconsistencies is included by means of cumulative probabilities, which are computed according to the path of classes in the sub-hierarchy.If the probability of association between a node and a class is close to zero, then the cumulative probability of the association between the same node and the descendant classes will be close to zero as well.

Fig. 7 :
Fig. 7: Execution time of the hierarchical multi-label classification approach with XGBoost (XGB) classifier and the HBN model for the prediction of the 15 sub-hierarchies.The execution time is measured in seconds and plotted in logarithmic scale.

Table 1 :
Sub-hierarchies generated for the gene co-expression network of Oryza sativa Japonica Prediction performance of the hierarchical multi-label classification approach with XGBoost (XGB) and graph convolutional network (GraphCN) classifiers, and the probabilistic model (HBN) for the 15 subhierarchies of Oriza sativa Japonica.Performance is measured with area under the ROC curve and the average precision score.Note that the notation used for the graph convolutional network is GraphCN to distinguish it from the gene co-expression network (GCN).
True positive rate (or recall) and true negative rate of the hierarchical multi-label classification approach with XGBoost (XGB) and graph convolutional network (GraphCN) classifiers, and the HBN model for the 15 sub-hierarchies generated for Oryza sativa Japonica.