This section presents the approach for gene function prediction using HMC to create a predictor, enriched with the information of the features created in “Clustering-based feature extraction” section.

The GO hierarchy is defined as a directed acyclic graph (DAG) containing three main types of annotations: biological processes, molecular functions, and cellular component (Gene Ontology Consortium 2019). This work focuses on biological processes, i.e., a subgraph of the GO hierarchy that contains 28 roots (i.e., functions in the GO hierarchy with null indegree). This subgraph is denoted as \(H=(A,R)\), where *A* is the set of biological processes and *R* the binary relation representing ancestral relations between pairs of biological processes (i.e., \((a,b)\in R\) means that function *b* is ancestor of function *a* in the GO hierarchy). The topological-sorting traversal algorithm presented in Romero et al. (2022) is used to transform the GO hierarchy of biological processes into a tree. As a result, the hierarchy is split into several components, i.e., subtrees of *H* called sub-hierarchies. Each sub-hierarchy, \(H'=(A',R')\) with \(A'\subseteq A\), \(R'\subseteq R\), and \(r\in A'\) the root, is associated to a subgraph \(G'=(V',E',w)\) containing all genes \(v\in V\) associated to *r*, i.e., \(V'=\phi ^{-1}(r)\). Note that, the proposed approach is independently applied to each sub-hierarchy.

The inputs of the approach are a sub-hierarchy \(H'=(A',R')\), a subgraph of the GCN, denoted by \(G'=(V',E',w)\), where \(V'\subseteq V\) and \(E'\subseteq E\), an annotation function \(\phi :V\rightarrow 2^{A'}\), the matrices \(J_G\) and \(J_F\) resulting from “Clustering-based feature extraction” section, and a constant value \(c\in [0,1]\) for feature selection. The output is a function \(\psi : V' \times A'\rightarrow [0,1]\), specifying, for each gene \(v \in V'\), the probability \(\psi (v,a)\) of *v* being associated to function \(a\in A'\).

First, sub-matrices \(J_G'\) and \(J_F'\) are created from \(J_G\) and \(J_F\), by respectively considering only the genes \(V'\subseteq V\) and functions \(A'\subseteq A\). These sub-matrices represent structural properties of the GCN subgraph \(G'\), and associations between genes and functions based on multiple partitions of each graph. Figure 3 illustrates the prediction approach. The reminder of this section is devoted to detailing the prediction approach.

SHAP filters the extracted features with more impact in the prediction task, and HMC is used to predict associations between genes and functions without inconsistencies (i.e., complying the true-path rule). Since local HMC methods use more than one predictor per hierarchy, the feature selection is executed for each predictor independently, considering only the features related to the functions being predicted, denoted by \(A''\subseteq A'\). For example, consider the function hierarchy and a local classifier per level method depicted in Fig. 4. The predictor for level 1 predicts functions *a* and *b*, so only the features associated to functions *a* and *b* are considered for the feature selection.

### Feature selection

The aim of feature selection is to produce a matrix \(J : V' \times \Theta (c) \rightarrow [0, 1]\) by selecting a reduced number of significant features from \(J_G'\) and \(J_F'\). The number of selected features is denoted by \(0 \le \Theta (c)\le 2m\cdot |A''|\), where \(m\cdot |A''|\) is the number of features in each matrix \(J_G'\) and \(J_F'\), denoted as *q* (that is \(q=m\cdot |A''|\)).

Feature selection is conveyed from \(J_G'\) and \(J_F'\) to *J* using SHAP. Let \(J_{G+F}'\) denote the matrix resulting from extending \(J_G'\) with the *q* features of \(J_F'\). That is, for each \(v \in V'\), the expression \(J_{G+F}'(v, \_)\) denotes a function with domain [0, 2*q*) and range [0, 1], where the values in [0, *q*) denote the *p*-values associated to *v* in *G* and the values in [*q*, 2*q*) the ones associated to *v* in the enriched version of *G*. For each entry \(J_{G+F}'(v, j)\), with \(v \in V'\) and \(0 \le j < 2q\), the mean absolute SHAP value \(s_{(v,j)}\) is computed after a large enough number of Shapely values are computed (executions of SHAP). Features are selected based on the cutoff

$$\begin{aligned} c \cdot \sum _{j=0}^{2q-1}s_{(v,j)}, \end{aligned}$$

i.e., on the sum of mean absolute values by a factor of the input constant *c*. The first \(\Theta (c)\) features, sorted from greater to lower mean absolute SHAP value, are selected as to reach the given cutoff.

Note that the input constant *c* is key for selecting the number of significant features. The idea is to set *c* so as to find a balance between prediction efficiency and the computational cost of building the predictor.

### Training and prediction

This stage comprises a process that combines two supervised machine learning techniques/tools to build the predictor \(\psi\). In particular, stratified *k*-fold cross-validation and hierarchical multi-label classification are used sequentially in a pipeline.

The pipeline takes as input the matrix *J*, which specifies the significant features of \(J_G'\) and \(J_F'\), the sub-hierarchy \(H'\) and the annotation function \(\phi\). First, *k*-fold is applied to split the dataset into *k* different folds for cross validation (note that *k* is not related to the input *K*). That is, each fold is used as a test set, while the remaining \(k-1\) folds are used for training. Recall that *k*-fold cross validation aims to overcome overfitting in training. Furthermore, one or multiple random forest classifiers are build and used for prediction, the number of classifiers depends on the HMC method. Randoms forest is selected for this approach since it is a tree-based and multi-label classification algorithm, which is interpretable (SHAP can be applied). The parameter values used for random forest classifiers, differently from the default scikit-learn values, are: 200 estimators (*n_estimators*) and minimum number of samples of 5 (*min_samples_split*).

Additionally, some HMC methods require an extra step to keep prediction consistent w.r.t. the sub-hierarchy \(H'\) (i.e., comply the true-path rule). The probability of association between a function \(v\in V'\) and a function \(a\in A'\) must be lower than the probability of association between the same gene and the ancestor of *a* in \(H'\). To satisfy this constraint cumulative probabilities are computed throughout the paths in \(H'\). That is, for each gene \(v\in V\) and functions \((a,b)\in R\), the predicted probability of the association between *v* and *a* is multiplied by the predicted probability of association between *v* and *b* (its ancestor). This process is repeated for every path in the hierarchy from the root to the leaves.

The output of this stage is the predictor \(\psi\), i.e., the probabilities of associations between the genes in \(V'\) and functions \(A'\). Note that the predictor \(\psi\) satisfies the true-path rule.

### Performance evaluation

It is often the case in HMC datasets that individual classes have few positive instances. In genome annotation, typically only a few genes are associated to specific functions. This implies that for most classes (deeper in the hierarchy), the number of negative instances by far exceeds the number of positive instances. Hence, the real focus is recognizing the positive instances (predict associations between genes and functions), rather than correctly predicting the negative ones (predict that a function is not associated to a given gene). Although ROC curves are better known, their area under the curve is higher if a model correctly predicts negative instances, which is not suitable for HMC problems.

For this reasons, the measures (based on the precision-recall (PR) curve) introduced by Vens et al. (2008) are used for evaluation.

#### Area under the average PR curve

The first metric transforms the multi-label problem into a binary one by computing the precision and recall for all functions \(A'\) together. This corresponds to micro-averaging the precision and recall.

The output of the prediction stage are the probabilities of associations between genes \(V'\) and functions \(A'\). Thereby, instead of selecting a single threshold to compute precision and recall, multiple thresholds are used to create a PR curve. In the PR curve each point represent the precision and recall for a give threshold that can be computed as:

$$\begin{aligned} \overline{\text {Prec}} = \frac{\sum _{i}TP_i}{\sum _{i}TP_i+\sum _{i}FP_i}, \quad \text {and} \quad \overline{\text {Rec}}=\frac{\sum _{i}TP_i}{\sum _{i}TP_i+\sum _{i}FN_i}. \end{aligned}$$

Note that *i* ranges over all functions \(A'\), i.e., precision and recall are computed for all functions together. The area under this curve is denoted as AU(\(\overline{\text {PRC}}\)).

#### Average area under the PR curves

The second metric corresponds to the (weighted) average of the areas under the PR curves for all functions \(A'\). This metric, referred as macro-average of precision and recall, can be computed as follows:

$$\begin{aligned} \overline{\text {AUPRC}}_{w_1,w_2,\dots ,w_{|A'|}} = \sum _{i}w_i\cdot \text {AUPRC}_i. \end{aligned}$$

If the weights of all functions are the same (i.e., \(1/|A'|\)) the metric is denoted as \(\overline{\text {AUPRC}}\). In addition, weights can also be defined based on the number of genes associated to functions in \(\phi\), i.e., \(w_a=|\phi ^{-1}(a)|/\sum _i |\phi ^{-1}(i)|\) for \(a\in A\). In the later case, denoted as \(\overline{\text {AUPRC}_w}\), more frequent functions get higher weight. Note that one point in the weighted PR curve corresponds to the (weighted) average of the AUPRC of all functions \(A'\) given a threshold.