Feature extraction using Spectral Clustering for Gene Function Prediction using Hierarchical Multi-label Classification

Gene annotation addresses the problem of predicting unknown associations between gene and functions (e.g., biological processes) of a specific organism. Despite recent advances, the cost and time demanded by annotation procedures that rely largely on in vivo biological experiments remain prohibitively high. This paper presents a novel in silico approach for to the annotation problem that combines cluster analysis and hierarchical multi-label classification (HMC). The approach uses spectral clustering to extract new features from the gene co-expression network (GCN) and enrich the prediction task. HMC is used to build multiple estimators that consider the hierarchical structure of gene functions. The proposed approach is applied to a case study on Zea mays, one of the most dominant and productive crops in the world. The results illustrate how in silico approaches are key to reduce the time and costs of gene annotation. More specifically, they highlight the importance of: (i) building new features that represent the structure of gene relationships in GCNs to annotate genes; and (ii) taking into account the structure of biological processes to obtain consistent predictions.


Introduction
Identifying the association of genes to functions is key to gain insight into how genomes serve as blueprints for life, e.g., to develop treatments for specific conditions or enhance tolerance to environmental stresses [29,34,37]. Numerous studies have used co-expression data to predict specific biological functions and processes [23,26,31,33]. Intuitively, genes are reported to co-express whenever they are simultaneously active, which suggests that they are associated to one or more common biological processes.
Under this hypothesis, characterizing gene interactions as a gene coexpression network (GCN) may assist to identify unknown functional annotations in a genome. Co-expression networks are generally represented as undirected weighted graphs, where vertices denote genes and weighted edges indicate the strength of the co-expression between two genes. A detailed analysis of the structure and distribution of gene relationships in GCNs provides additional clues that facilitate the prediction of gene functions [32].
However, the cost and time requirements to annotate genes using in vivo biological experimentation remains prohibitively high [4,42]. To overcome this limitation, hybrid approaches that integrate existing knowledge of gene-function associations and in silico methods have been proposed [5,7,17,27]. While they have shown great promise, given the extreme combinatorial nature of the problem, annotating genes in an efficient manner remains an open challenge.
Functional annotations are defined by the Gene Ontology (GO), which contains three main types of annotations: biological processes, molecular functions, and cellular component [8]. These annotations, commonly known as GO terms, are structured in a hierarchy and defined as a directed acyclic graph (DAG). Gene annotation approaches generally ignore the relationships among biological processes, even though these relationships are key to improve the accuracy and avoid inconsistency in predictions. A prediction is said to be inconsistent w.r.t. the GO hierarchy when a gene is inferred to have a particular function a, but it is not inferred to have all ancestor of a. In other words, an inconsistent prediction states that the prediction does not satisfy the ancestral relations between GO terms. Satisfying ancestral constraints is often referred to as the true-path rule in GO [32,1] and as the hierarchical constraint in HMC [35].
This paper presents a feature extraction approach for in silico annotation of genes. It follows a network-based approximation that uses cluster analysis and hierarchical multi-label classification (HMC) for building a predictor that assigns functions to genes satisfying the true-path rule. Cluster analysis plays the role of enriching the information available for predicting gene-function associations by extracting new features that represent structural properties of the GCN. That is, co-expression relations are used to identify gene clusters that ultimately help in associating functions to genes (i.e., guilt by association, see [24]). It has been shown in [28] that new features built from the GCN and associations between genes and functions with the spectral clustering algorithm are key to improve the prediction performance in the gene annotation problem. The results in [28] show that using other features associated to structural properties of the GCN and gene functional information lead to lower performance.
Furthermore, the extracted features are filtered (using SHAP) based on their impact in the prediction task and HMC is used to predict genefunction associations that take into account the relations between biolog-ical functions. The proposed approach illustrates how the performance of gene annotation is improved by combining: (i) new information extracted from the GCN; and (ii) classification methods that consider the relation between gene functions. This approach is applied to a case study on Zea mays, one of the most dominant and productive crops. Zea mays serves a variety of purposes, including animal feed and derivatives for human consumption and ethanol [41]. The co-expression information used in the study is imported from the ATTED-II database [21]. The resulting GCN, modeled as a weighted graph, comprises 26 131 vertices (i.e., genes) and 44 621 533 edges. The functional information (i.e., known gene-function associations) is taken from DAVID Bioinformatics Resources [10]. It contains a total of 255 865 annotations of biological processes for maize, i.e., pathways to which a gene contributes. The results highlight the importance of extracted features that represent structural properties of the GCN and the hierarchical structure of biological processes with HMC to improve prediction performance. Ultimately, the results provide experimental (in silico) evidence that the proposed approach is a viable and promising approximation to gene function prediction.
This paper is a significant extended version of [28] that: -Addresses the gene function prediction as a hierarchical multi-label classification problem by considering the structure of gene functions. That is ancestral relationships are represented as a DAG [8]. -Analyzes a larger functional database for the case study of maize.
The number of genes associated to at least one function increased from 5 361 to 10 049. The new dataset consists of 255 865 associations between genes and functions, and 7 021 relations between functions. -Concludes that the ancestral relations between functions and the features extracted from the GCN improve the prediction performance in the gene function prediction task when addressed as a hierarchical multi-label classification problem.
The remainder of the paper is organized as follows. Section 2 reviews some preliminaries. Section 3 introduces the approach to extract features from the gene co-expression network using cluster analysis. The proposed approach to predict gene functions, based on hierarchical multi-label classification is presented in Section 4. Section 5 presents the case study for the Zea mays species. Finally, Section 6 draws some concluding remarks and future research directions.
This section presents preliminaries on spectral clustering, gene co-expression networks, gene function prediction, hierarchical multi-label classification, and SHAP feature contribution.

Spectral clustering
The aim of applying cluster analysis on a network is to identify groups of vertices sharing a (parametric) notion of similarity [39,25]. Usually, distance or centrality metrics are used for clustering. Spectral clustering is a clustering method with foundations in algebraic graph theory [11]. It has been shown that spectral clustering has better overall performance across different areas of applications [20]. Given a graph G, the spectral clustering decomposition of G can be represented by the equation L = D − A, where L is the Laplacian, D is the degree (i.e., a diagonal matrix with the number of edges incident to each node), and A the adjacency matrices of G. Spectral clustering uses, say, the n eigenvectors associated to the n smallest nonzero eigenvalues of L. In this way, each node of the graph gets a coordinate in R n . The resulting collection of eigenvectors serve as input to a clustering algorithm (e.g., k-means) that groups the nodes in n clusters.

Gene Co-expression Network
A gene co-expression network (GCN) is represented as an undirected graph where each vertex represents a gene and each edge the level of co-expression between two genes. Definition 1. Let V be a set of genes, E a set of edges that connect pairs of genes, and w : E → R ≥0 a weight function. A (weighted) gene co-expression network is a weighted graph G = (V, E, w).
The set of genes V in a co-expression network is particular to the genome under study. The correlation of expression profiles between each pair of genes is measured, commonly, using the Pearson correlation coefficient. Every pair of genes is assigned and ranked according to a relationship measure, and a threshold is used as a cut-off value to determine E. The weight function w denotes the strength of the co-expression between each pair of genes in V . For example, in the ATTED-II database, the co-expression relation between any pair of genes is measured as a z-score expressed as a function of the co-expression index LS (Logit Score) [21,22].

Gene Function Prediction
In an annotated gene co-expression network, each gene is associated with the collection of biological functions to which it is related (e.g., through in vivo experiments).
Definition 2. Let A be a set of biological functions. An annotated gene co-expression network is a gene co-expression network G = (V, E, w) complemented with an annotation function φ : The problem of predicting gene functions can be explained as follows. Given an annotated co-expression network G = (V, E, w) with annotation function φ, the goal is to use the information represented by φ, together with additional information (e.g., features of G), to obtain a function ψ : V → 2 A that extends φ. Associations between genes and functions not present in φ have either not been found through in vivo experiments, or do not exist in a biological sense. The new associations identified by ψ are a suggestion of functions that need to be verified through in vivo experiments. The function ψ can be built from a predictor of gene functions, e.g., based on a supervised machine learning model.

Hierarchical Multi-label Classification
Node classification refers to the task of predicting a node class for an input data based on the information of other nodes in the network [2]. In general, node classification problems can be categorize into three different types: binary classification refers to predict one attribute (target) with two classes (for example, positive and negative) [12]; multi-class classification refers to the case where the attribute to be predicted has more than two classes and are mutually exclusive (for example, the brand of a car) [18]; and multi-label classification refers to predicting an attribute with at least two classes, but where an instance could be associated to more than one class (for example, the gene function prediction problem) [36].
Although the aforementioned prediction methods are frequently used, they do not consider hierarchical relations between classes. For such scenarios, hierarchical multi-label classification (HMC) addresses the task of structured output prediction where the classes are organized into a hierarchy and an instance may belong to multiple classes. In many problems, such as gene function prediction, classes inherently satisfy these conditions [14]. Authors in [30] expose that there are two types of methods to explore the hierarchical structure. First, top-down or local classifiers refer to partially predict the classes in the hierarchy from the top to the bottom. Second, big-bang or global classifiers refer to use a single classifier that considers the entire hierarchy at once.
Classifiers that ignore the class relationships, by predicting only the leaf classes in the hierarchy or predicting each class independently, often lead to inconsistent predictions. This refers to the fact that a node is inferred to have a particular class a, but the outcome of the classifier fails to infer the node's association to all ancestor classes of a in the hierarchy. In other words, an inconsistent prediction states that the prediction does not satisfy the hierarchy for some class a. Satisfying ancestral constraints is often referred to as the true-path rule in GO [32,1] and as the hierarchical constraint in HMC [35]. Figure 1 illustrates the four HMC methods used in this work: Local classifier per node (lcn) consists of training one binary classifier for each class in the hierarchy except the root. Local classifier per parent node (lcpn) consists of training a multi-label classifier for each parent node in the hierarchy to distinguish between its child classes. Local classifier per level (lcl ) consists of training one multi-label classifier for each level of the class hierarchy except for the root. Global classifier consists of building a single multi-label classifier taking into account the hierarchy as a whole during a single run. The global classifier can assign classes at potentially every level of the hierarchy to an instance.

SHAP feature contribution
The performance of classification algorithms is partly determined by the features used to train a particular predictor. SHAP (SHapley Additive exPlanation) is a framework that computes the importance values for each feature in a dataset using concepts from game theory [15,16]. SHAP assigns Shapely values to explain which features in the model are the most important for prediction by calculating the changes in the prediction when features are conditioned. Given a predictor and a training set, SHAP computes a matrix with the same dimensions of the predictor's output containing the Shapely values for each instance and class. For example, in a binary classification problem and a training set of n instances, the output of SHAP is a matrix of dimension n × 2 (there are two classes, positive and negative). In multi-label classification problems, the output is a matrix of dimension n×2 for each class, since classes are not mutually exclusive and the outcome is either positive or negative for each class.

Clustering-based Feature Extraction
The approach for extracting features from the GCN using a clustering algorithm and Gene Ontology term enrichment is presented. It combines information from the GCN, and the associations between genes and functions to create features capturing topological properties of the GCN. The inputs of the approach are a GCN, denoted by G = (V, E, w), a set of (biological) functions A, an annotation function φ : V → 2 A , and a set K = {k 0 , . . . , k m−1 } for sampling the number of clusters. The annotation function φ must satisfy true-path rule for the GO hierarchy [1,32]. That is, if a gene is associated to a function, then it must also be associated to every ancestor of the function in the hierarchy, and if a gene is not associated to a function, then it must not be associated to any of its descendants.
The outputs are two feature matrices J G and J F , of dimension V × A · K → [0, 1], specifying the likelihood of the genes V to be associated to the functions in A when the graph is decomposed in m clusters. Matrices J G and J F correspond to the GCN (that is the graph G) and an affinity graph defined the next subsection.
The feature extraction approach consists of three stages, which are depicted in Figure 2. First, an affinity graph F with information in φ is created from G. Second, the spectral clustering algorithm is applied to both G and its enriched version F for the m different number of clusters specified in K. Third, the Gene Ontology term enrichment technique is used to create m features for each function a ∈ A, corresponding to the number of clusters in K.
The clustering-based feature extraction approach consists of three stages. Namely, creation of affinity graph, clustering computation, and Gene Ontology term enrichment. Its inputs are a GCN, denoted by G = (V, E, w), a set of functions A, an annotation function φ : V → 2 A , and a set K = {k0, . . . , km−1}. Its output are two feature matrices (for both G and its enriched version F ) of dimension V ×A·K → [0, 1] that specify how likely it is for the genes to be associated to the functions in A when the graph is decomposed m clusters, each of size ki, for 0 ≤ i ≤ m.

Affinity Graph Creation
An affinity graph F = (V, E, w F ) between G and φ is built. Its weight function is defined as the mean between the co-expression weight specified by w and the proportion of shared functions between genes specified by φ.
where max(w) denotes the maximum value in the range of w (which exists because w is finite).
Under the assumption that at least one element in the range of w is greater than 1, it is guaranteed that the range of ). This is indeed the case, in practice, because the co-expression between two genes in the GCN is quantified in terms of the z-score, which is highly unlikely to be 1 for all pairs of genes.

Gene Clustering
The spectral clustering algorithm is applied independently to each graph X ∈ {G, F } to decompose X (i.e., group the genes V ) using the number of clusters specified by K = {k 0 , . . . , k m−1 }. The decomposition of X is performed m times, once per k in K. The adjacency matrices of the weighted and undirected graphs G and F are used as the precomputed affinity matrices required for the spectral clustering algorithm. The outcome of the clustering algorithm is an assignment from nodes to clusters of size k, for each k ∈ K. More precisely, the outputs of this stage are the matrices

Gene Enrichment
The goal of this stage is to produce a matrix J X : V × A · K → [0, 1] for each X ∈ {G, F }, specifying how likely it is for the genes to be associated to every function a ∈ A when X is decomposed in the given number of clusters.
For each decomposition from the previous stage (i.e., each column of the matrices I X ) and function a ∈ A, the resulting clusters are used to compute whether a significant number of members associated to function a is (locally) present. Intuitively, if genes that are grouped together have a strong co-expression relation and most of the group are associated to gene function a, then the remaining genes are also likely to be associated to a (i.e., guilt by association, see [24]). In this way, for each v ∈ V , a ∈ A, and k ∈ K, the entry J X (v, a · k) is a p-value indicating if the function a is over-represented in the decomposition of k clusters of X. This process is commonly known as Gene Ontology term enrichment and may use different statistical tests, such as, Fisher's exact test [38].

Hierarchical Multi-label Classification for Gene Function Prediction
This section presents the approach for gene function prediction using HMC to create a predictor, enriched with the information of the features created in Section 3.
The GO hierarchy is defined as a directed acyclic graph (DAG) containing three main types of annotations: biological processes, molecular functions, and cellular component [8]. This work focuses on biological processes, i.e., a subgraph of the GO hierarchy that contains 28 roots (i.e., functions in the GO hierarchy with null indegree). This subgraph is denoted as H = (A, R), where A is the set of biological processes and R the binary relation representing ancestral relations between pairs of biological processes (i.e., (a, b) ∈ R means that function b is ancestor of function a in the GO hierarchy). The topological-sorting traversal algorithm presented in [27] is used to transform the GO hierarchy of biological processes into a tree. As a result, the hierarchy is split into several components, i.e., subtrees of H called sub-hierarchies. Each sub-hierarchy, H = (A , R ) with A ⊆ A, R ⊆ R, and r ∈ A the root, is associated to a subgraph G = (V , E , w) containing all genes v ∈ V associated to r, i.e., V = φ −1 (r). Note that, the proposed approach is independently applied to each sub-hierarchy.
The inputs of the approach are a sub-hierarchy H = (A , R ), a subgraph of the GCN, denoted by G = (V , E , w), where V ⊆ V and E ⊆ E, an annotation function φ : V → 2 A , the matrices J G and J F resulting from Section 3, and a constant value c ∈ [0, 1] for feature selection. The output is a function ψ : V × A → [0, 1], specifying, for each gene v ∈ V , the probability ψ(v, a) of v being associated to function a ∈ A .
First, sub-matrices J G and J F are created from J G and J F , by respectively considering only the genes V ⊆ V and functions A ⊆ A. These sub-matrices represent structural properties of the GCN subgraph G , and associations between genes and functions based on multiple partitions of each graph. Figure 3 illustrates the prediction approach. The reminder of this section is devoted to detailing the prediction approach.
SHAP filters the extracted features with more impact in the prediction task, and HMC is used to predict associations between genes and functions without inconsistencies (i.e., complying the true-path rule). Since local HMC methods use more than one predictor per hierarchy, the feature selection is executed for each predictor independently, considering only the features related to the functions being predicted, denoted by  i.e., on the sum of mean absolute values by a factor of the input constant c. The first Θ(c) features, sorted from greater to lower mean absolute SHAP value, are selected as to reach the given cutoff.
Note that the input constant c is key for selecting the number of significant features. The idea is to set c so as to find a balance between prediction efficiency and the computational cost of building the predictor.

Training and Prediction
This stage comprises a process that combines two supervised machine learning techniques/tools to build the predictor ψ. In particular, stratified k-fold cross-validation and hierarchical multi-label classification are used sequentially in a pipeline.
The pipeline takes as input the matrix J, which specifies the significant features of J G and J F , the sub-hierarchy H and the annotation function φ. First, k-fold is applied to split the dataset into k different folds for cross validation (note that k is not related to the input K). That is, each fold is used as a test set, while the remaining k − 1 folds are used for training. Recall that k-fold cross validation aims to overcome overfitting in training. Furthermore, one or multiple random forest classifiers are build and used for prediction, the number of classifiers depends on the HMC method. Randoms forest is selected for this approach since it is a tree-based and multi-label classification algorithm, which is interpretable (SHAP can be applied). The parameter values used for random forest classifiers, differently from the default scikit-learn values, are: 200 estimators (n estimators) and minimum number of samples of 5 (min samples split).
Additionally, some HMC methods require an extra step to keep prediction consistent w.r.t. the sub-hierarchy H (i.e., comply the true-path rule). The probability of association between a function v ∈ V and a function a ∈ A must be lower than the probability of association between the same gene and the ancestor of a in H . To satisfy this constraint cumulative probabilities are computed throughout the paths in H . That is, for each gene v ∈ V and functions (a, b) ∈ R, the predicted probability of the association between v and a is multiplied by the predicted probability of association between v and b (its ancestor). This process is repeated for every path in the hierarchy from the root to the leaves.
The output of this stage is the predictor ψ, i.e., the probabilities of associations between the genes in V and functions A . Note that the predictor ψ satisfies the true-path rule.

Performance Evaluation
It is often the case in HMC datasets that individual classes have few positive instances. In genome annotation, typically only a few genes are associated to specific functions. This implies that for most classes (deeper in the hierarchy), the number of negative instances by far exceeds the number of positive instances. Hence, the real focus is recognizing the positive instances (predict associations between genes and functions), rather than correctly predicting the negative ones (predict that a function is not associated to a given gene). Although ROC curves are better known, their area under the curve is higher if a model correctly predicts negative instances, which is not suitable for HMC problems.
For this reasons, the measures (based on the precision-recall (PR) curve) introduced by [35] are used for evaluation.
Area under the average PR curve. The first metric transforms the multilabel problem into a binary one by computing the precision and recall for all functions A together. This corresponds to micro-averaging the precision and recall.
The output of the prediction stage are the probabilities of associations between genes V and functions A . Thereby, instead of selecting a single threshold to compute precision and recall, multiple thresholds are used to create a PR curve. In the PR curve each point represent the precision and recall for a give threshold that can be computed as: Note that i ranges over all functions A , i.e., precision and recall are computed for all functions together. The area under this curve is denoted as AU(PRC).
Average area under the PR curves. The second metric corresponds to the (weighted) average of the areas under the PR curves for all functions A . This metric, referred as macro-average of precision and recall, can be computed as follows: If the weights of all functions are the same (i.e., 1/|A |) the metric is denoted as AUPRC. In addition, weights can also be defined based on the number of genes associated to functions in φ, i.e., w a = |φ −1 (a)|/ i |φ −1 (i)| for a ∈ A. In the later case, denoted as AUPRC w , more frequent functions get higher weight. Note that one point in the weighted PR curve corresponds to the (weighted) average of the AUPRC of all functions A given a threshold.

Case study: Zea mays
Next section describes a case study on applying the feature extraction and prediction approach presented in Sections 3 and 4 to maize (Zea mays). First, the maize data used for the case study is described. Second, the proposed approach is applied to the maize data. Lastly, the performance of the proposed approach is compared to two models trained using each set of features J G and J F , independently.

Data Description and Feature Extraction
The co-expression information used in the study is imported from the ATTED-II database [21]. The gene co-expression network G = (V, E, w) comprises 26 131 vertices (genes) and 44 621 533 edges. In this case, a zscore threshold of 1 is used as the cut-off measure for G, i.e., E contains edges e that satisfy w(e) ≥ 1 (most of them satisfying w(e) > 1). Note that the highest value is assigned to the strongest connections. The functional information for this network is taken from DAVID Bioinformatics Resources [10] (2021 update); it contains annotations of biological processes, i.e., pathways to which a gene contributes. It is important to note that genes may be associated to several biological processes, and biological processes may be associated to multiple genes. The database comprises 3 924 biological processes A and 7 021 ancestral relations R between these functions, that represent the hierarchy H = (A, R) of the GO [8]. A total of 255 865 association between genes and functions are considered, these associations represent the annotation function φ : The feature extraction approach is applied with the inputs G, A, φ and K = {10, 20, . . . , 100} (values are incremented in steps of 10 up to 100). The outputs are the feature matrices J G and J F that specify how likely it is for the maize genes V to be associated to the biological processes A when the graph is decomposed in the number of clusters in K.
Moreover, only functions associated to more than 200 genes have been considered, so the number of functions in the resulting sub-hierarchies is tractable regarding the dimension of the output of SHAP (see Section 2). Recall that the Gene Ontology hierarchy splits into 28 subhierarchies when considering only biological processes. Additionally, all sub-hierarchies with less than 10 functions are discarded and the topologicalsorting algorithm introduced in [27] is used to transform the sub-hierarchies, represented as DAGs, into trees. For each ancestral relation (a, b) ∈ R (b is ancestor of a), the algorithm assigns a weight as the ratio of the number of genes associated to the a to the number of genes associated to b. Then, for each function a ∈ A with more than one parent, only the one with the higher weight remains (ties are broken arbitrarily).
As result, there are 5 sub-hierarchies of biological processes. Table 1 describes each sub-hierarchy H , starting by the root term r and its description, following the number of functions A and the number of genes V in the associated GCN subgraph G . The prediction approach is applied to each sub-hierarchy H independently. The remaining input parameter for the prediction approach is c = 0.9 (recall that this parameter is used to filter the most relevant features according to their mean SHAP value).  Table 1. Resulting sub-hierarchies H of biological processes for maize. The identifier and description of each root function r is presented in the first and second columns, respectively. The third column shows the number of functions A within each subhierarchy and the fourth column shows the number maize genes in the GCN subgraph G associated to H . The last column shows the number of functions per level, e.g., the first sub-hierarchy has 3 levels and there are 5, 5, and 2 functions on each level.    Figure 6 presents the prediction performance of the proposed approach measured with the AU(PRC) (denoted as micro) for four HMC methods, namely, local classifier per node (lcn), local classifier per parent node (lcpn), local classifier per level (lcl ), and global classifier. In general, it can be seen that all methods get a high area under the average PR curve, but the global classifier outperforms the local methods for all subhierarchies. The proposed approach identifies the associations between genes and functions by using the features extracted from the GCN G and the affinity graph F , and considering the ancestral relations of the biological processes. The global method obtains the best performance, followed by the lcpn and the lcl. Using multi-label classifiers is better than using a binary classifier for each function, i.e., lcn method.

GO
The micro score measures the overall performance of all functions within a sub-hierarchy without distinguishing between them. as macro. The macro score measure the prediction performance for each function individually and then takes the average. The conclusion is similar, the global method outperforms the local ones. Finally, Figure 8 illustrates the prediction performance measured with the AUPRC w , denoted as macro weighted. This score weights the individual performance of each function according to the number of genes associated to it. Thereby, the leaves and deeper functions in a sub-hierarchy always get lower weight than the others. Note that the deeper a functions is in a sub-hierarchy, the lower the predicted probabilities becomes. The global method outperforms the locals again. The conclusion is consistent with the three metrics, using clustering techniques to extract features from the GCN and considering the hierarchical structure of the biological processes seems to be key for the gene function production task.
It has been shown in [28] that the new features built from the GCN, and the associations between genes and functions with the spectral clustering algorithm are key to improve the prediction performance in the gene annotation problem (w.r.t. other features of the GCN and gene functional information). However, the feature extraction approach presented in Section 3 produces two different sets of features, namely, J G and J F , that are combined and used for prediction. The individual relevance of each set of features for the gene annotation problem is analyzed by (i) looking at the distribution of the filtered features for the global method and (ii) comparing the performance of the prediction task using each set of features independently. Table 2 presents the number of extracted and filtered features used for the global method per sub-hierarchy. Recall that the features are filtered using the mean SHAP values to select the more important ones with a cutoff defined by the input constant c.  Figure 9 illustrates the distribution of the filtered features for the global method per sub-hierarchy. Note that, even though the features from the affinity graph F (i.e., J F ) are more important, features from the GCN G (i.e., J G ) are also selected for all sub-hierarchies. Figure 10 shows the prediction performance of the global HMC method trained using the features J G and J F independently, and the proposed approach (i.e., their combination) measured with AU(PRC) and AUPRC. The combination of both sets of features, extracted from the GCN and the affinity graph is key to improve the performance of the proposed approach for all subhierarchies.

Related Work
Zhou et al. [41] presented an approach to predict functions of maize proteins using graph convolutional networks. In particular, an amino acid sequence of proteins and the GO hierarchy were used to predict functions of proteins with a deep graph convolutional network model (DeepGOA). Their results showed that DeepGOA is a powerful tool to integrate amino acid data and the GO structure to accurately annotate proteins. Similarly, the work presented in [6] aims to predict the phenotypes and functions associated to maize genes using: (i) hierarchical clustering based on datasets of transcriptome (set of molecules produced in transcription) and metabolome (set of metabolites found within an organism); and (ii) GO enrichment analyses. Their results showed that profiling individual plants is a promising experimental design for narrowing down the lab-field gap. Gligorijević et al. [9] proposed a network fusion method based on multimodal deep autoencoders to extract high-level features of proteins from multiple interaction networks. This method, called deepNF, relied on a deep learning technique that captures relevant protein features from different complex, non-linear interaction networks. Their results showed that extracting new features from biological networks is key to annotate gene with functions. The work in [40] is also closely related. They presented Gene Ontology hierarchy preserving hashing (HPHash), a gene function prediction method that retains the hierarchical order between GO terms. It used a hierarchy preserving hashing technique based on the taxonomic similarity between terms to capture the GO hierarchy. Hashing functions were used to compress the gene-term association matrix, where the semantic similarity between genes was used to predict the functions of the genes. Their results showed that HPHash preserves the GO hierarchy and improves prediction performance.
In addition, the authors in [3] presented iFeature, a Python-based toolkit for generating numerical feature representation schemes from protein sequences. It integrated algorithms for feature clustering, selection, and dimensionality reduction to facilitate training, analysis, and benchmarking of machine-learning models. In a related way, Mu et al. [19] showed that feature extraction of protein sequences is helpful for prediction of protein functions or interactions. They introduced FEGS (Feature Extraction based on Graphical and Statistical features), a novel feature extraction model for protein sequences that combines graphical and statistical features. Their results showed that similarity analysis of protein sequences has applications in the study of gene annotation, gene function prediction, identification and construction of gene families, and gene discovery.

Concluding Remarks and Future Work
By combining network-based modeling, cluster analysis, interpretable machine learning, and hierarchical multi-label classification, the approach presented in this paper introduces a novel method to address the gene function prediction problem. It aims to predict the association probability between each gene and function by taking advantage of the GCN spectral decomposition, the information available of associations between genes and functions, and the ancestral relations between the functions (i.e., the GO hierarchy).
A case study on Zea mayz (maize) is presented. Using the structural information of the gene co-expression network (extracted by a spectral clustering algorithm) and considering the hierarchical structure of the biological processes (using HMC) seems to be the key for the improved performance of the proposed approach. More precisely, the global HMC method, which considers all features available for a sub-hierarchy to build a single classifier, outperforms the other methods in relation to the three metrics that were used (namely, AU(PRC) , AUPRC , and AUPRC w ).
The results presented in [28] show that the features extracted from the GCN using spectral clustering lead to better prediction performance in the gene function prediction task (addressed as an independent binary classification problem per function). In this work, it has been shown that considering the ancestral relations between functions to produce an outcome that satisfies its hierarchical structure (i.e., complies the truepath rule or hierarchical constraint), based on the features extracted from the GCN, improves the performance in the gene function prediction task (addressed as a hierarchical multi-label classification problem).
Two main lines of work can be considered for future work. First, applying the proposed approach to identify genes associated to specific stresses (e.g., low temperature, salinity) can help to reduce the set of candidate genes that respond to treatments for in vivo validation. Second, exploring transfer learning techniques (especially, domain adaptation) to enrich the building of the classifiers using information from other organisms (datasets), not only can lead to higher prediction performance, but also can enable the proposed approach on organisms without a wealth of significant functional information.

Availability of data and materials
The datasets analyzed for the current study are publicly available from different sources. They can be found in the following locations: -Gene co-expression data of Oryza sativa Japonica is available on ATTED-II [21]. -Functional data of rice genes is available on the DAVID Bioinformatics Resources [10]. -Hierarchical data of Gene Ontology terms are available on the GOA-TOOLS Python library [13].
The data collected, cleaned, and processed from the above sources as used in the case study can be requested to the authors.