Skip to main content

Feature extraction with spectral clustering for gene function prediction using hierarchical multi-label classification

Abstract

Gene annotation addresses the problem of predicting unknown associations between gene and functions (e.g., biological processes) of a specific organism. Despite recent advances, the cost and time demanded by annotation procedures that rely largely on in vivo biological experiments remain prohibitively high. This paper presents a novel in silico approach for to the annotation problem that combines cluster analysis and hierarchical multi-label classification (HMC). The approach uses spectral clustering to extract new features from the gene co-expression network (GCN) and enrich the prediction task. HMC is used to build multiple estimators that consider the hierarchical structure of gene functions. The proposed approach is applied to a case study on Zea mays, one of the most dominant and productive crops in the world. The results illustrate how in silico approaches are key to reduce the time and costs of gene annotation. More specifically, they highlight the importance of: (1) building new features that represent the structure of gene relationships in GCNs to annotate genes; and (2) taking into account the structure of biological processes to obtain consistent predictions.

Introduction

Identifying the association of genes to functions is key to gain insight into how genomes serve as blueprints for life, e.g., to develop treatments for specific conditions or enhance tolerance to environmental stresses (Rust et al. 2002; Vandepoele et al. 2009; Yandell and Ence 2012). Numerous studies have used co-expression data to predict specific biological functions and processes (Oti et al. 2008; Romero et al. 2020; Stuart 2003; van Dam et al. 2017). Intuitively, genes are reported to co-express whenever they are simultaneously active, which suggests that they are associated to one or more common biological processes.

Under this hypothesis, characterizing gene interactions as a gene co-expression network (GCN) may assist to identify unknown functional annotations in a genome. Co-expression networks are generally represented as undirected weighted graphs, where vertices denote genes and weighted edges indicate the strength of the co-expression between two genes. A detailed analysis of the structure and distribution of gene relationships in GCNs provides additional clues that facilitate the prediction of gene functions (Valentini 2009).

However, the cost and time requirements to annotate genes using in vivo biological experimentation remains prohibitively high (Cho et al. 2015; Zhou et al. 2005). To overcome this limitation, hybrid approaches that integrate existing knowledge of gene-function associations and in silico methods have been proposed (Cho et al. 2016; Deng et al. 2003; Luo et al. 2007; Romero et al. 2022). While they have shown great promise, given the extreme combinatorial nature of the problem, annotating genes in an efficient manner remains an open challenge.

Functional annotations are defined by the Gene Ontology (GO), which contains three main types of annotations: biological processes, molecular functions, and cellular component (Gene Ontology Consortium 2019). These annotations, commonly known as GO terms, are structured in a hierarchy and defined as a directed acyclic graph (DAG). Gene annotation approaches generally ignore the relationships among biological processes, even though these relationships are key to improve the accuracy and avoid inconsistency in predictions. A prediction is said to be inconsistent w.r.t. the GO hierarchy when a gene is inferred to have a particular function a, but it is not inferred to have all ancestor of a. In other words, an inconsistent prediction states that the prediction does not satisfy the ancestral relations between GO terms. Satisfying ancestral constraints is often referred to as the true-path rule in GO (Valentini 2009; Ashburner et al. 2000) and as the hierarchical constraint in HMC (Vens et al. 2008).

This paper presents a feature extraction approach for in silico annotation of genes. It follows a network-based approximation that uses cluster analysis and hierarchical multi-label classification (HMC) for building a predictor that assigns functions to genes satisfying the true-path rule. Cluster analysis plays the role of enriching the information available for predicting gene-function associations by extracting new features that represent structural properties of the GCN. That is, co-expression relations are used to identify gene clusters that ultimately help in associating functions to genes (i.e., guilt by association, see Petsko (2009)). It has been shown in Romero et al. (2022) that new features built from the GCN and associations between genes and functions with the spectral clustering algorithm are key to improve the prediction performance in the gene annotation problem. The results in Romero et al. (2022) show that using other features associated to structural properties of the GCN and gene functional information lead to lower performance.

Furthermore, the extracted features are filtered (using SHAP) based on their impact in the prediction task and HMC is used to predict gene-function associations that take into account the relations between biological functions. The proposed approach illustrates how the performance of gene annotation is improved by combining: (1) new information extracted from the GCN; and (2) classification methods that consider the relation between gene functions.

This approach is applied to a case study on Zea mays, one of the most dominant and productive crops. Zea mays serves a variety of purposes, including animal feed and derivatives for human consumption and ethanol (Zhou et al. 2020). The co-expression information used in the study is imported from the ATTED-II database (Obayashi et al. 2018). The resulting GCN, modeled as a weighted graph, comprises 26,131 vertices (i.e., genes) and 44,621,533 edges. The functional information (i.e., known gene-function associations) is taken from DAVID Bioinformatics Resources (Huang et al. 2009). It contains a total of 255,865 annotations of biological processes for maize, i.e., pathways to which a gene contributes. The results highlight the importance of extracted features that represent structural properties of the GCN and the hierarchical structure of biological processes with HMC to improve prediction performance. Ultimately, the results provide experimental (in silico) evidence that the proposed approach is a viable and promising approximation to gene function prediction.

This paper is a significant extended version of Romero et al. (2022) that:

  • Addresses the gene function prediction as a hierarchical multi-label classification problem by considering the structure of gene functions. That is ancestral relationships are represented as a DAG (Gene Ontology Consortium 2019).

  • Analyzes a larger functional database for the case study of maize. The number of genes associated to at least one function increased from 5361 to 10,049. The new dataset consists of 255 865 associations between genes and functions, and 7021 relations between functions.

  • Concludes that the ancestral relations between functions and the features extracted from the GCN improve the prediction performance in the gene function prediction task when addressed as a hierarchical multi-label classification problem.

The remainder of the paper is organized as follows. “Preliminaries” section reviews some preliminaries.  “Clustering-based feature extraction” section introduces the approach to extract features from the gene co-expression network using cluster analysis. The proposed approach to predict gene functions, based on hierarchical multi-label classification is presented in  “Hierarchical multi‑label classification for gene function prediction” section. “Case study: Zea mays” section presents the case study for the Zea mays species. Finally,  “Related work and concluding remarks” section draws some concluding remarks and future research directions.

Preliminaries

This section presents preliminaries on spectral clustering, gene co-expression networks, gene function prediction, hierarchical multi-label classification, and SHAP feature contribution.

Spectral clustering

The aim of applying cluster analysis on a network is to identify groups of vertices sharing a (parametric) notion of similarity (Yu 2003; Rodriguez et al. 2019). Usually, distance or centrality metrics are used for clustering. Spectral clustering is a clustering method with foundations in algebraic graph theory (Jia et al. 2014). It has been shown that spectral clustering has better overall performance across different areas of applications (Murugesan et al. 2021). Given a graph G, the spectral clustering decomposition of G can be represented by the equation \({\mathbf {L}} = {\mathbf {D}} - {\mathbf {A}}\), where \({\mathbf {L}}\) is the Laplacian, \({\mathbf {D}}\) is the degree (i.e., a diagonal matrix with the number of edges incident to each node), and \({\mathbf {A}}\) the adjacency matrices of G. Spectral clustering uses, say, the n eigenvectors associated to the n smallest nonzero eigenvalues of \({\mathbf {L}}\). In this way, each node of the graph gets a coordinate in \({\mathbb {R}}^n\). The resulting collection of eigenvectors serve as input to a clustering algorithm (e.g., k-means) that groups the nodes in n clusters.

Gene co-expression network

A gene co-expression network (GCN) is represented as an undirected graph where each vertex represents a gene and each edge the level of co-expression between two genes.

Definition 1

Let V be a set of genes, E a set of edges that connect pairs of genes, and \(w:E \rightarrow {\mathbb {R}}_{\ge 0}\) a weight function. A (weighted) gene co-expression network is a weighted graph \(G = (V, E, w)\).

The set of genes V in a co-expression network is particular to the genome under study. The correlation of expression profiles between each pair of genes is measured, commonly, using the Pearson correlation coefficient. Every pair of genes is assigned and ranked according to a relationship measure, and a threshold is used as a cut-off value to determine E. The weight function w denotes the strength of the co-expression between each pair of genes in V. For example, in the ATTED-II database, the co-expression relation between any pair of genes is measured as a z-score expressed as a function of the co-expression index LS (Logit Score) (Obayashi et al. 2018; Obayashi and Kinoshita 2011).

Gene function prediction

In an annotated gene co-expression network, each gene is associated with the collection of biological functions to which it is related (e.g., through in vivo experiments).

Definition 2

Let A be a set of biological functions. An annotated gene co-expression network is a gene co-expression network \(G = (V, E, w)\) complemented with an annotation function \(\phi : V \rightarrow 2^{A}\).

The problem of predicting gene functions can be explained as follows. Given an annotated co-expression network \(G = (V, E, w)\) with annotation function \(\phi\), the goal is to use the information represented by \(\phi\), together with additional information (e.g., features of G), to obtain a function \(\psi : V \rightarrow 2^A\) that extends \(\phi\). Associations between genes and functions not present in \(\phi\) have either not been found through in vivo experiments, or do not exist in a biological sense. The new associations identified by \(\psi\) are a suggestion of functions that need to be verified through in vivo experiments. The function \(\psi\) can be built from a predictor of gene functions, e.g., based on a supervised machine learning model.

Hierarchical multi-label classification

Node classification refers to the task of predicting a node class for an input data based on the information of other nodes in the network (Bhagat et al. 2011). In general, node classification problems can be categorize into three different types: binary classification refers to predict one attribute (target) with two classes (for example, positive and negative) (Khan and Madden 2010); multi-class classification refers to the case where the attribute to be predicted has more than two classes and are mutually exclusive (for example, the brand of a car) (Mills 2021); and multi-label classification refers to predicting an attribute with at least two classes, but where an instance could be associated to more than one class (for example, the gene function prediction problem) (Xu et al. 2020).

Although the aforementioned prediction methods are frequently used, they do not consider hierarchical relations between classes. For such scenarios, hierarchical multi-label classification (HMC) addresses the task of structured output prediction where the classes are organized into a hierarchy and an instance may belong to multiple classes. In many problems, such as gene function prediction, classes inherently satisfy these conditions (Levatić et al. 2015). Authors in Silla and Freitas (2011) expose that there are two types of methods to explore the hierarchical structure. First, top-down or local classifiers refer to partially predict the classes in the hierarchy from the top to the bottom. Second, big-bang or global classifiers refer to use a single classifier that considers the entire hierarchy at once.

Classifiers that ignore the class relationships, by predicting only the leaf classes in the hierarchy or predicting each class independently, often lead to inconsistent predictions. This refers to the fact that a node is inferred to have a particular class a, but the outcome of the classifier fails to infer the node’s association to all ancestor classes of a in the hierarchy. In other words, an inconsistent prediction states that the prediction does not satisfy the hierarchy for some class a. Satisfying ancestral constraints is often referred to as the true-path rule in GO (Valentini 2009; Ashburner et al. 2000) and as the hierarchical constraint in HMC (Vens et al. 2008).

Fig. 1
figure 1

Example of global and local methods for hierarchical multi-label classification. Given a hierarchy of classes (r, a, b, c, d, e, and f), the dashed boxes show the number of classifiers required for each method. Note that the lcn, lcpn, lcl, and global classifiers require 6, 4, 3, and 1 predictors, respectively

Figure 1 illustrates the four HMC methods used in this work: Local classifier per node (lcn) consists of training one binary classifier for each class in the hierarchy except the root. Local classifier per parent node (lcpn) consists of training a multi-label classifier for each parent node in the hierarchy to distinguish between its child classes. Local classifier per level (lcl) consists of training one multi-label classifier for each level of the class hierarchy except for the root. Global classifier consists of building a single multi-label classifier taking into account the hierarchy as a whole during a single run. The global classifier can assign classes at potentially every level of the hierarchy to an instance.

SHAP feature contribution

The performance of classification algorithms is partly determined by the features used to train a particular predictor. SHAP (SHapley Additive exPlanation) is a framework that computes the importance values for each feature in a dataset using concepts from game theory (Lundberg and Lee 2017; Lundberg et al. 2020). SHAP assigns Shapely values to explain which features in the model are the most important for prediction by calculating the changes in the prediction when features are conditioned. Given a predictor and a training set, SHAP computes a matrix with the same dimensions of the predictor’s output containing the Shapely values for each instance and class. For example, in a binary classification problem and a training set of n instances, the output of SHAP is a matrix of dimension \(n\times 2\) (there are two classes, positive and negative). In multi-label classification problems, the output is a matrix of dimension \(n\times 2\) for each class, since classes are not mutually exclusive and the outcome is either positive or negative for each class.

Clustering-based feature extraction

The approach for extracting features from the GCN using a clustering algorithm and Gene Ontology term enrichment is presented. It combines information from the GCN, and the associations between genes and functions to create features capturing topological properties of the GCN.

The inputs of the approach are a GCN, denoted by \(G=(V,E,w)\), a set of (biological) functions A, an annotation function \(\phi :V\rightarrow 2^A\), and a set \(K=\{k_0,\dots ,k_{m-1}\}\) for sampling the number of clusters. The annotation function \(\phi\) must satisfy true-path rule for the GO hierarchy (Ashburner et al. 2000; Valentini 2009). That is, if a gene is associated to a function, then it must also be associated to every ancestor of the function in the hierarchy, and if a gene is not associated to a function, then it must not be associated to any of its descendants.

The outputs are two feature matrices \(J_G\) and \(J_F\), of dimension \(V\times A\cdot K\rightarrow [0,1]\), specifying the likelihood of the genes V to be associated to the functions in A when the graph is decomposed in m clusters. Matrices \(J_G\) and \(J_F\) correspond to the GCN (that is the graph G) and an affinity graph defined the next subsection.

Fig. 2
figure 2

The clustering-based feature extraction approach consists of three stages. Namely, creation of affinity graph, clustering computation, and Gene Ontology term enrichment. Its inputs are a GCN, denoted by \(G=(V,E,w)\), a set of functions A, an annotation function \(\phi :V\rightarrow 2^A\), and a set \(K=\{k_0,\dots ,k_{m-1}\}\). Its output are two feature matrices (for both G and its enriched version F) of dimension \(V\times A\cdot K\rightarrow [0,1]\) that specify how likely it is for the genes to be associated to the functions in A when the graph is decomposed m clusters, each of size \(k_i\), for \(0\le i\le m\)

The feature extraction approach consists of three stages, which are depicted in Fig. 2. First, an affinity graph F with information in \(\phi\) is created from G. Second, the spectral clustering algorithm is applied to both G and its enriched version F for the m different number of clusters specified in K. Third, the Gene Ontology term enrichment technique is used to create m features for each function \(a\in A\), corresponding to the number of clusters in K.

Affinity graph creation

An affinity graph \(F = (V, E, w_F)\) between G and \(\phi\) is built. Its weight function is defined as the mean between the co-expression weight specified by w and the proportion of shared functions between genes specified by \(\phi\).

Definition 3

The weight function \(w_F : V\times V \rightarrow [0,1]\) is defined for any \(u,v \in V\) as

$$\begin{aligned} w_F(u,v) = \frac{1}{2}\left( \frac{w(u,v) -1}{\max (w)-1}+\frac{|\phi (u) \cup \phi (v)|}{|\phi (u) \cap \phi (v)|}\right) , \end{aligned}$$

where \(\max (w)\) denotes the maximum value in the range of w (which exists because w is finite).

Under the assumption that at least one element in the range of w is greater than 1, it is guaranteed that the range of \(w_F\) is [0, 1] (because \(w: V \times V \rightarrow [1,\infty )\)). This is indeed the case, in practice, because the co-expression between two genes in the GCN is quantified in terms of the z-score, which is highly unlikely to be 1 for all pairs of genes.

Gene clustering

The spectral clustering algorithm is applied independently to each graph \(X \in \{G, F\}\) to decompose X (i.e., group the genes V) using the number of clusters specified by \(K=\{k_0,\dots ,k_{m-1}\}\). The decomposition of X is performed m times, once per k in K. The adjacency matrices of the weighted and undirected graphs G and F are used as the precomputed affinity matrices required for the spectral clustering algorithm. The outcome of the clustering algorithm is an assignment from nodes to clusters of size k, for each \(k\in K\). More precisely, the outputs of this stage are the matrices \(I_X : V \times K \rightarrow [0,1]\), where each column \(0\le i < m\) represents the decomposition of X in \(k_i\) clusters.

Gene enrichment

The goal of this stage is to produce a matrix \(J_X : V \times A \cdot K \rightarrow [0,1]\) for each \(X \in \{G, F\}\), specifying how likely it is for the genes to be associated to every function \(a\in A\) when X is decomposed in the given number of clusters.

For each decomposition from the previous stage (i.e., each column of the matrices \(I_X\)) and function \(a\in A\), the resulting clusters are used to compute whether a significant number of members associated to function a is (locally) present. Intuitively, if genes that are grouped together have a strong co-expression relation and most of the group are associated to gene function a, then the remaining genes are also likely to be associated to a (i.e., guilt by association, see Petsko (2009)). In this way, for each \(v \in V\), \(a\in A\), and \(k \in K\), the entry \(J_X(v, a\cdot k)\) is a p-value indicating if the function a is over-represented in the decomposition of k clusters of X. This process is commonly known as Gene Ontology term enrichment and may use different statistical tests, such as, Fisher’s exact test (Yon Rhee et al. 2008).

Hierarchical multi-label classification for gene function prediction

This section presents the approach for gene function prediction using HMC to create a predictor, enriched with the information of the features created in “Clustering-based feature extraction” section.

The GO hierarchy is defined as a directed acyclic graph (DAG) containing three main types of annotations: biological processes, molecular functions, and cellular component (Gene Ontology Consortium 2019). This work focuses on biological processes, i.e., a subgraph of the GO hierarchy that contains 28 roots (i.e., functions in the GO hierarchy with null indegree). This subgraph is denoted as \(H=(A,R)\), where A is the set of biological processes and R the binary relation representing ancestral relations between pairs of biological processes (i.e., \((a,b)\in R\) means that function b is ancestor of function a in the GO hierarchy). The topological-sorting traversal algorithm presented in Romero et al. (2022) is used to transform the GO hierarchy of biological processes into a tree. As a result, the hierarchy is split into several components, i.e., subtrees of H called sub-hierarchies. Each sub-hierarchy, \(H'=(A',R')\) with \(A'\subseteq A\), \(R'\subseteq R\), and \(r\in A'\) the root, is associated to a subgraph \(G'=(V',E',w)\) containing all genes \(v\in V\) associated to r, i.e., \(V'=\phi ^{-1}(r)\). Note that, the proposed approach is independently applied to each sub-hierarchy.

The inputs of the approach are a sub-hierarchy \(H'=(A',R')\), a subgraph of the GCN, denoted by \(G'=(V',E',w)\), where \(V'\subseteq V\) and \(E'\subseteq E\), an annotation function \(\phi :V\rightarrow 2^{A'}\), the matrices \(J_G\) and \(J_F\) resulting from “Clustering-based feature extraction” section, and a constant value \(c\in [0,1]\) for feature selection. The output is a function \(\psi : V' \times A'\rightarrow [0,1]\), specifying, for each gene \(v \in V'\), the probability \(\psi (v,a)\) of v being associated to function \(a\in A'\).

First, sub-matrices \(J_G'\) and \(J_F'\) are created from \(J_G\) and \(J_F\), by respectively considering only the genes \(V'\subseteq V\) and functions \(A'\subseteq A\). These sub-matrices represent structural properties of the GCN subgraph \(G'\), and associations between genes and functions based on multiple partitions of each graph. Figure 3 illustrates the prediction approach. The reminder of this section is devoted to detailing the prediction approach.

Fig. 3
figure 3

The prediction approach mainly consists of two stages, feature selection with SHAP and hierarchical multi-label classification. Its inputs are a sub-hierarchy \(H'=(A',R')\), a subgraph of the GCN \(G'=(V',E',w)\), an annotation function \(\phi :V\rightarrow 2^{A'}\) that satisfy the sub-hierarchy \(H'\), the sub-matrices of \(J_G\) and \(J_F\) containing only the functions \(A'\) and genes \(V'\), and a constant value \(c\in [0,1]\) for feature selection. Its output is a function \(\psi : V' \times A'\rightarrow [0,1]\), which indicates for each gene \(v \in V'\), the probabilities \(\psi (v,a)\) of v being associated to function \(a\in A'\)

SHAP filters the extracted features with more impact in the prediction task, and HMC is used to predict associations between genes and functions without inconsistencies (i.e., complying the true-path rule). Since local HMC methods use more than one predictor per hierarchy, the feature selection is executed for each predictor independently, considering only the features related to the functions being predicted, denoted by \(A''\subseteq A'\). For example, consider the function hierarchy and a local classifier per level method depicted in Fig. 4. The predictor for level 1 predicts functions a and b, so only the features associated to functions a and b are considered for the feature selection.

Fig. 4
figure 4

Gene function prediction considering the function hierarchy and using a local classifier per level method. The predictor for the level 1 predicts functions a and b, so only the features from \(J_G'\) and \(J_F'\) associated to functions a and b are considered for the feature selection

Feature selection

The aim of feature selection is to produce a matrix \(J : V' \times \Theta (c) \rightarrow [0, 1]\) by selecting a reduced number of significant features from \(J_G'\) and \(J_F'\). The number of selected features is denoted by \(0 \le \Theta (c)\le 2m\cdot |A''|\), where \(m\cdot |A''|\) is the number of features in each matrix \(J_G'\) and \(J_F'\), denoted as q (that is \(q=m\cdot |A''|\)).

Feature selection is conveyed from \(J_G'\) and \(J_F'\) to J using SHAP. Let \(J_{G+F}'\) denote the matrix resulting from extending \(J_G'\) with the q features of \(J_F'\). That is, for each \(v \in V'\), the expression \(J_{G+F}'(v, \_)\) denotes a function with domain [0, 2q) and range [0, 1], where the values in [0, q) denote the p-values associated to v in G and the values in [q, 2q) the ones associated to v in the enriched version of G. For each entry \(J_{G+F}'(v, j)\), with \(v \in V'\) and \(0 \le j < 2q\), the mean absolute SHAP value \(s_{(v,j)}\) is computed after a large enough number of Shapely values are computed (executions of SHAP). Features are selected based on the cutoff

$$\begin{aligned} c \cdot \sum _{j=0}^{2q-1}s_{(v,j)}, \end{aligned}$$

i.e., on the sum of mean absolute values by a factor of the input constant c. The first \(\Theta (c)\) features, sorted from greater to lower mean absolute SHAP value, are selected as to reach the given cutoff.

Note that the input constant c is key for selecting the number of significant features. The idea is to set c so as to find a balance between prediction efficiency and the computational cost of building the predictor.

Training and prediction

This stage comprises a process that combines two supervised machine learning techniques/tools to build the predictor \(\psi\). In particular, stratified k-fold cross-validation and hierarchical multi-label classification are used sequentially in a pipeline.

The pipeline takes as input the matrix J, which specifies the significant features of \(J_G'\) and \(J_F'\), the sub-hierarchy \(H'\) and the annotation function \(\phi\). First, k-fold is applied to split the dataset into k different folds for cross validation (note that k is not related to the input K). That is, each fold is used as a test set, while the remaining \(k-1\) folds are used for training. Recall that k-fold cross validation aims to overcome overfitting in training. Furthermore, one or multiple random forest classifiers are build and used for prediction, the number of classifiers depends on the HMC method. Randoms forest is selected for this approach since it is a tree-based and multi-label classification algorithm, which is interpretable (SHAP can be applied). The parameter values used for random forest classifiers, differently from the default scikit-learn values, are: 200 estimators (n_estimators) and minimum number of samples of 5 (min_samples_split).

Additionally, some HMC methods require an extra step to keep prediction consistent w.r.t. the sub-hierarchy \(H'\) (i.e., comply the true-path rule). The probability of association between a function \(v\in V'\) and a function \(a\in A'\) must be lower than the probability of association between the same gene and the ancestor of a in \(H'\). To satisfy this constraint cumulative probabilities are computed throughout the paths in \(H'\). That is, for each gene \(v\in V\) and functions \((a,b)\in R\), the predicted probability of the association between v and a is multiplied by the predicted probability of association between v and b (its ancestor). This process is repeated for every path in the hierarchy from the root to the leaves.

The output of this stage is the predictor \(\psi\), i.e., the probabilities of associations between the genes in \(V'\) and functions \(A'\). Note that the predictor \(\psi\) satisfies the true-path rule.

Performance evaluation

It is often the case in HMC datasets that individual classes have few positive instances. In genome annotation, typically only a few genes are associated to specific functions. This implies that for most classes (deeper in the hierarchy), the number of negative instances by far exceeds the number of positive instances. Hence, the real focus is recognizing the positive instances (predict associations between genes and functions), rather than correctly predicting the negative ones (predict that a function is not associated to a given gene). Although ROC curves are better known, their area under the curve is higher if a model correctly predicts negative instances, which is not suitable for HMC problems.

For this reasons, the measures (based on the precision-recall (PR) curve) introduced by Vens et al. (2008) are used for evaluation.

Area under the average PR curve

The first metric transforms the multi-label problem into a binary one by computing the precision and recall for all functions \(A'\) together. This corresponds to micro-averaging the precision and recall.

The output of the prediction stage are the probabilities of associations between genes \(V'\) and functions \(A'\). Thereby, instead of selecting a single threshold to compute precision and recall, multiple thresholds are used to create a PR curve. In the PR curve each point represent the precision and recall for a give threshold that can be computed as:

$$\begin{aligned} \overline{\text {Prec}} = \frac{\sum _{i}TP_i}{\sum _{i}TP_i+\sum _{i}FP_i}, \quad \text {and} \quad \overline{\text {Rec}}=\frac{\sum _{i}TP_i}{\sum _{i}TP_i+\sum _{i}FN_i}. \end{aligned}$$

Note that i ranges over all functions \(A'\), i.e., precision and recall are computed for all functions together. The area under this curve is denoted as AU(\(\overline{\text {PRC}}\)).

Average area under the PR curves

The second metric corresponds to the (weighted) average of the areas under the PR curves for all functions \(A'\). This metric, referred as macro-average of precision and recall, can be computed as follows:

$$\begin{aligned} \overline{\text {AUPRC}}_{w_1,w_2,\dots ,w_{|A'|}} = \sum _{i}w_i\cdot \text {AUPRC}_i. \end{aligned}$$

If the weights of all functions are the same (i.e., \(1/|A'|\)) the metric is denoted as \(\overline{\text {AUPRC}}\). In addition, weights can also be defined based on the number of genes associated to functions in \(\phi\), i.e., \(w_a=|\phi ^{-1}(a)|/\sum _i |\phi ^{-1}(i)|\) for \(a\in A\). In the later case, denoted as \(\overline{\text {AUPRC}_w}\), more frequent functions get higher weight. Note that one point in the weighted PR curve corresponds to the (weighted) average of the AUPRC of all functions \(A'\) given a threshold.

Case study: Zea mays

Next section describes a case study on applying the feature extraction and prediction approach presented in “Clustering-based feature extraction” and “Hierarchical multi‑label classification for gene function prediction” sections to maize (Zea mays). First, the maize data used for the case study is described. Second, the proposed approach is applied to the maize data. Lastly, the performance of the proposed approach is compared to two models trained using each set of features \(J_G\) and \(J_F\), independently.

Data description and feature extraction

The co-expression information used in the study is imported from the ATTED-II database (Obayashi et al. 2018). The gene co-expression network \(G = (V, E, w)\) comprises 26 131 vertices (genes) and 44 621 533 edges. In this case, a z-score threshold of 1 is used as the cut-off measure for G, i.e., E contains edges e that satisfy \(w(e) \ge 1\) (most of them satisfying \(w(e) >1\)). Note that the highest value is assigned to the strongest connections. The functional information for this network is taken from DAVID Bioinformatics Resources (Huang et al. 2009) (2021 update); it contains annotations of biological processes, i.e., pathways to which a gene contributes. It is important to note that genes may be associated to several biological processes, and biological processes may be associated to multiple genes. The database comprises 3 924 biological processes A and 7 021 ancestral relations R between these functions, that represent the hierarchy \(H=(A,R)\) of the GO (Gene Ontology Consortium 2019). A total of 255 865 association between genes and functions are considered, these associations represent the annotation function \(\phi :V\rightarrow 2^A\).

The feature extraction approach is applied with the inputs G, A, \(\phi\) and \(K=\{10,20,\dots ,100\}\) (values are incremented in steps of 10 up to 100). The outputs are the feature matrices \(J_G\) and \(J_F\) that specify how likely it is for the maize genes V to be associated to the biological processes A when the graph is decomposed in the number of clusters in K.

Moreover, only functions associated to more than 200 genes have been considered, so the number of functions in the resulting sub-hierarchies is tractable regarding the dimension of the output of SHAP (see “Preliminaries” section). Recall that the Gene Ontology hierarchy splits into 28 sub-hierarchies when considering only biological processes. Additionally, all sub-hierarchies with less than 10 functions are discarded and the topological-sorting algorithm introduced in Romero et al. (2022) is used to transform the sub-hierarchies, represented as DAGs, into trees. For each ancestral relation \((a,b)\in R\) (b is ancestor of a), the algorithm assigns a weight as the ratio of the number of genes associated to the a to the number of genes associated to b. Then, for each function \(a\in A'\) with more than one parent, only the one with the higher weight remains (ties are broken arbitrarily).

Table 1 Resulting sub-hierarchies \(H'\) of biological processes for maize

As result, there are 5 sub-hierarchies of biological processes. Table 1 describes each sub-hierarchy \(H'\), starting by the root term r and its description, following the number of functions \(A'\) and the number of genes \(V'\) in the associated GCN subgraph \(G'\). The prediction approach is applied to each sub-hierarchy \(H'\) independently. The remaining input parameter for the prediction approach is \(c=0.9\) (recall that this parameter is used to filter the most relevant features according to their mean SHAP value). Figure 5 depicts the number of classifiers trained per HMC method and sub-hierarchy. Note that the global method requires one classifier per hierarchy, while the lcn requires \(|A'|-1\) classifiers.

Fig. 5
figure 5

Number of classifiers trained per HMC method and sub-hierarchy. The lcn requires \(|A'|-1\) classifiers. The lcpn requires as many classifiers as functions with children in \(H'\). The lcl requires as many classifiers as the number of levels in \(H'\). At last, the global method requires one classifier per hierarchy

Summary of results

Fig. 6
figure 6

Prediction performance of the proposed approach measured with the area under the average PR curve, i.e., AU(\(\overline{\text {PRC}}\)). The performance is measured independently per sub-hierarchy

Figure 6 presents the prediction performance of the proposed approach measured with the AU(\(\overline{\text {PRC}}\)) (denoted as micro) for four HMC methods, namely, local classifier per node (lcn), local classifier per parent node (lcpn), local classifier per level (lcl), and global classifier. In general, it can be seen that all methods get a high area under the average PR curve, but the global classifier outperforms the local methods for all sub-hierarchies. The proposed approach identifies the associations between genes and functions by using the features extracted from the GCN G and the affinity graph F, and considering the ancestral relations of the biological processes. The global method obtains the best performance, followed by the lcpn and the lcl. Using multi-label classifiers is better than using a binary classifier for each function, i.e., lcn method.

Fig. 7
figure 7

Prediction performance of the proposed approach measured with the average area under the PR curve, i.e., \(\overline{\text {AUPRC}}\). The performance is measured independently per sub-hierarchy

The micro score measures the overall performance of all functions within a sub-hierarchy without distinguishing between them. Figure 7 presents the prediction performance measured with the \(\overline{\text {AUPRC}}\), denoted as macro. The macro score measure the prediction performance for each function individually and then takes the average. The conclusion is similar, the global method outperforms the local ones.

Fig. 8
figure 8

Prediction performance of the proposed approach measured with the average area under the PR curve, i.e., \(\overline{\text {AUPRC}_w}\). The performance is measured independently per sub-hierarchy

Finally, Fig. 8 illustrates the prediction performance measured with the \(\overline{\text {AUPRC}_w}\), denoted as macro weighted. This score weights the individual performance of each function according to the number of genes associated to it. Thereby, the leaves and deeper functions in a sub-hierarchy always get lower weight than the others. Note that the deeper a functions is in a sub-hierarchy, the lower the predicted probabilities becomes. The global method outperforms the locals again. The conclusion is consistent with the three metrics, using clustering techniques to extract features from the GCN and considering the hierarchical structure of the biological processes seems to be key for the gene function production task.

Table 2 Number of extracted and filtered features used for the global method per sub-hierarchy

It has been shown in Romero et al. (2022) that the new features built from the GCN, and the associations between genes and functions with the spectral clustering algorithm are key to improve the prediction performance in the gene annotation problem (w.r.t. other features of the GCN and gene functional information). However, the feature extraction approach presented in “Clustering-based feature extraction” section produces two different sets of features, namely, \(J_G\) and \(J_F\), that are combined and used for prediction. The individual relevance of each set of features for the gene annotation problem is analyzed by (i) looking at the distribution of the filtered features for the global method and (ii) comparing the performance of the prediction task using each set of features independently. Table 2 presents the number of extracted and filtered features used for the global method per sub-hierarchy. Recall that the features are filtered using the mean SHAP values to select the more important ones with a cutoff defined by the input constant c.

Fig. 9
figure 9

Distribution of the filtered features from \(J_G\) and \(J_F\) for the global method per sub-hierarchy

Fig. 10
figure 10

Prediction performance of the global method trained using the features \(J_G\) and \(J_F\) independently, and the proposed approach (i.e., their combination) measured with AU(\(\overline{\text {PRC}}\)) and \(\overline{\text {AUPRC}}\). The performance is measured independently per sub-hierarchy

Figure 9 illustrates the distribution of the filtered features for the global method per sub-hierarchy. Note that, even though the features from the affinity graph F (i.e., \(J_F\)) are more important, features from the GCN G (i.e., \(J_G\)) are also selected for all sub-hierarchies. Figure 10 shows the prediction performance of the global HMC method trained using the features \(J_G\) and \(J_F\) independently, and the proposed approach (i.e., their combination) measured with AU(\(\overline{\text {PRC}}\)) and \(\overline{\text {AUPRC}}\). The combination of both sets of features, extracted from the GCN and the affinity graph is key to improve the performance of the proposed approach for all sub-hierarchies.

Related work and concluding remarks

Related work

Zhou et al. (2020) presented an approach to predict functions of maize proteins using graph convolutional networks. In particular, an amino acid sequence of proteins and the GO hierarchy were used to predict functions of proteins with a deep graph convolutional network model (DeepGOA). Their results showed that DeepGOA is a powerful tool to integrate amino acid data and the GO structure to accurately annotate proteins. Similarly, the work presented in Cruz et al. (2020) aims to predict the phenotypes and functions associated to maize genes using: (i) hierarchical clustering based on datasets of transcriptome (set of molecules produced in transcription) and metabolome (set of metabolites found within an organism); and (ii) GO enrichment analyses. Their results showed that profiling individual plants is a promising experimental design for narrowing down the lab-field gap. Gligorijević et al. (2018) proposed a network fusion method based on multimodal deep autoencoders to extract high-level features of proteins from multiple interaction networks. This method, called deepNF, relied on a deep learning technique that captures relevant protein features from different complex, non-linear interaction networks. Their results showed that extracting new features from biological networks is key to annotate gene with functions. The work in Zhao et al. (2019) is also closely related. They presented Gene Ontology hierarchy preserving hashing (HPHash), a gene function prediction method that retains the hierarchical order between GO terms. It used a hierarchy preserving hashing technique based on the taxonomic similarity between terms to capture the GO hierarchy. Hashing functions were used to compress the gene-term association matrix, where the semantic similarity between genes was used to predict the functions of the genes. Their results showed that HPHash preserves the GO hierarchy and improves prediction performance.

In addition, the authors in Chen et al. (2018) presented iFeature, a Python-based toolkit for generating numerical feature representation schemes from protein sequences. It integrated algorithms for feature clustering, selection, and dimensionality reduction to facilitate training, analysis, and benchmarking of machine-learning models. In a related way, Mu et al. (2021) showed that feature extraction of protein sequences is helpful for prediction of protein functions or interactions. They introduced FEGS (Feature Extraction based on Graphical and Statistical features), a novel feature extraction model for protein sequences that combines graphical and statistical features. Their results showed that similarity analysis of protein sequences has applications in the study of gene annotation, gene function prediction, identification and construction of gene families, and gene discovery.

Concluding remarks and future work

By combining network-based modeling, cluster analysis, interpretable machine learning, and hierarchical multi-label classification, the approach presented in this paper introduces a novel method to address the gene function prediction problem. It aims to predict the association probability between each gene and function by taking advantage of the GCN spectral decomposition, the information available of associations between genes and functions, and the ancestral relations between the functions (i.e., the GO hierarchy).

A case study on Zea mayz (maize) is presented. Using the structural information of the gene co-expression network (extracted by a spectral clustering algorithm) and considering the hierarchical structure of the biological processes (using HMC) seems to be the key for the improved performance of the proposed approach. More precisely, the global HMC method, which considers all features available for a sub-hierarchy to build a single classifier, outperforms the other methods in relation to the three metrics that were used (namely, AU(\(\overline{\text {PRC}}\)) , \(\overline{\text {AUPRC}}\) , and \(\overline{\text {AUPRC}_w}\)).

The results presented in Romero et al. (2022) show that the features extracted from the GCN using spectral clustering lead to better prediction performance in the gene function prediction task (addressed as an independent binary classification problem per function). In this work, it has been shown that considering the ancestral relations between functions to produce an outcome that satisfies its hierarchical structure (i.e., complies the true-path rule or hierarchical constraint), based on the features extracted from the GCN, improves the performance in the gene function prediction task (addressed as a hierarchical multi-label classification problem).

Two main lines of work can be considered for future work. First, applying the proposed approach to identify genes associated to specific stresses (e.g., low temperature, salinity) can help to reduce the set of candidate genes that respond to treatments for in vivo validation. Second, exploring transfer learning techniques (especially, domain adaptation) to enrich the building of the classifiers using information from other organisms (datasets), not only can lead to higher prediction performance, but also can enable the proposed approach on organisms without a wealth of significant functional information.

Availability of data and materials

The datasets analyzed for the current study are publicly available from different sources. They can be found in the following locations: Gene co-expression data of Oryza sativa Japonica is available on ATTED-II Obayashi et al. (2018). Functional data of rice genes is available on the DAVID Bioinformatics Resources Huang et al. (2009). Hierarchical data of Gene Ontology terms are available on the GOATOOLS Python library Klopfenstein et al. (2018). The data collected, cleaned, and processed from the above sources as used in the case study can be requested to the authors. A workflow implementation is publicly available: Project name: clustering_hmc. Project home page: https://github.com/migueleci/clustering_hmc. Operating system(s): platform independent. Programming language: Python 3. Other requirements: None. License: GNU GPL v3.

Abbreviations

AUPRC::

Area under precision-recall curve

DAG::

Directed acyclic graph

GO::

Gene ontology

GCN::

Gene co-expression network

HMC::

Hierarchical multi-label classification

lcl::

Local classifier per level

lcn::

Local classifier per node

lcpn::

Local classifier per parent node

References

Download references

Acknowledgements

Not applicable.

Funding

This work was partially funded by the OMICAS program: Optimización Multiescala In-silico de Cultivos Agrícolas Sostenibles (Infraestructura y Validación en Arroz y Caña de Azúcar), anchored at the Pontificia Universidad Javeriana in Cali and funded within the Colombian Scientific Ecosystem by The World Bank, the Colombian Ministry of Science, Technology and Innovation, the Colombian Ministry of Education and the Colombian Ministry of Industry and Turism, and ICETEX, under GRANT ID: FP44842-217-2018. The second author was partially supported by Fundación CeiBA.

Author information

Authors and Affiliations

Authors

Contributions

MR and OR proposed the original idea. JF and CR provide advice on algorithms concepts and implementation. MR and OR structured the methodology and performed the analysis. MR, OR, JF, and CR wrote the manuscript. All authors read and approved the final manuscript.

Authors information

Miguel Romero Ph.D. Student in Engineering and Applied Sciences at the Pontificia Universidad Javeriana, in Cali (Colombia). He earned a B.S. degree in Economics and Systems Engineering from the Escuela Colombiana de Ingeniería Julio Garavito, Bogotá (Colombia). He has experience in rewrite logic, network analysis, machine learning, algorithms, and competitive programming. He works on the development of mathematical models and algorithms that allow identifying, from in silico omic characterization, the expression of phenotypic traits in different varieties of crops.

Oscar Ramírez M.Sc. Student in Engineering at the Pontificia Universidad Javeriana, in Cali (Colombia). He earned a B.S. degree in Electrical Engineering from the Escuela Colombiana de Ingeniería Julio Garavito, Bogotá (Colombia). He has experience in network analysis and algorithms. He works on the development of mathematical models and algorithms for the identification of uncharacterized associations between genes and biological functions in Zea mays. Specially, identifying those associations related to key stress types, such as, low temperature tolerance and common rust resistant.

Jorge Finke Professor in the Department of Electronics and Computer Science at the Pontificia Universidad Javeriana, in Cali (Colombia). He earned a B.S., a M.Sc. and a Ph.D. degree in Systems theory from The Ohio State University, Columbus, OH. More than 10 years of experience in developing representations of complex, large-scale data, and innovative approaches to predictive modeling and analysis, including cutting-edge research in academia as well as real-world consulting for companies across multiple industries. My motivation is to discover new and disruptive opportunities that enable organizations to solve problems and achieve strategic goals.

Camilo Rocha Associate Professor in the Department of Electronics and Computer Science at the Pontificia Universidad Javeriana, in Cali (Colombia). Starting February 2020, he has been appointed Dean of Engineering and Sciences. He earned a B.S. and a M.Sc. degree in Informatics from the Universidad de los Andes (Bogotá), and a M.Sc. degree in Mathematics and a Ph.D. degree in Computer Science from the University of Illinois at Urbana-Champaign. His main research interests are in formal methods, algorithms, and software engineering, more specifically on techniques for building reliable software systems.

Corresponding author

Correspondence to Miguel Romero.

Ethics declarations

Ethics approval and consent to participate

All data were anonymized and collected in accordance to paragraph 23 of the German federal law, German Protection against Infection Act (“Infektionsschutzgesetz”), which regulates the prevention and control of infectious diseases in humans. Therefore, ethical approval and informed consent were not required.

Consent for publication

Not applicable, because all data displayed in this publication are surveillance-based data, obtained in accordance with the German Protection against Infection Act (“Infektionsschutzgesetz”).

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Romero, M., Ramírez, O., Finke, J. et al. Feature extraction with spectral clustering for gene function prediction using hierarchical multi-label classification. Appl Netw Sci 7, 28 (2022). https://doi.org/10.1007/s41109-022-00468-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41109-022-00468-w

Keywords

  • Hierarchical classification
  • Supervised learning
  • Spectral clustering
  • Shap values
  • Gene function prediction
  • Zea mays