A transversal approach to compute semantic ... - Julie Chabalier

hierarchies: Biological Process, Molecular Function and Cellular. Component, 2) medical .... Schematic representation of the transversal analysis. Genes are ...
464KB taille 1 téléchargements 257 vues
A transversal approach to compute semantic similarity between genes Julie Chabalier

Nicolas Garcelon

Marc Aubry

Anita Burgun

E.A 3888 - Faculté de Médecine, Université de Rennes 1, IFR 140, 35043 RENNES Cedex, France 33(0)2.99.28.42.15

E.A 3888 - Faculté de Médecine, Université de Rennes 1, IFR 140, 35043 RENNES Cedex, France 33(0)2.99.28.42.15

UMR 6061, CNRS, Faculté de Médecine, IFR 140, 35043 RENNES Cedex, France 33(0)2.23.23.45.76

E.A 3888 - Faculté de Médecine, Université de Rennes 1, IFR 140, 35043 RENNES Cedex, France 33(0)2.99.28.42.15

[email protected]

[email protected]

[email protected]

[email protected]

ABSTRACT Interpretation of transcriptomic data analysis is a challenge in bioinformatic. Most of time, the interpretation consists in clustering genes according to the expression pattern and thus searching into database the gene related annotations. This preclustering limits the data interpretation. In order to obtain a better identification of studied genes and to highlight the functional relationships between them, we propose a transversal approach that aims to cluster the genes according to biological (Gene Ontology vocabulary), medical (UMLS terminology), genomic (sequence feature retrieval) and expression (experiment results) annotations. From these clusters, the semantic similarities between genes are computed in a vector space model. The transversal analysis is applied on a set of gene involved in the enterocyte differentiation. This approach results in several gene functional networks proved to be biologically relevant.

Keywords Ontology, semantic similarity, annotation, transcriptome.

1. INTRODUCTION Many methods have been developed to analyze series of microarray experiments [1]. Typically, to interpret experiment results, a two-step approach is sequentially performed: 1) the genes are organized into clusters depending on their differential expression pattern and, 2) for each cluster, the main objective is to assign a function to each gene product. This second step, called functional annotation, consists in searching for gene individual information (sequences, motifs, terms, etc.) in databases or ontologies. According to each expression cluster, gene information meaning is usually evaluated biologically through an existing model. Thus, a hypothetical model can be built. The biological interpretation is a difficult and fastidious task. Even if several methods have been developed to facilitate the annotation work [2,3,4], this analysis step is done manually most of the time. Working with data clustered according to the expression level can limit the interpretation. It seems to be the bottleneck of these different approaches. Indeed, it could be difficult to underline functional relationships between genes if they belong to different expression groups. Recent applications, such as Garban [5], propose a global approach to classify the data without expression pattern pre-clustering. However, these approaches concern essentially biological annotation through the Gene OntologyTM (GO) [6], without handling with the contextual knowledge of data, such as sequence features or medical knowledge. As opposed to the two-step approach, we define a transversal analysis that supports a parallel gene clustering according to different kinds of knowledge: 1) biological knowledge, which is

provided by the GO terms that are organized according to three hierarchies: Biological Process, Molecular Function and Cellular Component, 2) medical knowledge, supplied by the Unified Medical Language SystemR (UMLS) [7]; the UMLS currently integrates one million biomedical concepts from more than 100 vocabularies including the Medical Subject Heading (MeSH), 3) genomic knowledge corresponding to the sequence features of the studied genes, and 4) expression pattern knowledge which is provided by experimental results. Establishing relations between these clusters allows to assign functional relationships to a set of genes in order to infer knowledge on a biological model. Two genes can be functionally related in several ways which are complementary: 1) according to the GO terminology, they can be involved in a same biological process (for example iron ion transport) where they can carry on a specific molecular function (for example ferric iron binding) and 2) according to the UMLS, they can be involved in a same disease (for example liver disease). Therefore, the more information two genes share in common, the more functionally related they are. On this assumption, in order to assign relationships automatically to a set of genes, it is necessary to take into account the different descriptions of each gene. According to the transversal analysis, a gene is described as belonging to different clusters (related to GO and UMLS vocabularies, expression patterns and sequence features). Ongoing works have been proposed to establish functional relationships between genes [8,9]. These methods, based on term similarity, compare the different terms assigned to a pair of genes in order to calculate the semantic similarity between these genes. However, these approaches essentially deal with GO terms, the knowledge provided by the different descriptors is not integrated. Furthermore, these methods do not always meaningfully estimate similarity between genes because of the hierarchy organizing terms assigned to genes. For example, two genes associated with several terms, respectively parent and child, in the hierarchy do not have the highest similarity [10]. To address this issue, our transversal approach aims to combine different types of knowledge on the studied genes in order to predict functional relationships between genes. It consists in parallel clustering of genes based on different kinds of knowledge. From these clusters, the semantic similarity between genes is computed in a vector space model [11]. Genes are described as vectors of annotations inferred from clusters. Comparing these vectors results in a matrix of gene similarity. Through the gene clustering, different kinds of knowledge can be integrated in vectors. The hierarchy problem, encountered with GO and UMLS terminologies, is avoided by 1) associating a level

to each term, 2) selecting an appropriated level to cluster the genes and 3) inferring the vectors according to the terms associated with this level. Indeed, selecting a level limits the number of terms linked by a direct hierarchical relation. The transversal analysis is applied to a set of genes involved in enterocyte differentiation. These genes were previously studied by a transcriptomal approach based on the standard two-steps approach [12]. The transversal approach results in several networks of biologically relevant functional similarity genes in the gene collection. This paper is organized as follows. First, the principle of the transversal analysis is introduced. Then, we present the gene clustering methods. The vector space model to compute the gene semantic similarity is introduced. Before a discussion, the biomedical application is presented.

2. METHOD 2.1 Principle of the transversal analysis For a microarray experiment, the spotted genes are grouped according to the three GO hierarchies. The resulting gene clusters are named bio-ontological clusters. At the same time, data are clustered according to 1) medical concepts that are related to them through the co-occurrences, hierarchical and associative relationships in the UMLS, it results in clusters named medontological clusters 2) their sequence features (motifs of the studied genes) resulting in genomic clusters, and 3) their expression level: up-regulated, down-regulated, and invariant genes according to the experimental conditions (expression clusters).

Condition 1

Biological Processes

specific conditions and we can make the hypothesis that these genes can be functionally related. In order to verify this assumption, the relations with the other kinds of cluster are established. Indeed, the genes, which belong to a same medontological cluster, could support this hypothesis and in the same way, the sequence features known to be involved in a same molecular function can facilitate the interpretation of results. Thus, the transversal approach based on relations between gene clusters can improve a biological model or predict a new model according to some a priori hypotheses.

2.2

Biomedical Annotation retrieval and gene clustering

The gene biomedical annotations are retrieved through the public BioMedical Knowledge Extractor system (BioMeKE) [13]. BioMeKe provides an access to information using systematic investigation upon a gene; it is based on the combination of several relevant resources such as UMLS and GO.

2.2.1 GO clustering genes: Minimal Level method The GO terms that annotate a set of genes are located at different levels of the hierarchies. In order to cluster the genes sharing a common term, we have to compare the same level annotations. Thus, for each term associated with a gene, the ancestor terms are computed. Their relative position in the hierarchy, named level, is calculated. It corresponds to the number of relations between a term and the root term (Figure 2). As each term may be linked to more than one parent, the level of a term may be different according to the path towards the root term. The level, LevT, corresponds to the shortest path. This method, named ML method (for Minimal Level method) calculates the minimal number of relations, Rel, between a term T and the root term, Tr:

LevT = min(∑ RelT )

Condition 2

Tr

Hence, the genes sharing a same annotating term according to a specific level are clustered. The ML method may be applied to the three GO hierarchies (bio-ontological clusters).

L0

Ta Tb

g1

Figure 1 shows an example of two bio-ontological clusters and their relations with expression clusters according to two experimental conditions. These relations underline genes differentially expressed in a same biological process according to

Level L1

Ta: (g1,g2,g3)

Tb: (g2,g3) Tc: (g1,g3)

L1

Figure 1. Schematic representation of the transversal analysis. Genes are clustered according to two biological processes (A and B), represented in the center of the schema, and to expression patterns (up-regulated (+), down-regulated (-) and invariant (=)) clusters at the left and right of the schema). Each gene is linked to itself across the different clusters underlining genes differentially expressed into a similar biological process.

L1

Level L0

L2

Tf

Tc L2

g3

L2

Tg

Td

Te Th

L2

Level L2

Level L3

Td: (g2,g3)

Th: (g2)

Tg: (g3)

L3

g2

Figure 2. The GO term levels. According to the ML method, each term is associated with its minimal level in the hierarchy (L0, L1, L2, Ln). Annotated genes (g1, g2 , g3) may be clustered according to these levels (right table).

The upper level of the UMLS, the Semantic Network, groups concepts according to the semantic types that have been assigned to them. In order to reduce the complexity of this network, the semantic types have been aggregated into semantic groups [14]. To obtain medical information complementary to GO terms, we consider only the UMLS annotations related to pathologies. In this way, we select the concepts categorized by the semantic types associated with the Disorders semantic group. Thus, the selected concepts correspond to the projection of a filter on the semantic group Disorders. Genes sharing a common UMLS concept are clustered.

similarity without and with applying a weighting scheme. Furthermore, in order to retrieve genes functionally related to a given biological process, our approach is restricted to terms from the Biological Processus GO hierarchy (BP terms). It results in a half-matrix for the given gene collection. We use an arbitrary threshold of 0.6 for the dot product in order to select the pairs of genes that present a high degree of similarity. The Disorder semantic group filtered concepts (UMLS concepts) may be incorporated in the matrix in order to support the hypothesis that genes involved in a same biological process should share a common pathology. We discuss, in the last section, the UMLS concept contribution in the pair gene semantic similarity.

2.3

As a result of the VSM method, several networks of gene semantic similarities can be constructed.

2.2.2 UMLS clustering genes: Disorder projection

Computing gene similarities

Gene clusters, based on GO and UMLS, are used to calculate the semantic similarity between genes.

Cluster of genes

Genes Terms

Similarity values for a pair of genes described by GO terms are calculated based on a vector space model (VSM). VSM are essentially used in information retrieval for computing the similarity between documents described as vectors of keywords [11]. Recently, this method has been used to identify associative relations between terms in the GO [15].

Terms

Sim( g 1, g 2)

=

r r g1 ⋅ g 2 r r | g1 | × | g2 |

A weight may be applied to each binary association in order to lower the importance of an association between a gene and a GO term when a term is associated with many genes in the given collection. Indeed, we consider that a term is not representative of a gene if it annotates most of genes in the collection. This weighting scheme is known as inverse document frequency (idf). Let N, the total number of genes in the collection and nt, the number of genes annotated by the term t. Therefore, the weight is defined as follows:

idft = log

N nt

The gene semantic similarity is computed pairwise for all genes present in the collection. As the studied genes are involved in the iron absorption, they share numerous annotations. Applying a weight to binary associations may result in the removal of numerous associations. Consequently, we computed the semantic

Genes

Genes

The clusters of genes introduced in the 2.2 section correspond to a matrix of specific level GO terms by genes. This matrix consists in binary values indicating the presence or absence of an association between a GO term and a gene. Therefore, we have to transpose this matrix in order to obtain a matrix of genes by terms (Figure3). The similarity between two vectors is represented by the angle between these vectors. Given a pair of genes, g 1 and g 2 , the semantic similarity, Sim( g 1, g 2 ), may be defined by the cosine of the angle between the two annotation vectors, which corresponds to a normalized dot product of the vectors:

Genes

A collection of genes annotated with GO is analogous to a collection of scientific articles indexed with the MeSH controlled vocabulary.

Figure 3. Similarity in the vector space model from GO annotated gene clusters.

3. RESULTS In order to have a better understanding on the transcriptional events underlying intestinal iron absorption, the CaCo-2 colon adenocarcinoma cell line was studied [10]. As these cells spontaneously differentiate in culture intestine, they were used to characterize genes whose expression varies during differentiation by means of a transcriptomal approach based on microarray experiments. We have applied the transversal analysis to this gene collection. The genes are clustered according to expression data, BP terms and UMLS concepts. Conversely, we compare the different clustering levels in order to compute, at the most appropriate level, the semantic similarity between genes in the collection.

3.1

Gene clustering

3.1.1 transcriptome clustering Starting from the 726 genes spotted on the DNA chips, the experiments led to the identification of 186 significantly expressed genes: 50 down-regulated, 80 up-regulated and 56 invariant genes [12]. These genes are considered to be part of three expression clusters.

same concept annotates many genes in the collection.

3.1.2 GO clustering In order to obtain functionally related genes into a biological process, the 186 genes are clustered according to the BP terms. For this, we used the ML method presented in the section 2.2.1. Table 1 shows the number of clusters and the number of genes according to the level of the hierarchy. The sixth level contains the highest number of processes. Consequently to these clustering results, we applied the VSM method to the sixth-level gene clusters. Table 1. Number of clusters and genes according to process levels Process Levels

Number of clusters

Number of genes

2

2

6

3

13

88

4

49

87

5

65

80

6

78

74

7

71

58

8

58

47

9

31

30

10

18

17

3.1.3 UMLS clustering Only seven per cent of the genes are annotated by one or more UMLS concepts in the gene collection. However, the number of medical clusters is 691. This high number of clusters is balanced by the fact that most clusters share the same genes; it means that a

3.2

Computing semantic similarity

The computed semantic similarity is applied to the sixth level gene clusters. First of all, the BP term clusters are used to have binary associations between terms and genes (without weight). It results in fifteen functional networks, i.e. gene similarity networks. Among them, six networks contain more than two genes. In order to interpret these results, the expression clusters which each gene belongs to, are associated with the networks. Three networks appear to be biologically relevant: -

The first one (Figure 4A) concerns genes involved in the protein biosynthesis with up-regulated and invariant genes coding for ribosomal proteins (RPL39, RPS7, RPL41, RPL35A, RPS3, and RPL7A) and downregulated genes coding for translation initiation factors (EIF4A2, EIF3S2, and EIF3S8).

-

The second network (Figure 4B) corresponds to an upcluster where genes are involved in ion transport, more particularly in metal ion transport (TF, SLC21A9, SLC11A2, SLC2CA3, and AKR1C2).

-

The third network (Figure 4C) links three up-regulated genes coding for apolipoproteins (APOA1, APOB, andAPOC3).

Taking into account the UMLS concepts associated with genes, we obtained an additional down-network including well-known genes involved in the cell cycle process (NME1, NME2, HMGB1, STK15, FN1, and CDC2) (Figure 5). These results are computed from binary matrixes. However, by applying weight according to support the relative representation

A

B

C

Figure 4. Semantic similarity gene networks according to the Gene Ontology terms. Numbers above the relation between genes represent the degree of semantic similarity. A) Protein biosynthesis network, B) Ion transport network, C) Apolipoprotein network.

Figure 5. Cell cycle gene similarity network. of terms in the gene collection as we have seen in the section 2.3, no functional network is retrieved.

4. DISCUSSION This paper presents a new approach to interpret the transcriptomic data analysis. This approach deals with the combination of different kinds of knowledge by gene clustering according to the GO and the UMLS annotations. From these clusters, the vector space model method allows to retrieve functional gene networks based on gene semantic similarity. This transversal analysis is applied to a collection of genes involved in enterocyte differentiation. Gene clustering according to BP terms results in three functional gene networks. By adding UMLS concepts, a new network appears. According to experts, all these networks are proved to be relevant. It is important to note that the similarity networks obtained by our approach are independent of the expression level. Indeed, the figure 4A shows a protein biosynthesis network where genes are differentially expressed along this process. This result proves that a transversal approach is more appropriate to study the biological process than a classical two-step approach which begins by clustering genes according to the pattern expressions. The keystone of our approach is to combine a transversal gene clustering according to multiple sources with a method that integrates the information provided by this clustering, i.e. descriptors of gene subsets, in order to compute the similarity between genes. To validate our approach, the BP terms and the UMLS concepts are used to support the hypothesis that genes involved in a same biological process should share a common pathology. Even if the filtered UMLS annotations are not numerous, they allow a gain of knowledge. Indeed, adding UMLS concepts in gene matrixes by term results in a new relevant similarity network. As proved in [13], UMLS annotations appear to be complementary to GO annotations. However, as supposed in the section 2.3, the idf weighting scheme appears to be too restrictive over the gene collection. We explain this lack of results by the nature of the genes spotted on the DNA chips. Indeed, the CaCo-2 colon adecarcinoma cell line is known to be useful for gaining an insight into the mechanisms by which enterocytes adapt iron absorption [12]. Therefore, the spotted genes share numerous annotations related to the iron metabolism, transport protein biosyntheses, and iron associated pathologies. Consequently, applying a weight in order to lower the importance of an association between a gene and a term when a term is associated with many genes in the given collection has resulted in the removal of numerous associations. Through the clustering method, our approach allows to select the GO level according to the set of genes to analyze. Indeed, as mentioned in [8], the depth of GO (and UMLS) depends on the biological knowledge rather than anything intrinsic about terms. This current knowledge differs according to the considered biological domain. As the most informative level, i.e. where genes are associated with the maximum of concepts, appears to be correlated with the gene nature, the gene level is tested for each

gene collection before the clustering. Current work consists in applying the ML method before the clustering according to the UMLS concepts. It could be interesting to compare the different levels obtained across the study of different biological domains. Recent works propose alternative approaches to cluster genes according to GO terms and to compute semantic similarity between genes. Indeed, as presented in [16,17], a method has been developed to categorize a set of genes using semantic similarity in GO. Rather than clustering genes according to a common annotation, this method consists in the ranking of representative terms for a gene set. The requirements of our approach are not compatibles with this method, indeed, in order to compare genes, we need their specificities, i.e. specific annotation terms for each gene. The methods described in [8,9] exploit the usage of terms in a corpus to give a measure of information content. Including this measure, the similarity for a pair of genes is calculated from the semantic similarity between GO terms. Our approach, using VSM, is not based on information content. However this lack of precision in the term similarity is balanced by the possible integration of different kinds of knowledge during the analysis of a gene set. Indeed, the presence or the absence of any association between a gene and a descriptor (term, motif, promoter, etc) can be integrated in an annotation vector. Furthermore, by computing the gene semantic similarity from annotation based on GO, we are facing the hierarchy problem described in [10]. Clustering gene by selecting terms associated with a specific level of the GO organization limits the presence of terms hierarchically related during the vector construction. We are thus working on a means to attribute specific weight on related terms that are present in a common GO level. Future works will be to study the function of gene products in the different similarity networks in order to integrate unannotated genes. Previous research showed significant relationships between semantic similarity of gene pairs and their sequence-based similarity [8]. Then, we plan to integrate the sequence feature information given by the genomic clusters and the GO clusters related to the Molecular Function hierarchy into the VSM matrix entries. We will have to evaluate the relative contribution of integrating various descriptors into annotation vectors. The VSM approach will facilitate the attribution of weights according to the importance of the information related to the gene functional relationships.

5. ACKNOWLEDGEMENTS We gratefully acknowledge Fleur Mougin for helpful discussions and Gwenaëlle Marquet for its contribution to this work.

6. REFERENCES [1] Slonim DK. (2002), From patterns to pathways: gene expression data analysis comes of age, Nat Genet.32 Suppl:502-8. [2] Robinson P.N., Wollstein A., Böhme U., and Beattie B., (2004), Ontologizing gene-expression microarray data: characterizing clusters with Gene Ontology. Bioinformatics, 20: 979 - 981. [3] Smid M., Lambert C., Dorssers J., (2004), GO-Mapper: functional analysis of gene expression data using the

expression level as a score to evaluate Gene Ontology terms., Bioinformatics; 20: 2618 – 2625. [4] Volinia S., Evangelisti R., Francioso F., Arcelli D., Carella M., and Gasparini P. (2004), GOAL: automated Gene Ontology analysis of expression profiles. Nucleic Acids Res32: W492 - W499. [5] Martinez-Cruz L.A., Rubio A., Martinez-Chantar M.L., Labarga A., Barrio I., Podhorski A., Segura V., Sevilla Campo J.L., Avila M.A., Mato J.M. (2003), GARBAN: genomic analysis and rapid biological annotation of cDNA microarray and proteomic data. Bioinformatics. 1;19(16):2158-60. [6] Gene Ontology Consortium (2004), The Gene Ontology (GO) database and informatics resource, Nucleic Acids Res., 1, Database issue: D258-261. [7] Bodenreider O. (2004), The Unified Medical Language System (UMLS): integrating biomedical terminology, Nucleic Acids Research, 32, Database issue:D267-70 [8] Lord P, Stevens R., Brass A. and Goble C.(2003), Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics, 19, 1275-1283. [9] Wang H., Azuaje F., Bodenreider O.and Dopazo J (2004), Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships. In Proc. Of IEEE2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology, La Jolla, CA, USA, 25-31. [10] Azuaje F, Wang H. and Bodenreider O.(2005), Ontologydriven similarity approaches to supporting gene functional assessment. Proceedings of the ISMB'2005 SIG meeting on Bio-ontologies.

[11] Baeza-Yates R. and Ribeiro-Neto B. (1999), Modern information retrieval, 513 p., ACM Press ; Addison-Wesley, New York; Harlow, England.

[12] Bedrine-Ferran H., Le Meur N., Gicquel I., Le Cunff M., Soriano N., Guisle I., Mottier S., Monnier A., Teusan R., Fergelot P., Le Gall J.Y., Leger J., Mosser J. (2004), Transcriptome variations in human CaCo-2 cells: a model for enterocyte differentiation and its link to iron absorption, Genomics ;83(5):772-89. [13] Marquet G., Guérin E., Moussouni F., Loréal O., and Burgun A. (2005) UMLS-based biomedical annotation of functional genomic data. Proceedings of the French conference of bioinformatics: JOBIM’2005. [14] McCray AT, Burgun A, Bodenreider O.(2001), Aggregating UMLS semantic types for reducing conceptual complexity. Medinfo, 10(Pt 1):216-20. [15] Bodenreider O, Aubry M, Burgun A.(2005), Non-lexical approaches to identifying associative relations. Proceedings of the Gene Ontology. Pacific Symposium on Biocomputing 2005: World Scientific; p. 91-102. [16] Joslyn C., Mniszewski S.M., Fulmer A.W., and Heaton G.G. (2004), The Gene Ontology Categorizer. Bioinformatics, v. 20:s1, pp. 169-177 [17] Verspoor K, Cohn J, Joslyn C, Mniszewski S, Rechtsteiner A, Rocha LM, Simas T.(2005), Protein annotation as term categorization in the gene ontology using word proximity networks. BMC Bioinformatics.;6 Suppl 1:S20. Epub 2005 May 24.