Querying a bioinformatic data sources registry with concept lattices Nizar Messai, Marie-Dominique Devignes, Amedeo Napoli and Malika Smail-Tabbone
[email protected]
LORIA – UMR 7503 – BP 239, 54506 Vandoeuvre-l s-Nancy ICCS 2005 Kassel – July, 18 - 22, 2005
Querying a bioinformatic data sources registry with concept lattices – p.1/21
Outline 1. Motivation 2. BioRegistry: data source metadata repository 3. FCA for classifying and querying data sources 4. Ontology-based query refinement 5. Conclusion and future work
Querying a bioinformatic data sources registry with concept lattices – p.2/21
Outline 1. Motivation 1.1 Bioinformatic data sources on the web 1.2 Existing solutions 1.3 Challenge
2. BioRegistry: data source metadata repository 3. FCA for classifying and querying data sources 4. Ontology-based query refinement 5. Conclusion and future work Querying a bioinformatic data sources registry with concept lattices – p.3/21
1.1 Bioinformatic data sources on the web Bioinformatic data sources available on the Web 719 in 2005 (171 more than 2004) Diversity of contents (e.g. particular/any organism(s)) Different data types (e.g. nucleic/proteic sequences) Different data qualities (e.g. update, revision, annotation) New data source appearance
Querying a bioinformatic data sources registry with concept lattices – p.4/21
1.2 Existing solutions Thematic Portals Access to collection of selected data sources Correspond to given points of view Limited search capabilities
Querying a bioinformatic data sources registry with concept lattices – p.5/21
1.2 Existing solutions Thematic Portals Access to collection of selected data sources Correspond to given points of view Limited search capabilities Structured catalogs Bioinformatic data source catalog: DBcat small set of "free text" metadata no more maintained (since 2001)
Querying a bioinformatic data sources registry with concept lattices – p.5/21
1.3 Challenge
Improve data source identification through: gathering metadata in a structured repository taking into account existing domain ontologies organising data sources for browsing and querying
Querying a bioinformatic data sources registry with concept lattices – p.6/21
Outline 1. Motivation 2. BioRegistry: data source metadata repository 2.1 BioRegistry model 2.2 A subpart of the BioRegistry
3. FCA for classifying and querying data sources 4. Ontology-based query refinement 5. Conclusion and future work
Querying a bioinformatic data sources registry with concept lattices – p.7/21
2.1 BioRegistry model
Querying a bioinformatic data sources registry with concept lattices – p.8/21
2.1 BioRegistry model
Querying a bioinformatic data sources registry with concept lattices – p.8/21
2.1 BioRegistry model
Querying a bioinformatic data sources registry with concept lattices – p.8/21
2.1 BioRegistry model
BioRegistry Associate metadata to the data sources (from ontologies)
Idea
A formal context: data sources
Extract properties on the data sources from these metadata properties
Querying a bioinformatic data sources registry with concept lattices – p.8/21
2.2 A subpart of the BioRegistry
Querying a bioinformatic data sources registry with concept lattices – p.9/21
2.2 A subpart of the BioRegistry Data source properties extracted from the BioRegistry Data Source
Sequence
Organism
Manual Revision
Swissprot (S1)
Proteic (PS)
Any Organism (AO)
Yes
RefSeq (S2)
Nucleic (NS),Proteic (PS)
Any Organism (AO)
Yes
TIGR-HGI (S3)
Nucleic (NS)
Human (Hu)
No
GPCRDB (S4)
Proteic (PS)
Any Organism (AO)
Yes
HUGE (S5)
Nucleic (NS),Proteic (PS)
Human (Hu)
No
ENSEMBL (S6)
Nucleic (NS)
Animal (An)
No
MGDB (S7)
Proteic (PS)
Mouse (Mo)
No
VGB (S8)
Nucleic (NS)
Vertebrate (Ve)
No
Querying a bioinformatic data sources registry with concept lattices – p.9/21
2.2 A subpart of the BioRegistry Ontologies to valuate the properties (from NCBI)
Querying a bioinformatic data sources registry with concept lattices – p.9/21
2.2 A subpart of the BioRegistry Corresponding formal context Sources Metadata
NS
PS
AO
An
Ve
Hu
Mo
MR
S1
0
1
1
0
0
0
0
1
S2
1
1
1
0
0
0
0
1
S3
1
0
0
0
0
1
0
0
S4
0
1
1
0
0
0
0
1
S5
1
1
0
0
0
1
0
0
S6
1
0
0
1
0
0
0
0
S7
0
1
0
0
0
0
1
0
S8
0
1
0
0
1
0
0
0
Querying a bioinformatic data sources registry with concept lattices – p.9/21
Outline 1. Motivation 2. BioRegistry: data source metadata repository 3. FCA for classifying and querying data sources 3.1 Methodology 3.2 Data source classification 3.3 Query 3.4 Data source retrieval algorithm 3.5 Problem
4. Ontology-based query refinement 5. Conclusion and future work Querying a bioinformatic data sources registry with concept lattices – p.10/21
3.1 Methodology
Querying a bioinformatic data sources registry with concept lattices – p.11/21
3.1 Methodology
Querying a bioinformatic data sources registry with concept lattices – p.11/21
3.1 Methodology
Querying a bioinformatic data sources registry with concept lattices – p.11/21
3.2 Data source classification
Incremental construction of the concept lattices [Godin et Al. 1995] Add new data sources (Registry updating) Insert queries (Registry querying) Querying a bioinformatic data sources registry with concept lattices – p.12/21
3.3 Query A set of properties Example : "Data sources, that are manually revised, containing nucleic sequences of Human organism"
nucleic sequences (NS) human organism (Hu) manually revised (MR) Transform the query into a concept
=(
,
{Query} {nucleic sequences (NS), Human (Hu), Manual Revision (MR)} ) = ({Query}, {NS, Hu, MR})
Querying a bioinformatic data sources registry with concept lattices – p.13/21
3.4 Data source retrieval algorithm
Querying a bioinformatic data sources registry with concept lattices – p.14/21
3.4 Data source retrieval algorithm
Insert the query concept into the concept lattice [Carpineto 2000] Search relevant data sources: A data source is relevant to a query if it shares at least one of its properties Querying a bioinformatic data sources registry with concept lattices – p.14/21
3.4 Data source retrieval algorithm
Ø
Begin the result construction :
Step 0: Locate the new query concept in the resulting lattice
Querying a bioinformatic data sources registry with concept lattices – p.14/21
3.4 Data source retrieval algorithm
Step 1: Get the query concept subsumers and continue the result construction = 1) S3, S5 (Hu,NS), S2 (NS,MR) Querying a bioinformatic data sources registry with concept lattices – p.14/21
3.4 Data source retrieval algorithm
Step 2: = 1) S3, S5 (Hu,NS), S2 (NS,MR) 2) S1, S4 (MR), S6 (NS) Querying a bioinformatic data sources registry with concept lattices – p.14/21
3.4 Data source retrieval algorithm
Step 3: A concept with an empty intension is reached end of the algorithm return the result Querying a bioinformatic data sources registry with concept lattices – p.14/21
3.5 Problem When query properties are not in the context
= ({Query}, {Chicken (Ch)})
1-
Examples : =Ø
= ({Query}, {Eucaryote (Eu)})
2-
although data sources dealing with vertebrate can be interesting =Ø
although data sources dealing with animals can be interesting
Querying a bioinformatic data sources registry with concept lattices – p.15/21
3.5 Problem When query properties are not in the context
= ({Query}, {Chicken (Ch)})
1-
Examples : =Ø
= ({Query}, {Eucaryote (Eu)})
2-
although data sources dealing with vertebrate can be interesting =Ø
although data sources dealing with animals can be interesting
Idea : Ontology-based query refinement Querying a bioinformatic data sources registry with concept lattices – p.15/21
Outline 1. Motivation 2. BioRegistry: data source metadata repository 3. FCA for classifying and querying data sources 4. Ontology-based query refinement 4.1 Generalisation refinement 4.2 Specialisation refinement
5. Conclusion and future work
Querying a bioinformatic data sources registry with concept lattices – p.16/21
4.1 Generalisation refinement
Querying a bioinformatic data sources registry with concept lattices – p.17/21
4.1 Generalisation refinement
Generalisation refinement Add to the query the ancestors of the considered property in the ontology Only those that are in the formal context Querying a bioinformatic data sources registry with concept lattices – p.17/21
4.1 Generalisation refinement
New result:
= ({Query}, {Ve, An, AO})
Refined query:
= 1) S6 (An) 1) S8 (Ve) 1) S1,S2,S4 (AO) Querying a bioinformatic data sources registry with concept lattices – p.17/21
4.2 Specialisation refinement
Querying a bioinformatic data sources registry with concept lattices – p.18/21
4.2 Specialisation refinement
Specialisation refinement Add to the query the descendants of the considered property in the ontology Only those that are in the formal context Querying a bioinformatic data sources registry with concept lattices – p.18/21
4.2 Specialisation refinement
New result:
= ({Query}, {An, Ve, Hu, Mo})
Refined query:
= 1) S6 (An) 1) S8 (Ve) 1) S5 (Hu) 1) S7 (Mo) Querying a bioinformatic data sources registry with concept lattices – p.18/21
Outline 1. Motivation 2. BioRegistry: data source metadata repository 3. FCA for classifying and querying data sources 4. Ontology-based query refinement 5. Conclusion and future work
Querying a bioinformatic data sources registry with concept lattices – p.19/21
5 Conclusion and future work Conclusion Classification of data sources according to their metadata Identifying relevant data sources for a given query Ontology-based query refinement
Querying a bioinformatic data sources registry with concept lattices – p.20/21
5 Conclusion and future work Conclusion Classification of data sources according to their metadata Identifying relevant data sources for a given query Ontology-based query refinement
Future work Refine the definition of relevance (take into account some preferences) Define an order for data source composition (case of complex queries)
Querying a bioinformatic data sources registry with concept lattices – p.20/21
Thank you for your attention
Querying a bioinformatic data sources registry with concept lattices – p.21/21