spatial databases integration : interpretation of ... - David Sheeren

The first way, i.e. the aggregation with buffers, is similar to the simulation step of the BDCARTO .... difference and to learn rules with a good predictive accuracy.
419KB taille 2 téléchargements 430 vues
SPATIAL DATABASES INTEGRATION : INTERPRETATION OF MULTIPLE REPRESENTATIONS BY USING MACHINE LEARNING TECHNIQUES David SHEEREN COGIT Laboratory - Institut Géographique National 2-4 av. Pasteur - 94165 St Mandé Cedex - France [email protected] LIP6 Laboratory - AI Section University of Paris 6 - France ABSTRACT Many geographical databases exist to represent a same part of the world, seen at different levels of detail and with different points of view. The use and management of these databases sometimes require that they be integrated in a single database. An important issue for integration is the ability of analyzing and understanding differences among multiple representations. These differences can be explained by various specifications and updates but can also be due to errors during data capture. In this paper, we describe an approach to detect differences in representation between two geographical datasets and we expose experiments made to acquire rules by an inductive learning process, meant to interpret in an automatic way these differences. 1. INTRODUCTION Geographical databases are traditionally meant for a particular field of application such as production of topographic maps, management of cadastral parcels, regional planning and so on. A same part of the real world can thus be described in several independent databases with different levels of detail and different points of views. Combining all the various aspects of each database by means of their integration is becoming a topical issue. Unifying heterogeneous data sources by making explicit the relations existing between them can help to: -

increase the potentiality of applications development from these sources. Some applications can benefit from using databases with multiple representations [1,2] maintain the databases and propagate updates [3] carry out some quality analysis by using one database to control another one or by identifying inconsistencies [4,5].

The integration process concerns data schemas and data instances of the databases. One of the main step is to define the unified schema for the unified database, and the way this schema is related to the original database schemas. The other main step is to populate the unified database with objects from the initial databases. These two steps require a precise understanding of the content of each database, and of the differences between them. Although research has been undertaken in the last decade on various specific phase of the integration process [6,7,8] some key issues still need solving to achieve this goal. One of them, relating to the data instances, is the lack of method to detect and assess the consistency among multiple representations of geographical objects [9,10]. Generally, the phase of populating the unified database is only regarded as a matching problem and links between homologous objects are established and validated without analyzing the differences between them [11]. This step is however essential. According to the representation used , queries generated by the unified system can be different. For example, it is quite possible to have different values for the same attribute of homologous road objects or different results for a shortest path computation. In most cases, the differences are the consequences of specification differences in each database and are thus justified. Specifications describe which objects of the real world should be represented in the database and detail the process to capture them into the system. Nevertheless, differences can occur because of errors during data capture or updates. Such differences are abnormal and can lead to inconsistent results in the system. Our research problem is in keeping with that context. We are defining a rule-based reasoning system which is able to decide, after a data matching process, whether matching pairs correspond either to equivalent real world object, to an update between the databases, or to a consequence of an

error in one of the databases. This automatic interpretation process is based on knowledge that can stem from the specifications of each database and from experts in the field. One of the main difficulties in achieving this expert system is the knowledge acquisition step [12]. Analyzing the specifications of each database and defining the rules are weighty tasks. In addition, the pertaining information is not always well-identified and specifications in the databases can be imprecise and incomplete. This problem of knowledge acquisition bottleneck is well known in the artificial intelligence community. It has led to the emergence of machine learning techniques [13] which are precisely the subject of our paper. In this article, we describe an approach to detect differences in representation between two heterogeneous datasets and we expose experiments made to acquire rules by an inductive learning process meant to interpret these differences automatically. The paper is structured as follows. In the next section, we first introduce the characteristics of the datasets, their specifications and the learning problem. Then we present the acquisition step of examples. Next, we describe the supervised learning procedure and discuss the experimental results. Finally we conclude the study and indicate further work. 2. LEARNING PROBLEM Datasets considered in this study relate to the building theme of two of the French National Mapping Agency (IGN) geographical databases: BDTOPO and BDCARTO. The examined area surrounds the city of Orleans (figure 1). BDTOPO is a highly detailed topographic database with a one meter resolution. This database was developed in particular to produce maps to a 1:25'000 scale, derived from aerial photographs. The building theme used to make the experiments dates from 1998. The objects are represented by polygons using the following capture constraints: “In general, buildings are not generalized. Their individuality is maintained under the limits of the planimetric precision, i.e. one meter” [14]. The theme contains only premises for residential use excluding industrial, agricultural and commercial buildings. BDCARTO is a geographical decametric database meant in particular to produce maps at a scale ranging from 1:100'000 to 1:250'000. It aims at satisfying application needs at regional and departmental levels. The database originates from two different sources: SPOT images for the land use theme and scanned maps at a scale of 1:50'000 for the remainder of the database. The building theme is one of the modalities of the land use theme and the dataset we used for our study dates from 1993. The specifications are defined as follows: “The building theme is composed of areas with predominantly private housing. The minimal surface of the area must be greater than 8 hectares. Small parcels are grouped if their surface does not reach the threshold, only if these are less than 100 meters apart” [15]. The building theme can contain schools, universities and hospitals in addition to private houses. (a)

(b)

Figure 1. Extracts of the building theme of the two geographical databases used: BDCARTO (a) and BDTOPO (b). As regards to the specifications and up-to-dateness of the BDTOPO and the BDCARTO databases, differences in representation between the datasets are expected. Differences are related to the geometry, topology and semantic of objects but we are only interested here by the geometrical differences. We would like to acquire explicit knowledge in terms of classification rules for determining the origin of each geometrical difference. These rules will be used in an expert system.

Because of the difficulty of extracting rules directly from the specifications of each database, we have decided to examine the possibilities of using machine learning techniques and in particular, the supervised learning techniques [12,16,17]. These require to provide a set of examples with their classification (i.e. a set of differences in our case, with their origin identified), to derive classification rules automatically. In a formal way, the learning problem can be stated as follows: Given a set of examples S = (xi,yi) = (xi, f(xi)), find the classification function: y = f(x). The xi are the input training data, i.e. the geometrical differences between datatsets described by a set of measures. The yi corresponds to the output classes, i.e. the origin of the geometrical differences known beforehand for these training data. Function f( ) is the classifier that needs to be learned, i.e. the rules for determining the origin of the geometrical differences for new examples with unknown classification. So, after this inductive process, we will be able to specify automatically, for new examples with unknown classification, if differences in representation are equivalencies, inconsistencies or updates. We consider representations as equivalent if they respect their own capture constraints and correspond to the same entity of the real world. Inconsistent representations are the consequence of errors during the data capture or during the matching process. Updates obviously concern representations relating to different periods. The learning process requires therefore a set of training examples for which the acquisition step is described below. 3. CONSTRUCTION OF THE TRAINING EXAMPLES Several steps have been necessary before the constitution of the training sets. First, we have brought representations closer to facilitate their comparison and to detect differences. The dataset of BDCARTO has been simulated with the dataset of BDTOPO by respecting the specifications of the less detailed representation. A set of differences has been identified and some of these ones have been extracted and classified as equivalencies. Then, the two datasets have been matched and another set of equivalencies has been highlighted. Finally, the examples for the learning procedure have been defined from the remainder of the differences after an aggregation and a characterization stage. Detecting differences require to select a dataset as reference. It is actually possible to have objects in the first dataset that do not exist in the second dataset and vice versa. In this approach, the reference is BDTOPO. According to the specifications of the two databases, we can consider that all buildings in BDTOPO exist in BDCARTO in an other form, because the BDTOPO dataset is more recent than the BDCARTO dataset. In the opposite case, detection and interpretation would have been undertaken in both directions. The general process of detecting differences, simulating the BDCARTO from the BDTOPO and constructing the training examples is detailed in the next sections. 3.1. Simulation of the dataset of BDCARTO with the dataset of BDTOPO Differences in representation cannot be analyzed without transforming the highly detailed dataset into the less detailed dataset. In other words, it is necessary to simulate in our approach the representation of BDCARTO from dataset BDTOPO to highlight discrepancies in representations. This is the first step of the process. We have created building areas by expanding each small house of BDTOPO with a buffer of 50 meters in radius and we have merged the connected one (figure 2). The radius has been selected to take the specifications of BDCARTO into account which impose to amalgamate small parcels which are less than 100 meters apart. Once this has been done, we can already notice some differences in representation. Several areas in the new BDTOPO dataset do not exist in the BDCARTO dataset and it is possible to undertake a first interpretation procedure. We can consider that all new areas which present a surface lower than 8 hectares are equivalencies. Since the minimal surface threshold in the BDCARTO specifications is fixed at 8 hectares, it is perfectly normal that small areas in BDTOPO do not appear in BDCARTO. Simulation is however not entirely completed. In order to do so, we eliminate equivalencies in the BDTOPO dataset and continue the process with the remainder of the data. An excerpt of the results can be viewed in figure 2. Let us mention that differences, designated here as equivalences, can also be updates. Dataset BDTOPO dates from 1998 whereas BDCARTO is 5 years older. Nevertheless, had the dataset being updated, the objects would not appear in BDCARTO because of the minimal surface threshold fixed in the specifications. So, we can consider that all of these differences are equivalencies.

Buffer construction

BDTOPO

BDCARTO

Buffer selection

BDCARTO simulation

OVERLAY

Figure 2. Simulation of BDCARTO with the BDTOPO dataset 3.2. Matching spatial datasets At the end of the simulation, the correspondences between datasets have not been yet computed and it is necessary to proceed with this task. Our matching procedure is very simple in this experiment. We consider that a building of BDTOPO matches with an area of BDCARTO if the building overlaps one or more areas. So here we work with the highly detailed representation of BDTOPO and not with the abstracted representation. Buildings that are not matched are discrepancies requiring interpretation. The other buildings can be classified as equivalencies. 3.3. Buildings aggregation and characterization The matching process has enabled us to establish correspondences with the BDCARTO dataset for the majority of BDTOPO objects. Nevertheless, there still remain some differences which are more difficult to classify in this state. If we use the initial BDTOPO building representation to distinguish the remainder of the differences in terms of equivalence, inconsistency and update, we can only rely on a distance criterion. This is not sufficient for the learning procedure because this attribute is not enough discriminating. It is thus necessary at this stage to change representation to facilitate the automatic production of rules meant to interpret discrepancies. The change of representation has been expressed by an aggregation of BDTOPO buildings in two different ways: by using buffers and by carrying out a Delaunay triangulation. The first way, i.e. the aggregation with buffers, is similar to the simulation step of the BDCARTO but only applies to the remainder of differences after the matching step. We have created buffers of 50 meters in radius around each building in BDTOPO and dissolved boundaries between them when they were connected. This operation has given us a first set of building groups. Then, we have repeated the operation with a radius of 35 meters but from the boundaries of each group and on the inner side. This enabled us to create more homogeneous groups in terms of difference classes. These groups should have constituted the objects from which the training examples could have been constructed. Unfortunately, some of these groups still present heterogeneity. Several aggregated buildings should have remain separate in order to ensure the construction of homogeneous class examples. In fact, this first method seems not to be really appropriate and this explains the use of the second aggregation method. The second method is less immediate. It can be compared to a graph-based clustering method for which a number of different approach have been proposed [18,19]. First, we have computed the gravity center of each non classified building in BDTOPO. Then, we have triangulated this set of nodes by the Delaunay method (figure 3a). Finally, we have filtered the edges of the triangulation in order to obtain representative clusters (figure 3b). The first criterion to filter the neighborhood graph is a length criterion for which the value was determined empirically: any edge in the graph that was longer than 115 meters was deleted. The second criterion is an intersection criterion and

has been defined to correct some defects encountered in the first method: any edge that intersects one or more building areas of BDCARTO was eliminated. An illustration of these defects and of their correction by means of the triangulation method is presented in figure 4.

a

b

Figure 3. Extract of the BDCARTO dataset (in red) and the buildings of the BDTOPO dataset which do not match with the BDCARTO and from which the delaunay triangulation has been computed and filtered. Finally, we have constructed an object around each computed cluster like in the buffer method. The buildings and edges of each sub-graphs have been expanded with a 15-meter-radius buffer and we have obtained a set of objects which will be used in the learning procedure (figure 4).

Buffers method Discrepancies Resulting buffers

First buffers

Graph-based clustering method + buffers method

Delaunay Triangulation

Filtering

Buffers by using sub-graphs

Figure 4. The two methods used to aggregate discrepancies and an illustration of their results. It remains now to characterize each group obtained with the graph-based clustering method to prepare the learning procedure and to finish the acquisition step of the examples which correspond to discrepancies. All groups of building will be described by a few attributes considered as pertinent and representative characters. This is a critical step because the selected attributes will have an influence on the results of the learning process. The attributes have to contain the sufficient information to discriminate groups into homogeneous subsets (i.e. the classes).

Eight attributes for each cluster have been chosen: -

the number of buildings in BDTOPO the area of the group the perimeter of the group the density of BDTOPO buildings within the group the distance between the center of gravity of the group and the BDCARTO area the distance between the nearest building of the BDTOPO with the BDCARTO area and the BDCARTO area the distance between the farthest building of the BDTOPO with the BDCARTO area and the BDCARTO area the compactness of the group

Measures have been carried out for each cluster and we have obtained, a set of characterized groups which constitute our examples for the learning process. The different steps concerning the construction of the training sets which have been described in these sections are summarized in figure 5.

BDCARTO

Dataset

BDTOPO

Specifications

Dataset

Specifications

Creation and aggregation of buffers

Buffer aggregates < 8 ha ?

[NO]

A set of equivalences

[YES]

Spatial data matching

Overlay ?

[NO]

A set of equivalences [YES]

A set of discrepancies

Buildings aggregation and characterization

A set of examples for the learning process

Figure 5. The construction of the training examples: simulation, matching and aggregation operations.

4. LEARNING AND INTERPRETATION PROCEDURE 4.1. Induction and results As mentioned above, the machine learning techniques we used in this study are supervised learning techniques. This implies that the classification of each training example is known. From these ones, and depending on the algorithm that is chosen, a set of rules or a decision tree will be generated. So the first task that needs to be undertaken before initiating the inductive procedure is the interpretation of each cluster which constitute discrepancies between BDTOPO and BDCARTO datasets. We have used three different maps to perform this classification: -

a topographic map at a scale of 1:25'000 which dates back to 1991. This map is produced from the BDTOPO database and is much older than the dataset used. a map at a scale of 1:50'000 which dates from 2000. Some of the themes existing in the BDCARTO are derived from this document. a map at a scale of 1:100'000, dating from 2001. This map is produced from the BDCARTO.

Trough compiling this information and comparing it, we have interactively classified each of the previously created groups. Three classes have been retained: -

-

the class of equivalencies. It corresponds to differences considered as normal and justified by experts in the field. In this case, it is impossible to explain the differences by means of the explicit specifications of the databases, as for the first sets of equivalencies determined previously. They can only be grasped by implicit knowledge which can stem from capture operators and their experience. the class of updates. They could have been highlighted with the different maps. the class of errors. They result for the most part from the matching process.

So, this interpretation step accomplished and the description of examples completed, we can start the learning process. We have used the C4.5. algorithm to perform this task [20]. The algorithm requires an attribute-value list as representation language for the input data (table 1) and is based on the entropy measure to subdivide the training set and provide classification rules. We have tried to learn rules in two different ways: in one step, by using directly the examples labeled in one of the three classes of differences, and in two steps, first by learning rules to distinguish the errors and the other differences, and then, by learning rules to classify these other differences, i.e. the equivalencies and the updates. Table 1. An extract of the input data for the learning procedure ID 1 2 3 4 5

Nb 2 1 8 1 2

Area 4501.166 1937.692 11303.30 1654.495 4501.166

Perimeter 333.149 160.202 608.507 146.191 333.149

Density 0.035 0.114 0.119 0.099 0.035

Attributes Dist_gravity 52.400 38.191 5.332 8.271 51.672

Class Dist_nearest 8.585 40.221 6.447 8.039 11.968

Dist_farthest 78.819 40.221 55.992 8.039 165.672

Compactness 0.648 1.207 0.488 1.238 0.361

Equivalence Equivalence Error Update Equivalence

An example of rule obtained by C4.5 is reported below: If the number of buildings of BDTOPO > 4 If the density of buildings in BDTOPO within the group 0.488 If the distance between the center of gravity of the group and the BDCARTO area > 53.28 If the distance between the center of gravity of the group and the BDCARTO area