How to Select the Best Dataset for a Task?

The paper demonstrates the method using a somewhat simplified example: Two data sets are available and assessed for use by a tourist and for fire-fighters. ... to navigate in a city, i.e. the decision about turning left or right at a street corner.
53KB taille 19 téléchargements 338 vues
How to Select the Best Dataset for a Task? Eva Grum*, Berengere Vasseur,** * Institute for Geoinformation and Cartography, Technical University of Vienna, Gusshausstraße 27-29, 1040 Vienna [email protected] ** Universite de Provence, Labortatoire des Sciences de l´Information et des Systems LSIS, rue Joliot-Curie, 13453 Marseille cedex 13 [email protected]

Abstract. A simple, algorithmic method to select the most appropriate data set for a task is described and demonstrated. It is based on an assessment of the data quality of all objects represented in the data set and a similar assessment of the needs for individual data elements in the decision process. An algorithmic comparison of the data qualities of the data source with the requirements leads to a numeric assessment of the suitability of the dataset. The dataset with the highest suitability mark should be selected.

Introduction The selection of a dataset from several potentially useful ones is most often done intuitively. We describe here a rational method for comparing the suitability of different datasets with respect to an intended use. It is based on a detailed assessment of the data quality for each theme represented in the data set. A similar assessment for the required quality of each theme is made for the intended task. An algorithmic comparison then shows which datasets are suitable and the numeric assessment of suitability indicates which one is the most appropriate choice. The paper demonstrates the method using a somewhat simplified example: Two data sets are available and assessed for use by a tourist and for fire-fighters. The data sets are patterned after the mulit-purpose digital map of Vienna [ref to wilmersdorfer] and a digital tourist map [ref hoelzel?]. The paper is structured as follows: The second section of the paper describes the terminology used and the ontology. This covers the user task and its requirements, what is data quality and how to describe the usability of a map. The third section shows how to describe data quality for themes and the principles of the assessment. The numeric values are arranged in a matrix for each data set and give for each data element type an assessment for different quality aspects. In the fourth section the user’s task is described along similar lines. The section five then explains the comparison element by element and how a suitability measure for each combination of data set and user requirement is computed. The paper concludes with a suggestion how to integrate K.O. criteria, following the suggestion by Maslow (see [jahn issdq]), into the assessment and lists open questions for future work. Ontology and terminology In this section we explain how we conceptualize the situation and introduce the terms we will use in the sequel.

We start with the general situation of a user which must make a decision, or is expecting to make some repeated decisions of a spatial nature. There are many such tasks, which each has a different requirement for the data input necessary and for other data inputs which only improve the decision. The running example in this paper will concentrate on decisions necessary to navigate in a city, i.e. the decision about turning left or right at a street corner. We use here extensive previous work, in particular by Krek in her Ph.D. thesis [ref]. Other spatial decisions have different requirements for the data necessary, but they can be dealt with the same logic to capture the user requirements. The user has the choice of several data sets which are potentially beneficial for his task, i.e. they have the potential to allow him to make better decisions than without. The data sets contain data describing different aspects of reality, which we will call themes. The datasets are also different in the correspondence between the representations of reality in the data set. We will use the term data quality to mean the correspondence between an object in reality and its representation in the data set. Data quality will be differentiated for several aspects like precision, completeness or the level of up-to-date. This will result in a description of the data quality of the data sets which is in principle independent of a task. To decide on suitability, the information offered by a data set must be compared against the requirements. It is not possible, to assess suitability of a dataset independent of the intended use – but the data quality assessment as suggested above is independent of the task. The proposed method results in a systematic, programmable procedure to compute a usability value for each combination of a user requirement and a data quality description of a data set. The higher the usability value is, the more suitable is the data set for the corresponding task.

Description of the data quality of a data set This section describes how we assess the quality of the data sets. It follows suggestions found many places in the literature when metadata and data quality descriptions are discussed. The data set contains data about objects in the world; the data can be grouped in themes, each describing a specific type of attributes of objects, which we call themes. For example: street name, diameter of water main, the position of an ATM or the number of the bus line serving a bus stop. There is in theory an infinite number of themes under which the objects in the world can be classified and an equally infinite number of attributes for each. The agency constructing a dataset makes a decision for which objects which attributes are collected; an assessment of the dataset is restricted to this finite subset of themes. The quality assessment is generalized for the theme and usually not given for each individual data element (cf. measurement based cadastre [Navratil?]). . For example, a dataset is assessed to contain 95% of all street names correctly. Such a generalized assessment is appropriate, because the procedure to collect the data is usually the same for the whole of the dataset and thus the quality uniform. The quality of a theme in a dataset can be assess according to different aspects; a dataset for a theme which is not very complete but the data in it very up-to-date is hard to compare with a different data set containing the same theme but very complete but not so up-to-date. The

method suggested here assumes that these different aspects can be separately assessed. We will use: precision. completeness, and up-to-datedness as individual aspects. These are the most often mentioned aspects and are covered in several metadata standards [refs: stds, Dublin core]. The method described does not depend on specific aspects, but on the – admittedly questionable assumption – that assessments of data quality aspects can be made independently. It is obvious from the example above, that a dataset for street names, which was very precise and complete when collected and not updated for several years is comparable in number of errors to a not-very-precise or not very complete but completely updated dataset. To overcome the limitations introduced with this assumption is left to future research. The assessment for precision is the value for the standard deviation of point locations (in m), the assessment for completeness is a percentage of all correctly included object of the total of objects which should be included (this measures primarily omission; commissions are relatively rarely in cartography) and the year of data collection indicates how up-to-date the dataset is. The two datasets used here as examples for content in publicly available maps useful for in city navigation are the mulit-purpose map of the city of Vienna [ref. http://www.muvis.at/udk/udk/html/detail_a6579382-c268-11d2-9a86-080000507261.html es gibt glaub ich auch einen artikel von wilmersdorf, den bitte referenzieren. and the city map of Vienna [ref.]. These datasets are used as examples for typical content only. These datasets overlap in some of their themes. The following two lists of fictitious quality assessments give values for the themes which are included in the data sets, empty cells should be read as null values. The data quality assessment values are for demonstration and do not reflect the data quality found in the products distributed by the Magistrate Wien. Include here the city map and the multi-purpose map assessment (but not the numeric assessment for each), das ist eine halbe seite matrize. A und B, nur einmal die themes spalte!

Description of the user task The suitability of a dataset does not automatically follow from the data quality description. One cannot determine the suitability for a task just by studying the data quality description, but has to consider the task and the decision which should be made with the information from the dataset. The difficulty so far has been a description of the user’s decision situation. Krek has shown how the quality of the spatial information affects the quality of the decision [ref]; for a given decision, it is possible to identify which data elements influence the decision and ignore the data quality of all the others. Only the data quality of the data elements which influence the decision affect the quality of the decision, and the different data quality aspects affect the decision differently.

It is possible to assess for each theme how much it influences the decision, a specific decision. It is not possible to make these assessments independent of a decision a user has to take. We select three different user groups, for which we have constructed data requirement profiles (again, for demonstration purposes, a rational chain of reasoning to deduce these requirements from actual descriptions of the tasks is left for future work). The assessment follows the list of themes provided in the datasets and uses also the same data quality aspects. It would be possible, to have use just the themes of importance for a task (the ones not required are left blank here) or to add themes which are not provided in the datasets; this would not influence the result. The two tasks we consider is visiting the city and responding to an emergency. The two exemplary user types here are: a tourist who visits Vienna and the Vienna fire brigade. These seem to be sufficiently different, stereotypical uses, suitable for a demonstration of the method to select the best dataset for a task. Fuegen sie hier matrize c und d ein. Wo ist der truck driver geblieben? Data elements which are not used at all and cannot possibly influence the decision to be made but are present in a dataset are considered clutter: they make it more difficult to use the important data and require mental operations to ignore them. This can be assessed as a negative value for completeness.

Calculation of usability Usability results from the comparison of the user requirements and the available data quality in the data set. This can be done in two steps: 1. normalize the assessed data quality aspects and the requirements to a percentage in the range 0% to 100% (i.e. a value between 0 and 1) 2. compare the requirements with the provided data quality. Normalization The numeric assessment of precision and up-to-datedness are on different scales and must be made comparable. All values must be normalized to the range 0 to 1 (or 0% to 100%). Completeness is already a percentage and is carried forward (note: it can be negative, indicating that the theme is considered clutter for this task, the range for admissible values is effectively -1 to 1). For up-to-datedness we apply a 5% penalty per year; the formula is: 100% - 5% * the current year (2004) minus the year of the last update. For precision we use an (intuitively guessed) table: 1 cm – 99%, 1m – 90%, 10m – 80% More research is necessary to understand the influence of these normalizations. Differences in the formulae should cancel as the same formulae are used to convert data quality aspects and requirements.

Compare requirements To compare the requirements the weighted mean for each theme and each data quality aspect is computed, where the requirement is taken as the weight and the provided quality is the value. If for a requirement which request 100% the data have quality 100%, the multiplication of the two values give again 100% and this contributes to a perfect overall score of 100%. If a requirement is not fulfilled, then the contribution to the score will be less. The calculations can be seen in the following spreadsheet, which gives the normalization and the computation of the weighted mean for the use of the City Map for the tourist. The total score is XX percent; the same calculation gives for the same use of the multi-purpose map only a percentage of YY. Taking however the requirement of the fire brigade, the city map scores only XX, much less than the YY% of the multi-purpose map.

Conclusion This method to calculate the usability of a data set for a task emerged from a method to describe suitability of a dataset in a matrix, where on one side the available data themes were listed and on the other axis the requirements for the task [ref vasseur]. The method proposed here is separating the assessment of the dataset in one table and the required data quality into another table, essentially splitting the single matrix into two relations. This permits us to compare m datasets for n tasks, without redoing all the assessments – reducing effort from (n x m) to (n + m). The method described here does not yet include the ideas suggest by Jahn [issdq paper]; Jahn discusses the pyramid of needs described by Maslow [ref], which is very similar to the often used K.O. criteria. Selection processes – for example when acquiring a delivery truck – are often structured in a (small) number of K.O. criteria, i.e. requirements which must absolutely be fulfilled and other desirable criteria, where a gradual fulfilment is acceptable. For example: a truck must have wheels and a motor to be useful at all, these are therefore K.O. criteria. The color of the upholstery or the size of the gas tank may be more or less correspond to what we expect. The same principle apply when selecting a data set: if data which is crucial for decision making is not present, then the dataset is not usable, even if it fulfils many other desirable aspects. A tourist which intends to use public transportation will not be satisfied with a map without the bus lines and bus stops – independent how nicely the sights are shown or how precise the location of the fire hydrants are marked! This is a report about work in progress. A number of interesting questions remain open: - is it possible to integrate the different data quality aspects into a single assessment? Is this integration independent of the task? - is the weighted mean, perhaps augmented with a set of K.O. criteria a suitable method to calculate a comprehensive assessment?

Acknowledgements This work was funded as part of the REVIGIS project. Give details – gruber has them We also thank the magistrate Wien for providing us with datasets and descriptions and excuse us for the liberty we have taken with assessing the quality in this demonstration.