User Preferences for Access to Textual Information - Thibault ROY

ceptuelles à l'aide de vues globales et temporelles sur cor- pus. Verbum ex machina - Proccedings of TALN'06, the. 13th conference Natural Languages ...
184KB taille 3 téléchargements 324 vues
User Preferences for Access to Textual Information Thibault Roy and Stéphane Ferrari GREYC - CNRS UMR 6072 Computer Science Laboratory University of Caen, F14032 Caen Cedex, France {Thibault.Roy, Stephane.Ferrari}@info.unicaen.fr

Abstract Accessing textual information is still a complex task when the user has to browse through large collections of texts or long documents. Improving the user’s satisfactory requires to take his preferences into account. We propose a model and the related tools to help a user building his own semantic lexicons and use them for tasks such as searching for information in a specific domain. The process can be realised through graphical user interfaces, step by step, in an incremental approach. In order to show the flexibility of the model, we present two experiments with different tasks and contexts: accessing information and studying a linguistic figure.

information retrieval, which is the classic use of the model and tools. In order to illustrate the high flexibility of these model and tools, section 4 presents a second experiment where specialists of NLP perform a completely different task for research purpose, observing conceptual metaphors in a domain-specific corpus. In last section, we briefly discuss our results and conclude by pointing the main directions for further works.

2 Models and Tools 2.1

LUCIA: a Model for Representing User’s Knowledge on Domains

2.1.1 Main Principles

1 Introduction The number of textual documents produced and exchanged each day on the Web and on various public and professional networks does not cease increasing. The traditional tools of access to the content of such sets of documents (search engines for instance) do not fully satisfy their users1 . One reason for such dissatisfaction is the lack of consideration for the point of view and knowledge of the user. Thus, we propose a user-centred model for description of personal lexical knowledge through graphical interface. In a search task, such lexical representations are projected on texts, which can then be classified using the user’s knowledge. The results of the whole process is realised with the ProxiDocs platform, providing the user with interactive maps and hypertexts improved with mark-up directly related to his own choices and preferences.

The LUCIA model, proposed by V. Perlerin [9], is a differential one, inspired by F. Rastier works on Interpretative Semantics [11]. The basic hypothesis is the following: when describing things we want to talk about, in order to set their semiotic value, we just have to differentiate them from things for which they could be mistaken. Furthermore, in this model, the user has a central role. He is the one who describes the domains of his choice, according to his own point of view and with his own words. Domains descriptions are not supposed to be exhaustive, but they reflect the user’s point of view and vocabulary. The principle for knowledge representation is structuring and describing lexical items (i.e. words and compounds) according to two main criteria: • bringing together similar lexical items; • describing local differences between close items.

Section 2 is an overview of model main principles and of the tools developed for building lexical resources and using them on a collection. Section 3 presents an experiment of 1 See

e.g. [17] for a comparative study of search engines.

Such a representation is called a device. The user can define a device for each domain of interest. A device is a set of tables bringing together lexical units of a same semantic category, according to the user’s point of view. In each

table, the user has to make explicit differences between lexical units with couples of attributes and values. In the following, an example illustrates these notions. 2.1.2 Examples of LUCIA Devices Staff actor, director, cameraman, montage specialist, minor actor, soundman, filmmaker Director Jean-Pierre Jeunet, Steven Spielberg, Georges Lucas, Alfred Hitchcock, John Woo Table 1. Bringing similar words together

Staff Professional actor Yes director, filmmaker Yes cameraman, monYes tage specialist, soundman minor actor No No No Director Steven Spielberg, Georges Lucas Jean-Pierre Jeunet Alfred Hitchcock John Woo

Job Playing a part Direction Technical

table items: Professional, with values Yes vs. No, and Job, with values Playing a part vs. Technical vs. Direction. Another point of view can be reflected in the Director table, using an attribute Nationality with values American vs French vs. English vs. Chinese. Such choices result in the device shown in table 2. Cells can be blank in LUCIA tables, when the user founds no relevant lexical unit described by the combination of attributes and values on the same line (e.g. the last two lines of Staff table in table 2). Finally, the user can specify inheritance links showing that the lexicon of a whole table is related to a specific line of another one. In the example, the Staff table can be linked to the line Professional: Yes and Job: Direction of the Staff table. This means that each lexical unit of the Director table inherits of the attributes and values from the linked line. These links are used in further analysis.

2.2

User-centred Tools

2.2.1 VisualLuciaBuilder: Building LUCIA Devices

Playing a part Direction Technical Nationality American French English Chinese

Table 2. Differentiating similar words Figure 1. VisualLuciaBuilder’s Interface. This section illustrates the use of the model for a device representing knowledge on cinema2 . Let us consider the following lexical items, translation of the ones observed on a French corpus: actor, director, cameraman, montage specialist, minor actor, soundman, filmmaker, Jean-Pierre Jeunet, Steven Spielberg, Georges Lucas, Alfred Hitchcock, John Woo, etc. With these lexical units, it is possible to build a first set of LUCIA tables in order to bring them together. Table 1 shows an example of such first step. It is recommanded to use the model in such an incremental approach, with stepby-step enrichments. The differentiation between close lexical items, i.e. items in a same table, can be realised in a second time, by defining and using attributes and values. Here, for instance, two attributes can characterise the Staff 2 Further LUCIA devices are avalaible on http://www.info.unicaen.fr/~troy/dispositifs/

VisualLuciaBuilder is an interactive tool for building LUCIA devices. It allows a user for step-by-step creation and revision of devices through a graphical interface. This GUI (see figure 1) contains three distinct zones. • Zone 1 contains one or many lists of lexical units selected by the user. They can be automatically built in interaction with a corpus. The user can add, modify or delete lexical units. • Zone 2 represents one or many lists of attributes and values of attributes as defined by the user. • Zone 3 is the area where the user “draws” his LUCIA devices. He can create and name new tables, drags and drops lexical units from zone 1 into the tables, attributes and values from zone 2, etc. He can also associate a color to each table and device.

The tool allows SVG3 export of the devices. The lexical representation are stored in an XML format for further use (revision or application). 2.2.2 ProxiDocs: Projecting LUCIA Devices on a Corpus The ProxiDocs tool, [13], builds global representations after LUCIA devices and a collection of texts. It returns maps4 built after the distribution of the lexicon of the LUCIA devices in the corpus. Maps reveal proximities and links between texts or between sets of texts. This tool follows a direction of works in NLP which propose to visualise sets of texts in 2 or 3 dimensional spaces. See e.g. [16, 15, 6, 12, 8, 3] for such works, each one using a specific visualisation method. In the first stage, ProxiDocs counts how many lexical units from each device 5 appears in each text of the set. A list of numbers is thus associated to each text, a N dimensional vector, where N is the number of devices specified by the user. The next stage consists in a projection of the N dimensional vectors into a 2 or 3 dimensional space we can visualise. The PCA (Principal Components Analysis) method [2] is used to realize this projection. Each text can then be represented by a point on a map. Proximity between different points informs the user some domain similarities exist between the related documents. In order to emphasize such proximities, the clustering method called Ascendant Hierarchical Clustering (AHC) [2] is applied. Maps representing groups of texts can be built after the clusters. Analyses reports are also returned to the user, with information about most frequent lexical units, attributes and values, etc. All maps and texts are interactive, linking to each others and to the source documents, providing the user with a helpful tool for accessing the textual information of the corpus. Examples of maps build with ProxiDocs are shown in the two following sections, dedicated to experiments using this tool.

3 Experiment 1: Accessing Information 3.1

Context and Materials

The first experiment concerns information retrieval and documents scanning on the Web. The objective is to perform a search for information on the Web and in a broad 3 Scalable Vector Graphics (SVG) is a text-based graphics language of the W3C. It describes images with vector shapes, text, and embedded raster graphics. For specification see http://www.w3.org/TR/SVG/. 4 Other graphical representations of a corpus can also be returned by the tool, such as the “cloud” of lexical units presented in next section. 5 A list of graphical forms is associated with each lexical unit.

context: the “european decisions”. This search is realized with regards to the domains interesting the user. The domains representing the point of view of the user are agriculture, pollution, road safety, space, sport and computer science. These six domains are represented by LUCIA devices built using the VisualLuciaBuilder tool6 . The devices contain from 3 to 5 tables and from 30 to 60 lexical units. Some common attributes are used to structure the devices, such as the attribute Role in the domain with the values Object, Agent and Phenomenon, and the attribute Evaluation with the values Good and Bad. In order to constitute the collection of texts, the key words "european decision" have been searched using the Yahoo engine7, for texts in English language. The first 150 links returned have automatically been collected. The textual part of these documents, which were in three formats, HTML, PDF and DOC, have been automatically isolated in order to constitute a corpus of text documents, each one between 1,000 and 50,000 tokens. ProxiDocs has been used in order to project the devices on the corpus, building both “clouds” of lexical units and maps of texts, discussed in the following.

3.2

Results and Discussion

Figure 2. Cloud showing frequent words. Figure 2 is called a “cloud”8 of lexical units. It reveals which lexical units from the selected devices have been found in the documents of the corpus. They are sorted in alphabetical order and their size is proportional to their number of occurrences in the corpus. Here, lexical units from the computer science domain are particulary present, with the words programme, network, Microsoft, software, etc. Some words from the pollution domain and from the agriculture domain are also emphasised. 6 The devices are available in a SVG format at http://www.info.unicaen.fr/~troy/smap/ 7 French portal http://www.yahoo.fr 8 Such “clouds” have been introduced on the Web site TagCloud (http://www.tagcloud.com/) to give a global view on blogs.

Such clouds constitute a first corpus analysis which can help the user accessing textual information by simply bringing frequent terms to the fore, according to his own lexicon.

development, where the couples (attribute: value) State: gas and Evaluation: bad are the most frequent. These two groups illustrate two different interpretation of proximity and maps. The graphical outputs presented in this section provide the user with a personalised help for accessing textual information, reflecting the way his own knowledge on domains he describes is related to the documents in a collection. It is the main objective of the model and tools developed. Next section presents a completely different kind of experiment to show the flexibility and the adaptability of these model and tools.

4 Experiment 2: Conceptual Metaphors

Figure 3. Map of clusters. Figure 3 reveals proximities between documents according to the user’s devices. Each disc on the map represents a cluster. Its size is commensurate with the number of documents contained in the cluster. Its color is the one of the device the most represented in the cluster and its label contains the five most frequent lexical units. The map itself is interactive, each disc is also a “hypertext” link to a description of the cluster. The description shows, sorted by frequency, the lexical units, the attributes and values found in the cluster, etc. The map, as well as the previous cloud, reveals that the computer science domain is particularly represented in the corpus. The largest disc (manually annotated group 1 on figure 3) has the colour of this domain. But an analysis of this cluster shows that the documents are related to many themes (health, politics, broadcasting of the information, etc). The computer science domain is not really the main theme, but rather a vector of communication often present in this corpus, whatever the theme. The attributes and values frequently repeated in the documents of group 1 are Object type with values hardware and software and Activity type with value job. They highlight that the documents of this group are mostly talking about objects and jobs of computer science. The group 2 is by a majority of the pollution domain. Here, an analysis of the cluster shows documents really dealing with problems related to the pollution, and more particularly with european decisions on the durable

In this second experiment, the objective is a corpusoriented study of the way the lexicon related to conceptual metaphors is used. A possible application of such study in NLP (Natural Language Processing) is a help for text interpretation or semantic analysis. This work has been realised under a project called IsoMeta, which stands for isotopy and metaphor. Rather than one isolated experiment, IsoMeta involved a set of experiments, in an incremental approach. The first part, now completed, consisted in adapting the LUCIA model for lexical representation in order to characterise the main properties of metaphorical meanings, presented in 4.1. The second part, 4.2, is a study of what could be called metaphoricity of texts in a domain-specific corpus.

4.1

Constraints on the Model Metaphor Characterisation

for

This work is based on the existence of recurrent metaphoric systems in a domain-specific corpus. It is closely related to conceptual metaphors as introduced by Lakoff and Johnson [7], more specifically ones with a common target domain, which is the theme of the corpus. Previous works have already shown different conceptual metaphors in a corpus of articles about stock market, extracted from the French newspaper Le Monde: “the meteorology of the stock market”, “the health of economics”, “the war in finance”, etc. The first part of the IsoMeta project focussed on how the LUCIA model for lexical representation could help describing a specific metaphorical meaning. Rather than changing the core of the LUCIA model, a protocol for building the lexical representations has been defined, with constraints taking the main properties of metaphors into account. The first property is the existence, for a conceptual metaphor, of a source domain and a target domain. The second one is an underlying analogy between the source and the target of a metaphor, which is the

comparison point of view9 . The last property is the possible transfer of new meaning from the source, then considered as a vehicle, to the target, which is the novelty point of view. 4.1.1 Source and Target Domains Conceptual metaphors involve a source domain and a target domain. Thus, a first constraint consists in building a LUCIA device for the source domain and another one for the target domain. For instance, to study the “meteorology of the stock market”, a device describing the lexicon related to meteorology must be built, and another one for the stock market lexicon. But conceptual metaphors use semantic domains only, and when they are used in language, the resulting figure is not necessarily a metaphor. It can be a conventional one, lexicalised, and not perceived as a metaphor anymore. For instance, in the corpus, the French word “baromètre” (barometer) is commonly used to talk about stock index. It can be considered as a lexicalisation, and “baromètre” becomes a word of the stock market lexicon. In this case, using the LUCIA model, the word is simply considered as polysemous, and can be described in both devices, for each of its meanings. For the purpose of this study, describing the meaning related to the conventional metaphor is forbidden, the word must not appear in the target device. The goal is here to use the model to “rebuild” the metaphorical meaning, not to literally code it as an ad hoc resource. The other constraints must help this “rebuilding”.

storm not only denotes agitation, it also differs from other words denoting the same kind of turbulences: wind, breeze, tornado, etc. Therefore, the strength of the phenomenon is the piece of new information this particular word brings to the target. A storm on a financial place is not only agitation, it is a strong, a violent one. An attribute strength with the value high is enough to help interpreting the novelty part of the metaphorical meaning in the previous example. The novelty property can be rendered if the corresponding attributes are well identified as being “transferable” from the source domain to the target domain. This is the same configuration as for the shared attributes. Therefore, the constraint for analogy and novelty can finally be viewed as a unique one: a set of “sharable” attributes must exist for the description of the source and the target domain, clearly identied as transferable to reflect metaphorical meanings.

4.2

Maps and Texts “Metaphoricity”

4.1.2 Analogy and Novelty The analogy between the source and the target of a metaphor is usualy a clue in NLP for semantic analysis. In the LUCIA model, the constraint reflecting this analogy is a set of common attributes shared by the source and target devices. For instance, the couple of attribute and value (tool: prevision) can be used to describe barometer in the source domain. The same couple can also be used in a description from the target device, e.g. for computer simulation. Thus, this shared attribute and value reflect the underlying analogy between the two domains, and allow rebuilding the conventional metaphorical meaning of barometer in an sentence like: The Dow Jones is a stock exchange barometer The novelty property consists somehow in using metaphor to bring something new in the target domain. For instance, in: The storm has now reached the stock markets. 9 The comparison point of view and the novelty point of view are used by D. Fass, [4], to discriminate different approaches to metaphor. The hypothesis on metaphors studied in the IsoMeta project are not detailled in this paper. Please refer to previous works for specific information, e.g. [1, 10]. For further works on metaphors, tropes and rhetoric, see also [5].

Figure 4. Cartography “metaphoricity” of texts

reflecting

the

In the second part of the IsoMeta project, the previous protocol is used to study multiple conceptual metaphors in the same domain-specific corpus. A LUCIA device is built for each domain, the three source domains, meteorology, war and health, as well as one unique target domain stock market. Words from the three source domains can be used both for metaphorical and litteral meanings in this corpus. Usually, NLP approaches to metaphor focus on locally disambiguating such polysemy. Our hypothesis is the language of the whole text may be viewed as more or less metaphorical. Therefore, the experiment 2 consists in using the ProxiDocs tools to classify texts after the lexical resources related to conceptual metaphors. Results are detailled in [14]. Fig-

ure 4 shows the most relevant ones. After the analysis of both kind of maps, texts and clusters, three zones can be drawn. Zone A contains texts in which mostly literal meanings are used, e.g. in: Pour se déplacer (. . . ), des officiers de la guérilla utilisent les motos récupérées pendant les attaques. (For their movements, the guerrilla war officers used the motorbikes found in the assaults.) Le Monde, 13/04/1987 the war lexicon is not metaphorical. Zone B contains mostly conventional metaphors, e.g. in: En neuf mois, six firmes sur les trente-trois OPA ont été l’objet de véritables batailles boursières. (In nine months, 6 firms out of the 33 takeover bids were subjected to real financial battles.) Le Monde, 26/09/1988 where the phrase “bataille boursière” is a common one. Zone C contains rare and more varied metaphors, e.g. in: Porteur du terrible virus de la défiance, il se propage à la vitesse de l’éclair et les tentatives désespérées de réanimation (. . . ) sont inopérantes. (Carrying the dreadful virus of distrust, it spreads in a flash and the desperate attempts of reanimation are vain.) Le Monde, 30/10/1987 The maps reveals what can be called the “metaphoricity” of texts, from a degree 0 in the top of the maps to the highest degree in the bottom of the maps. The use of the model and tools presented here shows their high flexibility. A user may add his own rules, like the protocol defined for building devices, in order to fulfill his own task involving semantic access to a collection of texts.

5 Conclusion In this paper, we presented a centred-user approach for accessing textual information. Founded on a model for lexical representation, a set of interactive tools have been developed to help user specifying his own point of view on a domain and using this knowledge to browse through a text collection. Two very different experiments illustrates their use. The second experiment obviously shows that a user can easily appropriate the model and adapt it to a task far from access to textual information. This result calls interesting questions we can not answer in the scope of this paper: what is the role of the graphical tools in the process of appropriation, can models and tools be both flexible and not diverted, etc. For the time being, our perspectives mostly concerns the evaluation of the model in a well-defined task with a large number of users. A protocol must be defined to characterise the contribution of the user’s point of view. Such evaluation also calls interesting questions we hope to answer in future works.

References [1] P. Beust, S. Ferrari, and V. Perlerin. Nlp model and tools for detecting and interpreting metaphors in domain-specific corpora. Proceedings of the Corpus Linguistics 2003 conference, 16:114–123, 2003. [2] J.-M. Bouroche and G. Saporta. L’analyse des données. Presse Universitaires de France, Paris, 1980. [3] W. Chung, H. Chen, and J. Numaker. Business intelligence explorer: A knowledge map framework for discovering business intelligence on the web. Proceedings of the 36th Hawaii International Conference on System Sciences, 2002. [4] D. Fass. Processing metaphor and metonymy. Ablex Publishing Corporation, Greenwich, Connecticut, 1997. [5] S. Ferrari. Rhétorique et compréhension. In G. Sabah, editor, Compréhension des langues et interaction, chapter 7, pages 195–224. Lavoisier, Paris, 2006. [6] M. A. Hearst. Tilebars: Visualization of term distribution information in full text information access. Proceedings of ACM SIGCHI, pages 59–66, 1995. [7] G. Lakoff and M. Johnson. Metaphors we live by. University of Chicago Press, Chicago, U.S.A., 1980. [8] J. Lamping. A focus+context technique based on hyperbolic geometry for viewing large hierarchies. Proceedings of ACM SIGCHI, pages 401–408, 1995. [9] V. Perlerin. Sémantique Légère pour le document. PHD Thesis, University of Caen / Basse-Normandie, Caen, 2004. [10] V. Perlerin, P. Beust, and S. Ferrari. Computer-assisted interpretation in domain-specific corpora: the case of the metaphor. Proceedings of NODALIDA’03, the 14th Nordic Conference on Computational Linguistics, 2003. [11] F. Rastier. Sémantique Interprétative. Presses Universitaires de France, Paris, 1987. [12] G. Robertson, J. Mackinlay, and S. Card. Cone trees: Animated 3d visualizations of hierarchical information. Proceedings of ACM SIGCHI, pages 189–194, 1991. [13] T. Roy and P. Beust. Un outil de cartographie et de catégorisation thématique de corpus. Proceedings of the 7th International Conference on the Statistical Analysis of Textual Data, 2:978–987, 2004. [14] T. Roy, S. Ferrari, and P. Beust. Etude de métaphores conceptuelles à l’aide de vues globales et temporelles sur corpus. Verbum ex machina - Proccedings of TALN’06, the 13th conference Natural Languages Processing, 1:580–589, 2006. [15] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989. [16] B. Shneiderman. The eyes have it: a task by data type taxonomy for information visualization. Proceedings of Visual Languages, pages 336–343, 1996. [17] J. Véronis. A comparative study of six search engines. Author’s blog: http://aixtal.blogspot.com/2006/03/search-andwinner-is.html, March 2006.