Données semi structurées DEA I3

Uses the XML Query Language for continuous queries. “Querying the XML Documents of the Web”, V. Aguilera, S. Cluet, F. Boiscuvier, Tech. Report ...
1MB taille 14 téléchargements 40 vues
Grégory Cobena http://www-rocq.inria.fr/verso/ [email protected]

DEA I3 : Information, Interaction, Intelligence

Cours: Données semi structurées

Contrôle des Changements dans XML

20/12/2002



DEA I3 - Données semi-structurées - Grégory Cobéna

du Web (Xyleme) À l’échelle du document XML, cas de la gestion de versions

2

• À large échelle, cas d’un entrepôt de données

Comprendre la gestion de données dynamiques

Objectifs

20/12/2002





DEA I3 - Données semi-structurées - Grégory Cobéna

3

des documents XML, sur le Web ou sur un Intranet Mettre en place un suivi dans le temps de ces documents Extraire des connaissances sur ce qui change: les documents, leurs propriétés, leur contenu

• Savoir découvrir des sources de données et

Le contrôle des changements, c’est d’abord:

Motivations: à l’échelle du Web

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Lorsqu’on s’intéresse à l’évolution dans le temps d’un document donné Exemple: Fichier XML décrivant un carnet d’adresses



Lorsque l’on gère différents documents, on étudie les changements inter-documents Exemple: Fichier XML décrivant deux modèles de voitures, une Peugeot-307 et une 206



Dans quel cas trouve-t-on la notion de changements?

Motivations: à l’échelle du document

4

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Les données semi structurées doivent apporter une description plus précise que du simple texte, avec une sémantique bien définie La gestion des changements dans les données semi structurées est encore plus complexe que dans les BD relationnelles.

Enjeux

5

Un entrepôt de données XML à large échelle Intégration de données du Web Surveillance active des données du Web

20/12/2002

• •

DEA I3 - Données semi-structurées - Grégory Cobéna

Représentation des changements Détection des changements

XML Diff

• •



Xyleme

Plan du cours

6

A Dynamic Warehouse for the XML data of the Web

Première Partie: Xyleme

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

1. The Web and XML 2. Xyleme 3. Data Acquisition and Maintenance 4. XML Repository, Semantic Data Integration and Query Processing 5. Query Subscription Conclusion

Organization

8

(Part I: Xyleme) 1. The Web and XML

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

10

Private web: not publicly available pages Deep web: data hidden behind forms

• 1 billion in [06/2000] • several millions of servers

Terabytes of data A lot of public pages

The Web today

20/12/2002

HTML

DEA I3 - Données semi-structurées - Grégory Cobéna

Text + presentation Where is the data ?

Information System

11

The X23 new camera Ref Name Price replaces the X22 . It X23 Camera 359.99 comes equipped with a flash R2D2 Robot 19350.00 (worth by itself 53.99 $) Z25 PC 1299.99 hard and provides great quality for only 359.99 $.

HTML = Hypertext Language

20/12/2002

...

DEA I3 - Données semi-structurées - Grégory Cobéna

Data + Structure Semistructured: more flexible

12

Robot 19350 …

Ref Name Price < product reference=”X23"> X23 Camera 359.99 camera R2D2 Robot 19350.00 359.99 Z25 PC 1299.99 easy … ... < product reference=”R2D2"> Information System

XML = Semistructured Data

price

description

reference

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

• product-table/product/reference • product-table/product/price

Semantics and structure are in paths

designation

product

product-table

XML : Tree Types

13

(Part I: Xyleme) 2. A Dynamic Warehouse for the XML Data of the Web

Sophie Cluet: Databases (OQL…) Serge Abiteboul: semi-structured data + web Guy Ferran: ex O2 Technology

Guido Moerkotte

Marie Christine Rousset

20/12/2002



DEA I3 - Données semi-structurées - Grégory Cobéna

Dan Vodislav

CNAM



Université d’Orsay



Mannheim University

• • •

INRIA

15

Explore XML + Web + SGBD to make the Web a Knowledge Database

Project Xyleme at INRIA (1999-2000) :

Xyleme Research

20/12/2002







DEA I3 - Données semi-structurées - Grégory Cobéna

• Scalability for large amount of data • Internet (+focus) / Intranet support • Monitoring and Version Management • Heterogeneous Data Integration

Technology:

• Press, Editors, Financial Data, Biology…

Few XML documents available on the Web (because of weak software support) Company is focusing on private XML:

Market Challenges:

(25 employees end of 2001)

Started September 2000

Xyleme Company

16

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Distribution between autonomous machines Now Web Services

• local: Corba • external: HTTP

Cluster of PCs Developed with Linux and C++ Communications

Architecture

17

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

18

Semantic Module

Query Processor

Change Control

Xyleme Interface

Repository and Index Manager

Acquisition Loader & Crawler

Web Interface

-------------------- I N T E R N E T -----------------------

User Interface

Functional Architecture

20/12/2002

E T H E R N E T

Repositorry

Acquisition and Maintenance

Acquisition and Maintenance

Repository

Loader |Query

Index

DEA I3 - Données semi-structurées - Grégory Cobéna

Repository

Index

Change Control and Semantic Integration

Loader |Query

Repository

Index

Change Control and Semantic Integration

19

-------------------- I N T E R N E T -----------------------

Architecture

(Part I: Xyleme) 3. Data Acquisition and Maintenance, Page Importance

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

• Memory for known URLs • Bandwidth

Maintain them up to date Do this under bounded resources:

• For this crawl the web (HTML+XML)

21

Discover XML pages on the web that are of interest for customers

Goals

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

The document D is re(read) regularly

• The document D is loaded

• type, last_date_update...

• The meta data of D is read

22

The URL of D is discovered as a link in another page (or published by a customer) The page scheduler decides to read D

Life Cycle of a page in Xyleme

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

• decide which page to read or refresh next

Metadata management (access to disk) Page scheduling



a standard PC main cost is Internet connection

• we can load up to 5 millions of pages/day on

Loading of pages

Main Issues

23

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

24

(M. Preda, S. Abiteboul, G. Cobena) • does not require to maintain graph information • faster convergence with focused crawling

Definition: Important pages are linked to by important pages Offline algorithm (used by Google) Our Online algorithm

Page Importance

(Part I: Xyleme) 4. XML Repository: Semantic Data Integration and Query Processing

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Select boss/Name, boss/Phone From comp in BusinessDomain, boss in comp//Manager Where comp/Product contains “Xyleme”

Today: A mix of OQL and XQL We are currently moving to X-Query (which is also a mix of OQL and XQL…)

Querying Language

26

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

1 domain = 1 abstract DTD

homogeneous database for this domain

• one abstract DTD for the domain • gives the illusion that the system maintains an

27

Semantic domains, e.g., cinema Many possible types for data in this domain, many DTDs Semantic Integration

Web Heterogeneity

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

28

Goal: more work can be performed without accessing data

document + element identifier

• word → elements that contain this word

Xyleme index

• word → documents that contain this word

Standard inverted index

Indexing

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

I.4.1 Xyleme: Semantic Data Integration

29

20/12/2002



DEA I3 - Données semi-structurées - Grégory Cobéna

30

homogeneous database for each domain abstracts a set of DTDs into an abstract DTD = a hierarchy of pertinent terms for a particular domain

• gives the illusion that the system maintains an

Xyleme Semantic Integration

• heterogeneous vocabulary and structure

One application domain -- Several schemas

Data Integration

Business, culture, tourism, biology, …

20/12/2002







DEA I3 - Données semi-structurées - Grégory Cobéna

Organize tags into a hierarchy of concepts using thesauri such as Wordnet and other linguistic tool This provides the abstract DTD for the particular domain Generate mappings between concrete DTDs and the abstract one

31

For an application domain – semi-automatically



Cluster DTDs into application domains

Technology in short

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

I.4.2 Xyleme: Query Processing

32

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Select product/name, product/price From doc in catalogue, product in doc/product Where product//components contains “flash” and product/description contains “camera”

A mix of OQL and XQL, will use the W3C standard when there will be one

Xyleme Query Language

33

20/12/2002

⇒ d1//camera/price ⇒ d2/product/cost

DEA I3 - Données semi-structurées - Grégory Cobéna

MAPPINGS between concrete and abstract DTD’s 34

Union of concrete queries (possibly with joins)

catalogue/product/description ⇒ d1//camera/description ⇒ d2/product/info, ref ⇒ d2/description

catalogue/product/price

query on abstract DTD

Principle of Querying

DEA I3 - Données semi-structurées - Grégory Cobéna

35

Partial translation, from abstract to concrete, to identify “machines” with relevant data Algebraic rewriting, linear search strategy based on simple heuristics: in priority, use in memory indexes and minimize communication Decomposition into local physical subplans and installation Execution of plans If needed, Relaxation

20/12/2002

5.

4.

3.

2.

1.

Query Processing

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

36

Essential use of a smart index combining full-text and structure

Query processing

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

I.4.2 Xyleme: Repository

37

20/12/2002



DEA I3 - Données semi-structurées - Grégory Cobéna

and scanning good compromise : compaction / access time

• minimize the number of I/O for direct access

Efficient storage of trees in variable length records within fixed length pages Balancing of tree branches in case of overflow

Storage System: Xyleme Store

38

20/12/2002

Record 2

Overflow: Sub-tree in other page

DEA I3 - Données semi-structurées - Grégory Cobéna

Record 3

Record 4 39

Overflow: more children in other page

Record 1

Tree Balancing in Xyleme Store

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Questions ?

40

(Part I: Xyleme) 5. Change Control

keep the warehouse up-to-date

20/12/2002



DEA I3 - Données semi-structurées - Grégory Cobéna

query subscription

Change monitoring

representation and storage of change (see part II)



Version management



Data acquisition + maintenance

The Web changes all the time

42

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

43

SQL-like language based on ‘atomic events’. Combines the use of monitoring queries and continuous queries. The language can be extended by adding new types of atomic events. Uses the XML Query Language for continuous queries. “Querying the XML Documents of the Web”, V. Aguilera, S. Cluet, F. Boiscuvier, Tech. Report

Subscription Language

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

44

subscription myPaintings % what are the new painting entries in Musee d’Orsay site monitoring newPainting select URL Atomic where URL extends www.musee-orsay.fr/* events and contains “Monet” % manage the changes in the expositions continuous delta Exposition select ... from ... where when monthly notify daily % send me a daily report

Example

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

45

XML loader

d/46,67

complex event detection

metadata manager

document & alerts d/46

atomic event 46: URL matches pattern www.musee-orsay.fr/* atomic event 67: XML document contains the tag with the value “Monet”

d

5 millions of pages/day

Step 1: Atomic Event Detection

20/12/2002

• • •

DEA I3 - Données semi-structurées - Grégory Cobéna

Long string look-ups Finding XML Patterns (e.g. XPath) Comparing digital signature of text documents (copy tracker)

Each Alerter can be viewed as a plug-in that acts on a document flow. All sorts of Atomic events can be detected: URL pattern detection, Keywords, XPath expressions, Page rank… Can be distributed. Some advanced alerts are:

Alerters

46

URL | prefix* | *suffix

Test in O(1), total test time is O(n), where n is the length of URLs

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Example: http://www.inria.fr/verso/index.html Test: http://www.inria.fr/verso/* http://www.inria.fr/*



Using Hash Table: try all possible patterns



Supported patterns

URL Patterns Detection (1)

47

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Patricia Trees ?



Using a tree: navigate on the tree until a leave is encountered Example: Tree is,

URL Patterns Detection (2)

48

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Tree is implemented over a hash table

• a Tree of backward keyword sequences • a context memory with O(1) update cost

Detect: « Air France » Solution:

Keywords Sequence Algorithm

49

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

• Reverse path expression • Use postfix order • Use a stack for ‘//’ and another stack for ‘/’

Solution:

• detect CONTAINS « word »

Problem:

Simple XPath filtering Algorithm

50

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Consider tree:
toto Nodes come as: toto (id=1, level=4) C (id=2, level=3) C (id=3, level=3) B (id=4, level=2) A (id=5, level=1)

Simple XPath filter example: Understanding the tree structure in postfix order

51

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

When « toto » is detected, it is stored For each ancestor of « toto », the name is compared to . All tests are executed using an hash table

• « toto »::ancestor


CONTAINS toto is detected by:

Simple XPath: Example

52

To avoid duplicate registration of similar events To show the user how his query is stemmed

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Real stemmers: chevaux -> cheval

• •

On the Subscription Manager

Exemple: Éléphant –> ELEPHANT Do it for 500 documents / second Noise may be introduced (Example: tâche = tache)

• • •

On the Alerter

Stemming

53

20/12/2002

XML loader

HTML parser

DEA I3 - Données semi-structurées - Grégory Cobéna

54

complex event 12: 67 & 46 (XML document contains the tag with value “Monet” and URL matches pattern www.musee-orsay.fr/*)

complex event detection

Millions of alerts of pages/day Millions of subscriptions

Step 2: Complex Event Detection

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

The formal problem is NP-hard We proposed several possible algorithms Experimental (simulation) values proved the effectiveness of our solutions The Hash-Tree based algorithm is well suited for our application: • 10 million Complex Events • 1 million Atomic Events • 100 Atomic events detected per document 0.8 ms to process a document. ~2 million documents per day (on each PC).

Complex Events Algorithm

55

Reporter

notification/results

Millions of Notifications/day

notification/monitoring

DEA I3 - Données semi-structurées - Grégory Cobéna

continuous queries

triggers

complex event detection

20/12/2002

clock

alerts

Step 3: Notification Processor

56

20/12/2002

Xyleme Alerter

documents

Xyleme Subscription Manager

Subscription Manager

Trigger Engine

Xyleme Query Processor

SQL

Reporter

DEA I3 - Données semi-structurées - Grégory Cobéna

SQL

Complex Event Detection

Architecture

57

Web Browser

Xyleme Reporter

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Monitoring Applications

58

20/12/2002

1

Slice the document

2

Filter

3

DEA I3 - Données semi-structurées - Grégory Cobéna

Flow of candidate documents

Query to search engine Or specific crawl + pre-filter

59

detection

Example: a press agency wants to check that people are not publishing illegally copies of their wires Need to react fast on changes: illegal copy of the wire may last only a couple of days

Copy tracking

Unreachable pages Dangling pointers Incorrect pages (e.g., do not parse) Detection of interesting pages on the web Etc.

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Portal archiving Subscription and notification

• • • • •

Standard portal management

Web portal management

60

Anti-criminal and anti-terrorist intelligence, e.g., detecting suspicious acquisition of chemical products Business intelligence, e.g., discovering potential customers, partners, competitors

new pages, deleted pages, changes in a page

20/12/2002



DEA I3 - Données semi-structurées - Grégory Cobéna

Data mining, text understanding, knowledge representation and extraction, linguistic… Very AI 61

Classify information and extract data of interest



Find the data (crawl the web) Monitor the changes





Applications

Web surveillance

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

• Semantic web • Real-time advanced processing

Improve Change control accuracy

• Refine notion of importance • Improve important pages discovery

Focus crawling on important pages

Conclusion & Prospectives

62

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Questions ?

63

Deuxième Partie: XML Diff

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

“Detecting Changes in XML Documents”, G. Cobena, S. Abiteboul, A. Marian ICDE 2002 (San Jose)

We developed a Diff algorithm for XML

“Change-Centric Management of Versions” (VLDB 2001)

65

Temporal Queries (persistent identification of nodes) Version some documents or some sites (store a ‘delta’) Change Monitoring (query changes)

We proposed a representation of changes

• • •

Objectives:

Versions

INRIA Rocquencourt, Columbia University

Grégory Cobéna, Serge Abiteboul, Amélie Marian

(Part II: XML Diff) 1. Detecting Changes in XML Documents

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

detection

• An XML Diff algorithm • A comparative study for XML change

67

Algorithms for detecting changes in XML documents Plan

Introduction

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

• Synthetic and real world experiments

Experiments

• Tradeoff ‘quality’ versus speed • Quasi linear time and space complexity

Motivations State of the art Change model Algorithm

Overview

68

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Learning about changes Architecture and requirements ( ‘speed’ ) Multiple optimality criteria ( ‘quality’ )

A. Marian, S. Abiteboul, G. Cobéna, L. Mignet, VLDB2001

Change-centric management of versions in an XML warehouse

B. Nguyen, S. Abiteboul, G. Cobéna, M. Preda, SIGMOD2001

Monitoring XML data on the Web

Motivations

69

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

In fact, all these problems are very similar

70

Unix Diff: shows the different lines between two text files String Diff: shows which symbol have changed XML Diff: Which parts of the tree have been modified, inserted or deleted

II.1.1 XML Diff What is a diff ?

delete all 7 chars and insert 7 other chars Update
into , into , into , into , into Mix both solutions

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Question: What is the shortest edit sequence?



• •

Consider string: abcdefg How to transform it into: bczdeyz ? Possible solutions

The String Edit Problem

71

20/12/2002

• • •

DEA I3 - Données semi-structurées - Grégory Cobéna

S1 into S2, then x into y S1x into S2, then insert y delete x, and then S1 into S2y

Conversely, to find out the shortest path for transforming S1x into S2y, it is sufficient to compare following transformations:

S1x into S2y

If we know how to transform S1 into S2, then we know how to transform:

Solving the String-Edit-Problem

72

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

• Cost(x-1,y-1)+update_cost(S1[x],S2[y]) • Cost(x-1,y)+delete_cost(S1[x]) • Cost(x,y-1)+insert_cost(S2[y]) 73

Two strings S1 and S2 Cost(x,y) represents the shortest edit cost to transform S1[1..x] into S2[1..y] The cost is the sum of individual costs for each edit operation (insert, delete, update) Then, cost(x,y) is the min of:

String Edit Problem The algorithm

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

74

M[x,y] represents the cost of transforming S1[1..x] into S2[1..y] M[x,y] can be computed using M[x-1,y-1], M[x-1,y] and M[x,y-1] M[0,i] and M[i,0] are obvious Thus, M[|S1|,|S2|] can be computed

Note that the number of path is exponential, but the cost remains quadratic. Time and Space cost is O(|S1|*|S2|)

• • • •

The solution is to represent all possible path on a matrix: M[1..|S1|][1..|S2|]

A Quadratic Solution

20/12/2002

C

do nothing (cost=0)

delete C (cost=1)

source string

insert C (cost=1)

A

DEA I3 - Données semi-structurées - Grégory Cobéna

destination string

C

B





Best result is O(|s|^2 / log s) solution over finite alphabet O(|x|*|y|) solution with Directed A-cyclic Graph

State of the art (1): the string edit problem

75

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Questions?

76

Finds the solution in O(n*D) where n is the size of the largest string, and D the distance between the two strings

20/12/2002



DEA I3 - Données semi-structurées - Grégory Cobéna

Remove some edges to ensure that deleting a node will delete the subtree rooted at that node (and conversely for insert)

Adapt M[x,y] to work on trees (S. Chawathe)



Compute M[x,y] only close to the diagonal (E. Myers)

Extending the String problem

77

matching criteria to compare nodes and subtrees quadratic in the ‘distance’ between both trees.

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

IBM diff available at alphaworks

• •

Kuo-Chung Tai, Lu, Selkow: based on string edit problem in XML, many labels are identical Unix Diff, Sun DiffML LaDiff (MH-Diff) , Chawathe, Rajaraman, Garcia-Molina, J. Widom

State of the art (2): the tree pattern matching for XML

78

20/12/2002

Pr

TV 100

VCR 200

N P

Pr

Pr

N P

Pr

VCR 150

Version 2

TV 100 DVD 500

N P N P

Pr

Catalog

DEA I3 - Données semi-structurées - Grégory Cobéna

Version 1

Camera 300

N P N P

Pr

Catalog

Issue: Persistent identification of nodes

Data Model

79

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Change-centric Management of versions, VLDB2001

• Delta = Set of changes • Nice mathematical properties

Represent changes with a Delta

• to every node = XID • to the document = XID-map

Attach persistent identifiers:

Change Model

80

20/12/2002

2

4 6

8

Version 1

3

9

TV 100

7

21

6

17

18

19

20

15

Update

81

New XID-map: (6-10,17-21,11-16|22)

11 13

12 14

N P

Pr

VCR 150

Version 2

8

9

TV 100 DVD 500

7

Diff (V1,V2) delete(5) update(13,150) insert(16,2,(17-21))

11 13

VCR 200

12 14

Pr

16

N P N P

10

Pr

Insert

Catalog

DEA I3 - Données semi-structurées - Grégory Cobéna

XID-map: (1-16|17)

1

Camera 300

Delete

N P

N P N P

Pr 15

5

Pr 10

Pr

16

Catalog

Algorithm: Intuition

20/12/2002



Follow DTD

DEA I3 - Données semi-structurées - Grégory Cobéna

specifications Correctness: No change is missed

• Constraint-Awareness:

Assign persistent identifiers by matching nodes Compute a representation of changes between the two documents Also

Objectives

82

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

83

Camera$300 DVD$400 $200$150

Representation of Changes: Example

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

II.1.2 XML Diff The XyDiff Algorithm

84

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

• Signature • Weight

Compute for every subtree

Use ID-attributes from DTD to match nodes (or forbid matching)

One traversal of the tree

Phase 1: Identify Subtrees

85

20/12/2002



• • •

DEA I3 - Données semi-structurées - Grégory Cobéna

86

4/ Propagate [Very Carefully] matching to parents and ancestors

Remove S and all its subtrees from L

1/ find all identical subtrees in first document 2/ Select acceptable matches 3/ If at least one match, choose the best candidate

Let L be the list of all subtrees in second document For each subtree S in L (in decreasing weight order)

Phase 2: Bottom Up+Lazy Down Propagation

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

propagate matching to descendant nodes based on element names if no ambiguity

Then quick top-down pass

propagate matching to ancestors

Some nodes are unmatched after previous phases Use previous results to propagate matching [now a bit less carefully] First, bottom-up

Phase 3: Optimization

87

20/12/2002

• •



DEA I3 - Données semi-structurées - Grégory Cobéna

Largest common subsequence (weight) Ex: A, B, C, D, E, F E, D, A, B, C, F Largest common subsequence is A, B, C, F nodes D and E are ‘moved’ Complexity is quadratic We approximate the solution in linear time

Find inserted/deleted nodes Find “easy” move operations: parent node changed Find “complex” move = reordering children

Phase 4: Construct the delta

88

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

89

Propagation should try not to induce wrong matches. Intuition is that large matching subtrees are more relevant The larger the tree, the more we propagate the matching to ancestors.

• •

Propagation

Use locality (e.g. find matching ancestors) to avoid wrong matches Two small trees are matched if some ancestors are matching. For large trees, further look-up is accepted.



Select Acceptable Match

Key aspect: the weight of trees

Choice affects speed and quality

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Trade-off Quality vs. Speed We exhibit in the paper some bounds that guarantee linear time complexity



Definition of weight affects both speed and quality Look-up and Propagation distances

Algorithm: Tuning

90

Look-up level is designed to have ‘get best candidate’ cost in O(log(n)) uses some pre-computed indexes

20/12/2002



DEA I3 - Données semi-structurées - Grégory Cobéna

longest common subsequences of children is approximated

Phase 3 (optimization) is designed to be linear Phase 4 (delta construction) is linear





Phase 1 (identification) is one traversal of the tree Phase 2 (propagation) is n times ‘get best candidate’ in the worst case

Complexity: n*log(n)

91

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Comparison with Unix Diff on web data

92

Speed and Quality evaluation on synthetic data

Simulator of changes on XML documents

Experiments

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

• delta of changes • XML document D’=delta(D)

Parameters control the number of delete/insert/move Input: XML document D Outputs:

Simulator of changes

93

20/12/2002

Typical Pattern

DEA I3 - Données semi-structurées - Grégory Cobéna

Experimental verification that the algorithm is quasi-linear

Synthetic Data: Speed of the algorithm

94

1Mb

20/12/2002

95

Size of the computed delta is comparable to ‘original’ delta size For large deltas, XyDiff finds more efficient operations

– Generate changes over D: D’ = delta(D) – Give (D, D’) to XyDiff and compute a delta

Start from document D

DEA I3 - Données semi-structurées - Grégory Cobéna

Size ratio of the diff over original delta

0

1

2

3

4

Typical Pattern

Synthetic Data: Quality of the algorithm

at the time of that experiment, we had to crawl 10 million web pages to find them ☺

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

80% of the documents below the size of (UnixDiff*1.2) Almost all below UnixDiff*2 Of course, the delta of XyDiff contains much more information



Experiments on 10.000 XML web documents that changed

96

Comparison of the size of results: XyDiff vs. UnixDiff

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Use XML diff to observe changes between websites

improve XyDiff for typed XML data

• Frequency, patterns, … • Obtain statistics for DTD and XMLSchema • Use the statistics to learn about changes and

Larger scale experiments on web data Learn about changes:

Perspective

97

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

http://www-rocq.inria.fr/~cobena/XyDiffWeb/

A novel algorithm for XML diff in quasi linear time XML specificities are used to improve quality Available as Open Source freeware at:

Conclusion

98

20/12/2002

Storage

Delta(V(n-1),V(n))

XyDiff

DEA I3 - Données semi-structurées - Grégory Cobéna

V(n) of the XML document

XML Loader

Web Crawler

Alerter

XyDiff in Xyleme Architecture

99

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Questions ?

100

Grégory Cobéna (INRIA), Talel Abdessalem (ENST), Yassine Hinnach (ENST)

(Part II: XML Diff) Etude comparative sur la détection de changements en XML

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

102

Consider change-control in XML data warehouses. We want to understand changes We have only the old and new version of documents A diff need to be computed

Context

Version Management and Querying Comparison of Change representation models Experiments

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

State of the art in change detection Performance analysis and experiments Quality analysis and experiments

Summary

• • •

Detecting Changes

• • •

Motivations Data Model Representing Changes

Organization

103

Motivations

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

105

Version management, which means that the representation should allow for effective storage strategies Temporal Databases, the support for persistent identification of nodes is mandatory Monitoring: information about changes is used to support triggers or detect events Note: HTML or XHTML documents may be used

Motivations: Representing Changes

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

106

Correctness: the diff programs miss no changes Minimality of the result is important to save storage space and network bandwidth Semantics: some algorithms consider more semantics in XML documents Performance: with dynamic services and/or large amounts of data, high speed and low memory usage are mandatory ‘Move operations’: some algorithms support move operations whereas others don’t. This impacts both the performance of the tool and the quality of results.

Motivations: Detecting Changes

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

II.2.1 XML Diff Comparing Data Models

107

(i) insert, delete applied to leaves or subtrees (ii) update of text nodes (iii) move applied to a subtree root, moving the entire subtree

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

An edit cost is assigned to each operation. Usually, the cost is 1 per node touched The semantic of move is to identify subtrees even when their context has changed. We use the notion of mapping between the two trees. Each node in document A (or B) that is not deleted (or inserted) is matched to the corresponding node in B (or A).

• • •

Operations are:

Data Model (quick overview)

108

x

20/12/2002

a

b

root

y

c

a x

b

root

y

Selkow’s model: delete ‘b’

DEA I3 - Données semi-structurées - Grégory Cobéna

Tai’s model: delete ‘b’

Data Model: Intuition

c

109

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

II.2.2 XML Diff Representing Changes

110

There are several version management strategies. For instance, when only deltas are stored, their size must be reduced We also consider the performance of reconstructing a document given the delta and the previous document. It is linear in all cases. A simple text-based version management is possible but can not be used for querying.

20/12/2002

• • •

DEA I3 - Données semi-structurées - Grégory Cobéna

111

Labeling nodes by prefix+postfix identifiers improves querying algorithms Labeling nodes with persistent identifiers improves temporal databases There is no short labeling scheme that is good for both

Querying Changes







Version Management

Representing Changes

20/12/2002

Notebook 2200MHz Pentium4 $1999 Digital Camera Fuji FinePix 2600Z $299

DEA I3 - Données semi-structurées - Grégory Cobéna

Notebook 2200MHz Pentium4 $1999 Digital Camera Fuji FinePix 2600Z Not Available

Our Example

112

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Different representations

113

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna





$299 XPath expression



Change Models: XUpdate

114

20/12/2002



mentions some unchanged nodes

DEA I3 - Données semi-structurées - Grégory Cobéna

115

The order is important (no ids, no move)

Not Available $399





Same look’n’feel Change Models: as the document DeltaXML (Example)

20/12/2002



DEA I3 - Données semi-structurées - Grégory Cobéna

$399

Not Available



Change Models: XyDelta (Example)

116

What is the parent node?

Persistent identifiers

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

$299 Identify nodes 117

element node

Models: Microsoft XDL (Example)

Verify Change consistency

A formal model and nice mathematical properties Persistent identification of nodes (at least as an option)

A framework for querying

20/12/2002

• • • •



DEA I3 - Données semi-structurées - Grégory Cobéna

Validation by a DTD (may be a problem for DeltaXML, XyDelta) Verify the source document (only XDL) Support of ‘move’ operations (only XyDelta and XDL) Backward deltas (only XyDelta) Monitoring the delta (only XUpdate and DeltaXML)

Nice features that some are missing



Still missing for all of them

• •

Unique advantages of XyDelta

Summary

118

20/12/2002

100

1000

10000

100000

1000000

1

100 Edit Cost

1000

10000

100000

DEA I3 - Données semi-structurées - Grégory Cobéna

10

Comparing Delta Size

119

DeltaXML

XyDelta

Identifiers save space when few updates

Storage Experiments

File Size

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

protocols

• It is not yet clear how to query changes • Define transaction or synchronization

Change monitoring is easier with DeltaXML and XUpdate Temporal queries are easier to evaluate with XyDelta (persistent identifiers) Future work:

Change Models: Conclusion

120

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

II.2.3 Detecting Changes

121

20/12/2002

• •

DEA I3 - Données semi-structurées - Grégory Cobéna

Run in linear time or close Match nodes or subtrees depending on their content

122

find the Minimum Edit Script in O(m*n) time and space, where m and n are the size of the two documents

Other algorithms

• •

Based on the String Edit Problem (1966) Tree-to-tree correction Algorithms:

State of the art

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Experiments: Speed of several algorithm

123

20/12/2002





To:






From:


DEA I3 - Données semi-structurées - Grégory Cobéna









124

The cheapest choice would be to move and . (cost=2) But finding the best script with ‘move’ operations is NP-hard The minimum edit script consists in deleting and and then inserting them. (cost=4) (MMDiff) Preprocessing often consists in mapping identical subtrees. In these case, an additional ‘move’ operations will be needed (cost=5)

Algorithms: Overview

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Experiments: Quality (measured by the Edit Cost)

125

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Experiments: Speed (focus on DeltaXML)

126

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Many other algorithms that have no advantages MMDiff is the reference for quality DeltaXML and XyDiff are good compromises quality/performance; but performances of XyDiff more regular Performance measure for Microsoft available soon – seems comparable in performance to DeltaXML

Comparison summary

127

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Postprocessing of results?

• •

(e.g. DeltaXML) Using XMLSchema or DTD information Time-constrained diff (e.g. XyDiff)

• Using ‘keys’ to match specific nodes

Constrained Diff is often interesting:

Other issues

128

Unify and improve existing features Support Queries! Chain versions?

20/12/2002

• • •

DEA I3 - Données semi-structurées - Grégory Cobéna

We are currently working on Microsoft’s XML Diff Use XMLSchema (or DTD) information Mining changes? Use learning ?

Change Detection:

• • •

Representing Changes:

What’s next?

129

20/12/2002

DEA I3 - Données semi-structurées - Grégory Cobéna

Questions ?

130

merci