Large-scale, Parallel Automatic Patent Annotation - Thomas Heitz

3 Experiments: large scale and parallel. 4 Evaluation: gold standard .... GATE. GATE and ANNIE. GATE [5], the General Architecture for Text Engineering, is a.
2MB taille 1 téléchargements 207 vues
Overview

Large-scale, Parallel Automatic Patent Annotation Thomas Heitz & GATE Team Computer Science Dept. - NLP Group - Sheffield University Patent Information Retrieval 2008

30 October 2008

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

1 / 33

Overview

Task Approach Results In the following

Automatic Patent Annotation Objectives Fully automatic method. Scaling up without sacrificing computational performance and accuracy.

Methods Keywords based queries: 10 degree, 20 degree Celsius, 18 ◦ F, etc. Semantic annotations based queries: measurement.unit = ’degree Celsius’, measurement.value = {10,30}; will find Fahrenheit equivalent as well.

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

2 / 33

Overview

Task Approach Results In the following

Automatic Patent Annotation Objectives Fully automatic method. Scaling up without sacrificing computational performance and accuracy.

Methods Keywords based queries: 10 degree, 20 degree Celsius, 18 ◦ F, etc. Semantic annotations based queries: measurement.unit = ’degree Celsius’, measurement.value = {10,30}; will find Fahrenheit equivalent as well.

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

2 / 33

Overview

Task Approach Results In the following

Large-scale parallel Information Extraction

System characteristics Insufficient training data for learning ⇒ Rule-Based system Robust, Scalable ⇒ Shallow IE (Deep in PatExpert [16]). Large volume of data ⇒ Automatic and Parallel

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

3 / 33

Overview

Task Approach Results In the following

Results

Performance and quality Processed 1.3 million patents in 6 days with 12 parallel processes. Strict precision and recall greater than 90% for most annotations.

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

4 / 33

Overview

Task Approach Results In the following

Results

Performance and quality Processed 1.3 million patents in 6 days with 12 parallel processes. Strict precision and recall greater than 90% for most annotations.

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

4 / 33

Overview

Task Approach Results In the following

Contents

1 2 3 4

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

5 / 33

Overview

Task Approach Results In the following

Contents

1 2 3 4

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

5 / 33

Overview

Task Approach Results In the following

Contents

1 2 3 4

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

5 / 33

Overview

Task Approach Results In the following

Contents

1 2 3 4

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

5 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Patent data and structure Section annotations Reference annotations Measurement annotations

Contents

1 2 3 4

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

6 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Patent data and structure Section annotations Reference annotations Measurement annotations

Patent data and structure Dataset from Matrixware American patents (USPTO): 1.3 million, 108 GB, average file size is 85KB. European patents (EPO): 27 thousand, 780MB, average file size is 29KB.

Structure in three main parts The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the claim part and the bibliography part. T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

7 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Patent data and structure Section annotations Reference annotations Measurement annotations

Patent data and structure Dataset from Matrixware American patents (USPTO): 1.3 million, 108 GB, average file size is 85KB. European patents (EPO): 27 thousand, 780MB, average file size is 29KB.

Structure in three main parts The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the claim part and the bibliography part. T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

7 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Patent data and structure Section annotations Reference annotations Measurement annotations

Section annotations (EPO)

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

8 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Patent data and structure Section annotations Reference annotations Measurement annotations

Section annotations

Sections BibliographicData, Abstract and Claims sections pre-existing. heading annotations gives the beginning of a section, if present. Use of keywords to guess the section type. About 20 section types.

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

9 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Patent data and structure Section annotations Reference annotations Measurement annotations

Reference annotations (USPTO)

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

10 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Patent data and structure Section annotations Reference annotations Measurement annotations

Reference annotations References Claim, Example, Figure, Formula, Table are quite straightforward except for intervals like Fig. 1 to 3 and 5. A lot more difficult are Patent because of the variability of format. And even more Literature, for example authors can have numerous format: Warwel, S.; S. Warwel; Siegfried Warwel; etc. T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

11 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Patent data and structure Section annotations Reference annotations Measurement annotations

Measurement annotations (EPO)

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

12 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Patent data and structure Section annotations Reference annotations Measurement annotations

Measurement annotations Measurements Most measurements comprise a scalarValue followed by a unit, e.g. 350 nm. Two scalarValue with or without unit can be contained in an interval, e.g. 150 to 350 nm. Large number of measurement units in existence so we used an ontology populated from a database. One letter unit are ambiguous. T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

13 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Contents

1 2 3 4

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

14 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

GATE

GATE and ANNIE GATE [5], the General Architecture for Text Engineering, is a framework providing support for a variety of language engineering tasks. It includes a vanilla information extraction system, ANNIE. The processing resources we use from ANNIE are as follows: tokeniser, completely customised gazetteer and finite state transduction grammars.

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

15 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

GATE

GATE and ANNIE GATE [5], the General Architecture for Text Engineering, is a framework providing support for a variety of language engineering tasks. It includes a vanilla information extraction system, ANNIE. The processing resources we use from ANNIE are as follows: tokeniser, completely customised gazetteer and finite state transduction grammars.

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

15 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Gazetteers

Reference and measurement unit gazetteers The rules use some clue words like Table followed by a number for table references. We use gazetteers to annotate such clue words with all their inflections. For reference: 314 entries. For measurements unit: more than 30K entries (Created automatically from a database).

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

16 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Gazetteers

Reference and measurement unit gazetteers The rules use some clue words like Table followed by a number for table references. We use gazetteers to annotate such clue words with all their inflections. For reference: 314 entries. For measurements unit: more than 30K entries (Created automatically from a database).

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

16 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Annotation rules

GATE JAPE We use GATE JAPE rule that consists of two parts: left hand side (LHS) and right hand side (RHS). LHS consists of an annotation pattern that should be matched in the text. RHS declares the action that should be taken when the pattern specified in LHS is found in the document.

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

17 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Annotation rules

GATE JAPE We use GATE JAPE rule that consists of two parts: left hand side (LHS) and right hand side (RHS). LHS consists of an annotation pattern that should be matched in the text. RHS declares the action that should be taken when the pattern specified in LHS is found in the document.

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

17 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Annotation rules To find a Measurement E.g. 350 nm.

Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.Measurement = {}

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

18 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Annotation rules To find a Measurement E.g. 350 nm.

Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.Measurement = {}

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

18 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Annotation rules To find a Measurement E.g. 350 nm.

Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.Measurement = {}

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

18 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Annotation rules To find a Measurement E.g. 350 nm.

Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.Measurement = {}

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

18 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Annotation rules To find a Measurement E.g. 350 nm.

Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.Measurement = {}

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

18 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Annotation rules To find a Measurement E.g. 350 nm. In total, 30 rules are used for measurements.

Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.Measurement = {}

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

18 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Annotation rules To find a literature reference E.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium” Elsevier: Amsterdam, 1966.

Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.Literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

19 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Annotation rules To find a literature reference E.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium” Elsevier: Amsterdam, 1966.

Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.Literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

19 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Annotation rules To find a literature reference E.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium” Elsevier: Amsterdam, 1966.

Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.Literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

19 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Annotation rules To find a literature reference E.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium” Elsevier: Amsterdam, 1966.

Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.Literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

19 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Annotation rules To find a literature reference E.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium” Elsevier: Amsterdam, 1966.

Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.Literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

19 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Annotation rules To find a literature reference E.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium” Elsevier: Amsterdam, 1966.

Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.Literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

19 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Annotation rules To find a literature reference E.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium” Elsevier: Amsterdam, 1966. 24 rules are used for references.

Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.Literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

19 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

GATE Gazetteers Rules Application

Application

Application pipeline Phase 1 2 3 4 5

Gate processing resource Section Finder English Tokeniser Patent-specific gazetteers Reference Finder Measurements Finder

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

20 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Setup Optimisation Performance

Contents

1 2 3 4

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

21 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Setup Optimisation Performance

Setup Large Data Collider (LDC) Our experiments were carried out on the IRF’s LDC with Java (jrockit-R27.4.0-jdk1.5.0 12) with up to 12 processes. SGI Altix 4700 system comprising 20 nodes each with four 1.4GHz Itanium cores and 18GB RAM. In comparison, we found it 4x faster on Intel Core 2 2.4GHz.

Specific applications GATE batch mode: dispatches files to process on several GATE applications; do not stop on error. GATE benchmarking: generate time stamps for each resource and display charts from them. T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

22 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Setup Optimisation Performance

Setup Large Data Collider (LDC) Our experiments were carried out on the IRF’s LDC with Java (jrockit-R27.4.0-jdk1.5.0 12) with up to 12 processes. SGI Altix 4700 system comprising 20 nodes each with four 1.4GHz Itanium cores and 18GB RAM. In comparison, we found it 4x faster on Intel Core 2 2.4GHz.

Specific applications GATE batch mode: dispatches files to process on several GATE applications; do not stop on error. GATE benchmarking: generate time stamps for each resource and display charts from them. T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

22 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Setup Optimisation Performance

Optimisation

Benchmarking and refactoring Benchmarking of each processing resources. Removing of unnecessary resources like ANNIE Morphological analyser and Named Entities Recognition to keep only the Tokenizer. Optimisation of the JAPE rules where the benchmarking detect abnormal execution time.

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

23 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Setup Optimisation Performance

Performance

Baseline vs. optimized

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

24 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Patent Gold Standard Evaluation on the Patent Gold Standard

Contents

1 2 3 4

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

25 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Patent Gold Standard Evaluation on the Patent Gold Standard

Patent Gold Standard

Creation of the Gold Standard Selection of patents from two very different fields: mechanical engineering and biomedical technology. Manual annotation of USPTO and EPO patents by more than 10 person with several annotators for each patent. In total: 51 patents, 2,5 million characters.

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

26 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Patent Gold Standard Evaluation on the Patent Gold Standard

Statistics on Gold Standard Annotation type Section.Abstract S.BackgroundArt S.BestMode S.BibliographicData S.Bibliography S.Claims S.CrossReferenceToR.A. S.DetailedDescription S.DisclosureOfInvention S.DrawingDescription S.Effects S.Examples S.PreferredEmbodiment S.PriorArt S.Sponsorship S.SummaryOfTheInvent. S.TechnicalField S.UsageOfInvention Annotations/Doc

USPTO

EPO

23 19 2 23 0 23 6 11 3 16 1 17 10 4 2 20 14 1 8.5

28 22 5 28 8 0 1 18 6 20 2 25 7 6 0 18 17 6 8

T. Heitz & GATE Team - NLP Group - Sheffield University

Annotation type

USPTO

EPO

Reference.Claim R.Example R.Figure R.Formula R.Literature R.Patent R.Table Annotations/Doc

352 99 375 79 114 92 59 51

2 264 570 66 488 182 105 60

M.scalarValue Measurement.unit M.interval Annotations/Doc

1998 1613 432 176

3409 2994 375 242

Large-scale, Parallel Automatic Patent Annotation

27 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Patent Gold Standard Evaluation on the Patent Gold Standard

Statistics on Gold Standard Annotation type Section.Abstract S.BackgroundArt S.BestMode S.BibliographicData S.Bibliography S.Claims S.CrossReferenceToR.A. S.DetailedDescription S.DisclosureOfInvention S.DrawingDescription S.Effects S.Examples S.PreferredEmbodiment S.PriorArt S.Sponsorship S.SummaryOfTheInvent. S.TechnicalField S.UsageOfInvention Annotations/Doc

USPTO

EPO

23 19 2 23 0 23 6 11 3 16 1 17 10 4 2 20 14 1 8.5

28 22 5 28 8 0 1 18 6 20 2 25 7 6 0 18 17 6 8

T. Heitz & GATE Team - NLP Group - Sheffield University

Annotation type

USPTO

EPO

Reference.Claim R.Example R.Figure R.Formula R.Literature R.Patent R.Table Annotations/Doc

352 99 375 79 114 92 59 51

2 264 570 66 488 182 105 60

M.scalarValue Measurement.unit M.interval Annotations/Doc

1998 1613 432 176

3409 2994 375 242

Large-scale, Parallel Automatic Patent Annotation

27 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Patent Gold Standard Evaluation on the Patent Gold Standard

Results on Gold Standard, Micro-averaged precision, recall Annotation type P.

USPTO R. F1

P.

EPO R.

F1

S.BackgroundArt S.DrawingDescr. Section.Examples S.SummaryOf. S.TechnicalField

74 75 65 89 80

74 75 65 80 57

74 75 65 84 67

56 84 61 83 94

68 80 56 83 94

61 82 58 83 94

Reference.Claim R.Example R.Figure R.Formula R.Literature R.Patent R.Table

100 97 99 99 69 76 100

100 100 99 99 75 77 98

100 99 99 99 72 77 99

100 100 99 100 70 72 100

100 99 98 100 74 84 100

100 99 98 100 72 78 100

M.scalarValue Measurement.unit M.interval

96 95 93

93 92 92

94 93 93

94 94 82

92 93 81

93 93 82

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

28 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Patent Gold Standard Evaluation on the Patent Gold Standard

Section annotation: Examples (EPO)

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

29 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Patent Gold Standard Evaluation on the Patent Gold Standard

Reference annotation: Literature (USPTO)

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

30 / 33

Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard

Patent Gold Standard Evaluation on the Patent Gold Standard

Measurement annotation: interval (EPO)

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

31 / 33

Conclusion

Conclusion

Contents

In conclusion...

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

32 / 33

Conclusion

Conclusion

Conclusion Fully automatic, scaling up method (million patents, 100GB). Quality close to human annotators.

Perspective Machine learning from annotated patents. Semantic queries with Patent Ontology.

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

33 / 33

Conclusion

Conclusion

Conclusion Fully automatic, scaling up method (million patents, 100GB). Quality close to human annotators.

Perspective Machine learning from annotated patents. Semantic queries with Patent Ontology.

T. Heitz & GATE Team - NLP Group - Sheffield University

Large-scale, Parallel Automatic Patent Annotation

33 / 33