Overview
Large-scale, Parallel Automatic Patent Annotation Thomas Heitz & GATE Team Computer Science Dept. - NLP Group - Sheffield University Patent Information Retrieval 2008
30 October 2008
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
1 / 33
Overview
Task Approach Results In the following
Automatic Patent Annotation Objectives Fully automatic method. Scaling up without sacrificing computational performance and accuracy.
Methods Keywords based queries: 10 degree, 20 degree Celsius, 18 ◦ F, etc. Semantic annotations based queries: measurement.unit = ’degree Celsius’, measurement.value = {10,30}; will find Fahrenheit equivalent as well.
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
2 / 33
Overview
Task Approach Results In the following
Automatic Patent Annotation Objectives Fully automatic method. Scaling up without sacrificing computational performance and accuracy.
Methods Keywords based queries: 10 degree, 20 degree Celsius, 18 ◦ F, etc. Semantic annotations based queries: measurement.unit = ’degree Celsius’, measurement.value = {10,30}; will find Fahrenheit equivalent as well.
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
2 / 33
Overview
Task Approach Results In the following
Large-scale parallel Information Extraction
System characteristics Insufficient training data for learning ⇒ Rule-Based system Robust, Scalable ⇒ Shallow IE (Deep in PatExpert [16]). Large volume of data ⇒ Automatic and Parallel
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
3 / 33
Overview
Task Approach Results In the following
Results
Performance and quality Processed 1.3 million patents in 6 days with 12 parallel processes. Strict precision and recall greater than 90% for most annotations.
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
4 / 33
Overview
Task Approach Results In the following
Results
Performance and quality Processed 1.3 million patents in 6 days with 12 parallel processes. Strict precision and recall greater than 90% for most annotations.
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
4 / 33
Overview
Task Approach Results In the following
Contents
1 2 3 4
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
5 / 33
Overview
Task Approach Results In the following
Contents
1 2 3 4
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
5 / 33
Overview
Task Approach Results In the following
Contents
1 2 3 4
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
5 / 33
Overview
Task Approach Results In the following
Contents
1 2 3 4
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
5 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Patent data and structure Section annotations Reference annotations Measurement annotations
Contents
1 2 3 4
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
6 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Patent data and structure Section annotations Reference annotations Measurement annotations
Patent data and structure Dataset from Matrixware American patents (USPTO): 1.3 million, 108 GB, average file size is 85KB. European patents (EPO): 27 thousand, 780MB, average file size is 29KB.
Structure in three main parts The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the claim part and the bibliography part. T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
7 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Patent data and structure Section annotations Reference annotations Measurement annotations
Patent data and structure Dataset from Matrixware American patents (USPTO): 1.3 million, 108 GB, average file size is 85KB. European patents (EPO): 27 thousand, 780MB, average file size is 29KB.
Structure in three main parts The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the claim part and the bibliography part. T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
7 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Patent data and structure Section annotations Reference annotations Measurement annotations
Section annotations (EPO)
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
8 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Patent data and structure Section annotations Reference annotations Measurement annotations
Section annotations
Sections BibliographicData, Abstract and Claims sections pre-existing. heading annotations gives the beginning of a section, if present. Use of keywords to guess the section type. About 20 section types.
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
9 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Patent data and structure Section annotations Reference annotations Measurement annotations
Reference annotations (USPTO)
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
10 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Patent data and structure Section annotations Reference annotations Measurement annotations
Reference annotations References Claim, Example, Figure, Formula, Table are quite straightforward except for intervals like Fig. 1 to 3 and 5. A lot more difficult are Patent because of the variability of format. And even more Literature, for example authors can have numerous format: Warwel, S.; S. Warwel; Siegfried Warwel; etc. T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
11 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Patent data and structure Section annotations Reference annotations Measurement annotations
Measurement annotations (EPO)
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
12 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Patent data and structure Section annotations Reference annotations Measurement annotations
Measurement annotations Measurements Most measurements comprise a scalarValue followed by a unit, e.g. 350 nm. Two scalarValue with or without unit can be contained in an interval, e.g. 150 to 350 nm. Large number of measurement units in existence so we used an ontology populated from a database. One letter unit are ambiguous. T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
13 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Contents
1 2 3 4
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
14 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
GATE
GATE and ANNIE GATE [5], the General Architecture for Text Engineering, is a framework providing support for a variety of language engineering tasks. It includes a vanilla information extraction system, ANNIE. The processing resources we use from ANNIE are as follows: tokeniser, completely customised gazetteer and finite state transduction grammars.
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
15 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
GATE
GATE and ANNIE GATE [5], the General Architecture for Text Engineering, is a framework providing support for a variety of language engineering tasks. It includes a vanilla information extraction system, ANNIE. The processing resources we use from ANNIE are as follows: tokeniser, completely customised gazetteer and finite state transduction grammars.
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
15 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Gazetteers
Reference and measurement unit gazetteers The rules use some clue words like Table followed by a number for table references. We use gazetteers to annotate such clue words with all their inflections. For reference: 314 entries. For measurements unit: more than 30K entries (Created automatically from a database).
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
16 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Gazetteers
Reference and measurement unit gazetteers The rules use some clue words like Table followed by a number for table references. We use gazetteers to annotate such clue words with all their inflections. For reference: 314 entries. For measurements unit: more than 30K entries (Created automatically from a database).
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
16 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Annotation rules
GATE JAPE We use GATE JAPE rule that consists of two parts: left hand side (LHS) and right hand side (RHS). LHS consists of an annotation pattern that should be matched in the text. RHS declares the action that should be taken when the pattern specified in LHS is found in the document.
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
17 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Annotation rules
GATE JAPE We use GATE JAPE rule that consists of two parts: left hand side (LHS) and right hand side (RHS). LHS consists of an annotation pattern that should be matched in the text. RHS declares the action that should be taken when the pattern specified in LHS is found in the document.
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
17 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Annotation rules To find a Measurement E.g. 350 nm.
Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.Measurement = {}
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
18 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Annotation rules To find a Measurement E.g. 350 nm.
Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.Measurement = {}
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
18 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Annotation rules To find a Measurement E.g. 350 nm.
Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.Measurement = {}
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
18 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Annotation rules To find a Measurement E.g. 350 nm.
Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.Measurement = {}
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
18 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Annotation rules To find a Measurement E.g. 350 nm.
Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.Measurement = {}
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
18 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Annotation rules To find a Measurement E.g. 350 nm. In total, 30 rules are used for measurements.
Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.Measurement = {}
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
18 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Annotation rules To find a literature reference E.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium” Elsevier: Amsterdam, 1966.
Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.Literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
19 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Annotation rules To find a literature reference E.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium” Elsevier: Amsterdam, 1966.
Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.Literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
19 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Annotation rules To find a literature reference E.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium” Elsevier: Amsterdam, 1966.
Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.Literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
19 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Annotation rules To find a literature reference E.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium” Elsevier: Amsterdam, 1966.
Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.Literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
19 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Annotation rules To find a literature reference E.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium” Elsevier: Amsterdam, 1966.
Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.Literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
19 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Annotation rules To find a literature reference E.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium” Elsevier: Amsterdam, 1966.
Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.Literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
19 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Annotation rules To find a literature reference E.g. see: Peacock, R. D. “The Chemistry of Technetium and Rhenium” Elsevier: Amsterdam, 1966. 24 rules are used for references.
Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.Literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
19 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
GATE Gazetteers Rules Application
Application
Application pipeline Phase 1 2 3 4 5
Gate processing resource Section Finder English Tokeniser Patent-specific gazetteers Reference Finder Measurements Finder
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
20 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Setup Optimisation Performance
Contents
1 2 3 4
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
21 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Setup Optimisation Performance
Setup Large Data Collider (LDC) Our experiments were carried out on the IRF’s LDC with Java (jrockit-R27.4.0-jdk1.5.0 12) with up to 12 processes. SGI Altix 4700 system comprising 20 nodes each with four 1.4GHz Itanium cores and 18GB RAM. In comparison, we found it 4x faster on Intel Core 2 2.4GHz.
Specific applications GATE batch mode: dispatches files to process on several GATE applications; do not stop on error. GATE benchmarking: generate time stamps for each resource and display charts from them. T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
22 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Setup Optimisation Performance
Setup Large Data Collider (LDC) Our experiments were carried out on the IRF’s LDC with Java (jrockit-R27.4.0-jdk1.5.0 12) with up to 12 processes. SGI Altix 4700 system comprising 20 nodes each with four 1.4GHz Itanium cores and 18GB RAM. In comparison, we found it 4x faster on Intel Core 2 2.4GHz.
Specific applications GATE batch mode: dispatches files to process on several GATE applications; do not stop on error. GATE benchmarking: generate time stamps for each resource and display charts from them. T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
22 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Setup Optimisation Performance
Optimisation
Benchmarking and refactoring Benchmarking of each processing resources. Removing of unnecessary resources like ANNIE Morphological analyser and Named Entities Recognition to keep only the Tokenizer. Optimisation of the JAPE rules where the benchmarking detect abnormal execution time.
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
23 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Setup Optimisation Performance
Performance
Baseline vs. optimized
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
24 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Patent Gold Standard Evaluation on the Patent Gold Standard
Contents
1 2 3 4
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
25 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Patent Gold Standard Evaluation on the Patent Gold Standard
Patent Gold Standard
Creation of the Gold Standard Selection of patents from two very different fields: mechanical engineering and biomedical technology. Manual annotation of USPTO and EPO patents by more than 10 person with several annotators for each patent. In total: 51 patents, 2,5 million characters.
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
26 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Patent Gold Standard Evaluation on the Patent Gold Standard
Statistics on Gold Standard Annotation type Section.Abstract S.BackgroundArt S.BestMode S.BibliographicData S.Bibliography S.Claims S.CrossReferenceToR.A. S.DetailedDescription S.DisclosureOfInvention S.DrawingDescription S.Effects S.Examples S.PreferredEmbodiment S.PriorArt S.Sponsorship S.SummaryOfTheInvent. S.TechnicalField S.UsageOfInvention Annotations/Doc
USPTO
EPO
23 19 2 23 0 23 6 11 3 16 1 17 10 4 2 20 14 1 8.5
28 22 5 28 8 0 1 18 6 20 2 25 7 6 0 18 17 6 8
T. Heitz & GATE Team - NLP Group - Sheffield University
Annotation type
USPTO
EPO
Reference.Claim R.Example R.Figure R.Formula R.Literature R.Patent R.Table Annotations/Doc
352 99 375 79 114 92 59 51
2 264 570 66 488 182 105 60
M.scalarValue Measurement.unit M.interval Annotations/Doc
1998 1613 432 176
3409 2994 375 242
Large-scale, Parallel Automatic Patent Annotation
27 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Patent Gold Standard Evaluation on the Patent Gold Standard
Statistics on Gold Standard Annotation type Section.Abstract S.BackgroundArt S.BestMode S.BibliographicData S.Bibliography S.Claims S.CrossReferenceToR.A. S.DetailedDescription S.DisclosureOfInvention S.DrawingDescription S.Effects S.Examples S.PreferredEmbodiment S.PriorArt S.Sponsorship S.SummaryOfTheInvent. S.TechnicalField S.UsageOfInvention Annotations/Doc
USPTO
EPO
23 19 2 23 0 23 6 11 3 16 1 17 10 4 2 20 14 1 8.5
28 22 5 28 8 0 1 18 6 20 2 25 7 6 0 18 17 6 8
T. Heitz & GATE Team - NLP Group - Sheffield University
Annotation type
USPTO
EPO
Reference.Claim R.Example R.Figure R.Formula R.Literature R.Patent R.Table Annotations/Doc
352 99 375 79 114 92 59 51
2 264 570 66 488 182 105 60
M.scalarValue Measurement.unit M.interval Annotations/Doc
1998 1613 432 176
3409 2994 375 242
Large-scale, Parallel Automatic Patent Annotation
27 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Patent Gold Standard Evaluation on the Patent Gold Standard
Results on Gold Standard, Micro-averaged precision, recall Annotation type P.
USPTO R. F1
P.
EPO R.
F1
S.BackgroundArt S.DrawingDescr. Section.Examples S.SummaryOf. S.TechnicalField
74 75 65 89 80
74 75 65 80 57
74 75 65 84 67
56 84 61 83 94
68 80 56 83 94
61 82 58 83 94
Reference.Claim R.Example R.Figure R.Formula R.Literature R.Patent R.Table
100 97 99 99 69 76 100
100 100 99 99 75 77 98
100 99 99 99 72 77 99
100 100 99 100 70 72 100
100 99 98 100 74 84 100
100 99 98 100 72 78 100
M.scalarValue Measurement.unit M.interval
96 95 93
93 92 92
94 93 93
94 94 82
92 93 81
93 93 82
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
28 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Patent Gold Standard Evaluation on the Patent Gold Standard
Section annotation: Examples (EPO)
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
29 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Patent Gold Standard Evaluation on the Patent Gold Standard
Reference annotation: Literature (USPTO)
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
30 / 33
Task: patent annotation Tools: GATE gazetteers and rules Experiments: large scale and parallel Evaluation: gold standard
Patent Gold Standard Evaluation on the Patent Gold Standard
Measurement annotation: interval (EPO)
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
31 / 33
Conclusion
Conclusion
Contents
In conclusion...
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
32 / 33
Conclusion
Conclusion
Conclusion Fully automatic, scaling up method (million patents, 100GB). Quality close to human annotators.
Perspective Machine learning from annotated patents. Semantic queries with Patent Ontology.
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
33 / 33
Conclusion
Conclusion
Conclusion Fully automatic, scaling up method (million patents, 100GB). Quality close to human annotators.
Perspective Machine learning from annotated patents. Semantic queries with Patent Ontology.
T. Heitz & GATE Team - NLP Group - Sheffield University
Large-scale, Parallel Automatic Patent Annotation
33 / 33