PatentSBERTa: A Deep NLP based Hybrid Model for Patent Distance and Classification using Augmented SBERT

March 22, 2021

Hamid Bekamiri, Daniel S. Hain, Roman Jurowetzki

Computer Science

Economics

Machine Learning

Econometrics

This study provides an efficient approach for using text data to calculate patent-to-patent (p2p) technological similarity, and presents a hybrid framework for leveraging the resulting p2p similarity for applications such as semantic search and automated patent classification. We create embeddings using Sentence-BERT (SBERT) based on patent claims. We leverage SBERTs efficiency in creating embedding distance measures to map p2p similarity in large sets of patent data. We deploy our framework for classification with a simple Nearest Neighbors (KNN) model that predicts Cooperative Patent Classification (CPC) of a patent based on the class assignment of the K patents with the highest p2p similarity. We thereby validate that the p2p similarity captures their technological features in terms of CPC overlap, and at the same demonstrate the usefulness of this approach for automatic patent classification based on text data. Furthermore, the presented classification framework is simple and the results easy to interpret and evaluate by end-users. In the out-of-sample model validation, we are able to perform a multi-label prediction of all assigned CPC classes on the subclass (663) level on 1,492,294 patents with an accuracy of 54% and F1 score > 66%, which suggests that our model outperforms the current state-of-the-art in text-based multi-label and multi-class patent classification. We furthermore discuss the applicability of the presented framework for semantic IP search, patent landscaping, and technology intelligence. We finally point towards a future research agenda for leveraging multi-source patent embeddings, their appropriateness across applications, as well as to improve and validate patent embeddings by creating domain-expert curated Semantic Textual Similarity (STS) benchmark datasets.

A Survey on Sentence Embedding Models Performance for Patent Analysis

April 28, 2022

95% Match

Hamid Bekamiri, Daniel S. Hain, Roman Jurowetzki

Computation and Language

Patent data is an important source of knowledge for innovation research, while the technological similarity between pairs of patents is a key enabling indicator for patent analysis. Recently researchers have been using patent vector space models based on different NLP embeddings models to calculate the technological similarity between pairs of patents to help better understand innovations, patent landscaping, technology mapping, and patent quality evaluation. More often than ...

Find SimilarView on arXiv

A comparative analysis of embedding models for patent similarity

March 25, 2024

94% Match

Grazia Sveva Ascione, Valerio Sterzi

Computation and Language

Information Retrieval

Machine Learning

This paper makes two contributions to the field of text-based patent similarity. First, it compares the performance of different kinds of patent-specific pretrained embedding models, namely static word embeddings (such as word2vec and doc2vec models) and contextual word embeddings (such as transformers based models), on the task of patent similarity calculation. Second, it compares specifically the performance of Sentence Transformers (SBERT) architectures with different trai...

Find SimilarView on arXiv

Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Processing Method

January 6, 2024

94% Match

Liqiang Yu, Bo Liu, Qunwei Lin, ... , Che Chang

Computation and Language

Artificial Intelligence

In the realm of patent document analysis, assessing semantic similarity between phrases presents a significant challenge, notably amplifying the inherent complexities of Cooperative Patent Classification (CPC) research. Firstly, this study addresses these challenges, recognizing early CPC work while acknowledging past struggles with language barriers and document intricacy. Secondly, it underscores the persisting difficulties of CPC research. To overcome these challenges an...

Find SimilarView on arXiv

PatentBERT: Patent Classification with Fine-Tuning a pre-trained BERT Model

May 14, 2019

94% Match

Jieh-Sheng Lee, Jieh Hsiang

Computation and Language

Machine Learning

In this work we focus on fine-tuning a pre-trained BERT model and applying it to patent classification. When applied to large datasets of over two millions patents, our approach outperforms the state of the art by an approach using CNN with word embeddings. In addition, we focus on patent claims without other parts in patent documents. Our contributions include: (1) a new state-of-the-art method based on pre-trained BERT model and fine-tuning for patent classification, (2) a ...

Find SimilarView on arXiv

PaECTER: Patent-level Representation Learning using Citation-informed Transformers

February 29, 2024

93% Match

Mainak Ghosh, Sebastian Erhardt, Michael E. Rose, ... , Harhoff Dietmar

Information Retrieval

Computation and Language

Machine Learning

PaECTER is a publicly available, open-source document-level encoder specific for patents. We fine-tune BERT for Patents with examiner-added citation information to generate numerical representations for patent documents. PaECTER performs better in similarity tasks than current state-of-the-art models used in the patent domain. More specifically, our model outperforms the next-best patent specific pre-trained language model (BERT for Patents) on our patent citation prediction ...

Find SimilarView on arXiv

Enhancing Patent Retrieval using Text and Knowledge Graph Embeddings: A Technical Note

November 3, 2022

93% Match

L Siddharth, Guangtong Li, Jianxi Luo

Information Retrieval

Patent retrieval influences several applications within engineering design research, education, and practice as well as applications that concern innovation, intellectual property, and knowledge management etc. In this article, we propose a method to retrieve patents relevant to an initial set of patents, by synthesizing state-of-the-art techniques among natural language processing and knowledge graph embedding. Our method involves a patent embedding that captures text, citat...

Find SimilarView on arXiv

A Text-Embedding-based Approach to Measure Patent-to-Patent Technological Similarity -- Workflow, Code, and Applications

March 27, 2020

93% Match

Daniel Hain, Roman Jurowetzki, ... , Wolf Patrick

Digital Libraries

This paper describes an efficiently scalable approach to measure technological similarity between patents by combining embedding techniques from natural language processing with nearest-neighbor approximation. Using this methodology we are able to compute existing similarities between all patents, which in turn enables us to represent the whole patent universe as a technological network. We validate both technological signature and similarity in various ways, and demonstrate ...

Find SimilarView on arXiv

Text Similarity in Vector Space Models: A Comparative Study

September 24, 2018

92% Match

Omid Shahmirzadi, Adam Lugowski, Kenneth Younge

Computation and Language

Machine Learning

Automatic measurement of semantic text similarity is an important task in natural language processing. In this paper, we evaluate the performance of different vector space models to perform this task. We address the real-world problem of modeling patent-to-patent similarity and compare TFIDF (and related extensions), topic models (e.g., latent semantic indexing), and neural models (e.g., paragraph vectors). Contrary to expectations, the added computational cost of text embedd...

Find SimilarView on arXiv

PatentMatch: A Dataset for Matching Patent Claims & Prior Art

December 27, 2020

92% Match

Julian Risch, Nicolas Alder, ... , Krestel Ralf

Information Retrieval

Digital Libraries

Patent examiners need to solve a complex information retrieval task when they assess the novelty and inventive step of claims made in a patent application. Given a claim, they search for prior art, which comprises all relevant publicly available information. This time-consuming task requires a deep understanding of the respective technical domain and the patent-domain-specific language. For these reasons, we address the computer-assisted search for prior art by creating a tra...

Find SimilarView on arXiv

BERT based patent novelty search by training claims to their own description

March 1, 2021

92% Match

Michael Freunek, André Bodmer

stat.ML

cs.CL

cs.LG

econ.EM

math.ST

stat.TH

In this paper we present a method to concatenate patent claims to their own description. By applying this method, BERT trains suitable descriptions for claims. Such a trained BERT (claim-to-description- BERT) could be able to identify novelty relevant descriptions for patents. In addition, we introduce a new scoring scheme, relevance scoring or novelty scoring, to process the output of BERT in a meaningful way. We tested the method on patent applications by training BERT on t...

Find SimilarView on arXiv