Extend Named Entity Recogniser (NER) to label new entities with spaCy

栏目: IT技术 · 发布时间: 4年前

内容简介:This post assumes that the reader has some notion of entities extraction from texts and want to further understand what state-of-the-art techniques exist for new custom entity recognition and how to use them. However, if you are new to NER problem then ple

Extend Named Entity Recogniser (NER) to label new entities with spaCy

Figure 1: Colour-coded recognised entities

This post assumes that the reader has some notion of entities extraction from texts and want to further understand what state-of-the-art techniques exist for new custom entity recognition and how to use them. However, if you are new to NER problem then please do read about it here .

Having said that, the purpose of this post is to delineate using of a pretrained natural language processing (NLP) core model from spaCy for learning to recognise new entities. The existing core NLP models from spacy are trained to recognise various entities as given in Figure 2.

Extend Named Entity Recogniser (NER) to label new entities with spaCy

Figure 2: Existing entity recognised by spaCy core models ( source )

Nonetheless, a user might want to construct its own entities to solve problem needs. In such a case, preexisting entities render themselves insufficient, thus, one needs to train NLP model do the job. Thanks to spaCy for its documentation and pretrained models this is not very difficult.

If you do not want to read further, and would rather learn how to use it then please go to this jupyter notebook - it is self-contained. Regardless, I would recommend to read it as well.

Data Preprocessing

Like any supervised learning algorithm which requires input and output to learn, similarly, here- the input is text and output is encoded according to BILUO as shown in Figure 3. While there exists a different scheme, however, Ratinov and Roth showed that the minimal Begin, In, Out ( IOB ) scheme is more difficult to learn than the BILUO scheme which explicitly marks boundary tokens. An example of IOB encoded is provided by spaCy that I found in consonance with the provided argument. Thus, from here on any mention of an annotation scheme will be BILUO.

Extend Named Entity Recogniser (NER) to label new entities with spaCy

Figure 3: BILUO scheme

A short example of BILUO encoded entities is shown in the following figure.

Extend Named Entity Recogniser (NER) to label new entities with spaCy

Figure 4: Entity encoded with BILOU Scheme

To encode your with BILUO scheme there are three possible ways. First one is to create a spaCy doc and then label each token that is saved in a text file.

The above snippet makes it easier to annotate, and transform the data into input-output for spaCy NLP model in an accepted format. To read data from the file and convert it into an accepted object form by the model is as follows:

Another method is using offset indices, where the indices of start and end (i.e. begin, inside, last part of entity are clubbed together) are given along with label, for example, as shown here:

The third one is similar to the first one, except, here we can fix our own tokens and label them, instead of generating tokens with the NLP model and then labelling them. While this can also work, but in my experiments, I found this to rather degrade the performance. Nevertheless, here is how you can do that:

Training

After preprocessing the data and prepared it to train, we need to add the vocabulary of new entities in the model NER pipeline. The core spaCy models have three pipelines Tagger , Parser , and NER . Furthermore, we need to disable tagger and parser pipelines, since we will be only training the NER pipe, although, one can train all the other pipelines simultaneously. Find more here .

Here while training the dropout is assigned 0.0- to deliberately overfit the model and show that it can learn to recognise all the new entities. Result with the trained model on the text:

spaCy also provides a way to generate visualize colour encoded entities (as in Figure 1) to be viewed in web-browser or notebook using the following snippet:

Caveats

The process provided here to train for new entities may seem a bit easy, however, it does come with a warning. While training it is possible that the newly trained model can forget to recognise old entities, therefore, it is highly recommended to mix some text with entities from previously trained entities, unless, the old entities are not essential for a solution of the problem. Secondly, it might better to learn more specific entity than a generalized entity.

Conclusion

We saw that it is not very hard to get started with learning new entities but one does need to experiment with different annotating techniques and choose what works best for the given problem.

Additional Notes

  • This post is further an extension to the example provided by spaCy here .
  • An entire code block can be accessed at this jupyter notebook . The readme also contains how to install the spaCy library and debug error issues during installation and loading of pretrained model.
  • Read this paper by Akbik et al. It should help in understanding the algorithm behind the sequence labelling i.e. multiple word entities.

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

About Face 3 交互设计精髓

About Face 3 交互设计精髓

Alan Cooper、Robert Reimann、David Cronin / 刘松涛 / 电子工业出版社 / 2008-11 / 72.00元

本书是一本数字产品和系统的交互设计指南,全面系统地讲述了交互设计过程、原理和方法,涉及的产品和系统有个人电脑上的个人和商务软件、Web应用、手持设备、信息亭、数字医疗系统、数字工业系统等。运用本书的交互设计过程和方法,有助于了解使用者和产品之间的交互行为,进而更好地设计出更具吸引力和更具市场竞争力的产品。 全书分成3篇:第1篇描述了“目标导向设计”,详细讨论了用户和设计的过程及思想;第2篇讲......一起来看看 《About Face 3 交互设计精髓》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具