Generating headlines from news articles using SOTA summarizer based on BERT

栏目: IT技术 · 发布时间: 4年前

内容简介:In the process of our research, we created a library called “Headliner” to generate our headlines. Headliner is a sequence modeling library that eases the training and, in particular, deployment of custom sequence models. In this article we will go through

At Axel Springer , Europe’s largest digital publishing house, we own a lot of news articles from various media outlets such as Welt , Bild , Business Insider and many more. Arguably, the most important part of a news article is its title, and it is not surprising that journalists tend to spend a fair amount of their time to come up with a good one. For this reason, it was an interesting research question for us at Axel Springer AI whether we could create an NLP model that generates quality headlines from Welt news articles (see Figure 1). This could, for example, serve our journalists as inspiration for creating SEO titles, which our journalists often don’t have time for (in fact we’re working together with our colleagues from SPRING on creating a SEO title generator).

Generating headlines from news articles using SOTA summarizer based on BERT

Figure 1: One example from our Welt.de headline generator.

In the process of our research, we created a library called “Headliner” to generate our headlines. Headliner is a sequence modeling library that eases the training and, in particular, deployment of custom sequence models. In this article we will go through the main features of the library as well as why we decided to create it. And we love open-source, so you can find the code on GitHub if you want to try it out .

It’s not a new topic

Generating news headlines is not a new topic. Konstantin Lopyrev already used deep learning in 2015 to generate headlines from 6 major news agencies including the New York Times and the Associated Press. Specifically, he used an encoder-decoder neural network architecture (LSTM units and attention, see Figure 2) to solve this particular problem. In general, generating headlines can be seen as a text summarization problem and a lot of research has been done in this area. A gentle introduction to this topic can be found on Machine Learning Mastery or FloydHub .

Generating headlines from news articles using SOTA summarizer based on BERT

Figure 2: Encoder-decoder sequence-to-sequence model.

Why Headliner?

When we started with this project, we did some research on existing libraries. In fact, there are many libraries out there, such as Facebooks fairseq , Googles seq2seq , and OpenNMT . Although those libraries are great, they have a few drawbacks for our use case. For example, the former doesn’t focus much on production and the Google one is not actively maintained. OpenNMT was the closest one to match our requirements as it has a strong focus on production. We, however, wanted to provide a leaner repository that could easily be customized and extended by the user.

Therefore, we built our library with the following goals in mind:

  • Provide an easy-to-use API for both training and deployment
  • Leverage all the new features from TensorFlow 2.x like tf.function, tf.keras.layers etc.
  • Be modular by design and easily connectable with other libraries like Huggingface’s transformers or SpaCy
  • Be extensible for different encoder-decoder models
  • Work on large data

Building the library

We approached the problem scientifically by starting with the basics and iteratively increasing complexity. Consequently, we first built a simple baseline repository in TensorFlow 2.0, which was freshly released by this time. During this process we learned to appreciate many of TensorFlow’s new features that really ease development, specifically:

  • Integration of Keras with model subclassing API
  • Default eager execution that enables developers to debug into the execution of the computation graphs
  • The possibility to write graph operations with natural Python syntax that are interpreted by AutoGraph

The repository is comprised of separate modules for data preprocessing, vectorization and model training, which makes it easy to test, customize and extend. For example, you can easily integrate your own custom tokenizer into the training pipeline and ship it with the model.

Generating headlines from news articles using SOTA summarizer based on BERT

Many machine learning repositories do not really pay a lot of attention to deployment of the trained models. For example, it is pretty common to see that code for model inference depends on global parameter settings. This is dangerous, since a deployed model is bound to the exact preprocessing logic during its training, which means that even a slight change in the inference setup can mess up predictions pretty badly. A better strategy is to serialize all preprocessing logic together with the model. This is realized in Headliner by bundling all modules together that are involved in preprocessing and inference.

Generating headlines from news articles using SOTA summarizer based on BERT

Figure 3: Model structure for (de)-serialization.

Once the codebase was built, we started to add more complex models such as attention-based recurrent networks and the Transformer. At last, we implement a SOTA summarizer based on finetuning pre-trained BERT language models.

Welcome BertSum

Generating headlines from news articles using SOTA summarizer based on BERT

Recently a fine-tuned BERT model achieved state-of-the-art performances for abstractive text summarization across several datasets [1]. The authors made two key adjustments to the BERT model, first a customized data preprocessing and second, a specific optimization schedule for training. We integrated those adjustments in in separate preprocessing modules that can be used out-of-the-box, try out a tutorial here !

To make the text consumable for BERT it is necessary to split it into sentences enclosed in special tokens. Here is an example:

Generating headlines from news articles using SOTA summarizer based on BERT

In Headliner this preprocessing step is performed by the BertPreprocessor class, which internally uses a SpaCy Sentencizer to do the splitting. Then, the article is mapped to two sequences, first the token index sequence and second the segment index sequence to distinguish multiple sentences (see Figure 4).

Generating headlines from news articles using SOTA summarizer based on BERT

Figure 4: Architecture of the original BERT model (left) and BertSum from [1] (right). For BertSum multiple sentences are enclosed by [CLS] and [SEP] tokens, and segment embeddings are used to distinguish multiple sentences.

The BertSum model is composed of a pre-trained BERT model as encoder and a standard transformer as decoder. The pre-trained encoder is carefully fine-tuned, whereas the decoder is trained from scratch. To deal with this mismatch it is necessary to employ two separate optimizers with different learning rates and schedules. We found that the training is relatively sensitive to the hyperparameters such as learning rate, batch size, dropout etc., which requires some fine-tuning for each dataset. We trained the model on a dataset of 500k articles with headlines from the WELT newspaper and were quite impressed by the results, an example is below:

(input) Drei Arbeiter sind in der thailändischen Hauptstadt Bangkok vom 69 Stockwerk des höchsten Wolkenkratzers des Landes in den Tod gestürzt. Die Männer befanden sich mit zwei weiteren Kollegen auf einer am 304 Meter hohen Baiyoke-Hochhaus herabgelassenen Arbeitsbühne, um Werbung anzubringen, wie die Polizei am Montag mitteilte. Plötzlich sein ein Stützkabel gerissen, worauf die Plattform in zwei Teile zerbrochen sei. Nur zwei der fünf Männer konnten sich den Angaben zufolge rechtzeitig an den Resten der Arbeitsbühne festklammern. Sie wurden später vom darunter liegenden Stockwerk aus gerettet.
(target) [CLS] Unfälle: Drei Arbeiter stürzten in Bangkok vom 69. Stock in den Tod [SEP]
(prediction) [CLS] Unglücke: Drei Arbeiter stürzen von höchstem Wolkenkratzer in Bangkok in den Tod [SEP]

[1] Liu, Y. and Lapata, M., 2019. Text summarization with pretrained encoders . arXiv preprint arXiv:1908.08345 .

How to use the library

To get started, you can use the library out-of-the-box to train a summarizer model. Just install Headliner via pip:

pip install headliner

All you need is to provide the data as a list (or generator) of string tuples for input and target. Then you create a summarizer model and trainer. After training the model, it can be saved to a folder and loaded for inference. In this minimalistic example, the trainer takes care of the data preprocessing and vectorization using a simple word-based tokenizer:

from headliner.trainer import Trainer
from headliner.model.transformer_summarizer import TransformerSummarizer

data = [('You are the stars, earth and sky for me!', 'I love you.'),
        ('You are great, but I have other plans.', 'I like you.')]# train summarizer and save model
summarizer = TransformerSummarizer(num_layers=1)
trainer = Trainer(batch_size=2, steps_per_epoch=100)
trainer.train(summarizer, data, num_epochs=2)
summarizer.save('/tmp/summarizer')# load model and do a prediction
summarizer = TransformerSummarizer.load('/tmp/summarizer')
summarizer.predict('You are the stars, earth and sky for me!')

For further information on how to use the library have a look at our tutorials on GitHub .

Summary

In this article, we presented our library Headliner which we internally used for our research to generate news headlines. We showed that it’s really easy to use. We also talked about BertSum, a state-of-the-art approach for text summarization, which is also implemented in our library. Please check out our library and give us feedback.

If you found this article useful, give us a high five :clap|type_1_2: so others can find it too, and share it with your friends. Follow us on Medium (Christian Schäfer andDat Tran) to stay up-to-date with our work. Thanks for reading!


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

程序员面试金典(第5版)

程序员面试金典(第5版)

[美] Gayle Laakmann McDowell / 李琳骁、漆 犇 / 人民邮电出版社 / 2013-11 / 59.00

本书是原谷歌资深面试官的经验之作,层层紧扣程序员面试的每一个环节,全面而详尽地介绍了程序员应当如何应对面试,才能在面试中脱颖而出。第1~7 章主要涉及面试流程解析、面试官的幕后决策及可能提出的问题、面试前的准备工作、对面试结果的处理等内容;第8~9 章从数据结构、概念与算法、知识类问题和附加面试题4 个方面,为读者呈现了出自微软、苹果、谷歌等多家知名公司的150 道编程面试题,并针对每一道面试题目......一起来看看 《程序员面试金典(第5版)》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码