Generating headlines from news articles using SOTA summarizer based on BERT

栏目: IT技术 · 发布时间: 5年前

内容简介：In the process of our research, we created a library called “Headliner” to generate our headlines. Headliner is a sequence modeling library that eases the training and, in particular, deployment of custom sequence models. In this article we will go through

At Axel Springer , Europe’s largest digital publishing house, we own a lot of news articles from various media outlets such as Welt , Bild , Business Insider and many more. Arguably, the most important part of a news article is its title, and it is not surprising that journalists tend to spend a fair amount of their time to come up with a good one. For this reason, it was an interesting research question for us at Axel Springer AI whether we could create an NLP model that generates quality headlines from Welt news articles (see Figure 1). This could, for example, serve our journalists as inspiration for creating SEO titles, which our journalists often don’t have time for (in fact we’re working together with our colleagues from SPRING on creating a SEO title generator).

Generating headlines from news articles using SOTA summarizer based on BERT

Figure 1: One example from our Welt.de headline generator.

In the process of our research, we created a library called “Headliner” to generate our headlines. Headliner is a sequence modeling library that eases the training and, in particular, deployment of custom sequence models. In this article we will go through the main features of the library as well as why we decided to create it. And we love open-source, so you can find the code on GitHub if you want to try it out .

It’s not a new topic

Generating news headlines is not a new topic. Konstantin Lopyrev already used deep learning in 2015 to generate headlines from 6 major news agencies including the New York Times and the Associated Press. Specifically, he used an encoder-decoder neural network architecture (LSTM units and attention, see Figure 2) to solve this particular problem. In general, generating headlines can be seen as a text summarization problem and a lot of research has been done in this area. A gentle introduction to this topic can be found on Machine Learning Mastery or FloydHub .

Figure 2: Encoder-decoder sequence-to-sequence model.

Why Headliner?

When we started with this project, we did some research on existing libraries. In fact, there are many libraries out there, such as Facebooks fairseq , Googles seq2seq , and OpenNMT . Although those libraries are great, they have a few drawbacks for our use case. For example, the former doesn’t focus much on production and the Google one is not actively maintained. OpenNMT was the closest one to match our requirements as it has a strong focus on production. We, however, wanted to provide a leaner repository that could easily be customized and extended by the user.

Therefore, we built our library with the following goals in mind:

Provide an easy-to-use API for both training and deployment
Leverage all the new features from TensorFlow 2.x like tf.function, tf.keras.layers etc.
Be modular by design and easily connectable with other libraries like Huggingface’s transformers or SpaCy
Be extensible for different encoder-decoder models
Work on large data

Building the library

We approached the problem scientifically by starting with the basics and iteratively increasing complexity. Consequently, we first built a simple baseline repository in TensorFlow 2.0, which was freshly released by this time. During this process we learned to appreciate many of TensorFlow’s new features that really ease development, specifically:

Integration of Keras with model subclassing API
Default eager execution that enables developers to debug into the execution of the computation graphs
The possibility to write graph operations with natural Python syntax that are interpreted by AutoGraph

The repository is comprised of separate modules for data preprocessing, vectorization and model training, which makes it easy to test, customize and extend. For example, you can easily integrate your own custom tokenizer into the training pipeline and ship it with the model.

Many machine learning repositories do not really pay a lot of attention to deployment of the trained models. For example, it is pretty common to see that code for model inference depends on global parameter settings. This is dangerous, since a deployed model is bound to the exact preprocessing logic during its training, which means that even a slight change in the inference setup can mess up predictions pretty badly. A better strategy is to serialize all preprocessing logic together with the model. This is realized in Headliner by bundling all modules together that are involved in preprocessing and inference.

Figure 3: Model structure for (de)-serialization.

Once the codebase was built, we started to add more complex models such as attention-based recurrent networks and the Transformer. At last, we implement a SOTA summarizer based on finetuning pre-trained BERT language models.

Welcome BertSum

Recently a fine-tuned BERT model achieved state-of-the-art performances for abstractive text summarization across several datasets [1]. The authors made two key adjustments to the BERT model, first a customized data preprocessing and second, a specific optimization schedule for training. We integrated those adjustments in in separate preprocessing modules that can be used out-of-the-box, try out a tutorial here !

To make the text consumable for BERT it is necessary to split it into sentences enclosed in special tokens. Here is an example:

In Headliner this preprocessing step is performed by the BertPreprocessor class, which internally uses a SpaCy Sentencizer to do the splitting. Then, the article is mapped to two sequences, first the token index sequence and second the segment index sequence to distinguish multiple sentences (see Figure 4).

Figure 4: Architecture of the original BERT model (left) and BertSum from [1] (right). For BertSum multiple sentences are enclosed by [CLS] and [SEP] tokens, and segment embeddings are used to distinguish multiple sentences.

The BertSum model is composed of a pre-trained BERT model as encoder and a standard transformer as decoder. The pre-trained encoder is carefully fine-tuned, whereas the decoder is trained from scratch. To deal with this mismatch it is necessary to employ two separate optimizers with different learning rates and schedules. We found that the training is relatively sensitive to the hyperparameters such as learning rate, batch size, dropout etc., which requires some fine-tuning for each dataset. We trained the model on a dataset of 500k articles with headlines from the WELT newspaper and were quite impressed by the results, an example is below:

(input) Drei Arbeiter sind in der thailändischen Hauptstadt Bangkok vom 69 Stockwerk des höchsten Wolkenkratzers des Landes in den Tod gestürzt. Die Männer befanden sich mit zwei weiteren Kollegen auf einer am 304 Meter hohen Baiyoke-Hochhaus herabgelassenen Arbeitsbühne, um Werbung anzubringen, wie die Polizei am Montag mitteilte. Plötzlich sein ein Stützkabel gerissen, worauf die Plattform in zwei Teile zerbrochen sei. Nur zwei der fünf Männer konnten sich den Angaben zufolge rechtzeitig an den Resten der Arbeitsbühne festklammern. Sie wurden später vom darunter liegenden Stockwerk aus gerettet.
(target) [CLS] Unfälle: Drei Arbeiter stürzten in Bangkok vom 69. Stock in den Tod [SEP]
(prediction) [CLS] Unglücke: Drei Arbeiter stürzen von höchstem Wolkenkratzer in Bangkok in den Tod [SEP]

[1] Liu, Y. and Lapata, M., 2019. Text summarization with pretrained encoders . arXiv preprint arXiv:1908.08345 .

How to use the library

To get started, you can use the library out-of-the-box to train a summarizer model. Just install Headliner via pip:

pip install headliner

All you need is to provide the data as a list (or generator) of string tuples for input and target. Then you create a summarizer model and trainer. After training the model, it can be saved to a folder and loaded for inference. In this minimalistic example, the trainer takes care of the data preprocessing and vectorization using a simple word-based tokenizer:

from headliner.trainer import Trainer
from headliner.model.transformer_summarizer import TransformerSummarizer

data = [('You are the stars, earth and sky for me!', 'I love you.'),
        ('You are great, but I have other plans.', 'I like you.')]# train summarizer and save model
summarizer = TransformerSummarizer(num_layers=1)
trainer = Trainer(batch_size=2, steps_per_epoch=100)
trainer.train(summarizer, data, num_epochs=2)
summarizer.save('/tmp/summarizer')# load model and do a prediction
summarizer = TransformerSummarizer.load('/tmp/summarizer')
summarizer.predict('You are the stars, earth and sky for me!')

For further information on how to use the library have a look at our tutorials on GitHub .

Summary

In this article, we presented our library Headliner which we internally used for our research to generate news headlines. We showed that it’s really easy to use. We also talked about BertSum, a state-of-the-art approach for text summarization, which is also implemented in our library. Please check out our library and give us feedback.

If you found this article useful, give us a high five :clap|type_1_2: so others can find it too, and share it with your friends. Follow us on Medium (Christian Schäfer andDat Tran) to stay up-to-date with our work. Thanks for reading!

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Generating headlines from news articles using SOTA summarizer based on BERT

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

基业长青

[美] 詹姆斯·柯林斯、[美] 杰里·波勒斯 / 真如 / 中信出版社 / 2006-9 / 39.00元

如何建立一个伟大并长盛不衰的公司？有思想的人们早已经厌倦了“年度流行语”般稍纵即逝的管理概念，他们渴求获得能经受时间考验的管理思想。柯林斯和波勒斯在斯坦福大学为期6年的研究项目中，选取了18个卓越非凡、长盛不衰的公司作了深入的研究，这些公司包括通用电气、3M、默克、沃尔玛、惠普、迪士尼等，它们平均拥有近百年的历史。是什么使这些公司不同于它们的竞争对手呢？他们拥有什么别的公司所不具有的法宝呢......一起来看看《基业长青》这本书的介绍吧!

码农工具