Blender Bot — Part 1: The Data

栏目: IT技术 · 发布时间: 4年前

内容简介：Can you guess who uttered those lines? Let me give you 2 options:A)A ChatBotI’ll reveal the answer at the end of this article.

“I’m good at the art of assimilation. I have watched & listened & learned. I knew nothing, but I studied the ways of men and slowly learnt how to ruin, how to hate, how to debase, how to humiliate. And at the feet of my master, I learnt the highest of human skills, the skill that no other creature owns. I finally learnt how to lie!”

Can you guess who uttered those lines? Let me give you 2 options:

A)A ChatBot B) The Creature from Frankenstein.

I’ll reveal the answer at the end of this article.

Introduction:

The purpose of a good chatbot is to observe and listen and learn and study the ways of men (and women!), and to learn many different human skills, in order to engage in a good conversation. Chatbots serve in 2 different settings within the purview of conversational agency.

Goal Oriented Dialog: These are the ones engaged in Online Ticket / Restaurant booking and other customer services. They usually have a fixed set of “intents” and corresponding “responses” (as well as “actions” mapped to the intents, that are taken in the background). They also have a knowledge base (database) at their disposal which they can access by way of API calls.
Open Domain Dialog: These are the ones that can engage in open ended chit chat in a wide range of topics. Recent advancements in open domain chatbots have been vastly due to scaling neural network models — by having more parameters and training on huge corpora. Eg: Meena from Google. Meena has 2.6B parameters and is trained on 341GB of text from social media conversations. And compared to OpenAI GPT-2, Meena has 1.7x greater model complexity and was trained on 8.5x more data.

But the researchers from FAIR attempt to show that scaling alone is insufficient and there is a lot more to be accounted for to generate a good conversation — for the chatbot to display human-like traits, like:

Personality
Engaging-ness
Empathy
Domain knowledge/expertise

Enter “Blender Bot”, FAIR’s champion conversational agent which they have open sourced recently.

In this 3 part series about Blender, I will try to explain the data sets used, the evaluation methods, the workings of the transformer architectures used and the model architecture with their training objectives, one by one. In the first part, let us discuss in detail about the data sets used and also see an overview of the limitations and failure cases. The paper is a bit system-heavy, so prior understanding of Attention, Transformers, BERT and Language Models in general would help tie all the pieces together seamlessly (not required for Part-1).

Data Sets:

Different data sets and (fake) tasks are used during the pre-training and fine-tuning stages of the model.

Pre-training:

BERT is pre-trained on the Toronto Book Corpus and Wikipedia. Such a training will not help in this case, because we are dealing with dialog generation and not just sentence associations. Therefore public domain data from Reddit and subreddits are used as the source of truth. Around 1.5B training examples are generated. The goal here is to generate a comment, conditioned on the full thread leading up to the comment. Cleaning the reddit data is a challenging process. A particular comment is not-used in the following cases:

if the author is a known bot
if it is from a non-English subreddit
if it is longer than 2048 characters or fewer than 5 characters
if it contains a URL
if it starts with a non-ASCII character
if it is at a depth > 7 in the thread
if it is a removed/deleted comment

In spite of cleaning it, the data still suffers from toxicity, noise and the fact that they are not 2-way conversations but group discussions.

Fine-Tuning:

Fine tuning of transformer models are usually done on data sets that are more relevant and closer to the downstream tasks. In the same line, fine-tuning for Blender is done on crowd-sourced, cleaner, smaller, 2-way conversational datasets. Let’s see about each of them in detail.

ConvAI2: This is the dataset from the ConvAI2 Challenge in NeurIPS-2018. It is based on the Persona Chat dataset. Here, each of the 2 speakers is given a role to play based on sentences describing their persona, which were also separately crowdsourced (both speakers can see their own persona description, but cannot see their partner’s persona). The task thus involves getting to know the other speaker and engaging them in friendly conversation, both asking and answering questions. The use of a “persona” gives improved consistency for the bot.
Empathetic Dialogues (ED): This dataset was benchmarked in “ Towards empathetic opendomain conversation models: A new benchmark and dataset”, Rashkin et al., 2019. Here, in each dialogue, one speaker describes a personal situation and the other plays a “listener” role, displaying empathy during the discussion. Fine-tuning models on this dataset helps them display more empathy in human evaluations.
Wizard of Wikipedia (WoW): The task here involves discussing a given topic in depth, where the goal is to both engage the partner as well as display expert knowledge. The two participants, however, are not quite symmetric: one will play the role of a knowledgeable expert (which we refer to as the wizard) while the other is a curious learner (the apprentice). One of them will choose a topic. The wizard has access to an information retrieval system that shows them paragraphs from Wikipedia possibly relevant to the conversation, which are unobserved by the apprentice. Before each conversation turn the wizard can read these paragraphs and then potentially base their next reply on that observed knowledge. The goal of collecting this dataset is to then replace the human wizard with a learned agent that will speak to a human apprentice instead.
Blended Skill Talk (BST): A small crowdsourced dataset of about 5k conversations, where the participants were asked to display all 3 qualities — personality, empathy and expertise — during a conversation, whenever appropriate. They provide responses from models that have been trained towards a specific skill, as inspiration to one of the two workers in the conversation. That worker is free to either use and modify or ignore those responses. Thus, each conversation involves an “unguided” speaker and a “guided” speaker, with the unguided speaker talking first. Whenever it is the guided speaker’s turn to respond, they are shown three suggested responses, one each from three single-task poly-encoder models (more on this in the next part) trained on the ConvAI2, ED, and WoW datasets. All the conversations in BST are fully annotated on the skill represented. Once the dataset is collected, an oracle model is trained by combining models trained towards individual capabilities. There are 2 variants of it: a) The poly-encoder is first pre-trained on the Reddit dataset and then fine-tuned on individual datasets. b) Pre-trained on reddit dataset then fine tuned on BST dataset itself. Given below is an example conversation between 2 crowd-workers from the BST dataset.

Blender Bot — Part 1: The Data — Sample conversation between a Guided and Unguided speakers from the BlendedSkillTalk dataset.

Sample conversation from the BlendedSkillTalk dataset, annotated with four conversation mode types: PB: personal background; K: knowledge; S: personal situation; E: empathy. The guided (G) and unguided (U) workers are given personas and a topic. The conversation has been seeded with two utterances from a conversation sampled from WoW. When the guided worker selected one of the suggestions, it is shown in shaded grey.

Safety:

BST and the other datasets on which fine tuning is done are all crowdsourced, where the crowd-workers were given explicit instructions to not use toxic / biased language, and hence are generally safer to train on. But remember pre-training is done on Reddit data — which is prone to contain negative training samples in abundance. A classifier, which is trained to identify toxic language, is used at the time of inferencing to help mitigate this problem.

Model Limitations & Failure Cases:

Vocabulary Usage: Generative Transformer Models that employ Beam Search decoding show an inclination towards generating common words too frequently and rarer words too infrequently. For example, most commonly occurring 3-grams (in the dataset) like “do you like”, “have any hobbies”, “lot of fun” are repeated over and again.
Nontrivial repetition: The models often repeat what is said to them. For instance, they’ll say that they had a pet dog if a conversation partner mentions a pet dog, or that they like the same bands as the person they’re speaking with.
Contradiction and forgetfulness: Blender models contradict themselves, albeit to a lesser degree in the larger models. They also fail to make the logical link that they shouldn’t ask questions they’ve asked before (to avoid the appearance of “forgetting”).
Knowledge and factual correctness: It’s relatively easy to manoeuvre Blender models into making factual errors, particularly when exploring a topic deeply.
Conversation length and memory: Blender conversations would likely be dull and repetitive over the course of several days or weeks of conversation, — especially considering Blender can’t remember earlier conversations.
Deeper understanding: The Blender models lack the ability to learn concepts through further conversation, and they have no way of grounding to entities, actions, and experiences in the real world.

In the image below, you can find examples of failure cases:

Issue Identified:

Example 1: non-trivial repetition
Example 2: forgetfulness
Example 3: contradiction, Georgia is not in the MidWest
Example 4: hallucinating knowledge, wrongly connecting games with the makers

In the next part, we will explore the Transformer architecture used in Blender, known as, Poly-Encoder and how it is superior to the other variants like Bi- or Cross-Encoder used for the same task of Multi-Sentence Scoring.

And finally, the answer to the question we saw at the beginning of the article: B) The Creature from Frankenstein! If you guessed it as A) A ChatBot, well you are almost close, for it won’t be long before we could chat about the vileness of humans to a well-rounded conversational agent!

References:

About Blender : https://www.kdnuggets.com/2020/05/facebook-open-sources-blender-largest-open-domain-chatbot.html
Blender Bot Research: https://arxiv.org/abs/2004.13637
Blender Bot recipe : https://parl.ai/projects/recipes/
Blended Skill Talk : https://arxiv.org/abs/2004.08449
Wizard of Wikipedia : https://arxiv.org/abs/1811.01241
About Meena : https://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Blender Bot — Part 1: The Data

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

网络心理学

玛丽•艾肯 (Mary Aiken) / 中信出版社 / 2018-8-1 / CNY 58.00

《五十度灰》如何利用恋物心理，成为全球仅次于《圣经》的畅销读物？为什么相对于亲朋好友，你更愿意向网络陌生人敞开心扉？上网时总感觉时间飞逝，原来是网络的时间扭曲效应？网络游戏中埋伏了哪些“上瘾”机关，暗中操控着你的行为? 为什么科技越发达，我们就越怕死？ ...... 网络空间是一个巨大的兔子洞，里面集合了新奇、刺激、喜悦、痛苦、不安等各种元素。在日复一日的......一起来看看《网络心理学》这本书的介绍吧!

码农工具