How I Fine-Tuned GPT-2 to Generate Creative Domain Names

栏目: IT技术 · 发布时间: 4年前

I had a goal in my mind to create an AI service which is helpful to people and super simple in the same time. After fiddling around with GPT-2, I have realized it has an immense creative potential that could prove useful in creative text generation.

Therefore I created NameKrea which is an AI that generates domain names . Domain name generator business is online since long time, but it hasn’t seen this amount of good quality content. (If you want to learn more about project’s ideation phase and productivity tips here is the first part of the article )

Let me walk you through how I build an AI service that generates domain names and business ideas!

Introduction

After scraping around 100.000 websites from the Majestic Millions top 1 Million Domain list, I fine-tuned 355M parameter model. Results are weirdly accurate and also creative at the same time. Have a look at the results:

GPT-2 is able to understand the context if there is enough training data is there. To be able to train it we need lots of data. This can be easily done by scraping meta description of a website. Luckily there is no shortage of websites on the internet :)

Fine tuning GPT-2 is possible by reading each line using a CSV file. Before start scraping we need to define what kind of data structure is understandable for the algorithm. For that I take a rather simplistic approach of feeding GPT-2 with a 1 line text per domain with meta description. A single entry in our training data will look like the following:

Create an account or log into Facebook. Connect with friends, family and other people you know. Share photos and videos, send messages and get updates. = @ = facebook.com

As you can see, we first feed in meta context of the given context and then use a delimiter which doesn’t exist in normal text. You can choose anything that is not normally exist in a natural text. I have chosen this delimiter: -> = @ =

Step 1: Scraping

As you might assume, it will take significant amount of time to manually copy and paste meta context of the domains. We need to come up with a scraping algorithm which is able to generate us clean training data.

Cleanliness of the data is important since most of the machine learning models are relying on the quality. Your machine learning model can be as good as your training data. Therefore:

When training a machine learning model, always remember: Trash in, trash out!

So what do I mean by clean data? First of all GPT-2 is trained mostly on English data that was scraped all over the internet. Therefore we need to make sure that we are collecting meta context data in English. Secondly there are many websites with meta descriptions which uses emojis and different characters. We don’t want any of this in our final collected data.

If we design a scraping algorithm, it should be able to filter and extract data with following logic:

English only
No emojis and smileys and alike. Just bare English Text.
Only collect data from a range of TLD’s (like .com, .net, .org..)
Be Fast! We need to have multiprocessing for fetching data from multiple domains at the same time other wise it will take ages to scrape data.

Since we decided on our main requirements, let’s move on to building our scraper!

Python has lot of great packages for scraping such as BeatifulSoup . It has many features make it possible to start scraping websites in an instant. We will use this library to fetch domains and then write them into a csv file.

complete scraper at github repo of namekrea

For some reason Github Gist embeds are not working properly. Have a look at the scraper.py from source code at the github repo of namekrea

First of all scraper.py reads domain names from the majestic’s top 1 million domain list and then starts the process of scraping data.

Note: After running scraper.py you will end up with 5 different files from 5 different threads. Therefore you need to combine those files into 1 and turn them into a csv file otherwise fine-tuning is not going to be possible.

.txt output from scraper.py will look like this:

Create an account or log into Facebook. Connect with friends, family and other people you know. Share photos and videos, send messages and get updates. = @ = facebook.com  Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for. = @ = google.com   Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube. = @ = youtube.com

When you are done with scraping data, we will continue to the next steps.

Step 2: Fine Tune it!

GPT-2 is huge! Medium scale pre-trained model has 355 Million parameters! Fine tuning this kind of architecture is definitely not possible by using your ordinary laptop CPU. On my setup I have used 2x1070Ti GPU’s and it took around 2 hours to reach a level of good quality output.

Let’s have a look at the general architecture of the project to understand how to train this model:

So first of all, we have scraped the data and combined text files into a csv to make it usable by the model_trainer.py script.

When training is complete, we will load the checkpoint to text_generator.py to generate domain names randomly.

Step 3 AI generated Domain Names:

Randomly generated domain names are cool, but it is not very useful if we are not able to give prompts to the GPT-2. Luckily this is possible with the prefix variable:

gpt2.generate(sess, model_name=model_name,
              #run_name=run_name, checkpoint_dir=checkpoint_dir,
              temperature=0.8, include_prefix=True, prefix='The best e-cigarette',
              truncate='<|endoftext|>', nsamples=10, batch_size=2, length=128
              )

I can’t stop laughing when I see the result:

The best e-cigarette with a nicotine level of over 99%. E-cigarette is more than just a vape. It is a COMMUNICATION APP. Use it as a portable vaporizer. Or put it in your pocket and control your vape with a smartphone. Vaporsca is the best e-cigarette comparison site.

A vape and also a COMMUNICATION APP?? I can certainly say that this thing very creative :D

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

How I Fine-Tuned GPT-2 to Generate Creative Domain Names

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

结网

王坚 / 人民邮电出版社 / 2010-12-10 / 59.00元

本书以如何创建、发布、推广互联网产品为主线，介绍了互联网产品经理的工作内容以及应对每一部分工作所需的方法和工具。为用户创造价值是产品经理的第一要务，产品经理的工作是围绕用户及具体任务展开的，本书丰富的案例和透彻的分析道出了从发现用户到最终满足用户这一过程背后的玄机。本书面向现在正在从事及未来将要从事互联网相关工作的创业者和产品经理，也可以作为互联网产品策划人员或相关专业学生的参考书。新版完......一起来看看《结网》这本书的介绍吧!

码农工具