Neural network for generating bread recipes

栏目: IT技术 · 发布时间: 4年前

内容简介:In 2017, a friend gave me some sourdough starter to make bread with, and ever since then, my life has changed. It sounds cheesy, but I discovered a hobby that has led me to buy almost 200 pounds of flour at a time (seriously), develop a biweekly pizza baki

The bread was actually tasty!

In 2017, a friend gave me some sourdough starter to make bread with, and ever since then, my life has changed. It sounds cheesy, but I discovered a hobby that has led me to buy almost 200 pounds of flour at a time (seriously), develop a biweekly pizza baking habit, and dream of what bread I’m going to make in the coming days!

Neural network for generating bread recipes

On the left, a loaf of sourdough bread I made with a recipe generated from my neural network model. On the right, the code to analyze and predict bread recipes. (Image by author)

Because I spend a lot of time baking sourdough and experimenting with new formulas, I wanted to see if I could create an artificial intelligence-powered recipe generator that would predict something for me to make! One of my go-to websites for technique, tips and tricks has been the helpful bread baking forum, The Fresh Loaf , where people ask questions and post recipes. My idea was to scrape this website and get data to train a neural network to generate new bread recipes — and that’s what I did. At the end of this project, I was able to achieve my goal: to bake a machine learning-inspired loaf of bread, with ingredients predicted with a neural network.

Since there are multiple components to this project, I am breaking them down in a few posts. The first part is here on Medium, and the rest is on my blog, https://pratima.io. All the code I used for the project is in this repo .

Neural network for generating bread recipes

My walnut sourdough loaf, with the accompanying picture of the insides (called the crumb shot) showing purple streaks due to the tannins in walnuts. Chemistry! (Image by author)

Outline of the project

I followed the following steps during this project to obtain a recipe-generating model:

  • First I scraped the Fresh Loaf website to get a list of recipes, cleaning and visualizing it to understand the trends in the dataset (Part I — what you’re reading right now)
  • Then I explored the data using NLP tools to gain further insight into what people on the blog are saying in their posts ( Part II )
  • Lastly, I trained a neural network to learn a word-based language model and generated sentences from it to get new recipes ( Part III )

In this post, I’ll be describing the data collection and data exploration parts of the project.

Scraping the website to get data

I used the urllib library in python to query the website for webpages and Beautiful Soup to parse the HTML code. Inspecting the source code for blog posts on The Fresh Loaf, I realized that the ones containing recipes had the class node-type-blog , while other posts had other classes like node-type-forum ; so I made sure to only grab pages with the former class. Then I had to identify where the body of the blog containing the text was. The tags had a lot of nesting, and to me, were a little bit of a mess, since I don't look at HTML code very often. The <div> elements of this class contained both the blog text as well as the comments and ads, but I only wanted to isolate the recipe. So I decided to use the prettify() function in Beautiful Soup and examine the resulting string to see where the body of the text was.

import urllib.request as urlib
from bs4 import BeautifulSoup
def get_recipe(page_idx):
try:
page_url = f’http://www.thefreshloaf.com/node/{page_idx}'
page = urlib.urlopen(page_url)
soup = BeautifulSoup(page, ‘html.parser’)
# only process pages that are blog posts, aka contain recipes
if ‘node-type-blog’ in soup.body[‘class’]:
print(f’blog at index {page_idx}’)
soup = soup.prettify()
soup = soup[soup.find(‘title’):]
soup = soup[soup.find(‘content=”’)+len(‘content=”’):]
end = soup.find(‘“‘)
return preprocess(soup[:end])
except Exception:
print(f’Page: http://www.thefreshloaf.com/node/{page_idx} not found!’)

The tags in this string made it quite easy to find the recipe body (contained in the content section) and after isolating that string, I used a quick preprocessing function to get rid of any HTML remnants that ended up in the recipe’s text.

def preprocess(text):
  text = text.replace(u’<br/>’, ‘’)
  text = text.replace(‘(<a).*(>).*(</a>)’, ‘’)
  text = text.replace(‘(&amp)’, ‘’)
  text = text.replace(‘(&gt)’, ‘’)
  text = text.replace(‘(&lt)’, ‘’)
  text = text.replace(u’\xa0', ‘ ‘)
  return text

I scraped ~10000 posts to end up with 1257 recipes that I collected into a text document, each separated by a new line. I could keep scraping for more but the website stopped responding after querying it for a few hours, so I decided to stop and understand the data from here.

Data cleaning and exploration

To begin, I loaded the text file into a Jupyter notebook and calculated the length and unique words in each recipe. Upon inspecting the data, I found that a lot of authors referred to other url’s where they obtained inspiration for their recipe and I removed these “words” from the text. I also tokenized the recipe by removing stop words using the NLTK corpus . However, a lot of authors describe their own thoughts and procedures so removing the stop words “i” and “me” would lead to grammatical issues while training a language model; I retained these stop words in the text.

def tokenize(text):
  punctuation_map = str.maketrans(‘’, ‘’, string.punctuation)
  stopwords_list = stopwords.words(‘english’)
  stopwords_list.remove(‘i’)
  stopwords_list.remove(‘me’)
  stopwords_list.append(‘com’)
  stopwords_set = set(stopwords_list)
  text = text.split()
  # remove website links in text
  text = [word for word in text if not (‘http’ in word or ‘www’ in word)]
  text = [word.translate(punctuation_map).lower() for word in text]
  tokenized_words = [word for word in text if word not in stopwords_set]
  return tokenized_words

Neural network for generating bread recipes

The recipes have an average length of 294 words with a standard deviation of 618. That’s quite a variation in recipe lengths! They also contain 166 unique words on average, again with a high standard deviation of 174 words. This high standard deviation indicates a large tail, and indeed, upon examining the data, I found a few recipes with over 1200 words that are skewing these statistics. One post even has >20000 words!

Neural network for generating bread recipes

Code to show histogram of recipe lengths. (Image by author)

To get a visual representation of what the texts of these recipes contain, I created a word cloud based on the words in the recipes. This package creates an image of the top terms after removing stop words. As suspected, the most common words are ‘bread’, ‘dough’, ‘loaf’, ‘time’ and ‘water’. Some of the other words that pop out to me are ‘stretch’, ‘fold’ and ‘starter’. From my experience baking bread, time is an important ingredient, and it checks out that it’s quite a frequent word. I’m also happy to see stretch and fold there, as this is a common technique to develop strength in the dough and lots of people seem to use it! It’s surprising that starter shows up but the word ‘yeast’ does not, as bread can be leavened using both a wild-yeast starter as well as standard commercial baking yeast. It seems as though the users of The Fresh Loaf, like me, prefer using sourdough starters to make their bread rise and coax out fun flavor profiles due to its wild nature.

Neural network for generating bread recipes
Word cloud generated from the body of the recipes. (Image by author)

More quantitatively, I made a long list of all the words and counted the number of times each one occurred to plot the 20 most common words occurring in the recipes. As with the word cloud above, we see ‘dough’, ‘flour’, ‘water’ and units of time in this list of frequently used terms.

Neural network for generating bread recipes

Histogram showing the frequency of the top 20 words that occur in the recipes. (Image by author)

Phew — that’s a lot of stuff! Head on to my website for parts II and III to learn more about the dataset and how I trained a neural network that told me how to make bread :-)


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

演算法圖鑑

演算法圖鑑

石田保輝、宮崎修一 / 陳彩華 / 臉譜 / 2017-12 / TWD450

★日本超人氣演算法學習書 ★逾50萬次下載量,「Apple年度最佳APP」書籍化! ★隨書附贈獨家贈品「圖形搜尋和排序圖解記憶表」 ★★ 讀再多文字解說都看不懂?沒關係,全部畫給你看,一次弄懂演算法到底是什麼!★★ ●直觀理解,從基礎開始學習,一用就上手的演算法專書! ●全圖像化step by step,完整拆解制霸AI時代的演算法精髓! ●詳解演算法的奧妙、執......一起来看看 《演算法圖鑑》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具