Neural network for generating bread recipes

栏目: IT技术 · 发布时间: 5年前

内容简介:In 2017, a friend gave me some sourdough starter to make bread with, and ever since then, my life has changed. It sounds cheesy, but I discovered a hobby that has led me to buy almost 200 pounds of flour at a time (seriously), develop a biweekly pizza baki

The bread was actually tasty!

In 2017, a friend gave me some sourdough starter to make bread with, and ever since then, my life has changed. It sounds cheesy, but I discovered a hobby that has led me to buy almost 200 pounds of flour at a time (seriously), develop a biweekly pizza baking habit, and dream of what bread I’m going to make in the coming days!

Neural network for generating bread recipes

On the left, a loaf of sourdough bread I made with a recipe generated from my neural network model. On the right, the code to analyze and predict bread recipes. (Image by author)

Because I spend a lot of time baking sourdough and experimenting with new formulas, I wanted to see if I could create an artificial intelligence-powered recipe generator that would predict something for me to make! One of my go-to websites for technique, tips and tricks has been the helpful bread baking forum, The Fresh Loaf , where people ask questions and post recipes. My idea was to scrape this website and get data to train a neural network to generate new bread recipes — and that’s what I did. At the end of this project, I was able to achieve my goal: to bake a machine learning-inspired loaf of bread, with ingredients predicted with a neural network.

Since there are multiple components to this project, I am breaking them down in a few posts. The first part is here on Medium, and the rest is on my blog, https://pratima.io. All the code I used for the project is in this repo .

Neural network for generating bread recipes

My walnut sourdough loaf, with the accompanying picture of the insides (called the crumb shot) showing purple streaks due to the tannins in walnuts. Chemistry! (Image by author)

Outline of the project

I followed the following steps during this project to obtain a recipe-generating model:

  • First I scraped the Fresh Loaf website to get a list of recipes, cleaning and visualizing it to understand the trends in the dataset (Part I — what you’re reading right now)
  • Then I explored the data using NLP tools to gain further insight into what people on the blog are saying in their posts ( Part II )
  • Lastly, I trained a neural network to learn a word-based language model and generated sentences from it to get new recipes ( Part III )

In this post, I’ll be describing the data collection and data exploration parts of the project.

Scraping the website to get data

I used the urllib library in python to query the website for webpages and Beautiful Soup to parse the HTML code. Inspecting the source code for blog posts on The Fresh Loaf, I realized that the ones containing recipes had the class node-type-blog , while other posts had other classes like node-type-forum ; so I made sure to only grab pages with the former class. Then I had to identify where the body of the blog containing the text was. The tags had a lot of nesting, and to me, were a little bit of a mess, since I don't look at HTML code very often. The <div> elements of this class contained both the blog text as well as the comments and ads, but I only wanted to isolate the recipe. So I decided to use the prettify() function in Beautiful Soup and examine the resulting string to see where the body of the text was.

import urllib.request as urlib
from bs4 import BeautifulSoup
def get_recipe(page_idx):
try:
page_url = f’http://www.thefreshloaf.com/node/{page_idx}'
page = urlib.urlopen(page_url)
soup = BeautifulSoup(page, ‘html.parser’)
# only process pages that are blog posts, aka contain recipes
if ‘node-type-blog’ in soup.body[‘class’]:
print(f’blog at index {page_idx}’)
soup = soup.prettify()
soup = soup[soup.find(‘title’):]
soup = soup[soup.find(‘content=”’)+len(‘content=”’):]
end = soup.find(‘“‘)
return preprocess(soup[:end])
except Exception:
print(f’Page: http://www.thefreshloaf.com/node/{page_idx} not found!’)

The tags in this string made it quite easy to find the recipe body (contained in the content section) and after isolating that string, I used a quick preprocessing function to get rid of any HTML remnants that ended up in the recipe’s text.

def preprocess(text):
  text = text.replace(u’<br/>’, ‘’)
  text = text.replace(‘(<a).*(>).*(</a>)’, ‘’)
  text = text.replace(‘(&amp)’, ‘’)
  text = text.replace(‘(&gt)’, ‘’)
  text = text.replace(‘(&lt)’, ‘’)
  text = text.replace(u’\xa0', ‘ ‘)
  return text

I scraped ~10000 posts to end up with 1257 recipes that I collected into a text document, each separated by a new line. I could keep scraping for more but the website stopped responding after querying it for a few hours, so I decided to stop and understand the data from here.

Data cleaning and exploration

To begin, I loaded the text file into a Jupyter notebook and calculated the length and unique words in each recipe. Upon inspecting the data, I found that a lot of authors referred to other url’s where they obtained inspiration for their recipe and I removed these “words” from the text. I also tokenized the recipe by removing stop words using the NLTK corpus . However, a lot of authors describe their own thoughts and procedures so removing the stop words “i” and “me” would lead to grammatical issues while training a language model; I retained these stop words in the text.

def tokenize(text):
  punctuation_map = str.maketrans(‘’, ‘’, string.punctuation)
  stopwords_list = stopwords.words(‘english’)
  stopwords_list.remove(‘i’)
  stopwords_list.remove(‘me’)
  stopwords_list.append(‘com’)
  stopwords_set = set(stopwords_list)
  text = text.split()
  # remove website links in text
  text = [word for word in text if not (‘http’ in word or ‘www’ in word)]
  text = [word.translate(punctuation_map).lower() for word in text]
  tokenized_words = [word for word in text if word not in stopwords_set]
  return tokenized_words

Neural network for generating bread recipes

The recipes have an average length of 294 words with a standard deviation of 618. That’s quite a variation in recipe lengths! They also contain 166 unique words on average, again with a high standard deviation of 174 words. This high standard deviation indicates a large tail, and indeed, upon examining the data, I found a few recipes with over 1200 words that are skewing these statistics. One post even has >20000 words!

Neural network for generating bread recipes

Code to show histogram of recipe lengths. (Image by author)

To get a visual representation of what the texts of these recipes contain, I created a word cloud based on the words in the recipes. This package creates an image of the top terms after removing stop words. As suspected, the most common words are ‘bread’, ‘dough’, ‘loaf’, ‘time’ and ‘water’. Some of the other words that pop out to me are ‘stretch’, ‘fold’ and ‘starter’. From my experience baking bread, time is an important ingredient, and it checks out that it’s quite a frequent word. I’m also happy to see stretch and fold there, as this is a common technique to develop strength in the dough and lots of people seem to use it! It’s surprising that starter shows up but the word ‘yeast’ does not, as bread can be leavened using both a wild-yeast starter as well as standard commercial baking yeast. It seems as though the users of The Fresh Loaf, like me, prefer using sourdough starters to make their bread rise and coax out fun flavor profiles due to its wild nature.

Neural network for generating bread recipes
Word cloud generated from the body of the recipes. (Image by author)

More quantitatively, I made a long list of all the words and counted the number of times each one occurred to plot the 20 most common words occurring in the recipes. As with the word cloud above, we see ‘dough’, ‘flour’, ‘water’ and units of time in this list of frequently used terms.

Neural network for generating bread recipes

Histogram showing the frequency of the top 20 words that occur in the recipes. (Image by author)

Phew — that’s a lot of stuff! Head on to my website for parts II and III to learn more about the dataset and how I trained a neural network that told me how to make bread :-)


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

创新者的解答

创新者的解答

【美】克莱顿•克里斯坦森、【加】迈克尔·雷纳 / 中信出版社 / 2013-10-10 / 49.00

《创新者的解答》讲述为了追求创新成长机会,美国电信巨子AT&T在短短10年间,总共耗费了500亿美元。企业为了保持成功记录,会面对成长的压力以达成持续获利的目标。但是如果追求成长的方向出现偏误,后果往往比没有成长更糟。因此,如何创新,并选对正确方向,是每个企业最大的难题。 因此,如何创新,并导向何种方向,便在于创新结果的可预测性─而此可预测性则来自于正确的理论依据。在《创新者的解答》中,两位......一起来看看 《创新者的解答》 这本书的介绍吧!

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

html转js在线工具
html转js在线工具

html转js在线工具