内容简介:python词云(一)简单的英语词云
python的一个蛮酷炫的功能是可以轻松地实现词云。
github上有关于这个项目的开源代码:
https://github.com/amueller/word_cloud
注意跑例程时要删除里面的wordcloud文件夹
词云的功能有部分是基于NLP,有部分是基于图像的,
下面以一段github wordcloud上面的代码为例
from os import path from PIL import Image import numpy as np import matplotlib.pyplot as plt from wordcloud import WordCloud, STOPWORDS d = path.dirname(__file__) # Read the whole text. text = open(path.join(d, 'alice.txt')).read() # read the mask image # taken from # http://www.stencilry.org/stencils/movies/alice%20in%20wonderland/255fk.jpg alice_mask = np.array(Image.open(path.join(d, "alice_mask.png"))) stopwords = set(STOPWORDS) stopwords.add("said") wc = WordCloud(background_color="white", max_words=2000, mask=alice_mask, stopwords=stopwords) # generate word cloud wc.generate(text) # store to file wc.to_file(path.join(d, "alice.png")) # show plt.imshow(wc, interpolation='bilinear') plt.axis("off") plt.figure() plt.imshow(alice_mask, cmap=plt.cm.gray, interpolation='bilinear') plt.axis("off") plt.show()
原图:
结果:
Alice与兔子的图片
其中:
text打开文档
alice_mask是以数组的形式加载图画
stopwords设置停止显示的词语
WordCloud设置词云的属性
generate生成词云
to_file储存图片
进入wordcloud.py可以看到WordCloud类的相关属性:
"""Word cloud object for generating and drawing. Parameters ---------- font_path : string Font path to the font that will be used (OTF or TTF). Defaults to DroidSansMono path on a Linux machine. If you are on another OS or don't have this font, you need to adjust this path. width : int (default=400) Width of the canvas. height : int (default=200) Height of the canvas. prefer_horizontal : float (default=0.90) The ratio of times to try horizontal fitting as opposed to vertical. If prefer_horizontal < 1, the algorithm will try rotating the word if it doesn't fit. (There is currently no built-in way to get only vertical words.) mask : nd-array or None (default=None) If not None, gives a binary mask on where to draw words. If mask is not None, width and height will be ignored and the shape of mask will be used instead. All white (#FF or #FFFFFF) entries will be considerd "masked out" while other entries will be free to draw on. [This changed in the most recent version!] scale : float (default=1) Scaling between computation and drawing. For large word-cloud images, using scale instead of larger canvas size is significantly faster, but might lead to a coarser fit for the words. min_font_size : int (default=4) Smallest font size to use. Will stop when there is no more room in this size. font_step : int (default=1) Step size for the font. font_step > 1 might speed up computation but give a worse fit. max_words : number (default=200) The maximum number of words. stopwords : set of strings or None The words that will be eliminated. If None, the build-in STOPWORDS list will be used. background_color : color value (default="black") Background color for the word cloud image. max_font_size : int or None (default=None) Maximum font size for the largest word. If None, height of the image is used. mode : string (default="RGB") Transparent background will be generated when mode is "RGBA" and background_color is None. relative_scaling : float (default=.5) Importance of relative word frequencies for font-size. With relative_scaling=0, only word-ranks are considered. With relative_scaling=1, a word that is twice as frequent will have twice the size. If you want to consider the word frequencies and not only their rank, relative_scaling around .5 often looks good. .. versionchanged: 2.0 Default is now 0.5. color_func : callable, default=None Callable with parameters word, font_size, position, orientation, font_path, random_state that returns a PIL color for each word. Overwrites "colormap". See colormap for specifying a matplotlib colormap instead. regexp : string or None (optional) Regular expression to split the input text into tokens in process_text. If None is specified, ``r"\w[\w']+"`` is used. collocations : bool, default=True Whether to include collocations (bigrams) of two words. .. versionadded: 2.0 colormap : string or matplotlib colormap, default="viridis" Matplotlib colormap to randomly draw colors from for each word. Ignored if "color_func" is specified. .. versionadded: 2.0 normalize_plurals : bool, default=True Whether to remove trailing 's' from words. If True and a word appears with and without a trailing 's', the one with trailing 's' is removed and its counts are added to the version without trailing 's' -- unless the word ends with 'ss'. Attributes ---------- ``words_`` : dict of string to float Word tokens with associated frequency. .. versionchanged: 2.0 ``words_`` is now a dictionary ``layout_`` : list of tuples (string, int, (int, int), int, color)) Encodes the fitted word cloud. Encodes for each word the string, font size, position, orientation and color. Notes ----- Larger canvases with make the code significantly slower. If you need a large word cloud, try a lower canvas size, and set the scale parameter. The algorithm might give more weight to the ranking of the words than their actual frequencies, depending on the ``max_font_size`` and the scaling heuristic. """
其中:
font_path表示用到字体的路径
width和height表示画布的宽和高
prefer_horizontal可以调整词云中字体水平和垂直的多少
mask即掩膜,产生词云背景的区域
scale:计算和绘图之间的缩放
min_font_size设置最小的字体大小
max_words设置字体的多少
stopwords设置禁用词
background_color设置词云的背景颜色
max_font_size设置字体的最大尺寸
mode设置字体的颜色 但设置为RGBA时背景透明
relative_scaling设置有关字体大小的相对字频率的重要性
regexp设置正则表达式
collocations 是否包含两个词的搭配
在generate函数中调试进去可以看到函数:
words=process_text(text)可以返回文本中的词频
generate_from_frequencies根据单词和词频创造一个词云
下面是generate_from_frequencies函数的实现步骤
def generate_from_frequencies(self, frequencies, max_font_size=None): """Create a word_cloud from words and frequencies. Parameters ---------- frequencies : dict from string to float A contains words and associated frequency. max_font_size : int Use this font-size instead of self.max_font_size Returns ------- self """ # make sure frequencies are sorted and normalized frequencies = sorted(frequencies.items(), key=item1, reverse=True) if len(frequencies) <= 0: raise ValueError("We need at least 1 word to plot a word cloud, " "got %d." % len(frequencies)) frequencies = frequencies[:self.max_words] # largest entry will be 1 max_frequency = float(frequencies[0][1]) frequencies = [(word, freq / max_frequency) for word, freq in frequencies] if self.random_state is not None: random_state = self.random_state else: random_state = Random() if self.mask is not None: mask = self.mask width = mask.shape[1] height = mask.shape[0] if mask.dtype.kind == 'f': warnings.warn("mask image should be unsigned byte between 0" " and 255. Got a float array") if mask.ndim == 2: boolean_mask = mask == 255 elif mask.ndim == 3: # if all channels are white, mask out boolean_mask = np.all(mask[:, :, :3] == 255, axis=-1) else: raise ValueError("Got mask of invalid shape: %s" % str(mask.shape)) else: boolean_mask = None height, width = self.height, self.width occupancy = IntegralOccupancyMap(height, width, boolean_mask) # create image img_grey = Image.new("L", (width, height)) draw = ImageDraw.Draw(img_grey) img_array = np.asarray(img_grey) font_sizes, positions, orientations, colors = [], [], [], [] last_freq = 1. if max_font_size is None: # if not provided use default font_size max_font_size = self.max_font_size if max_font_size is None: # figure out a good font size by trying to draw with # just the first two words if len(frequencies) == 1: # we only have one word. We make it big! font_size = self.height else: self.generate_from_frequencies(dict(frequencies[:2]), max_font_size=self.height) # find font sizes sizes = [x[1] for x in self.layout_] font_size = int(2 * sizes[0] * sizes[1] / (sizes[0] + sizes[1])) else: font_size = max_font_size # we set self.words_ here because we called generate_from_frequencies # above... hurray for good design? self.words_ = dict(frequencies) # start drawing grey image for word, freq in frequencies: # select the font size rs = self.relative_scaling if rs != 0: font_size = int(round((rs * (freq / float(last_freq)) + (1 - rs)) * font_size)) if random_state.random() < self.prefer_horizontal: orientation = None else: orientation = Image.ROTATE_90 tried_other_orientation = False while True: # try to find a position font = ImageFont.truetype(self.font_path, font_size) # transpose font optionally transposed_font = ImageFont.TransposedFont( font, orientation=orientation) # get size of resulting text box_size = draw.textsize(word, font=transposed_font) # find possible places using integral image: result = occupancy.sample_position(box_size[1] + self.margin, box_size[0] + self.margin, random_state) if result is not None or font_size < self.min_font_size: # either we found a place or font-size went too small break # if we didn't find a place, make font smaller # but first try to rotate! if not tried_other_orientation and self.prefer_horizontal < 1: orientation = (Image.ROTATE_90 if orientation is None else Image.ROTATE_90) tried_other_orientation = True else: font_size -= self.font_step orientation = None if font_size < self.min_font_size: # we were unable to draw any more break x, y = np.array(result) + self.margin // 2 # actually draw the text draw.text((y, x), word, fill="white", font=transposed_font) positions.append((x, y)) orientations.append(orientation) font_sizes.append(font_size) colors.append(self.color_func(word, font_size=font_size, position=(x, y), orientation=orientation, random_state=random_state, font_path=self.font_path)) # recompute integral image if self.mask is None: img_array = np.asarray(img_grey) else: img_array = np.asarray(img_grey) + boolean_mask # recompute bottom right # the order of the cumsum's is important for speed ?! occupancy.update(img_array, x, y) last_freq = freq self.layout_ = list(zip(frequencies, font_sizes, positions, orientations, colors)) return self
比较遗憾的词云并没有用到opencv库,如果用到opencv库应该可以做到更加炫酷
接下来可能想加上Opencv等 工具 来处理词云中较为不足的地方
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
微信民族志、自媒体时代的知识生产与文化实践
赵旭东 / 中国社会科学出版社 / 2017-9 / 98.00元
进入二十一世纪以来,随着网络技术的发展,自媒体的悄然登场深度影响着我们的日常生活。中国社会中自媒体通讯方式的普及以及随之而有的一种文化书写的新形式——微信民族志的出现使原有文化秩序中时空意义发生转变的同时,也在重新塑造着以研究异文化为己任的人类学学科自身的成长、转型与发展。在此种情境之下,由中国人民大学人类学研究所、中国人民大学国家发展与战略研究院、中国人民大学社会学理论与方法研究中心、《探索与争......一起来看看 《微信民族志、自媒体时代的知识生产与文化实践》 这本书的介绍吧!
RGB转16进制工具
RGB HEX 互转工具
HSV CMYK 转换工具
HSV CMYK互换工具