Calculate the frequencies of words, pairs of words and more in a Wikipedia dataset

栏目: IT技术 · 发布时间: 4年前

内容简介:Calculate the frequencies of words, pairs of words, etc. in a Wikipedia dataset. One use-case is to create lists of popular, different words to use in e.g. games, passphrase generation, etc.This crate is not published to crates.io yet so you will need to f

word-frequencies

Calculate the frequencies of words, pairs of words, etc. in a Wikipedia dataset. One use-case is to create lists of popular, different words to use in e.g. games, passphrase generation, etc.

Installation

This crate is not published to crates.io yet so you will need to first install Rust , clone this repository locally, then run:

cargo install --path . --force

This will put a word-frequencies binary into your $HOME/.cargo/bin folder, which you can then put into your PATH environment variable.

Usage

Run word-frequencies --help and e.g. word-frequencies split --help for usage instructions. Below is an end-to-end example of using word-frequencies to count unigrams (words) and bigrams (pairs of words), and then calculate the most frequent words.

1. Wikipedia dataset download

First download the Wikipedia dataset for the language that you care about.

2a. Mac and Linux command-line example

Let's assume the Wikipedia dataset is downloaded to $HOME/datasets/wikipedia/plwiki-20200113-cirrussearch-content.json.gz .

With this file downloaded, first split the file into multiple pieces, and also Unicode-normalize the input. The output files will be one line per Wikipedia article.

word-frequencies split \
    --input-path $HOME/datasets/wikipedia/plwiki-20200113-cirrussearch-content.json.gz \
    --output-dir $HOME/datasets/wikipedia/plwiki-20200113-split

After splitting you can create a frequencies file, which contains counts for unigrams (single words) and bigrams (pairs of words):

word-frequencies create-frequencies \
    --input-dir $HOME/datasets/wikipedia/plwiki-20200113-split \
    --output-file $HOME/datasets/wikipedia/plwiki-20200113-split/plwiki-20200113-frequencies.txt \
    --language pl

This will create a compressed file plwiki-20200113-frequencies.txt.gz . If you zless it you can see it contains counts that can let you build a language model if you'd like.

For now if you only care about the most popular K unigrams, e.g. top 10k words, you can run:

word-frequencies top-k-words \
    --number-of-words 10000 \
    --input-file $HOME/datasets/wikipedia/plwiki-20200113-split/plwiki-20200113-frequencies.txt.gz \
    --output-file $HOME/datasets/wikipedia/plwiki-20200113-split/plwiki-20200113-top-10k.txt

2b. Windows command-line example

TODO, works but need to write out commands and test it

Dictionary sources

English

From https://packages.debian.org/sid/wordlist download wamerican , wbritish , wcanadian standard lists (around 103k words each), then concatenate, sort, de-dupe:

cat wamerican/usr/share/dict/american-english \
    wbritish/usr/share/dict/british-english \
    wcanadian/usr/share/dict/canadian-english | sort | uniq > en.txt

Note that http://wordlist.aspell.net/12dicts-readme/ is another great resource for curated English words.

Polish

The Debian wpolish dictionary is surprisingly low quality so I scraped Wiktionary to build a Polish dictionary, see below.

Using Wiktionary

I haven't ironed this out but here is some quick Python code to convert Wiktionary dataset dumps (from the same links as above) to dictionary files. You can then put these into the "dictionaries" sub-folder and re-run.

#!/usr/bin/env python3

import json
import gzip
import unicodedata


def main():
    main_en()
    main_pl()


def main_pl():
    words = []
    with gzip.open("plwiktionary-20200113-cirrussearch-content.json.gz", "rb") as f:
        for line in f:
            data = json.loads(line)
            if "language" not in data:
                continue
            if " " in data["title"]:
                continue
            if "Szablon:język polski" not in data["template"]:
                continue
            word = unicodedata.normalize("NFKC", data["title"])
            words.append(word)
    words.sort()
    with open("pl.txt", "w") as f_out:
        for word in words:
            f_out.write("{0}\n".format(word))


def main_en():
    words = []
    with gzip.open("enwiktionary-20200113-cirrussearch-content.json.gz", "rb") as f:
        for line in f:
            data = json.loads(line)
            if "language" not in data:
                continue
            if " " in data["title"]:
                continue
            if "English" not in data["heading"]:
                continue
            word = unicodedata.normalize("NFKC", data["title"])
            words.append(word)
    words.sort()
    with open("en.txt", "w") as f_out:
        for word in words:
            f_out.write("{0}\n".format(word))


if __name__ == "__main__":
    main()

TODOs

  • Need tests.
  • Option to specify your own dictionary file, that way we don't need to keep adding dictionaries to the binary.
  • Very memory inefficient, need ~15GB RAM for English.
    • Try interning Strings, I think the string copying is a big culprit.
    • If still not good enough then use SQLite to count words.
  • Make minimum article count in create-frequencies an input parameter.
  • Once crate is published update installation instructions.

Testing commands for older English dataset

word-frequencies split \
    --input-path $HOME/datasets/wikipedia/enwiki-20191202-cirrussearch-content.json.gz \
    --output-dir $HOME/datasets/wikipedia/enwiki-20191202-split
word-frequencies create-frequencies \
    --input-dir $HOME/datasets/wikipedia/enwiki-20191202-split \
    --output-file $HOME/datasets/wikipedia/enwiki-20191202-split/enwiki-20191202-frequencies.txt \
    --language en
word-frequencies top-k-words \
    --number-of-words 10000 \
    --input-file $HOME/datasets/wikipedia/enwiki-20191202-split/enwiki-20191202-frequencies.txt.gz \
    --output-file $HOME/datasets/wikipedia/enwiki-20191202-split/enwiki-20191202-top-10k.txt
echo done

License

word-frequencies is distributed under the terms of the Apache License (Version 2.0). See LICENSE for details.


以上所述就是小编给大家介绍的《Calculate the frequencies of words, pairs of words and more in a Wikipedia dataset》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

文明之光(第一册)

文明之光(第一册)

吴军 / 人民邮电出版社 / 2014-6-25 / 59.00元

人类的历史,是从野蛮蒙昧一步步走向文明进步的过程。在文明的进程中,人类创造出多元的文化,它们有着各自的特长。要实现人类和平发展的终极理想,一个重要的前提是承认文化的多元性,并且取长补短,相互融合。 吴军博士写作《文明之光》系列,希望能开阔人们的视野,让我们看到各种各样的人类文明。虽然今天不同的地区发达程度不同,文明历史的长短不一,国家亦有大小之分,但是文明之光从世界的每一个角落发出,对人类的......一起来看看 《文明之光(第一册)》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

随机密码生成器
随机密码生成器

多种字符组合密码