The NLP Project
Natural Language Processing — Beginner to Advanced (Part-3)
Advanced Lexical Processing — how to handle miscellaneous noise in the textual data and how to build your own spelling corrector.
May 6 ·8min read
In theprevious part of the series ‘The NLP Project’, we learned all the basic lexical processing techniques such as removing stop words, tokenization, stemming, and lemmatization.
Even after going through all those preprocessing steps, a lot of noise is still present in the textual data. For example, spelling mistakes that happen by mistake as well as by choice (informal words such as ‘lol’, ‘u’, ‘gud’ etc.). Also, the problem of spelling variations of a word that occurs due to different pronunciations (e.g. Mumbai, Bombay).
The two methods that we talked about in the last part — stemming and lemmatization — are both parts of a technique called canonicalization . Basically, canonicalization means reducing a word to its base form.
There are scenarios where we can not canonicalize a word just by using stemming and lemmatization. Thus, we will need another technique to canonicalize the words correctly. For example, if the word ‘allowing’ is misspelled as ‘alowing’, then we will have a redundant token as ‘alow’ after stemming the misspelled word, and lemmatization wouldn’t even work as it only works on correct dictionary spelling.
The same problem is with the pronunciation of same words in different patois. For example, the word ‘humour’ used in British English, is spelled as ‘humor’ in American English. Both are correct spellings, but they will give two different base forms, after stemming, when used in the form ‘humouring’ and ‘humoring’.
Phonetic Hashing
There are some words that have different pronunciation in different languages. As a result, they are ultimately written differently. For example a common Indian surname like ‘Srivastava’ is also spelled as ‘Shrivastava’, ‘Shrivastav’, or ‘Srivastav’. Therefore, we must reduce all forms of a particular word to one single and common word.
To attain this, we will get to know about a technique is called phonetic hashing. The phonetic hashing method combines the same phonemes (smallest unit of sound) into one bucket and gives them the same hash code for all the variations. So, the words ‘color’ and ‘colour’ have the same code.
Phonetic hashing is a technique used to canocalize words that have different variants but same phonetic characteristics, that is, the same pronunciation.
Now, Soundexes are algorithms that can be used to calculate the hash code of a given word. Phonetic Hashing uses these Soundex algorithms. The algorithms differ from language to language. American Soundex is the most popular Soundex algorithm and we will be using that. Also, it doesn’t matter which language the input word comes from — as long as the words sound similar, they will get the same hash code.
Phonetic hashing is a four-character code. So, let us calculate the hash code(also known as Soundex) of the word ‘Chennai’ by applying the following steps:-
- Retain the first letter of the word as is — Soundex is based on the rationale that the English pronunciation depends on the first letter and pattern of the consonants. So, C becomes the first letter.
- Remove all the vowels. So we have ‘CHNN’ now.
- Replace consonants with digits as follows (after the first letter)
- b, f, p, v → 1
- c, g, j, k, q, s, x, z → 2
- d, t → 3
- l → 4
- m, n → 5
- r → 6
- h, w, y→ unencoded/removed
After encoding the code becomes C55.
4. If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by ‘h’ or ‘w’ are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter. Now, we have two 5’s, so we will merge them into one and get C5.
5. If there are too few letters in our word that we can’t assign three numbers, append with zeros until there are three numbers. If we have four or more numbers, retain only the first three. Since we have one letter and only one number, we will append two zeroes at the end of the encoding to make it a four-character code. Hence, we get C500 as the final code for ‘Chennai’.
Thus, we see that the Soundex code for city names like ‘Bengaluru’ and ‘Bangalore’ is the same, B524, which is what we were aiming for.
So, we saw how to deal with different pronunciations of a particular word. Next, let us see how to deal with misspellings.
Edit Distance
Simply put, an edit distance is the minimum number of edits required to convert one string to another, that is, given string to the target string.
Now, you must be wondering what is an edit. So, the possible edits are as follows:-
- Insertion of a character into a given string. To convert the misspelled word ‘Aquire’ to its correct spelling ‘Acquire’, we need to insert the character ‘c’ in the given misspelled string.
- Deletion of a character from a given string. To convert ‘neccessary’ to ‘necessary’, we need to delete the extra ‘c’ from the given string.
- Substitution of a character in a given string. To convert ‘Absense’ to its correct form ‘Absence’, we need to substitute the character ‘s’ with the character ‘c’.
The most common edit distance algorithm is the Levenshtein edit distance which you can learn about, in detail, here . There is another edit option that you can do called the ‘transpose operation’, where you can swap the two adjacent characters of a given string. But, this operation is only available in the Damerau-Levenshtein edit distance.
Spelling Corrector
A spelling corrector is a widely used application. If you have the autocorrect feature enabled on your phone, the incorrect spellings get replaced by the correct ones. Another example is when you use a search engine such as Google to search for anything and mistype a word, it suggests the correct word. Spell correction is an important part of lexical processing. In many applications, spell correction forms an initial preprocessing layer.
We saw how to calculate the edit distance between two words. Now, we will use the concept of edit distance to make a spelling corrector. We will now see how to make the famous Norvig spell corrector.
We need a large corpus of text with proper spelling. This corpus of text will also serve as a dictionary search for our spell corrector. Here, the large corpus of text ‘big.txt’ is nothing but a book. It’s the book ‘The Adventures of Sherlock Holmes’ present in text format at Project Gutenberg’s website .
# function to tokenise words def words(document): “Convert text to lower case and tokenise the document” return re.findall(r’\w+’, document.lower())# create a frequency table of all the words of the document all_words = Counter(words(open('big.txt').read()))
There are five main functions that we will use to find the correct spelling of a word and they are as follows:
- edits_one() : function creates all the possible words that are one edit distance away from the input word. Most of the words that this function creates are garbage, i.e. they are not valid English words. For example, if we pass the word ‘laern’ (misspelling of the word ‘learn’) to edits_one(), it will create a list where the word ‘lgern’ will be present since it is an edit away from the word ‘laern’. But it’s not an English word. Only a subset of the words will be actual English words.
def edits_one(word): “Create all edits that are one edit away from `word`.” alphabets = ‘abcdefghijklmnopqrstuvwxyz’ splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [left + right[1:] for left, right in splits if right] inserts = [left + c + right for left, right in splits for c in alphabets] replaces = [left + c + right[1:] for left, right in splits if right for c in alphabets] transposes = [left + right[1] + right[0] + right[2:] for left, right in splits if len(right)>1] return set(deletes + inserts + replaces + transposes)
2. edits_two() : function creates a list of all the possible words that are two edits away from the input word. Most of these words will also be garbage.
def edits_two(word): “Create all edits that are two edits away from `word`.” return (e2 for e1 in edits_one(word) for e2 in edits_one(e1))
3. known(): function filters out the valid English word from a list of given words. It uses the frequency distribution as a dictionary that was created using the seed document. If the words created using edits_one() and edits_two() are not in the dictionary, they’re discarded.
def known(words): “The subset of `words` that appear in the `all_words`.” return set(word for word in words if word in all_words)
4. possible_corrections(): This function takes the input word and uses the above three functions to return the correct spelling for a given input word. First, it checks for the spelling of the input word and if the spelling is correct, that is, if the word is present in the dictionary, it returns no spelling suggestions since it is already a correct dictionary word.
If the spelling is incorrect, it searches for every dictionary words that are one edit away from the input word and returns a list of them. If there are no dictionary words that are an edit away from the given word, then it searches for all dictionary words which are two edits away and returns a list of them. If there are no words that are two edits away, the input word is returned back, this means that there are no alternatives that the spell corrector could find.
def possible_corrections(word): “Generate possible spelling corrections for word.” return (known([word]) or known(edits_one(word)) or known(edits_two(word)) or [word])
5. prob(): The prob() function will create a probability associated with each of the suggested correct spellings and return the one with the highest probability, or we can say, it takes the suggested correct spellings and returns the one that is most frequent among the input words in our text corpus.
This is exactly why we need a large text corpus instead of a dictionary. A dictionary only contains a list of all correct English words. But, a text corpus not only contains all the correct words but it could also be used to create a frequency distribution of all these words.
def prob(word, N=sum(all_words.values())): “Probability of `word`: Number of appearances of ‘word’ / total number of tokens” return all_words[word] / N
Now, we’re almost done building the spell corrector. We just need to put all the pieces of the code together and wrap them up in a new function that uses all the functions created till now.
def spell_check(word): “Print the most probable spelling correction for `word` out of all the `possible_corrections`” correct_word = max(possible_corrections(word), key=prob) if correct_word != word: return “Did you mean “ + correct_word + “?” else: return “Correct spelling.”
Voila! We have successfully created a simple yet very effective spelling corrector. We can now use it to correct spellings on any given text corpus.
That’s it form part 3 folks. We have covered the whole of lexical processing inpart 2 and part 3. Now, we will move onto the next major part of NLP — Syntactic Processing. Cheers!
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。