内容简介:I am sure you must have taken or at least seen people take this challenge. In this challenge, some gibberish text appears on the screen which seems nonsense but sounds to something meaningful. Let’s see some examples.I have personally tried it and I can vo
A simple Algorithm to Create your Gibberish Challenge
Jul 30 ·8min read
Social media is the powerful platform of the 21st century. It has become the primary source of digital marketing , expressing opinions, making friends, mode of entertainment, and whatnot. Social media apps keep innovating and changing to keep the users engaged by showing the right content understanding their behavior. It’s also common to see new challenges trending from time to time where many celebrities post about it. One such challenge trending at current times is the “ Guess the Gibberish challenge ”.
I am sure you must have taken or at least seen people take this challenge. In this challenge, some gibberish text appears on the screen which seems nonsense but sounds to something meaningful. Let’s see some examples.
I have personally tried it and I can vouch that it can be addicting at times. Being an algorithm enthusiast first thing that came to mind was how can I create one such challenge myself. I did some research around and created a simple version of the game. This simple version can be made complicated with some tweaks around that I will discuss at the end as an open-ended problem. There can be many different ways of creating it. In this blog post, I will discuss one such possible way. All the codes used in this blog post can be found here . So, let’s get started.
Phonetic Algorithms
As per Wikipedia, Phonetics is the science of the sounds of the human voice. Subject matter experts in this field are called Phoneticians. The Linguistic based study of Phonetics is termed as Phonology .
A phonetic algorithm is an algorithm for indexing of words by their pronunciation . These algorithms provide the capability to identify words with a similar pronunciation.
The first question that can come to the reader’s mind is why are we discussing Phonetics and Phonetic algorithms. The answer is that in the problem we are trying to solve, “Guess the Gibberish challenge”, the Gibberish sounds similar to something meaningful that is to be decoded and which is also the final motive. Intuitively, it comes to mind that some Phonetic Algorithm can help with that. There are many good Phonetic Algorithms and one such popular and simple one is the Soundex algorithm .
Soundex Algorithm
Soundex is a phonetic algorithm for indexing names by sound. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling .
Soundex Algorithm encodes English words into a code where the first place consists of a letter followed by ‘k-1’ numeric digits, given that we want ‘k’ bit encoding.
The algorithm is very well explained in Wikipedia . Taking parts from it with slight simplifications, this is how it looks like:
Step 1:Retain the first letter of the word and drop all other occurrences of a, e, i, o, u, y, h, w.
Step 2:Replace consonants with digits as follows (after the first letter):
★ b, f, p, v → 1
★ c, g, j, k, q, s, x, z → 2
★ d, t → 3
★ l → 4
★ m, n → 5
★ r → 6
The logic behind Step 2:
Consonants at a similar place of articulation share the same digit so, for example, the labial consonants B, F, P, and V are each encoded as the number 1.
Step 3:If two or more letters with the same number are adjacent in the original word, only retain the first letter; also two letters with the same number separated by ‘h’ or ‘w’ are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
Step 4:If final encoding is less than k-bits, fill the remaining ones with 0’s. If there are more than k-bits, retain only the first k.
Using this algorithm, both ‘Robert’ and ‘Rupert’ return the same string ‘R163’ while ‘Rubin’ yields ‘R150’. ‘Ashcraft’ and ‘Ashcroft’ both yield ‘A261’. ‘Tymczak’ yields ‘T522’ not ‘T520’ (the chars ‘z’ and ‘k’ in the name are coded as 2 twice since a vowel lies in between them). ‘Pfister’ yields ‘P236’ not ‘P123’ (the first two letters have the same number and are coded once as ‘P’), and “Honeyman” yields ‘H555’.
In python, the fuzzy package provides a good implementation of Soundex and other phonetic algorithms.
A Slight Modification To Soundex
We will use the Soundex algorithm to generate the encoding with a slight variation in Step 1.
Instead of deleting all occurrences of a, e, i, o, u, y, h, w, we will further cluster/number them.
★ e, i, y → 7
★ o,u → 8
★ a,h,w → Ignore them
Reasoning: e, i, and y seem similar for eg. ‘pic’, ‘pec’, ‘pyc’ sound similar.
I can’t use the fuzzy package as I wish to perform these suggested modifications in the original Soundex algorithm. I found this great implementation by Carlos Delgado . It’s not completely correct but good enough for our use case.
The modified Soundex Algorithm:
def get_soundex_modified(name): # Get the soundex code for the string name = name.upper()soundex = "" soundex += name[0]# numbering of letters based on phonetic similarity dictionary = {"BFPV": "1", "CGJKQSXZ":"2", "DT":"3", "L":"4", "MN":"5", "R":"6","EIY":"7","OU":"8","AHW":"."}for char in name[1:]: for key in dictionary.keys(): if char in key: code = dictionary[key] if code != soundex[-1]: soundex += codesoundex = soundex.replace(".", "") # We prefer a 8-bit output soundex = soundex[:8].ljust(8, "0")return soundex
Guess The Gibberish Challenge Algorithm
With the slight modification to the Soundex algorithm, we are in a position to complete the “Guess the Gibberish Challenge” algorithm. The steps will be
Step 1:Given the sentence, take one word at a time, generate 8-bit encoding using the modified sounded algorithm above. For eg. the word ‘under’ gets an encoding of ‘U5376000’.
Step 2:Take the encoding from Step 1 and take one letter of the encoding at a time and do the following:
- If the letter is a character, keep it as it is.
- If the letter is a number, randomly pick a character from that cluster. For example, if the letter is 2, we can randomly pick any one of c, g, j, k, q, s, x, and z. The clusters are:
★ b, f, p, v → 1
★ c, g, j, k, q, s, x, z → 2
★ d, t → 3
★ l → 4
★ m, n → 5
★ r → 6
★ e, i, y → 7
★ o,u → 8
- If the letter is 0 or no more letter left of the encoding, we are done for that word
Step 3:Repeat the same process for all the words in the sentence.
Let’s visualize our gibberish output for some proverbs.
The output is gibberish though sounds similar to the actual proverbs. That’s our simple “Guess the Gibberish Challenge” solution.
Possible Enhancements
In the present approach, we are creating a gibberish per word that can possibly be changed. Also, the first character remains the same both in the original input and in the gibberish output. We can have some workaround with those. I will provide two possible ways in which we can enhance our “Guess the Gibberish” algorithm and rest keep it to the audience as an open-ended problem to show creativity.
- Some smart heuristic to split and combine the words intelligently and then perform the encoding using a modified Soundex algorithm. For eg., United Kingdom’s one possible interesting gibberish is “Ewe night ted king dumb”.
- The first character doesn’t have to be the same in both the original input and in the gibberish output. For example, one possible way of encoding the letter ‘U’ can be ‘Ewe’.
Conclusion
Through this blog post, we developed an interesting and indeed simple algorithm to create our own ‘Guess the Gibberish Challenge’. In the process, we also learned about Phonetics, Phonology, and Phonetic algorithms. We also looked at ways on how we can make it more challenging. Hope you liked it. All the codes used in this blog post can be found here .
If you have any doubts or queries, do reach out to me. I will be interested to know if you think of some more possible enhancements to it.
About the author-:
Abhishek Mungoli is a seasoned Data Scientist with experience in ML field and Computer Science background, spanning over various domains and problem-solving mindset. Excelled in various Machine learning and Optimization problems specific to Retail. Enthusiastic about implementing Machine Learning models at scale and knowledge sharing via blogs, talks, meetups, and papers, etc.
My motive always is to simplify the toughest of the things to its most simplified version. I love problem-solving, data science, product development, and scaling solutions. I love to explore new places and working out in my leisure time. Follow me on Medium , Linkedin or Instagram and check out my previous posts . I welcome feedback and constructive criticism. Some of my blogs:
- An unsupervised Mathematical Scoring Model
- Dimensionality Reduction: PCA versus Autoencoders
- Experience the power of the Genetic Algorithm
- 5 Mistakes every Data Scientist should avoid
- Decomposing Time Series in a simple & intuitive way
- How GPU Computing literally saved me at work?
- Information Theory & KL DivergencePart I andPart II
- Process Wikipedia Using Apache Spark to Create Spicy Hot Datasets
- A Semi-Supervised Embedding based Fuzzy Clustering
- Compare which Machine Learning Model performs Better
- Analyzing Fitbit Data to Demystify Bodily Pattern Changes Amid Pandemic Lockdown
- Myths and Reality around Correlation
- A Guide to Becoming Business-Oriented Data Scientist
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Linux程序设计
Neil Matthew、Richard Stones / 陈健、宋健建 / 人民邮电出版社 / 201005 / 99.00元
时至今日,Linux系统已经从一个个人作品发展为可以用于各种关键任务的成熟、高效和稳定的操作系统,因为具备跨平台、开源、支持众多应用软件和网络协议等优点,它得到了各大主流软硬件厂商的支持,也成为广大程序设计人员理想的开发平台。 本书是Linux程序设计领域的经典名著,以简单易懂、内容全面和示例丰富而受到广泛好评。中文版前两版出版后,在国内的Linux爱好者和程序员中也引起了强烈反响,这一热潮......一起来看看 《Linux程序设计》 这本书的介绍吧!
MD5 加密
MD5 加密工具
HSV CMYK 转换工具
HSV CMYK互换工具