内容简介:There is hardly a week when you go to Google News and don’t find a news article about Phishing. Just in the last week, hackers areThe best part about tackling this problem with machine learning is the availability of well-collected phishing website data se
There is hardly a week when you go to Google News and don’t find a news article about Phishing. Just in the last week, hackers are sending phishing emails to Disney+ subscribers , ‘Shark Tank’ star Barbara Corcoran lost almost $400K in phishing scam , a bank issues phishing warnings , and almost three-quarter of all phishing websites now use SSL . Since phishing is such a widespread problem in the cybersecurity domain, let us take a look at the application of machine learning for phishing website detection . Although there have been many articles and research papers on this topic [ Malicious URL Detection ] [ Phishing Website Detection by Visual Whitelists ] [ Novel Techniques for Detecting Phishing ], they do not always provide open-source code and dive deeper into the analysis. This post is written to address these gaps. We will use a large phishing website corpus and apply a few simple machine learning methods to garner highly accurate results.
Data
The best part about tackling this problem with machine learning is the availability of well-collected phishing website data sets, one of which is collected by folks at the Universiti Malaysia Sarawak. The ‘Phishing Dataset – A Phishing and Legitimate Dataset for Rapid Benchmarking’ dataset consists of 30,000 websites out of which 15,000 are phishing and 15,000 are legitimate. Each website in the data set comes with HTML code, whois info, URL, and all the files embedded in the web page. This is a goldmine for someone looking to apply machine learning for phishing detection. There are several ways this data set can be used. We can try to detect phishing websites by looking at the URLs and whois information and manually extracting features as some previous studies have done [ 1 ]. However, we are going to use the raw HTML code of the web pages to see if we can effectively combat phishing websites by building a machine learning system. Among URLs, whois information, and HTML code, the last is the most difficult to obfuscate or change if an attacker is trying to prevent a system from detecting his/her phishing websites, hence the use of HTML code in our system. Another approach is to combine all three sources, which should give better and more robust results but for the sake of simplicity, we will only use HTML code and show that it alone garners effective results for phishing website detection. One final note on the data set: we will only be using 20,000 total samples because of computing constraints. We will also only consider websites written in English since data for other languages is sparse.
Byte Pair Encoding for HTML Code
For a naive person, HTML code does not look as simple as a language. Moreover, developers often do not follow all the good practices while writing code. This makes it hard to parse HTML code and extract words/tokens. Another challenge is the scarcity of many words and tokens in HTML code. For instance, if a web page is using a special library with a complex name, we might not find that name on other websites. Finally, since we want to deploy our system in the real world, there might be new web pages using completely different libraries and code practices that our model has not seen before. This makes it harder to use simple language tokenizers and split code into tokens based on space or any other tag or character. Fortunately, we have an algorithm called Byte Pair Encoding (BPE) that splits the text into sub-word tokens based on the frequency and solves the challenge of unknown words. In BPE, we start by considering each character as a token and iteratively merge tokens based on the highest frequencies. For instance, if a new word “ googlefacebook ” comes, BPE will split it into “google” and “facebook” as these words could be frequently there in the corpus. BPE has been widely used in recent deep learning models [ 2 ].
There have been numerous libraries to train BPE on a text corpus. We will use a great one called tokenizer by Huggingface . It is extremely easy to follow the instruction on the github repository of the library. We train BPE with a vocabulary size of 10,000 tokens on top of raw HTML data. The beauty of BPE is that it automatically separates HTML keywords such as “tag”, “script”, “div” into individual tokens even though these tags are mostly written with brackets in an HTML file e.g
