Identifying Malicious URLs

1. Introduction

These days, more and more cyber threats are being delivered via web pages. Most commonly, these pages are accessed from URLs contained within phishing emails or in callbacks made by malwares. However, those are not the only means of delivering a malicious web page to a user. Recently, attackers have also been proliferating exploits within benign-looking advertisements using the wide reach of online advertising networks. From an information security standpoint, it is therefore important to study means by which such malicious URLs can be identified so that threats can be effectively contained.

Google Safe Browsing warning a user about a potentially malicious site

2. Discussion on identification methodologies

There are two main classes of methodologies for identifying malicious URLs: (1) traditional signature-based methods which detect malicious URLs based on a database of known threats and (2) heuristic methods which are capable of detecting even unknown threats using certain features known to exist predominantly in malicious URLs.

2.1 Signature-based methods

2.1.1 Blacklists

Traditional signature-based methods are highly accurate and simple to implement. For instance, one of the easiest ways to do this may be to maintain a database of known malicious URLs and/or the relevant malicious content (hashed) so that a request may be blocked whenever its URL or content matches an entry in the signature database.

Example of a signature database

Of course, a smart adversary will be able to bypass this detection scheme simply by making small but irrelevant changes to the URL (e.g. appending queries to the URL) and/or its content (e.g. adding a timestamp in the page content). Therefore, one might suggest using a pattern-matching approach instead to identify blacklisted URLs from the database. However, pattern-matching comes with its own set of tradeoffs since the creation of rules is more challenging and error-prone, not to mention the process of pattern-matching is much more computationally expensive.

Since maintaining a blacklist is troublesome, one may consider delegating the task to external blacklists if one is willing to stomach the risk of trusting external vendors. Here are some potentially helpful blacklists:

https://www.google.com/transparencyreport/safebrowsing (malware + phishing URLs)
http://www.malwaredomains.com (malware URLs)
https://www.phishtank.com (phishing URLs)

Besides the uphill task of building a blacklist, we know that a blacklist cannot detect unknown threats. Hence, a blacklist is unlikely to stop a sophisticated attacker and a holistic detection system must still to incorporate heuristic methods to stop more advanced attacks.

2.1.2 Whitelists

Before we condemn the signature-based approach, I would like to suggest a potentially useful role for signatures - as a whitelist instead to lower the rate of false-positives (FPs) flagged by a system - a common challenge faced by companies in the information security industry. In this case, however, one need not fight with evading adversaries to generate up-to-date and accurate threat signatures. Instead, one may collaborate with end-users and content providers to whitelist URLs that are deemed as trusted.

Example of a whitelist

Though the task of building a whitelist may seem daunting at first, it is definitely not the case. To begin with, one may start with a small whitelist, since having a small whitelist can only be better than having no whitelist. Then, through collaborations and discoveries, one builds up the whitelist iteratively - possibly taking one's time in doing so since there is no competition against threat actors. This iterative process might even be easier than building a blacklist since legitimate content providers are not seeking to evade detection. Instead, they would seek for their URLs to be identified by the system so that users are not denied access to their services. Therefore, developing a whitelist does not have to be laborious and the payoffs might even be extremely rewarding in the form of less FPs.

2.1.3 Blacklists + Whitelists

Using both a blacklist and whitelist for signature-based detection allows one to quickly and accurately detect known threats and trusted websites. URLs which are not discernible by the lists should then be further evaluated by heuristic methods to test a URL for certain features known to exist predominantly in malicious URLs. Here's the high-level idea in pseudocode:

if is_blacklisted(url):
    return MALICIOUS
if is_whitelisted(url):
    return CLEAN
return heuristical_analysis(url)

I will further elaborate on some heuristic methods for analysing a URL below.

2.2 Heuristic methods

Heuristic methods are excellent for identifying unknown threats. However, due to the probabilistic nature of such methodologies, they also result in false classifications. In the following section, we will discuss some of the more telling features that are known to exist predominantly in malicious URLs and may be used with an intelligent classifier to classify URLs accordingly.

2.2.1 Page Reputation

In a research done by Choi, Zhu and Lee (2011), link popularity - which is estimated by counting the number of incoming links from other webpages - was used as a proxy measure for page reputation. The researchers found that link popularity is highly discriminative for both malicious URL detection and even attack type identification. For instance, malicious URLs tend to have low link popularity in contrast to benign - especially popular - websites which tend to have high link popularity. The link popularity data used in the study was obtained from several search engines with the data from AltaVista, AllTheWeb and Yahoo producing more accurate classifications. Today, AltaVista and AllTheWeb both belong to Yahoo but there are still many alternatives for checking a link's popularity besides Yahoo (e.g. Alexa).

Detection statistics reported in the research by Choi, Zhu and Lee (2011)

Another popular measure of page reputation is page ranking. In fact, page ranking also takes into consideration the link popularity of a page. Generally, the higher the page ranking of a site, the more reputable the site is. Since malicious pages tend to be relatively new and short-lived, they generally have a low or non-existent page ranking (Ranganayakulu, C, 2013). Just like link popularity, such page ranking data may be obtained from online sources such as Alexa.

2.2.2 Traffic Analysis

Similar to link popularity and page ranking, traffic analysis gives one an idea of how popular a certain page is - which is then used as a proxy for the page's reputation. One way to do this may be to collect URL traffic statistics if one is able to, and then use the statistics in determining how popular or reputable a page is. Just like in link popularity, a high-traffic URL is generally considered more trustworthy. However, unlike link popularity, this data is not easily available. Also, end-users may be worried about privacy concerns since their traffic will be recorded. Nevertheless, this issue may be overcome by storing such statistics only in their aggregated or hashed forms.

2.2.3 Jaccard Measure of a URL

In a research done by Yadav, Reddy and Reddy (2010) to detect algorithmically generated malicious domain names, they found that the Jaccard Index (JI) is the best measure in detecting algorithmically generated domain names with 100% detection and 0% false-positives in their study. The classifier was even able to identify names generated by Kwyjibo, a publicly available tool which can be used to generate names that are pronounceable yet not in the English dictionary (Yadav, Reddy and Reddy, 2010).

The index is defined as:

where A and B are sets of bigrams in this context. However, the formula had to be adapted to this specific use since the set of bigrams could potentially contain duplicates. Also, since their computation steps are not explicitly outlined in the paper, I made a few assumptions about their implementation and came up with a short proof-of-concept script for calculating the Jaccard measure of a URL:

from collections import defaultdict

def populate(db):
    # define function to populate the database of non-malicious bigrams here
    pass

def get_bigrams(word):
    bigrams = defaultdict(int)
    for i in range(len(word) - 1):
        bigrams[word[i:i+2:]] += 1
    return bigrams

def get_max_ji(test_bigrams, control_bigrams_set):
    return max(map(lambda control_bigrams: get_jaccard_index(test_bigrams, control_bigrams), control_bigrams_set))

def get_jaccard_index(a, b):
    # Not the standard way of calculating JI since there can be duplicate bigrams in a word
    intersection, union = 0, sum(a.values() + b.values())
    for bigram in a:
        if bigram not in b:
            continue
        intersection += a[bigram] + b[bigram]
    return intersection/union

def get_jaccard_measure(word):
    test_bigrams = get_bigrams(word)
    control_bigrams_set = set()
    for bigram in test_bigrams:
        control_bigrams_set.union(database[bigram])
    return get_max_ji(test_bigrams, control_bigrams_set)
            
def average(s):
    return sum(s)/len(s)

database = defaultdict(list)
populate(database)

jaccard_measures = []
for test_word in url:
    jaccard_measures.append(get_jaccard_measure(test_word))
avg_jaccard_measure = average(jaccard_measures)

Though highly accurate, the researchers acknowledged that such a methodology requires a good database of non-malicious words along with tolerance for high computation time. If the two factors are not a concern, then one may consider using such a measure to detect algorithmically generated URLs.

2.2.4 Static Analysis

According to Hou et al. (2009), attackers often set invalid (negative) content-lengths in the headers of malicious websites as a buffer overflow exploit. Some malicious websites also contain zero-sized iframes that renders a malicious script. In addition, the number of suspicious native Javascript functions such as escape(), eval(), link(), unescape(), exec(), link() and search() might also indicate if a website is malicious, since these functions are used in some cross-site scripting and web-based malware distribution activities. Therefore, one may also use these properties as features in the classification of a URL.

2.2.5 Dynamic Analysis

As attacks become more sophisticated, the only way to detect malicious behaviour in a URL might be to actually pay a visit to the page. This method involves analysing the behaviour of a sandboxed machine during the visit to a suspected URL. In this analysis, one may look for certain behavioural signatures such as:

The number of new processes created
The amount of data sent (exfiltrated)
The presence of unexpected changes to the file system
The presence of unexpected changes to the registry (Windows)

Of course, performing dynamic analysis comes with its own set of challenges. For instance, one might need to analyse websites in different environments because the exploit may be targeting a specific operating system or browser plugin. Also, one may wish to anonymise the network address of the VM so as to hide any analysis activity from the adversary. To simplify this process of dynamic analysis, one may consider using the patented MVX technology from FireEye.

2.2.6 Aggregated Analyses

If you are looking for a quick and easy way to profile a URL, you may want to consider trying services such as VirusTotal or URLVoid. These services submit URLs to various anti-virus vendors and URL characterisation tools that use some of the techniques described above before returning the aggregated analyses to the user. The user may then use the aggregated results (detection ratio/safety reputation) or its individual breakdown as part of his/her analysis.

Screenshot of VirusTotal

Screenshot of URLVoid

2.2.7 Other Features

Some other features which are worth mentioning include:

The reputation of its service provider (see DNS Features in http://www.msr-waypoint.net/pubs/193308/paper.pdf)
Whether the domain is affiliated to a blacklisted host/registrant
The age of the domain
The domain lookup duration and download speed of the page (see Network Features in http://www.msr-waypoint.net/pubs/193308/paper.pdf)
The length of a URL, length of the domain name and the number of dots in the URL (Ma, et al., 2009)

3. Limitations

Despite our best efforts to detect threats, there will always be new techniques to evade such detection. For instance, search engine optimization and link pharming are some of the techniques that can be used to bolster a malicious page's ranking/popularity on search engines. The use of HTTPS protocol also prevents a non-intrusive analyst from analysing a URL or its content. WHOIS protection services can also mask a domain registrant's information from an analyst. Finally, URL shortening services are also blurring the line between URLs for benign sites and URLs for malicious sites.

4. Way forward

Cyber criminals will continue to find new ways to mask telltale signs of malicious websites. However, they will never be able to mask the malicious behaviour of the page because they are depending on those exact behaviours to gain entry, embed persistence and subsequently achieve exfiltration of data from the victim's computer. Therefore, while security services should still use traditional methods of detecting malicious URLs, they should also invest in their dynamic analysis capabilities so as to ensure that they are always a step ahead of attackers.

Open-Source Thoughts

Search This Blog