This post and the full series has been elaborated jointly with Ana Isabel Prieto, Sergio Villanueva and Luis Búrdalo.
Internet brings a world of possibilities for personal development and the realization of many of the daily activities, being an indispensable piece in today’s society. On this network there are hundreds of millions of domains to access, although unfortunately not all of them are safe. Malicious domains are those used by cybercriminals to connect to command and control servers, steal credentials through phishing campaigns or distribute malware.
In many cases, these domains share certain lexical characteristics that at first glance may attract attention. For example, in phishing campaigns, domains with TLD xyz, top, space, info, email, among others, are relatively common. Similarly, attackers use DGA (Domain Generation Algorithm) techniques to create random domains to exfiltrate information, such as istgmxdejdnxuyla[.]ru. Other striking properties can be excessive hyphens, multi-level domains or domains that attempt to impersonate legitimate organizations such as amazon.ytjksb[.]com and amazon.getfreegiveaway[.]xyz.
With digitization on the rise, organizations surf to thousands of different domains, making it difficult to detect malicious domains among so much legitimate traffic. In a medium-sized organization, between 3,000 and 5,000 domains of traffic are logged daily. This volume makes it unfeasible to analyze them manually. Traditionally, part of this detection process is automated using pattern search rules, for example, rules to find domains with TLDs (Top Level Domain) used in phishing campaigns, containing the name of large companies that are not legitimate or have more than X characters.
In recent years, the use of various Artificial Intelligence techniques and algorithms has become popular, especially those related to Machine Learning, to carry out some of the tasks in the field of cybersecurity, such as the detection of malicious domains. In this and the following articles we will see an example of how this type of techniques and algorithms can be used to drastically reduce the amount of information that security analysts have to process manually, and automate the detection task as much as possible.
Before moving on to the next article, it is worth recalling the difference between supervised and unsupervised learning algorithms. Broadly speaking, supervised algorithms require a set of previously labeled data to be trained to solve the problem, while unsupervised algorithms do not require such prior labeling, since they base their operation on the search for patterns already existing in the data.
In the case of this post, domain detection is carried out using unsupervised algorithms, so there is no need for a training dataset with domains labeled as malicious or not.
In order to apply unsupervised classification algorithms, it is essential to build a robust database, with a sufficiently large and varied number of domains, together with a series of characteristics or variables that define them.
For the characterization of domains, a series of metrics will be calculated that define the lexical characteristics of each complete domain, its Second Level Domain (SLD) and its Top Level Domain (TLD). These lexical characteristics can be grouped into the following categories:
- Number or count of character types that appear in the full domain, SLD and TLD. For example: number of letters, digits, special characters, periods, hyphens, uppercase and lowercase characters, etc.
- Length of the whole domain, SLD and TLD.
- Ratios between different characteristics already calculated, which allow these characteristics to be related to each other. For example: the ratio between the number of digits and letters of a given domain.
- Shannon Entropy. This metric measures the degree of randomness of a word, in this case of a domain name. Malicious domains generated by computers are usually very random, so they can be detected by having a high Shannon entropy value compared to the low values that legitimate domains would obtain.
- Presence of certain words in the domain name. For example, the word “login” is often entered by attackers to obtain user information, so it is important to identify when it appears.
- Presence of the TLD in lists of the most common and suspicious TLDs.
Having characterized the set of domains in the database with their respective lexical variables, the following post will present in detail the different unsupervised classification algorithms that will be used to detect which domains have significantly different lexical characteristics from the rest and, therefore, can be labeled as anomalous.