DGA

Posted on Apr 11, 2024 by

 107

What Is DGA?

A Domain Generation Algorithm (DGA) operates on principles of randomness, incorporating elements such as character sequences, timestamps, dictionary terms, or predefined inputs to create domain names. These generated domains, being inherently random, serve as conduits for communication between centralized botnets and their command and control (C&C) servers, strategically avoiding detection through domain name blacklists.

Harm of DGA

As Internet technologies evolve, a proliferation of malware has surfaced, posing a significant threat to cybersecurity. Today, malware stands as the primary tool for cybercriminals seeking illicit gains online. Within the contemporary digital landscape, illicit activities such as pornography, gambling, and fraudulent schemes thrive unabated. Compounding the challenge is the arbitrary generation of lucrative yet illegitimate domain names, rendering traditional blacklisting methods ineffective in safeguarding enterprise and community networks. This challenge stems from the presence of Domain Generation Algorithms (DGAs). Despite efforts by law enforcement, technology firms, and hosting providers to block illegitimate domains, malware persists in leveraging DGA technology to generate random domain names, enabling the continued operation of illicit websites for command and control or data exchange purposes.

Numerous malicious domain names are rapidly generated, presenting a formidable challenge for effective shielding.

Utilizing Domain Generation Algorithms (DGAs), malicious actors can generate thousands of such domains daily. Despite attempts to configure blacklists on network security devices, shielding all these malicious domain names remains an insurmountable task.

The significant level of randomness inherent in Domain Generation Algorithms (DGAs) poses considerable challenges for detection.

Various malware families leverage DGAs to generate vast numbers of pseudo-random domain names. Despite appearing random, the structure of these strings can be predetermined, facilitating repetitive generation and replication. The majority of randomly generated domain names lack real associations, with only a fraction being registered for use by hosts in communicating with servers for data retrieval or executing malicious activities. Moreover, when a domain name is successfully blocked, attackers swiftly register another from the DGA-generated list. Consequently, network security devices encounter immense difficulty in identifying and neutralizing the multitude of malicious domain names circulating.

Continuous parsing, camouflage, and lurking tactics are employed by malware authors to evade detection.

While the majority of Domain Generation Algorithm (DGA) domain names remain inaccessible on the Internet due to the impracticality of registering such vast quantities, malicious actors exploit a shared seed and algorithm to generate domain name lists identical to those generated by malware. Subsequently, a subset of these domains is selected for command and control (C&C) servers. Malware persists in resolving these domain names until an available C&C server is located, complicating efforts to block its operations.

What Is DGA? What Harm Will DGA Cause?

DGA Classification

By Seed

Through the utilization of a seed, attackers employ a crucial input parameter to generate domain names using a Domain Generation Algorithm (DGA). Varied seeds yield distinct DGA domain names, encompassing a range of input types such as dates, trending words from social networks, random numerical values, and dictionary terms. The DGA algorithm crafts character prefixes based on these seeds and appends top-level domains (TLDs) to produce the final algorithmically generated domains (AGDs).

Seeds are broadly categorized into two types: time-based and deterministic. Time-based seeds leverage temporal data, such as the system time of the compromised host or HTTP response times. Conversely, deterministic seeds rely on fixed inputs, facilitating the pre-calculation of AGDs for mainstream DGAs. However, certain DGAs incorporate uncertain inputs. For example, the infamous malware Bedep utilizes foreign exchange reference rates from the European Central Bank (ECB) as one of its seeds. Torpig, on the other hand, employs keywords sourced from prominent social networking sites as seeds and activates only when a domain name is registered within a specific timeframe.

DGA domain names are further classified based on the seed classification method, encompassing categories such as time-dependent and deterministic (TDD), time-dependent and non-deterministic (TDN), time-independent and non-deterministic (TIN), and time-independent and deterministic (TID).

By Generation Scheme

Different methods are employed for generating Domain Generation Algorithms (DGAs), each employing unique techniques to craft domain names:

Arithmetic-based: This approach generates a set of values that can be mapped to ASCII codes, forming the foundation for DGA domain names. Widely adopted, it's renowned for its simplicity and widespread usage.

Hash-based: Domain names are derived from hexadecimal hash values generated by hashing algorithms like MD5 or SHA-256. This scheme enhances unpredictability and complexity in domain generation.

Wordlist-based: Utilizing a predefined dictionary, this method selects words to construct domain names, reducing the inherent randomness of character selection. These dictionaries are often embedded within malicious software or sourced from publicly available resources.

Permutation-based: Initially, characters from an original domain name are rearranged into various permutations, producing multiple variations of the original domain. This technique introduces diversity while maintaining a common foundational structure.

DGA Detection Methods

Supervised Learning

In supervised learning, common algorithms like decision trees and random forests are utilized to identify DGA domain names. These algorithms rely on labeled datasets for training, enabling them to classify domain names effectively based on predefined features.

Unsupervised Learning

Unlike supervised learning, unsupervised learning models such as K-means do not require labeled datasets for training. One significant advantage of unsupervised learning is its independence from labeled data. For instance, K-means, a widely used unsupervised algorithm, can effectively detect DGA domain names without the need for labeled datasets.

Registration Status

The registration status of a domain name, including its registration date, expiration date, and payment status, provides valuable insights into its nature. By analyzing these registration details on business platforms, high-risk domain names can be profiled based on criteria such as registration dates and payment amounts, aiding in their identification.

Threat Intelligence

Utilizing threat intelligence platforms and DGA datasets facilitates the detection of known DGA domain names, enhancing cybersecurity measures against malicious activities.

Entropy-based Analysis

Entropy, a measure of uncertainty in random variables, serves as a key metric for distinguishing DGA domain names. Typically, domain names generated using random algorithms exhibit higher entropy than regular domain names, enabling their classification based on entropy levels.

Implicit Markov Model

The implicit Markov model analyzes conversion probabilities between characters in domain name strings to classify them. DGA domain names, characterized by high randomness, exhibit distinct statistical features compared to normal domain names, making this method effective for detection.

Deep Learning Models

Deep learning models leverage neural networks trained on both known DGA and normal domain names. Despite being less transparent and more complex than traditional models, deep learning models have demonstrated superior effectiveness in accurately identifying DGA domain names, leading to their widespread adoption in cybersecurity products.