What are phishing attacks? 

Phishing attacks have become more prominent and prevalent in recent years. In particular, our research into the cyber threat landscape over the last few months has shown a dramatic increase in the volume of phishing campaigns observed by our customers. 

The most basic way to detect phishing is by using denylists of phishing URLs. However, our research showed that, in many cases, the lifetime of phishing URLs is less than 24 hours, which renders the denylist approach largely ineffective.  

At VMware, we use multiple approaches to detect phishing attacks. The one we’ve found to be the most promising uses visual representation of the website to recognize phishing. In this blog post, we’ll discuss how this approach works in greater detail. If you need an overview of the more general idea behind phishing detection using image similarity, visit our previous blog post.

Not every hash function is a cryptographic hash function 

As one part of VMware’s phishing detection, we store information about the visual representation of every analyzed URL: that is, we calculate perceptual hashes of the screenshots of rendered web pages. 

While hash functions, especially cryptographic hash functions, are often constructed so as to generate very different, random-looking outputs for even slightly different inputs, the perceptual hash functions generate similar outputs for similar inputs. This means that for websites with very close visual representations, the hash will give similar or even equal results. 

At the same time, the calculation of the image hashes is quite fast, and the compression property of the hash function allows one to compare different hash values at low computational cost.  

There are different hash functions that can be used for this purpose, e.g., Average Hash, Difference Hash, pHash / DCT Hash, or Wavelet Hash. They perform differently depending on the type of distortion between two compared images 

To provide additional detail into how these hash functions work on images, let’s take the pHash  or DCT Hash  function as an example (all the functions listed above share the first two processing steps). The various hash functions do not operate directly on the original images but instead convert the data to a simpler representation in order to increase performance. First, each image is converted into grayscale. Second, the image is scaled down in size to fit the designated hash value size.  

Visual Comparison of Phishing Web Page

Conversion of the original screenshot to a scaled-down version and then to its grayscale representation (left to right). This is a real-world example of a Bank of America phishing page.

In the case of pHashwe perform scaling not only to fit the hash size, but also to reduce the required computational cost to the algorithm. After downsizing the image (usually to 32×32 pixels), pHash calculates the discrete cosine transform (DCT) on the image data. This is a data transformation operation that is commonly used in image compression techniques, such as JPEG. pHash computes the different frequencies available in the image, assigns a scalar value to them, and renders them as new 32×32 pixel images, with frequencies ordered from lowest to highest (meaning the top-left pixel gives information about the lowest frequencies of the image and the bottom-right pixel about the highest). 

In the next step, the DCT matrix is further reduced by retaining only the top-left 8×8 pixel values. This can be done because the information on higher frequencies is typically not very meaningful. Afterwards, the average across these values is computed. However, the top-left value is excluded from this calculation, as it contains information about solid colors and its value can differ significantly from the others, which would skew the average. This will also ensure that this so-called flat image information will not be part of the calculated hash. As a result, large portions of solid color won’t influence the hash value, making it more robust against image manipulations like cropping, in which solid parts are removed. 

After calculating the average of all DCT values, the hash value is computed. In this example, the hash value is 64 bits in size because an 8×8 DCT matrix was generated. Each bit of the hash corresponds to one field of the DCT matrix; it is set to 0 if the value in the matrix is below the average and to 1 if it is above. Finally, to compare two images for similarity, the hashes of both images are computed and compared. Their hamming distance can then be used to express the difference between the two images.   

Image hashing vs. URL denylists 

Everyday, VMware’s Network Detection and Response platform uses Network Traffic Analysis to evaluate more than 175,000 webpages to protect customers from phishing and other web-based threats. At the same timeless than 25,000 different visual representations are observed. Focusing only on phishing attacks, the VMware NDR platform recognizes about 1500 unique URLs per day, of which only 500 have a non-identical visual representation.  

Conclusion 

In the case of phishing detection, this hash function enables comparison of the visual representation of web pages. Keeping a set of hashes for well-known phishing campaigns allows us to use a denylist approach for detecting phishing pages unrelated to the underlying code of the webpage or the domain on which it is hosted. 

Although the simple principle of using perceptual hashes to cluster phishing pages is already quite effective, there are more sophisticated methods based on artificial intelligence that can be used to improve image-based phishing detection. We’ll explore this in upcoming blog articles.