Theses Doctoral

Detection, Triage, and Attribution of PII Phishing Sites

Roellke, Dennis

Stolen personally identifiable information (PII) can be abused to perform a multitude of crimes in the victim’s name. For instance, credit card information can be used in drug business, Social Security Numbers and health ID’s can be used in insurance fraud, and passport data can be used for human trafficking or in terrorism. Even Information typically considered publicly available (e.g. name, birthday, phone number, etc.) can be used for unauthorized registration of services and generation of new accounts using the victim’s identity (unauthorized account creation). Accordingly, modern phishing campaigns have outlived the goal of account takeover and are trending towards more sophisticated goals.

While criminal investigations in the real world evolved over centuries, digital forensics is only a few decades into the art. In digital forensics, threat analysts have pioneered the field of enhanced attribution - a study of threat intelligence that aims to find a link between attacks and attackers. Their findings provide valuable information for investigators, ultimately bolster takedown efforts and help determine the proper course of legal action. Despite an overwhelming offer of security solutions today suggesting great threat analysis capabilities, vendors only share attack signatures and additional intelligence remains locked into the vendor’s ecosystem. Victims often hesitate to disclose attacks, fearing reputation damage and the accidental revealing of intellectual property. This phenomenon limits the availability of postmortem analysis from real-world attacks and often forces third-party investigators, like government agencies, to mine their own data.

In the absence of industry data, it can be promising to actively infiltrate fraudsters in an independent sting operation. Intuitively, undercover agents can be used to monitor online markets for illegal offerings and another common industry practice is to trap attackers in monitored sandboxes called honeypots. Using honeypots, investigators lure and deceive an attacker into believing an attack was successful while simultaneously studying the attacker’s behavior. Insights gathered from this process allow investigators to examine the latest attack vectors, methodology, and overall trends. For either approach, investigators crave additional information about the attacker, such that they can know what to look for. In the context of phishing attacks, it has been repeatedly proposed to "shoot tracers into the cloud", by stuffing phishing sites with fake information that can later be recognized in one way or another. However, to the best of our knowledge, no existing solution can keep up with modern phishing campaigns, because they focus on credential stuffing only, while modern campaigns steal more than just user credentials — they increasingly target PII instead.We observe that the use of HTML form input fields is a commonality among both credential stealing and identity stealing phishing sites and we propose to thoroughly evaluate this feature for the detection, triage and attribution of phishing attacks. This process includes extracting the phishing site’s target PII from its HTML <label> tags, investigating how JavaScript code stylometry can be used to fingerprint a phishing site for its detection, and determining commonalities between the threat actor’s personal styles.

Our evaluation shows that <input> tag identifiers, and <label> tags are the most important features for this machine learning classification task, lifting the accuracy from 68% without these features to up to 92% when including them. We show that <input> tag identifiers and code stylometry can also be used to decide if a phishing site uses cloaking. Then we propose to build the first denial-of-phishing engine (DOPE) that handles all phishing; both Credential Stealing and PII theft. DOPE analyzes HTML <label> tags to learn which information to provide, and we craft this information in a believable manner, meaning that it can be expected to pass credibility tests by the phisher.


  • thumnail for Roellke_columbia_0054D_17596.pdf Roellke_columbia_0054D_17596.pdf application/pdf 1.84 MB Download File

More About This Work

Academic Units
Computer Science
Thesis Advisors
Stolfo, Salvatore
Ph.D., Columbia University
Published Here
December 7, 2022