Phishing IJCA Paper

International Journal of Computer Applications (0975 – 8887)
Volume 185 – No. 11, May 2023
Phishing Detection Implementation using Databricks

and Artificial Intelligence
Dinesh Kalla Fnu Samaah Sivaraju Kuraku, PhD Nathan Smith
Microsoft | Colorado Northeastern Illinois Uptycs | University of the Colorado Technical
Technical University University Harrisburg Cumberlands University
Charlotte, NC 28273 University Austin, Texas 78758 San Diego, California
Desplaines, IL 60016
ABSTRACT healthcare institutions that fail to protect sensitive customer

Phishing is a fraudulent activity that includes tricking people information following regulations such as the Health Insurance
into disclosing personal or financial information by Portability and Accountability Act of 1996 (HIPAA) [1]. Even
impersonating a legitimate company or individual. The though anybody may be a victim of phishing, people should be
increasingly complex nature of phishing has drawn the aware of the warning signals of phishing attempts and only
attention of criminals, who see it as a profitable and simple way supply personal information when they are confident they are
to get sensitive information. As a result of the negative impact dealing with a trustworthy website or organization.
of phishing assaults on both individuals and companies, However, the Phishcatch algorithm is one way that can help
efficient detection and prevention measures have been secure sensitive information internationally, notably in the
developed. This document overviews numerous approaches for United States, where phishing causes significant financial
detecting and thwarting phishing attacks. The research losses [1]. The Phishcatch algorithm uses a combination of
introduces the Phishcatch algorithm, which has shown heuristics and machine learning approaches that are constantly
substantial success in identifying phishing emails and alerting updated to increase its effectiveness [1]. The Phishcatch
consumers to fraudulent attempts. Phishcatch studies user algorithm has a significant benefit in that it does not rely on
behavior on websites and limits access if any suspicious signatures or known URLs to detect phishing attempts, making
behavior is found. Phishcatch is a vital instrument in the battle it very successful against new and unknown assaults [12].
against phishing attempts, with an accuracy and detection rate Furthermore, the Phishcatch algorithm can detect cross-site
of 90%. Furthermore, this article explains the steps in scripting (XSS) and man-in-the-middle (MitM) assaults, both
developing, testing and implementing successful anti-phishing of which are becoming more common [12]. The Phishcatch
algorithms. algorithm, which has shown to be a powerful weapon in the
battle against phishing, has already been used by several
General Terms significant corporations. The Phishcatch algorithm's continued
Pattern Recognition, Security Awareness, Stemming and growth and enhancement will likely play an important role in
Lemmatization, Cyber Security, Spam Detection, stop words. protecting sensitive information in the future.
Keywords Businesses that accept online payments or have an extensive
Phishing, NLTK, Natural Language Processing, Azure client base are more likely to be targeted by phishers since they
Databricks, Spam, Security Situational Awareness, Credential have valuable data [13]. Furthermore, firms that have recently
Theft, Python, Machine Learning, Stemming and been the victim of a security incident or data breach may be
Lemmatization, Naïve Bayes, Artificial Intelligence. targeted to exploit prospective weaknesses. It is proposed that
the PhishCatch algorithm, which employs Machine Learning
1. INTRODUCTION and the Natural Language Toolkit (NLTK) package, be used to
Phishing is a cyber-attack that uses deceit to steal sensitive increase the application's performance [3]. Machine learning
information from unsuspecting persons, such as passwords or and artificial intelligence can dramatically improve the
credit card details [1]. This is usually accomplished by creating Phishcatch algorithm's capacity to detect and thwart phishing
a phony website or email that looks authentic to deceive people assaults. The NLTK library is vital in constructing machine
into submitting their personal information [1]. Phishing learning models [5]. Emotet is one of the most recent growing
assaults have been proven to have disastrous outcomes for cybersecurity threats [18]. As a result, numerous businesses
individuals and corporations. Phishing is predicted to cost the have been tasked with developing a PhishCatch algorithm that
US economy billions yearly [1]. Companies pay direct costs uses machine learning and natural language processing to battle
associated with phishing attacks, such as data breaches and the growing volume of phishing emails and cyber threats.
productivity loss, while consumers bear indirect costs, such as
higher pricing for goods and services [1]. Furthermore, 2. PHISHING MECHANISM
phishing may have severe consequences for individuals, Phishing assaults generally carried out using a misleading
including financial ruin and mental misery. email with a link to download an infected file, are one of the
most common forms of malware [1]. When the user clicks on
In the United States alone, phishing attempts cost more than $5 the link, the virus is triggered and can execute on the user's
billion in damages in 2018, with businesses bearing the brunt machine. The characteristics of this sort of malware vary
of the losses due to the theft of client data or financial depending on the operating system in use. For example, there
information [1]. Furthermore, these companies frequently are two sorts of Windows viruses: those that penetrate systems
suffer brand harm, which causes customers to switch to using a web browser, known as a "drive-by download," and
competing products [1]. Fines are enforced on entities such as
1
Volume 185 – No. 11, May 2023
those that enter the system via a file downloaded from the
internet, known as an "injector" [1].
Regardless of how a phishing assault is carried out, all
approaches entail some kind of deceit. Attackers may, for
example, send an email purporting to be from a respected
institution and request that the receiver click on a link or open
an attachment. This link might take the victim to a bogus
website designed to steal sensitive information, or the attacker
could directly request personal information via a form or email
response [1]. Phishing attacks are usually difficult to detect
because attackers typically employ genuine-looking logos and
graphics in their communications. However, grammatical
problems or the sender's email address that does not match the
organization's name might be indicators of a phishing email [1].
Fig 2: Most Targeted Industries (APWG Q3Report,2022)

Phishing attacks are a continuous and broad danger to
individuals and organizations, with some industries being more
actively targeted owing to the volume of sensitive information
they contain [11]. Financial institutions, social networking
platforms, and SaaS/Webmail providers are among the most
often targeted businesses due to the vast quantities of personal
and financial data they collect that may be exploited to conduct
fraud or identity theft [11]. Attackers can acquire access to
client credit cards and bank account information and email
correspondences by exploiting holes in these firms' security
systems [11]. The attackers' primary goal is to get passwords
and usernames that may be utilized to compromise the
Fig 1: Phishing Mechanism resources and infrastructure of these businesses. However,
businesses such as Bitcoin, which have a robust security
Phishing attacks are carried out by attackers that send structure and have yet to be effectively hacked, are not popular
unsolicited communications to users to deceive them into targets [11]. Other industries, such as transportation,
opening an attachment or clicking on a link that may contain telecommunications, and eCommerce, may be less commonly
malware [3]. The attacker can write the email by hand or use attacked since they need more data to be infiltrated, resulting in
an automated program. These emails frequently come from a significant loss of resources and a compromised security
reputable sources like banks, media sites, or government system [11].
agencies. They may include attachments that appear to be vital
updates or significant information but are actually harmful
links or downloads [1]. Such emails are often intended to abuse
employees' welfare, such as the promise of better working
conditions or a pay raise.
To be successful, phishing attempts must get sensitive
information such as usernames and passwords. In phishing
assaults, passwords are typically the most targeted information.
Attackers also require computer access to conduct phishing
assaults, either by sending emails with dangerous attachments
or by installing malware on the victim's computer using remote
access tools (RATs) [3]. The attackers gather information by
disseminating their email addresses, malicious links, and files
to understand better how people use their computers and what
kind of data they keep online. Following the click of the link,
attackers may exploit other vulnerabilities in the compromised
machine, such as remote access tools or web browser exploits,
to steal victims' financial information, such as bank account Fig 3: Phishing Attacks 2021-2022 (APWG
numbers and credit card details [1]. Q3Report,2022)
From October 2021 to September 2022, the graph above shows
a continuous increase in phishing assaults. The number of
phishing events increased in the second half of 2021, reaching
over 3,000 attacks per day, according to the APWG Trends
Report Q3 2022 [4]. This is a concerning trend since attackers
are persistent and growing more skilled. According to the
2
Volume 185 – No. 11, May 2023
survey, phishing assaults mainly target corporate personnel via Vishing attacks can be challenging to detect because the
social media sites like LinkedIn and Facebook. This method attacker may employ sophisticated ways to seem official, such
works because many employees have personal profiles on these as faking the phone number to match the actual organization or
platforms, and attackers know that if they are already linked agency they claim to represent [1]. Victims may also trust a
with someone at the firm, they are more likely to click on a link phone call or text message more than an email or website,
or attachment. The increased usage of the internet by rendering them more vulnerable to assault [1].
organizations and consumers is to blame for the increase in
cyberattacks. Cybercriminals use advanced social engineering To avoid vishing attacks, being wary of unsolicited phone calls
techniques to trick their victims into clicking on malicious links or text messages is critical, especially if they appear to be from
or opening infected files that appear legitimate files from a well-known organization or agency [1]. Personal information,
trusted parties such as banks or email providers, exposing them such as passwords or social security numbers, should only be
to malware infections such as ransomware or banking Trojans. given over the phone or text message if the caller's identity can
be verified [1]. If someone has doubts or suspicions, contact the
firm or agency immediately using a recognized and trustworthy
phone number [1]. Individuals may reduce their vulnerability
to vishing attacks and protect their personal and financial
information by being aware and taking the required steps.
3.3 Smishing
Smishing is sending fraudulent text messages with URLs that
look legitimate but lead to fraud, malware, or other types of
cyberattacks [7]. The communications might be sent by
attackers impersonating reputable persons, such as business
acquaintances, to gather sensitive information like income or
working circumstances. Smishing attacks are potent because
they use social engineering tactics to generate a feeling of
urgency and encourage the victim to act without investigating
the message's legitimacy. Individuals and companies can suffer
Fig 4: Brands Attacked 2021-2022 (APWG significant repercussions from these assaults, including identity
Q3Report,2022) theft, financial loss, and data breaches. As a result, individuals
and companies must be aware and take precautions against
Phishing assaults on e-commerce and retail enterprises these sorts of attacks.
decreased in the third quarter of 2020, but attacks on SaaS
stayed constant. This rise might be ascribed to the influence of 3.4 Key logger
COVID-19, which required more individuals to work online, Keylogging is capturing every keystroke a computer user
creating a bigger pool of possible targets for attackers [1]. makes to obtain sensitive information such as credit card
Financial institutions were the most targeted industry, numbers or passwords [2]. The stolen data is typically
accounting for 23.4% of assaults. However, this was down preserved in a log file for the attacker to retrieve later. Notably,
from 27.6% in the previous quarter. Interestingly, December most keylogging data is retained on websites without two-
had the fewest phishing assaults of any month, most likely factor authentication, making it more straightforward for
because many businesses take a vacation over the holiday attackers to get unauthorized access to critical information [2].
season, giving less potential for attacks [1]. Companies should This type of cyber assault, which can result in serious
continually monitor changes in attacker technology and tactics repercussions such as identity theft and financial loss, is
to keep up with developing threats. frequently used as a weapon for corporate espionage or to get
access to sensitive information saved on a victim's computer.
3. DIFFERENT TYPES OF PHISHING As a result, it is critical for computer users to apply robust
3.1 Search Engines Phishing security measures to protect themselves against keylogging
Phishing through search engines is a technique that uses Google assaults, such as utilizing antivirus software and avoiding
and Bing software to add malicious links to the search engine questionable internet downloads.
results pages (SERPs). These malicious links are designed to
look like real ones and lure users into clicking them. Phishing 3.5 Social Engineering
through search engines has been around for years, but it is Social engineering is a common phishing technique that
become more common recently due to the increased use of includes leveraging victims' trust or gullibility to deceive them
mobile devices and social media platforms like Facebook and into disclosing personal information [17]. To persuade victims
Twitter. to provide their passwords or account numbers, attackers may
masquerade as customer care representatives or employ other
3.2 Vishing deceit. Unlike other phishing strategies that rely on
Vishing is a phishing assault in which a fraudster calls or texts infrastructure penetration, social engineering persuades
the victim using their phone number or other information, such individuals to access crucial information willingly. The
as their name and address [1]. The attacker may pretend as an efficacy of social engineering strategies is based on their
employee of a well-known corporation, such as Apple or capacity to affect victims' psychology and instill a sense of
Verizon, or a government body, such as the IRS, and ask the urgency or trust in them, motivating them to reveal sensitive
victim personal identifying questions [1]. The attacker aims to information [18]. Individuals and organizations must thus be
get sensitive information from the victim, which he or she may aware of the many types of social engineering strategies and
subsequently sell on the black market or use to perpetrate take caution when dealing with unknown sources or revealing
identity theft against the victim [1]. personal information.
3
Volume 185 – No. 11, May 2023
3.6 Domain spoofing software, routinely upgrading software and operating systems,
Domain spoofing is a typical phishing method in which and being wary of strange emails or other communications that
attackers establish fake websites that seem identical to may include malware.
authentic ones to collect sensitive information from
unsuspecting victims [5]. These websites are meant to appear
3.10 Ransomware
and feel like the real thing, making it impossible for consumers Ransomware is malicious software that encrypts a user's data
to tell the difference. Attackers can employ a variety of ways to and prevents them from being accessed until a ransom is paid
attract people to their fake websites, such as phishing emails or [5]. This sort of assault may be disastrous for people and
social media messages with a link to the false website. When a businesses, resulting in the loss of critical data and severe
user inputs their login credentials or other sensitive information financial impact. The attacker encrypts the data, and the victim
on the faked website, the attackers can gather and exploit this must pay a ransom to get the decryption key required to open
information to gain access to the real website or carry out other them. This tactic is frequently used to extort money from
harmful operations. Domain spoofing may have profound individuals or businesses. After receiving money, the attacker
effects, ranging from identity theft to financial losses. Thus, may or may not supply the victim with the decryption key. The
users must be aware and cautious when providing personal growing prevalence of ransomware attacks emphasizes the
information online. importance of proactively protecting personal and business
data from cyber threats.
3.7 Website forgery
Website forging is a fraudulent practice in which a website that
3.11 Malvertising
seems real but is phony is created [19]. The fraudster may use Malvertising is a cybercriminal practice in which they pay
stolen identities or data to impersonate an actual website to third-party businesses to put advertisements on websites under
establish a phony website. The false website might be hosted their control. The advertisements are intended to lure users to
on another server, and the fraudster will use various tactics to the attackers' websites, where they can steal sensitive
trick the visitor into thinking they are visiting the actual information or exploit vulnerabilities in the consumers'
website. This is a typical sort of cyber assault in which the machines or networks [12]. Users can be directed to these sites
attacker attempts to steal sensitive information from by clicking on the advertisement or visiting the website. The
unsuspecting victims. Website forgery is especially attackers can then acquire access to the accounts and passwords
problematic since it is sometimes impossible for people to of the companies, allowing them to carry out attacks on the
recognize that the website, they are viewing is a fabrication. firms' systems. This activity is an increasing source of worry
When a victim inputs personal or financial information on a for both corporations and people, as it may cause major
false website, the attacker can access that information and use financial and reputational harm. Users must consequently be
it for nefarious reasons. As a result, it is critical for users to be attentive and take adequate precautions to protect themselves
cautious when inputting sensitive information online and to from such assaults.
confirm the legitimacy of websites before entering any 3.12 Spear Phishing
information.
Spear phishing is a type of email phishing in which an attacker
3.8 Trojan sends an email to a user posing as a trusted individual, such as
A Trojan horse is malicious software that may install additional an attorney, employer, or university, requesting confidential
apps on a user's computer without their knowledge or consent information such as login credentials, social security numbers,
[12]. This virus is frequently sent via email attachments or and other personal data that could be used to gain access to
websites that have been hijacked by hackers or malware writers accounts or steal money online [15]. Spear phishing is a highly
looking to enhance the capabilities of their dangerous focused kind of phishing that focuses on specific persons or
programs. Trojan horses are well-known for their ability to organizations to make the assault look more legitimate and
disguise themselves as legal applications or software updates, customized. These assaults are frequently effective because the
deceiving unwary users into downloading and installing them. attacker has done research on the target and may use that
Once installed, the Trojan horse can perform a variety of knowledge to construct a compelling message that seems real.
operations on the user's computer, such as stealing personal 3.13 Session Hijacking
data, installing further malware, or providing the attacker
Session hijacking is a malicious technique to gain unauthorized
remote access. As a result, users must be cautious and follow
access to a user's session ID and intercept their data. Attackers
best practices to protect themselves from such unwanted
achieve this by exploiting vulnerabilities in session
assaults.
management procedures, which can be present in cookies,
3.9 Malware tokens, or URLs [9]. The attacker can then use this information
Malware is a term that refers to any program that is meant to to impersonate the user, access their confidential information,
disrupt a computer system or steal personal information from and carry out unauthorized transactions. Session hijacking is a
users who unknowingly download it onto their machines, serious threat that can result in identity theft, account misuse,
typically without their knowledge or agreement. and financial fraud, among other consequences [9]. To prevent
Cybercriminals generally construct this malware, which is then session hijacking attacks, it is essential to ensure proper session
deployed in a variety of methods, including phishing assaults, management techniques, including session timeouts, secure
spam emails, and keyboard loggers [9]. Phishing assaults have session storage, and using HTTPS to encrypt session data [8].
grown in popularity in recent years, and they frequently use 3.14 Content injection
malware to infect victims' systems and steal important data.
Content injection is a cyberattack in which harmful code is
Malware can also be used to carry out other sorts of assaults,
placed into genuine website content without the website
such as ransomware and distributed denial-of-service (DDoS)
owner's or user's knowledge or consent [15]. The inserted code
attacks, which can be extremely damaging to organizations and
may modify or replace current content with malicious code,
people [10]. As a result, it is critical for people and companies
giving the impression that the website has been altered. Content
to take precautions against malware, such as installing antivirus
4
Volume 185 – No. 11, May 2023
injection attacks are designed to steal sensitive data from 4. PURPOSE BEHIND THE PHISHING
genuine websites or to divert users to phishing websites where
their personal information can be obtained. To carry out the 4.1 Identity theft
content injection attack and achieve their goals, the attackers Phishing is a social engineering assault used to gain personal
may employ a variety of techniques, including cross-site information, particularly login credentials. In this attack, an
scripting (XSS) and SQL injection (SQLi). Content injection email or SMS message seems to originate from a trusted source,
attacks are a severe danger to website owners and users since but it is really delivered by an unknown sender [14]. The email
they can cause data breaches, financial loss, and reputational or message may contain what appear to be normal links or
harm to impacted companies. attachments. However, they are meant to download malicious
software onto the target's computer or mobile device to acquire
3.15 Link Manipulation sensitive information such as login credentials or financial
Users can be directed to other web pages by using web links. information [14]. These phishing assaults can have major
On the other hand, attackers might exploit this feature by ramifications, including identity theft, financial loss, and
building malicious links that redirect visitors to fraudulent sites malware installation on victims' PCs [14].
rather than the intended destination. As a result, attackers can
intercept any cookie provided along with the link [15]. This 4.2 Financial Gain
allows them to mimic the user and conduct nefarious activities Phishing is frequently driven by financial gain, with criminals
such as stealing sensitive data or starting fraudulent using the scam to trick customers into disclosing their bank
transactions. As a result, users should be cautious when account passwords and transferring data [8]. This type of
clicking on links, particularly those they do not recognize or assault is known as "phishing," It involves mimicking emails
look suspicious. Putting in place security measures like two- from financial institutions, banks, and other trustworthy
factor authentication can also assist to reduce the danger of businesses to obtain personal information from unsuspecting
such assaults. victims.
3.16 Whaling 4.3 Password Harvesting

Whaling is a sort of cyber assault that includes duping Password harvesting, also known as credential stuffing, is a
consumers into clicking on malware-infected links or specific purpose of phishing attacks in which attackers attempt
advertisements, generally via email or social media postings to steal user credentials to gain access to online accounts [9].
[19]. The advertisements are frequently directed towards Attackers employ automated programs to collect passwords
prominent social media sites such as Facebook and Twitter. from the machines of a large number of online users without
Attackers often flood a large number of unprotected computers their knowledge or agreement. This occurs when consumers
with traffic in order to overwhelm them and cause them to crash click on links in phishing emails that redirect them to
or freeze. This may cause severe interruption for organizations unfamiliar websites where they are requested to input their
and individuals, as well as potential data breaches and data loss. credentials into an automated form. Attackers utilize this form
Whaling assaults are growing more complex, with attackers to acquire the user's login details for future fraudulent
employing social engineering tactics to produce persuasive operations.
phishing emails and websites that consumers find difficult to
differentiate from legal ones. As a result, companies and 4.4 Gain recognition
individuals must be aware and take proactive actions to defend Phishing attacks that target high-profile individuals or
themselves against whaling assaults. organizations are often motivated by the desire to gain
recognition [17]. Attackers can increase their visibility and
3.17 Email/spam notoriety by successfully tricking a well-known entity into
Email-based phishing attacks are a common form of revealing sensitive information. This recognition can also lead
cybercrime in which attackers send fraudulent emails that to attention from other members of the hacking community and
appear to come from a legitimate source, such as a financial potentially lead to financial gain or status within the
institution or an online service, to trick users into divulging community. Therefore, high-profile individuals and
sensitive information. This technique is known as email organizations are at increased risk of becoming targets of
phishing [18]. Despite its prevalence, email phishing attacks phishing attacks due to their status and visibility.
have several limitations, including the possibility of emails
getting lost in transit or ending up in spam folders, making them 4.5 Exploit security hole
less reliable than other phishing attacks. Nonetheless, email Security vulnerabilities are frequently employed in phishing
phishing remains a popular technique for cybercriminals due to attacks, in which attackers exploit flaws in an organization's
its potential to target a large number of users and its ability to security architecture to obtain unauthorized access to sensitive
appear trustworthy to victims. data [3]. One of the most prevalent methods attackers use is
sending seemingly regular emails with links to malicious
3.18 Web-based delivery websites. When a user clicks on the link, the attacker can
Phishing attempts frequently employ web-based delivery circumvent security protections and obtain access to sensitive
techniques, such as redirecting visitors to fake websites or information. These assaults may harm enterprises, resulting in
utilizing pop-ups and malicious code [2]. These assaults can substantial financial losses and reputational damage. As a
have profound implications, such as money loss, identity theft, result, companies must have robust security measures in place
and malware installation on victims' machines. Individuals and to defend themselves from such assaults.
businesses must take precautions to defend themselves from
such assaults, such as learning about the strategies employed 4.6 Brand Tarnishing
by attackers, deploying security software, and keeping alert to Brand tarnishing is a strategy used by attackers to destroy the
strange emails and communications. reputation of businesses or persons by disseminating false or
bad information [16]. Attackers use phishing emails that appear
to come from a genuine source, such as the targeted firm, but
include provocative or objectionable information. Another
5
Volume 185 – No. 11, May 2023
technique is to steal client data from the firm and then post it critical to deploy robust authentication procedures such as
online, which can severely harm the company's reputation. MFA to reduce the dangers of such attacks. [5]
Brand tarnishing is a significant concern since it may result in
considerable financial losses and harm the targeted company's 5.6 Access control list
reputation. ACL is a set of permissions allocated to a specific object, such
as a file or folder, that indicates which people or groups have
4.7 Data Theft access to the item and what degree of access they have [5].
Data theft is a primary objective of phishing attacks. Attackers ACLs are frequently used to limit access to sensitive data or
use various tactics to trick individuals into divulging sensitive resources so only authorized users can access them. Suppose an
information like passwords or credit card numbers, allowing ACL is configured incorrectly or is out of date. In that case, it
them to access the data for their purposes [8]. Malware can also can lead to security vulnerabilities and raise the risk of phishing
steal data by tricking individuals into downloading it onto their attempts by allowing unauthorized users to access sensitive
computers or networks. Once the attacker has obtained the data, data or network resources. As a result, it is critical to evaluate
they can use it to commit identity theft or financial fraud, or sell and update ACLs on a regular basis to ensure that they are
the information on the black market. correctly set and effective in safeguarding the network against
unwanted access.
5. CAUSES OF PHISHING
5.1 Security Flaws 5.7 Software Updates
Attackers can access systems by exploiting security Successful phishing assaults are frequently caused by out-of-
weaknesses in software and hardware [3]. A design defect is date software. These are examples of web browsers, antivirus
the most prevalent security problem when software is software, and operating systems. When these software
distributed without sufficient testing. Because of this design packages are not frequently updated, they might become
weakness, attackers can abuse the program and obtain exploitable, exposing the system to attack. To reduce the
unauthorized computer access. danger of these sorts of assaults, it is critical to keep software
up to date.
5.2 Weak passwords
A weak password is not considered a security weakness; 5.8 Browser Vulnerabilities
instead, the user must generate a strong password and update it When using a browser like Internet Explorer or Firefox to
frequently. Phishing attacks may readily exploit a weak explore the internet, users are generally requested to install
password [12]. Users and administrators may use weak software updates to check for security vulnerabilities in their
passwords owing to carelessness in password selection or system each time they visit a website. It is, nevertheless, critical
because they often use basic passwords such as "password" or to use caution when upgrading this program. Before installing
"123456." Furthermore, individuals frequently use the same any updates, users should check whether any security updates
password for several accounts, exposing all of their accounts if are available. This is because attackers may exploit weaknesses
the password is hacked [8]. in obsolete software and take advantage of users unaware of the
latest security upgrades to launch phishing attacks. To reduce
5.3 Non-secure desktop the danger of such assaults, it is recommended that all software
A non-secure desktop does not have the latest security updates. on a system be maintained up to date.
A non-secure desktop that lacks the most recent security
patches makes a machine more vulnerable to phishing assaults 5.9 Open ports and misconfigured services
[7]. Furthermore, users should be aware of the kind and version exposed to the internet
of their browsers, as certain browsers are more secure than Open ports and misconfigured services on the internet
others. If a browser is hacked, it may be used to launch phishing constitute a severe security risk because they allow unprotected
attacks. traffic from untrusted sources to access devices or computers
without authorization, even when firewalls or other security
5.4 No User Awareness controls are in place. Some frequently used ports, such as 80
A non-secure desktop environment employs web browsers (HTTP) and 443 (HTTPS), are exposed to the public, rendering
such as Internet Explorer 11 and other out-of-date versions that them vulnerable to malevolent actors. Similarly, misconfigured
are vulnerable to phishing attempts [5]. Users can prevent this services such as SMTP and FTP can be utilized by anybody on
risk by disabling all extensions in Internet Explorer 11 and the internet without authorization, raising the danger of
using contemporary browsers such as Microsoft Edge, Firefox, phishing attempts. To reduce this risk, it is advised that open
or Chrome as the default browser in Windows 10/8/7. Users ports be secured, and services be correctly configured, as well
can better defend themselves against phishing attempts and as employing extra security measures such as network
other online hazards by doing so. segmentation and intrusion detection systems.
5.5 Weak auth or no MFA 5.10 Poor Endpoint Detection

Weak authentication methods and the absence of multi-factor When endpoint detection is inadequate, phishing attempts may
authentication (MFA) can expose systems to phishing attacks. frequently evade security protections and get access to critical
Attackers can acquire access to systems through the use of information. Maintaining security requires ensuring that a
weak password and username combinations, potentially reliable endpoint detection mechanism is in place.
resulting in the theft of sensitive data such as financial and Organizations must have the right technologies in place to
personal information. Files holding sensitive information on detect and stop phishing efforts; failing to do so might expose
individuals, their family members, and workers, which might people to fraud [1]. As a result, it is critical to have a well-
be exploited for identity theft, may fall into this category. It is designed endpoint detection and response system that can
detect and respond to phishing assaults.
6
Volume 185 – No. 11, May 2023
6. PHISHING DETECTION clicking on links, be cautious since they might lead to

fraudulent websites that steal personal information or install
6.1 Domain name detection malware on the user's computer.
Detecting phishing attacks using domain names is a common
practice to identify fraudulent websites. Legitimate websites 6.8 A message with a sense of urgency
are often recognizable by their domain name, which usually Attackers frequently employ phishing emails that generate a
matches the company or organization's name. On the other sense of urgency to trick victims into acting without thinking.
hand, fake websites often contain domain names that resemble They frequently employ fear or hurry to induce panic, causing
legitimate websites but with slight variations that can be the victim to act without considering the implications. To
difficult to detect. For instance, a domain name like prevent falling for these scams, it is critical to be calm and
"paypal.com-scam.com" is an example of a fake website double-check the message's validity before acting.
created to deceive unsuspecting users into revealing their
sensitive information. Therefore, proper domain name 6.9 Awareness creation
detection is crucial in detecting and preventing phishing Increasing staff awareness is critical in spotting phishing
attacks. assaults. Educational seminars and workshops are powerful
tools for informing employees about the nature of phishing
6.2 Language Used attempts and how to spot them [8]. To raise employee
Language detection is another technique that can be used to awareness, IT departments can give a variety of materials such
identify phishing attacks. Typically, phishing emails could be as online training modules, pamphlets, and guidelines.
better written in English or use unusual phrases not commonly Employees may become essential members of the
found in professional communications. If an email appears organization's security team and play a critical role in
suspicious because of its language, verifying whether it is a identifying and stopping phishing attempts if given the required
phishing attempt is advisable. skills and information.
6.3 UI Detection 6.10 Unbelievable deals and offers
UI detection, also known as user interface detection, is a In numerous ways, unbelievable discounts and offers might
method of detecting phishing assaults by inspecting the user lead to questionable phishing efforts. Email verification may
interface of a website or email. Fake websites and emails entail providing a verification link or prompting the user to
frequently have poor visual quality or misspellings in the text, enter their email address into a form. Furthermore, phishing
which might indicate a phishing effort. Users may frequently efforts may require consumers to check their accounts to claim
tell whether a website or email is valid by inspecting its design an offer. Individuals and workers must be watchful and avoid
and layout [18]. such offers and agreements.
6.4 Signature 7. PREVENTION OF PHISHING
Many phishing emails must be better designed and have
spelling and grammatical mistakes. By checking for these
7.1 Enforcing strong passwords
unique email features, analysts can detect phishing efforts. This One of the most effective ways to avoid phishing attempts is to
might include misspellings, grammatical errors, or strange enforce strong passwords. Passwords should be at least eight
layouts. Employees should be careful when opening emails characters long and comprise a combination of lowercase and
with strange signatures since they might be phishing efforts. uppercase letters, numbers, and symbols. Furthermore, avoid
using the same password for different accounts to reduce the
6.5 Tools to detect risk of unauthorized access to sensitive information [7]. By
To identify phishing emails, individuals or organizations can following these recommended practices, users can lower their
use several technologies. These technologies often employ chances of falling victim to phishing attempts that leverage
heuristics and machine learning approaches to identify weak passwords.
suspected phishing emails. Often utilized tools include
PhishMe, KnowBe4, and FireEye [20]. When the program
7.2 Implement MFA
generates an alert, employees must proceed cautiously before Multi-factor authentication (MFA) can significantly reduce the
opening the email. risk of successful phishing attacks by adding a layer of security
to the login process. With MFA, users must provide two or
6.6 Suspicious attachments more forms of identification when attempting to access their
Suspicious email attachment detection is an important account, such as a password and a text message code. By
component of phishing protection. Suspicious attachments can requiring multiple forms of identification, MFA can make it
be recognized by examining the attachment's nature, size, and more challenging for hackers to bypass account security
source. Any attachment that asks the receiver to activate measures with stolen passwords alone [25].
macros should be viewed cautiously, as activating macros can
allow malware to execute on the user's device [20]. To avoid
7.3 Creating security awareness programs
potential danger, users should confirm the safety of Security awareness programs help educate users about
attachments by contacting the sender personally or via other protecting themselves from phishing scams and other types of
methods before opening them. cybercrime. The program teaches them how criminals use
social engineering techniques like pretending to be from a
6.7 Suspicious links company or authority figure, such as an email from a user's
Another approach for spotting suspicious links is to move the bank or credit card company revealing fraud on an account.
mouse pointer over the link without clicking on it. This allows
consumers to examine the link's actual URL and verify that it
7.4 Monitoring open RDP ports
matches the displayed content. Furthermore, specific online Attackers can use the redirection of Remote Desktop Protocol
browsers and email clients include built-in security capabilities (RDP) ports to redirect RDP connections, resulting in a denial-
that detect and alert users to potentially hazardous links. When of-service attack by consuming network bandwidth and
7
Volume 185 – No. 11, May 2023
resources. RDP ports 3389, 3390, 3394, and 4100 are often unwanted emails and prevent them from reaching the user's
used. It is recommended to avoid open RDP ports by inbox without the sender's consent.
monitoring them and swiftly shutting them when they are found
to be utilized for something other than their original purpose. 8. PHISHING DETECTION
IMPLEMENTATION
7.5 Hardening conditional access policies Several Microsoft Azure technologies are used to create a
Conditional access is a type of access control in which users are phishing detection system. Azure Data Factory is used for data
permitted access to a resource if specific conditions are migration, allowing files to be transported from many sources
satisfied. Location, time of day, device kind, or user to Azure Data Lake gen2. For Python and R language, Azure
identification are examples of such circumstances [8]. Databricks is utilized, with all R Language and Python code put
Conditional access can be used to restrict access to sensitive in Databricks notebooks..
information such as social security numbers and credit card
data and to guarantee that only authorized individuals have To run Databricks python note book spark cluster are used
access to customer and financial records. Azure Databricks is a fully managed first-party service that
enables an open data lakehouse in Azure. With a lakehouse
Using temporary passwords, granted to users depending on built on top of an open data lake, quickly light up a variety of
predefined criteria, is one form of conditional access. When analytical workloads while allowing for common governance
accessing particular resources, for example, a user may be across your entire data estate. Enable key use cases including
prompted to submit a temporary password texted to their data science, data engineering, machine learning, AI, and SQL-
mobile device. Requiring an extra layer of authentication, this based analytics.Data Lake Gen 2 is used to store pre- and post-
helps to prevent illegal access. Organizations may strengthen datasets, and Power BI is used to produce telemetry reports by
their security posture and safeguard sensitive information from connecting to Data Lake Gen 2.
unwanted access by introducing conditional access restrictions.
Data collection is the first stage of the phishing detection
7.6 Security policies implementation. Data is collected from numerous sources and
Incorporating security policies, such as those issued by the transported to Azure Data Lake Gen2 storage utilizing the
National Institute of Standards and Technology (NIST), can Azure data factory tool for data movement. In Azure data
help avoid phishing attempts. These policies establish best factory self hosted integration runtime is use to move data from
practices for safeguarding systems and sensitive data and on prem to Azure Data Lake cloud storage. The obtained
should be followed by all workers who have access to member dataset is read into CSV format using the Pandas package and
accounts and systems. Such rules should be included in an labeled. The NLTK library is used for importing stop words
organization's Information Security Policy and Procedures and porter stemmer, which are used to delete unnecessary
Manual and other related guidelines and paperwork [25]. words and locate the word's base root. Regular expressions are
used to remove special characters from a dataset.
7.7 Avoiding clicking links and attachments
Receiving an email with a link or attachment should trigger Following data cleaning, text preparation is carried out, in
suspicion, and clicking on it should be done cautiously. Before which email body words are transformed to lowercase, and
clicking [10], it is best to confirm the integrity of the link or file each word is separated into a separate column. The count
by checking the URL or email address with a reliable source. If vectorizer from the scikit-learn package turns unique words
the link or attachment is opened, ensure it takes the user to the into columns. The primary dataset is then converted into
correct page. Otherwise, it is best to approach it with care. Boolean variables 0 and 1 for spam and valid emails. Both
datasets are then tested using train-test split and the Nave Bayes
7.8 Spam Guarding package. The confusion matrix from the scikit-learn package is
The deployment of spam guarding services at both the used to test and validate predictions. The accuracy score
organization's email server and user levels is an effective anti- measures the algorithm's accuracy across all 11 datasets.
phishing tactic. These tools prevent unsolicited commercial
email messages from entering the network and entering emails Emails identified as spam and those not transferred more
carrying harmful code, such as viruses or worms, into the profoundly into the data lake and Power BI to create a spam
computer system. detection tool efficiency report. Overall, developing a phishing
detection algorithm necessitates using multiple Azure
7.9 Install antivirus and anti-spam software. technologies and modules for data cleaning, preprocessing,
Several free antivirus apps may be used to protect computers visualization and testing.
against malware such as viruses and spyware. Examples
include Microsoft Security Essentials, Norton 360, and
McAfee's Internet Security package [6]. Additionally, the user
should install anti-spam software on their device to filter out
8
Volume 185 – No. 11, May 2023
Fig 5: Phishing Detection Implementation
9. EXPERIMENTS AND RESULTS tool was more than 92% efficient in most of the tests done. The
The research findings reveal that the created phishing detection tool was determined to be 94% efficient and accurate on
tool detects phishing emails with excellent efficiency and average. Finally, the experimental findings show that the
accuracy. The program was evaluated on 11 datasets, each of created phishing detection tool successfully identifies phishing
which had 500 emails, including a mix of spam and legal emails and may be used to safeguard individuals and businesses
emails. The phishing detection tool's total accuracy was 94%, from phishing assaults.
with an efficiency of 92% or higher in most of the tests
conducted.
Table 1: Experimental Results
Datasets Number Spam Not Efficiency
of detected Detected
Emails
1 500 106 9 0.96
2 500 94 6 0.95
3 500 93 7 0.94
4 500 95 5 0.93
5 500 94 6 0.92
6 500 94 6 0.94
7 500 92 8 0.94
8 500 93 7 0.95 Fig 6: Phishing Detected vs Not Detected
9 500 94 6 0.93 The phishing detection tool's effectiveness was assessed by
testing it on 11 datasets, each comprising 500 emails (5,500).
10 500 95 5 0.94 As shown in Fig 7, the developed prediction model was more
11 500 96 4 0.92
12 5500 1099 16 0.98
When the datasets were split, the accuracy of the prediction

model was much lower when compared to the accuracy of the
prediction model when evaluated on the entire dataset.
According to the findings, the greater the dataset, the more
accurate the prediction model. Furthermore, Fig. 7 depicts the
efficacy of the phishing detection tool, which reveals that the
9
Volume 185 – No. 11, May 2023
than 92% efficient in most of the experiments and 94% efficient intelligence system at the fog layer. Future Generation
and accurate on average. Computer Systems, pp. 90, 94–104.
[6] Kalla, D., & Samaah, F. (2020a). Chatbot for Medical
Treatment using NLTK Lib. IOSR Journal of Computer
Engineering, 22(1), 50–56. https://doi.org/10.9790/0661-
2201035056
[7] Lopez-Aguilar, P., & Solanas, A. (2021). The Role of
Phishing Victims’ Neuroticism: Reasons Behind the Lack
of Consensus. Int'l J. Info. Sec. & Cybercrime, 10, 75.
[8] Marzuki, K., Hanif, N., & Hariyadi, I. P. (2022).
Application of Domain Keys Identified Mail, Sender
Policy Framework, Anti-Spam, and Antivirus: The
Analysis on Mail Servers. International Journal of
Electronics and Communications Systems, 2(2), 65-73.
[9] Mishra, S., & Soni, D. (2021). Dsmishsms-a system to
detect smishing sms. Neural Computing and Applications,
Fig 7: Efficiency of the Phishing Detection Tool pp. 1–18.
Furthermore, the prediction model's accuracy was in the 96% [10] Negassa, M. D., Mallie, D. T., & Gemeda, D. O. (2020).
to 92% range for split datasets with 500 emails each. The Forest cover change detection using Geographic
prediction model's accuracy was 98% for the whole dataset of Information Systems and remote sensing techniques: a
5,500 emails. These findings suggest that the greater the spatiotemporal study on Komto Protected Forest priority
dataset, the more accurate the prediction model will be. area, East Wollega Zone, Ethiopia. Environmental
Systems Research, 9, 1-14.
10. CONCLUSION
Overall, this research paper highlights the importance of taking [11] Oesch, S., & Ruoti, S. (2020, August). That was then; this
proactive measures to prevent phishing attacks. It also provides is now: A security evaluation of password generation,
valuable insights into developing and testing a phishing storage, and autofill in browser-based password managers.
detection algorithm using Natural Language Processing and In Proceedings of the 29th USENIX Conference on
Python. While the algorithm demonstrates high accuracy in Security Symposium (pp. 2165-2182).
detecting phishing emails, there is still room for improvement,
particularly in considering attachments and subject lines. [12] Petelka, J., Zou, Y., & Schaub, F. (2019, May). Put your
warning where your link is: Improving and evaluating
Future research could focus on integrating these tools with
email phishing warnings in Proceedings of the 2019 CHI
popular email services and developing real-time alert systems
for users. Overall, this paper contributes to the ongoing efforts conference on human factors in computing systems (pp. 1-
to improve cybersecurity and protect against phishing attacks. 15).
[13] Qwaider, S. R. H. (2019). ANALYSIS AND
11. REFERENCES EVALUATION OF CYBERSECURITY TECHNIQUES
[1] Akinyelu, A. A. (2019). Machine learning and nature- FOR SOCIAL ENGINEERING (Doctoral dissertation).
inspired based phishing detection: a literature survey.
International Journal on Artificial Intelligence [14] Riadi, I., Umar, R., Busthomi, I., & Muhammad, A. W.
Tools, 28(05), 1930002. (2022). Block-hash of blockchain framework against
man-in-the-middle attacks. Register: Jurnal Ilmiah
[2] Chan, J. M., Van Blarigan, E. L., Langlais, C. S., Zhao, S., Teknologi Sistem Informasi, 8(1), 1-9.
Ramsdill, J. W., Daniel, K., ... & Winters-Stone, K. M.
(2020). Feasibility and acceptability of a remotely [15] Sahingoz, O. K., Buber, E., Demir, O., & Diri, B. (2019).
delivered, web-based behavioral intervention for men with Machine learning-based phishing detection from URLs.
prostate cancer: a four-arm randomized controlled pilot Expert Systems with Applications, 117, 345-357.
trial. Journal of medical Internet research, 22(12), e19238. [16] Sharma, A., Gupta, P., & Noida, I. (2020). COVID 19
[3] Dinesh K; Nathan S. "Study and Analysis of Chat GPT PANDEMIC: IMPACT ON BUSINESS AND CYBER
and its Impact on Different Fields of Study." Volume. 8 SECURITY CHALLENGES. Journal of Emerging
Issue. 3, March - 2023, International Journal of Innovative Technologies and Innovative Research (JETIR), 7(7).
Science and Research Technology (IJISRT), [17] Shen, G., Link, S. S., Tao, X., & Frankfort, B. J. (2020).
www.ijisrt.com. ISSN - 2456-2165, PP :- 827-833. Modeling a potential SANS countermeasure by
https://doi.org/10.5281/zenodo.7767675 manipulating the translaminar pressure difference in mice.
[4] Hakim, Z. M., Ebner, N. C., Oliveira, D. S., Getz, S. J., npj Microgravity, 6(1), 19.
Levin, B. E., Lin, T., ... & Wilson, R. C. (2021). The [18] Kuraku, S.; Kalla, D. Emotet Malware–A Banking
Phishing Email Suspicion Test (PEST) is a lab-based task Credentials Stealer. Iosr J. Comput. Eng. 2020, 22, 31–41.
for evaluating the cognitive mechanisms of phishing
detection. Behavior research methods, 53, 1342-1352. [19] Xu, D. (2019). Jamming-assisted legitimate surveillance
of suspicious interference networks with successive
[5] Homayoun, S., Dehghantanha, A., Ahmadzadeh, M., interference cancellation. IEEE Communications
Hashemi, S., Khayami, R., Choo, K. K. R., & Newton, D. Letters, 24(2), 396–400.
E. (2019). DRTHIS: Deep ransomware threat hunting and
10
Volume 185 – No. 11, May 2023
[20] Yathiraju, N., Jakka, G., Parisa, S. K., & Oni, O. (2022). [21] Zhang, L., Tan, S., Wang, Z., Ren, Y., Wang, Z., & Yang,
Cybersecurity Capabilities in Developing Nations and Its J. (2020, December). Viblive: A continuous liveness
Impact on Global Security: A Survey of Social detection for a secure voice user interface in an IoT
Engineering Attacks and Steps for Mitigation of These environment. In Annual Computer Security Applications
Attacks. In Cybersecurity Capabilities in Developing Conference (pp. 884-896).
Nations and Its Impact on Global Security (pp. 110-132).
IGI Global.
IJCATM : www.ijcaonline.org 11

Phishing IJCA Paper

Uploaded by

Copyright:

Available Formats

Phishing IJCA Paper

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Phishing IJCA Paper

Uploaded by

Copyright:

Available Formats

International Journal of Computer Applications (0975 – 8887)

Volume 185 – No. 11, May 2023

Phishing Detection Implementation using Databricks

ABSTRACT healthcare institutions that fail to protect sensitive customer

Fig 2: Most Targeted Industries (APWG Q3Report,2022)

3.16 Whaling 4.3 Password Harvesting

5.5 Weak auth or no MFA 5.10 Poor Endpoint Detection

6. PHISHING DETECTION clicking on links, be cautious since they might lead to

Fig 5: Phishing Detection Implementation

When the datasets were split, the accuracy of the prediction

You might also like