An adaptive approach for internet phishing detection based on log data

The Internet has become one of the most important daily socials, financial and other activities. the number of customers who use the Internet to conduct their business and purchases is very large. This results in billions of dollars being transferred every day online. Such a large amount of money attracts the attention of cybercriminals to carry out their illegal activities. “Fraud” is one of the most dangerous of these methods, especially phishing, where attackers try to steal user credentials using fraudulent emails, fake websites, or both. The proposed system for this paper includes efficient data extraction from the web file through data collection and preprocessing. and web usage mining procedure to extract features that demonstrate user behavior. and feature-extracting URL analysis to detect website phishing addresses. After that, the features from the above two parts are combined to make the number of features sixty-three. Finally, a classification algorithm (Random Forests) is applied to determine if website addresses are phishing or legitimate. Suggested algorithms performance is determined by using a confusion matrix and a number of metrics that shows the robustness of the proposed system.


Introduction
The development in the field of communications and information technology (IT) in recent years has led to a very large growth in services provided on the web such as shopping, banking, e-commerce, games, forums, and file sharing [1]. Internet users are exposed to several types of phishing. through the use of fraudulent emails or a fake website, attackers try to obtain sensitive information from users such as user credentials, passwords, etc. [2].A phishing attacker uses social engineering techniques to simulate legitimate websites and lure users to phishing web pages in various ways, etc. [3]. A common method asks to enter the malicious link on the page to reset your sensitive information and this directs the user to a phishing website [4]. Phishing attacks are among the most serious threats to web-based services including financial institutions, e-commerce, and individuals [2] [5]. According to a report by the Anti-Phishing Working Group (APWG). In the first quarter of 2021. The number of phishing attacks doubled during 2020. Then it peaked in January 2021 [6]. In general, phishing attack detection techniques fall into two main categories: blacklisting and On the basis of the heuristic. The first technique compares the requested URL with the one in the phishing list. Recent studies have proven the ineffectiveness of the blacklist against the number of sites hosted daily [7] [4]. Conversely, other heuristic technology uses machine learning algorithms to extract features from web pages such as features extracted from URLs or web usages such as detecting user behavior. Depending on these features a web page is classified as legitimate or phishing. The second method is considered more effective, fast and reliable, due to its ability to detect a new phishing website [7].

Literature review
The researchers describe the advantages and disadvantages of machine learning and why it is important to apply these techniques in order to identify and detect phishing. To get the right anti-phishing tools [8].

Review related concepts
Phishing is a fake web page created similar to a legitimate page, and most often they take advantage of wellknown pages, to increase the user's confidence and access to this page. The aim is to steal the sensitive and personal information of users [9]. Phishing attacks are divided into two groups: A-Social engineering Social engineering means an act that influences a person to achieve desired goals. This includes obtaining information from the target to take a particular action. B-Technical Subterfuge Attacks These common methods of scams, where fraudsters send some malicious code which is attached either to fraudulent emails or fraudulent websites that are through (XSS-based programming, session hijacking, phishing software) [10]. Security experts and researchers have taken advanced steps to solve the problem of phishing by multiple techniques that can be categorized into (user training, blacklist, and heuristic-based), heuristic-based two common methods, URL parsing, and page contents analysis such as knowing user behavior. URL analysis extracts features from a web page link, analyzes and detects either a phishing web page or a legitimate [12].

Related works
Detection of the phishing content received a critical attention in recent years, due the explosion growth of transmission content over the internet through wide range of social media applications. However, detecting legal and authenticated content still faces many challenges because of the complexity of detecting the fraud or phishing contents, in which this content may change over time and cannot be presented in formal manner. In this section, we illustrate most related works focused on detected phishing content. V. Preethi et al. (2016) presented a study of an algorithm which is called PrePhish, which is a machine learning technique to analyze whether URLs are fraudulent or not. The URL features used in machine learning are based on an actual data set. With a range value and limit value set for each feature. Three of the basic classifiers, Random Forest, Naive Bayes, and SVM are used to increase security. The results had a high predictive level with an accuracy of 97.83% and an error of 1.82% [14]. Pratik Patel et al. (2016) presented a study focusing on tackling phishing by clarifying phishing methods and the methods used to detect phishing. In addition to phishing prevention methods, it is also provided an effective model for detecting and preventing malicious attacks [15]. Nandhini.S et al. (2017) conducted a study aimed at identifying the important and effective features in the performance of classification for the detection of fraud sites. The results, after applying a number of algorithms to these features, were that the random forest algorithm gives the highest percentage of correctly classified cases [13].  [20]. Hesham Abusaimeh et al. (2021) suggested using three combined algorithms (random forest, decision tree, and support vector machine) to detect phishing sites in addition to using these models separately for comparison with the proposed model. The results that emerged was that the three models combined had a higher accuracy of detecting phishing sites than using them alone, where the percentage was (98.52%) [21]. P. Kalaharsha et al. 92021) discussed different types of phishing attacks and phishing website detection techniques. Technologies include list-based, visual measurement, machine learning, and heuristics. and different performance methods for data sets. Knowing this information is very important to help end-users in combating phishing sites [22].

Research methodology
In this section the model used to detect phishing sites is described as well as the data set, algorithms, and metrics used in the evaluation of the model.

Why heuristic based phishing detection
This technique depends on the characteristics of phishing sites or the behavior of the attackers; Although these techniques have high accuracy of results, it is not always possible to guarantee the presence or selection of important characteristics in phishing detection. If the method (technique, features) chosen is effective in identifying phishing, phishing attacks can be detected at zero hour. This technique is against blacklist technology. They are very quick to respond when compared to visual similarity technical because it does not require any initial legitimate image database and does not include no comparison of images with image database. Thus, the calculation cost is lower as compared to visual similarity assessment technique. It is useful from blacklist or whitelist approach in a phishing attack is detected.

Phishing data set
Machine learning technology was used to develop the proposed model for phishing attack detection by selecting data for training and for validation. To develop a new phishing attack detection model, a phishing training dataset was collected from the Aalto University, Finland (AU) repository dataset of approximately 96012 entries Preprocessing and cleaning of outlier data, and 102 records were found an outlier and used data to train and test the model. A Random Forest algorithm was chosen for classification and is one of the most popular algorithms in identifying and discovering websites that are phishing or legitimate.

Adaptive random forest algorithm
Random forest is a supervised learning algorithm which is used for classification and regression tasks. The "forest" it builds, or A classifier is a collection of multiple decision trees. Randomness is added to the model to generate decision trees. It defines a random subset of features to split nodes. this that measures a feature's importance by looking at how much the tree nodes that use that feature reduce impurity across all trees in the forest. Based on the prediction of each decision tree, each tree performs a unit vote for the most popular category in the input data. It computes this score automatically for each feature after training.

Design flowchart of the phishing attack detection model
The following figure describes the proposed system design for detecting phishing attacks. Which starts from entering the weblog and conducting analyzes and even detecting phishing sites.  Step 4: Classified All Instances in TD After collecting data from the webserver, perform preprocessing, web usage mining, and analysis of URLs to extract features. 63 influential features were obtained in the process of phishing detection.

Time and space complexity
Time complexity is the amount of time it takes a computer to run a given algorithm. As a function of input length, it measures the time it takes to execute each statement of code in an algorithm. Space complexity represents the total amount of memory an algorithm or process uses to run (with input values into the algorithm) to execute and produce the result.

Performance evaluation
These are automatic algorithms for quality assessment that could analyses data and report their quality without human involvement.

Confusion matrix
When it comes to classification problems, the confusion matrix is a widely used measure. [22]. (1)

Precision and recall (P/R)
Precision indicates how well the model predicts positive values. The recall is a useful metric for determining a model's ability to predict positive outcomes. The following are the formulas for measuring precision and recall[23]. (3)

F-measure (F)
F-measure, also known as F-value, A-weighted harmonic mean of precision and recall [24].

Kappa statistic (KS)
The Kappa statistic is used to measure interference Among categorical items. [21]. (ks)= (5) is the relative observed agreement among raters, is the hypothetical probability of chance agreement.

Mean absolute error (MAE)
The mean absolute error is quantity is used to measure Expectations in the end results. [21].
(6) where is the prediction value , is the true value.

4.6.
Root mean square error (RMSE) The root mean square error (RMSE) is a measure of the file User for differences between the Number of sample values estimated or predicted by a model and observed values. [21].
After applying the previous measures to the specified data set, the following results were obtained.

Experimental parameters
To Comparing the proposed model with fraud detection methods, similar inputs were tested on each one of the four detectors which are SVM, Naïve Bays, KNN, and Decision Tree in addition to the proposed system individually. The results of the other algorithms differed, and each recorded a lower accuracy than the proposed system. As shown in the following table.

Correctly and incorrectly classified instances
As we can see in Figure 4.1 shows which cases are correctly classified and the cases are incorrectly classified. This indicates the performance of the proposed model and its high ability to detect malicious websites.  Figure 5. Analysis of correctly / incoreectly classified instances of proposed system

Conclusion and recommendations
The phishing attack is one of the most sophisticated web attacks and it is considered a serious threat to website users, this paper proposed a model based on the Random Forest algorithm for the purpose of classifying and detecting phishing sites based on 63 important features in identifying phishing by URL, domain, or path characteristics. The performance of the classification algorithm with feature selection based on classifier attributes evaluator was evaluated using a phishing dataset consisting of the combination of URL, Domain, and Path-based features. The result of the evaluation shows that the proposed model has a high accuracy of 96.91% and low error rates of 0.03% compare to other existing machine learning-based models.