Enhancement performance of random forest algorithm via one hot encoding for IoT IDS

The random forest algorithm is one of important supervised machine learning (ML) algorithms. In the present paper, the accuracy of the results of the random forest (RF) algorithm has been improved by the use of the One Hot Encoding method. The Intrusion Detection System (IDS) can be defined as a system that can predict security vulnerabilities within network traffic and is located out of range on a network infrastructure. It does not affect the efficiency of the built-in network because it analyzes a copy of the built-in traffic flow and reports results to the administrator by giving alerts. However, since IDS is a listening system only, it cannot take automatic action to prevent an attack or security vulnerability detected from infecting the system, it provides information about the source address to start the break-in, the address of the target and the type of suspected attack. The IoTID20 dataset is used to verify the improved algorithm, where this dataset is having three targets, the proposed system is compared with the state-of-art approaches and shows superiority over them.


Introduction
The Internet of Things, or IoT, can be defined as a network of several million computers that are linked together by the internet for the purpose of collecting, exchanging, and processing data. To put it another way, any smart computer connecting to a wireless network can be managed and communicated with. IoT applications can be used in a variety of settings, including homes, businesses, vehicles, and industries. The Internet of Things (IoT) is extremely elastic and has a significant economic effect on society. The IoT Market is expected to be consisting of over 84 billion connected devices that generate 186 zettabyte of data by the year of 2025 [1,2]. Having several gadgets attached to the internet is a major security issue. Not all data transmitted over the IoT system is secured, making it susceptible to malicious attacks. To build a stable and safe network, a lot of research is being carried to improve IoT security. The three fundamental rules of confidentiality, integrity, honesty, and authenticity, should be followed when developing solutions. Developing different types of security mechanisms for IoT networks, such as intrusion detection systems (IDS), has become very important, and the concept of applying machine learning to IDS is gaining a lot of interaction [3]. Intrusion detection systems are applications or services that can be used to track or diagnose unusual behavior in a network or device. A number of ML approaches are being utilized to forecast anomaly detection in IoT networks, with promising findings [4,5]. Host-based IDSs and network-based IDSs are the two major categories of intrusion detection systems. Based on information from the computer, such as machine logs, host-based IDS is used to track and protect a single device or network. Network-based IDS is used to track an entire network by accessing and evaluating the flows that exist within it. Packet-based IDS and Flow-based IDS are 2 types of network-based IDS. Network packet information, such as payload or header information, was used by packet-based IDS. Traditional IDS [6]is another name for them. Flow-based IDS, on the other hand, analyzes and monitors network anomalies using network flow characteristics such as data rate and byte information. Network Behavior Analysis [6] is another name for flowbased IDS. Several supervised and unsupervised machine learning models are used to classify malicious or unusual network behavior. Traditional protection methods are difficult to implement directly to protected IoT devices due to computational and fundamental resource constraints. Rule-based detection techniques, on the other hand, produced effective results [7,8]. As a result, as IoT environments and technologies evolve, anomalybased detection mechanisms may become increasingly important. Machine-learning algorithms can benefit greatly from big data generated by IoT devices so they can carry out the data analysis and produce meaningful predictions and interpretations of IoT devices. As a result, using machine learning to secure IoT systems is seen as the best way for protecting them from the intrusion attacks, particularly through the detection of any unusual behavior in the system. It's also worth noting that ML performs well in other fields [9,10]. The following is the rest of the paper's structure. The related work to the proposed approach is discussed in the second section. The IoTID20 dataset is defined in the third section. In the fourth part, the proposed architecture for anomaly detection in IoT systems is presented. The proposal's results and analyses are discussed in Section 5. Finally, we wrap up this study with suggestions for future research in the final section.

Related work
In 2021, Qaddoura et al. [11] introduced a 3-stage solution that included clustering with the reduction, oversampling, and classification with the use of a Single Hidden Layer Feed-Forward NN (SLFN). The paper's innovation lies in the approaches of data reduction and oversampling that have been utilized for the generation of the balanced and usable training data, as well as the hybrid consideration of unsupervised and controlled approaches for detecting activities of the intrusion. The tests were divided into four stages and tested in terms of the precision, recall, accuracy, and G-mean: measuring the impact of clustering on data reduction, the framework's performance against basic classifiers, the impact of oversampling method, and a contrast against basic classifiers On the chosen IoTID20 [12]dataset, the SLFN method of the classification and the option of Synthetic Minority Oversampling Technique (SVM-SMOTE) and Support Vector Machines (SVMs) with a ratio of 0.90 and a k value of 3 for k-means++ clustering method produce superior performance, with a score of 98.4%. Qaddoura et al. [13] in 2021 introduced a deep multilayer classification method that included the following components: an oversampling strategy utilizing SMOTE to address the problem of mismatch datasets; and a deep multilayer method of classification that included the following components. To improve the classification outcomes, two methods are proposed. The SLFN technique's first level of classification forecasts interference and routine operations. DNN predicts the type of intrusion activity in the second stage of classification. The tests were carried out on IoTID20 [12], demonstrating that the suggested solution produced better performance. Ullah and Mahmoud in 2020 [12] used similarity of features, rating of features, and various ML approaches for classification for analyzing and comparing the IoTID20 dataset. For the normalization and interpretation of IoTID-20 data-set, they used ML algorithms and column normalization approaches. The identification capabilities of a machine learning algorithm are harmed by associated characteristics. For IoTID20, twelve associated attributes were removed from the dataset. The Shapira-Wilk algorithm was used to rate the features in the IoTID20 dataset, which tests the regularity of the feature-related distribution of occurrences. More than 70% of the features graded with a score greater than 0.50, indicating that they have a high rating. The binary, subclass, and subcategory mark datasets were all evaluated. Machine learning models are built using ensemble, Gaussian Naïve Bayes, Support Vector Machine, Logic Regression, Latent Dirichlet Allocation, RF, and Decision Tree classifiers. To evaluate the effectiveness of the different classifiers, we used numerous K-fold cross-validation measures, including 3, 5, and 10 fold cross-validation tests. The highest accuracies were achieved by Ensemble, Random Forest, and Decision Tree classifiers, while the lowest accuracies were achieved by SVM and Logic Regression. In 2020, Yang and Shami suggested the adaptive Light GBM model for the IoT data analytics [14,15], which has high precision while using little time and memory. The proposed model will automatically respond to the ever-changing data streams of complex IoT systems owing to incorporation of their proposed novel drift-handling algorithm, Optimized Adaptive and Sliding Windowing (OASW), an ensemble ML algorithm (i.e. Light GBM), and a hyper-parameter tool, Particle Swarm Optimization (PSO). Experiments on two public IoT anomaly detection data-sets, IoTID-20 [12,8] and NSLKDD [13], are utilized in order to test and discuss the suggested process.
On IoTID-20 and NSL-KDD datasets, the approach for effective outperforms many state-of-the-art drift adaptation methods in terms of detecting IoT attacks and adapting to idea drift, with accuracy values of 99.92% and 98.31%, respectively. Farah in 2020 [3] used two freely accessible simulation databases, IOTID20 [12] and Bot-IoT [16], for training and testing, which were developed to capture IoT networks for various attacks like the Denial of Service (DoS) and Scanning. Machine learning models that were applied to these datasets were tested within each dataset before being evaluated through datasets. There was a large variation in the analyses obtained using the two datasets. Supervised machine learning models were developed and tested for binary classification, which differentiated between standard and anomaly attack cases, as well as multiclass classification, which classified the type of attack on the IoT network. To ensure that the model works well, the flow identifiers of a network packet, like the source and destination IP addresses, port numbers, and timestamp, had to be removed. Since attackers will use various IP addresses and times to initiate attacks on the network, if the model is trained using these features, it may fail to generalize well when deployed. Furthermore, ten more functions were removed because they only had a single value and did not add much value to the machine learning models. The models were then trained using a total of 67 features. The models work admirably and are capable of detecting irregularities in the data. On both datasets, the decision tree, k-NN, and ensemble scores are both above 0.95. Assi and Sadiq in 2018 [17]proposed a feature selection strategy that has been based upon the Modified Artificial Immune System. The suggested algorithm makes use of the benefits of the Artificial Immune System to optimize the efficiency and the randomization of features. The NSL-KDD dataset [18] revealed that when compared to other feature selection algorithms, the NSL-KDD algorithm was more effective (best first search, correlation, and information gain). In 2017, Assi and Sadiq suggested five core classification methods to classify network attacks using the NSL-KDD dataset [13] and three feature selection strategies [19]. These methods include the J48 decision tree, SVM, Decision Table, Bayes Network, and Neural Network Back Propagation. Feature discovery approaches include correlation-based feature selection (CFS), information gain (IG), and decision tables. A number of the trials have been carried out to yield positive outcomes by using NSL-KDD preparation and research in the general attack (Anomaly and Normal). Those have been performed using 4 different types of attacks: DOS, U2R, R2L, and probing. The J48 classification scheme with training data produces the highest performance (80.3 percent) by using the testing dataset and (93.9 percent) when using the consistency training dataset.

Random forest
Random forests can be defined as an ensemble learning system for the classification, regression, as well as other tasks, functioning through the construction of a wide range of the decision trees at training time and producing the class which is the individual trees classes (i.e. classification) or mean/average predictor (i.e. regression) mode. Random decision forests are correct for the decision trees' habit. RFs have been considered as one of the ways for averaging several deep decision trees that have been trained with the intentions to reduce the variance on different sections of one training set. Which comes at the cost of a slight rise in the bias and some interpretability lack, however, typically results in the substantial improvement of efficiency in the final model. Forests are like pulling decision tree algorithm attempts together. In this way, the teamwork of multiple trees increases the productivity of a single random tree [20] proposed hybrid feature selection for Random Forest depends on two measures Information Gain and Gini Index in different weight-based percentages. The key plan is to measure the Information Gain for all random selection features and then look for the best split point in the node that offers the best value for a Gini Index hybrid equation [22]. It proposed a random forest algorithm using an accuracy-based ranking that relies on the accuracy of a single tree from the previous Random Forest assessment. The proposed model consists of two primary stages, the first being the training process responsible for the development of the tree and the evaluation phase containing two test tiers (evaluation test and accuracy test) [23].

Enhancement Performance of Random Forest Algorithm
One Hot Encoding is a process of exemplifying categorical values into binary numbers. Many learning algorithms either use distances between samples or learn a single weight per element. The former is true for linear models like logistic regression, which are simple to understand. It is a known fact that machine learning algorithms fail to work on categorical data and hence have to be converted to numbers where one hot encoding technique plays its significant role. As a natural reaction it is possible to opine on the use of integer coding directly but it has its limitations when used on relationships of the natural ordinal types. In one hot encoding technique, the categorical values in the data are directly assessed to integer values and each integer value is changed to a binary value. In this proposed system framework shown at Figure 3. The three classes are one hot encoded which resulted in two labels for binary, five labels for category and 9 labels for subcategory. The process of one-hot encoding is done through using one for the corresponding label and zero for the other labels, and the result vector is used as a multi classification problems where we have a vector of target classes. Onehot encoding makes the training process more effective and the built model would learn more efficient as it gives the network more expressive power to learn a probability-like number for each possible label value. This can help in both making the problem easier for the network to model. When a one hot encoding is used for the output variable, it offers a more nuanced set of predictions than a single label. Random forest classifier is then trained using the training set and the trained model is then tested used the testing set. The proposed system achieved higher accuracies which is 99.9% for binary label, 99.3% for category label and 95.8% for subcategory. To represent one-hot encoding process more formally, Consider = { (1) , (2) , … , ( ) } represents corresponding labels for D shown in table 1, which is a given as input set. Also, let = { 1 , 2 , … , } represents labeled network traffic data in industrial environments. And for each instance there are = { (1) , (2) , … , ( ) } represents m features for the instance. If (1) = 1 , while (2) = 2 , and (3) = 3 , and the other response values of C are either 1 ,or 2 , or 3 . One hot encoding process would assume a vector where the vector size equal the number of distinct labels, that is three in this example, and would refer to each label in the vector as 0 or 1 where 0 means that the data instance not correspond to that label and 1 means it have that label. So  Table 1 and its encoding process. For more clarification, If we have a single categorical feature "Category", with values "Scan", "Mirai", "MITM", "ARP" an "DOS". Assume, without loss of generality, that these are encoded as 0, 1 and 2, 3, 4. The integer encoding is insufficient for categorical variables with no such ordinal relationship. In reality, encouraging the model to assume a normal ordering between categories and using this encoding which result in bad performance or unpredictable effects (predictions halfway between categories). In this case, the integer representation can be encoded using a one-hot encoding. For each unique integer value, the integer encoded variable is discarded and a new binary variable is inserted. But with the one-hot encoding, the representation is [

Via One Hot Encoding
by looping through the testing dataset and then it's divided by the dataset size and the final percentage is then evaluated by multiplying to 100. Randomized forests are a collection of regression trees and classifications which train on a training Dataset generated by random selection on the initial Data-set. When building a tree, the researchers use a collection of randomized test Data that contains no Record constraints from training data-set as a group for testing the trees within the forest. To calculate the consistency of every one of the trees, take the error rate in determining the form of input from the test community for the tree, and to determine the error rate of the forest, take the error rate for all the trees in the forest. The rule will split the data into two bits, allowing it to be checked with the same data and with the same precision as laboratory tests, according to the out of bag scale. These values are added to all of the trees in the random forest in order to define a new entry. Each tree predicts the class of this entry Record, and the forest then makes a decision based on the majority vote of the trees [24].

Dataset description
IoTID-20 data-set [12] includes intrusion and regular activities recorded by notebooks, tablets, and smartphones in a smart home IoT network with Wi-Fi router, SKT NGU computer, and EZVIZ camera. There are 83 attributes and 625,783 instances in the dataset. with the nominal attributes omitted, resulting in a dataset with 79 characteristics. The intrusion detection mark, the category label, and the sub category label are all present in the dataset. The IoTID20 dataset's binary, category, and sub-category labels are listed in Table 2.

Input: Training Dataset, Number of Features, select target
Output: Tree of random forest Begin: Step 1: Count number of classes for target in the training dataset Step 2: Add features columns equal to number of classes with classes name in the training data-set Step 3: K= the number of instances in the training data-set Step 4: Loop i from 1 to K Begin: a. put number 1 for the cell that present the class b. put number 0 for other cells End Step 5: TG = Create array of target classes Step 1: Count number of classes for target in the testing data-set Step 2: Add features columns equal to number of classes with classes name in the testing data-set Step 3: K= the number of instances in the testing dataset Step 4: Loop i from 1 to K Begin: a. put number 1 for the cell that present the class b. put number 0 for other cells End Step 5: TG = Create array of target classes Step 6: TD = Drop all the columns for target classes from the tasting dataset Step 7: Predictions = Random Forest algorithm (T,TD, TG) Step 8: Right = 0 Step 9: Loop i from 1 to K Begin: if Predictions = TG; Then: Right = Right +1 End Step 10: Accuracy = (100*right/(K)) End Table 2. IoTID20 dataset Table 3 lists the precise distribution of dataset records between standard and intrusion operations. By adding more damaging risks, IoT systems have increased the attack surface. Denial of service (DoS), Manin-the-Middle (MITM), Distributed DoS (DDoS), and active scanning were the most malicious activities injected and tracked to produce the dataset. The type of DoS attack that has been considered is one that floods synchronized (SYN) packets into TCP-based connections (TCP-based connections). SYN packets are typically utilized to create TCP connections between the communicating parties through reserving resources on both sides, primarily ports and buffers. It may be used to target the server's and/or victim machines' availability. Furthermore, flooding acknowledgment, Hyper Text Transfer Protocol (HTTP), and User Datagram Protocol (UDP) packets were used to introduce DDoS attacks in the form of IoT Mirai. Furthermore, a brute force attack has been used to decrypt data and reveal the confidentiality. MITM has been also used for the poisoning of the Address Resolution Protocol (ARP) table and map the attacker's Media Access Control (MAC) address to the router's Internet Protocol (IP) address. As a result, the intruder will impersonate the network router and disrupt interactions between network entities. The key goal of this attack is to sniff or manipulate data that is being transmitted [11]. The IoTID20 dataset is split into two parts: 75 percent preparation and 25% research.  Figure 4 shows a comparison between the proposed system for predicting the binary (normal/ anomaly) target and other state-of-the-art studies in terms of accuracy. The accuracy is specified by the total number of correct predictions divided by the total number of predictions. The accuracy of the proposed system and systems proposed by Farah [3] and Qaddoura et al. [13] discussed before in the related work section approaches 100%. The overall evaluation shows the superiority of our proposed algorithm.    Figure 6 shows the accuracy of algorithms proposed by Ullah and Mahmoud [12] Spoofing/ Synflooding) on the IoTID20 dataset, they achieved high accuracy using their decision tree, random forest, and ensemble algorithms, but their proposed system uses all the data in the training step which makes the algorithms suffer from overfitting and make it not dependable.

Results
On the other hand, highly small microstrip filters and antennas [25][26], can be utilized to improve portability of IoT system with an efficient wireless communication. Figure 6. Accuracy of algorithms proposed by Ullah and Mahmoud [12] predicting sub-category (Normal/ Host Port/ Port OS/ ACK Flooding/ Host BruteForce/ HTTP Flooding/ UDP Flooding/ MITM ARP Spoofing/ Synflooding).

Conclusion
Intruders will initiate more disruptive cyber-attacks due to the rapid development of IoT gadgets. With disruptive operations, the attacker hoped to deplete the target IoT network's resources. Researchers and company owners are concerned about the reliability of IoT networks, which has a significant effect on availability of services that are provided by IoT devices as well as the safety of users connecting to the network. An intruder prevention mechanism protects the network by detecting malicious activity. The nominal attributes are omitted from the IoTID20 dataset, resulting in a dataset with 79 characteristics. Label, category, and subcategory are the three groups in the dataset. The data-set is then divided into two parts: training and testing. The three groups are hot encoded, yielding two binary labels, five category labels, and nine subcategory labels. The training set is then used for training the random forest classifier, and the trained model is then evaluated with the testing set. The suggested scheme reached higher accuracies, with a binary label accuracy of 99.9%, a division label accuracy of 99.3%, and a subcategory label accuracy of 95.8%. The methodology outperforms currently available state-of-the-art approaches.