Detecting insider threat within insitituttions using CERT dataset and different ML techniques

,

organization insider attacks have the percent of 25% of all the threats this percent increasing gradually [3]. A part of insider threats tends to report cases of combustion incidents, including burnt data, customer records, trading secrets, theft of personal identification information, and destruction of IT systems [4]. Network-connected systems are an essential and sensitive part of any company where data associated with employees and clients and their confidence are stored to be processed, Therefore, the insider threat is a major threat against cybersecurity, which must be addressed with the main priorities to assure the continued safety of these systems and hence the functions of the institution. The threat definition was mentioned in a technical report issued by CERT Insider Threat. It has been defined as threats implemented by malicious or non-intentional persons whose permission to enter the institution's network, systems, and data are used negatively to affect the confidentiality of this data. Malicious activity relevant to insider intimidation can be carried out deliberately by spiteful insiders such as sabotaging information systems and revealing confidential information, as well as by an unintentional insider, such as neglecting the use of authorized resources [4]. Apart from the traditional tasks for intrusion detection, there are various challenges related to the detection of internal threats, because the informed person has the autorization to enter all the network sites belong to the organization, sub systems and possesses sufficient information about the institution layers security, in addition to that the art of insiders with malicious intent in most institutions their activities are irregular. Consequently, data available to describe their activity and movement are not well documented and are scarce [5].
The challenges of insider threat detection may originate from process requirements and data kind's wide range investigation in institutional conditions, network traffic, accessing web logs, and files to history of email or employee information. The data also greatly fluctuates prepared by the corporation. Consequently, only a diminutive portion of the institutions have the tools and human resources for describing user behavior and purpose from the data collection monitoring This paper has been performed an evaluation process for ML application routines in understanding enterprise network systems and malicious inside institutions. The model has been proposed and evaluated workflows for detecting user-focused internal threats, from collecting pre-processing data to analyzing data with ML techniques, reports information and analyzing. The designed methodology intends for helping the analysts of cybersecurity to learn from only a diminutive fraction of common employee actions and identify malicious insider liveliness to distinguish threats under anonymous data and produce valuable penetrations [3].
The research intends to obtain the following contributions: (1) realistic requirement has been hypothesized for training ML procedures for producing outcomes that simulate the real environments of the world, highlighting comparative differences to training corresponding to various conditions; (2) data processing technologies have been exploring to effects the obtaining of multi-levels of data accuracy with informative detail for analyzing data [3]. (3) comprehensive results reporting process has been provided with different ML techniques, and malicious cases are inspected, to have a preferable insight of performance.
The other section of the paper is constructed as the following. The second section behavioral and definitional issues. The third section presents the background and related work. The fourth section presents the proposed methodologe with the dataset, data preprocessing steps, feature extraction, normalization, the ML algorithms used, results, and evaluation metrics. Section 5 conclusions and suggested future works.

II. Behavioral and Definitional Issues
The requirement of search should be clearly described structures. [6] [7] Once the validity proofs are defined, they are bounded to the context of the research in which they are presented. [8] For instance, a method with characterizing insider threats arises from whereby people and establishments describe an "insider", that is, an association member. Public-private partnerships moreover, blur the boundaries that separate insiders from the outsiders [9] producing separate analytical problematical divisions. Therefore, how IT operates in an obtained critical investigation to determining its relevance and generality.
Insider Threat Descriptions. Representations of an insider threat fluctuate extensively [10], [11]. All too often, the theory is left indeterminate or indeterminate, with studies indicating non-adaptive organizational behavior with little modification. For instance, "Insider threat indicates harmful acts that trusted insiders may commit", [12], "A malicious insider appears when a trusted user of the information system behaves in a way that the security policy behaves is defined as unacceptable. ". Likewise, the omission to report questionable behavior was described as a "passive" insider threat. [13] Broad interpretations are crucial concerns for valid credentials and prognostication so that all employees should be identified as insider threats.
Intentional explanations were judged, highlighting considerations [14] or malice. Additional definitions also involve intentional behavior that is not malicious [15,16] or that does not immediately address intention. [17] concerning these descriptions, insider threats imitate behaviors that contrast with methods, procedures, and priorities that the institution has deliberately or accidentally set. Examples of insider threats entail destroying/deleting important knowledge assets, performing unauthorized data modification or records, copying under unauthorized conditions, data or records extraction, intended or accidental interruption of networks, destruction of facilities, eavesdropping, and beams sniff. From aforementioned attitude, models remain succinct significant than the actuality that they diverge from acceptable norms [18].
Researchers should additionally attempt to provide a basic characteristic among types of insider threats according to the data from employee profiles. [19] as Salem et al. [20] Representation, there are three general kinds of methods: host-based user profiling (web and operating systems ), sensor-based for networking (user behavior on the network), and integrated procedures that join among the characteristics of the sensor and user profiling. Utilizing a particular method depends on the kinds of insider threat discovery. Furthermore, that will change from unintended insider threats that could hold the consequence of social engineering struggles [ 21 , 22 .] .
a case-based method has been applied by Band et al. requirement represent the analytical portions of 3 by behavior variations [23] However, the insider behavior implications conflicts whether it reproduces disguises or changes. [20] Disguisers are supposed the identification the authorization of the network employee. [24,20,25] Considering that employees work in a comparatively uniform manner, disguises are discovered by discerning differences in user action that do not match an actual profile. Instead, failures are considered authorized employees who serve with relevant cohesion within the network. authorized people longing to enter the same locations of file forming disclosure also more complex. Crucially, the leak does appear for transpiring outside the network.
While sometimes referred to as "intentions," are limited detail about these purposes or how they have different behavior. A discovering of the attacker's psychology further lacked [26]. Anderson and Pearson [34] observe in their brutality account the nature of "deviant" behavior in the workplace disregards personal impulses and the communicative circumstances required to perfectly predict behavior within an association. Many insider threat patterns fail to recognize this feasibility or identify key social dimensions [26], [27] Using transcripts, Searle and Rice [28] combined characters of impulse into their insider threat organization. Back of previous interviews correlated to three significant circumstances within the institution, they recognized four distinct types of insiders: Insiders that disappointed to control their behavior inadvertently violated the rules (the omissions), and regularly engaged in abnormal behavior that might indicate a sedate violation of workplace criteria and systematic commitment in the grave and secondary violations in the workplace Standards have joined in little correlative aggressive (retaliatory) or negative (retraction) behaviors focused towards individuals or the institution (avengers). Furthermore, they also conceive of individuals who have disappointed to report suspicious behavior. While introducing informational divisions, their investigations do not regularly link those divisions to basic psychological dimensions such as impulse.

III. Related works
Detecting insider threats is a hard research problem, not only for the research inhabitants but also for the country's organizations and agencies, and cybersecurity companies. The US National Insider Threat Task Force and CERT Insider Threat Center should be declared collective supervision for helping prevent and alleviate insider threats in managerial situations. [29] [30] The reference of the research describes 20 practices institutions should achieve crosswise the corporation to avert and discover insider threats, addition to case examinations of companies have slipped for detecting insiders [29] Liu et al. Provided an overview of the research literature on insider threats, and associated cybersecurity concerns, including malware and superior tenacious threats [31] moreover Homoliak et al. Introduced a taxonomy structure and insider threats novelty categorization [32] Since the problem of the threat accociated with human factors, several researches techniques solving the difficulty by applying psychological patterns and principles of making the decision [33] [34] [35] Padayachee implemented contingency criminology hypotheses to imagine insider threats, as well as speculation measures of minimization for helping to relieve insider threats [33].
Greitzer et al. Suggested a framework for the model prediction that use many data sources integration and psychological/motivational determinants to assist the analyst in discovering essential danger behavior, implemented approaches of criminology opportunity to visualize insider threats, and limitation of opportunity measures to help mitigate insider threats [34].
Legg et al. suggested a structure for insider threats modeling depends on observations of psychological and behavioral aspects. The structure supports the analyst for clarifying and represent possible insider threats of multiple domains, such as the policies of organizations and behavior of the human. [35].
A huge quantity of data has been given and gained by any organization every day. Solution based on ML are between the multiple promising techniques to solving challenges of cybersecurity of the current era [36].
The benefit of ML is the capability of automatically learning from a massive quantity of data and distinguish patterns that have the most similar characterize malicious motions or anomalous behaviors [37].
They further recognized any malicious user notices within the guide research that could notify this design from an insider threat detection system [38].
an an aberration discovery represents a common method that uses ML techniques for detecting insider threats, in which models of typical employee performance are created and abnormality has been described as aberrations of natural behavior. the abnormality signaling, this situation, refers to innovations in the behavior of the user as a potential first sign of insider threats. Parveen et al. have proposed graph-based model with educational assumtion guidance method to discovering insider threats according data flow. To do this, quantitative pattern dictionaries are generated for each piece of data and the the consideration of data has been referred to as abnormality if it including a high distance dictionary usual patterns [39].
an additional method for detecting anomalies has been modeled to employ series of human activities for abnormal sequence detection.
Rashid et al. proposed a system with a Hidden Markov model to acquire weekly the series of normal activities for each user from the common activities. series of abnormality activity detected by the the model refers to the insider threats [40] Many insider threat discovery systems have been supported by the DARPA (Defense Advanced Research Projects Agency) ADAMS (Anomaly Detection at Multiple Scales) project. The project purpose is "identify patterns and anomalies in very large data sets" for preventing and detecting insider threats [41] [42] [43].
Various algorithms were used to detect anomalies, including hidden Markov models and Gaussian mix models, in a group on user activity log data to identify insider threat indicators [44].
Eldardery et al. Used anomaly detectors combined in hybrid form for user activity logs in several different discover data of double classes of insiders -the intrusive insiders from the blending, and insiders who had an abnormal action in behaviors [41].
Disguise detection policies are suggested based on the discovery of anomalies in user search and the behaviors of accessing file [45] [46].
Javai et al. I applied various machine learning methods to data that have organized to detect abnormality and initial indications of "withdrawal", as both may indicate internal threats [41].
The capabilities of various ML technologies, such that the Bayesian-based approaches [48][49], decision tree [47], and the self-organizing map [50] also have been tested to discover insider threats. real-time learning methods have used to distinguish the conditions for unusual user behaviors.
Tor et al. Suggested an approach for discovering abnormality using a the enterprise of deep neural network, or a model per user enural networks have been used for builds scores for abnormality [51].
Bose et al. Propose a model that uses moderated and unsupervised extendability learning techniques to integrate a streams of heterogeneous data to recognize abnormalities and insider threats [52].
On other side, Le et al. Conventional algorithms of genetic programming supporting two behavioral theories: static and dynamic, to check the evolutionary probability computation in detecting the insider threat [53] In this paper different ML techniques have been applied to the CERT dataset after preprocessing and feature extraction for detecting insider threat, and the performane of these teqniques have been compared.

Dataset
insider threat recognition, many suitable data sets are not available. Therefore, the insider threat data set published by CERT (Carnegie Mellon University) has been applied in this research [54]. "R4.1 and R4.2" dataset has been used for the analysis. This dataset contains six kinds of data logs: HTTP, login, email, psychometry, device, and file. All activities of 1000 employees over 17 months are contained in this dataset [55].

Preprocessing
Pre-processing is significant for identifying insider threats appropriately but likewise for cybersecurity duties in general. the best monitoring method coupled with suitable data accumulation allows the successful employment of machine learning procedures and assists security analysts in producing the right decisions. The data collected from the environments of the institutions for that the data sources are wide and variable. The variety of resources construct data with different patterns. [29] [56] in this paper the data have been organized into two main categories: the first category contains the activity log data .It represents the real-time sources of the data that require to accumulate and be prepared in a suitable time. form in order of speedy detection and response to malicious and anomalies. This category came from different system logon such as file accesses, firewall logs, emails, and capturing network traffic, and web. The second category contains the structure of organization and user information including employee information, relationship with other employees, and the employee role in the institution. The second category of data acts as the context data or the background this category has the more complex data including behavioral model and psychometric of users. Assisting to manipulate data and create features, user context models are obtained for all users in the institution. The forms include the auxiliary information associated with every user, such as restricted devices, relations with other employees, tasks, hours of work, authorized access, etc. depend on user context models, feature vectors have been created that summarize user activity immediately and regularly from the input data.

Features extraction
Features extraction can be performed using input data and user context to obtain user data vectors that are used for training ML models. At first, data aggregated from different resources depending on used ID according to aggregation conditions C such as the performed task numbers and time duration. The second generates the numeric vector Xc also called data instances by applying features extraction on the aggregated data. The numeric vector with fixed length N summarized users' actions and includes the information of the user. Data categories have been encoded to numerical form to apply ML techniques. The features have been provided are frequent features that represent the number of different types of actions performed by the user during the proceeding of the aggregation process such that date, user, PC, activity, and vector.

Normalization
The status of the raw data is not ordinarily proper for processing by data mining and machine learning techniques. Data normalization is a method of molding raw data values to different forms with characteristics of more useful for modeling and analysis. It is necessary to prevent the difference of huge values that is controlled the results. So, the term normalization can be applied for producing a normal feature. Normalization aspires to guarantee that a whole set of values have a special characteristic. So that all the features should be revealed at the equivalent unit of determination. There're various techniques of normalization specifically Min-Max Normalization, Z-Score Normalization, and Decimal Scaling Normalization [57]. Min-Max normalization method will be addressed because it has been implemented in this paper.
Min-Max normalization arranges a transformation that is linear on the source data.

Classification
A. Random forest It refers to its name, includes a great number of unique decision trees that run multiple forms in order of obtaining the better performance of prediction. Each tree included in random forest splutters produces a prediction for a class .this class represents the class that has the most votes to become the prediction model. The random forest visualization obtaining a classification. The fundamental idea of the random forest is manageable however powerful. The knowledge of organizations. In science-speak of data, the random forest model theory to operate in benefit way: A large number of uncorrelated trees run permission will surpass every single-component technique. the key represents the low that associate among the models. Exactly such that low-association characteristics are collected collectively to construct a purse greater than the whole parts, independent techniques can generate aggregate forecasts which have a high accuracy of any individual forecast. The purpose of astonishing effect because the trees preserve each other from their singular error. Some trees may have a high error rate, many other trees will be accurate, so as a collection the trees can influence in the correct direction. So, the requirements for a random forest to operate perfectly are:  There should obtain some original sign in the build features so models created with those features run more beneficial than random guesswork.  The predictions produced by singular trees should beget low associations with each other. B. Naïve Bayes naive Bayes is a probabilistic classifier its performance depends on Bayes theory. with powerful (naïve) confidence hypotheses between the characteristics. They are between the easiest Bayesian interface models but joined with kernel density estimation, they can obtain greater accuracy levels.
The naïve Bayes classification model has highly scalable, expecting several linear parameters in the number of features or predictors problem learning. Maximum-likelihood training should be accomplished by estimating a closedform formulation, which requires a linear time, rather than the costly iterative estimation used by the various classification methods [58].

C. 1 Nearest Neighbor
1NN one of the ML lazy learner techniques. When the test sample ready to classify the training, a step applies. The proximity between the test state and all the training sets has been calculated using proximity measures. The nearest neighbor has been chosen according to the proximity to the test sample. The class of the test sample should be equivalent to the class of the nearest instance in training. [47]. Euclidian distance, Manhattan Distance, and Minkowski Distance are an example of the proximity measures. The following equation explains the Euclidian distance used in this research [58].
x, y = two points in space of N states. n= total number of instances in space.

Evaluation
Performance measurements have scalability and variation in this research two performance measures have been used the Root Mean Square Error (RMSE) and the Mean Absolute Error (MAE)  Root-Mean-Square Error (RMSE) MSRE is a commonly applied performance measure of the variations between the test set predicted by a classifier and the actual value. The RMSD represents the square root of the second sample moment of the contrasts between estimated values and actual values or the quadratic average of these variations. These errors are called residuals when the computations are accomplished across the data unit that is used for predicting and are denominated as errors when calculated out-of-sample. The RMSE labors to aggregate the importance of the errors in estimation for multiple data circumstances into a unique measure of estimative power. RMSE is an accuracy measurement, applied to compare forecasting errors of various classification models for a special dataset and not between datasets, as it is scale-dependent.

 Mean Absolute Error (MAE)
MAE is a statical performance measure between joined investigations formulating the same phenomenon. suppose of Y versus X contain comparisons of estimated versus actual, consequent time against initial time, and one method of measurement against an alternative method of measurement. MAE is computed using the following formula: yi = estimation. xi = actual. n= total number of instances.

Results
The results have been obtained by applying three ML algorithms Random Forest, Naive Bayes, and 1Neaerest Neighbor. The results listed as follows:  (1) with an accuracy of 89.59040507% and an error rate of 10.40959493%      3-Appling Applying the Naïve Baye1Nearest Neighbor technique with a training set of 66% and the remaining is the test set the result summarized in figure (7) with an accuracy of 94.68205476% and an error rate of 5.317945236%.

V. Conclusion and future works
The research has some issues that should be discussed in this section: 1-Random forest technique with a training set of 66% near to be equivalent with random forest with 10-Fold because the huge dataset and in the two cases have enough training. 2-Feature selection methods do not work properly with such kinds of the dataset because of the data type of some features of the dataset. 3-The ML model has been trained and tested using the five effective features date, user, source, action, and vector according to towards data science website. 4-Body tracking, gait analysis using depth sensors will be added to the research in the future for purposes of confidentiality. 5-As a future work also gesture recognition and body language can be added to discover the cases of lying and anxiety.