View of Phishing Webpage Detection using Hybrid Features and Deep Learning Techniques

(1)

Turkish Journal of Computer and Mathematics Education Vol.12 No.10 (2021), 2817-2829

Research Article

2817

Phishing Webpage Detection using Hybrid Features and Deep Learning Techniques

Arivukarasi M1_{and Antonidoss A}2

1_{Computer science and Engineering, Hindustan institute of technology and science, chennai, India} 2_{Computer science and Engineering, Hindustan institute of technology and science, chennai, India}

Article History: Received: 10 January 2021; Revised: 12 February 2021; Accepted: 27 March 2021; Published

online: 28 April 2021

Abstract: Phishing is a sort of online assault that endeavors to cheat delicate data of organization clients. Current phishing

website page discovery strategies primarily utilize manual component assortment, and there are issues that include extraction is muddled and the conceivable connection between's highlights can't be stayed away from. To tackle the issues, another phishing site page location model is proposed, among which the primary segments are programmed taking in portrayals from multi-perspectives highlights through portrayal learning and extricating highlights by mixture profound learning organization. Initially, the model treats URL, HTML page substance, and document object model design of pages as character arrangements individually, also, utilizes portrayal learning innovation to consequently get familiar with the portrayal of the pages; at that point, sends different portrayals to a half breed profound learning network made out of a convolutional neural network also, a bidirectional long and transient memory network through various channels to separate neighborhood and worldwide highlights, and utilize the consideration component to fortify the influence of significant highlights; finally, the yield of various channels is combined to acknowledge classification expectation. Through four arrangements of trials to check the recognition impact of the model, the outcomes show that the general classification impact of the model is superior to the current exemplary phishing site page discovery strategies, the exactness arrives at 98.05%, and the bogus positive rate is just 0.25%. It is demonstrated that the techniques of extricating website page highlights from all angles through portrayal learning and cross breed profound learning organization can successfully improve the discovery impact of phishing site pages..

Keywords: Phishing, Convoluational Neural Network, LSTM, URL

1. Introduction

Phishing is a sort of assault that assailants utilize social designing furthermore, specialized mask and other assault strategies to swindle clients to visit counterfeit site pages by sending beguiling spam, real-time correspondence messages, and so forth, to prompt clients to unveil their own character, financial account, and other delicate data. As indicated by the most recent report of the APWG, the all out number of phishing website pages in the second quarter of 2020 expanded by 14.9% over a similar period in 2019 [1]. The proceeded development of phishing assaults contrarily affects the sound improvement of the Web. In the hostile and cautious game with phishing, phishing website page investigation and location innovation have been persistently created, and the customary phishing website page discovery techniques, for example, boycott based [2], heuristic based [3], [4], visual closeness based [5], [6], and machine learning-based [7][11] techniques are proposed, and arising discovery techniques dependent on profound learning [13][21] are likewise proposed lately. Customary phishing site page discovery techniques are chiefly founded on the examination what's more, demonstrating of physically separated multi-source highlights for example, URL highlights, page content highlights, and page underlying highlights, which once showed solid protection from phishing assaults. In any case, as the iterative update speed of phishing website pages is quickened and the assaults are more hesitant, the customary strategies can just keep on giving more gritty investigation and concentrate more highlights, coming about in a fiasco of highlight measurements, and simultaneously can't dodge the conceivable connection between's highlights [11]. As of late, profound learning has been investigated and applied in the location of phishing site pages with its amazing programmed include extraction abilities. Be that as it may, due to the semi-organized nature of pages, it is muddled to naturally separate highlights from URL, page content, site page structure, and different viewpoints. Along these lines, existing investigations normally separate highlights just from a solitary perspective like URL also, absence of far reaching learning of website page highlights, so the recognition impact actually should be improved. Targeting taking care of the above issues, a phishing page identification methodW2V is proposed to naturally extricate page includes in multi-viewpoints. First and foremost, the model takes the URL, page substance, and document object model design of the website pages as text, and develops corpora dependent on characters, words, and sentences from these messages separately, and utilize the portrayal learning innovation to consequently gain proficiency with the multidimensional portrayals of pages; at that point, these portrayal vectors are contribution to various channels for include extraction. Convoluational Neural Network is utilized to separate nearby highlights, following by BiLSTM to get setting semantics and reliance highlights, and afterward the consideration component is utilized to fortify the influence of the significant highlights. At last, a classifier is utilized for classification expectation. This model consequently learns the qualities of phishing site pages without earlier information, maintains a strategic distance from the subjectivity of physically choosing includes, and can exploit a

(2)

Turkish Journal of Computer and Mathematics Education Vol.12 No.10 (2021), 2817-2829

Research Article

2818

cross breed profound learning organization to separate highlights, so it accomplishes ideal discovery results. To the most awesome aspect our insight, this is the first examination to utilize portrayal learning innovation in Characteristic Language Preparing to naturally remove page substance and document object model underlying highlights, and to accomplish multidimensional include extraction along with consequently extricated URL highlights. Notice that in the paper, the distinction between phishing site pages and favorable website pages is their deceitful expectation rather than their appearances. Phishing site pages are intended to swindle clients of key private data, while considerate site pages are committed to drawing in individuals to peruse dully. So what the model needs to find is the dormant contrast among phishing and generous pages. Specifically, the vital commitments in this work are recorded as follows:

The paper takes the URL, page substance, and Document Object Model structure of the pages as text, and utilizations portrayal learning innovation to consequently get familiar with the portrayal of the site pages taking all things together measurements.

A crossover profound learning model that wires Convoluational Neural Network, BiLSTM, what's more, consideration is addressed.

Further, four analyses on theW2V are directed from various perspectives. The outcomes show that the classification execution is acceptable. The paper is coordinated as follows. In Segment II,

we present related deals with phishing site page discovery. At that point, the system and the itemized interaction of W2V is depicted in Segment III. In Segment IV, the presentation of theW2V is assessed. At long last, we finish up the paper and talk about future works.

2. Related Works

Scientists have proposed a progression of phishing website page identification techniques, including the generally utilized customary phishing page location techniques and the arising profound learning-based strategies.

2.1 A. Customary phishing website page identification techniques

Conventional phishing website page discovery techniques principally incorporate four classes: Techniques dependent on a boycott. These techniques identify phishing pages essentially dependent on the boycott by coordinating URL and other data without bogus positives. Yet, they can't effectively distinguish phishing pages not recorded on the boycott. Delegate applications are Google Chrome and different undertakings [2]. - Strategies dependent on heuristic standards. These techniques plan also, execute heuristic standards dependent on the similitudes existing between phishing pages. Regular investigates incorporate CANTINAC [3], PhishDetector [4], and so on Heuristic guidelines can distinguish most unreported phishing pages progressively, however the reason is that the measurable qualities of phishing pages are remarkable and fluffy coordinating advancements are utilized, so the Bogus Positive Rate (FPR) is high. ® Techniques in light of visual likeness. These techniques convert the pages to be distinguished into pictures, and afterward analyze the highlight vectors of the site pages to be tried and the objective pages through picture handling innovations [5]. An ordinary strategy is proposed in [6]. In spite of the fact that there are some new explores proposed as of late [7], such strategies are still weak to phishing site pages which are not outwardly like the objective pages. Techniques dependent on AI. They treat phishing page discovery as a classification or bunching issue, and utilize the comparing AI calculations to assemble recognition models [8]. Among them, the bunching strategies first partition the site pages into a few bunches, and afterward recognize the phishing site pages from the kindhearted pages by checking the bunches [9]. Then again, the classification techniques develop classifiers as indicated by the qualities of the marked examples, and afterward map the unlabeled examples to phishing or amiable [10], [11]. Among existed investigates, PCA-RF accomplished best in class execution with an exactness of 99.55% [12]. Because of the prevalent versatility, adaptability furthermore, exactness, the AI based techniques became standard among the over four kinds of techniques. The adequacy of AI based strategies for the most part relies upon the nature of the extricated highlights, so the strategies center around how to extricate and choose more powerful highlights. Regular highlights removed from phishing website pages normally incorporate URL factual attributes, page content attributes (page format, topic, and so forth) [8], and so on To oppose avoidance assaults from aggressors, the number of removed highlights is expanding. For instance, Google Chrome has extricated 2130-dimensional highlights for phishing identification [9], which significantly builds the intricacy of demonstrating, yet leaves the identification proficiency to be improved. Simultaneously, these strategies are avoided by the aggressors once the calculations or highlights are known to the phisher.

2.2 B. Profound Learning Based Phishing Site Page Identification Techniques

As of late, profound learning has been utilized in different fields as an option in contrast to conventional AI techniques furthermore, has made extraordinary progress. A few analysts have too applied it to phishing website

(3)

Turkish Journal of Computer and Mathematics Education Vol.12 No.10 (2021), 2817-2829

Research Article

2819

page discovery [13][21]. Agreeing to whether the profound neural organization is utilized to remove includes naturally, these investigations are partitioned into three classes: Strategies dependent on artificial include designing. These strategies follow the possibility of customary phishing page discovery exploration to artificially extricate includes as the contribution of profound neural organizations. The contrasts between them incorporate which highlights are extricated and which profound neural organizations are utilized for learning. For instance, writing [13] separates exemplary highlights, for example, URL highlights, space highlights, site page substance and encoding highlights as data sources, and uses profound feed forward neural organization for discovery, while [14] removes 56-dimensional highlights from the URL, page substance, and document object model construction, and utilizations an Auto- Encoder to identify phishing site pages. Be that as it may, these strategies still can't evade the predisposition brought about by human experience. - Strategies dependent on programmed highlight learning. These techniques first revamp the first information from the website pages into a structure that can be learned by the neural organizations, and afterward extricate includes naturally by the profound neural organizations, finally utilize a conventional AI classifier to build up a classification model. As per various wellsprings of the first information, this sort of technique can be isolated into two sub-types: URL-based strategies and page content-based strategies. Among them, a great deal of investigates has been done on the URL-based strategies since URLs are not difficult to acquire furthermore, manage. For instance, writing [15][18] take URLs as text, and utilizations LSTM [15], [16], Denoising Autoencoder [17], Convoluational Neural Network [18] individually to describe URLs as highlight vectors with fixed-length. The page content-based strategies see page content as text all things being equal, and endeavor to naturally gain proficiency with the trademark portrayal of the pages from the page content [19], [20]. For instance, writing [20] extricates a progression of semantic highlights of phishing website pages receiving the W2V model. Since they can evade human inclination, the strategies dependent on programmed highlight learning have solid speculation capacity and are more appropriate for the short life cycle and quick emphasis of phishing pages. Be that as it may, this sort of technique for the most part utilizes unique, single info, for example, URL or page content. Contrasted and the conventional multi-faceted highlights, it needs extensive examination of the website pages, so the identification precision should be improved. ® Mixture techniques. The artificial highlights and programmed highlights are utilized at the same time as the contribution of the classification model. For instance, writing [21] first utilizes the half breed Convoluational Neural Network-LSTM model to extricate URL includes naturally, and afterward consolidates it with customary artificial highlights, for example, page content highlights to structure a multi-dimensional component and put it into XGBoost for classification. Albeit the strategy utilizes the benefits of the over two sorts of techniques, it can't dodge the influence of human experience. To take care of the difficult that current strategies can't take in the portrayal of website pages from multi-perspectives consequently, W2V proposed in this paper is planned to gain includes consequently from three perspectives, including URL, page substance, and Document Object Model structure, which don't need earlier information about phishing. At that point a half and half profound learning network dependent on convoluational neural network and BiLSTM is utilized, which can exploit convoluational neural network to remove the nearby highlights, and afterward exploit BiLSTM to extricate the worldwide semantic includes, and embracing a consideration system after BiLSTM to reinforce the learning impact.

3. Proposed Method

In this segment, the conventional assertion of phishing recognition is given firstly, and afterward the general structure ofW2V and its key advancements are really expounded. In this segment, the proper assertion of The objective ofW2V is to arrange the site pages to be tried as phishing or favorable, so the issue is viewed as a double arrangement issue. Think about a bunch of site pages

A. The General System

TheW2V comprises of five parts as demonstrated in Figure.1:

➢ Data assortment. Acquire phishing site pages and benevolent pages from PhishTank and Alexa sites separately, to shape a dataset;

➢ Site page parsing. Concentrate the first information of URL, page substance, and document object model construction of every website page to develop the relating quantity; also, document object model ➢ Website page portrayal. Utilize the word inserting innovation in NLP to learn relating portrayals of the

(4)

Turkish Journal of Computer and Mathematics Education Vol.12 No.10 (2021), 2817-2829

Research Article

2820

content,

Figure 1. Proposed Model framework.

➢ Highlight extraction. Info different portrayal vectors to various cross breed convoluational neural network-BiLSTM organizations to remove neighborhood and worldwide highlights, and afterward join the consideration component to reinforce significant highlights;

➢ Classification expectation. Link the multi-channel yield vectors, and utilize the classifier to decide the classification of the tried page. Since the fundamental component of the model is to address pages and concentrate includes on the whole parts of the site pages, it is named W2V. The vital advancements in theW2V model are portrayed beneath.

B. Website Page Corpora Development

Most related investigates have zeroed in on consequently learning website page highlights from URLs [15][18], [21], in light of the fact that URLs are normal character successions and can be effectively vectorized without preprocessing. Yet, the actual URL doesn't cover all the construction and semantic data of the phishing website page. Thusly, W2V learns the extensive include portrayal of the website pages from the three angles of URL, page substance, and DOCUMENT OBJECT MODEL structure. To learn these portrayals naturally, it is important to remove the comparing quantity from the site pages.

(5)

Turkish Journal of Computer and Mathematics Education Vol.12 No.10 (2021), 2817-2829

Research Article

2821

Figure 2. Example of URL and word level quantity

1) Url quantity

Drawing on the information preparing technique for the writing [18], the URLs are prepared in units of characters and words separately.

A: Character-Level quantity

While developing a character-level quantity, every URL is viewed as a character succession, and each character succession is standardized to a xed-length by the capture or then again zero paddings, and all character successions structure a succession set. An aggregate of 96 letters, numbers and exceptional characters with high recurrence in the succession set are chosen to shape a character jargon, including a unique image <UNK> for supplanting rare characters and placeholder <PAD>, and afterward a novel number is doled out to each character in the jargon. At long last, every URL character succession in the succession set is encoded, that is, the comparing number is utilized to supplant the first character one by one, so the one-dimensional advanced vector named One- Hot code relating to every URL character arrangement is acquired. This comprises the character-level quantity of URLs.

B: Word-Level quantity

On the off chance that seeing URL as a blend of words, the quintessence of the URL can be perceived from a more significant level. The development of a word-level quantity is like that of the characterlevel quantity. The thing that matters is that URLs are divided into word successions rather than character groupings. The wordlevel parting of URLs isolates every URL into the convention, hostname, way, le name, and boundary parts as per the underlying qualities of the URL, by separators '.', ':', '//', and so on, to feature the succession connection between the parts. Fig.2 shows an illustration of the development of character quantity and word quantity for URL.

2) Page Content quantity

The page content quantity is separated into a word-level quantity also, a sentence-level quantity. The development interaction is comparable to the URL quantity, so it won't be rehashed here. It ought to be noticed that the

(6)

Turkish Journal of Computer and Mathematics Education Vol.12 No.10 (2021), 2817-2829

Research Article

2822

sentence is isolated by the character ''.''; in request to encourage preparing, the mixed media content, HTML labels, CSS styles, and some other data of the HTML archive are taken out, just the content data is held.

3) Document object model Structure quantity

A HTML report is a commonplace semi-organized archive. HTML labels in it have a settled relationship and reflect the progressive construction of the website page, which can be described by the document object model. For effortlessness, while framing a document object model quantity, just the labels that make up the document object model are thought of, what's more, their characteristics, text, and remark hubs are disregarded. The development of the document object model quantity is isolated into two stages: building a label grouping and developing a quantity from the label grouping. To start with, develop the document object model label arrangement for every site page. The cycle is as per the following: Parse the HTML report and get the root hub of the document object model tree, and use it as the current layer and the first component of the label succession; - Starting from the current layer, use expansiveness first system to navigate layer by layer. The specific strategy is to navigate the kid hubs of the hubs in the current layer from left to right, what's more, save them in arrangement; ® Repeat - until all layers are examined, and get back to the label arrangement. Subsequent to finishing the above advances, the website page will be changed over into a grouping of document object model labels. Fig.3 is a HTML archive, and Fig.4 shows how it is changed over into the grouping of document object model labels. At that point, the document object model label groupings framed by all pages are collected, and the HTML labels are viewed as the words establishing the groupings, consequently building a word-level quantity.

D. Website page representation

The numerous corpora built in the last area have all things considered finished the jargon planning of every quantity. These mappings are sorts of One-Hot encoding. One-Hot is otherwise called the slightest bit compelling encoding. Its guideline is to utilize N-cycle status registers to encode N states. Each state relates to a free compelling register bit. One-Hot can't mirror the affiliation and semantic data of the corpora, and the encoding result is moderately meagre, so it can just be utilized for primer quantization. As of late, the word vector innovations in NLP have been widely considered and applied, which are for planning characters or words from a word reference to low-dimensional vectors. They can lessen the measurement, yet in addition catch the setting data of the current characters or words in the succession, so are regularly used to become familiar with the portrayal

(7)

Turkish Journal of Computer and Mathematics Education Vol.12 No.10 (2021), 2817-2829

Research Article

2823

Figure 4. Document Object Model Tag

of text-like quantity. Notwithstanding, common word vector age techniques, for example, W2V are pre-prepared models [22], and their muddled pre-preparing interaction will bring an immense weight to the discovery of phishing website pages. To this end, a simplified word vector development technique is utilized inW2V, furthermore, the One-Hot lattice after starter quantization is installed as word vectors by a solitary layer neural organization. The inserting layer and the ensuing element extraction furthermore, classification parts are mutually enhanced through back propagation to steadily upgrade the semantic portrayal capacity of the model.

E. Feature Extraction

In the examination and utilization of profound learning, convoluational neural network is the most generally utilized one since it is acceptable at separating nearby highlights from information, yet it does not have the capacity to learn logical data; and then again, LSTM, as a time-recursive neural organization, is only appropriate for handling grouping data, so as of late, the blend of convoluational neural network and LSTM applied to different sorts of exploration has arisen and succeeded [23], [24]. Motivated by this, W2V means to utilize a mixture convoluational neural network-LSTM conspire for highlight extraction. Simultaneously, to defeat the deficiencies that LSTM just thinks about the forward data furthermore, overlooks the regressive data, BiLSTM with a bidirectional consecutive construction is utilized rather than LSTM to incorporate all setting data into the model. Moreover, to reinforce the influence of significant highlights, the consideration system that has been broadly contemplated and utilized lately is applied to the yield of BiLSTM, so that the identification capacity of the model can be improved.

1) CONVOLUATIONAL NEURAL NETWORK

The Convoluational Neural Network planned in W2V comprises of a convolutional layer and a pooling layer. At the convolutional layer, numerous convolution portions perform convolution procedure on the information vectors to create various element maps; at the pooling layer, the component of the element map is decreased by maximum pooling.

(8)

Turkish Journal of Computer and Mathematics Education Vol.12 No.10 (2021), 2817-2829

Research Article

2824

Figure 5. Lstm cell structure.

2) CLASSIFICATION FUNCTION

Subsequent to separating the highlights, the element vectors yield from each channel are linked to frame the combination vector Xi of the site page I, and the classification forecast of the site page is accomplished through the completely associated layer and the sigmoid capacity. During the preparation cycle, a cross-entropy misfortune work is utilized to compute the blunder between the genuine worth and the anticipated one. Leave ji alone the anticipated worth and j'i the genuine classification, at that point the misfortune work is:

4. Experimental results and analysis

To check the viability of the W2V model, four arrangements of analyses are intended to attempt to answer the following inquiries:

1) Question 1: Compared with the exemplary phishing site page discovery techniques, how powerful is the identification of the W2V model?

2) Question 2: Does the multifaceted portrayal gained from the first data of the site pages utilizing the portrayal learning technique successfully improve the discovery result?

3) Question 3: Does the crossover convoluational neural network-BiLSTM organization for highlight extraction in theW2V model have benefits over other run of the mill profound learning organizations?

4) Question 4: Does utilizing the consideration component in the W2V model improve execution? The exploratory advancement climate is appeared in Table 1.

Table 1. Experiment environment

Os Processor Memory Environment Development

Language Windows 8 Intel core

i5

16gb PyCharm Python3.5

(9)

Turkish Journal of Computer and Mathematics Education Vol.12 No.10 (2021), 2817-2829

Research Article

2825

Evaluation parameter Formula Accuracy (TP+TN)/(TP+FP+TN+FN) Precision TP/(TP+FP) Recall TP/(TP+FN) F1 (2*Precision*recall)/(precision+recall) FPR FP/(TN+FP)

Table 3. Word 2 Vector

Parameter Value CONVOLUATIONAL NEURAL NETWORK pool_size 4 CONVOLUATIONAL NEURAL NETWORK Kernel_size 150 Strides 1 BiLSTM cell no 150 Batch 76 Dropout 0.7 Epoch 20 Learning rate 0.001

The pages utilized in the analyses come from the genuine organization climate. The kind site pages assortment is from Alexa. Alexa is a site kept up by Amazon that distributes the world rankings of sites. It has a gigantic number of URLs and gritty site positioning data. We gather pages in the top rundown given by Alexa which are considered as kindhearted pages. In the wake of filtering out a few invalid, mistake, and copy pages, we gathered 23,000 typical site pages from Alexa. The phishing site page assortment comes from Phish- Tank.com. PhishTank is a globally notable phishing site page assortment site that gives a convenient and legitimate rundown of phishing pages. PhishTank gathers a suspected phish put together by anybody and afterward verifies it as per if it has a deceitful endeavor prior to distributing. Because of the short endurance season of phishing site pages, we gathered 21,000 phishing pages recorded on PhishTank from September 2020 to November 2020, and pre processed the site pages that didn't meet the language structure rules. The proportion of the preparation set to the test set is 0.80:0.20.

1) Evaluation parameter

To sum up different assessment pointers in the literary works, the most generally utilized are the accompanying: Accuracy, Precision, Genuine Positive Rate, which is comparable to Recall, FPR, F1-measure, and their computation equations are appeared in Table 2. Among them, TP indicates the quantity of favourable site pages effectively classified as kind hearted site pages, FP indicates the quantity of phishing site pages classified as kind site pages, TN indicates the quantity of phishing pages classified as phishing site pages, and FN indicates the quantity of favourable pages classified as phishing site pages. F1 is the consonant mean of Precision and Recall, which can thoroughly reflect the exhibition of the strategy.

2) Baselines

Exemplary phishing website page identification strategies looked at with W2V incorporate PCA-RF [12], CANTINAC [3], URLNet [18], and MPURNN [15]. Among them, PCA-RF is a commonplace AI approach dependent on artificial and TABLE 3. Boundaries of W2V. heuristic highlights, which has the cutting edge

(10)

Turkish Journal of Computer and Mathematics Education Vol.12 No.10 (2021), 2817-2829

Research Article

2826

execution. Then again, CANTINAC is the most perceived heuristic technique. The two techniques physically extricate highlights from all parts of URL, page substance, and document object model structure; Then again, URLNet and MPURNN are profound learning techniques, which both consequently take in highlights from URL. The thing that matters is that URLNet performs include extraction by convoluational neural network after character-level and word-level installing, while MPURNN just installs characters, and afterward separates highlights through LSTM. When contrasting the element extraction techniques in the W2V, the exemplary single profound learning networks convoluational neural network, Recurrent Neural Network, LSTM, just as the cross breed network convoluational neural network-LSTM and convoluational neural network-BiLSTM without adding consideration were chosen. Notice that the customary administered AI techniques like Sequential Minimal Optimization, Bayesian Network, Support Vector Machine, also, AdaBoost are not analyzed in our examinations, on the grounds that from the exploratory correlation in [12], PCA-RF performed the best out of every one of these baselines.

3) Parameter Setting

The boundary settings of theW2V are appeared in Table 3. Furthermore, the lengths of the URLs, the HTML content, what's more, the document object model construction of every website page are conflicting, and should be set to a fixed length during the computation. Concurring to the dispersion insights of various lengths, set the URL length, HTML length, and document object model structure length to 201, 1100 what's more, 2100 separately.

5. RESULTS EVALUATION

1) EXPERIMENT 1: W2V DETECTION EFFECT

Question 1 means to get to the location impact of W2V. To address Question 1, test 1 looks at W2V with exemplary phishing website page discovery techniques PCA-RF, CANTINAC, URLNet, and MPU RNN. The outcomes are appeared in Table 4. As can be seen from Table 4, as a rule, the discovery impact of the PCA-RF is the awesome, W2V shows imperfect outcomes and CANTINA Chas the third spot. This is on the grounds that these three techniques have done multi-angles learning on the website pages, and W2V accomplishes lower execution than PCA-RF in light of the fact that lacking sufficient information when utilizing profound learning organizations, yet its presentation is near PCA-RF and still better compared to CANTINAC for finding out additional inert data through profound learning and portrayal learning. URLNet and MPURNN just gain from URLs and utilize a solitary profound learning network for highlight extraction, the fundamental highlights of site pages are not scholarly enough, so the classification impact isn't ideal. The computational intricacy of the strategies relies upon the extraction and registering the highlights from pages. In try 1, CANTINAC spend the most brief activity time on the grounds that subsequent to getting highlights physically, it just necessities to utilize a straightforward heuristic principle to settle on a choice, while profound learning-based techniques, for example, W2V and URLNet need to run numerous ages to get the best outcome. Since PCA-RF utilizes the outfit strategy, it is additionally more slow than the heuristic technique. W2V invested the longest running energy, since it necessities to acknowledge portrayal learning before profound inclining.

Table 4. Comparison with classic phishing detection.

Model Accuracy Precision TPR F1 FPR

W2V 0.9804 0.9767 0.9825 0.9904 0.0023

PCA-RF 0.9820 0.9810 0.9728 0.9812 0.0032

CANTINA+ 0.9652 0.9810 0.9652 0.9650 0.0078

URLNet 0.9530 0.9519 0.7502 0.8017 0.0021

MPURNN 0.9221 0.9431 0.8881 0.9150 0.0352

Table 5. Various feature combinations.

Features ACC P TPR F1 FPR

Document object model 0.9071 0.9871 0.9521 0.9040 0.1121

Url 0.9371 0.9532 0.9060 0.9312 0.0158

Html 0.9655 0.9681 0.9583 0.9623 0.0082

Document object model+url 0.9532 0.9541 0.9726 0.9508 0.0018

Document object model+html

(11)

Turkish Journal of Computer and Mathematics Education Vol.12 No.10 (2021), 2817-2829

Research Article

2827

Url+html 0.9731 0.9721 0.9712 0.9711 0.0132 Url+html+document object model 0.9803 0.9756 0.9721 0.9807 0.0013

Table 6. Various feature extraction models

Model Acc P TPR F1 FPR Convoluational Neural Netork -BiLSTM 0.9804 0.9756 0.9726 0.9804 0.0024 Convoluational Neural Netork-LSTM 0.9450 0.9883 0.9654 0.9706 0.0031 LSTM 0.9554 0.9814 0.9223 0.9515 0.0060 Recurrent Neural Network 0.9554 0.9812 0.9245 0.9524 0.0062 Convoluational Neural Network 0.9678 0.9743 0.9653 0.9700 0.0084

Table 7. Detection effect of attention process

W2V Acc P TPR F1 FPR With attention 0.9804 0.9764 0.9721 0.9807 0.0024 Without attention 0.9732 0.9789 0.9655 0.9722 0.0040

2) EXPERIMENT 2: EFFECTS OF MULTI-FACETED HIGHLIGHT LEARNING

Question 2 intends to consider the need and impact of taking in page highlights from numerous angles. To answer Question 2, explore 2 consolidates various highlights gained from URL, page substance, and document object model construction to shape various arrangements of various contributions to look at the effect of different highlights on the location results. Table 5 shows the location results got by consolidating various highlights. It tends to be seen from Table 5 that, the best outcomes have been accomplished from exhaustive learning highlights through the blend of URL, page substance, and document object model structure. In the current examination, despite the fact that there are very few investigates on phishing website pages from each of the three parts of URL, page content, and document object model structure, it isn't unexpected to examine one or a mix of the two as the exploration object. This shows that the data in various pieces of the site page can reflect a few qualities of the phishing page, for model, the substance of the page can reflect semantic attributes, the document object model design can reflect primary qualities, and so on, and the blend of them will definitely improve Table 6. Identification impacts of various element extraction models. Table 7. Identification impact of consideration system. the identification impact. This likewise clarifies that the strategy for learning two parts of data in Table 5 is superior to the strategy for taking in highlights from just a single viewpoint.

3) EXPERIMENT 3: THE EFFECTIVENESS OF HYBRID CONVOLUATIONAL NEURAL NETWORK-BiLSTM DEEP LEARNING NETWORK

Question 3 means to think about the impact of highlight extraction utilizing the crossover profound learning network Convoluational Neural Network-BiLSTM in the W2V model. To respond to Question 3, analyze 3 supplanted Convoluational Neural Network-BiLSTM in theW2V with Convoluational Neural Network-LSTM, LSTM, RNN, and Convoluational Neural Network separately. The correlation results are appeared in Table 6. It very well may be seen from Table 6 that, the recognition efficiency of a solitary organization model is significantly lower than that of a cross breed one. Contrasted and a solitary model, a mixture model can separate the inactive highlights of phishing website pages from various levels, which is worthy of top to bottom exploration and application. It likewise shows that the Convoluational Neural Network-BiLSTM network has a

(12)

Turkish Journal of Computer and Mathematics Education Vol.12 No.10 (2021), 2817-2829

Research Article

2828

preferred classification discovery impact over the Convoluational Neural Network-LSTM, which implies that it is important to perform bidirectional include extraction.

4) EXPERIMENT 4: EFFECTIVENESS OF CONSIDERATION MECHANISM

Question 4 inspects the adequacy of utilizing the consideration system in the W2V. In this way, the discovery impact of the W2V with and without consideration system is thought about. The outcomes are appeared in Table 7. It very well may be seen from Table 7 that the impact of the W2V model with the consideration system is significantly better than that without the consideration system. This shows that expanded thoughtfulness regarding the yield of BiLSTM can feature significant component data and viably improve the classification recognition impact. To all the more likely represent the improvement of the classification impact of the consideration component, Fig.6 shows the changes in Accuracy and Loss during preparing and testing with or without the consideration. It tends to be seen from Fig.6 that the model with consideration meets quicker and the preparation and testing measure is more steady. The over four gatherings of trials show that the W2V model can address pages in multi-viewpoints

Figure 6. Different effect of the notice functions.

through portrayal learning, utilize the mixture profound learning network convoluational neural network-BiLSTM for highlight extraction, and use consideration component to additionally improve classification execution. These procedures are for the most part attainable and powerful. The expectation impact of the W2V model is ideal.

6. Conclusion

A phishing site page identification model W2V dependent on portrayal learning and profound learning is proposed in the paper. The model uses the portrayal learning innovation in NLP to completely become familiar with the portrayal of website pages from the URL, page substance, and document object model structure; at that point develop a multi-channel crossover profound learning organization to separate the profound secret highlights of the pages and afterward utilize the consideration component to fortify the influence of significant highlights; finally, the element extraction after effects of various channels are combined for classification expectation. Four sets of tests verified the classification consequences of the W2V model from various points. With the quick improvement of portrayal learning innovation, the profound portrayal learning on the chart, in particular Graph Neural Network, has been broadly examined. The exploration object of this paper, website pages, are connected to one another, normally shaping a chart. The most effective method to delve profound into the connection attributes of phishing website pages, to build a huge diagram structure that can reflect the phishing attributes, and utilize amazing examination and handling abilities of Graph Neural Network to find more separated phishing site page identification strategy is our further examination bearing.

(13)

Turkish Journal of Computer and Mathematics Education Vol.12 No.10 (2021), 2817-2829

Research Article

2829

References

1. J. Ma, et.al., “Learning to detect malicious urls,” ACM Trans. Intell. Syst. Technol., May 2012.

2. P. M, S. L. et al., ‘‘A static approach to detect drive-by-download attacks on Webpages,’’ in Proc. Int. Conf. Control Commun. Comput., Dec. 2013.

3. G. Xiang, et al., ‘‘CANTINA+: A feature rich machine learning framework for detecting phishing Web sites,’’ACM Trans. Sep. 2011.

4. M. Moghimi ‘‘New rule-based phishing detection method,’’ Expert Syst. Appl., Jul. 2016,

5. M. N. Raj et al., ‘‘A survey on phishing detection based on visual similarity of Web pages,’’ Int. J. Sci. Res. Sci., Eng. Technol., Jul. 2018,.

6. J.-X. Cao, ‘‘A phishing Web pages detection algorithm based on nested structure of Earth Mover’s distance (NestedEMD),’’ Chin. J. Comput., Aug. 2009.

7. R. S. Rao, ‘‘Two level filtering mechanism to detect phishing sites using lightweight visual similarity approach,’’ J. Ambient Intell. Humanized Comput., Sep. 2020.

8. Aleroud, ‘‘Phishing environments, techniques, and countermeasures: A survey,’’ Comput. Secur., Jul. 2017.

9. K. Sahu, ‘‘Kernel K-means clustering for phishing Website and malware categorization,’’ Int. J. Comput. Appl., Feb. 2015.

10. S. Lee ,‘‘WarningBird: A near real-time detection system for suspicious URLs in Twitter stream,’’ IEEE Trans. Dependable Secure Comput., , May 2013.

11. B. Liang, ‘‘Cracking classifiers for evasion: A case study on the Google’s phishing pages filter,’’ in Proc. 25th Int. Conf. World Wide Web, Montréal, QC, Canada, Apr. 2016.

12. R. S. Rao, ‘‘Detection of phishing Websites using an efficient feature-based machine learning framework,’’ Neural Comput. Appl., Aug. 2019.

13. G. Vrbancic, ‘‘Swarm intelligence approaches for parameter setting of deep learning neural network: Case study on phishing Websites classification,’’ in Proc. 8th Int. Conf. Web Intell., Mining Semantics (WIMS), 2018, pp. 1–8.

14. J. Feng, ‘‘A phishing Webpage detection method based on stacked autoencoder and correlation coefficients,’’ J. Comput. Inf. Technol., vol. 27, no. 2, pp. 41–54, Nov. 2019.

15. A. C. Bahnsen, et al.,‘‘Classifying phishing URLs using recurrent neural networks,’’ in Proc. APWG Symp. Electron. Crime Res. (eCrime), Apr. 2017.

16. W. Chen, et al., ‘‘Phishing detection research based on LSTM recurrent neural network,’’ in Proc. Int. Conf. Pioneering, China, 2018.

17. S. Douzi, et al., ‘‘Advanced phishing filter using autoencoder and denoising autoencoder,’’ in BDIOT, 2017.

18. H. Le, et al.,‘‘URLNet: Learning a URL representation with deep learning for malicious URL detection,’’ CoRR, Feb. 2018.

19. P. Yi, et al., ‘‘ phishing web page detection using a deep learning framework,’’ Wireless Commun. Mobile Comput., 2018.

20. X. Zhang, et al., ‘‘The phishing detection performance by semantic analysis,’’ Big Data, Dec. 2017. 21. P. Yang, et al., ‘‘Phishing Website detection based on multidimensional features driven by deep

learning,’’ IEEE Access, 2019.

22. T. Mikolov,et al., ‘‘Distribution of words and phrases and their compositionality,’’ in Proc. Adv. Neural Inf. Process. Syst., 2013.

23. P. Sun, et al., ‘‘Extracting features using CNN-LSTM hybrid network for intrusion detection system,’’ Secur. Commun. Netw., 2020.

24. M. Umer, ‘‘Fake news stance detection using deep learning ’’, IEEE Access, 2020.

25. W. Kong, ‘‘Short-term residential load forecasting based on deep learning technique,’’IEEE Trans. Smart Grid, Jan. 2019.