View of SMART HOME SECURITY SOLUTIONS USING FACIAL AUTHENTICATION AND SPEAKER RECOGNITION THROUGH ARTIFICIAL NEURAL NETWORKS

(1)

SMART

HOME

SECURITY

SOLUTIONS

USING

FACIAL

AUTHENTICATION AND SPEAKER RECOGNITION THROUGH

ARTIFICIAL NEURAL NETWORKS

Navya Saxena

B.Tech Student, Vellore Institute of Technology Vellore, India

Devina Varshney

B.Tech Student, Vellore Institute of Technology Vellore, India

Abstract—In this paper, we have tried to implement a holistic solution for Smart Home Security

which helps in improving privacy and security using two independent and emerging technologies of Facial authentication and Speech Recognition. With the help of our application, the user will be able to monitor his home through his mobile phone/tablet/PC. This method involves facial recognition by taking a real-time feed of the person at the door and then analysis of the live feed is carried out where the face recognized is authenticated with the data of owners in our database which matches the face to names. Speech recognition has been used to doubly check the output of facial authentication. The entire process is carried out with the help of Neural Networks. If there's an unauthorized person at the door, an alert will be triggered and the owner will get a notification of this unauthorized access and would get to choose whether they want to add the person to their database or not. The accuracy of the proposed model is 87.5 % for Facial Authentication and 84.62 % for Speaker Authentication. Along with this, the main novelty for our research is to identify faces through masks which will help to properly verify the identity of the person and would prove to be beneficial not only in the current COVID scenario but also in cases of thefts and burglaries by alerting the owner about the anomaly. Thus this smart security system can be extended to applications like banks, malls, offices, etc., and shall not be limited to only homes.

Keywords—Artificial Neural Networks, Facial Authentication, Gaussian Mixture Model, Internet of Things, One-shot Learning, Siamese Neural Network, Smart Home Security, Speaker Recognition

I. INTRODUCTION

Crimes such as burglary and theft are serious concerns for any household. To be f ree from constant worrying especially at night, people can install Smart Home Security Systems which will be accessible to them through a single device. Usually people have a camera built in on all entrances of their homes for security and are able to see the person visiting their home. Some advanced Home Security systems allow Face Recognition as well. The facial recognition can be carried out by a process similar to a method which involves a combination of geometrical feature points and low-level visual features [1]. But this can be surpassed by criminals by showing the image of the owner or a member of the household.

(2)

We propose a Home Security method which will not only authenticate using Face Authentication but also using Speech Recognition. This will be dual authentication of every person visiting the house through a live feed and comparing the input with the names stored in the owner’s database. Speech Recognition is being included so that no one will be able to just show a picture of one of the identified members and enter, instead they will also need to say a passphrase which will run through the speech recognition system to dually authenticate the person [2]. The existing members will easily be authenticated and allowed to enter the house by giving notification to the owner’s device. When there is an unidentified person at any of the entrances, an alert will be sent to the owner along with their picture. If the user wants to add the person to their database, they can simply add them with a click of a button, or just allow them to enter that one time without adding them to the database thereby not giving them lifelong permission to enter.

Since the dawn of Covid-19 Pandemic, whenever people meet up, they must wear a mask. The mask has made it difficult for facial recognition systems to recognise the person behind the mask. Our research will incorporate the study of how the Facial Recognition System can recognise the person even if they are wearing a mask. The study will include various methods possible for facial recognition through minimum feature extraction and specially focusing on recognising through the eye region of the face. This research will not only help during Covid-19, but will also help if there is a burglar or thief wearing a mask. The recognition system can be run by authorities to identify the criminal.

Our research will include finding the most accurate Artificial Neural Network for Facial Authentication using facial recognition and finding an accurate model for Speech Recognition and combining them for an efficient Home Security System. The whole system will be able to work through a single application by using the Internet of Things. It will consist of a network connected to the application on a Personal Device Assistant such as a mobile or tablet, cameras and the voice capturing devices.

Even though there are many ways to implement Face Authentication, we have chosen Artificial Neural Network because it has the ability to learn and model non-linear and complex relationships which is needed when it comes to pictorial inputs. Artificial Neural Network converts the input image into a vector and is then mathematically designated by a notation. Not only that, but Artificial Neural Network also gives high accuracy which is very important when it comes to authentication for security. It can prove to be very useful for Multimodal facial biometrics recognition [3].

Security and privacy are important for every single individual. Having a local database of identified people helps with both security and privacy. The data in the security system will travel on the local network itself keeping the users’ data and them safe.

A Smart Home Security System will not only make the housemates feel safer at their own home but also make them technologically stronger. This project can be extended as a security system for Banks, Malls, and other places requiring security.

II. LITERATURE SURVEY

Smart home devices account for a large portion of the consumer IoT market, but they pose security risks. Nothing is understood about how homeowners' views of security risk affect their decisions to use smart home technology. [4] evaluated a new model of how perceived security risk influences intention to use smart devices in order to investigate how expectations of security risks influence intentions to use

(3)

smart home devices. Another method can be seen in [5]. In presenting their findings, they have described a smart home safe framework based on a refined version of the blockchain called Consortium blockchain. [6] carried out the classification of natural access points in the home as primary and secondary access points according to their use. Logic-based sensing is made by pointing to the normal user performance of these access points and requesting user verification where necessary. The gaps, which can be seen here, would be to improve the user behavior prediction which can be done by analyzing different user actions at home to make smart home security better.

Facial recognition technology is being used in both the private and public sectors for a variety of uses, ranging from physical security to customised shopping experiences. However, it is unclear how consumers interpret this new technology in terms of utility, danger, and comfort. [7] address these questions. Deep Siamese networks have recently been used for pair-wise face matching to increase robustness to intra-class variations. Although these networks can increase state-of-the-art accuracy, the lack of prior information from the target domain necessitates the collection of a large number of images to account for all potential capture conditions, which is impractical in many real-world surveillance applications [8]. Previous research in [9] proposed a multi-task convolutional neural network (CNN) for face recognition, with identity classification as the primary task and pose, illumination, and expression (PIE) estimations as side tasks. A dynamic-weighting scheme was also devised for automatically assigning loss weights to each side mission, solving MTL's critical task balancing issue. A weighted mixture deep neural network (WMDNN) is proposed to automatically extract features that work with FER functions. Many pre-processing methods, such as face detection, rotation adjustment, and data addition, are used to limit FER regions [10]. [11] proposed a comprehensive network framework for capturing identity information from facial strengths and their relationships. In another proposed method, facial expressions from the smile were analyzed and used for facial verification. The 3D authen tication feature [12] is as important to the user as security and provides an easy way to authenticate the right user. [13] proposed a method that uses a powerful appearance and time-dependent local features that express a person’s face during a speech in relation to its temporary and temporary elements. [14] proposed a user-friendly authentication system for the EchoPrint novel, which incorporates the acoustics and concept of secure and easy-to-use authentication, without requiring any special hardware. Remote authentication includes the transmission of encrypted information, as well as visual and audio signals (photos / videos, personal voice, etc.). [15] suggested a strong authentication method based on the semantic phase, chaotic encryption, and data encryption. A new multi-user-based framework had been developed and used for large types of image data [16]. In this framework, various biometric features such as IRIS, facial and finger features are used to determine the user's unique data validity and security process. [17] designed to use Fourier Optics and Neural Networks which uses an advanced optical Fourier plane correlator (real-time) for face recognition and feature extraction.

In recent years, real-time speech recognition technology has been commonly used in the fields of intelligent voice toys, industrial control, and intelligent rehabilitation as a primary cross technology in the field of artificial intelligence. Since real-time speech recognition based on embedded technology has obvious advantages in terms of device scale, power consumption, and R&D costs, it has become a hot carrier for achieving efficient speech recognition technology. [18] build a basic framework of machine learning based on Markov random field theory combined with machine learning theory, and research the algorithm of real-time speech vocabulary matching recognition based on this framework in order to

(4)

realise a simple and functional real-time speech recognition system based on embedded systems. Another common method is using Artificial Neural Networks. Some carefully built deep autoencoders are proposed in [19] to generate effective bimodal features from audio and visual stream inputs. Speaker Recognition can be divided into Text-dependent speaker verification and Text-independent speaker verification. [20] show that Deep Neural Networks based systems have significantly outperformed GMM for text-dependent speaker verification. [21][24] have recently demonstrated that similar ideas can be applied to the text-dependent speaker verification mission, inspired by the success of Deep Neural Networks (DNN) in text-independent speaker recognition. They discussed new developments in their state-of-the-art i-vector based approach to text-dependent speaker verification, which also employs various DNN techniques. [22] proposed a novel model to improve the recognition accuracy of the short utterance speaker recognition system in this paper. On the other hand, [23] investigate the use of joint factor analysis (JFA) for text-dependent speaker recognition with random digit strings. A unique method of using facial expressions with voice is used in [25]. A new data rating is designed in [26] to detect deviations from optimal speech quality. In [27], a new cross-entropy-guided measure is proposed to indirectly evaluate the details of automatic speech recognition for discounted speech with speech enhancements before and without performing ASR tests directly.

[28] removed masked objects from facial images. This problem is challenging because a facial mask often covers a large area of the face that extends beyond the lower border of the chin, and mask and without mask facial pairs do not exist for training. Findings of [29] at the community level corroborated previous findings showing the significance of the eye area for face recognition. They also showed that, from the observer's perspective, face processing capacity is linked to a systematic increase in the use of the eye region, especially the left eye. [30] proposed a face-mask recognition approach based on the Gaussian Mixture Model for fraud prevention in this paper. In comparison to other conventional face recognition approaches, their methodology has been designed to improve the ability to identify abnormal faces such as sunglasses, masks, and respirators, as well as reduce the risk of these rare faces in the security sector. [31] proposed eye movements and authorization based on iris recognition (EmIr-Auth), a proven authentication system for biometrics-based novel operators. [32] related to the biometric diagnostic method and the problem of providing accurate diagnostic results using an eye-frequency eye tracker. [33] proposed a privacy-preserving technique for "controlled information release" in which they mask an original face image and prevent biometric feature leakage while identifying an individual.

III. PROBLEM STATEMENT

The research done on Smart Home security using Facial Authentication and Speaker Recognition using Artificial Neural Networks is limited in comparison to the research done on each of them individually. This restraint can be seen because of the challenges faced by the researchers while performing such tasks independently. When it comes to Face Authentication, researchers prefer to use Support-Vector Machines (SVM) or Convolutional Neural Networks (CNN). A gap that can be seen here is that when the number of features is greater than the number of samples, SVM does not perform well. Even though a standard CNN is considered to be better than SVM when it comes to image inputs, since CNN detects important features automatically without any human supervision, it may falter when the dataset is too small. The problem that may occur when it comes to Smart Home Security is that the dataset would be too small as there would be a limited number of images for each individual and the CNN must be able to detect the required features from the dataset available. To tackle this problem,

(5)

Siamese Neural Network helps as it can work with a limited dataset using one shot learning approach. One important gap that can be seen while identifying an individual using Face Authentication is that if the individual is wearing a mask, the system is unable to identify them as most of their facial features are hidden. This creates a problem for individuals trying to enter a home where face authentication is required, especially during the Covid-19 pandemic, as the system will not accept them. To tackle this problem, the system can focus on the eye region and run the biometrics based on the individual’s iris using modified Neural Network models. On the other hand, when it comes to speaker recognition, most systems use Text-Independent methods using Gaussian Mixture Model or i-vectors. Even though these methods are effective, these can be surpassed by an unidentified individual if they have a voice recording of an identified individual. This can be solved by using a specified passphrase which must be spoken by the individual for recognition. This will create dual authentication as the system will first use speech recognition to verify the phrase and then use voice authentication to identify the individual.

The main target of this paper would be to fill the above gaps to provide better home security by contributing the below works. The following aims are to be kept in mind for building a smart Solution using Facial Authentication and Speaker Recognition through Artificial Neural Networks for Home Security.

 To propose an effective facial authentication system using Siamese Neural Networks.

 To intend GMM (Gaussian Mixture Model) to train on extracted Mel Frequency Cepstral Coefficients (MFCCs) features from an audio wav file for speech authentication and speaker recognition.

 To perform masked facial recognition through minimum feature extraction by focussing more on the eye region of the face.

The above efforts would be taken to achieve a high classification and accuracy for facial authentication, masked facial recognition and speaker recognition to provide a better home security solution.

IV. RESEARCH FRAMEWORK

A. Overall Architecture

As shown in the proposed model in Figure 1, an image of the individual will be captured at the entrance using a camera placed at the door. The captured image will be sent to the Face Recognition and Authentication System. The System will first detect whether the person is wearing a mask or not. Once that is detected, the image is sent to ‘Face Recognition and Authentication System’ if the person is not wearing a mask or to ‘Masked-Face Recognition and Authentication System’ if the person is wearing a mask. If the Face Recognition System does not recognise the individual, then the system doesn’t move on to the Speaker Recognition System, instead it breaks out and immediately sends a notification to the Owner along with the individual’s image. On the other hand if the System recognises the individual, then the individual is asked to speak the passphrase which is taken as an input for the Speaker Recognition System. Once the user speaks the passphrase, the Speaker Recognition System works on Text-Dependent Voice Authentication. The System will use a speech recognition model to identify the Speaker and recognise the phrase spoken to compare with the passphrase for the security system. If either the Speaker

(6)

isn’t identified or if the phrase spoken does not match the passphrase, then the System breaks and sends an alert to the owner along with the individual’s image. On the other hand, if the individual is identified as an existing speaker and says the correct passphrase, the individual’s name is compared with the name recognised using Facial Authentication System. If both the Systems have identified the same person, then the individual is allowed to enter into the House by sending a notification to the Owner. Else, if both the Systems identify the individual as two different persons, then also an alert is sent to the Owner along with the Individual’s image.

Fig. 1. Overall System Architecture

B. Proposed Methodology

The entire project can be divided into three components – Face Recognition and Authentication System, Masked-Face Recognition and Authentication System, and Speaker Recognition and Authentication System.

a) Face Recognition and Authentication System

After capturing the image of the Individual at the entrance, if they are not wearing a mask, then the captured image is sent to the ‘Face Recognition and Authentication’ System which uses Siamese Neural Network and One Shot Learning using FaceNet Model. It can be seen that the dataset for a household will have limited images or maybe even only one image per person. Siamese Neural Network works most efficiently in such a case as it distinguishes two different classes by measuring their similarity instead of detecting characters of a class. It can tell whether a pair of pictures belongs to two separate classes or to the same class by studying pairs of pictures. This Authentication System is implemented by feeding a pair of images: the captured image and an image from the dataset, to the same Siamese Neural

(7)

Network. Their features are given as the two outputs. Then the distance of the two outputs is calculated which means comparing the features. The calculated distance indicates the similarity between the two images. The similarity scores of the same class pairs are low, while the similarity scores of different class pairs are high. In addition to the Siamese Neural Network, FaceNet Model is used for faster computation. It computes two pairs simultaneously by calculating distance between the Captured Image with one image in the dataset and the Captured Image with another image in the dataset. The two distances calculated show towards which image the Captured Image is more similar to. Keeping that as a base, compares other Captured Image with other images in the datasets. When the similarity score is lower for another pair, it is taken as the base for further computation. If the System provides a Recognized name for the individual, it moves on to the Speaker Authentication. If not, then an alert is sent to the Owner.

Fig. 2. Face Recognition and Authentication System Architecture

Fig. 3. Siamese Neural Network Architecture

b) MaskedFace Recognition and Authentication System

If in the captured real-time image the person is wearing a mask, traditional facial authentication systems would not work and would give less accuracy. In these cases, frontal face images are extracted at a high quality so that the task of identifying the concealed face is no longer so difficult. Although the mask covers most part of the face, features on the upper part, such as the eye and eyebrows, can still be used to improve the availability of a face recognition system. The basic premise would be to remove the distortion of the mask and give priority to the exposed face features which are useful. Our proposed masked facial authentication approach will be on 2 main aspects. One is a built-in database in which we will be adding both masked and unmasked faces, and the other is the appropriate use of uncovered facial features which are useful. We will apply various attention weights to the important aspects of the visible

(8)

facial features, such as forehead, facial contour, etc., which effectively address the problem of unequal distribution of discriminatory facial information. If the identified face is registered in the database, then it moves on to the Speaker Authentication. Else, if the identified face is not registered in the database, an alert would be sent to the owner.

Fig. 4. Masked-Face Recognition and Authentication System Architecture

c) Speaker Recognition and Authentication System

Once the face is authenticated, the system moves to Speaker Authentication. For adding a new user, they must record 3 audio clips of 2 seconds each. The audios are converted to .wav files. Their features are extracted using Mel Frequency Cepstral Coefficients (MFCC). A Gaussian Mixture Model (GMM) is created based on the features from all three audios and stored in the database. For Authentication, the user speaks for 2 seconds. The audio is converted to a test.wav file. Its features are extracted using MFCC. The extracted features are put in all the GMM models in the database and their scores are calculated. The model with maximum score is chosen as the identified Speaker. If the user is identified, it will be compared with the identity of the user recognized in Facial Authentication. If the two identities match, the individual will be allowed to enter the House. If they do not match then an alert will be sent to the Owner. If the user does not exist in the database, the System will output unknown and an alert will be sent to the Owner. They will have an option to add the individual to the database.

(9)

V. RESULTS AND DISUSSIONS

A. Face Recognition and Authentication

Dataset: The training dataset for the facial authentication model consists of 50 images each of 8 people. The testing dataset includes both masked and unmasked 8 images of 3 people.

Input: Input has been taken from the live camera feed wherein a masked/unmasked person stands close to the camera. The facial features are extracted from the live input and fed into the FaceNet model. In case of masked facial recognition, the features from the upper region of the face like eyes and eyebrows are extracted and taken as input for the FaceNet model for authentication.

Output: If the user is registered, the model gives the output as his corresponding name. If the user is not registered in the database, the model gives an output of “Not Found”.

Fig. 6. Facial authentication of unmasked faces. Displays the name of the recognized person

Fig. 7. Facial authentication of masked faces. Displays the name of the recognized person

(10)

Figure 8 shows the confusion matrix of the Facial Recognition and Authentication model. It shows that for User 0 and User 2 the model gives a true positive value ( i.e. predicted the actual user itself). However, for User 1, the model gave 1 true positive and 1 false negative value (i.e. predicted another user instead of the actual user).

Fig. 9. Classification Report for Facial Recognition and Authentication

Figure 9 shows the classification report for our proposed model. The individual precision, recall and f1-score for each user is shown along with the overall values. For testing- 4 images of user ‘Devina’, 2 images of user ‘Emilia Clarke’ and 2 images of user ‘Gauri’ were used. According to the weighted average, the total precision for our proposed model is 0.92, recall is 0.88 and f1-score is 0.87. The Facial Recognition and Authentication model reported an accuracy of 87.5%.

TABLE I. COMPARISON WITH EXISTING MODELS FOR FACIAL AUTHENTICATION

Model Methodology Training

Dataset

Accuracy ±

Std(%)

DeepID2 [35] Using deep learning and both face identification and verification signals as supervision. The face identification process extracts DeepID2 features from different identities by using inter-personal variations, and the face verification task pulls the DeepID2 features extracted from the same identity and reduces intra-personal variations. Used Joint-Bayes method. 202599 images of 10177 subjects, private on LFW dataset 95.43

DeepFace [36] Utilized 3D face modeling to apply a piecewise affine transformation using a nine-layer deep neural network which involves using about 120 million parameters. This method involves using locally connected layers without weight sharing, instead of the standard convolutional layers. 4.4M images of 4030 subjects, private on LFW dataset 95.92 ± 0.29

FaceNet [37] Uses a deep convolutional network trained to directly learn a mapping from face images to a small Euclidean space where distances correspond to a measure of face similarity by

260M images

of 8M

subjects, private on

(11)

using only 128 bytes per face. LFW dataset Sparse Representation-based Classification [38]

Using sparse representation computed by l-minimization, a general classification algorithm for object recognition has been devised which handles two issues in face recognition: feature extraction and robustness to occlusion.

48 video sequences and 64204 face images on Chokepoint database 52.4±0.32 (pAUC) 47.5±0.031 (AUPR) Our Proposed Model

Fed a pair of image - the captured image and an image from the dataset, to the same Siamese Neural Network. Along with this, the FaceNet model finds computes two pairs

simultaneously by calculating distance

between the Captured Image with one image in the dataset and the Captured Image with another image in the dataset.

Custom, private dataset (50 images of 8 users for training, 8 images for testing) 87.5

B. Speaker Recognition and Authentication System

Dataset: Since the database for Smart Home Security will have limited dataset, the prototype contains the voice recordings of 9 Speakers. Each speaker records 3 voice recordings of 2 seconds each for training. While for testing, the speaker speaks for 2 seconds and then the prediction is done by comparing it with the existing GMM models.

Input: Speaker speaks for 2 seconds which is recorded and converted into a test.wav file.

Output: This .wav file is compared with the existing Gaussian Mixture Models and the one with the maximum score is chosen as the identity which is given as the output.

(12)

Figure 10 shows the Confusion matrix for predicting the Speaker. It is an 8x8 matrix. The confusion matrix compares the actual target values with those predicted by the our proposed model. It can be seen that for Speakers 0,1,2,3,4,5, and 6, the model was able to predict the true positive values (i.e. predicted the actual speaker itself). However, for Speaker 7, the model gave only 1 true positive and 2 false negative values (i.e. predicted another speaker instead of the actual speaker).

Fig. 11. Classification Report for Speaker Authentication

Figure 11 shows the classification report for our proposed model. The individual precision, recall and f1-score for each actual speaker is shown along with the overall values. For calculating the values - Navya, Speaker26, Speaker27,Speaker28, Speaker29, Speaker30, Speaker34 and unknown had 1,1,1,2,12,2,3 testing cases respectively. The total precision for our proposed model is 0.91, recall is 0.85 and f1-score is 0.83. The accuracy of our proposed model is 84.62%.

TABLE II. COMPARISON WITH EXISTING MODELS FOR SPEAKER AUTHENTICATION

Model Methodology Training Dataset Metrics

GMM and CNN Hybrid [19]

It achieves successful training and identification for a limited number of biometric samples by training the pre-processed speech spectrum in deep networks to extract the deep features of the full frequency spectrum of short utterances.

Speed spectrum images of the 50 individual speech samples of people, Each sample contains 10 speech spectrum pictures.

Equal Error Rate = 2.5%

For 5000 iterations, Accuracy = 87%

Joint Factor Analysis [20]

Testing the model based on many systems and their fusion. It was found that best EER was possible when all 6 systems were fused together.

RSR2015 (part III) dataset

Male = 2.01% and Female = 3.19% Equal Error Rates

HMM-Based i-Vector Extractor [21]

Preconditioned i-vectors with a regularized version of within-class covariance normalization, which can be robustly calculated phrase-dependently on the minimal

RSR2015 (143 female and 157 male speakers, includes more than 60 different utterances

Equal Error Rate Male = 1.11

(13)

datasets available for the text-dependent task. spoken by all speakers) Our Proposed Model

Implemented using MFCC (Mel Frequency Cepstral Coefficient) feature extraction which passes

its output to the GMM

(Gaussian Mixture Model). The prediction is based on the maximum score calculated with respect to the model.

Custom, private

dataset

(9 speakers for

training, 13

speakers for testing)

Accuracy =

84.62%

Comparing One Shot Learning with Classification, we see that if Classification requires a large number of images for each class for training. Also, if we need to test the model on another class apart from the classes given in training, we cannot expect a result. If a new class has to be added in the training dataset, a lot of images of the class is required and the model has to be re-trained again. This poses a problem when the number of classes is dynamically changing and is huge which in turns increases the computation and training costs. However, One Shot Classification in the proposed model only requires one training example for each class. This network doesn't learn to classify an image to any class, instead it uses a similarity function to check how similar the two input images are.

It can be seen that the proposed model is 87.5% accurate for Facial Recognition and Authentication and 84.62% accurate for Speaker Authentication. Compared with existing models, their accuracy is lesser but close to them. The existing models were trained and tested on large datasets, but for Smart Home Security, the database will include limited datasets. For small datasets, the existing models might have lesser accuracy as those methods require multiple training data for each user. Hence it can be seen that the proposed models are more efficient for smaller datasets with less training data.

VI. CONCLUSION

In this paper, Smart Home Security Solution, models for Facial and Speaker Recognition have been proposed for User Authentication. Siamese Neural Network with FaceNet based on One-Shot Learning is used for Facial Authentication and Gaussian Mixture Model with MFCC feature extraction is used for Speaker Authentication. Pre-processing is done for the captured image and the audio of the user. Based on the features extracted, minimum distance for facial recognition and maximum score for speaker recognition are taken. Using these parameters, the user is classified as either a member in the database or unidentified. For small datasets, the proposed models are more efficient than the state-of-the-art models which require large datasets for training. Apart from this, the model not only recognizes identities of unmasked faces but also recognizes masked faces. For a masked user, their eye and nose region should be clearly visible. The proposed model reports an accuracy of 87.5 % for Facial Authentication and 84.62 % for Speaker Authentication. As a future scope, masked users could be recognized solely on the basis of their eye region. This Smart Security Solution can also be extended to malls, offices and other places requiring security.

(14)

REFERENCES

[1] Klobas, J. E., McGill, T., & Wang, X. (2019). How perceived security risk affects intention to use smart home devices: a reasoned action explanation. Computers & Security, 87, 101571.

[2] Arif, Samrah, et al. "Investigating smart home security: Is blockchain the answer?." IEEE Access 8 (2020): 117802-117816.

[3] Jose, Arun Cyril, and Reza Malekian. "Improving smart home security: Integrating logical sensing into smart home." IEEE Sensors Journal 17.13 (2017): 4269-4286.

[4] Seng, S., Al-Ameen, M. N., & Wright, M. (2021). A First Look into Users’ Perceptions of Facial Recognition in the Physical World. Computers & Security, 102227.

[5] Mokhayeri, F., & Granger, E. (2019). Video face recognition using siamese networks with block-sparsity matching. IEEE Transactions on Biometrics, Behavior, and Identity Science, 2(2), 133-144.

[6] Yin, X., & Liu, X. (2017). Multi-task convolutional neural network for pose-invariant face recognition. IEEE Transactions on Image Processing, 27(2), 964-975.

[7] Yang, Biao, et al. "Facial expression recognition using weighted mixture deep neural network based on double-channel facial images." IEEE Access 6 (2017): 4630-4640.

[8] Kim, Seong Tae, and Yong Man Ro. "Attended relation feature representation of facial dynamics for facial authentication." IEEE Transactions on Information Forensics and Security 14.7 (2018): 1768-1778.

[9] Banerjee, Debdeep, and Kevin Yu. "3D face authentication software test automation." IEEE Access 8 (2020): 46546-46558.

[10] Castiglione, Aniello, Michele Nappi, and Stefano Ricciardi. "Trustworthy Method for Person Identification in IIoT Environments by Means of Facial Dynamics." IEEE Transactions on Industrial Informatics 17.2 (2020): 766-774.

[11] Zhou, Bing, et al. "Robust Human Face Authentication Leveraging Acoustic Sensing on Smartphones." IEEE Transactions on Mobile Computing (2021).

[12] Ntalianis, Klimis, and Nicolas Tsapatsoulis. "Remote authentication via biometrics: a robust video-object steganographic mechanism over wireless networks." IEEE Transactions on Emerging Topics in Computing 4.1 (2015): 156-174.

[13] Tarannum, Ayesha, et al. "An Efficient Multi-Modal Biometric Sensing and Authentication Framework for Distributed Applications." IEEE Sensors Journal 20.24 (2020): 15014-15025. [14] Natheem, M. Syed, R. Narayanan, and Pragash N. Nagaiyan. "Advanced face recognition system

using Fourier Optics and Neural Networks." 2013 Tenth International Conference on Wireless and Optical Communications Networks (WOCN). IEEE, 2013.

[15] He, Y., & Dong, X. (2020). Real time speech recognition algorithm on embedded system based on continuous Markov model. Microprocessors and Microsystems, 75, 103058.

[16] Rahmani, M. H., Almasganj, F., & Seyyedsalehi, S. A. (2018). Audio-visual feature fusion via deep neural networks for automatic speech recognition. Digital Signal Processing, 82, 54-63.

(15)

[17] Dey, S., Motlicek, P., Madikeri, S., & Ferras, M. (2017). Template-matching for text-dependent speaker verification. Speech communication, 88, 96-105.

[18] Zeinali, H., Sameti, H., & Burget, L. (2017). Text-dependent speaker verification based on i-vectors, neural networks and hidden Markov models. Computer Speech & Language, 46, 53-71. [19] Liu, Z., Wu, Z., Li, T., Li, J., & Shen, C. (2018). GMM and CNN hybrid method for short

utterance speaker recognition. IEEE Transactions on Industrial informatics, 14(7), 3244-3252. [20] Stafylakis, T., Alam, M. J., & Kenny, P. (2016). Text-dependent speaker recognition with

random digit strings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(7), 1194-1203.

[21] Zeinali, H., Sameti, H., & Burget, L. (2017). HMM-based phrase-independent i-vector extractor for text-dependent speaker verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(7), 1421-1435.

[22] Meng, Zibo, et al. "Improving speech related facial action unit recognition by audiovisual information fusion." IEEE transactions on cybernetics 49.9 (2018): 3293-3306.

[23] Asaei, Afsaneh, Milos Cernak, and Hervé Bourlard. "Perceptual information loss due to impaired speech production." IEEE/ACM Transactions on Audio, Speech, and Language Processing 25.12 (2017): 2433-2443.

[24] Chai, Li, et al. "A Cross-Entropy-Guided Measure (CEGM) for Assessing Speech Recognition Performance and Optimizing DNN-Based Speech Enhancement." IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2020): 106-117.

[25] Din, Nizam Ud, et al. "A novel GAN-based network for unmasking of masked face." IEEE Access 8 (2020): 44276-44287.

[26] Royer, J., Blais, C., Charbonneau, I., Déry, K., Tardif, J., Duchaine, B., ... & Fiset, D. (2018). Greater reliance on the eye region predicts better face recognition ability. Cognition, 181, 12-20. [27] Chen, Q., & Sang, L. (2018). Face-mask recognition for fraud prevention using Gaussian mixture

model. Journal of Visual Communication and Image Representation, 55, 795-801.

[28] Ma, Zhuo, et al. "EmIr-Auth: Eye Movement and Iris-Based Portable Remote Authentication for Smart Grid." IEEE Transactions on Industrial Informatics 16.10 (2019): 6597-6606.

[29] Lyamin, Andrey V., and Elena N. Cherepovskaya. "An approach to biometric identification by using low-frequency eye tracker." IEEE Transactions on Information Forensics and Security 12.4 (2016): 881-891.

[30] Chamikara, M. A. P., Bertok, P., Khalil, I., Liu, D., & Camtepe, S. (2020). Privacy preserving face recognition utilizing differential privacy. Computers & Security, 97, 101951.

[31] Nautsch, A., Jiménez, A., Treiber, A., Kolberg, J., Jasserand, C., Kindt, E., ... & Busch, C. (2019). Preserving privacy in speaker and speech characterisation. Computer Speech & Language, 58, 441-480.

[32] Vasanthi, M., & Seetharaman, K. (2020). Facial image recognition for biometric authentication systems using a combination of geometrical feature points and low-level visual features. Journal of King Saud University-Computer and Information Sciences.

(16)

[33] Mahmood, A. (2019). A Solution to the Security Authentication Problem in Smart Houses Based on Speech. Procedia Computer Science, 155, 606-611.

[34] Tiong, L. C. O., Kim, S. T., & Ro, Y. M. (2020). Multimodal facial biometrics recognition: Dual-stream convolutional neural networks with multi-feature fusion layers. Image and Vision Computing, 102, 103977.

[35] Sun, Y., Wang, X., & Tang, X. (2014). Deep learning face representation by joint identification-verification. arXiv preprint arXiv:1406.4773.

[36] Taigman, Y., Yang, M., Ranzato, M. A., & Wolf, L. (2014). Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1701-1708).

[37] Schroff, Florian, Dmitry Kalenichenko, and James Philbin. "Facenet: A unified embedding for face recognition and clustering." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

[38] Wright, John, et al. "Robust face recognition via sparse representation." IEEE transactions on pattern analysis and machine intelligence 31.2 (2008): 210-227.