Design of automatic speech recognition in noisy environments enhancement and modification

Recurrent neural networks (RNN) and feed-forward multi-layer perceptron’s have been proposed for determining the absence and presence of speech in continuous voice signals when there is a variety of background noise levels present. The Aurora2 and Aurora3 were used to conduct detailed performance evaluations on vocal activity detection. When a Recurrent neural network feeds on automatic speech recognition particular features and acoustic features, the best outcomes can be achieved, according to this study. Aurora2 and the French, Romanian and Norway portions of the Aurora3 corpus is also proposed for detailed studies of ASR. When noise presence probability is utilized to change for encoding speech, phone subsequent probabilities are employed; the WER is reduced by 10.3 percent.


Introduction
Large corpora with a wide range of speakers are used to train automatic speech recognition systems for multiple languages. Typically, they are collected in a controlled environment where there is little or no noise. When automatic speech recognition systems are operating in the presence of real-world noise, this often results in a decrease in performance. Erroneous word additions to the recognized sentence are possible when noise is present in non-speech signal segments, whereas Signal-to-Noise Ratio (SNR) regions Some phonemes may be masked., resulting in incorrect word removals. Recognition accuracy is often enhanced with the application of noise reduction techniques. These mistakes continue to be made despite the fact that they have been addressed. When training corpora in noisy environments, theoretically it would be possible to further minimize mistakes, but this is actually impractical. However, in fact, an accurate (VAD) Voice Activity Detector that is robust in noisy settings can be introduced to get some benefits. In order to determine the likelihood that speech would be heard, a variety of approaches have been proposed, including linear estimation and logistic regression [1]. In addition to HOS, log likelihood ratio, a priori speech absence probability is calculated using the smoothed power spectrum's minimum values., and Laplacian Gaussian model, other ways have been developed. [2] discusses features that can effectively identify speech in a variety of auditory settings. Each speech frame can be associated with an artificial neural network to determine the likelihood of speaking or silence in that frame. An artificial neural network was recently presented in [3] to do this. In acoustic Hidden Markov Models (HMM), scaled and utilized as a new observation, this likelihood. Noise Presence Probability in the absence of speech is also proposed in this study, but it has two important distinctions. The acoustic model's non-speech state's predicted strength informs a non-linear gain function that incorporates this likelihood. The non-voice subsequent probability calculated by the artificial neural network of the hybrid system can be altered using this function in an ANN/HMM ASR hybrid system. Thus, Adjusting the prior probability of other phones does not change their relative values calculated by a model learned through discriminative learning. There is less uncertainty between speech and non-speech as a result of this. The neural VAD is the ANN that calculates NPP. Noise and signalto-noise ratio are all taken into account while training it to distinguish voice from non-voice in various noise settings (SNR). ASR system and suggested VAD are described in Section 2. There are three sections: Section 3 details the ANN that was used to compute the likelihood of noise existence, and Section 4 reports findings achieved using Aurora2 as well as its French, Romanian and Norway components.

Use probability of noise presence
The design presented in Figure 1 is used to integrate the neural VAD with the hybrid systems. Hybrid HMM-NN ASR defined in [3] is the phonetic expert. With a huge multi-condition training set and a speech enhancement algorithm that has already reduced stationary noise, the Neural VAD is the Voice/Noise expert. It is specifically built and taught to distinguish between voice and noise. Consideration has been given to the use of a Recurrent Neural Network (ANN) and a Multi-Layer Perceptron (MLP). After removing the stationary noise, the J-RASTA, signal energy and Perceptual Linear Prediction Coefficients (PLPC) filtering are extracted from the speech signal. For Voice/Noise discrimination, (J-Rasta PLP) [4] some more characteristics difference between the voice signal energy, Spectral entropy and (MSE) are examples of these properties. P (C|Y) is calculated using the spectral parameters as well as their first derivative and second derivative, which the phonetic specialist employs. The Voice/Noise expert, which calculates NPP, receives the same parameters as before, along with the supplementary features. When the phonetic expert calculates P(C|Y), he uses this probability, along with an speculation of the inherent intensity of the background noise , to calculate gain of a non-linear that modulates P(C|Y).

Figure 1. Proposed system design frame work
An output node in the ASR system's ANN offers an estimate of the background noise's posterior probability P (BGN|Y). Many insertion and deletion errors are caused by a miscalculation of ( | ). Given the input signal Y, the neural Voice Activity Detector computes NPP = P (noise-only|Y). For the absence of speech, noise presence probability is used to instead of ( | ) and generate a new subsequent estimation ́( | ).the expression of ́( | ) is written in the below As an example, NS estimates the "intrinsic strength" of the artificial neural network O/P node that determine the subsequent probability ( | ) in this case. The ( • ) function is defined as: Where the gain function's and values are min and Max, respectively. For each ANN model, and even within a single language, the "intrinsic strength" of the BGN node varies based on the training material. It is possible to assess "intrinsic strength" in two method.

2-Average P(BGN|Y) when no voice activity is detected by the VAD
The following is an expression for the noisy speech model in the frequency domain: Where D(t,k) represent noise signal ,X(t,k) represent speech signal and Y(t,k) represent noisy signal On the other side non-linear system such as the fourth root or square root is used to further compress this number in order to provide a more convenient rescaling of the dynamics. Differences between the two are inconsequential. In both notions of intrinsic strength. Gain function states that if the intensity of the background noise is intrinsically strong, Noise zones are less amplified compared to voice regions.; the opposite is true if the BGN is weak. Afterwards, all of the NN acoustic model's other outputs P (Ci | Y) are normalized to sum to one by modulating gain with P (BGN|Y) to obtain ́( | ).

Structures for the neural VAD
Presence of only background noise, a two-output MLP has been developed. Seven nearby frames feed the artificial neural network input; each frame comprises the 13 cepstral coefficients and total energy, as well as first derivative and third time derivative; and seven adjacent frames feed each frame. Other ancillary factors have been taken into consideration. Pitch trackers can extract periodicity and pitch, and other metrics such as signal noise ratio, the difference between (MSE) and current energy are available. In the following, the term "AuxF" refers to a set of complementary features that are included. The first hidden layer is composed of 315 units, each of which is locally related to the properties of the concentration frame as well as the right and left contexts of the four frames under consideration. With 52 units, the second concealed layer is completely connected to the first. A multi-layer perceptron's and a Recurrent neural network with feedback on the second layer nodes have both been implemented as neural Voice Activity Detector, with the MLP being the simpler of the two. Voice and Background Noise posterior probabilities are computed using two units in a soft ax layer, which is the result of the algorithm's output. It is necessary to train the neural VAD with a multi-style corpus, which contains many languages as well as different There are many distinct kinds of noise, as well as varying decibel levels. For the neural Voice Activity Detector's benefit, a noise-decreasing method [6] is applied during front-end processing to decrease stationary noise. Two classes of phonetic labels based on acoustic models and the conventional acoustic models, the forced segmentation of the training and test corpora was performed on the corpora (voice, noise). Several tests have been carried out in order to evaluate the performance of the neural Voice Activity Detector. Table I contains the results of the study. • Etsi NE VAD: this is the Voice Activity Detector that is used in the ETSI -AFE noise estimation module.
• NVAD: this is the neural Voice Activity Detector Spectral Attenuation noise reduction and based on J-Rasta PLP features; • NVAD + AuxF: Auxiliary features have been introduced to this version; • RNNVAD + AuxF: The neural Voice Activity Detector beats the energy-based ETSI Noise-Est Voice Activity Detector, according to the results of the experiment. The results reveal that, among the neural VAD variations, the auxiliary features (fundamental frequency, periodicity, signal noise ratio, entropy, and noise energy difference) often enhance the performance, as predicted by the researchers. Using an RNN, you can achieve even greater results.

Recognition results
Test sets from the Aurora2 and Aurora3 corpora's were used in ASR tests that were carried out in accordance with the architecture described in Section 2. The results, expressed as a percentage of WER, are presented in Tables II. The Phonetic Expert is a standard Loquendo hybrid Hidden Markov model -neural network that has been released for the French, Romanian, Norway, and untied state English languages, as well as for other languages. NNVAD is the Voice/Noise Expert in Table and it is the one with no auxiliary characteristics, as indicated by the name. Because it represents the best compromise between accuracy and computing complexity, it has been selected for the recognition studies. As a matter of fact, the fundamental frequency features increase the outcomes at the expense of approximately double the amount of time spent on the front-end computations. Table II contains row headers that relate to the following conditions: • EM -SA: standard J-Rasta-PLP based on signal noise ratio dependent modified Malah -Ephraim-spectral attenuation [11]; • EM -SA-NPP-1: but with a gain function-modified phone posterior probability and an NS value calculated as the mean of background noise state probability NPP>0.6 ( = 1.6 and m=0.0); • EM -SA-NPP-2: same as before, but with the phone's following probability being adjusted by the gain function, the SA-EM-NPP-2 model includes NS, which is computed as the square root of the instantaneous background noise state probability ( = 1.6 and m=0.0).
The confidence intervals for the word error rate are shown in parenthesis. Using NPP to re-modulate the posterior probability of background noise and phonemes has been found to be highly successful, especially when noise induces a significant insertion and deletion rate. The pre-plosive delay in geminate plosives is filled by noise in the Aurora3 French digits, which causes the word to be deleted or split into two wrong words. The best overall setup, EM -SA-NPP-2, yields the most improvement for Aurora3 Romanian (10.3% E.R.) However, when using the optimum setup, the improvement is consistent across Aurora2 and Aurora3 (7.8 percent). When the residual error is mostly caused by substitutions, the technique has limited effect because it is designed to focus on deletions and insertions only. This is the case in Norway with Aurora3 as an example. In order to make a direct comparison with the ETSI Noise Est Voice Activity Detector values.

Conclusions
Using a Phonetic expert and a Noise /voice expert to improve automatic speech recognition accuracy in loud contexts has been proposed in this study. When the outputs of the hidden markov model -neural network Hybrid system are modulated by results from a noise-resistant Neural VAD, the integration is achieved. Multiple test sets, both noisy (Aurora3 and Aurora2) and clean (Aurora3), have been used to evaluate various neural VAD architectures and features (TIMIT). With the standard ETSI NoiseEst VAD, the results are always inferior. Once noise has been reduced to an acceptable level, the neural VAD was utilized to re-modulate the background noise and phoneme likelihoods. Test instances in which this issue is more prevalent have yielded consistent decreases in errors.