초록
This dissertation proposes an effective feature compensation scheme based on the speech model for achieving robust speech recognition. RATSZ (Multivariate Gaussian Based Cepstral Normalization) is overviewed as the representative GMM (Gaussian Mixture Model) based feature compensation method. For implementation in condition of small sized resource such as embedded system, some alternative of RATZ are proposed. Considerable computation loads of conventional RATZ could be significantly reduced by employing Gaussian selection technique. The proposed algorithm is based on interpolated RATZ and it is modified to be suitable for the frame-synched recognition system. It shows the equivalent performance to the original isolated RATZ just with the far-lower computational load.
Conventional RATZ requires off-line training with a noisy speech database and is not suitable for online adaptation. In the proposed scheme, we can eliminate the need for the noisy speech database in the off-line training by employing the parallel model combination technique for the estimation of correction factors. The application of the model combination technique to the mixture model alone, as opposed to the entire HMM, makes online model combination possible. Exploiting the availability of noise models from off-line sources, we accomplish the online adaptation via MAP(Maximum A Posteriori) estimation. In addition, the real-time channel estimation procedure is induced within the proposed framework. For a more efficient implementation, a selective model combination scheme is proposed, which leads to a reduction of the computational complexity. Representative experimental results indicate that the proposed algorithm is effective in realizing robust speech recognition under the combined adverse conditions of additive background noise and channel distortion.
In the conventional GMM-based method, feature restoration is accomplished by MMSE (Minimum Mean Squared Error) in which the posterior probability decides on the extent of compensation. Since the noisy speech is "incomplete", the compensation by posteriori can result in an obscure feature. In the proposed method, we define the components which are likely to diminish the discriminative property of speech feature and re-compose the mixture model by excluding the competing components. Candidates for distinctive features are estimated from the re-composed model. Final feature selection is based on the measures with likelihood averager over the similar states and standard deviation of likelihood across the dissimilar states. The experimental results show that the suggested algorithm is effective in achieving more distinctive features and thus leads to improved recognition performance under noisy environments.
닫기