Array-based Spectro-temporal Masking For Automatic Speech Recognition

Moghimi, Amir Reza

doi:10.1184/R1/6714845.v1

Array-based Spectro-temporal Masking For Automatic Speech Recogni.pdf (2.47 MB)

Array-based Spectro-temporal Masking For Automatic Speech Recognition

thesis

posted on 2014-05-01, 00:00 authored by Amir Reza Moghimi

Over the years, a variety of array processing techniques have been applied to the problem of enhancing degraded speech to improve automatic speech recognition. In this context, linear beamforming has long been the approach of choice, for reasons including good performance, robustness and analytical simplicity. While various non-linear techniques - typically based to some extent on the study of auditory scene analysis - have also been of interest, they tend to lag behind their linear counterparts in terms of simplicity, scalability and exibility. Nonlinear techniques are also more difficult to analyze and lack the systematic descriptions available in the study of linear beamformers. This work focuses on a class of nonlinear processing, known as time-frequency (T-F) masking - a.k.a. spectro-temporal masking { whose variants comprise a significant portion of the existing techniques. T-F masking is based on accepting or rejecting individual time-frequency cells based on some estimate of local signal quality. Analyses are developed that attempt to mirror the beam patterns used to describe linear processing, leading to a view of T-F masking as "nonlinear beamforming". Two distinct formulations of these "nonlinear beam patterns" are developed, based on different metrics of the algorithms behavior; these formulations are modeled in a variety of scenarios to demonstrate the flexibility of the idea. While these patterns are not quite as simple or all-encompassing as traditional beam patterns in microphone-array processing, they do accurately represent the behavior of masking algorithms in analogous and intuitive ways. In addition to analyzing this class of nonlinear masking algorithm, we also attempt to improve its performance in a variety of ways. Improvements are proposed to the baseline two-channel version of masking, by addressing both the mask estimation and the signal reconstruction stages; the latter more successfully than the former. Furthermore, while these approaches have been shown to outperform linear beamforming in two-sensor arrays, extensions to larger arrays have been few and unsuccessful. We find that combining beamforming and masking is a viable method of bringing the benefits of masking to larger arrays. As a result, a hybrid beamforming-masking approach, called "post-masking", is developed that improves upon the performance of MMSE beamforming (and can be used with any beamforming technique), with the potential for even greater improvement in the future.

History

Date

2014-05-01

Degree Type

Dissertation

Department

Electrical and Computer Engineering

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Richard Stern

Usage metrics

Keywords

speech recognition audio signal processing array processing t-f masking beamforming

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Array-based Spectro-temporal Masking For Automatic Speech Recognition

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports