中央研究院 資訊科學研究所

活動訊息

友善列印

列印可使用瀏覽器提供的(Ctrl+P)功能

A Deep Learning Perspective on Acoustic Signal Processing

:::

A Deep Learning Perspective on Acoustic Signal Processing

  • 講者李錦輝 教授 (School of Electrical and Computer Engineering, Georgia Institute of Technology)
    邀請人:蘇克毅
  • 時間2014-11-18 (Tue.) 15:00 ~ 16:00
  • 地點資訊所新館106演講廳
摘要

In contrast to conventional model-based acoustic signal processing, we formulate a given acoustic signal processing problem in a novel deep learning framework as finding a nonlinear mapping function between the observed signal and desired targets. Monte Carlo techniques are often required to generate a large collection of signal pairs in order to learn the often-complicated structure of the mapping functions. In the case of speech enhancement, to be able to handle a wide range of additive noises in real-world situations, a large training set, encompassing many possible combinations of speech and noise types, is first designed. Next deep neural network (DNN) architectures are employed as nonlinear regression functions to ensure a powerful approximation capability. In the case of source separation a similar simulation methodology can also be adopted. In the case of speech bandwidth expansion, the target wideband signals can be filtered and down-sampled to create the needed narrowband training examples. Finally in the case of acoustic de-reverberation, a wide variety of simulated room impulse responses are needed to generate a good training set.

When reconstructing the desired target signals, some additional techniques may be required to estimate noise, interfering speaker or missing phase information in order to enhance the quality of the synthesized signals. Experimental results demonstrate that the proposed framework can achieve significant improvements in both objective and subjective measures over the conventional techniques in speech enhancement, speech source separation, bandwidth expansion and voice conversion. It is also interesting to observe that the proposed DNN approach can also serve as an acoustic preprocessing front-end for robust speech recognition to improve performance with or without post-processing.