Embeddings for DNN speaker adaptive training
Embeddings for DNN speaker adaptive training
In this work, we investigate the use of embeddings for speaker-adaptive training of DNNs (DNN-SAT) focusing on a small amount of adaptation data per speaker. DNN-SAT can be viewed as learning a mapping from each embedding to transformation parameters that are applied to the shared parameters of the DNN.We investigate different approaches to applying these transformations, and find that with a good training strategy, a multi-layer adaptation network applied to all hidden layers is no more effective than a single linear layer acting on the embeddings to transform the input features. In the second part of our work, we evaluate different embeddings (i-vectors, x-vectors and deep CNN embeddings) in an additional speaker recognition task in order to gain insight into what should characterize an embedding for DNN-SAT. We find the performance for speaker recognition of a given representation is not correlated with its ASR performance; in fact, ability to capture more speech attributes than just speaker identity was the most important characteristic of the embeddings for efficient DNN-SAT ASR. Our best models achieved relative WER gains of 4% and 9% over DNN baselines using speaker-level cepstral mean normalisation (CMN), and a fully speaker-independent model, respectively.
DNN说话人自适应训练的嵌入
在这项工作中,我们调查了嵌入在DNN说话人自适应训练(DNN-SAT)中的使用,重点是每个说话人少量的适应数据。DNN-SAT可以看作是学习从每次嵌入到应用于DNN共享参数的变换参数的映射。.. 我们研究了应用这些转换的不同方法,发现采用良好的训练策略,应用于所有隐藏层的多层自适应网络并不比作用于嵌入以转换输入特征的单个线性层有效。在我们工作的第二部分中,我们在其他说话人识别任务中评估了不同的嵌入(i矢量,x矢量和深CNN嵌入),以便深入了解DNN-SAT嵌入的特征。我们发现给定表示的说话人识别性能与其ASR性能无关。实际上,有效捕获DNN-SAT ASR的最重要特征是,不仅捕获说话者身份,还能够捕获更多语音属性。 (阅读更多)