Chinese-pipeline:Final Report

字数统计: 639阅读时长: 3 min

 2019/08/21   Share

Welcome to the Chinese Pipeline project

In this post, we will show the work we had done on Chinese Pipeline for Red Hen Lan as part of GSoC 2019.
Please Click here to access to my repository and click here to see the previous blogs.
Please feel free to send emails to liuziyi219@gmail.com if you have any questions.

Please note that this is just a summary of what I did during this summer. If you like, you can read my previous blogs for more details or the README in my repository.

Automatic Speech Recognition for Chinese

The project is mainly based on the open source project Deepspeech2 on PaddlePaddle released by Baidu. And my work is to improve the initial ASR pipeline which has been built last year by Zhaoqing Xu.

1. Using WebRTCVad to split the audio

In the initial version, the audio was split every 10 seconds, this may cause a complete sentence to be separated and effect the performance of pipeline. So I use
a tool called WebRTCVad which classifies a piece of audio data as being voiced or unvoiced. You can see the introduction by the link. And to solve the problem that some splitted audios would be extremely long and make it easier to use, I made some adjustment to the code, you can see the update in here

2. Run pipeline on Chinese news test set

For Red Hen didn’t have annotated data for testing, we manually annotated 150 minutes of voice as test set data. The data is from Xinwenlianbo. You can see the detailed introduction of testset in here
Then we tested the pipeline on the test set, the performance was much better than before. Here is the experimental results.

3. Xunfei SDK

Xunfei is a powerful tool for Chinese speech recognition. This method has been denied at last, but since I spent a week on it, I would still show my work here. Also, using Xunfei SDK can help us produce more test data without manually annotation.

4. Tuning the hyper-parameter

The hyper-parameters alpha (language model weight) and beta (word insertion weight) for the CTC beam search decoder often have a significant impact on the decoder’s performance. It would be better to re-tune them on the validation set when the acoustic model is renewed. The code and the usage are in README link. We spent a week to tune the hyper-parameter. One thing to be noticed was that tuning took much time. To use the testset we made, tuning process may take a few days.

5. Deploy the pipeline on CWRU HPC server

Since the pipeline was once deployed on server, I just need to add the new function, rewrite some code and debug the whole pipeline. I spent almost 2 weeks to solve all bugs including videos transforming, output errors,etc. Here is the use of ASR pipeline on CWRU HPC server.

NLP Pipeline

We didn’t make much progress about this part. We have discussed about the data format. Then I preprocessed the text. But the problem is that if we split the text into sentences, there should be punctuation in the text. And our results from ASR pipeline don’t contain punctuations. So it is difficult in sentence splitting.

Acknowledgement

A huge thanks goes to my GSoC mentors, Xu Zhaoqing and Professor Mark Turner. Thanks for their patience and support. They always responsed quickly to my questions and helped me with issues I faced. I would also like to thank Professor Francis Steen for all the help he provided. I will also thank to all the Red Hen Lab members, they all showed great kindness and thank them so much for their help. At last, I express my gratitude to Google Summer of Code. Thanks for all the things you do and I really enjoyed this experience in GSoC 2019.

原文作者：Ziyi Liu

原文链接：http://liuziyi219.github.io/2019/08/21/Chinese-pipeline-Final-Report/

发表日期：August 21st 2019, 4:54:03 am

更新日期：August 24th 2019, 1:24:49 pm

Previous Post

Chinese-Pipeline: ASR for Chinese Pipeline

CATALOG

1. Welcome to the Chinese Pipeline project
2. Automatic Speech Recognition for Chinese
3. NLP Pipeline
4. Acknowledgement



缺失模块。
1、请确保node版本大于6.2
2、在博客根目录（注意不是archer根目录）执行以下命令：
npm i hexo-generator-json-content --save
3、在根目录_config.yml里添加配置：

jsonContent:
  meta: false
  pages: false
  posts:
    title: true
    date: true
    path: true
    text: false
    raw: false
    content: false
    slug: false
    updated: false
    comments: false
    link: false
    permalink: false
    excerpt: false
    categories: true
    tags: true