Chinese Pipeline:An introduction to the Chinese news test set

字数统计: 439阅读时长: 2 min

 2019/06/16   Share

I spent more than a week making this test set. This contains almost 150 minutes of voice. I made the correct text for each audio, so we can calculate exactly how well the performance of ASR Chinese Pipeline is.

Sources

The data is from 新闻联播(xinwenlianbo). We extract 5 episodes which are from 2019.03.01 to 2019.03.05. The videos are from redhen gallina and the scripts are from an official website which offers everyday news script.

Audio length

A single audio’s length is mostly in the range of ten seconds to twenty seconds. After the first split, audios longer than thirty seconds are re-segmented, and audios below three seconds are recombined. We also kept a few shorter or longer audio for testing.

Audio cuts

In every audio, there are lots of advertisements and weather forecast which have loud background sound, so I cut those parts. Also, I kept several audios which only have the background sound as disturbing term to see whether the pipeline can recognize them as non-speech.

Text

All text names are the same as the corresponding audio.
If the audio is named as “xwlb0304-161.wav”, then its text is “xwlb0304-161.txt”.

We changed all the arabic numbers into Chinese charactersvand use space to replace punctuation.
Here is a text example :

成员国元首先举行小范围会谈 随后邀请观察员国阿富汗总统加尼 白俄罗斯总统卢卡申科 伊朗总统鲁哈尼 蒙古国总统巴特图勒嘎以及有关国际和地区组织代表参加大范围会谈

Beside the host’s script, the text also include some interviews which are not recorded in news script and this part need manual dictation. Many of these interviews have dialects.

Some characteristics about this dataset

Wording and phrasing are quite succinct and brief. It’s even obscure for native speakers. For example “建言资政”, you can know the meaning by the character, but it’s hard to come out these characters immediately when listening.
There are so many names——both person names and location names. And location names are usually not common, it might be a unknown village. If the pipeline can’t recognize the name and regard it as a word or even break the name and recombine them with neighbouring characters, the result won’t be satisfying.
There are foreign languages, not much, but they exist in almost every episode.

原文作者：Ziyi Liu

原文链接：http://liuziyi219.github.io/2019/06/16/Chinese-Pipeline-week3/

发表日期：June 16th 2019, 2:32:14 am

更新日期：June 16th 2019, 4:34:18 am

Next Post

Chinese-Pipeline:First experimental result and the comparation of 3 ASR APP
Previous Post

Chinese Pipeline：usage of WebrtcVAD

CATALOG

1. Sources
2. Audio length
3. Audio cuts
4. Text
5. Some characteristics about this dataset



缺失模块。
1、请确保node版本大于6.2
2、在博客根目录（注意不是archer根目录）执行以下命令：
npm i hexo-generator-json-content --save
3、在根目录_config.yml里添加配置：

jsonContent:
  meta: false
  pages: false
  posts:
    title: true
    date: true
    path: true
    text: false
    raw: false
    content: false
    slug: false
    updated: false
    comments: false
    link: false
    permalink: false
    excerpt: false
    categories: true
    tags: true