Ziyi's Blog

Chinese Pipeline:An introduction to the Chinese news test set

字数统计: 439阅读时长: 2 min
2019/06/16 Share

I spent more than a week making this test set. This contains almost 150 minutes of voice. I made the correct text for each audio, so we can calculate exactly how well the performance of ASR Chinese Pipeline is.

Sources

The data is from 新闻联播(xinwenlianbo). We extract 5 episodes which are from 2019.03.01 to 2019.03.05. The videos are from redhen gallina and the scripts are from an official website which offers everyday news script.

Audio length

A single audio’s length is mostly in the range of ten seconds to twenty seconds. After the first split, audios longer than thirty seconds are re-segmented, and audios below three seconds are recombined. We also kept a few shorter or longer audio for testing.

Audio cuts

In every audio, there are lots of advertisements and weather forecast which have loud background sound, so I cut those parts. Also, I kept several audios which only have the background sound as disturbing term to see whether the pipeline can recognize them as non-speech.

Text

All text names are the same as the corresponding audio.
If the audio is named as “xwlb0304-161.wav”, then its text is “xwlb0304-161.txt”.

We changed all the arabic numbers into Chinese charactersvand use space to replace punctuation.
Here is a text example :

成员国元首先举行小范围会谈 随后邀请观察员国阿富汗总统加尼 白俄罗斯总统卢卡申科 伊朗总统鲁哈尼 蒙古国总统巴特图勒嘎以及有关国际和地区组织代表参加大范围会谈

Beside the host’s script, the text also include some interviews which are not recorded in news script and this part need manual dictation. Many of these interviews have dialects.

Some characteristics about this dataset

  1. Wording and phrasing are quite succinct and brief. It’s even obscure for native speakers. For example “建言资政”, you can know the meaning by the character, but it’s hard to come out these characters immediately when listening.
  2. There are so many names——both person names and location names. And location names are usually not common, it might be a unknown village. If the pipeline can’t recognize the name and regard it as a word or even break the name and recombine them with neighbouring characters, the result won’t be satisfying.
  3. There are foreign languages, not much, but they exist in almost every episode.
CATALOG
  1. 1. Sources
  2. 2. Audio length
  3. 3. Audio cuts
  4. 4. Text
  5. 5. Some characteristics about this dataset