Ziyi's Blog

Chinese-Pipeline: ASR for Chinese Pipeline

字数统计: 1k阅读时长: 6 min
2019/08/18 Share

The project is mainly based on the open source project Deepspeech2 on PaddlePaddle released by Baidu. For the configuration, I strongly recommend you to use our singularity recipe to avoid protential problems.
For the code, click here

Prerequsites

For the Red Hen Lab participants, all the configuration has been set up on the server.

Python==2.7
Singularity==2.5.1
CUDA==7.5

Data Preparation

Data Description

What we use in this project is the Chinese video news data collected by Red Hed Lab, including Xinwenlianbo(新闻联播),Xinwen1+1(新闻1+1),etc. The video lengths vary from 20 minutes to 40 minutes.(some of them may include advertisements)

Data Extraction

To extract the mp4 file from ./tv, you can use the order below
find . -name '*_CN_*.mp4' ! -iname "*CGTN*" -exec cp {} $PWDDIR #set this as your directory to save the videos

Format Conversion

Firstl we need to convert the video to audio so that we could apply our pipeline. The tools we use here is FFmpeg.
The usage is:
ffmpeg -i file.mp4 -ac 1 -ar 32000 file.wav
ac is to set the channel, ‘-ac 1’ means set the channal as mono, or it will be stereo.
ar is to set the sample rate. 16000 and 32000 are both acceptable.

Split the audio

The tool we use here is WebRTCVad. A VAD classifies a piece of audio data as being voiced or unvoiced.
The details of the usage can be found in here

Prepare the Manifest

Manifest is a file that includes the information of data, including the audio path, duration. It is made up of small jsons which looks like:
{"audio_filepath": "data/split/2018-01/2018-01-31_0410_CN_HNTV1_午间新闻/2018-01-31_0410_CN_HNTV1_午间新闻001.wav", "duration": 10.0, "text": ""}
You can find the code manifest.py in the repository to create the manifest.

Hyper-parameters Tuning

The hyper-parameters $\alpha$ (language model weight) and $\beta$ (word insertion weight) for the CTC beam search decoder often have a significant impact on the decoder’s performance. It would be better to re-tune them on the validation set when the acoustic model is renewed.

tune.py performs a 2-D grid search over the hyper-parameter $\alpha$ and $\beta$. You must provide the range of $\alpha$ and $\beta$, as well as the number of their attempts.

Tuning with GPU:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python tune.py \
--trainer_count 8 \
--alpha_from 1.0 \
--alpha_to 3.2 \
--num_alphas 45 \
--beta_from 0.1 \
--beta_to 0.45 \
--num_betas 8

The grid search will print the WER (word error rate) or CER (character error rate) at each point in the hyper-parameters space, and draw the error surface optionally. What need to be noticed is that tuning might take lots of time, so it’s better not to set the range of alpha and beta wide.

If you need to use testset from Red Hen Lab, here is the introduction of test set
The testset is in /mnt/rds/redhen/gallina/Chinese_Pipeline/ziyiliu. The audios set is in /results, the annotated script is in /script. This testset only contains 150min of voice.

Infer

To get the inference of your data, we need the following files:

Infer Manifest

This is the file that generates from the Prepare the Manifest

Mean&Stddev

This is the file that perform z-score normalization which includes audio features.

python code/compute_mean_std.py \
--num_samples 2000 \
--specgram_type linear \
--manifest_paths {your manifest path here} \
--output_path {your expected output path}

you could check in the compute_mean.sh and compute_mean_std.py to see more details.

Vocabulary

This is the file to count all the words(in Chinese we say characters, same thing) in your data. Note that all the generated words are from the vocabulary, so if you didn’t put the expected word here, it’s impossible to generate it. We just use the vocab.txt that Baidu provides. You could find the vocab.txt in code directory.

Speech & Language Model

We used Aishell Model here, which is trained on the Aishell Dataset for 151h and for the language model, we used Mandarin LM Large here, which has no pruning and about 3.7 billion n-grams

After all the preparation, you could fiil them in the right place in infer.sh. The code in the repository contains all the work, from extracting the video, to the inference, so you only need to run infer.sh.

Running Code at CWRU HPC

Get into the container

module load singularity/2.5.1
module load cuda/7.5
export SINGULARITY_BINDPATH="/mnt"
srun -p gpu -C gpup100 --mem=100gb --gres=gpu:2 --pty bash
cd /mnt/rds/redhen/gallina/Singularity/Chinese_Pipeline/ziyiliu
singularity shell -e --nv ziyiliu.simg

Run the infer.sh

cd code
./infer.sh

Check the results

cd ../new_text

Results

A cut of sample output file:

TOP|20190505183301|2019-05-05_1833_CN_CCTV13_新闻1+1
COL|Communication Studies Archive, UCLA
UID|e1bcc5b0-6f68-11e9-9863-eb97ad55d029
DUR|00:26:54
VID|720x576|1024x576
SRC|Changsha, China
TTL|News 1+1
CMT|
ASR_01|CMN
LBT|2019-05-06 02:33:01 Asia/Shanghai
ASR_01|2019-08-18 07:55|Source_Program=Baidu DeepSpeech2,infer.sh|Source_Person=Zhaoqing Xu,Shuwei Xu,Ziyi Liu|Codebook=Chinese Speech to Text
20190505183301.000|20190505183311.950|ASR_01|一只很可爱的目的和他们的男孩安东江>湖的美人模样还行不
20190505183311.950|20190505183323.560|ASR_01|创业的感受将是你我还是我的国家我还>没有
20190505183323.560|20190505183324.430|ASR_01|没
20190505183324.430|20190505183328.240|ASR_01|我曾好好地对本行业的公司
20190505183328.240|20190505183333.100|ASR_01|比如原来的因为中国在黄金上的突破
20190505183333.100|20190505183341.230|ASR_01|也还是要亲口说出而没有这个机会来了>还是中国的年的太空
20190505183341.230|20190505183343.930|ASR_01|同时很多的欧美国 
CATALOG
  1. 1. Prerequsites
  2. 2. Data Preparation
    1. 2.1. Data Description
    2. 2.2. Data Extraction
    3. 2.3. Format Conversion
    4. 2.4. Split the audio
    5. 2.5. Prepare the Manifest
  3. 3. Hyper-parameters Tuning
  4. 4. Infer
    1. 4.1. Infer Manifest
    2. 4.2. Mean&Stddev
    3. 4.3. Vocabulary
    4. 4.4. Speech & Language Model
  5. 5. Running Code at CWRU HPC
    1. 5.1. Get into the container
    2. 5.2. Run the infer.sh
    3. 5.3. Check the results
  6. 6. Results