Ziyi's Blog

Chinese Pipeline:usage of WebrtcVAD

字数统计: 518阅读时长: 3 min
2019/06/08 Share

WebrtcVAD

A VAD classifies a piece of audio data as being voiced or unvoiced.

Installation

Install the webrtcvad module

pip install webrtcvad

Preparing the audios

Redhen only have mp4 format videos. So, we need to transform the video to audios by ffmpeg.

FFmpeg is a powerful tool for format converting. After install the ffmpeg, you can convert the video like this:

ffmpeg -i XXX.mp4 -hide_banner -loglevel 0 -ac 1 -ar 32000 xxx.wav

XXX.mp4 is the input video, and the xxx.wav is the output audio. ‘ac’ is to set the channel and ‘ar’ is to set the sample rate.

Split the audio

To run the files:

python audiosplit.py <padding duration> <path to wav file/directory> <path to directory>

The complete code audiosplit.py and sample I used can be found at repository.

Usually, the result won’t be satisfied after the first split. Some of cuts will be quite long. To deal with the audios that are longer than 30 seconds(or you can change into other standards), you can decrease the padding duration and re-split them. You don’t need to select them by yourself. Just assign the directory including all the files, and all audio that are longer than 30s would be selected and re-splitted.

Explanations of parameters

There are several parameters needed to be explained. Some of them are defining, some are not. But I think it’s important to introduce all of them and it will be easier to understand the program.

aggressiveness mode

You can see ‘aggressiveness’ in ‘run the files’. This parameter is used to filter out non-speech. It’s an integer between 0 and 3. 0 is the least aggressive about filtering out non-speech, 3 is the most aggressive. If you set the aggressiveness ‘0’, then you will find many non-speech parts which may last for 1 or 2 seconds in the result. However, if you set the aggressiveness ‘3’, the audio will be splited into hundreds of small parts and that’s also not the results we want. After several tests, I found that with the aggressiveness node set ‘2’, the splitting performs best.

sample_rate

The API only accepts the sample_rate ‘16000’,’24000’,’32000’ and ‘48000’. The higher the sample rate, the better the audio quality. When converting the video to audio, you need to set the sample rate.

num_channels

The API only accepts the audio with monochannel, this is also needed to be set when converting.

frame _duration _ms

This defines how long a frame is. I have tried both 20ms and 30ms, and there is no difference between the results.

padding _duration _ms

This is most defining parameter since the webrtcvad’s theory is to test the padding. This parameter determines how long the pause will be detected. If we set the padding duration small, the audio will be splitted into many small parts and some of them just last 2 or 3 seconds, and that is not fit for later model testing. So after testing 200ms, 250ms, 280ms, 300ms, me and zhaoqing both considered that 300ms is best for cutting the audio.

CATALOG
  1. 1. WebrtcVAD
    1. 1.1. Installation
    2. 1.2. Preparing the audios
    3. 1.3. Split the audio
    4. 1.4. Explanations of parameters
      1. 1.4.1. aggressiveness mode
      2. 1.4.2. sample_rate
      3. 1.4.3. num_channels
      4. 1.4.4. frame _duration _ms
      5. 1.4.5. padding _duration _ms