# crnn-audio-classification **Repository Path**: linglin1978/crnn-audio-classification ## Basic Information - **Project Name**: crnn-audio-classification - **Description**: UrbanSound classification using Convolutional Recurrent Networks in PyTorch - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2021-03-15 - **Last Updated**: 2021-03-15 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # PyTorch Audio Classification: Urban Sounds Classification of audio with variable length using a CNN + LSTM architecture on the [UrbanSound8K](https://urbansounddataset.weebly.com/urbansound8k.html) dataset. Example results:

### Contents - [Models](#models) - [Inference](#inference) - [Training](#training) - [Evaluation](#evaluation) - [To Do](#to-do) #### Dependencies - [soundfile](https://pypi.org/project/SoundFile/): audio loading - [torchparse](https://github.com/ksanjeevan/torchparse): .cfg easy model definition - [torchaudio_contrib](https://github.com/keunwoochoi/torchaudio-contrib): Audio transforms on GPU #### Features - Easily define CRNN in .cfg format - Spectrogram computation on GPU - Audio data augmentation: Cropping, White Noise, Time Stretching (using phase vocoder on GPU!) ### Models CRNN architecture:

Printing model defined with [torchparse](https://github.com/ksanjeevan/torchparse): ``` AudioCRNN( (spec): MelspectrogramStretch(num_bands=128, fft_len=2048, norm=spec_whiten, stretch_param=[0.4, 0.4]) (net): ModuleDict( (convs): Sequential( (conv2d_0): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=[0, 0]) (batchnorm2d_0): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (elu_0): ELU(alpha=1.0) (maxpool2d_0): MaxPool2d(kernel_size=3, stride=3, padding=0, dilation=1, ceil_mode=False) (dropout_0): Dropout(p=0.1) (conv2d_1): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=[0, 0]) (batchnorm2d_1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (elu_1): ELU(alpha=1.0) (maxpool2d_1): MaxPool2d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False) (dropout_1): Dropout(p=0.1) (conv2d_2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=[0, 0]) (batchnorm2d_2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (elu_2): ELU(alpha=1.0) (maxpool2d_2): MaxPool2d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False) (dropout_2): Dropout(p=0.1) ) (recur): LSTM(128, 64, num_layers=2) (dense): Sequential( (dropout_3): Dropout(p=0.3) (batchnorm1d_0): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (linear_0): Linear(in_features=64, out_features=10, bias=True) ) ) ) Trainable parameters: 139786 ``` ### Usage #### Inference Run inference on an audio file: ```bash ./run.py /path/to/audio/file.wav -r path/to/saved/model.pth ```

#### Training ```bash ./run.py train -c config.json --cfg arch.cfg ``` ##### Augmentation Dataset transforms: ```bash Compose( ProcessChannels(mode=avg) AdditiveNoise(prob=0.3, sig=0.001, dist_type=normal) RandomCropLength(prob=0.4, sig=0.25, dist_type=half) ToTensorAudio() ) ``` As well as [time stretching](https://github.com/keunwoochoi/torchaudio-contrib/blob/781fe10ee0ee6ccab4628c7e0a56ce8e3add0502/torchaudio_contrib/layers.py#L236):

##### TensorboardX

#### Evaluation ```bash ./run.py eval -r /path/to/saved/model.pth ``` Then obtain defined metrics: ```bash 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:03<00:00, 12.68it/s] {'avg_precision': '0.725', 'avg_recall': '0.719', 'accuracy': '0.804'} ``` ##### 10-Fold Cross Validation | Arch | Accuracy | AvgPrecision(macro) | AvgRecall(macro) | |----------|:-------------:|------:|------:| | CNN | 71.0% | 63.4% | 63.5% | | CRNN | 72.3% | 64.3% | 65.0% | | CRNN(Bidirectional, Dropout) | 73.5% | 65.5% | 65.8% | | CRNN(Dropout) | 73.0% | 65.5% | 65.7% | | CRNN(Bidirectional) | 72.8% | 64.3% | 65.2% | Per fold metrics CRNN(Bidirectional, Dropout): | Fold | Accuracy | AvgPrecision(macro) | AvgRecall(macro) | |----------|:-------------:|------:|------:| |1|73.1%|65.1%|66.1%| |2|80.7%|69.2%|68.9%| |3|62.8%|57.3%|57.5%| |4|73.6%|65.2%|64.9%| |5|78.4%|70.3%|71.5%| |6|73.5%|65.5%|65.9%| |7|74.6%|67.0%|66.6%| |8|66.7%|62.3%|61.7%| |9|71.7%|60.7%|62.7%| |10|79.9%|72.2%|71.8%| ### To Do - [ ] commit jupyter notebook dataset exploration - [x] use torchaudio-contrib for STFT transforms - [x] CRNN entirely defined in .cfg - [x] Some bug in 'infer' - [x] Run 10-fold Cross Validation - [ ] Comment things