Evaluate Robustness of Convolutional Neural Networks (CNNs) with CIFAR100-C and CIFAR10-C datasets

6 min readJan 12, 2022

Frequent questions that arise when we are thinking to evaluate our model’s robustness are: What datasets to use for robustness check?, What are the metrics used in this evaluation? and what recent works have used this datasets and metrics for their model’s robustness evaluation?. This article introduces you to one of such methods to evaluate convolutional neural network (CNN) model’s robustness to common corruptions in images. Github Code here.

Overview:

I : Datasets: CIFAR100-C and CIFAR10-C
II: Evaluation Metric: mCE
III: Code for CIFAR100-C and CIFAR10-C evaluation

The datasets and metrics in this article are based on the ICLR 2019 paper: BENCHMARKING NEURAL NETWORK ROBUSTNESS TO COMMON CORRUPTIONS AND PERTURBATIONS

I : DATASETS

What are CIFAR100-C and CIFAR10-C datasets?

Standard CIFAR100 and CIFAR10 datasets provide 10,000 (10K) images each as Test-dataset. The above paper applies 15 common corruptions to these test images with different severity levels (1 to 5), to produces total of 10,000*15*5 = 750,000 (750K) images for Robustness evaluation for each dataset.

Each severity is applied to 10K images to produce corresponding 10K images. 5 such severities for 1 common corruptions leads to 50K images. Therefore, for 15 different corruptions we end up with 750,000 (750K) images for robustness testing. Link

CIFAR100-C dataset download link.

CIFAR10-C dataset download link.

What are these common corruptions?

There are 15 common corruptions and each has 5 different severities. Therefore, in total 75 common corruptions. Below are the descriptions of each corruption from the paper.

Gaussian noise: This corruption can appear in low-lighting conditions.

2. Shot noise: Also called Poisson noise, is electronic noise caused by the discrete nature of light itself.

3. Impulse noise: Is a color analogue of salt-and-pepper noise and can be caused by bit errors.

4. Defocus blur: Occurs when an image is out of focus.

5. Frosted Glass Blur: Appears with “frosted glass” windows or panels.

6. Motion blur: Appears when a camera is moving quickly.

7. Zoom blur: Occurs when a camera moves toward an object rapidly.

8. Snow: Is a visually obstructive form of precipitation.

9. Frost: Forms when lenses or windows are coated with ice crystals.

10. Fog: Shrouds objects and is rendered with the diamond-square algorithm.

11. Brightness: Varies with daylight intensity.

12. Contrast: Can be high or low depending on lighting conditions and the photographed object’s color.

13. Elastic: Transformations stretch or contract small image regions.

14. Pixelation: Occurs when upsampling a lowresolution image.

15. JPEG: is a lossy image compression format which introduces compression artifacts

II: Evaluation Metric

Do NOT train models on these datasets
These datasets are to be ONLY used for Evaluation/testing

Metric for ImageNet-C dataset:

Metric for CIFAR 100-C and CIFAR10 -C:

Recent works have slightly modified above equation for evaluation on CIFAR100-C and CIFAR10-C dataset. Specifically for CIFAR100-C and CIFAR10-C, average of the corruption errors over all severities and all common-corruptions is taken as mCE. AugMix paper (ICLR 2020) calls this as mCE and AugMax paper (NeurIPS 2021) calls this as RA (Robustness Accuracy (1 - mCE))

Common-Corruption-Error is given by (simple error formula):

Common-corruption-Error of one common-corruption with all severities is calculated as follows:

Hence, Mean of all Common-Corruption-Errors over all common-corruptions gives us the Mean-Corruption-Error (mCE) :

III: Code for CIFAR100-C and CIFAR10-C

Github Link to code.

Two ways to run with the above code:

[With pretrained weights] First is a standalone code, where you have pretrained weights file (.pth file) and also the model definition available with you which you simply call in the standalone code and get evaluation on CIFAR100-C and CIFAR10-C. More details below.

[Train and evaluate] Second way is to train the model from scratch and get evaluation on both standard CIFAR100/10 dataset and CIFAR100-C/10-C dataset. More details below.

First way: Standalone code for evaluating robustness error

File Name: robustness_standalone.py
Command line arguments needed: -wp (weights_pth) and -dp (dataset_pth)

How to run?

Step I: Import the model definition file. I have imported resnet and mobilenetv2 definitions from models folder for example.

Step II: Edit main() in robustness_standalone.py to call the relevant function. For CIFAR100-C call mCE_cifar100() and for CIFAR10-C call mCE_cifar10()

Step III: Create ‘dataset’ folder and Download & unzip CIFAR-100-C and CIFAR-10-C datasets into it.

Step IV: Provide 2 paths when running from command line:

1. trained weights file using ‘-wp’

2. CIFAR-100-C or CIFAR-10-C dataset path using ‘-dp’

Command: python robustness_standalone.py -wp path/to/weightsfile -dp path/to/cifar100-c

Second way: Train and evaluate standard error and robustness error

How to run?

I have provided ResNet and MobileNetV2 examples in this code. Model definitions of ResNet and MobileNetV2 are present in ./models folder.

Configuration files (.yml files in ./configs folder) provides code with model hyperparameters and training parameters such as model name, number of classes, convolution stride of the first layer, model checkpoint saving path, bestmodel saving path, optimizer, loss function, dataset paths and etc.

Create ‘dataset’ folder and Download & unzip CIFAR-100-C and CIFAR-10-C datasets into it.

Command : python main.py -cp configs/resnet50.py
Command : python main.py -cp configs/mbnetv2_c100.py

main.py file execution flow: It loads configuration file -> sets random seed, ->creates dataloaders (CIFAR100/10), -> loads model using cfg and model definitions in ./models folder -> config and model is sent to Train class, -> For-loop for training, -> ‘results_modelname’ folder created for saving model checkpoint, bestmodel, graphs and config file, ->model checkpoint saved after each epoch, ->bestmodel saved based on the evaluation accuracy, -> At the end of training, Best model accuracy on CIFAR100/10 is evaluated and printed, ->robustnesss of model is evaluated and mCE on CIFAR100-C/10-C is printed, -> all the training, evaluation, testing and robustness metrics are saved in a .txt file, -> Training & evaluation loss & accuracy plots are generated and saved ->finally config file is copied in the results folder for reference.

If you use conda environment: I have provided my conda environment file environment.yml

To dive more deep into Dataset details and metrics take a look papers mentioned in the reference. I hope this article helps to getting started with evaluating robustness of Convolutional Neural Network models.

References:

GitHub - hendrycks/robustness: Corruption and Perturbation Robustness (ICLR 2019)

This repository contains the datasets and some code for the paper Benchmarking Neural Network Robustness to Common…

github.com

GitHub - psh150204/AugMix: PyTorch Implementation of the paper 'AugMix' (ICLR 2020)

Pytorch Implementation of AugMix (ICLR2020) Python 3.6 > PyTorch 1.4 > CIFAR-100, CIFAR-100-C Baseline : 100 epochs, 30…

github.com

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

In this paper we establish rigorous benchmarks for image classifier robustness. Our first benchmark, ImageNet-C…

arxiv.org

AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty

Modern deep neural networks can achieve high accuracy when the training distribution and test distribution are…

arxiv.org

AugMax: Adversarial Composition of Random Augmentations for Robust Training

AugMax: Adversarial Composition of Random Augmentations for Robust Training Part of Advances in Neural Information…

papers.nips.cc

Evaluate Robustness of Convolutional Neural Networks (CNNs) with CIFAR100-C and CIFAR10-C datasets

Overview:

I : DATASETS

II: Evaluation Metric

Common-Corruption-Error is given by (simple error formula):

Common-corruption-Error of one common-corruption with all severities is calculated as follows:

Hence, Mean of all Common-Corruption-Errors over all common-corruptions gives us the Mean-Corruption-Error (mCE) :

III: Code for CIFAR100-C and CIFAR10-C

Two ways to run with the above code:

How to run?

How to run?

References:

GitHub - hendrycks/robustness: Corruption and Perturbation Robustness (ICLR 2019)

This repository contains the datasets and some code for the paper Benchmarking Neural Network Robustness to Common…

GitHub - psh150204/AugMix: PyTorch Implementation of the paper 'AugMix' (ICLR 2020)

Pytorch Implementation of AugMix (ICLR2020) Python 3.6 > PyTorch 1.4 > CIFAR-100, CIFAR-100-C Baseline : 100 epochs, 30…

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

In this paper we establish rigorous benchmarks for image classifier robustness. Our first benchmark, ImageNet-C…

AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty

Modern deep neural networks can achieve high accuracy when the training distribution and test distribution are…

AugMax: Adversarial Composition of Random Augmentations for Robust Training

AugMax: Adversarial Composition of Random Augmentations for Robust Training Part of Advances in Neural Information…

Written by Shakti Wadekar