“Anime Character Recognition Using Intermediate Features Aggregation”, Edwin Arkel Rios, Min-Chun Hu, Bo-Cheng Lai2022-05-27 (, )⁠:

In this work we study the problem of anime character recognition. Anime, refers to animation produced within Japan and work derived or inspired from it.

We propose a novel Intermediate Features Aggregation classification head, which helps smooth the optimization landscape of Vision Transformers (ViTs) by adding skip connections between intermediate layers and the classification head, thereby improving relative classification accuracy by up to 28%. The proposed model, named as Animesion, is the first end-to-end framework for large-scale anime character recognition.

We conduct extensive experiments using a variety of classification models, including CNNs and self-attention based ViTs. We also adapt its multimodal variation Vision-Language Transformer (ViLT), to incorporate external [Danbooru] tag data for classification, without additional multimodal pre-training.

Through our results we obtain new insights into the effects of how hyperparameters such as input sequence length, mini-batch size, and variations on the architecture, affect the transfer learning performance of Vi(L)Ts.

we release our source-code and pretrained model checkpoints, in an effort to encourage and facilitate researchers to continue work in this domain.

3.1. Data: We use the DanbooruAnimeFaces dataset in our experiments. DAF, is a subset of the 2018 release of Danbooru2019. Due to its extremely long-tailed distribution, we only keep classes with at least 20 samples, resulting in 463, 437 images of 3,263 characters. We split it into training, validation, and testing sets using a ratio of 0.7, 0.1, and 0.2, respectively. Since the original dataset only contains face crops, we also sample full body images by resizing the original images from Danbooru20×x, and coin it as DAFull. Furthermore, we include description tags from Danbooru20×x as additional multimodal data.