“Anime Character Identification and Tag Prediction by Multimodality Modeling: Dataset and Model”, 2023-06-18 ():
In recent years, some advances have been achieved in classification and object detection related to animation. However, these works do not take full advantage of the tags and text description content attached to the anime data when they are created, which restricts both the related methods and data to unimodality, consequently leading to unsatisfactory performance.
In this paper, we propose a novel multimodal deep learning network for Anime character identification and tag prediction by exploiting multimodal data. Considering that in many realistic scenarios, text annotations accompanying anime may be missing, we introduce the concept of curriculum learning in transformers to enable inference with only one modality.
Another challenge lies in that the existing dataset does not meet our demand for large-scale multimodal deep learning. To train the proposed network, we construct a new anime dataset Dan: mul that contains over 1.6M images spread across more than 14K categories, with an average of 24 tags per image. To the best of our knowledge, this is the first dataset specifically designed for multimodal anime character identification.
With the trained network, we can identify the anime characters in images and generate the related tags. Experiments show that our method achieves state-of-the-art performance on Dan: mul in animation identification.
…A. Dataset Construction: We build Dan:mul from an existing large online anime database Danbooru. To ensure the generality and quality of the dataset, we collect the latest version (Danbooru2021) of the database and use only images under the 512px subset. To simplify character identification into a classification task, we keep those images in which only one anime character appears. In addition, since our method is based on supervised multimodal learning, image classes with fewer than 10 images are removed to avoid the long-tail distribution problem. Based on these processes, we construct the image part of the dataset, containing 1,616,238 images with 14,413 categories.
…Dan:mul and DAF: re are both constructed on the basis of the Danbooru database, and their statistics are shown in Table 2. There are several main changes:
Dan:mul is much more expanded in the order of dataset size, close to 4×.
Our image resolution is 4× higher than DAF: re and the number of categories has doubled over 4×, making it more general and challenging in comparison.
Our tags, even after comprehensive importance filtering, are still 2× larger than DAF: re, meaning we can combine more textual information for multimodal learning.