Abstract
Multiple modalities represent different aspects by which
information is conveyed by a data source. Modern day social
media platforms are one of the primary sources of multimodal
data, where users use different modes of expression
by posting textual as well as multimedia content such as images
and videos for sharing information. Multimodal information
embedded in such posts could be useful in predicting
their popularity. To the best of our knowledge, no such
multimodal dataset exists for the prediction of social media
photos. In this work, we propose a multimodal dataset
consisiting of content, context, and social information for
popularity prediction. Specifically, we augment the SMPT1
dataset for social media prediction in ACM Multimedia
grand challenge 2017 with image content, titles, descriptions,
and tags. Next, in this paper, we propose a multimodal
approach which exploits visual features (i.e., content
information), textual features (i.e., contextual information),
and social features (e.g., average views and group counts)
to predict popularity of social media photos in terms of view
counts. Experimental results confirm that despite our multimodal
approach uses the half of the training dataset from
SMP-T1, it achieves comparable performance with that of
state-of-the-art.