A rapid growth in the amount of fake news onsocial media is a very serious concern in our society. It isusually created by manipulating images, text, audio, and videos.This indicates that there is a need of multimodal systemfor fake news detection. Though, there are multimodal fakenews detection systems but they tend to solve the problemof fake news by considering an additional sub-task like eventdiscriminator and finding correlations across the modalities.The results of fake news detection are heavily dependent onthe subtask and in absence of subtask training, the performanceof fake news detection degrade by 10% on an average.To solve this issue, we introduce SpotFake- a multi-modalframework for fake news detection. Our proposed solution de-tects fake news without taking into account any other subtasks.It exploits both the textual and visual features of an article.Specifically, we made use of language models (like BERT) tolearn text features, and image features are learned from VGG-19 pre-trained on ImageNet dataset. All the experiments areperformed on two publicly available datasets,i.e.,Twitter andWeibo. The proposed model performs better than the currentstate-of-the-art on Twitter and Weibo datasets by 3.27% and6.83%, respectively.