MPST: Movie Plot Synopses with Tags

gandham vignesh babu
6 min readJul 24, 2019

AUTOMATED MOVIE TAGGING

Social tagging of movies reveals a wide range of heterogeneous information about movies, like the genre, plot structure, soundtracks, metadata, visual and emotional experiences. Such information can be valuable in building automatic systems to create tags for movies. Automatic tagging systems can help recommendation engines to improve the retrieval of similar movies as well as help viewers to know what to expect from a movie in advance.

In this case study, we set out to the task of collecting a corpus of movie plot synopses and tags. We describe a methodology that enabled us to build a fine-grained set of around 70 tags exposing heterogeneous characteristics of movie plots and the multi-label associations of these tags with some 14000 movie plot synopses. We investigate how these tags correlate with movies and the flow of emotions throughout different types of movies. Finally, we use this corpus to explore the feasibility of inferring tags from plot synopses. We expect the corpus will be useful in other tasks where analysis of narratives is relevant.

INTRODUCTION

Tags should express plot-related attributes that are easy to understand by people. The goal is to predict tags from the written movie plots. Therefore relevant tags are those that capture properties of movie plots (e.g. structure of the plot, genre, emotional experience, storytelling style), and not attributes of the movie foreign to the plot, such as metadata.

The tag set should not be redundant. Because we are interested in designing methods to automatically assign tags, having multiple tags that represent the same property is not desirable. For example, tags like cult, cult film, cult movie are closely related and should all be mapped to a single tag. Tags should be well represented. For each tag, there should be a sufficient number of plot synopses, so that the process of characterizing a tag does not become difficult for a machine learning system due to data sparseness.

DATA

Data consists of fields imdb_id, title, plot_synopsis, tags, split, synopsis_source.

Imdb_id: this fields describes the id of the movie given by the imdb website. this is the unique value for the particular movie.

Title: this describes the title of the movie.

Plot_synopsis: this describes the experience of the movie and briefs the story of the movie . Based on this field features are extracted based on this the tags are obtained.

Tags: these are the exactly target variables which are to be predicted . these are the words that describe the genre of the movie.

split: this field is divided into train, test and validation. train data is used for training. test data is used for testing and validation data is used for validation and to obtain the best parameters.

synopsis_source: this field describes the source of data imdb or wikipedia.

EXPLORATORY DATA ANALYSIS :

We remove duplicates from the dataset and analyze:

  1. The number of tags per question.
  2. Most common tags.
  3. Frequency of tags in dataset.
EXTRACTION OF NUMBER OF FEATURES.

analyzing the top 20 tags and their frequency. analyzing the above EDA we conclude that some movies have a large number of tags, but most of the movies are tagged with one or two tags only. Murder,Violence, flashback, and romantic are the most frequent tags in the corpus.

visualization of box plot and violin plot to analyze the data. This helps in removing the outliers and the occurences of the tags.

DATA PREPROCESSING :

The data requires some preprocessing before we analyzing the model and make the prediction.

hence in preprocessing phase, we do the order below :-

  1. Begin the removing the HTML tags.
  2. Remove the punctuations and special characters.
  3. removing the words with the alpha numeric characters.
  4. removal of stop words.
  5. convert the word to lower case.
  6. check the length of the word is letter than 2.
  7. perform the stemming and lemmatization.

TRAINING THE MODELS AND PREDICTION:

we define the various models with the features generated from the plot synopses. We train the models using the machine learning models like Logistic Regression and SVM (Support Vector Machine).

we have encoded the features in BOW(Bag Of Words) , Tfidf (Term frequency- Inverse Document Frequency), Word2Vec, Tfidf Word2Vec.

we extract the features from the nltk-parts of speech. The parts of speech which are verb, noun, pronoun, adjective and adverb. we also encode the features into the glove model which is global vectorization in which the plot synopses features are encoded.

We perform the hyperparameter tuning is done the RandomizedSearchCV.

We use OneVsRest classifier for each classifier the class is fitted against all other classes.

we use GloVe methods is an unsupervised learning algorithm for generating vector representations for words.Training is done using a co-occcurence matrix from a corpus. The resulting representations contain structure useful for many other tasks.

PERFORMANCE MEASURES :

We consider f1score,precision,recall as the performance measures for the multilabel classification and applied models like Logistic Regression and SVM. We have used Logistic Regression and SVM with one vs rest classifier since it is a multi label classification technique.

we have done the analysis using the unigrams bigrams and trigrams and ngrams with all the 71 features of data and obtained the performances.this is a type of multi label classification we used f1-score as the scoring performance technique while performing the random search. f1-score is the harmonic mean of precision and recall. f1-score has two types which is f1-micro score and f1 macro score.

In micro-averaging all TPs, TNs, FPs and FNs for each class are summed up and then the average is taken.

In micro-averaging method, you sum up the individual true positives, false positives, and false negatives of the system for different sets and the apply them. And the micro-average F1-Score will be simply the harmonic mean of above two equations.

micro-averaging
macro-averaging

Hamming-Loss is the fraction of labels that are incorrectly predicted, i.e., the fraction of the wrong labels to the total number of labels.

We have obtained the f1-score of 0.3 for the encoding techniques like bag of words and tfidf . Using the feature engineered model with NLTKPARTS OF SPEECH+ GLOVE models. We obtained the f1-scores of 0.41 and 0.44 on Logistic Regression and SVM models.

CONCLUSION:

We have presented a new corpus of ≈70 fine-grained tags and their associations with ≈14K plot synopses of movies. In order to create the tagset, we tackled the challenge of extracting tags related to movie plots from noisy and redundant tag spaces created by user communities in MovieLens and IMDb. In this regard, we describe the methodology for creating the fine-grained tagset and mapping the tags to the plot synopses.

We present an analysis, where we try to find out the correlations between tags. These correlations seem to portray a reasonable set of movie types based on what we expect from certain types of movies in the real world. We also try to analyze the structure of some plots by tracking the flow of emotions throughout the synopses, where we observed that movies with similar tag groups seem to have similarities in the flow of emotions throughout the plots.

--

--

gandham vignesh babu

Datascientist and machine learning engineer with strong math background . Working on various problem statements involving modeling, data processing and data min