Movie rating prediction with machine learning algorithms on IMDB data set

(1)

International Conference on Advanced Technologies, Computer Engineering and Science (ICATCES’18), May 11-13, 2018 Safranbolu, Turkey

Abstract – Predicting movie success with machine learning algorithms has become a very popular research area. There are many algorithms which can be applied on a data set to make movie success prediction if the data set is prepared and represented properly. In this study, we explained how IMDB movie data was used for movie rating prediction. The data set extracted from IMDB was formatted and prepared for datamining algorithms. These algorithms were executed on WEKA application environment and the performances in movie ratings and confusion matrices were obtained. The seven machine learning algorithms used have performed well on the data set with varying performance ratings of 73.5% to 92.7%. Random Forest algorithm had the best performance of 92.7%. This is the highest score obtained among similar studies.

Keywords - Machine learning, WEKA, movie prediction, IMDB.

I. INTRODUCTION

OVIE industry has been expanded worldwide and now people, wherever on earth, have chance to watch a movie in the day it is released. There is a huge sector behind the preparation phases of each film and lots of directors and movie stars have burst. Every year, hundreds of films are produced. These movies have different genres, varying from comedy to romance or war to science fiction. To keep track of every movie produced, an online platform was needed. Internet Movie Database (IMDB) is the most popular platform to reach information about a rich collection of movies [1]. IMDB web site contains downloadable raw data about the movies, including data like cast, directors, genres, crew, scriptwriters, summaries, gross and even user ratings. This data is used for data mining on the movies for making prediction on user ratings of the movies. The data used in this study was extracted from Internet Movie Database (IMBD)

M

web site.

There are some studies in literature which also make movie prediction. In some of these studies, similar data mining algorithms were used. Table 1 shows the studies in literature about movie prediction. The datamining algorithms applied in these studies are also given in the table. According to the table, Latif and Afzal [2] have studied the similar datamining algorithms as in this study. They used IMDB data set and obtained a movie prediction performance of 82.42% with J48 algorithm. They also performed a prediction of 79.07% with MLP and 79.52% with PART algorithms.

Butler et al. who worked on movie prediction [3] on IMDB data set have used different algorithms than we used and their performances are not as successful as [2] did.

Among these movie prediction studies, Lee et al. [4] tried many datamining algorithms on IMDB data and have obtained the best movie prediction results with Random Forest algorithm with 86.4% and MLP with 84% of performances. In another study, Yu used J48 algorithm for movie prediction on IMDB data [5] and got only a performance of 49%. In the study of Nithin et al., there are no common datamining algorithms used but they also worked on movie success prediction [6]. Their most successful algorithm was linear regression and it had the performance of 50.7% only.

Not only papers are published on movie prediction, but also there are some thesis which makes datamining on movie data from different points of view. For example Tashman have examined the datamining techniques for analyzing film industry success [7]. This study concluded that the former careers of the actors and the actresses have the most predictive information. Person also made movie rating prediction in his bachelor thesis [8] with random forests and support vector machines. In the thesis he collected data from 3three different resources like IMDB, Rotten Tomatoes [9] and MovieLens [10].

Table 1 – Studies in literature.

Reference Aim Algorithm/Technique

[Our Study] Movie prediction J48, MLP, Random Forest, Bagging, BayesNet, LMT, PART [2] Movie prediction Logic Regression, Simple Logistics, MLP, J48, Naïve Bayes, PART

[3] Movie prediction KNN, Decision Trees, Gaussian Naïve Bayes

[4] Movie prediction Gradient Tree Boosting, Random Forests, Logistic Regression, Linear Discriminant, MLP, Adaptive Tree Boosting, Supported Vector Classifier

[5] Movie rating prediction SMO, Naïve Bayes, Logistic Regression, J48 [6] Movie success prediction Linear Regression, Logistic Regression, SVM In this study, we used seven machine learning algorithms to

make a movie rating prediction. We chose the most popular algorithms used in prediction. As the result of our preliminary tests, the most successful seven of them are explained here.

The layout of the paper is as follows: In Section 2, the workflow of the system and the machine learning algorithms we used in the study are explained. In Section 3, the data set

and preparation of the data set is introduced. In Section 4, the experimental results are given and Section 5 is the discussion part for this study.

II.MATERIALAND METHOD

Classification is one of the most common used machine learning techniques. Classification is a supervised method that

Movie Rating Prediction with Machine

Learning Algorithms on IMDB Data Set

D. ABİDİN, C. BOSTANCI, A. SİTE

Manisa Celal Bayar University, Manisa/Turkey, didem.abidin@cbu.edu.tr Manisa Celal Bayar University, Manisa /Turkey, cse.canerbostanci@gmail.com

(2)

classifies the items into several groups. The data set is divided into two groups as training set and test set, one for building a learning model on it and the other one to test it. In this study, seven classification algorithms are applied using WEKA (Waikato Environment for Knowledge Analysis) environment [11].

To make a proper analysis on the data set, the data should be

cleaned and formatted. This phase is called the “preprocessing”. Then, machine learning algorithms will be executed and these algorithms will build a model on the data in the first place. When the algorithms learn from the model built, they will perform a classification for the movie ratings. The workflow about the study is given in Figure 1.

Figure 1: The workflow of this study. Our workflow completely fits the CRISP-DM (CRoss Industry

Standard Process for Data Mining) process model [12] which was stated in 1996. The steps of CRISP-DM for data mining applications are business understanding, data understanding, data preparation, modeling, evaluation and deployment [13]. We apply the data understanding, data preparation, modeling and evaluation steps exactly. The machine learning algorithms used in the study are explained briefly in the following subsections.

A. J48

Quinlan’s first algorithm ID3 [14] was improved to C4.5 algorithm and J48 [15] is the open source Java implementation of C4.5 [16]. C4.5 is a classifier which accepts nominal classifiers. C4.5 can use both discrete and continuous attributes and training data with missing attribute values. The additional features of J48 are accounting for missing values, decision trees pruning, continuous attribute value ranges and derivation of rules. The WEKA tool provides a number of options associated with tree pruning. In case of potential over fitting pruning can be used as a tool for précising. In other algorithms the classification is performed recursively till every single leaf is pure, that is the classification of the data should be as perfect as possible [17].

B. MLP

MLP algorithm can be viewed as a logistic regression classifier where the input is first transformed using a learnt non-linear transformation. This transformation projects the input data into a space where it becomes linearly separable. This intermediate layer is referred to as a hidden layer. A single hidden layer is sufficient to make MLPs a universal approximator [18]. The power of the multilayer perceptron comes precisely from non-linear activation functions. Almost any non-linear function can be used for this purpose, except for polynomial functions [19].

C.Random Forest

Random forest is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest [20]. It is a classifier that builds many classification trees as a forest of random decision trees, each constructed

using a random subset of the features [21]. Each tree gives a classification (vote) and the algorithm chooses the classification having most votes among all the trees in the forest [22].

D.Bagging

Bagging is an algorithm that is designed to improve the performance of machine learning algorithms. It also helps to reduce overfitting in the algorithm. It is a variance reduction technique for a given base procedure, such as decision tree and to grow each tree a random selection (without replacement) is made from the examples in the training set [23]. Bagging is a smoothing operation which turns out to be advantageous when aiming to improve the predictive performance of regression or classification trees [24].

E. BayesNet

BayesNet belongs to the family of probabilistic graphical models (GMs). These graphical structures are used to represent knowledge about an uncertain domain; each node in the graph represents a random variable, while the edges between the nodes represent probabilistic dependencies among the corresponding random variables. These conditional dependencies in the graph are often estimated by using known statistical and computational methods. Hence, BNs combine principles from graph theory, probability theory, computer science, and statistics. BNs correspond to another GM structure known as a directed acyclic graph (DAG) [25] that is popular in the statistics, the machine learning and the artificial intelligence societies [26]. BayesNets improves the performance of naive Bayesian classifiers by avoiding unwarranted assumptions about independence [27].

F. Logistic Model Tree (LMT)

LMT basically consists of a standard decision tree structure with logistic regression functions at the leaves, much like a model tree is a regression tree with regression functions at the leaves. As in ordinary decision trees, a test on one of the attributes is associated with every inner node [28]. LMT is a supervised training algorithm that combines logistic regression and decision trees. [29].

(3)

G.PART

PART algorithm is used for generating rules in classification rule mining. PART stands for Projective Adaptive Resonance Theory. The input for PART algorithm is the vigilance and distance parameters [30]. It is a separate –and - conquer rule learner. The algorithm producing sets of rules called „decision lists‟ which are planned set of rules. A new data is compared to each rule in the list in turn, and the item is assigned the class of the first matching rule. PART builds a partial C4.5 decision tree in each iteration and makes the “best” leaf into a rule [31].

III. DATA SET DESCRIPTION A. The Movie Dataset

In this study, we used seven popular machine learning algorithms [32][33] for movie analysis on our data set. This data set was extracted from IMDB web site [1]. The data set given in the website is for about 270000 movies. Among these movies, the ones which have the gross data are used. The gross values were not included to the data set but that information was already given in the web pages of some of the movies. For this reason, a second data extraction was done through the movie web pages. This gross value is one of the most important attributes of the data set. We wrote a script to extract gross from the web site separately and it was merged with the former extracted data set. The data set prepared in three steps and these steps are described in the following subsections.

B. Data Extraction

The movie data set was extracted from IMBD web site, with one important information missing. The missing information was the “gross” field, which is already in IMDB web site. To include that column to the data set, a Python script is implemented and the gross data is added to the data set for the corresponding movies as an additional column with a join operation. The raw data set had 86 columns including “gross”. We extracted 10843 movies in this way.

C.Data Preprocessing

Among the extracted movie data, some movies had same names. To eliminate the duplications, the movies were distinguished with respect to their movie identification numbers (tconst). After the elimination of the duplications and removing redundant columns, we had a data set of 6840 movies. Among these, animation movies are excluded from the data set. The remaining data have the type “movie” and finally we obtained a data set of 6548 movies

The given information about movies may vary in size of cast or directors. For example there are some movies with six directors, where most of the movies have one or at most two directors. The maximum number of directors in a film determine the number of columns that the directors occupy.

Same case counts for the stars and writers of the movies. Including movies with so many directors, stars or writers mean so many null fields for movies with a few directors, stars or writers. For this reason, for the 6548 movies in the final data set, we did not include the columns of directors and stars. Including or excluding director and star data had no effect on the performances of the classification algorithms.

The preprocessed .csv file has 6 columns, which are explained below:

gross: Revenues for movie theaters in USA. startYear: Release date of the movie in USA. runtimeMinutes: The duration of the movie.

numVotes: Number of total votes taken from IMBD for a movie.

averageRating: Rating value in IMBD for a movie.

cluster: Rating clusters with values 0-9. This is the only column whose value is calculated by a computer program to obtain the 10 clusters of the movie ratings. The performance of the machine learning algorithms are measured according to this column.

D.Data Transformation

To prepare the data set for WEKA, a normalization step was needed. In this normalization step, the aim was to make all 6 columns’ data have values in the same scale. These are the normalization steps:

 Data set was sorted by “gross” column and numbers were assigned to the instances from 1 to 6548 according to this sorting.

 Same sorting and enumerating operations were applied to the data set by “numVotes” column.

 Normalization of the values of an instance are done according to (1).

(1)  Clusters of instances were obtained by a Python script

given in Figure 2.

Figure 2: Pseudocode for the data set normalization. In the machine learning algorithms used, the “numVotes” column has the highest priority. For this reason, the weight of this column in the normalization algorithm has the highest value. The movies released before 2000 may have worse gross values although they have very high number of votes. This is the reason why we keep the weight of “gross” column low.

(4)

E. The .csv File

At the end of the data transformation step, the .csv file to be used in WEKA is obtained. Figure 3 shows the final status of the .csv file.

Figure 3 - Definition of attributes in .csv file.

IV. EXPERIMENTAL RESULTS

We analyzed the performance of seven machine learning algorithms using WEKA environment. The classification operation is performed by choosing the “Classify" tab and the proper algorithm under this tab (weka/classifiers/). As the test option, “Cross Validation” with 10 folds is used. With this option, the data sets to be obtained can have different sizes and it does not need to divide the data into two sets as training and test sets. Cross validation option uses the whole data set for learning. To obtain the results, all algorithms were used with their default parameters.

In Table 2, the percentages of correctly classified instances for each algorithm are given. According to the results, Random Forest gives the best classification percentage. Bagging and LMT algorithms have remarkable successful results as Random Forest. Unexpectedly, MLP gives on of the lowest performance values.

Table 2 – Percentages of Correctly Classified Instances for Machine Learning Algorithms

Algorithm %Correctly Classified

J48 87.9 MLP 76.1 RF 92.7 Bagging 90.2 BayesNet 73.5 LMT 90.1 PART 88

The confusion matrix for the results of Random Forest is given in Figure 4. According to the recall and precision values, rating cluster 0 is the best classified cluster; where the

cluster with the movies having highest ratings (cluster 9) has the lowest recall and precision values. In cluster 9, there are less movies, which affects the precision and recall percentages.

Figure 4: Confusion matrix for Random Forest algorithm The misclassified instances tend to be classified in the neighbor clusters of the expected cluster. For example the misclassified instances of cluster 8 are spread to cluster 7 and cluster 9. The number of instances misclassified in cluster 7 (17) is more than the number of instances misclassified in cluster 9 (3).

Kappa, precision and F-measure values for each machine learning algorithm are given in Figure 5.

Figure 5: Kappa, precision and F-measure graphics According to the results, the movies which are expected to be in cluster 5 tend to be misclassified more than the other misclassified movies. The reason for this is the runtimeMinutes value for the movies. There are very long movies expected to be classified in cluster 5 but because of the calculation of clusters in our application, they cannot be classified as expected. On the other hand, runtimeMinutes column has a remarkable effect on determining the cluster of the movie as happens in “The Green Mile”.

MLP is one of the least successful machine learning algorithm, especially in determining the correct clusters of the movies having low rating values. Oppositely, MLP is the most successful algorithm to classify the best movies of cluster 9.

Therefore, these results show that machine learning techniques can be applied to movie score prediction

(5)

effectively.

V.DISCUSSIONAND FUTURE WORK

In this study, 6548 movies are used as the movie data set to make an analysis with machine learning algorithms and the performances of these algorithms were compared to similar studies. First, data was gathered and preprocessed to obtain a normalized data with no null fields. Later, the data set was used for executing the machine learning algorithms in WEKA. BayesNet was the least successful algorithm among the chosen ones. The misclassified instances are distributed to the neighboring clusters in cluster 5 and cluster 6 mostly. Three of the algorithms (Random Forest, MLP and Bagging) achieved the classification of movie ratings with a performance of over 90%. As the results show, the machine learning algorithms can be used for movie rating prediction.

As future work to this study, we plan to improve the data set by including more columns like genre and MPAA ratings. The Oscar awards will have a certain effect on the new data set to obtain more accurate classification results. The next study will be about exploring the genres of the movies and make a prediction on this column. Another future work will be studying on animation movies.

REFERENCES

[1] Internet Movie Data Base. URL: https://www.imdb.com/

[2] M.H. Latif, H. Afzal, “Prediction of movies popularity using machine learning techniques”, International Journal of Computer Science and

Nework Security, vol. 16, pp. 127-131, August 2016.

[3] C.D. Butler et. al., “Predicting movie success using machine learning algorithms”, Proceedings of The Fifteen Laccei International Multi-Conference For Engineering, Education Technology, Boca Raton, Florida – USA, July 19 - 21 2017.

[4] K. Lee, J. Park, I. Kim, Y. Choi, “Predicting movie success with machine learning techniques: ways to improve accuracy”, Information

Systems Frontiers, 2016. https://doi.org/10.1007/s10796-016-9689-z [5] T. Yu., “On predicting the movie ratings”, Carnegie Mellon University

Human-Computer Interaction Institute, 2017.

[6] Nithin V.R., et. al., “Predicting Movie Success Based on IMDB Data”,

International Journal of Data Mining and Techniques, vol.3, pp.

365-368, June 2014.

[7] M. Tashman, “The Association Between Film Industry Success and Prior Career History: A Machine Learning Approach”, Master's thesis, Harvard Extension School, 2015.

[8] K. Persson, “Predicting movie ratings: A comparative study on random forests and support vector machines”, Bachelor Degree Project in Informatics, University of Skövde, 2015.

[9] Rotten Tomatoes. URL: https://www.rottentomatoes.com/ Date of access March 26, 2018.

[10] MovieLens. URL: https://movielens.org/

https://www.rottentomatoes.com/ Date of access March 26, 2018. [11] E. Frank, I.H. Witten, Data Mining, Morgan Kaufmann Publishers,

2000.

[12] R. Wirth, H., Jochen Hipp, “CRISP-DM: Towards a Standard Process Model for Data Mining”, (2000).

[13] C. Shearer, “The CRISP-DM Model: The New Blueprint for Data Mining”, Journal of Data Warehousing, vol. 5, No:4, pp.13-22, 2000. [14] J.R. Quinlan, "Induction of Decision Trees", Machine Learning vol.1,

pp. 81-106, 1986.

[15] L.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools

and Techniques, Morgan Kaufmann Press, San Francisco, USA, 2005.

[16] J.R Quinlan, C4.5 Programs for Machine Learning, Morgan Kaufmann Publishers, 1993.

[17] Kaur G., Chhabra A., “Improved J48 Classification Algorithm for

the Prediction of Diabetes”, International Journal of Computer

Applications (0975 –8887), vol. 98, no:22, July 2014.

[18] Deep learning 0.1 Documentation. URL:

http://deeplearning.net/tutorial/mlp.html Date of access March 20, 2018 [19] M.-C. Popescu, V.E. Balas, L. Perescu-Popescu, N. Mastorakis,

“Multilayer Perceptron and Neural Networks”, WSEAS Transactions on

Circuits and Systems, Issue 7, vol. 8, pp. 579-588, July 2009.

[20] L. Breiman, "Random Forests", Machine Learning, pp. 5–32. doi: 10.1023/A:1010933404324, 2001.

[21] T.K. Ho, "Random Decision Forests", Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 278-282, 1995.

[22] F.A. Fontana, M.V. Mäntylä, M. Zanoni, A. Marino, "Comparing and Experimenting Machine Learning Techniques for Code Smell Detection", Empirical Software Engineering, vol. 21, pp. 1143-1191, 2016.

[23] L. Breiman, "Bagging Predictors", Machine Learning, vol. 24(2), pp. 123–140, 1996.

[24] Bühlmann, P., Yu, B., “Analyzing bagging”, Annals of Statistics, vol. 30, pp. 927–961, 2002.

[25] J. Pearl, Probabilistic Reasoning in Intelligent Systems, San Francisco, CA, Morgan Kaufmann, 1988.

[26] I. Ben-Gal, Bayesian Networks, in Ruggeri F., Faltin F. & Kenett R.,

Encyclopedia of Statistics in Quality & Reliability, Wiley & Sons, 2007.

[27] N. Friedman, D. Geiger, M. Goldszmidt, M., "Bayesian network classifiers", Machine Learning, vol. 29, pp. 131–163, 1997. [28] N. Landwehr, M. Hall, E. Frank, Logistic Model Trees, Kluwer

Academic Publishers, 2006.

[29] N Landwehr, M. Hall, E. Frank, "Logistic Model Trees", Machine

Learning, 59, 161, 2005.

[30] S.Vijayarani, M.Divya, “An Efficient Algorithm for Classification Rule Hiding”, International Journal of Computer Applications (0975 – 8887), vol. 33, no.3, pp. 39-45, November 2011.

[31] V.S.Parsania, N.N. Jani, N.H. Bhalodiya, “Applying Naïve bayes, BayesNet, PART, JRip and OneR Algorithms on Hypothyroid Database for Comparative Analysis”, International Journal Of Darshan Institute

On Engineering Research & Emerging Technologies, vol. 3, no. 1, 2014.

[32] X. Wu, et al., "Top 10 Algorithms in Data Mining", Knowledge and

Information Systems, vol.14, pp. 1-37, 2008.

[33] Meenakshi, Geetika, “Survey on Classification Methods using WEKA",

International Journal of Computer Applications, vol. 86(18), pp. 16-19,