Home / Blog / Big Data & Analytics / Movies Data Analytics using K-means Clustering

Movies Data Analytics using K-means Clustering

February 18, 2023
99

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of Innodatatics Pvt Ltd and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.Meta Description

The film data analytics project seeks to identify commonalities between groups of individuals to create a system for suggesting movies to consumers, which is considered to be one of the major studies related to movie data analysis. To investigate the traits that people share in their tastes for movies based on how they evaluate them, we will examine a dataset from the Netflix database. There are two files in this dataset; we will import both and use both of them. Using the clustering technique k-means to build a recommendation system for movies. Based on their prior viewing habits, we will provide viewers with recommendations for movies that are more relevant to them. To analyze data analytics in the movie industry that consumers like the most, we will only integrate data from users who have given movies a rating of 4+. The Python programming language and its related libraries, including NumPy, Pandas, Matplotlib, and Scikit-Learn, have been utilized throughout the whole text. The reader is also assumed to be familiar with Python and the aforementioned libraries.

There are two files in this dataset; we will import both and use both of them. Using the clustering technique k-means to build a recommendation system for movies. As we using bots to gather information and material from a website is known as web scraping. Web scraping collects the underlying HTML code and, with it, data kept in a database, in contrast to screen scraping, which just scrapes pixels seen onscreen. After that, the scraper can duplicate a whole website's content elsewhere. Movie data analysis using python libraries that support web scraping and Beautiful soup is one of them. The data collected is related to movies and ratings, collected from the web scraping method and taken from an existing database, i.e. MovieLens.

The movies about data collection and the dataset contain movieid, title of the movie, genre, rating details.

Don't delay your career growth, kickstart your career by enrolling in this Data Analyst Course in Pune.

In order to get each user's average rating for all science fiction and romance films, we take into account a group of users and research their favorite genres.

The total dataset contains 25000095 ratings of 62423 movies

If we are considering data only to the people who are interested in either genre romance or science fiction movies and if we try to group them and this could use for further clustering analysis.

In this subset derived, number of records we have are 302.

We can see that there are 183 recordings total, and there is a rating for a science fiction and romance film for each one. We will now perform some visualization analysis to get a clear picture of the biased dataset and its features.

Also, check this Data Analytics Course to start a career in Data Analytics.

Unsupervised learning algorithm K-Means Clustering divides the unlabeled dataset into several clusters. Here, K specifies how many pre-defined clusters must be produced as part of the process; for example, if K=2, there will be two clusters, if K=3, there will be three clusters, and so on. It gives us the ability to divide the data into several groups and provides a practical method for automatically identifying the groups in the unlabeled dataset without the need for any training. Each cluster has a centroid assigned to it since the technique is centroid-based. This algorithm's primary goal is to reduce the total distances between each data point and its related clusters. The prejudice that we previously generated is now very evident. By using K-Means to divide the sample into two different groups, we will advance the situation.

The rationale for the groupings is based on how each individual scored romantic films. One group will consist of people who gave romance movies an average rating of 3 or higher, and the other group will consist of those who gave romance movies an average rating of less than 3. Now that the dataset has been divided into three groups, based on that, clustering will be performed. Similarly, we can build 4 clusters, depending on which data will be segmented.

The number of clusters, k, is a clustering parameter that is required by some clustering algorithms like K-means. In the analysis, determining the ideal number of clusters is crucial. Each point will begin to widely represent a cluster if k is set too high, and the data points will be improperly grouped if k is set too low. Granularity in clustering is achieved by determining the ideal number of clusters. The elbow approach is employed to figure out how many clusters should be included in a dataset. It functions by showing the K values in ascending order against the total error discovered using that K.

Wish to pursue a career in data analytics? Enroll in this Data Analytics course in Bangalore to start your journey.

A statistic used to assess the efficacy of a clustering method is the silhouette coefficient, often known as the silhouette score. Its value is between -1 and 1. If the value is 1, it indicates that the clusters are distinct from one another; if the value is 0, it indicates that the clusters are indifferent or that the distance between them is not important. The result is -1, which indicates that clusters have been assigned incorrectly.

The top K value selections, according to the plot, are 7, 22, 27, and 31. The poorest clusters, as measured by the Silhouette Score, occur from increasing the number of clusters over that range. The K = 7 will be our choice since it produces the best result and is simpler to visualize.

We have just examined romantic and science fiction films thus far. Let's experiment by include Action movies in our examination of other genres.

Pursue a career in Data Analytics with the number one training institute 360DigiTMG. Enroll in the Best Data Analytics Courses in Hyderabad with placements to start your journey.

By adding action movies, we are trying to select dataset, again, applying k-means clustering. This time, we are performing cluster evaluation based on the silhouette score. Thus evaluating the error values and ultimately predicting for recommendations.

The romance and sci-fi ratings' x and y axes are still being used in this instance. The size of the dot is also plotted to show the ratings of the action movies (the bigger the dot the higher the action rating). We can observe that the clustering drastically varies when the action genre is included. The more information we incorporate into our k-means model, the more similar each group's preferences will be. The drawback of this way of charting is that it causes us to lose our capacity to appropriately visualize data when three or more dimensions are being analyzed. So, in the part that follows, we'll look at additional plotting techniques for accurately visualizing clusters with up to five dimensions.