Things You Should Know | Set up Google Colab and Load Data

We said it all starts with data. A recommender system cannot exist without it.

Let me just tell you a story of how I figured it out, beyond hearing about the importance of data. Spotify recommended a song once, at 09:05pm. Ever since, I started listening to it in repeat. I was mind blown. My friends could never recommend me songs that I liked this much. So, I decided to examine the data of the song. The genre was indie, it was about 3.5 minutes long, and it was released in 2014.

Shortly after I realized, this data about the song was not telling me anything. Except, my favorite genre is indie, and this song being indie checked out.

From there, I decided to request a copy of my Spotify data and get more insights about my patterns. I examined many other songs I have liked on Spotify. To my surprise, most of them fell between the range of 3 –4.5 minutes long. And many of them were released between 2010-2015. Now all of a sudden, the recommendation started making sense. I also found out, that at least 3 times a week, I start a listening session at around 9 pm (when I saw this, I immediately remembered... that this is my typical workout time).

It took me about 2 weeks to get my data and about a day of analysis to find out all the trends. Spotify’s recommender system, on the other hand, did this continuously for millions of users at any moment.

But eventually, I went from mind blown to understanding that recommenders only exist because they can collect and save so much data.We will not collect any data for the recommender now(although, wouldn’t it be fun?), but we will use a movie dataset.

Before we move further, join Google Colab (https://colab.research.google.com). This is your personal coding notebook, where you will be able to run your code.

It’s free and ready to use! We will use Python, which is easily supported by GoogleColab.

If you feel unsure about your Python skills or would like a refresher, we have created a document for you to onramp with the basics needed to complete the recommender. Now let’s start with your first lines of code.

As for many data scientists, the first line involves importing libraries that we will need. In this case, to get data in our notebook, we will need to import pandas.

import pandas 

Now that we have pandas up and running, let’s upload the data. We will use the two datasets for our recommender. As this recommender will suggest movies, we need to have a list of movies and their ratings.

Movies

Ratings

You do not need to download anything, simply go back to your Google Colab notebook, and import both datasets using the Github links above.

movies = pandas.read_csv('https://github.com/aptitude-learn/recommender-system-launchpad/raw/main/movies.csv')

ratings =pandas.read_csv('https://github.com/aptitude-learn/recommender-system-launchpad/raw/main/ratings.csv')

Now that we have both datasets in the notebook, I suggest you take a look at them. You could do that by using pandas functions such as head() or tail().

There are plenty of other ways to examine datasets that you can find on the pandas documentation.

Next,we will combine these two datasets into one, again, using pandas. If you’ve examined it properly, you will notice both datasets have one thing in common: the movie ID.

As such, we will merge them using the movie ID.data = pandas.merge(movies, ratings,on='movieId')

Now take another look at the data. Get a sense of what your new data looks like.

It should look something like this:

Let's Continue