top of page

Data Science in R

An Analysis of the Popularity of Songs on Spotify and Youtube


Course: Data Science in R (Spring 2020)

Project Link: 

* Note that our exploratory data analysis is divided into two parts: EDA and EDA2.


  • For our project, the main questions and hypotheses we want to explore using the Spotify dataset is around music and artist trends. Spotify has become one of the most popular music streaming services that offers a wide variety of music content from many artists of different genres. With its millions of users, the platform provides an enormous amount of data regarding music and artist trends. To analyze these popularity patterns in the music industry, we worked with a main dataset titled “Spotify- All Time Top 2000’s Mega Dataset” which has songs from 1956-2019, which we named spotify. For each song, the dataset also provides different characteristics like the tempo and a loudness score.

  • The other data set we used is the “Trending YouTube Video Statistics” dataset (saved as USvideos.csv and loaded as “youtube_music” as the cleaned dataset) which contains a database of the top trending videos on the platform to analyze music video and artist popularity. We wanted to explore trends we see in the Spotify data set will appear in the Youtube data set. We compared the popularity of artists between Spotify and Youtube.

Interactive Components: 

  • For our interactive component, we made two shiny apps for two plots on the EDA 1 page. The first one is for the plot “Popularity Vs. Danceability”. We used a method of “brush” to make it interactive. If you select a range of dots in the plot, it will show the range of danceability and popularity for the selected dots. Also, it will show the corresponding characteristics for each of the selected dots/observations representing a song, including energy, acousticness, beats per minute, etc. This shows the trend of the characteristics we are measuring against popularity.

  • The second one is for the last plot in our EDA 1 page, “Predicted Vs. Actual”. By clicking the dots in the graph, it will show the corresponding predicted and actual scores, as well as the difference between the two - the residual value. This helps further show how well our model predicts popularity. The smaller the residual, the more accurate the predicted score was to the actual observed score. These residuals and graph shows that our model is on the right track to explaining the popularity score of a song.


  • In our plots, the variables in the Spotify dataset don’t have an extremely strong relationship with popularity, and the parts that follow a relationship are captured by our model, but the variability that can’t be explained by our variables can be due to factors not included in our spotify dataset.

  • So for EDA2, we looked deeper into the artists’ relationship with popularity and looked at popularity of the songs on Youtube to see if there is any correlation between popularity of songs on Spotify and on Youtube. And we found out that artists that are popular on Spotify are not necessarily popular on Youtube. None of Spotify’s top 5 artists overlapped with Youtube’s top 5 music artists. Thus, there may be other factors that account for the discrepancy in artist popularity among Spotify and Youtube users such as good marketing strategies to make users go check out their music videos vs how many times a song can be listened to before it gets boring.

  • To further explore this topic in the future, it would be important to also take a look at other musical streaming services similar to Spotify such as Pandora, iTunes music, and even Soundcloud.

bottom of page