Will I Ever Make It in the Music Industry?
- Ronald Daley

- Oct 20, 2019
- 4 min read
Updated: Nov 20, 2019
Like most people, I love music and I love listening to new music. In High School, my friends and I started a rap group and made music after school. The group didn't last long because we all went our separate ways after high school. I had so much fun making music that I didn't want to stop so I decided to go "solo dolo" and become an independent artist.
I spent countless hours in the studio reviewing every small detail of my audio files with my vocals. I wanted to "optimize" my sound so the song not only delivered the intended message (usually about some college party or heart break haha), but also had, as my brother often says, "repeat value". Meaning, is the song worth listening to over and over? I just knew that if my song was a hit with other folks and had that "repeat value", I was guaranteed to sign a record deal, make millions, and buy my family a new house in Jamaica. Unfortunately, my music career never “took off” so instead I decided to focus on Engineering and become a Data Scientist.

My dream of rocking a Lollapalooza stage alongside Kid Cudi as he does his iconic humming as I'm spitting bars was short lived. Ever since I departed ways from the music industry, I have always wandered if there was a quantitative way to predict the success of a song and what are those important audio characteristics of those hit songs on the Billboard in the United States.
My curiosity lead to what I decided to explore for my third project for Metis Data Science Bootcamp. My goal for this project was to create a model given a set of audio features that predicts a song’s peak position on the Billboard top 100 chart. In addition to the model, I leveraged Spotify's Web API to identify the audio features that have the most significant impact on the song's peak position.
Alight, let's go do Data Scientist things!!

Data
There were two major sources for the data. The first source was the Billboard Hot 100 weekly chart. I used the Beautiful soup library in Python to scrape 4 weeks of data from Billboard.com. The Billboard Hot 100 is the music industry standard record chart in the United States for songs, published weekly by Billboard magazine. Chart rankings are based on sales (physical and digital), radio play, and online streaming in the United States.
The second source was audio features from the Spotify's Web API. Spotify has an API endpoint called get_audio_features. The endpoint allows you to get song features like loudness, Instrumentalness (how much instruments are used), energy, liveness (the presence of a live audience), Speechiness, song duration etc. Additionally, I leveraged Python's spotipy library to connect to Spotify’s Web API endpoint to bring in more granular details about each song.
Tools
All of my analysis for this project was done using Python. Data collection and exploratory data analysis was done with Python's pandas, beautiful soup and spotipy packages. Data visualization was done with Python's matplotlib and seaborn packages. All models created and scored was done with Python's sklearn package.
Process
As I began my analysis, I used lasso regression on my original dataset to quickly eliminate the non-significant features. It reduced the features from 20 to 9. I then performed cross validation by fitting a series of regression models to my training data and identifying the model that performed the best on my validation data and was not overfitting on the testing data. My linear regression model did the best job at minimizing the errors when predicting peak rank. From my linear regression model, I was able to determine truly how good my model was by generating an RMSE score. On average my model is wrong by 12 positions. For my purposes, that is sufficient enough to move forward with my model.
Once I identified the best model, I wanted to do a bit more "digging" into the numbers of the important audio features of the top songs. Since I had already identified the most significant audio characteristics when predicting peak rank, I decided to look at what the averages were for those audio characteristics for songs that were in the top 25% of the Billboard. Those numbers gave me a better idea of how those top songs are being created and what audio characteristics I should target for my next song if I ever want to make a second run at the music industry.
Here's a link to my GitHub page with my Python code and presentation.
Results
In the end, I was able to create a linear regression model that can predict a song's peak position given information for the significant audio features of a song. I also discovered that happier songs with a higher tempo and more lyrics (dance, EDM, rap, etc.) tend to peak higher on the Billboard Hot 100.
Thanks to this project, I have learned a lot about the "inside secrets" of the music industry but I am loving Data Science too much to go back to making music full-time. :)
RAD Guy


Comments