What Makes a Hit Song? · Harish Yerra

Have you ever wondered what qualities make a song more likely to be a hit? I was curious to see if my own musical tastes aligned with the data, so I decided to use a little bit of data science to help us discover what makes a song popular. This project will use publicly available Spotify data, perform exploratory data analysis, and then model the data using traditional machine learning methods to discover what audio features distinguish hit songs from non-hit songs.

Datasets

Spotify Dataset (1921-2020)

This dataset contains audio features of 600K+ songs released between 1921-2020. We intentionally avoid using artist information during the modeling phase, because we want to learn what audio features make a song popular.

Feature	Type	Range	Description
`name`	`string`	N/A	The name of the song
`artists`	`string`	N/A	The artist(s) who produced the song
`duration_ms`	`int`	$[0, \infty)$	The length of the song in milliseconds
`explicit`	`bool`	$\{0, 1\}$	Whether the song contained explicit content
`danceability`	`float`	$[0, 1]$	How suitable a track is for dancing based on features like beat strength, tempo, rhythm stability, and overall regularity
`energy`	`float`	$[0, 1]$	A perceptual measure of intensity and activity
`loudness`	`float`	$(-\infty, 0]$ dB	The overall loudness of a track in decibels
`mode`	`int`	$\{0, 1\}$	Indicates the modality, major (1) or minor (0), of a track. Major and minor are different scale structures loosely associated with brighter or darker emotional qualities, respectively
`speechiness`	`float`	$[0, 1]$	Detects the presence of spoken words in a track. Values above 0.66 likely correspond to tracks made up entirely of spoken words. Values between 0.33-0.66 contain both music and speech (like rap). Values less than 0.33 most likely represent music
`acousticness`	`float`	$[0, 1]$	Confidence measure of whether the track is acoustic. 1.0 represents high confidence the track is acoustic
`instrumentalness`	`float`	$[0, 1]$	Confidence measure of whether the song contains vocals
`valence`	`float`	$[0, 1]$	Describes the musical positiveness conveyed by a track. Tracks with high valence sound more positive while tracks with low valence sound more negative
`tempo`	`float`	$[0, \infty)$	The overall speed or pace of a given track, measured in beats per minute

Billboard Top 100 (1958-2024)

This dataset contains tracks that made it to the top 100 between 1958-2024, along with how long it stayed in the top charts and the peak position the track attained within the chart. It also contains metadata that identifies a track, so that we can join it with the spotify dataset.

Feature	Type	Description
`title`	`string`	The name of the song
`performer`	`string`	The artist(s) who produced the song
`peak_pos`	`int`	The highest position it reached on the top charts
`wks_on_chart`	`int`	How many weeks the song stayed in the top 100

We treat a song as a hit if it appears in this dataset, i.e., it made the Billboard Top 100 at any point between 1958 and 2024. Joining this with the Spotify dataset on song title and primary artist gives us a labeled set of songs along with their audio features.

Exploratory Data Analysis

Feature Distribution

Before we get into modeling, let’s first create some graphs to help us better understand the structure of the data.

The data seems to suggest that songs tend to lean more upbeat, and you can see that especially in the energy and valence charts that heavily skew left. Surprisingly, danceability is much more symmetric, even though you’d expect it to track energy and valence. We can also see that songs lean toward produced (non-acoustic) and vocal-driven, as shown by the strong right skew for acousticness and instrumentalness.

Relationship b/w Features and Top Charts

Now that we understand the feature distribution a bit better, let’s see if we can find any relationship between the features and whether or not a song makes it to the top charts.

Acousticness and energy seem to be the strongest features that help us distinguish between hit and non-hit songs. We can see that hit songs are less acoustic than their non-hit counterparts given the dramatic difference in the tail from 0.7-1.0. We also notice that hit songs are more upbeat in general compared to non-hit songs, given the stronger left skew for energy, valence, and danceability. Hit songs also tend to be louder, given the longer left tail in loudness for non-hit songs.

It’s hard to make any real observations from duration, explicit, instrumentalness, speechiness, mode, and tempo as the graph of hit songs vs. non-hit songs looks very similar for these features.

Time’s Effect on Hit Rate

Let’s see how the hit rate has changed over time.

We notice there is some volatility in the percentage of songs that were hits in the early years. In general, the percentage of songs that were hits seems to be decreasing over time. This is because more songs get produced each year while the Billboard chart stays a fixed size, so the fraction that ends up as hits drops.

Feature Correlation

Now, let’s take a look at how correlated the features are. This is an important question to ask during feature selection when modeling the data. Highly correlated features become redundant. It’s also interesting to see what features are highly correlated. For example, I’d expect energy and acousticness to be inversely correlated as acoustic songs are usually more calming.

The correlation matrix does reveal some things that we’d expect. Louder songs tend to have more energy (strongly correlated), and acoustic songs have less energy (negatively correlated). It’s also interesting to see that danceable songs are happier given the positive correlation with valence. Conversely, speechiness and loudness don’t have much correlation, which makes sense as spoken words shouldn’t affect the loudness of the track.

One thing to keep in mind is that some variables which you’d expect to be (inversely) correlated, like speechiness and instrumentalness, seem to have no correlation. The reason correlation is so weak is that both features have low variance. Correlation measures how closely variables move together when they deviate from the mean. But when most of the data sits at the mean without moving, there’s no clear signal for correlation to pick up on.

Modeling

From the data, it looks like there are some signals (loudness, acousticness, energy, etc.) that we can try to incorporate to predict whether or not a song will be a hit. We will focus on simple, classical machine learning methods due to their interpretability. Let’s see which features logistic regression and tree-based models think are most important for determining whether a song is a hit.

Setup

We split the data into 80% training and 20% test, stratified by whether or not the song was a hit, so both sets keep the same hit rate. Only ~3.6% of songs are hits, which is a heavy class imbalance that shapes how we evaluate the models.

Given this, accuracy is not a good metric since a trivial model that classifies every song as non-hit will reach 96.4%. Instead, we use precision-recall area under the curve (PR-AUC) as our evaluation metric. Precision is the fraction of songs we predicted as hits that actually are. Recall is the fraction of actual hit songs that we caught. The curve plots these against each other across all classification thresholds, and the area tells us what the average precision of the model is across all recall levels. A perfect model has an area of 1.0, which means it has 100% precision regardless of recall. A model with no skill, one whose predictions don’t correlate with the true label, has an area equal to the base rate (0.036 here).

For each model, we tune hyperparameters with 5-fold cross-validation on the training set, optimizing for PR-AUC. Logistic regression features are standardized so the coefficients are directly comparable across features.

Logistic Regression

As a quick reminder, logistic regression models take the form:

\hat{y} = \sigma(\theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + \ldots + \theta_n x_n)

where $x_1 \ldots x_n$ is the feature vector, $x_0=1$ (so $\theta_0$ can be a bias term), and $\sigma(z) = \frac{1}{1 + e^{-z}}$ . The model produces values in the range $(0, 1)$ . You choose a threshold $c \in [0, 1]$ such that the predicted value is $1$ if $\hat{y} \geq c$ and $0$ otherwise.

Interpreting the Weights

The weights below are a bit surprising given our data analysis. We thought energy would be a strong predictor, but it looks relatively less important compared to the other features. Even more surprising is how important instrumentalness is, given how little variance there is in that feature.

So the natural question is: did we do something wrong?

Well, not necessarily. Logistic regression has many limitations, and an important one is that it struggles to model data that isn’t linearly separable in the feature space. To simplify our analysis, we’ll look at each feature by itself. In order for logistic regression to do well, we want there to be a transition point $c \in \mathbb{R}$ such that a feature being above or below $c$ reliably separates hits from non-hits. If the cutoff is clean, logistic regression can fit a steeper S-shaped curve around it and minimize loss since it’s more confident.

To understand our features better, let’s plot hit songs vs. non-hit songs against energy and instrumentalness.

When it comes to energy, we see that no good transition point exists, meaning that even if there was some kind of relationship between energy and hit vs. non-hit songs, logistic regression will struggle to find it. Thus, logistic regression has a small weight for energy.

Now let’s contrast this to instrumentalness, which has a stronger negative weight.

Much better! Even though both classes have a strong cluster of points around 0, we still have a clear signal in the tail. After a transition point of $c \approx 0.3$ , the number of songs that are hits falls pretty substantially, while a good number of non-hits still sit above that cutoff. This reveals a monotonic pattern where the probability of a hit decreases as instrumentalness increases. While this definitely isn’t linearly separable, there’s a much better signal that $c$ can give for instrumentalness than energy, which the model is able to use to fit a tighter curve.

Decision Tree, Random Forest, XGBoost

If you need a refresher, StatQuest has great walkthroughs of decision trees, random forests, gradient boost, and XGBoost.

The core idea behind tree-based models is that each node in a tree splits a feature by some criteria (e.g. explicit = True or instrumentalness > 0.5). You can measure how effective the split was by examining the Gini impurity of the parent node and the resulting child nodes. The Gini impurity of a node is given by:

G = 1 - \sum_{i=1}^C p_i^2

where $p_i$ is the fraction of samples in the node that belong to class $i$ , and $C$ is the total number of classes. A score of 0 means the node is pure, and all of the samples in that node agree on the class.

The Gini gain (a.k.a. decrease in impurity), which measures the effectiveness of the split, is given by:

\Delta G = G_{\text{parent}} - \sum_{j=1}^K \frac{N_{\text{child}_j}}{N_\text{parent}} G_{\text{child}_j}

where $K$ is the number of children, $N_{\text{child}_j}$ is the number of samples in child $j$ , and $N_\text{parent}$ is the number of samples in the parent node. A higher $\Delta G$ means the split was more effective.

The mean decrease in impurity is a per-feature aggregate of the gain. To compute it, you collect every split the feature was a part of. You then extract the gain, multiply it by a sample fraction (number of samples that were part of the node divided by total number of samples), and sum the weighted values together. After you compute the mean decrease in impurity of each feature, you normalize it across all features. A larger mean decrease in impurity corresponds to the feature being more important.

XGBoost takes a different approach by default. Instead of summing sample-weighted gains, it computes the mean gain per split: you collect every split the feature was a part of, take the gain from each, and average them. Like before, you normalize across all features. The main difference is that mean gain per split is not sample weighted.

Interpreting the Feature Importance

Looking at the feature importance, it’s nice to see that the models identified acousticness as an important feature, agreeing with our initial analysis when we explored the data. It’s also important to notice the difference in feature importance between decision tree/random forest and XGBoost. This can largely be attributed to the differences in the mean decrease in impurity and the mean gain per split metrics. For instance, very few songs in our dataset are explicit. Thus, the sample-weighted mean decrease in impurity ranks explicit lower than mean gain per split, which doesn’t factor in sample size.

The EDA showed hits do skew slightly higher in energy, but the tree models suggest that the distributional shift isn’t enough for clean classification. Looking at the scatter plot of hit songs vs. non-hit songs against energy makes this concrete:

The hit and non-hit clusters overlap across the entire energy range. Even though the distributional difference was present, there wasn’t a clear place to split energy.

Final Thoughts

Predicting what songs are going to be hits based on audio features alone is difficult. Even our best model (XGBoost) only reaches a PR-AUC near 0.10. While this is better than the base rate of 0.036 (~3.6% of songs are hits), it’s still not amazing.

The curve shows precision falling steeply as recall grows, meaning even small concessions to recall cost a lot of precision. In practice, this means the model can confidently flag a handful of hits. If you cast a wider net (increase recall), you start bringing in too many non-hits. This shows that our audio features do carry signal, but not enough to predict hits reliably on their own.

This suggests a few points of improvement:

Can we look at how the audio features in a song change across time? Maybe songs that start with less energy but become more energetic are popular.
Audio features alone aren’t enough, and we need to factor in the lyrics. Things like catchiness, language, sentiment, etc. can be important for gauging popularity.
Non-audio features like artist popularity are stronger indicators of whether a song will be a hit.

But the audio features still tell us something. We can see real distributional shifts in the features of hit and non-hit songs, and our models did show a noticeable improvement over the base rate.

As to whether these trends match my own music taste, they mostly do. I prefer vocal-driven, upbeat songs, which is roughly what the models picked out.