In a recent blog post I discussed the possibility of using a proxy measure for quality. Rather than ask about quality directly, could we find another metric in (ideally free and accessible) existing data about user behaviour that we could rely on to predict quality?
We decided to investigate the idea empirically using samples and data from Freesound.org. Freesound is an online repository of sound files, consisting of anything users choose to submit. Visitors to the site can find a vast collection of audio samples from field recordings to synthesised music, all available to download/use under Creative Commons licences. Freesound also provides users (and researchers like us) with metadata for each sound file in the repository – how often it has been downloaded, when it was uploaded, what its average rating is, and so on. This is all great information for us to use when exploring the effect of quality on the real-life decision-making behaviour of Freesound’s users.
Imagine you are a sound designer. You are making a soundscape of an outdoor scene and decide your mix is missing some birdsong in the background. You visit Freesound and search for “Birdsong”. Hundreds of matches are returned. How do you decide which sample to download and use? Which ones will you listen to?
Typically, users in these kinds of scenarios do not spend hours going through the list of possible options, listening to each and picking the best one. Instead, we take shortcuts; which is the most downloaded already? Which ones have previous visitors rated most highly? And so forth.
The problem here, from a quality perspective, is one of a feedback loop – sound files which are already towards the top of rankings are more likely to be listened to (and downloaded, rated, and so on) by virtue of their position. How confident can we be that the most popular samples also genuinely represent those of the highest quality? Are users of Freesound using the quality of the sound file to guide their behaviour or are they biased by the previous ratings, etc, of other users?
To try to tease apart these questions we designed a simple Web experiment. We downloaded a set of samples from Freesound along with all the metadata available for each sound file. To parallel previous work we have conducted we decided birdsong would make an interesting and enjoyable category for participants to listen to.
We then presented each sample combined randomly in a pair with another sample and asked participants to rate which of the pair was best, and by how much, on a 7-point scale. This way, over time, we established an average ‘score’ for quality for each sample (relative to all other samples).
The participants in our Web experiment however had none of the additional information about the samples which would have been available to them had they found it on Freesound (such as play counts, number of downloads, etc). Because our measure of quality across the clips was independent of this Freesound data it allowed us to explore what metric(s) of user behaviour on Freesound best predicts the quality of a sound file.
We downloaded a large amount of metadata for each sound file, including number of downloads, average user rating on Freesound, sample length, and number of days since upload. Additionally, it makes little sense to treat a sample which received 1000 downloads in 1 week as being equal to another with 1000 downloads over the space of 5 years, so we computed a standardised variable of number of downloads per day.
Preliminary analyses (based on the quality scores from around 6,500 comparisons) showed that, of the variables listed, only the average Freesound rating and downloads per day significantly correlated with our independently obtained quality scores.
We can see in the figure below the relationship between mean quality scores and the number of downloads per day. Each dot on the graph represents one of the 75 samples of bird song in our test set.
The quality scores (Y axis) are the mean scores provided by our web participants for each sample. Because, over several thousand comparisons, each sound sample was paired with every other sample we begin to see which samples are consistently rated as being better (those with positive scores), which samples are on average no better or worse (those with scores around zero), and which ones have a tendency to be rated as being the worst of the pairs (those with negative scores).
The values along the X axis represent log transformed values for downloads per day. Because they have been transformed it might not be immediately obvious what they are showing us – essentially, the number of downloads per day increases from left to right across the scale.
The figure shows us that, broadly, samples which are rated more highly by participants in our Web experiment tend to also be the ones downloaded most often on Freesound. We can explore the nature of this relationship more rigorously with statistical analyses.
To do this we conducted a series of analyses involving a statistical technique called regression. Most people will be familiar with the concept of correlation; an estimate of the degree of association between two variables. Regression is a related technique that allows us to go a few steps further. Regression attempts to describe the degree to which change in an outcome variable (in this case, audio quality) is dependent on change in one (or more) predictor variables.
In short, the regression model we applied revealed significant predictive value of the downloads per day variable (R² = .312, p < .001). Including the average Freesound rating provided no additional predictive value to the model. In other words, our simple linear model shows that 32% of the variability in quality scores can be accounted for by just the number of downloads per day of the audio sample. This finding might not be strong enough to accurately predict the quality of any individual sample but it does imply that we can fairly confidently predict the best and worst of a set of samples simply by using their relative download figures.
Great. But why is this finding important?
For several reasons. Firstly, from the perspective of individual users, if someone is curious about the quality of their own submissions/recordings they can work out an estimate of how well any individual sample compares to others within any particular set of recordings (without having to wait for months or years for a large number of ratings to appear).
Alternatively, for users attempting to seek out good quality sound files from a pool of hundreds (or more) of possible samples our data suggests a shortcut to use which is likely to reveal the higher quality recordings in the pool.
Most importantly however, at least for our purposes on the Good Recording project, is the implication for our machine learning algorithms. In order to train a system to detect quality in audio we need to provide the system with relevant input data. Lots and lots of data. Access to a wide range of audio content isn’t a problem – the difficult part is to provide the appropriate conditions for the machine to learn to distinguish between good and poor quality. Typically this process is labour and time intensive. It requires human judgements of audio samples which can be a slow, tedious, and expensive process.
If however we can replicate and extend the findings discussed above to other sound types and categories (beyond birdsong) we open up the possibility of simply data-mining audio from sources such as Freesound. If our quality predictor variable can be shown to reliably distinguish between, say, the best 10% and worst 10% in a particular set of sound files we would be able to easily acquire the appropriate input data to train a machine with without requiring the intermediary step of human listening tests.
Skip to the end….