Building on our work on bird detection, we explored how (and how well) we can name all bird species audible in an audio recording.
Formally, we treat the task as follows: Given an audio recording of arbitrary length, and optionally the recording time and location, return the detection confidence for each of a predefined set of possible species.
We obtained an annotated dataset from a scientific challenge on bird audio identification that we participated in. It contains 36,496 recordings of 1500 South American bird species to learn from. Recordings were done by ornithologists and hobbyists and uploaded to the public xeno-canto website — feel free to explore! Each recording is labelled with one of the 1500 species that the recordist aimed to capture, and sometimes with additional species audible in the background.
Recordings are of different lengths and quality, and the annotations do not include precise timing information. For example, for the first recording above, we are only told that somewhere, there is a Dark-breasted Spinetail, a Palm Tanager, a Silver-beaked Tanager, and a Great Antshrike.
As for bird detection, our first step was to compute a spectrogram for each recording. Similar to the images provided by xeno-canto for the examples above, they capture the frequencies of all audible sounds over time, resulting in distinct patterns for different bird calls. We then trained Convolutional Neural Networks (CNNs) to predict which of the 1500 species are audible within a recording. The training algorithm repeatedly went through many 30-second excerpts of the recordings, each time adapting the network parameters to get closer to the intended outcome. This way, the predictions for, say, a Palm Tanager become more and more dependant on patterns that occur in all Palm Tanager recordings, and not in any others. This allows us to learn to distinguish birds even without precise timing annotations, as long as enough examples are available.
To improve results, we trained a second network to predict the species only from the date, time and geocoordinates of a recording. This is of course not enough for an accurate classification, but it will capture which species are more likely to be heard in a specific season, at a specific time of day, or location. As both networks — the audio network and the metadata network — output confidences for all 1500 species, we can simply average these confidences to include both sources of information.
To assess how well the networks work on data not included in training, we kept aside 10% of the 36,496 recordings. For 62% of these, the audio network correctly predicts the foreground species the recordist meant to capture, out of 1500 possible species. The second network gets 21% correct just from the date, time and location. Combining the two, we get 70% correct. Finally, using multiple audio and metadata networks, we correctly predict the foreground species in 75% of recordings.
The scientific competition we participated in used another 12,347 recordings to compare the participating teams. Among six international groups, we came in second, both for detecting the foreground and background species.
Note that results improve when reducing the number of species that need to be distinguished. For example, for a random subset of 150 species, the networks get 90% correct.
For a detailed description of our work on bird identification, please refer to the following scientific publication:
Jan Schlüter: Bird Identification from Timestamped, Geotagged Audio Recordings. In Working Notes of CLEF, Avignon, France, 2018. (PDF file)
Get In Touch
This project merely serves to demonstrate what is possible with current technology. If you work on a related problem — be it in academia or industry — we are highly interested in hearing from you! Please send us an email or give us a call and we will figure out a way to collaborate.
Image credits: Speaker icon, public domain; Clock icon by simpleicon.com, CC by 3.0; Calendar icon by Subhashish Panigrahi, CC by-sa 4.0; Map pin icon, CC0 1.0; Gull silhouette by Julian Herzog, CC by 4.0; Pigeon silhouette by Nevit Dilmen, CC by-sa 3.0; Pelican silhouette by Julian Herzog, CC by-sa 3.0