What is a podcast? I expect we all have our own slightly idiosyncratic answer to that question. For me, podcasts are defined partly by some software I use on my phone that regularly downloads audio files from various feeds, which I periodically listen to, often as I'm doing other things. Most of these recordings feature two or more people engaged in a long-form conversation. Some of these take the form of interviews, in which a regular host interviews a rotating series of guests. Others feature rotating subsets of the same group of hosts, talking together about various issues. A few shoes, like Hardcore History, are just long monologues, not so different from an audiobook.
For you it may be different. Some of the available feeds feature narrative content. Others are primarily music. A popular form of podcast is something like highly-produced audio journalism or documentaries, with one or more hosts reporting out a story, featuring interviews, audio clips, sound effects, and re-creations. Clearly all of these, especially the last, bear some relation to traditional radio, except that they can be listened to on demand, rather than broadcast. And yet, increasingly people seem to refer to things like video series on YouTube as podcasts, even if they are not delivered via a dedicated audio feed. Others apparently primarily consume podcasts in written form, only reading transcripts, rather than ever listening to a show.
As a step towards understanding the podcast ecosystem in more detail, my PhD student Ben Litterer recently built and released the Structured Podcast Research Corpus (SPoRC), along with myself and David Jurgens. As a first offering, we decided to focus on a slice of time, collecting everything we could from the months of May and June 2020. For the purposes of this work, we basically define podcasts as any RSS feed which distributes audio files.
To build this dataset, we started with Podcast Index, an open source database of podcast feeds, which podcast creators are encouraged to submit their shows to. The resulting list is not fully comprehensive, both because some shows may not be indexed in the database, and because overall feeds and particular episodes may eventually be taken down or go offline. Nevertheless, we were able to get a huge sample of over one million episodes from that two month period.
Using the RSS feeds available from Podcast Index, we download all the corresponding episodes from our time period of interest, and then transcribe these using OpenAI's Whisper model. For a subset of around 400,000, we also diarized the audio files, to separate the speech out into individual speakers, and extract basic audio features at the word level, such as pitch and other aspects of the frequency spectrum. Due to their large size, we deleted the original audio files, but have made the output data available via a dataset on HuggingFace.
Digging into this data, we can provide some basic characterization of what a podcast is, at least within the limits of our technical definition. Within our sample, the median number of episodes per show is just one during our two month collection period. However, nearly as many release one episode per month or once per week. The median length of episodes is about 30 minutes, with a handful going on for more than three hours. A majority of episodes appear to have one or two speakers, with two being slightly more common than one. Surprisingly to us, the most common self-assigned primary category for shows was Religion. It turns out that a huge number of (mostly Christian) services are recorded and made available online via RSS feeds, which thus meet our definition as podcasts.
What kinds of things are podcasts about? Below is interactive version of a figure from our paper, which shows a t-SNE plot of a sample of 25,000 episodes represented according to their topic distribution (inferred via LDA), and coloured according to podcasts' self-assigned primary category. As you select parts of the figure, the histogram on the right shows the most common primary topics included in your selection. As can be seen by exploring the space, there are many clear clusters that fit neatly within named categories. Within Sports, for example, there are obvious clusters for football, baseball, wrestling, and so on, as well as somewhat more niche fandoms, such as comics, movies, and books.
Other topics cut across a few or many categories. Shows devoted to spirituality end up associated themselves somewhere among Business, Religion, and Health, for example. One of the most cross-cutting topics in this dataset is the discussion of George Floyd and racial justice, who was killed during the time period we study. In the middle of the figure we can find many more general topics, like family, which may appear on many shows across many categories.
To get a slightly more fine-grained sense of what typical podcasts are about, I ran a larger topic model on a sample of 200,000 transcripts, using 500 topics. Looking through these, I did a rough manual categorization for certain themes that jumped out. For some of these, I'll give a rough estimate of the percentage of ALL episodes that are primarily about this area (comprising one or more topics), but please keep in mind that these are very rough estimates, so I wouldn't put too much weight on them. In particular, most of these numbers are likely underestimates, since many episodes end up with a less clear primary topic, but might still be about many other things.
Among sports, it appears that football and basketball are the two most popular (1.6% and 1.3% of all episodes, respectively). After that, we have wrestling (0.5%), soccer (0.5%), baseball (0.4%). Obviously these categories are imprecise, and not necessarily fairly compared, but overall this seems plausible. Farther down the list, things potentially get a bit combined, but there are obvious topics for UFC, boxing, golf, horse racing, rugby, hockey, etc. Basically, if you can think of any moderately popular sport, there's probably quite a few podcasts about it.
Other categories like Business, Health, and Society are a bit harder to parse into clear themes, but for Business for example, entrepreneurship and careers generally seem to be big. Surprisingly to me, discussions of crypto does not seem to be particularly common on podcasts. The only obvious crypto topic (bitcoin crypto blockchain currency exchange), is only about 0.2% of episodes.
Within entertainment, music appears to be dominant (around 2.2%), followed by video games and movies (both about 1.3%). Within more niche fandoms, Star Wars appears to beat out Star Trek (0.3% vs 0.1%), and Harry Potter is more popular than Game of Thrones, though both are negligible in terms of popularity.
Interestingly, the most dramatic skew is within Religion. Among those topics that can be easily identified as related to particular religions, more than 12% of all episodes in our data end up assigned to topics related to Christianity. By contrast, the numbers for Judaism, Islam, and Mormonism are much smaller (0.4%, 0.3%, and less that 0.1%, respectively).
Probably the most surprising results here are what we don't see a lot of. One of these is politics. There are a few clear topics related to politics, elections, and so forth. However, none of these are particularly prevalent. More likely, politics is threaded through other discussion, such as topics focused on policing, race, COVID-19, and so forth. However, as a simple test case, even the word "trump" only occurs with any reasonable prominence in two topics, and both of these are less than half a percent. (As a reminder, these are episodes from May and June, 2020).
The other big surprise to me is that there isn't a bigger slice of topics that are identifiable as related to true crime. Anecdotally, I think of true crime as being one of the most popular genres of podcasts. However, in this data, there is only one topic that clearly relates to it (murder found police case crime killed evidence), and it is only about half a percent of all episodes. True Crime also exists as a self-assigned category, but it's similarly less than half a percent as a primary choice.
Once again, these numbers are very heuristic; many episodes end up associated with more miscellaneous topics, such as those that are primarily names, or things which seem to be more about advertising or the nature of podcasting itself.
However, probably the biggest limitation here is that these numbers tell us nothing about popularity. It could be, for example, that true crime is an extremely popular genre to listen to, but that there aren't that many shows devoted to it. To me, that would be surprising, given that I would expect supply to largely follow demand, but perhaps the cost of creating such a show is high, or it is effectively impossible to compete with the best known shows. By contrast, perhaps recording a show about a particular fandom is relatively low cost for the hosts (in terms of their time), and they are doing it without expecting to find a large audience.
All to say, there is much that we don't yet know about the podcast ecosystem. Our paper contains many more in depth analysis of this dataset, including looking at networks that form via guests that appear on multiple shows, as well as the way discussions of George Floyd spread throughout parts of the ecosystem. Above all, we want to encourage people to take podcasts seriously as a medium for studying the spread of news, information, and cultural interactions, as well as a rich source of data on how people interact. We're also planning to release additional future datasets that will extend our work beyond this first sample, and we're hopeful that SPoRC will help people begin to dig more into these questions.