Get BUSCO gene descriptions from OrthoDB


BUSCO lacks descriptions of the ortholog groups it searches for. You can query OrthoDB via its API to map BUSCO IDs and pull the information. Below is a short R snippet to automate this and produce a nice table.

BUSCO is a great tool to assess the completeness of eukaryotic genome assemblies. It let's you search your assemblies for a sets of conserved near-universal single copy orthologs genes, and returns useful statistics on their absence and presence in the data set at hand. Because the chosen orthologs are ubiquitous, complete assemblies should contain most of these markers. It also is a good way for comparing assembly versions - the more complete markers, the better. But also the numbers of fragmented and duplicated markers can give you useful information about assembly qualities.

Recently, I used BUSCO to assess four assemblies of quite similar strains of the marine flagellate Cafeteria roenbergensis. Among other things, I noticed that in all strain the same set of BUSCOs was missing. This made me curious, and I wondered if this might reflect some interesting biology - certain pathways missing, … Unfortunately, the output generated by BUSCO does not contain any descriptive information about the different gene clusters, just the BUSCO IDs for each cluster; apparently a issue that others encountered before, and could not really solve (Get BUSCO gene descriptions - Biostars).

So, here is how I solved it. Most BUSCO data sets are generated from OrthoDB. If you search OrthoDB with BUSCO IDs, you will immediately be forwarded to the respective ortho-group entry containing all the metainformation we are interested in. Because I want to retrieve metadata for a large number of BUSCO IDs, I wrote a small R script that queries the OrthoDB API, and returns a nice table with the results.

library(tidyverse)
library(jsonlite)

busco_ids <- c("EOG093714Q2", "EOG09370GHP", "EOG09370OS5")
names(busco_ids) <- busco_ids # so map_df gets .id right

# a tibble of interpro profile associated with each busco_id
busco_ipr <- map_df(busco_ids, .id="busco_id",  function(busco_id){
  write(busco_id, stderr()) # just so we can monitor progress
  # map the BUSCO ID to OrthoDB group ID
  query <- read_json(paste0("https://www.orthodb.org/search?query=", busco_id))
  odb_id <- query$data[1]
  # get all info on the orthogroup
  odb_info <- read_json(paste0("https://www.orthodb.org/group?id=", odb_id),
    simplifyVector = TRUE)
  # return the interpro table
  odb_info$data$interpro_domains
})

busco_ipr %>% select(busco_id, description)

And this is how the table looks like:

     busco_id                                              description
1 EOG093714Q2                          Mediator complex, subunit Med31
2 EOG093714Q2       Mediator complex, subunit Med31 domain superfamily
3 EOG09370GHP                       Reduced growth phenotype protein 1
4 EOG09370OS5 Ribosomal protein S2, flavodoxin-like domain superfamily
5 EOG09370OS5                                     Ribosomal protein S2
6 EOG09370OS5      Ribosomal protein S2, bacteria/mitochondria/plastid
7 EOG09370OS5                     Ribosomal protein S2, conserved site