Genome coverage, literally speaking. The challenge of annotating 200 genomes with 4 million publications

Research output: Contribution to journalArticlepeer-review


  • Paul Janssen
  • Leonid Goldovsky
  • Victor Kunin
  • Nikos Darzentas
  • Christos Ouzounis

Documents & links



In late 2004, 200 complete genomes had been sequenced and made available to the research community. At the time of writing this viewpoint, that number had further risen to 221 and will have undoubtedly increased again before publication. These genomes, which represent a wide range of species from archaea to human, are a highly valuable knowledge resource for the scientific community. However, the sequencing of a full genome is just the first step in research; it must be followed by the functional characterization of genes and proteins. In this context, it is interesting to see how well represented these sequenced species are in terms of publications. We have thus obtained the number of abstracts published per species and normalized that count by the number of genes in that species to obtain a comparable measure for the number of publications per gene for all completed and published genomes. This simple measure highlights the current knowledge gap between various organisms and could further serve as a guideline for selecting genomes for sequencing projects, high-throughput functional genomics and database annotation efforts.


Original languageEnglish
Pages (from-to)397-399
JournalEMBO Reports
Issue number5
Publication statusPublished - 1 May 2005


  • genomics, genomeprojects, genome literature, text mining, species knowledge index, SKI

ID: 71686