The pan-ecological gene universe of metagenomics

This resource accompanies a submitted manuscript exploring the gene content of microbiomes (microbial communities) across 17 different ecologies, including humans, animals, and other environmental sources. The link below will lead you to a download page for our gene resources, and the abstract describes our findings, which notably include an unprecendent appreciation for conservation of certain genes across different environments.

When our work is available online, we will include a link to the manuscript here. Below you can find an abstract summarizing our findings as well as an introductory figure describing the content of our database.


You can access our gene catalog files (FASTA files, functional gene annotations, etc) at our figshare resource.


The human microbiome consists of microbes with pan-ecological evolutionary origins, yet a systematic gene-level analysis of microbial life across ecologies is lacking. We quantified the gene content from 14,183 samples across 17 ecologies – 6 human-associated, 7 non-human-host associated (e.g. mouse gut), and 4 in other environmental niches (e.g. soil). At 30% amino acid identity, we identified 117,629,181 non-redundant genes across all samples, 66% of which were singletons, only being observed in one sample. We quantified the genetic similarity and “uniqueness” between different ecologies, showing that sites like the human vaginal and skin ecologies had low genetic alpha-diversity yet high beta-diversity, indicating few species but high pangenomic variation. We further identified a set of 1,864 sequences conserved across all ecologies, which indicates an overwhelming gene-level conservation to microbial life despite extreme taxonomic variation. However, using 90% amino acid clustering identity, we did not observe any globally conserved genes, even those known to be present in all bacteria. This indicates that prior studies, which cluster at, for example, 95% nucleotide identity, have not estimated microbial gene content accurately. We additionally found genes that were differentially abundant in particular groups of ecologies (e.g. human gut and non-human gut genes), identifying discrete functions among these groups. We showed that genes associated with pathogenic taxa tend to be the most likely to appear in multiple ecologies. We provide our databases, as well as the sets of genes described above, as a resource at .

Database overview

Screen Shot 2021-05-09 at 3 45 04 PM

Overview of database and genetic similarity between ecologies. A) Statistics regarding the gene ane sample content of our database at the 30% clustered sequence identity and the high-level analytical steps we took in the manuscript. B) Hierarchical clustering on the gene overlap between ecologies as a function of iterative sampling. Each cell represents the average number of genes shared between two ecologies (on rows and columns) after 50 random samplings. Cell color is in units of number of genes. Color of text corresponds to broader ecology class. C) The average number of genes shared between any two ecologies. This number is the sum of the rows, not counting the diagonal. D) The average number of genes unique to (found only in) a given ecology. This number corresponds to the values in the diagonal.

The old version if this resource, comprising only human and gut microbiome data, is available at:


For questions or concerns, please contact