Welcome to ResMicroDb

1. Introduction

1.1 Overview of ResMicroDb

Respiratory diseases pose a major burden to global public health. Distinct anatomical sites within the human respiratory tract harbor unique resident microbial communities, which play important roles in susceptibility, clinical progression, and outcomes of both infectious and non-infectious diseases. The number of studies on the respiratory microbiome has surged recently, along with a substantial increase in data. However, there is a notable lack of a comprehensive respiratory microbiome database.

To fill this gap, we introduce ResMicroDb, a comprehensive database and analysis platform for the human respiratory microbiome. ResMicroDb integrates 106,464 samples from 489 publications and 514 projects, covering 10 sample sites and 146 phenotypes.Specifically, ResMicroDb provides the following data resources: (1) standardized microbiome profiles for all samples generated using a unified processing pipeline; (2) a total of 32 manually curated metadata factors; (3) distributions of 3,490 microbial taxa across sample sites, phenotypes, countries, and age groups; and (4) 11,908 microbe-disease associations identified from 132 case-control studies. In addition, ResMicroDb offers three online tools for further analysis: (1) Microbiome Composition: generates microbiome profiles for selected samples. (2) Sample Similarity Search: finds the most similar samples in the database to a user-provided query. (3) Cross-study Analysis: explores both shared and unique microbial characteristics across cohorts, diseases, and sample sites. Overall, this comprehensive database, along with its integrated analysis tools, will serve as a versatile and valuable resource for advancing research across a broad spectrum of respiratory microbiome studies.

Figure 1. Overview of ResMicroDb

1.2 Citation

Please cite: ResMicroDb: A Comprehensive Database and Analysis Platform of Respiratory Microbiome.

1.3 Contact us

If you have any question, suggestions, or comments, please feel free to contact us via email (limk@cncb.ac.cn).

Address:

National Genomics Data Center

Beijing Institute of Genomics (China National Center for Bioinformation), Chinese Academy of Sciences

No. 104 building, No.1 Beichen West Road, Chaoyang District

1.4 Licenses

ResMicroDb is free for academic use only. For any commercial use, please contact us for commercial licensing terms.

2. Materials and Methods

2.1 Data Collection

We searched the PubMed database up to January 31, 2025 using keywords “human” AND “[respiratory]” AND “microbiome”. The term “[respiratory]” was also substituted with specific respiratory site names, such as “oropharynx”, “trachea”, “lung” and others. Out of 10,586 records retrieved, we manually reviewed full text of each publication and excluded those did not focus on the respiratory microbiome or without accessible raw data. Finally, 489 publications were retained.

Search Query for Respiratory Microbiome Articles

2.1.2 Raw sequencing data download

We downloaded raw microbiome sequencing data in FASTQ format from NCBI SRA, EBI ENA, and NGDC GSA.

2.1.3 Metadata download and curation

To ensure maximal completeness and accuracy of the metadata, we performed manual curation and standardization on most metadata obtained from the BioSample database. The curation process involved cross-referencing the original research articles associated with each BioProject, verifying the consistency between the published data, supplementary materials, and the corresponding BioSample metadata. Standardization of metadata of all samples are done based on the structured curation model below.

Search Query for Respiratory Microbiome Articles

Items Description Value (Grey letters: prefix of accession numbers)
Basic information
Run Accession number of each sample from data resource SRR, ERR, DRR or CRR
Project ID Accession number of each BioProject from data resource PRJNA, PRJEB, PRJDB or PRJCA (Some projects do not have an Accession number, we use their SRAStudy accession number for replacement.)
BioSample Accession number of each Biosample in data resource SAMN, SAMEA, SAMD or SAMC
PMID Publication in which the sampe is described PubMed ID
Sequencing Strategy
Sequencing Type Controlled vocabulary 16S, Metagenomics, Metatranscriptomics
Library Layout Controlled vocabulary SINGLE or PAIRED
Platform Controlled vocabulary ILLUMINA, LS454, BGISEQ, ION_TORRENT
Model Controlled vocabulary Illumina MiSeq, Illumina NovaSeq 6000, Illumina HiSeq 2500, etc
Biological Condition
Phenotype Controlled vocabulary To establish a standardized framework for disease names and definitions, terms or identifiers from multiple ontologies, including Experimental Factor Ontology (EFO) (Main source)Mondo Disease Ontology (MONDO)NCI Thesaurus OBO Edition (NCIT)SNOMED CT (International Edition) (SNOMED) Human Disease Ontology (DOID) Human Phenotype Ontology (HP)TOXic Process Ontology (TXPO) and Ontology for MIRNA
Target (OMIT).
Disease Stage The disease stage of samples Conclusion term
Complication The complication of samples Conclusion term
Smoke Controlled vocabulary Non-smoker, Smoker, Ex-Smoker
Recent Antibiotics Use Controlled vocabulary Yes, No
Antibiotics Used The antibiotics used for the samples Conclusion term
Sample Characteristic
Sample Site Controlled vocabulary Nasal, Nasopharynx, Oropharynx, Pharynx, Throat, Sputum, Trachea, Bronchus, BALF, Lung Tissue
Sample Type Sample Site recorded in the original publication/study Conclusion term
Sex Controlled vocabulary Male, Female
Age Statistical data The age of samples, recorded as either a specific value or an interval
Age Group Controlled vocabulary 0-3, 3-18, 18-35, 35-45, 45-60, 60-75, 75+
BMI Statistical data The BMI of samples, recorded as either a specific value or an interval
Patient ID A unique identifier for each patient as recorded in the original publication/study Conclusion term
Time Point The time point at which the sample was collected Conclusion term
#Reads Statistical data The number of reads in the sample
Shannon Statistical data Shannon-Wiener index, a measure of taxa diversity in a community
Chao1 Statistical data Chao1 index, an estimator of taxa richness in a community
Observed Statistical data The observed number of unique taxa within a community
Geographic Location
Continent Controlled vocabulary Europe, North America, Asia, Africa, Oceania, South America
Country Controlled vocabulary United States, China, Netherlands, United Kingdom, etc
Location Controlled vocabulary Bilthoven, Copenhagen, New York, Shenzhen, etc
Latitude The geographic latitude coordinate Generated by geocoding the 'Country' and 'Location' fields using Google Maps
Longitude The geographic longitude coordinate Generated by geocoding the 'Country' and 'Location' fields using Google Maps

2.2 Data Processing

2.2.1 Quality control and Taxonomic assignment

For 16S rRNA gene sequencing data (Figure 2, left), we performed: (1) Read merging and quality control using VSEARCH; (2) Denoising and chimera removal to generate Amplicon Sequence Variants (ASVs) using USEARCH; (3) Taxonomy classification using the QIIME2 feature-classifier plugin and the SILVA ribosomal RNA database (v138).

For metagenomic and metatranscriptomic sequencing data (Figure 2, right), we performed: (1) Quality control using fastp; (2) Host (T2T-CHM13) and contaminant (UniVec) sequences removal using Bowtie2; (3) Ribosomal RNA removal using SortMeRNA; (4) Taxonomic classification using Kraken2 with a custom database (NCBI nucleotide sequences for bacteria/archaea/viruses/fungi plus T2T-CHM13); (5) Abundance estimation using Bracken at the species level.

Figure 2. Microbiome sequencing data analysis pipeline

2.2.2 Case-control Analysis

For the 132 case-control studies collected, we performed microbial biomarker identification, biodiversity analysis, and co-occurrence network construction for each individual study to systematically compare microbiome differences between the disease and control groups. Additionally, Cross-study Analysis was conducted to identify consistent associations across studies.

2.2.2.1 Identification of microbe-disease associations (markers)

Marker taxa are genera or species significantly differentially abundant between case and control groups (absolute gFold Change > 0.1 and BH-adjusted p-value <= 0.2) by differential abundance analysis (DAA).

We employed multiple DAA methods to provide users with diverse options. These methods include the Wilcoxon rank-sum test, fastANCOM, ALDEx2, MaAsLin2, and ZicoSeq, with MaAsLin2 set as the default method.

For effect size quantification, we employed generalized fold change (gFold Change). Generalized fold change is the mean of the differences between two distributions at several quantiles and can therefore resolve differences in low-prevalence taxa. The formula is as follows:

(1)gFoldChange=1nqQ(log10(Xcase(q)+ϵ)log10(Xcontrol(q)+ϵ))

Where:

  • Q: Set of quantiles, {0.1,0.2,…,0.9}

  • Xcase(q), Xcontrol(q): The q-th quantile of relative abundance in the case and control groups, respectively.

  • ϵ: Pseudocount,1e-6

  • n: Number of quantiles evaluated (n = 10)

2.2.2.2 Diversity Analysis

Alpha diversity was assessed using the Shannon index, and group comparisons were made using the Wilcoxon rank-sum test.

Beta diversity was evaluated through principal coordinates analysis (PCoA) based on Bray-Curtis distance, with group differences tested using permutational multivariate analysis of variance (PERMANOVA).

2.2.3.3 Microbial Co-occurrence Network Analysis

We constructed microbial co-occurrence networks for disease and control groups separately using the NetCoMi package, with microbial interactions calculated by the SparCC method and network differences between groups compared using the netCompare function.

Network comparisons were performed across various properties of the whole network, the largest connected component, and individual nodes.

Network properties
Network featureDescription
Number of componentsNumber of connected components. Since a single node is connected to itself by the trivial path, each single node is a component.
Clustering coefficientA measure of the network's "cliquishness", defined as the arithmetic mean of the local clustering coefficient defined by Barrat et al. It quantifies the degree to which nodes tend to cluster together.
Positive edge percentagePercentage of edges with positive estimated association of the total number of edges. It reflects the balance of cooperative versus competitive interactions.
Edge densityThe density of a graph is the ratio of the actual number of edges and the largest possible number of edges in the graph, assuming that no multi-edges are present. It measures the overall connectivity saturation of the network.
Natural connectivityThe natural connectivity of a graph is a useful robustness measure of complex networks, corresponding to the average eigenvalue of the adjacency matrix.
Relative LCC sizeThe proportion of nodes within the LCC relative to the total number of nodes in the entire network.
Clustering coefficientA measure of the cliquishness or degree to which nodes cluster together, calculated for the LCC.
Positive edge percentageThe proportion of positive edges relative to the total number of edges, calculated for the LCC.
Edge densityThe ratio of actual edges to the maximum possible number of edges, calculated for the LCC.
Natural connectivityA measure of structural robustness based on path redundancy, calculated for the LCC.
Vertex connectivityThe minimum number of vertices that must be removed to disconnect the LCC. It measures vulnerability to node loss.
Edge connectivityThe minimum number of edges that must be removed to disconnect the LCC. It measures vulnerability to link disruption.
Average dissimilarityThe mean of the dissimilarity values (e.g., 1 - correlation) across all edges in the LCC.
Average path lengthComputed as the mean of shortest paths in the LCC. The av. path length of an empty network is 1.
Degree centralityDegree centrality refers to a measure in network analysis that quantifies the number of connections a node has. It is calculated based on the count of social connections (edges) a node possesses, with higher values indicating a more central position within the network.
Betweenness centralityA measure of how often a node lies on the shortest paths between other nodes. It identifies crucial "bridges" or bottlenecks.
Closeness centralityA measure of how close a node is to all other reachable nodes. It quantifies a node's overall accessibility in a network by measuring the normalized inverse of its total shortest-path distance to all other reachable nodes.
Eigenvector centralityA measure of a node's influence, where connections to other highly influential nodes contribute more to its score.

3. Database Usage

3.1 Home

The Home page provides a general overview of ResMicroDb and its main features. Users can:

  • Perform quick searches via the search box.

  • Click on icons or statistics to navigate to corresponding database sections.

3.2 Sample Sites

The Sample Sites page displays 10 respiratory sites included in ResMicroDb.

a) Associated metadata: Categorizes samples based on metadata such as "Phenotype" and "Sequencing Type." By clicking on the Microbiome Composition, users can explore detailed microbial profiles and (only available for sample sizes ≥10).

b) Dominant taxa:Shows the average relative abundance of the top 15 genera across all nasopharynx samples, healthy samples (include Control samples) and disease samples. Columns represent individual samples, with hierarchical clustering performed using the Bray-Curtis distance matrix and the Ward.D2 method.

c) Taxonomy tree:Displays the taxonomic hierarchy of the top 100 most prevalent genera in all nasopharynx samples, healthy samples (include Control samples) and disease samples. Solid dots indicate genera with a prevalence greater than 0.6, representing core taxa.

3.3 Taxa

3.3.1 Overview of All Taxa

The Taxa page displays 3,490 taxa (1,117 genus and 2,373 species) included in ResMicroDb, along with their presence across healthy (include Control samples) and disease samples, as well as the number of microbe-disease associations.

Users can click on any taxon to view detailed information. (The quick search bar on the Home page can also be used for access)

3.3.2 Taxa Details

Here we use Streptococcus as an example to show the contents of this page.

a) Introduction: Includes lineage, NCBI TaxID, description, and links to external databases such as the NCBI Taxonomic Database, Wikipedia, and BugSigDB.

b) Distribution in different sample sites: Displays the distribution of Streptococcus across different sample sites and sequencing types, showing abundance and prevalence in both healthy (include Control samples) and diseased samples. Clicking on visualizations provides comprehensive characteristics under more metadata (e.g., distribution in nasopharynx across phenotypes, countries, and age groups).

c) Marker taxon: Lists diseases associated with Streptococcus. Markers are identified on a per-project basis. (Refer to the Identification of Microbe-Disease Associations for detailed marker identification process.)

3.4 Phenotypes

3.4.1 Overview of all phenotypes

The Phenotypes page displays 146 phenotypes included in ResMicroDb, along with the number of associated samples, studies, publications, and microbe-disease associations.

Users can click any of the phenotype to view detailed information. (The quick search tool on the Home page can also be used for access)

3.4.2 Phenotype details

a) Basic information: Provides a brief description based on the Experimental Factor Ontology (EFO) and other phenotype ontology databases. It also summarizes the number of associated samples, studies, and publications.

b) Associated marker: Displays microbe-disease associations related to the phenotype. By clicking on Cross-study analysis, users can explore robust microbial markers across projects for the selected sample site and disease.

c) Associated metadata: Categorizes samples based on metadata such as "Sample Site" and "Sequencing Type." Clicking on Microbiome Composition allows users to explore detailed microbial profiles and Cross-study Analysis allows users to conduct cross-study analysis based on the selected sample site (only available for sample sizes ≥10).

3.5 Associations/Markers

The Associations page displays 11,908 microbe-disease Associations included in ResMicroDb (default method: MaAsLin2). Users can :

  • Filter associations using the left "Filter by metadata" panel

  • Customize the displayed information columns in the table using the top "Display columns" panel

3.6 Samples

The Samples page displays 106,464 samples/runs included in ResMicroDb. Users can :

  • Filter samples using the left "Filter by metadata" panel

  • Customize the displayed information columns in the table using the top "Display columns" panel

Details about sample curation and standardization can be found on the Metadata curation page.

3.7 Projects

3.7.1 Overview of all projects

The Projects page displays 514 projects included in ResMicroDb. Users can :

  • Filter projects using the left "Filter by metadata" panel

  • Customize the displayed information columns in the table using the top "Display columns" panel

Users can click any of the project to view detailed information. (The quick search tool on the Home page can also be used for access)

3.7.2 Project details

a) Basic information: Provides an introduction to the project, links to related publications, and details about associated samples, sequencing types, sample sites, phenotypes, countries, sample counts, and average reads per sample.

b) Associated samples:Contains a table listing all runs/samples included in the project.

3.8 Publications

The Publications page displays 489 publications included in ResMicroDb.

3.9 Microbiome Composition

The Microbiome Composition allows users to analyze the microbial composition of a selected population of interest.

Follow these steps to perform the analysis:

  1. Choose “Sample Site”, ”Phenotype“, or ”Sequencing Type" to select a population you are interested.

  2. Click on any row to switch your selection. A "√" symbol will appear at the far left of the selected row, and the row will be highlighted in gray.

  3. Click "Analysis" to get the results.

Figure 3.1 Steps to perform the Microbiome Composition

The results will appear on the page.

a) Dominant taxa:Displays the average relative abundance of the top 15 taxa across selected samples. Each column represents a sample, with hierarchical clustering performed using the Bray-Curtis distance matrix and the Ward.D2 method.

b) Average taxonomic composition: Shows the average microbial composition of all selected samples.

Sample Similarity Search identifies the most similar samples in the database based on a user's query. It provides integrated statistical analyses and visualizations using downloadable metadata and microbial composition data.

Figure 3.2 Overview of Sample Similarity Search

Given two microbiome samples A and B, the similarity between them is defined as:

(2)Similarity(A,B)=1Distance(A,B)

We provide three distance metrics, which are widely used in microbiome studies.

  1. Bray-Curtis

(3)DBC(A,B)=12imin(piA,piB)ipiA+ipiB
  1. Jensen-Shannon Divergence (JSD)

(4)DJS(A,B)=12DKL(PAM)+12DKL(PBM)
  1. Jaccard

(5)DJ(A,B)=1|AB||AB|

Where:

  • A and B refer to two microbiome samples represented by their taxonomic profiles.

  • piA: Relative abundance of taxon i in sample A

  • piB: Relative abundance of taxon i in sample B

  • M=PA+PB2,DKL(PQ)=ipilogpiqi

  • |AB|: Number of taxa present in both samples

  • |AB|: Total number of unique taxa in either sample

 

Follow these steps for analysis:

  1. Upload Your Data: Provide the microbial composition data for a sample in CSV format. The first column should contain genus or species names, and the second column should list their relative abundances.

  2. Sample Selection Filters: Select the appropriate taxonomic level for your sample. Then, choose whether to restrict the comparison to a specific sample site, whether to include healthy samples, whether to include infant samples

  3. Distance Measure: Choose a method for calculating similarity.

  4. Click "Run" to start the search. Results will typically be generated within 10–30 seconds.

Figure 3.3 Steps to perform the Sample Similarity Search

The results will appear on the page.

a) Associated samples: display the 500 most similar samples along with their associated metadata.

b) Associated samples statistics: Visualizes the similarity metrics of the most similar samples and the distributions of sample sites and phenotypes. Users can adjust the similarity threshold to refine the displayed results.

c) Associated samples microbial composition: Displays microbial composition comparisons between the query sample and the most similar samples.

3.11 Cross-study Analysis

Cross-study Analysis Cross-study Analysis enables comparative analysis of microbial biomarkers, alpha and beta diversity indices, and co-occurrence networks across studies. It supports cross-study analyses to identify robust microbial biomarkers and ecological characteristics across cohorts and countries, cross-disease analyses to uncover disease-specific and shared microbial features, and cross-sample site analyses to contrast microbial communities across sample site.

Example 1: In ResMicroDb, a disease may span multiple projects. Cross-study comparisons can identify robust microbial characteristics across different studies associated with a specific disease (e.g., chronic rhinosinusitis) across multiple cohorts within a selected sample site.

Example 2: Similarly, a sample site may contain samples from multiple diseases. Cross-study comparisons can identify shared and unique microbial characteristics across different diseases within the same sample site (e.g., sputum). These characteristics may either be pan-disease characteristics (shared across diseases) or disease-specific ones (unique to a particular disease).

Example 3: Similarly, a phenotype may be related to multiple sample site. Cross-study comparisons can identify shared and unique microbial characteristics across different sample sites under the same condition (e.g., COPD).

Here we use Example 1 as an case to show the contents of this tool. Follow these steps to perform the analysis:

  1. Select the sample site or disease of interest.

  2. Add relevant projects to the cart (up to 30 projects). The red number indicates the total number of selected projects. Users can review project details before adding them.

  3. Click "Cross-Study Comparison" to start the analysis.

Figure 3.4 Steps to perform the Cross-study Analysis

3.11.1 Marker

The heatmap below displays consistent and non-consistent disease-associated microbial markers across datasets for chronic rhinosinusitis in the nasal. Users can further customize the analysis by applying the following filters:

  • DAA Method: Select the differential abundance analysis method.

  • Taxonomy level: Select the taxonomy level (e.g., genus for 16S sequencing, or species for metagenomics and metatranscriptomics).

  • Exclude conflicting markers: Exclude/Include markers with inconsistent enrichment directions across different studies.

  • #Projects: Specify the number of projects that report the marker taxon as significantly different between cases and control groups.

  • |gFold change|: Filter by effect size.

  • P.adj Value: Apply a threshold based on the Benjamini-Hochberg adjusted p-value.

Figure 3.5 Marker results in Cross-study Analysis

3.11.2 Diversity

Users can compare alpha diversity (Shannon index) between disease cases and healthy controls across different studies.

Figure 3.6 Alpha diversity in Cross-study Analysis

Users can compare beta diversity between disease cases and healthy controls across different studies.

Figure 3.7 Beta diversity in Cross-study Analysis

3.11.3 Network

Users can compare network between disease cases and healthy controls across different studies.

Figure 3.8 Network in Cross-study Analysis

3.12 Statistics

The Statistics page provides an overview of the data in ResMicroDb, summarized across three dimensions: sample metadata, sequencing strategy, and publications.

3.13 Download

The Download page provides links to download the Metadata and Abundance tables for all samples included in ResMicroDb.