Abstract
Acute respiratory distress syndrome (ARDS) is a clinically defined syndrome of acute hypoxaemic respiratory failure secondary to non-cardiogenic pulmonary oedema. It arises from a diverse set of triggers and encompasses marked biological heterogeneity, complicating efforts to develop effective therapies. An extensive body of recent work (including transcriptomics, proteomics, and genome-wide association studies) has sought to identify proteins/genes implicated in ARDS pathogenesis. These diverse studies have not been systematically collated and interpreted.
To solve this, we performed a systematic review and computational integration of existing omics data implicating host response pathways in ARDS pathogenesis. We identified 40 unbiased studies reporting associations, correlations, and other links with genes and single nucleotide polymorphisms (SNPs), from 6,856 ARDS patients.
We used meta-analysis by information content (MAIC) to integrate and evaluate these data, ranking over 7,000 genes and SNPs and weighting cumulative evidence for association. Functional enrichment of strongly-supported genes revealed cholesterol metabolism, endothelial dysfunction, innate immune activation and neutrophil degranulation as key processes. We identify 51 hub genes, most of which are potential therapeutic targets. To explore biological heterogeneity, we conducted a separate analysis of ARDS severity/outcomes, revealing distinct gene associations and tissue specificity. Our large-scale integration of existing omics data in ARDS enhances understanding of the genomic landscape by synthesising decades of data from diverse sources. The findings will help researchers refine hypotheses, select candidate genes for functional validation, and identify potential therapeutic targets and repurposing opportunities. Our study and the publicly available computational framework represent an open, evolving platform for interpretation of ARDS genomic data.
Introduction
The acute respiratory distress syndrome (ARDS) is clinically defined as acute hypoxaemic respiratory failure due to non-cardiogenic pulmonary oedema1. It occurs following a variety of insults; pulmonary and extra-pulmonary. While this definition has been useful in identifying patients at risk of serious morbidity and death2, it overlooks the underlying biology and masks heterogeneity3. Arguably, this has contributed to limited success in developing therapeutics4. In contrast, a biological definition of ARDS may provide the lever necessary for future drug discovery5.
Functional genomics technologies enable hypothesis-free disease characterisation at unprecedented resolution. The emergence of coronavirus disease 2019 (COVID-19) has provided an opportunity to test genetic approaches to drug discovery in a homogeneous subset of ARDS patients. A notable success is the finding that baricitinib, a Janus kinase inhibitor, reduces mortality in patients hospitalised with COVID-196. A priori support for baricitinib7 was greatly enhanced following the discovery of a causal link between elevated tyrosine kinase 2 (TYK2) expression and severe COVID-19 in genome-wide association studies (GWAS)8. The availability of comparable omics data for non-COVID ARDS is limited.
An unresolved challenge is how large omics data can be effectively exploited9. Specifically, how can we combine data from heterogeneous sources to derive new insights or recalibrate our understanding in the light of new data? We have proposed meta-analysis by information content (MAIC) as a data-driven, algorithmic, method for combining gene lists from diverse sources10. MAIC is agnostic to the quality or methodology of the sources and combines ranked or un-ranked gene sets by calculating weights for each list and gene, and iteratively updating them to converge on a ranked meta-list. We have successfully applied MAIC to host-genomics studies of influenza A10 and coronavirus infection8,11, and shown that it out-performs existing algorithms when combining ranked and un-ranked lists obtained from heterogeneous sources12.
In this work, we present a living meta-analysis by information content of ARDS host genomics studies. This serves as an open-source resource for gene prioritisation, functional genomics, and drug target discovery. An interactive interface can be accessed at https://baillielab.net/maic/ards, alongside a complementary R package.
Results
Systematic review
We first conducted a systematic review of existing genome-wide studies, which reported associations between genes, transcripts, or proteins and ARDS susceptibility, severity, survival, or phenotype. Our search yielded 8,937 unique citations (Fig. S1). We retrieved 74 articles for full-text evaluation and included 40 in our meta-analysis13–52. These 40 studies produced 44 unique gene lists (22 transcriptomic, 13 proteomic, and 9 based on genome-wide association studies (GWAS); see Table 1). Three studies reported results from multiple methodologies33,38,44, and several used more than one tissue type18,21,32. Excluding GWAS, 14 gene lists (40%) were derived from lung or airways samples, and 21 (60%) from blood. We could not retrieve one gene list26. No whole-genome sequencing GWAS were found, and only 36% (n=8) of transcriptomic lists used next-generation sequencing techniques. The earliest included study was published in 200418, however, almost half (n=19, 47.5%) were published in the last 5 years.
Most studies aimed to identify genes or proteins associated with ARDS susceptibility (n=27, 67.5%). The remainder examined associations with survival (n=6, 15%), sub-phenotype (n=4, 10%), disease progression (n=2, 5%), or severity (n=1, 2.5%). In total, studies included 6,856 patients with ARDS.
Meta-analysis by information content (MAIC)
We analysed all 43 available gene lists using MAIC. Lists were categorised by method (i.e., GWAS, transcriptomics, and proteomics) and technique (e.g., RNA-seq, mass spectrometry; see Table 1). In total, we ranked 7,085 unique genes (or SNPs), with a median of 27 genes per gene list (range 1-4,954). The top 100 ranked genes are summarised in Figure 1. Most genes were found in a single category (n=5,866, 82.8%); only 157 (2.2%) were identified in ≥ 3 categories, with the maximum number of categories supporting a gene being 5 (Figure 1). Similarly, few genes (n=362, 5.1%) were identified by more than one method, with only AKR1B10, HINT1, HSPG2, S100A11, and SLC18A1 present in transcriptomic, proteomic, and GWAS-based lists.
To prioritise genes for further investigation, we used the unit invariant knee method53 to identify the inflection point in the MAIC score curve. This prioritised 1,306 genes with scores above this point (Figure 1). These genes were more likely to be found in ≥ 2 lists or categories and by more than one method (Figure 1).
To assess the influence of individual lists, we calculated the total MAIC score (totMS), reflecting the sum of gene scores across each list (Fig. S2), and the contributing total MAIC score (ctotMS), measuring the sum of each lists gene scores which contribute to a gene’s overall MAIC score. To obtain relative values, we divided the totMS/ctotMS for each list by the total across all included lists. This demonstrated that only 10 lists (from 9 studies) contributed >1% by either metric (Tab. S1). Notably, the RNA-seq list from Sarma et al.44 accounts for >50%, a function of its length. To account for this, we normalised totMS/ctotMS by the number of genes per list; along with the proportion of replicated genes in each list, this provides an alternative perspective, with several proteomic studies ranking highly (Fig. S2).
Comparison with existing ARDS sources and COVID-19
To place our meta-analysis results in context, we evaluated the overlap between the genes prioritised by MAIC and those from two established resources: BioLitMine54, using an ARDS MeSH search, and the ARDS Database of Genes55 (Fig. S3a and Fig. S3c). A search using BioLitMine, identified 271 ARDS-associated genes, of which 142 (52.4%) were present in our analysis. Almost half of the overlapping genes (n = 63, 44.4%) ranked within our prioritised set (Tab. S2).
After correcting for historical gene symbol aliases, we matched 4 additional genes from the BioLitMine search. A further 104 genes were supported by just a single publication (Fig. S3b). For each of the remaining 21 genes, we obtained the 100 most co-expressed genes using ARCHS456 (returning data for 18) and assessed the overlap of these sets with the results of ARDS MAIC; two-thirds exhibited <50% overlap (Fig. S3b). Of the 239 genes catalogued in the ARDS Database of Genes, 177 (74.1%) were also found in our study. However, both sources contain gene associations which lack genome-wide support.
Finally, we compared the overlap between genes ranked by ARDS MAIC and those identified in a previous MAIC of the host response to coronaviruses11 (Fig. S3d). In total, 2,606 genes (36.8%) were shared, of which 143 were prioritised by both analyses (Fig. S3e).
Tissue and cell-specific expression
While most gene lists were derived from blood sampling, most genes were identified in airways samples (n=5,847, 82.5%) (Fig. S4a). This was equally the case for the prioritised gene set, however the majority of these genes were also identified in blood samples (n=818, 62.6%) (Fig. S4b). Among genes uniquely identified in lists obtained from blood samples (n=1,238), almost three-quarters are known to be expressed in the lung (HPA scRNA-seq data, ≥ 5 normalised transcripts per million (nTPM)), with a quarter being highly-expressed (≥ 100 nTPM) (Fig. S4c).
For prioritised genes found in lists obtained from airways sampling, there was a wide variety of cell-specific expression (Fig. S4d). However, in the smaller set of prioritised genes identified solely in lists employing blood sampling, clusters of expression specific to neutrophils, T cells, and monocytes were evident (Fig. S4e). Cell-type specific gene enrichment analysis suggests innate immune as well as epithelial and endothelial cell types are enriched among genes identified in airways samples (Fig. S4f). However, enrichment of epithelial and endothelial cells was not evident for prioritised genes identified from blood sampling alone (Fig. S4g).
Functional enrichment
Having identified a set of prioritised genes, we undertook several functional enrichment analyses. First, we performed over-representation analysis (ORA). In Reactome, 51 terms were significantly enriched (P < 0.001) (Figure 3). Not unexpectedly, neutrophil degranulation and several innate immune pathways (e.g., IL-10 signalling, interferon signalling, MHC II antigen presentation, TLR4 cascade) featured heavily. However, multiple pathways associated with cholesterol biology and metabolism (e.g., chylomicron assembly/remodelling, GLUT4 translocation, TP53 regulation of metabolic genes, insulin regulation) were also over-represented. Similarly, lipid and cholesterol metabolism, as well as hyperlipidaemia, were over-represented in KEGG and WikiPathways (Fig. S5a and Fig. S5b). In an enrichment analysis using the GWAS Catolog, the prioritised set of genes was associated with asthma (adult onset/time to onset), monocyte, lymphocyte, and eosinophil counts, aspartate aminotransferase levels, and levels of apolipoprotein A1 (Fig. S5d).
Next, we used the prioritised set of genes to create a protein-protein interaction (PPI) network. We graph-clustered this network, identifying 48 clusters with ≥ 5 members. Among the 10 largest clusters, we found programs associated with the proteaosome, cholesterol metabolism, interferon signalling, IL-6 signalling, and the complement cascade (Fig. S6). We then sought to use the PPI network to identify hub genes using an ensemble of topological methods. This analysis suggests 51 genes as being central to the wider network, which fall into clear clusters implicating plausible biological pathways, including innate immune cytokine signalling, and interferon response (Figure 2). The majority of hub genes (n=31, 61%) are currently druggable and include targets such as IL-6, IL-17A, IL-18, and MAP3K14.
Sub-groups
To address the disparate range of study designs included in the overall analysis, we applied MAIC to key subsets of gene lists, with two different study desings: studies of ARDS versus non-ARDS controls (i.e. presence/absence of ARDS) (n=28) and studies of ARDS survival and severity (n=7) (Figure 3).
For ARDS vs. non-ARDS controls, there were 15 transcriptomic (54%), 7 GWAS (25%), and 6 proteomic studies (21%). Together, these studies included 5,713 patients with ARDS. MAIC ranked 2,096 genes (Figure 3). The majority of these (n=1,222; 58%) were unique to to this sub-group (Figure 3). Most were identified in blood, with a small fraction found solely in airways samples. The inflection point method prioritised the top-ranked 130 genes (Fig. S7a). In comparison to the BioLitMine search and the ARDS Database of Genes, 71/271 and 117/239 genes were found among this sub-group, respectively (Fig. S7b). A single study, a microarray-based transcriptomic list from Juss et. al.30, contributed the largest total MAIC score in this analysis (Tab. S3). ORA using Reactome, KEGG, and WikiPathways identified 25 significantly enriched pathways, including multiple terms related to cholesterol metabolism and glycolysis (Figure 4). A consensus of topological models identified 7 hub genes within a PPI network of prioritised genes. These genes cluster in a single grouping related to cholesterol metabolism (Figure 3).
In the survival/severity analysis, there were 8 gene lists, consisting of 3 transcriptomic lists (37.5%), 3 proteomic lists (37.5%), and 2 very small GWAS (25%). Together, these studies included 644 patients with ARDS. MAIC ranked 463 genes (Figure 3). Approximately half of these (n=238, 51%) were unique to survival-based lists. In contrast to the ARDS vs. non-ARDS analysis, most survival genes were found in airways samples. Thirty-three genes were prioritised (Fig. S7d). In total, 32/271 of the BioLitMine ARDS-associated genes and 23/239 of the ARDS Database of Genes genes were found among the ARDS MAIC survival set (Fig. S7e). The proteomic and transcriptomic lists from Bhargava et.al15 and Morrell et. al40 each contributed approximately 30% of the summed ctotMS of all included gene lists (Tab. S4). IL-10 and IL-18 signalling pathways were both significantly enriched in ORA (Figure 4). Graphbased (MCL) clustering of the prioritised set of survival genes identified a single large cluster of immune-related genes including, IL-10, CXCL8, TNFRSF1A, and IL2RA (Figure 4).
Discussion
Our large-scale meta-analysis of the genomic landscape of ARDS prioritises 1,306 genes. Using wide inclusion criteria, we capture a diverse range of study designs and methods; the subsequent application of MAIC downgrades noisy, irrelevant, or low-quality information. These results have three main applications. First, they can be used to better understand the pathobiology of ARDS, providing a resource to prioritise future in-vitro and in-vivo studies and permitting comparisons between important sub-groups. Second, they prioritise therapeutic targets, serving as a resource against which novel and repurposed treatments can be screened. Third, they serve as a base for quantifying the novelty or additive nature of future ARDS studies using high-throughput technologies.
Our review included 40 studies taking a genome/proteome-wide approach with a variety of aims and methods. The rate at which this form of study is being published is increasing; half of all studies in the last 5 years and a quarter since 2020. Similarly, there were few studies which employed next-generation sequencing (NGS) techniques or equivalent, and only two single-cell RNA-seq studies. A partial explanation may be the emergence of COVID-19, which is likely to have consumed the attention of many research teams active in this field. We anticipate that an increasing number of non-COVID ARDS single-cell and NGS studies will emerge in the coming years. This reinforces the requirement for methods capable of meta-analysing multi-omic data57. Less obviously. A minority of studies have sampled the lung in ARDS, with only four examining the bulk transcriptome in the distal airspace. Reliance on information derived from blood samples may present a skewed picture of the pathobiology of ARDS and may be a missed opportunity to identify novel targets in the lung58.
A key advantage of the MAIC approach is its ability to integrate diverse data sources and deprioritise irrelevant information or noise. Traditional methods of gene list meta-analysis rely on simple vote counting or robust rank aggregation59. Instead, MAIC applies a data-derived weighting to each gene list, allows the investigator to define granular categorisation (preventing any one particular method from overwhelming the analysis), and permits the inclusion of both ranked and unranked lists. For data structures common in biological research (high noise, heterogeneity between studies, large input lists), MAIC outperformed other methods in a comprehensive simulation12. We have previously used MAIC to identify anti-viral genes in response to influenza A infection10 and Covid-1911.
Our results reinforce existing associations and reveal some new insights. The functional prominence of innate immunity and cytokine signalling - in particular neutrophil-related activity - is well described in the ARDS literature60, as is the high ranking of genes such as CXCL861, IL-1862, MMP963, and MUC164. However, we also identify several genes that are consistently highly ranked in multiple studies, but have not been extensively discussed in the literature. Histidine triad nucleotide binding protein 1 (HINT1), ranked 10th in our MAIC analysis, is one of only 5 genes to have support from GWAS, transcriptomics, and proteomics methods. To our knowledge, no role for HINT1 has previously been suggested in ARDS65. However, HINT1 has been implicated in T-cell response66, immunoregulation67, and apoptosis65. There is significant enrichment of cholesterol uptake, efflux, and esterification pathways among prioritised genes68,69. Stratification by sub-group revealed a tight cluster of genes important in cholesterol metabolism at the hub of those prioritised in ARDS vs. non-ARDS controls. This is of considerable therapeutic relevance given the potential role of drugs targeting this pathway in ARDS therapy70,71.
Multiple distinct pathways were identified in the setting of ARDS vs. non-ARDS controls, including type I interferon signalling72, MHC class II antigen presentation73, cell-cell adhesion74, and natural killer cell cytotoxicity75. In contrast, genes prioritised in our severity/outcome analysis are functionally more homogeneous and related to cytokine signalling, in particular IL-10 and IL-18 signalling. Our approach cannot determine whether this indicates a real difference between pathways active in ARDS (from the ARDS vs. non-ARDS analysis) and pathways associated with severity/survival, or an imbalance of study design or a of lack statistical power to detect some pathways in the severity/survival studies that have been conducted. We provide an open platform and associated tools to enable deeper mining of the output, allowing others to re-analyse the data based on alternative sub-group divisions or to integrate unseen information (https://github.com/baillielab/ARDSMAICr).In future, the addition of data from new technologies, and in greater scale and precision from existing technologies, is expected to substantially improve this analysis. For this reason, we consider this report to be the beginning of an ongoing, community-led multi-omic data integration.
Our approach has limitations. The majority of the original studies do not have designs that support causal inference, so we make no attempt to determine causality. Different methodologies, such as large-scale GWAS/meta-analysis and genome-wide summary Mendelian randomisation, may support causal inference in future. At present the available GWAS data is underpowered for this purpose. We purposefully excluded single-gene or candidate genetics studies. In the case of a gene with extensive evidence from the latter, our methodology may underestimate its association with ARDS. However, these study designs are subject to other limitations, such as publication and investigator biases and spurious associations arising from underpowered studies76. The limitations of the available data prevented us from accounting for direction of expression or effect. For a given gene, if the direction of expression differs between studies, we may therefore overestimate the strength of evidence associated with that gene. This also limits the scope of functional enrichment analyses which can be performed. Finally, the paucity of available data, and in particular the limited number of studies reporting data from ARDS subtypes or single-cell transcriptomics (or proteomics) studies, is an unavoidable limitation. It is likely that many pathological perturbations are highly cell-type and -state specific, and specific to distinct underlying disease process, which may not be apparent in bulk analyses of heterogeneous tissues identified using syndromic definitions77.
Our study provides a first step in systematically integrating decades of work in ARDS. Our results implicate potential therapeutic targets including interferon signalling and cholesterol metabolism dysregulation. Enrichment patterns and sub-group differences also give clues to genomic drivers of susceptibility, outcomes, and mortality. We show that combining existing data reveals new insights that were not observed in the original studies, and provide a framework for a living summary of the genomic landscape of ARDS.
Methods
The systematic review and meta-analysis protocol was registered with the International Prospective Register of Systematic Reviews (PROSPERO; CRD42022306270). The review is reported in compliance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines78.
Search strategy and selection criteria
A detailed description of our search strategy and eligibility criteria is provided in the Supplementary Methods. Briefly, we searched MEDLINE, Embase, bioRxiv, medRxiv, the ARDS Database of Genes55, and the NCBI Gene Expression Omnibus from inception to April 1st, 2023 without language restrictions. We also performed single-level backwards and forwards citation searches using SpiderCite79 and hand-searched recent review articles80–83.
We included human genome-wide studies reporting associations between genes, transcripts, or proteins and ARDS susceptibility, severity, survival, or phenotype, accepting any contemporaneous ARDS definition. We excluded paediatric studies (age < 18 years), animal studies, in-vitro human ARDS models, candidate in-vivo or in-vitro studies (< 50 genes/proteins), candidate gene associations, and studies with < 5 patients per arm (except scRNA-seq).
Outcomes
We retrieved ranked lists of genes associated with the ARDS host response, preferring measures of significance and adjusted P values over raw P values when multiple ranking measures were used. We obtained both summary lists (all implicated genes) and author-defined subgroup lists. To combine subgroup lists into summary lists, we took the minimum P value or maximum effect size. We excluded genes below the author-defined threshold for significance/effect magnitude. If unavailable, we excluded genes with P > 0.05, z-score < 1.96, or log fold change < 1.5.
Study selection and data extraction
Article titles and abstracts from our search were stored in Zotero v6.0-beta (Corporation for Digital Scholarship, United States). Titles were initially screened by one author using Screenatron79. Two authors then independently screened abstracts against eligibility criteria, with a third resolving inconsistencies. Full texts and supplements of eligible studies were retrieved and inclusion adjudicated by consensus.
Data were extracted by one author and cross-checked by a second. Gene, transcript, or protein identifiers were mapped to HGNC symbols or Ensembl/RefSeq equivalents if no HGNC symbol was available. Unannotated SNPs were searched in NCBI dbSNP. miRBase (University of Manchester, United Kingdom) provided miRNA symbols. For microarray probes without symbols, we used the DAVID Gene Accession Conversion tool (Laboratory of Human Retrovirology and Immunoinformatics, Frederick National Laboratory for Cancer Research, United States) to map them to HGNC symbols. We extracted information relating to study design, methodology, tissue/cell type, demographics, ARDS aetiology, risk factors, severity, and outcomes.
Meta-analysis by information content (MAIC)
The MAIC algorithm has been described in detail8,10,11,84. Full documentation and the source code are available at https://github.com/baillielab/maic. Briefly, MAIC combines ranked and unranked lists of related named entities, such as genes, from heterogeneous experimental categories, without prior regard to the quality of each source. The algorithm makes four key assumptions; (1) genes associated with ARDS exist as true positives, (2) a gene is more likely to be a true positive if it is found in more than one source, (3) the probability of being a true positive is enhanced if the gene appears in a list that contains a higher proportion of replicated genes, and (4) the probability is further enhanced if it is found in more than one category of experiment. Based on these assumptions, MAIC compares lists with each other, forming a weighting for each source based on its information content, which is then used to calculate a score for each gene. The output is a ranked list summarizing the total information supporting each gene’s association with ARDS. We have shown MAIC outperforms available algorithms, especially with ranked and unranked heterogeneous data84.
As our primary analysis, we performed MAIC on all summary gene lists, regardless of study focus. Lists were assigned categories based on their methodology and experimental technique: genome-wide association study (GWAS) - genotyping, GWAS - whole exome sequencing, transcriptomics - microarray, transcriptomics - RNA-sequencing (RNA-seq), transcriptomics - single cell RNA-seq (scRNA-seq), proteomics - mass spectometry, and proteomics - other. For secondary analyses, we performed MAIC on subsets of lists based on study focus (i.e., susceptibility to ARDS or survival/severity).
In secondary analyses, we repeated this pipeline for gene lists arising from studies in which the focus was ARDS vs. non-ARDS controls or ARDS survival/severity.
For each MAIC iteration, we prioritised genes with sufficient evidential support for further study (i.e., the gene set before which information content diminished such that there was little/no corroboration for the remainder’s ARDS association). We used the unit invariant knee method53,85 to identify the elbow point in the best-fit curve of MAIC scores. Genes with values above this point were prioritized for downstream analyses.
ARDS literature and SARS-CoV-2 associations
We used BioLitMine54 to query the NCBI Gene database for genes associated with the Medical Subject Heading (MeSH) term “Respiratory Distress Syndrome, Acute”, generating a list of genes and publications. We descriptively compared the overlap between this list and the MAIC-ranked gene list. Similar comparisons were made between the ARDS MAIC results and the gene set in the ARDS Database of Genes55 and a prior MAIC of SARS-CoV-2 host genomics11.
Tissue expression and enrichment
Transcript and protein expression data for genes included in ARDS MAIC were retrieved from the Human Protein Atlas (HPA, version 21.0)86. We investigated mRNA expression in a consensus scRNA-seq dataset of 81 cells from 31 sources (https://www.proteinatlas.org/about/assays+annotation#singlecell_rna) and in the HPA RNA-seq blood dataset87, containing expression levels in 18 immune cell types and total peripheral blood mononuclear cells. To investigate protein expression, we retrieved tissue-specific expression scores from the HPA88. We conducted cell-type specific enrichment analysis using WebCSEA89 and extracted the top 20 general cell types for each query.
Functional enrichment
We performed functional enrichment of genes against the universe of all annotated genes using g:Profiler90. The following data sources were used; Kyoto Encyclopaedia of Genes and Genomes (KEGG)91, Reactome92, WikiPathways93, and Gene Ontology94. Multiple testing was corrected for using the g:SCS algorithm90, with a threshold of P < 0.01. Input lists were ordered by MAIC score were appropriate. In the case of GO cellular component terms, we used the REVIGO tool to perform multi-dimensional scaling of the matrix of all pairwise semantic similarities95. Enrichment was also performed against the National Human Genome Research Institute GWAS Catalog96 using the Enrichr web-interface97. Protein-protein interaction enrichment was performed using STRING v1198. We included all possible interaction sources but specified a minimum interaction score of 0.7. We used the the whole annotated genome as the statistical background. The MCL (Markov Clustering) algorithm[PMID: 22144159] was applied to the resulting network with an inflation parameter of 3. Clusters were annotated by hand having considered enrichment against KEGG, Reactome, and WikiPathways. To identify hub genes within the PPI network, we used cytoHubba99 and Cytoscape100. The highest ranked genes by Maximum Neighbourhood Component (MNC), Maximal Clique Centrality (MCC), Density of MNC (DMNC), Edge Percolated Component (EPC), and node degree were retrieved. The intersecting genes of these methods were deemed hub genes. Hub genes were searched for in the Drug Gene Interaction Database101 to identify if they were present in the druggable genome. The Drug Gene Interaction Database (DGIdb) was queried for each ranked gene102.
Software and code availability
MAIC is implemented in Python v3.9.7 (Python Software Foundation, Wilmington, United States). All other analyses were performed with R v4.2.2 (R Core Team, R Foundation for Statistical Computing, Vienna, Austria). Code required to reproduce the analyses is available at https://github.com/JonathanEMillar/ards_maic_analysis. An R package (ARDSMAICR) containing the data used in this manuscript and several functions helpful in analyses is available at https://github.com/baillielab/ARDSMAICr.
Data Availability
All data produced are available online at https://github.com/JonathanEMillar/ards_maic_manuscript/tree/main/data
https://github.com/JonathanEMillar/ards_maic_manuscript/tree/main/data
Supplementary Material
Supplementary Methods
Search Strategy
We used the following strategy to search MEDLINE and a direct translation to search Embase.
1 exp Respiratory Distress Syndrome, Adult/
2 “acute lung injury*“.ti,ab,kf,kw
3 1 OR 2
4 “gene*“.mp
5 “genome*“.mp
6 “transcript*“.mp
7 “protein*“.mp
8 4 OR 5 OR 6 OR 7
9 3 AND 8
10 (“COVID-19*” OR “COVID19*” OR “COVID-2019*” OR “covid”).ti,ab,kf,kw
11 (“SARS-CoV-2*” OR “SARSCov-2*” OR “SARSCoV2*” OR “SARS-CoV2”).ti,sh,kf,kw
12 (“2019-nCoV*” OR “2019nCoV*” OR “19-nCoV*” OR “19nCoV*” OR “nCoV2019*” OR “nCoV-2019*” OR “nCoV19*” OR “nCoV-19*“).ti,ab,kf,kw
13 10 OR 11 OR 12
14 9 NOT 13
15 Letter.pt OR Conference Abstract.pt OR Conference Paper.pt OR Conference Review.pt OR Editorial.pt OR Erratum.pt OR Review.pt OR Note.pt OR Tombstone.pt
16 14 NOT 15
17 exp *adolescence/ or exp *adolescent/ or exp *child/ or exp *childhood disease/ or exp *infant disease/ or (adolescen* or babies or baby or boy? or boyfriend or boyhood or girlfriend or girlhood or child or child* or child*3 or children* or girl? or infan* or juvenil* or juvenile* or kid? or minors or minors* or neonat* or neo-nat* or newborn* or new-born* or paediatric* or peadiatric* or pediatric* or perinat* or preschool* or puber* or pubescen* or school* or teen* or toddler? or underage? or under-age? or youth*).ti,kw
18 16 NOT 17
19 ((exp animal/ or nonhuman/) NOT exp human/)
20 18 NOT 19
21 limit 20 to yr=“1967-Current”
Inclusion criteria
Inclusion:
Human studies: in-vivo or in-vitro
Adults (age ≥ 18 years)
Acute Respiratory Distress Syndrome (ARDS)
– by any contemporaneous definition
Accepted methodologies:
– CRISPR screen
– RNAi screen
– Protein-protein interaction study
– Host proteins incorporated into virion or virus-like particle
– Genome wide association study
– Transcriptomic study
– Proteomic study
Exclusion:
Children (age < 18 years)
Animal studies
Meta-analyses, in-silico analyses, or re-analysis of previously published data
Excluded methodologies:
– In-vitro human studies simulating ARDS
– Candidate in-vivo or in-vitro transcriptomic or proteomic studies (defined as those investigating < 50 genes)
– Candidate gene association studies
– Studies including fewer than 5 individuals in either the control or ARDS arm
Supplementary Results
Supplementary Data
Supplementary Data File 1. Raw gene list input to MAIC. https://github.com/JonathanEMillar/ards_maic_manuscript/Supplementary_Data_File_1.csv
Supplementary Data File 2. MAIC output - overall. https://github.com/JonathanEMillar/ards_maic_manuscript/Supplementary_Data_File_2.csv
Supplementary Data File 3. BioLitMine and ARDS Database of Genes results. https://github.com/JonathanEMillar/ards_maic_manuscript/Supplementary_Data_File_3.csv
Supplementary Data File 4. MAIC output - ARDS vs. non-ARDS controls sub-group. https://github.com/JonathanEMillar/ards_maic_manuscript/Supplementary_Data_File_4.csv
Supplementary Data File 5. MAIC output - survival sub-group. https://github.com/JonathanEMillar/ards_maic_manuscript/Supplementary_Data_File_5.csv
Supplementary Data File 6. Functional enrichment results - overall. https://github.com/JonathanEMillar/ards_maic_manuscript/Supplementary_Data_File_6.csv
Supplementary Data File 7. Functional enrichment results - ARDS vs. non-ARDS controls sub-group. https://github.com/JonathanEMillar/ards_maic_manuscript/Supplementary_Data_File_7.csv
Supplementary Data File 8. Functional enrichment results - survival sub-group. https://github.com/JonathanEMillar/ards_maic_manuscript/Supplementary_Data_File_8.csv
Glossary
- MAIC score
- the score assigned by MAIC to a given gene considering all lists.
- Gene score
- the score assigned by MAIC to a given gene in a given list.
- Total MAIC score
- the sum of all scores assigned by MAIC to a genes in a given list.
- Contributing total MAIC score
- the sum of all scores assigned by MAIC to a genes in a given list where that score contributes to the MAIC score for that gene (i.e., excluding those gene scores that are not used because a gene score from another list in the same category is greater).
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.
- 15.↵
- 16.
- 17.
- 18.↵
- 19.
- 20.
- 21.↵
- 22.
- 23.
- 24.
- 25.
- 26.↵
- 27.
- 28.
- 29.
- 30.↵
- 31.
- 32.↵
- 33.↵
- 34.
- 35.
- 36.
- 37.
- 38.↵
- 39.
- 40.↵
- 41.
- 42.
- 43.
- 44.↵
- 45.
- 46.
- 47.
- 48.
- 49.
- 50.
- 51.
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.
- 82.
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
References
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.