87
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      scite_
       
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Enhancing the Accuracy of Network Medicine Through Understanding the Impact of Sample Size in Gene Co-expression Networks

      Published
      research-article

            Abstract

            Network medicine relies on RNA sequencing to infer gene co-expression networks, which are crucial to identify functional gene clusters and gene regulatory interactions, and offer a deeper understanding of disease phenotypes and drug mechanisms. It remains unknown, however, how many samples do we need to make reliable predictions. Here, we propose a power-law model to predict the relationship between the number of inferred significant interactions and sample size, allowing us to quantitatively link sample size to the accuracy of the inferred networks. We apply our model to investigate the effect of sample size on biomarker discovery and differentiation of protein–protein interactions from non-interacting pairs, ultimately unveiling the critical role of data quality in generating meaningful predictions in network medicine.

            Main article text

            INTRODUCTION

            Network medicine has benefited significantly from rapid advances in next generation sequencing (NGS) technologies such as microarrays, RNA sequencing (RNA-seq), and single-cell RNA-seq. Through the quantification of context-specific mRNA abundance and the mapping of this information to molecular networks, these techniques offered a deeper understanding of the interconnectivity of biological systems, knowledge that informed personalized diagnoses and therapy choices for patients. Among these technologies, RNA-seq stands out due to its extensive availability of datasets and a comprehensive array of methods for reliable data processing and analysis. RNA-seq provides a more robust and detailed examination of gene expression compared to microarrays, offering improved sensitivity and specificity in detecting transcripts and isoforms, allowing for the discovery of novel transcripts and splice junctions. 1,2 When compared to single-cell RNA-seq, bulk RNA-seq is less expensive, less labor-intensive, and less time-consuming, making it more convenient for large-scale studies. As a result, RNA-seq has led to a mechanistic understanding of multiple disease phenotypes and drug mechanisms, 3,4 and has also reached the clinic, benefiting patients by offering personalized medicine approaches for conditions such as rheumatoid arthritis (RA) 5,6 or ulcerative colitis. 7,8 While single-cell RNA-seq ability to capture cell-to-cell variability offers immense potential for research and clinical applications, it is currently more expensive and complex, with ongoing development needed to fully process its complexity.

            One way network medicine leverages sequencing data is by creating gene co-expression networks, which reveal potential associations between genes based on similarities in their expression patterns. These networks have a number of applications, including the identification of changes between co-expression networks from different conditions through differential co-expression analysis, 911 the identification of modules of co-expressed genes that perform similar functions, 9,12 and the inference of gene regulatory networks involving transcription factors, 13 offering potentially novel insights into the role of cellular interconnectedness in health and disease.

            Note that many different methods have been proposed to construct gene co-expression networks, including Pearson’s and Spearman’s correlation coefficients, weighted gene co-expression network analysis, 14 weighted topological overlap, 15 tree-based ensemble methods like GENIE3, 16 and mutual information-based approaches like ARACNe. 17 Each method captures a different aspect of gene expression-based relationships, each with its advantages and limitations, allowing for analyses tailored to specific research questions and datasets.

            A crucial aspect of interpreting NGS data is the accurate assessment of the appropriate sample number needed to extract the desired mechanistic information. Decisions pertaining to sample sizes not only impact the cost and time invested in a study, but also play a significant role in the accuracy and the reliability of the predictions obtained using co-expression networks. It is therefore crucial to understand the factors that determine the minimal sample size necessary for optimal results. We lack, however, an understanding of the underlying statistical patterns that govern the sample size effect and impact the predictive power of gene co-expression networks. To be sure, previous research has confirmed that networks constructed from a larger number of samples are able to recover a higher number of functional associations 18,19 and display increased reproducibility when compared to networks built from smaller sample sizes. 20,21 Yet, these studies have been limited to relatively small sample sizes (e.g., a maximum of 50 samples from the same dataset 20 ), limiting our ability to explore the impact of sample size on the predictive power of the emerging networks. Other studies explored larger sample sizes by combining different datasets from the same organism (e.g., 2000 samples 21 ), which in turn increased the noise, hence limiting the predictive power. With the increasing use of NGS techniques in the clinic, 58 datasets spanning thousands of individuals are increasingly available; yet, critical questions regarding the impact of sample size on the definition of links within these networks, changes in their topology and modularity, as well as their application in disease studies remain unaddressed. In this work, we address these gaps in knowledge by developing a model to predict the growth of significant correlations in co-expression networks as a function of sample size. The model allows us to accurately estimate the proportion of significant correlations within gene co-expression networks extracted from a gene expression dataset. Additionally, it facilitates the comparison of different datasets based on the rates at which their significant correlations converge with increasing sample size. We show that a key factor is sample homogeneity, measured by both the percentage of weakly correlated genes in the network and the standard deviation (SD) of the gene expression associated with each gene, demonstrating that highly heterogeneous gene expression datasets need larger sample size and tend to have more limited predictive outcomes. In addition, we examine the stability of co-expression weights, finding that networks built from <100 individuals tend to be less stable, measured by its variance. Finally, we investigate the effect of sample size on biomarker discovery and differentiation of protein–protein interactions from non-interacting pairs using co-expression networks, finding that larger sample sizes lead to more stable and accurate predictions. Overall, our findings offer actionable insights into the optimal use of gene co-expression networks in network medicine and highlight the importance of sample size and data quality in extracting meaningful results.

            RESULTS AND DISCUSSION

            Constructing Gene Co-expression Networks

            To investigate the relationship between sample size and network quality, we constructed a benchmark of gene co-expression networks with sample sizes ranging from 20 to 1080 from four datasets: (i) The Cancer Genome Atlas (TCGA), containing RNA-seq data of 40,019 genes and 11,123 samples from 10,261 subjects and 33 cancer types; (ii) GTEx V8, 22 containing RNA-seq data of 56,200 genes and 17,271 samples from 932 subjects and 54 tissue types; (iii) RA dataset, 6 of 24,219 genes and 584 whole blood samples collected from 261 RA patients treated with adalimumab; (iv) GSE193677 dataset, containing RNA-seq data of 18,197 genes and 2490 samples from different tissues and conditions. For each tissue in the datasets (e.g., cancer tissues, healthy tissues, and whole blood cells affected by RA), we constructed the co-expression networks using Pearson’s correlation (ρ), keeping only significant correlations (p-value < 0.05, Bonferroni corrected). We opted for Pearson’s correlation to maintain simplicity in the analysis. Additionally, to assess the reproducibility of our findings, we constructed the same networks using Spearman’s correlation. Altogether, we created one of the most comprehensive datasets of gene co-expression networks to study the effect of the sample size on network quality ( Figure 1 ).

            Figure 1 provides an overview of the methodology used to construct gene co-expression network datasets in four steps. (A) Collection of four gene expression datasets: TCGA, GTEx, rheumatoid arthritis (RA) dataset, and GSE193677 dataset. (B) Gene pre-processing: transform gene identifiers to HGNC symbols and filter out low expression counts. (C) Sub-sampling: random selection of sample groups of varying sizes (20-max) with replacement. (D) Network generation: construction of co-expression networks using Pearsonâs correlation, keeping only statistically significant correlations (p < 0.05, Bonferroni corrected).
            Figure 1.

            Overview of the methodology used to create the gene co-expression network dataset. The process includes four steps: (A) Collection of four gene expression datasets: TCGA (RNA-seq data from 11,123 samples and 10,261 subjects), GTEx (RNA-seq data from 17,271 samples and 932 subjects), RA dataset (584 whole blood samples from 261 RA patients), and GSE193677 dataset (RNA-seq data from 2490 samples and 1162 CD and UC patients); (B) Pre-processing of genes: transform the gene identifiers to HGNC names and filter out the genes with low expression counts; (C) Sub-sampling: random selection of sample groups of varying sizes (20–max) with replacement; (D) Network generation: construction of co-expression networks using Pearson’s correlation with significant correlations (p-value < 0.05, Bonferroni corrected). CD, Crohn’s disease; HGNC, HUGO Gene Nomenclature Committee; RA, rheumatoid arthritis; RNA-seq, RNA sequencing; TCGA, The Cancer Genome Atlas; UC, ulcerative colitis.

            The Discovery Rate of New Significant Links in Gene Co-expression Networks Follows a Power Law

            To understand the effect of sample size on the number of statistically significant links in gene co-expression networks, we first plot the number of links of each network in relation to the sample size used to create the network. We find that, as sample size increases, the number of statistically significant links in the network also increases ( Figure 2A ). However, the effect is non-linear, observing a saturation effect, implying that the number of newly added significant links decreases as the sample size increases. This sublinear pattern is similar to the law of diminishing marginal utility, a phenomenon in economics where the willingness of customers to purchase a product decreases as their consumption increases. 23 To understand the origin of this saturation effect, we compared the number of links at a given sample size (Sn ) with the number of links at the previous sample size (Sn − 1), by measuring the discovery rate:

            (1) f(n)=SnSn1Sn1,

            where n is the sample size. We empirically observed that for each dataset, the discovery rate decreases with the sample size ( Figure 2B ), finding a scaling law that is well approximated by the power law:

            (2) f(n)=bnα.

            Here, α is a non-negative scaling exponent which we call the discovery exponent as it controls the discovery rate of new statistically significant links. Indeed, for α = 0, each new sample contributes the same number of new links, whereas if α → ∞, new samples contribute to the discovery of a diminishing number of new links. In Eq. 2, b is a normalization constant obtained from the empirical fit. The scaling relation (Eq. 2) allows us to analytically derive the relationship between the number of statistically significant links (S) and the number of the samples (n), obtaining the equation (see Materials and Methods):

            (3) S(n)=Lebn(1α)1α.

            In Eq. 3, L is the number of links expected in a complete network, which can be calculated from the total number of genes (NG ) in the dataset via NG (NG 1)/2. Equation 3 helps to estimate the values of the exponent α and the constant b, ultimately allowing us to predict the proportion of statistically significant links for a specific sample size ( Figure 2AD ), capturing the decreasing information gain offered by each new sample.

            To unveil the implications of the discovery rate, we compared the power-law-based model (Eq. 3) with two alternative models: an exponential decay model and a constant discovery model representing random behavior ( Figure 2B ). The exponential decay model, defined as f(n) = be−λn , captures a decline in the discovery of significant links that aligns with the data but is less pronounced at smaller sample sizes. In contrast, the constant discovery model assumes a uniform probability of discovering new links, described by f(n) = b. Using the TCGA breast cancer (TCGA-BRCA) dataset, we observed a marked discrepancy between the constant discovery predicted by the random model and the progressive decline observed in the power-law and exponential decay models. Notably, the power-law model exhibited a superior fit to the data, particularly at smaller sample sizes, compared to the exponential decay model. These results underscore the limitations of the random and exponential decay models in describing the observed dynamics in gene co-expression networks and highlight the significance of the power-law decay in capturing the discovery rate.

            Figure 2 shows the power-law scaling relationship between sample size and significant gene co-expression network links across datasets. (A) A scatterplot illustrates how the number of significant links decreases as sample size increases, following a diminishing return pattern. (B) A model comparison confirms that the power-law model best describes this relationship. (C) A plot compares empirical and predicted fractions of significant correlations in GTEx breast tissue and TCGA breast cancer data. (D) A table summarizes model fit (R²), error (ε), and the power-law exponent (α) across six datasets. (E-G) Scatterplots examine relationships between α and network properties, including strong correlations (E), high standard deviation genes (F), and the sample size needed for 50% significant links (G), revealing predictive power when using α.
            Figure 2.

            Uncovering the power-law scaling relationship between sample size and gene co-expression network link significance. (A) The scatterplot illustrates a decrease in the number of significant links discovered in the network as the sample size used to create the network increases, indicating a diminishing return pattern. The lines in the plot represent the fitting of the analytical model derived from the power-law scaling relation for two different datasets (GTEx breast mammary tissue and TCGA breast cancer), showing a highly accurate fitting (adjusted R2 of approximately 1). (B) The scatterplot compares the fitting of three models—the power-law model (red, R2 = 0.69), the exponential decay model (orange, R2 = 0.51), and the constant discovery model (gray, R2 = 0)—on data from TCGA breast cancer, depicting the relationship between the link discovery rate and sample size. The superior accuracy of the power-law model supports the hypothesis that the discovery of significant links in gene co-expression networks follows a power-law distribution. (C) The plot predicts the network evolution across sample size, comparing empirical values (as points) to analytical model predictions (as lines) for GTEx breast mammary tissue and TCGA breast cancer datasets. (D) The table summarizes the adjusted R2 and mean relative error (ε) values for three models: the power-law model, the exponential decay model (both depicted in Figure 2B ), and the analytical model derived from the power-law relationship (shown in Figure 2A and C ). These metrics were calculated across six datasets (UC inflamed, GTEx breast, GTEx lung, R. arthritis, TCGA lung cancer, and TCGA breast cancer). The results demonstrate improvements in adjusted R2 and reductions in ε when applying the power-law and analytical models, as compared to the exponential decay model. Additionally, for the analytical model, the table includes the discovery exponent (α), which is indicative of the convergence rate. Higher α values correspond to faster-converging datasets (e.g., GTEx breast, α = 1.81) compared to slower-converging datasets (e.g., TCGA breast cancer, α = 1.48). (E) The scatterplot depicts the linear relationship (adjusted R2 of 0.88 and Pearson’s correlation of 0.94) between the discovery rate (α) and the fraction of correlations above 0.2 predicted by the model for 22 subsets of 4 datasets (R. arthritis in orange, GTEx in green, TCGA in blue, and GSE193677 in pink). (F) The scatterplot reveals a relationship (adjusted R2 of 0.479 and Pearson’s correlation of −0.71) between α and the fraction of genes with SD above 10 for 22 subsets. (G) The scatterplot shows the linear relationship (adjusted R2 of 0.865 and Pearson’s correlation of −0.933) between α and the logarithm of the number of samples predicted by the model to achieve 50% of significant links for 16 subsets. The relationship indicates that the greater the α of the model, the smaller the number of samples required to get 50% of significant links, unveiling the potential of the model to effectively predict the number of significant links in co-expression networks. R. arthritis, rheumatoid arthritis; TCGA, The Cancer Genome Atlas; UC, ulcerative colitis.

            Taken together, Eq. 2 and Eq. 3 represent our first key result, quantifying a saturation effect in gene co-expression networks, where the rate of new links discovered decreases as the sample size increases. We find that this saturation effect follows a scaling relationship between the new links discovered and the sample size, allowing us to predict the number of links for a specific sample size. Changes in exponent α reveal different discovery rates, with α = 0 corresponding to the random reference model, while higher values of α imply faster convergence toward a higher rate of discovery of significant links.

            The Power-law Model Predicts the Evolution of the Co-expression Networks Across Sample Size

            Understanding how the number of statistically significant gene co-expression links evolves with sample size is essential for evaluating the reliability of co-expression networks. To facilitate comparisons of the sample size effect across different datasets, we focus on the fraction of significant links, defined as the number of significant links (S) (where p-value < 0.05) divided by the maximum number of possible links in a network (L). This normalization enables a consistent comparison of the sample size required to achieve the same fraction of significant links across datasets ( Figure 2C ). It is important to clarify that L represents the theoretical statistical limit of significant links and does not imply greater biological co-expression. Instead, L reflects the total number of links that could achieve statistical significance given sufficient sample size, serving as a measure of statistical reliability rather than biological relevance. For example, our model predicts that 350 samples are sufficient in the GTEx breast mammary tissue dataset to recover 50% of statistically significant links, whereas 3910 samples are required for the TCGA-BRCA dataset to achieve the same fraction. This indicates that the GTEx dataset has a substantially higher discovery rate of statistically significant links compared to the TCGA-BRCA dataset. However, this does not suggest that the GTEx dataset is more biologically relevant; rather, it highlights that fewer samples are needed in the GTEx dataset for the correlations to achieve statistical reliability.

            To understand the roots of this difference, we note that the discovery rate is captured by the discovery exponent α: datasets with a higher α converge faster to L. As α and, hence, the convergence increase, the probability of discovering new links decreases. We also find that datasets with a higher α have lower percentages of very weakly correlated genes (ρ < 0.2, Figure 2E ), representing links which require more samples to confirm their statistical significance.

            To better understand the differences between datasets, in each dataset we calculate the mean, SD, and coefficient of variation of gene expression and explore their relationship with α ( Supplementary Figure S1 ). We find that there is an inverse relationship between SD, reflecting the dispersion of data as a measure of the heterogeneity (i.e., variability) of gene expression, and α ( Figure 2F ). Indeed, while all datasets have a similarly shaped SD distribution, those having a smaller α are more skewed toward a higher SD, indicating that they display higher heterogeneity. However, we did not observe sufficient correlation (Pearson’s correlation −0.71, R2 0.479) between α and the proportion of genes with high SD to derive a direct model that predicts the evolution of co-expression networks as sample size increases ( Figure 2F ).

            Finally, we employed our model to estimate the α and sample size necessary to obtain a given percentage of significant links in the co-expression network. To be specific, we used our model to calculate the number of samples needed for each dataset to reveal 50% of the network links. Then, we assessed the relationship between the α for each dataset and the logarithm of the sample size necessary to reveal 50% of the links (see Supplementary Figure S2 for all linear and logarithm plots). We obtained a highly accurate inverse relationship ( Figure 2G , Pearson’s correlation −0.93, adjusted R2 0.87), suggesting the utility of α in elucidating how co-expression networks evolve with the addition of more samples. More importantly, the obtained inverse relationship could be used as a rule of thumb to estimate the α and sample size necessary to obtain an arbitrary proportion of statistically significant links.

            To ensure the robustness of our findings, we repeated the analysis using gene co-expression networks constructed with Spearman’s correlation ( Supplementary Figure S3 ). The results observed with Spearman’s correlation are consistent with those obtained using Pearson’s, reinforcing the validity of our model across different statistical methods for network generation.

            In summary, Eq. 3 allows us to predict the number of significant links within gene co-expression networks for a given sample size. Our findings indicate that datasets with a lower percentage of weakly correlated genes (and consequently, less variability in gene expression) exhibit co-expression networks that converge more rapidly. We established an inverse relationship between α and the sample size required to reveal a given percentage of links, allowing us to infer the number of samples necessary to obtain a network with certain fraction of significant links.

            The Correlation Values of Individual Links Converge as Sample Size Increases

            Correlations, as measured by their Pearson’s correlation coefficient (ρ), are jointly characterized by the statistical significance of the links and the weight in gene co-expression networks. Previous studies have observed that larger correlations require smaller sample sizes to reach statistical significance. 24 To understand the relationship between correlation value and sample size, we divided the weights into five groups based on the strength of their Pearson’s correlation coefficient (very strong: ≥0.8; strong: <0.8 and ≥0.6; moderate: <0.6 and ≥0.4; weak: <0.4 and ≥0.2; very weak: <0.2). As expected, we found that stronger correlations require fewer samples to reach statistical significance ( Figure 3A ). The minimum sample size required for a correlation to become statistically significant can be calculated using the equation of the t-statistic for testing the significance of a correlation coefficient 25 (see Supplementary Text S1 for derivation)

            (4) tα= ρn21ρ2,

            where tα is the t-test estimator for a given level of significance (α), ρ is the Pearson’s correlation coefficient, and n is the sample size. Equation 4 allows us to test the hypothesis that the correlation is non-zero, and by isolating the sample size, we can estimate the minimum sample number, n, at which a correlation value becomes significant. For example, according to Eq. 4, we need 218 samples from the GTEx whole blood dataset to detect correlations above 0.4 with statistical significance (p-value < 0.05).

            Figure 3 explores the impact of sample size on co-expression values in GTEx whole blood networks. (A) A violin plot shows p-value distributions across sample sizes and correlation strengths, revealing that strong correlations require fewer samples for significance. (B) A scatterplot illustrates the fraction of significant correlations by sample size, with different colors representing correlation strength categories, a power-law model prediction (red line) and predicted convergence points (red dots). (C) A line plot tracks correlation variations for five randomly selected links across sample sizes, with shaded areas showing variation and red dots marking consistently significant correlations. (D) A half-violin plot visualizes the distribution of standard deviations of Pearson correlations across different sample sizes, showing decreased variability with larger sample sizes.
            Figure 3.

            The influence of sample size on co-expression values. (A) The violin plot displays the distribution of p-values for co-expression links across different sample sizes and correlation strength categories. Correlation strengths are classified as follows: very strong (r ≥ 0.8, purple), strong (0.6 ≤ r < 0.8, blue), moderate (0.4 ≤ r < 0.6, green), weak (0.2 ≤ r < 0.4, pink), and very weak (r < 0.2). Each category consists of 100,000 randomly selected links from the GTEx whole blood co-expression networks, with weights remaining within the same strength category regardless of sample size. The plot illustrates that stronger correlations require fewer samples to reach statistical significance. (B) The scatterplot depicts the fraction of significant links in the GTEx whole blood co-expression networks (y-axis) and the sample size used to create the networks (x-axis). Different levels of correlation strength are represented by colored dots. The red line represents the prediction of the power-law model, and the red points above the line are the predicted convergence points at specific levels of correlation using Eq. 4. These predictions coincide with the sample sizes where the correlation levels converge. (C) The plot illustrates the variation in correlation values of five randomly selected links with different correlation strengths from GTEx whole blood networks across five replicates of the same sample size for various sample sizes. The line represents the mean value among the five replicates, and the area represents the minimum and maximum values. The red dots indicate links with significant correlations (p-adj < 0.05) in all the replicates. The variation of the correlation values decreases as the sample size increases and is lower in strong correlations. (D) The half violin plot shows the distribution of SD of the Pearson’s correlation of 1 million randomly selected links across five different co-expression networks from the GTEx whole blood dataset, all sharing the same sample size. The plot shows how this variability decreases as the sample size used to create the networks increases. SD, standard deviation.

            However, we also find that the number of statistically significant links of a given strength is not constant, but rather exhibits variation in relation to the sample size, with greater variation observed for weaker correlations and smaller sample sizes ( Figure 3B ). In other words, there is always some variance around a given correlation value, even if this value is statistically significant, and this variance is related to the sample size. To further investigate this randomness, we measured the correlations of five randomly chosen links of different strengths as the sample size varied ( Figure 3C ), finding that the variability of the co-expression weights decreases with increasing sample size. To confirm these findings on a larger scale, we randomly selected 1 million links and calculated the SD of their co-expression weights across repetitions of each sample size ( Figure 3D ). We find that the SD of the co-expression of the majority of links decreases with increasing sample size, and appears to converge after a sample size of 100, a behavior replicated in other datasets as well ( Supplementary Figure S4 ). Taken together, when we explore the impact of sample size and randomness of sample selection on the correlation values in gene co-expression networks, we find that even though a correlation may be deemed statistically significant, there is always some variance around the correlation value related to the sample size. Yet, as the sample size increases, the variability of correlation values decreases and converges toward a stable value.

            Small Sample Sizes Negatively Impact Strong Correlations

            Strong correlations, which we define as those with an absolute value ≥0.6, play a crucial role in gene co-expression networks, particularly for predicting gene regulatory interactions and protein–protein interactions. Given their importance, we investigated how sample size influences both the significance and stability of these strong correlations.

            To evaluate the significance of strong correlations, we applied our power-law model to determine the fraction of significant correlations across different sample sizes, and we use Eq. 4 to calculate the sample size at which all strong correlations become significant (88 samples). We then compared the evolution of strong correlations across four datasets, each characterized by different discovery exponent ( Figure 4A ). We observed that datasets with higher discovery exponents converge more rapidly toward 100% strong correlations, consistent with our previous findings. This implies that, while 88 samples are sufficient for obtaining significant strong correlations, smaller sample sizes, especially in heterogeneous datasets, are less likely to yield a substantial fraction of these correlations.

            Figure 4 examines the impact of sample size on the significance and stability of strong correlations. (A) A line plot depicts the percentage of significant strong correlations (Pearson â¥0.6) across sample sizes for four datasets, showing that datasets with a higher discovery exponent reach convergence faster. (B) A line plot illustrates the percentage of remaining strong correlations (Pearson â¥0.6) from networks of sample size 240 that remain in smaller networks, revealing a substantial loss at lower sample sizes. (C) A line plot shows the percentage of remaining strong scores from networks of sample size 240 in lower sample size networks using Pearson, Spearman, GENIE3 and ARACNe, highlighting differences in stability across methods.
            Figure 4.

            Impact of sample size on the significance and stability of strong correlations. (A) Percentage of significant strong correlations (defined as correlations ≥0.6) across sample sizes for four different datasets (GSE193677 UC/CD inflamed in pink, GTEx breast mammary tissue in light green, GTEx lung in dark green, and TCGA breast cancer in blue). Datasets with higher discovery exponent converge more rapidly toward 100% of strong correlations. (B) Percentage of remaining strong correlations (Pearson ≥ 0.6) extracted from a sample size of 240 in lower sample sizes across different datasets. A significant loss of strong correlations is observed at small sample sizes. (C) Percentage of remaining strong scores extracted from a sample size of 240 in lower sample sizes for the GTEx breast dataset across different methods. Strong scores are defined as correlation ≥0.6 for Pearson and Spearman, score ≥0.01 for GENIE3, and score ≥1 for ARACNe. CD, Crohn’s disease; TCGA, The Cancer Genome Atlas; UC, ulcerative colitis.

            To assess stability, we selected all strong correlations at the sample size of 240 and measured the percentage of these correlations that are still strong at lower sizes ( Figure 4B ). We find that very small sample sizes (20–40) result in the loss of many strong correlations observed only at higher sample sizes. We extended this analysis to networks constructed using different gene co-expression methods, specifically ARACNe 17 and GENIE3, 16 where strong correlations were defined as scores ≥1 for ARACNe and ≥0.01 for GENIE3 ( Figure 4C ). The trends were consistent, with ARACNe and GENIE3 showing an even more pronounced loss of strong correlations at small sample sizes compared to Pearson and Spearman. While this effect is primarily observed in very small sample sizes and varies considerably between datasets, it underscores the detrimental impact of small sample sizes on the stability of strong correlations. These findings emphasize the critical role of sample size in ensuring reliable identification of strong correlations in gene co-expression network analyses.

            Sample Size Improves the Identification of Biomarkers Using Differential Co-expression Analysis

            Gene co-expression networks are frequently used to identify biomarkers and understand the regulatory mechanisms behind a phenotype. This is achieved by comparing samples from different conditions, such as disease versus healthy state or response versus non-response to a treatment. One popular method for this purpose is differential co-expression analysis, which involves comparing the topology of gene co-expression networks observed under different conditions and identifying modules with significant changes across the conditions. 911 Little is known, however, on how sample size affects the predictive power of differential co-expression analysis. To address this, we used the co-expression differential network analysis (CoDiNA) method 26 to focus on breast cancer, using 1095 samples from the TCGA-BRCA dataset to represent the disease and 113 control samples from the healthy breast tissue as the baseline. We applied stretched normalization using the min–max method to ensure all networks have the same weight boundaries and then applied CoDiNA to the networks of different conditions with the same sample size. CoDiNA classifies genes into three categories based on how their associated co-expression links change across different conditions. These categories include: “common” nodes, which have links present in both networks with the same sign; “different” nodes, which have links present in both networks with opposite signs; and “specific” nodes, which have links only present in one of the networks and are referred to as “disease-specific” if present in the disease-based network or “normal-specific” otherwise. The algorithm ranks the biomarker predictions based on the number of links associated with the biomarker category, the top-ranked biomarkers being the ones that have a higher number of links associated with the biomarker category.

            In this case study, we focused on the classification of biomarkers for 1905 genes linked to breast cancer according to various literature-curated disease–gene association databases and genome-wide association studies (GWAS), as described in the Materials and Methods. We observed that the majority of genes were classified as “normal-specific” or “common,” with a smaller proportion classified as “disease-specific” and none classified as “different” ( Figure 5A ). The large group of “normal-specific” biomarkers correspond to genes highly co-expressed in the control network but not in the disease network, suggesting potential loss of function in these genes as a result of the disease. This trend is consistent with the well-established observation that a significant proportion of mutations associated with breast cancer are categorized as deleterious (mutations that cause a loss of function in the gene implicated). 27 The presence of a substantial group of “common” biomarkers, which are highly co-expressed in both disease and control networks, is also expected given the large number of biomarkers analyzed. Many of these common biomarkers may represent low-frequency variants or subtype-specific markers not exclusively tied to disease progression. We further found that an increase in sample size resulted in an increase in the number of common and specific genes, particularly normal-specific genes, reinforcing the idea of loss of function in many breast cancer-related genes in the disease state.

            Figure 5 explores the impact of sample size on the stability of breast cancer biomarker identification using differential co-expression analysis. (A) A stacked plot visualizes the classification of breast cancer disease genes into four categories: common, disease-specific, normal-specific, and undefined. The plot highlights a trend of increasing normal-specific classifications as sample size grows. (B) A line plot tracks the ranking stability of four breast cancer transcription factors (NR3C2, ZNF652, EGR3, and ARNT2) across different sample sizes, showing how their classification shifts with more data. (C) A line plot displays the fraction of stable disease genes within each biomarker category as sample size increases. The plot reveals that normal-specific genes exhibit greater stability with larger sample sizes.
            Figure 5.

            Impact of sample size on the stability of breast cancer biomarker identification using differential co-expression analysis. (A) Breast cancer disease genes are categorized into four groups and color-coded accordingly: common (blue) for genes with consistent co-expression patterns in both conditions, disease-specific (red) for genes exhibiting higher connectivity in the disease state, normal-specific (green) for genes showing increased connectivity in the healthy state, and undefined (gray) for genes that cannot be significantly assigned to a specific category. Changes in classification are represented by lighter colors, while unchanged classifications are depicted by darker colors. The plot highlights the shifting classifications of each breast cancer disease gene, allowing visual identification of any alterations in classification with varying sample sizes. Notably, the plot demonstrates an increase in the normal-specific classification as the sample size expands, suggesting a potential association of breast cancer disease genes with this particular category. (B) Classification and ranking of breast cancer transcription factors (NR3C2, ZNF652, EGR3, and ARNT2) across different sample sizes. The plot showcases the classification and relative ranking of these genes, providing insights into their stability or variation with changes in sample size. (C) The fraction of breast cancer genes that maintain stability in their category is depicted, along with a visual representation of the stability across different biomarker categories. This analysis offers a comprehensive view of the stability of breast cancer genes within their assigned biomarker categories at each sample size.

            We also monitored the classification and ranking of four well-characterized tumor suppressor transcription factors (NR3C2, ZNF652, EGR3, and ARNT2) that serve as biomarkers for breast cancer. 28 All four transcription factors were accurately classified as normal-specific by the algorithm ( Figure 5B ). Among them, NR3C2 ranked within the top 1000 genes out of 12,808, while ARNT2 and ZNF652 were ranked lower, though still within the top 50%. Interestingly, the effect of sample size on these predictions was minimal, with only ZNF652 showing a noticeable improvement in classification as sample size increased.

            We further examined the stability of biomarker classification as a function of sample size. Our analysis revealed that increasing sample size leads to decreased changes in the biomarker category for specific genes ( Figure 5A ), and increased stability of genes remaining in the same category, particularly for normal-specific genes, which represent the majority ( Figure 5C ).

            In summary, our findings indicate that increasing sample size not only improves the classification of molecular biomarkers, but also enhances the stability of gene co-expression network predictions. This case study also shows that some predictions are accurate and the stability of these predictions increases, but the signal is very small. The diminished signal is rooted in limited sample size (100) and the very high heterogeneity of breast cancer, two features that are indicators of poor performance according to our earlier results shown in Figure 2 .

            Relevance Predicting Protein–protein Interactions

            Gene co-expression networks are often used to uncover the molecular mechanisms behind biological processes. 9 Of the interactions that underlie these mechanisms, protein–protein interactions are often associated with high levels of gene co-expression, as their gene products tend to be similarly expressed to facilitate the interaction. 29,30 Hence, gene co-expression is often used to predict functional relationships between proteins, validate experimental results, or eliminate false interactions, 31 raising the question: How does the effect of sample size affect the co-expression of gene pairs encoding direct protein–protein interactions? To answer this, we utilized a comprehensive human protein–protein interactome compiled from 21 public databases containing various types of experimentally derived protein–protein interaction data 32 (see Materials and Methods). We used Pearson’s correlation to extract co-expression data from the GTEx whole blood network, which is based on gene expression that is not biased toward any particular disease.

            To investigate the relationship between gene co-expression and direct protein–protein interactions, we compared the co-expression of protein pairs involved in experimentally validated direct protein–protein interactions to a negative set of randomly selected protein pairs with a distance >3 in the human protein–protein interactome network. 33 We found that the direct protein–protein interactions have a higher median co-expression compared to the negative set of protein pairs, in line with our starting hypothesis. Furthermore, as the sample size increases, the variance of the distributions decreases, and the co-expression weights of distant protein pairs shift toward smaller values, thereby enhancing the distinction between distant pairs and the direct interaction group (Wilcoxon test, p-adj < 0.05 Holm method; Supplementary Table S1 ; Figure 6A ). We further grouped the protein pairs based on their shortest distance in the network and compared their co-expression. We find that protein pairs that are close in the protein–protein interactome have higher median co-expression weights compared to the average protein pairs, and that increasing the sample size enhances their differentiation (Friedman test; Dunn’s post-hoc test, p-adj < 0.05 Holm method; Supplementary Table S2 ; Figure 6B ). Finally, we investigated whether sample size improves the predictive power of co-expression weights on physical binding. We calculated the area under the receiver operating characteristic curve (AUROC), which assesses the ability to distinguish between binding and non-binding protein pairs based on co-expression. An AUROC of 0.5 indicates random chance, while an AUROC of 1 signifies accurate predictions ( Figure 6C ). We find a moderate yet monotonous increase in the AUROCs as sample size increased, from 0.67 at a sample size of 20 to 0.71 at a sample size of 300. In summary, our findings indicate that larger sample sizes improve the accuracy and significance of gene co-expression analysis in differentiating direct protein–protein interactions from other pairs, highlighting the importance of sufficient sample sizes if we wish to identify and validate relevant biological interactions.

            Figure 6 explores the influence of sample size on proteinâprotein interaction (PPI) co-expression. (A) A violin plot compares co-expression weights between direct PPIs and distant protein pairs (shortest path >3), showing higher co-expression for PPIs and reduced variance with larger sample sizes. (B) A violin plot illustrates the distribution of co-expression weights for protein pairs grouped by their shortest path distance in the interactome, revealing that distant protein pairs have lower co-expression, with this trend strengthening as sample size increases. (C) A line plot tracks the discriminative power of co-expression in distinguishing binding from non-binding protein pairs, measured by AUROC, which increases slightly from 0.67 at a sample size of 20 to 0.71 at 300. Data for all panels is from GTEx whole blood dataset.
            Figure 6.

            Influence of sample size on protein–protein interaction co-expression. (A) Distribution of co-expression weights between direct PPIs and protein pairs with a shortest path distance >3 in the protein–protein interactome. The plot highlights that PPIs show higher co-expression than distant protein pairs (distance > 3). As sample size increases, the variance of the distributions decreases, with a noticeable shift toward lower co-expression for distant protein pairs. (B) Distribution of co-expression weights for protein pairs grouped by their shortest path distance within the protein–protein interactome. The plot reveals that protein pairs with larger distances typically exhibit lower co-expression, with this trend becoming more pronounced as sample size increases. (C) Impact of sample size on the ability to discriminate between binding and non-binding based on co-expression. The AUROC is used as a measure of discriminative power. The plot shows that the AUROC slightly increases with sample size, ranging from 0.67 at a sample size of 20 to 0.71 at a sample size of 300. Note that co-expression weights were extracted from the GTEx whole blood dataset for analysis in panels A, B, and C. AUROC, area under the receiver operating characteristic curve; PPI, protein–protein interaction.

            CONCLUSIONS

            In conclusion, our study provides novel and valuable insights into the pivotal role of sample size in gene co-expression network analysis within the field of network medicine. Our main discovery, a novel power-law-based statistical pattern captured by Eq. 2 and Eq. 3, allowed us to develop a predictive model that accurately estimates the informational yield derived from gene co-expression networks. This model not only facilitates dataset comparisons based on the discovery exponent (α) of significant correlations with increasing sample size, but also enables the estimate of the sample size required to achieve a desired percentage of significant links. Notably, we found a robust inverse relationship between the discovery exponent, α, and the necessary sample size, allowing us to formulate a practical rule of thumb in estimating the sample size requirements to significant links. Our investigation also assesses the impact of sample size on different network medicine applications, including biomarker discovery and differentiation of protein–protein interactions from non-interacting pairs, and reveals that larger sample sizes lead to more stable and accurate predictions. These findings underscore the critical importance of sample size and data heterogeneity in gene co-expression network analyses, particularly in relation to predictive outcomes in highly heterogeneous gene expression datasets. Furthermore, our study highlights the inherent instability of networks constructed from fewer than 100 individuals, thereby emphasizing the need for meticulous consideration of sample size during study design, data collection, and analysis.

            While our study offers significant insights into gene co-expression network analysis within network medicine, it is essential to acknowledge its limitations. Notably, we did not include single-cell RNA-seq data in our analysis due to several reasons. Firstly, incorporating single-cell data would not only require the consideration of the number of individuals, but also the number of cells, incorporating a new level of complexity. Additionally, we recognize that the field of single-cell RNA-seq, while rapidly evolving, has yet to reach a level of relevance for direct application in clinical medicine and diagnosis. Therefore, our focus remained on traditional RNA-seq datasets, commonly used in current medical research and practice. Nonetheless, as single-cell technologies continue to mature and gain relevance, future studies should explore their integration into gene co-expression network analyses, offering a more comprehensive understanding of cellular heterogeneity and its implications for network medicine.

            Additionally, our study focused specifically on gene co-expression networks, rather than the inference of gene regulatory networks. While gene regulatory networks provide deeper insights into the mechanisms of gene regulation, the complexity of such networks would further increase the computational and statistical demands of our analysis. However, the insights regarding sample size derived from our study can be translated to the inference of gene regulatory networks. Larger sample sizes would similarly enhance the robustness and accuracy of regulatory network predictions, improving the ability to detect meaningful regulatory interactions.

            Overall, our work provides valuable insights into the optimal use of gene co-expression networks in network medicine, highlighting the importance of sample size and data heterogeneity. The discovered power-law scaling relationship offers an accurate estimation of the information yield from gene co-expression networks for a given number of samples and facilitates the optimization of network generation in network medicine pipelines. Our findings will have important implications for the design and analysis of experiments in network medicine that use RNA-seq as a basis, providing a foundation for future research to refine and enhance our understanding of gene co-expression networks and their potential applications in human health. Interested readers can further explore the results of our study at https://joaquimaguirreplans.shinyapps.io/samplesizeshiny/.

            MATERIALS AND METHODS

            Gene Expression Datasets

            We used four RNA-seq gene expression datasets to test our hypothesis. The first dataset (named RA dataset) was provided by Scipher Medicine and it comprises 584 whole blood samples extracted from 261 patients with RA. 6 The samples taken before treatment (baseline) were selected and the median expression of the samples from the same patients was calculated, obtaining a final dataset of 260 samples and 13,757 genes. The second dataset (named GTEx dataset) was retrieved from the GTEx database 22 (downloaded on August 8, 2022), from which five subsets corresponding to five different tissues were derived: breast (168 female samples, 18,132 genes), lung (576 samples, 18,030 genes), skin (692 samples, 17,804 genes), thyroid (651 samples, 17,870 genes), and whole blood (742 samples, 15,148 genes). The third dataset (named TCGA dataset) was retrieved from the TCGA database (downloaded on November 18, 2022), from which 11 subsets corresponding to 11 different tissues were derived: breast invasive carcinoma (BRCA, 1082 female samples, 19,118 genes), BRCA subtype luminal A (566 samples, 19,181 genes), BRCA subtype luminal B (217 samples, 19,017 genes), kidney renal clear cell carcinoma (532 samples, 19,310 genes), kidney renal papillary cell carcinoma (290 samples, 18,549 genes), lung adenocarcinoma (516 samples, 19,200 genes), lung squamous cell carcinoma (501 samples, 19,537 genes), thyroid carcinoma (504 samples, 18,827 genes), breast tissue (112 female samples, 19,826), kidney tissue (129 samples, 19,424 genes), and lung tissue (108 samples, 19,141 genes). The fourth dataset (named GSE193677 dataset) was retrieved from Gene Expression Omnibus (GEO) 34 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE193677), from which we classified samples into five conditions: Crohn’s disease rectum (115 samples, 18,197 genes), ulcerative colitis rectum samples (136 samples, 18,197 genes), control rectum samples (225 samples, 18,197 genes), inflamed rectum samples (502 samples, 18,197 genes), and non-inflamed rectum samples (851 samples, 18,197 genes). The genes of the datasets were initially filtered to select the genes properly annotated in the HUGO Gene Nomenclature Committee (HGNC) database (i.e., the ones that have HGNC symbol and gene classification). Then, genes with a large abundance of low values were removed to avoid biases when calculating the correlations. Specifically, the removed genes met at least one of these conditions: the sum or SD of the expression of all samples was 0, the mean or median expression was below 5, or more than half of the samples were empty counts (0). We did not apply any normalization strategy, as unnormalized counts have been discussed as a valid strategy for gene co-expression analysis. 11,35

            Gene Co-expression Networks

            We selected groups of samples of different sizes using random sampling with replacement, starting from size 20 and increasing the size by 20 until reaching the maximum number of samples in the dataset. We repeated the selection five times for each size. For each group of samples, we created co-expression networks using Pearson’s and Spearman’s correlation, calculated using the function rcorr of the R package Hmisc. 36 For each pair of genes, the methods provide us with a weight and a p-value (in the case of the three initial methods) that was corrected for multiple testing using Bonferroni correction, with the R function p.adjust. We also created co-expression networks using ARACNe and GENIE3 to assess the stability of strong correlations. ARACNe co-expression networks were calculated using the functions build.mim and aracne of the R package minet, 37 whereas GENIE3 co-expression networks were created using the R package GENIE3. It is important to note that GENIE3 weights are intended solely for ranking links and do not carry statistical significance. In this study, we used the GENIE3 weights exclusively for filtering strong links, applying a threshold of 0.01 based on the observed weight distribution within the network, without assigning any statistical interpretation to the scores. The code is available at https://github.com/Barabasi-Lab/SampleSize.

            Modeling Significant Interactions of Gene Co-expression Networks

            In this study, we used a set of conditions based on our understanding of how new significant links emerge in a network:

            1. Each new sample reveals a fraction of new significant links.

            2. Significant links cannot become insignificant.

            3. All significant links are hidden at sample size 0.

            4. All the links in the network become significant at infinite sample size.

            We defined the discovery rate, which is the fraction of new significant links gained at sample size n, f(n), as:

            (1) f(n)=SnSn1Sn1,

            where S(n) is the number of significant correlations at sample size n. f(n) follows a power-law function, defined by the following equation:

            (2) f(n)=bnα,

            where α is the scaling exponent and b is a normalization constant. When n tends to infinite (n → ∞), the fraction of new links discovered tends to 0 (f(n) → 0), and the number of links approaches to the maximum number of links in the network, L (S(n) → L).

            We rewrote Eq. 1 as a differential equation and substituted it in Eq. 2 to:

            (5) f(n)=SnSn1Sn1=ΔsΔn·1S=bnα

            We abbreviated S(n) as S for simplicity. We rewrote the equation as:

            (6) LS=0L1SΔ(S/L)=Nn=0nbnαΔ(n/n)

            where n is the maximum sample size in our dataset. Since L » 1 and n » 1, then ΔS/L → 0 and Δn/n → 0, which is infinitesimal in the differential. We substituted variables o = Δn/n and S′ = ΔS/L. Then, we changed the symbol and substituted back:

            (7) 1SdS=b(n)α+1oαdoLlnS=[bα+1nα+1]+cS(n)=ecebnα+1α+1

            If α > 1 and n → ∞, then the exponential part exp (bn α + 1/(−α + 1)) → exp (0) = 1. If x → 0, the exponential part converges to zero: exp (bn α + 1/(−α + 1)) → exp (−∞) = 0. Thus, to match the boundary condition, ec = L, and then we obtain the final equation:

            (3) S(n)=Lebn(1α)1α

            The function described by Eq. 3 is known as a stretched exponential, which is obtained by inserting a fractional power law into an exponential function. Equation 3 allows us to determine the value of L, the maximum number of links that the network can have, using the formula L = NG (NG − 1)/2, where NG is the number of genes in the network. The remaining unknown parameters (α, b) can be estimated using linear fits on empirical fractional data. To do so, we applied the following method. First, we transformed Eq. 3 by taking the logarithm of both sides to obtain:

            (8) lnS=lnL+lnebnα+1α+1lnLlnS=bnα+1α1

            We then added logarithms to both sides of Eq. 8, to obtain the following equation:

            (9) ln(ln Lln S)=ln(b)ln(α1)+(α+1)ln(n)

            By applying a linear fit model to the transformed equations, we obtained two coefficients corresponding to the values of −α + 1 and ln (b) − ln (α − 1), from which we calculate the values of α and b ( Figure 2B ). If the value of L is not known, we can apply Brent’s minimization algorithm to find the value of L that minimizes the error in the linear fit. This method is resolution-limited and depends on the amount of available data, including the number of samples, size of intervals, and number of replicates.

            Human Protein–Protein Interactome

            The human protein–protein interactome, reported in the study by Gysi and Barabási, 38 was compiled from 21 public databases containing various types of experimentally derived protein–protein interaction data 32 : (i) binary protein–protein interactions, derived from high-throughput yeast-two hybrid (Y2H) experiments (HI-Union 39 ), three-dimensional (3D) protein structures (Interactome3D, 40 Instruct, 41 Insider 42 ), or literature curation (PINA, 43 MINT, 44 LitBM17, 39 Interactome3D, 40 Instruct, 41 Insider, 42 BioGrid, 45 HINT, 46 HIPPIE, 47 APID, 48 InWeb 49 ); (ii) protein–protein interactions identified by affinity purification followed by mass spectrometry present in BioPlex2, 50 QUBIC, 51 CoFrac, 52 HINT, 46 HIPPIE, 47 APID, 48 LitBM17, 39 and InWeb 49 ; (iii) kinase-substrate interactions from KinomeNetworkX 53 and PhosphoSitePlus 54 ; (iv) signaling interactions from SignaLink 55 and InnateDB 56 ; and (v) regulatory interactions derived by the ENCODE consortium.

            Disease–gene Associations

            We considered 38 around 130 databases of disease–gene associations and selected those that (i) were not compiled from other data sources, and (ii) provided at least one kind of evidence type classified as: strong (functional evidence using an experimental essay); weak (GWAS evidence but no experimental validation); inferred (relying on bioinformatics or single nucleotide polymorphisms from imputation in GWAS); not compatible ((l)ncRNA, miRNA, and other transcripts with or without experimental validation). For each database, we kept the disease name, gene converted to HGNC names, and evidence level. At the end, we combined the following data sources: ClinGen, ClinVar, 57 CTD, 58 Disease Enhancer, 59 DisGeNET, 60,61 GWAS Catalog, 62 HMDD, 63 IncBook, 64 LncRNA disease, 65,66 LOVD, 67 Monarch, 68 OMIM, 69 Orphanet, PheGenI, 70 and PsyGeNet. 71

            Differential Co-expression Analysis

            We used the CoDiNA method 26 to calculate the differential co-expression between networks from 1095 breast carcinoma samples and 113 healthy breast tissue control samples, both subsets extracted from the TCGA-BRCA dataset. Before applying CoDiNA, we calculated the consensus networks between the five replicate gene co-expression networks from each sample size and condition. To calculate the consensus networks, we applied the function wTO.Consensus from the package wTO. 15 Then, we applied stretched normalization using the min–max method (i.e., the minimum value is transformed into 0, the maximum value is transformed into 1, and the rest of the values are transformed into decimals between 0 and 1), to ensure that all networks have the same weight boundaries. Finally, we applied CoDiNA to the networks of different conditions with the same sample size. CoDiNA classifies genes into three categories based on how their associated co-expression links change across different conditions. These categories include: “common” nodes, which have links present in both networks with the same sign; “different” nodes, which have links present in both networks with opposite signs; and “specific” nodes, which have links only present in one of the networks and are referred to as “disease-specific” if present in the disease-based network or “normal-specific” otherwise. The algorithm ranks the nodes based on the number of links associated with the category, the top-ranked nodes being the ones that have a higher number of links associated with the category.

            AUTHOR CONTRIBUTIONS

            Albert-Laszlo Barabasi, Deisy Morselli Gysi, Susan Dina Ghiassian, and Viatcheslav R. Akmaev designed the study. Joaquim Aguirre-Plans, Bingsheng Chen, and Deisy Morselli Gysi performed the analysis. All authors drafted, read, and approved the manuscript.

            CONFLICTS OF INTEREST

            Albert-Laszlo Barabasi, Susan Dina Ghiassian, Alex Jones, Viatcheslav R. Akmaev, and Alif Saleh are or have been part of Scipher Medicine, Inc., which applies network medicine strategies to biomarker development and personalized drug selection. All other authors declare no conflicts of interest.

            DATA AND CODE AVAILABILITY

            The data and code utilized in the preparation of this manuscript are publicly accessible and can be obtained from the following sources:

            References

            1. Rai MF, Tycksen ED, Sandell LJ, Brophy RH. Advantages of RNA-seq compared to RNA microarrays for transcriptome profiling of anterior cruciate ligament tears. J Orthop Res. 2018. Vol. 36:484–497. [Cross Ref]

            2. Rao MS, van Vleet TR, Ciurlionis R, et al.. Comparison of RNA-seq and microarray gene expression platforms for the toxicogenomic evaluation of liver from short-term rat toxicity studies. Front Genet. 2019. Vol. 9:636. [Cross Ref]

            3. Sonawane AR, Weiss ST, Glass K, Sharma A. Network medicine in the age of biomedical big data. Front Genet. 2019. Vol. 10:294[Cross Ref]

            4. Buphamalai P, Kokotovic T, Nagy V, Menche J. Network analysis reveals rare disease signatures across multiple levels of biological organization. Nat Commun. 2021. Vol. 12:6306[Cross Ref]

            5. Mellors T, Withers JB, Ameli A, et al.. Clinical validation of a blood-based predictive test for stratification of response to tumor necrosis factor inhibitor therapies in rheumatoid arthritis patients. Netw Syst Med. 2020. Vol. 3:91–104. [Cross Ref]

            6. Cohen S, Wells AF, Curtis JR, et al.. A molecular signature response classifier to predict inadequate response to tumor necrosis factor-α inhibitors: The NETWORK-004 prospective observational study. Rheumatol Ther. 2021. Vol. 8:1159–1176. [Cross Ref]

            7. Ghiassian SD, Voitalov I, Withers JB, Santolini M, Saleh A, Akmaev VR. Network-based response module comprised of gene expression biomarkers predicts response to infliximab at treatment initiation in ulcerative colitis. Transl Res. 2022. Vol. 246:78–86. [Cross Ref]

            8. Voitalov I, Zhang L, Kilpatrick C, et al.. The module triad: A novel network biology approach to utilize patients’ multi-omics data for target discovery in ulcerative colitis. Sci Rep. 2022. Vol. 12:21685[Cross Ref]

            9. van Dam S, Võsa U, van der Graaf A, Franke L, de Magalhães JP. Gene co-expression analysis for functional classification and gene-disease predictions. Brief Bioinform. 2018. Vol. 19:575–592. [Cross Ref]

            10. Gysi DM, Nowick K. Construction, comparison and evolution of networks in life sciences and other disciplines. J R Soc Interface. 2020. Vol. 17:20190610. [Cross Ref]

            11. Chowdhury HA, Bhattacharyya DK, Kalita JK. (Differential) co-expression analysis of gene expression: A survey of best practices. IEEE/ACM Trans Comput Biol Bioinform. 2020. Vol. 17:1154–1173. [Cross Ref]

            12. Paci P, Fiscon G, Conte F, Wang R-S, Farina L, Loscalzo J. Gene co-expression in the interactome: Moving from correlation toward causation via an integrated approach to disease module discovery. NPJ Syst Biol Appl. 2021. Vol. 7:3[Cross Ref]

            13. Glass K, Huttenhower C, Quackenbush J, Yuan G-C. Passing messages between biological networks to refine predicted interactions. PLoS One. 2013. Vol. 8:e64832. [Cross Ref]

            14. Langfelder P, Horvath S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics. 2008. Vol. 9:559. [Cross Ref]

            15. Gysi DM, Voigt A, Fragoso TM, Almaas E, Nowick K. wTO: An R package for computing weighted topological overlap and a consensus network with integrated visualization tool. BMC Bioinformatics. 2018. Vol. 19:392[Cross Ref]

            16. Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P. Inferring regulatory networks from expression data using tree-based methods. PLoS One. 2010. Vol. 5:e12776. [Cross Ref]

            17. Margolin AA, Nemenman I, Basso K, et al.. ARACNE: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006. Vol. 7 Suppl 1:S7. [Cross Ref]

            18. Ballouz S, Verleyen W, Gillis J. Guidance for RNA-seq co-expression network construction and analysis: Safety in numbers. Bioinformatics. 2015. Vol. 31:2123–2130. [Cross Ref]

            19. Liesecke F, De Craene J-O, Besseau S, et al.. Improved gene co-expression network quality through expression dataset down-sampling and network aggregation. Sci Rep. 2019. Vol. 9:14431[Cross Ref]

            20. Ovens K, Eames BF, McQuillan I. The impact of sample size and tissue type on the reproducibility of gene co-expression networksProceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics; New York, NY, USA: Association for Computing Machinery. 2020. p. 1–10. [Cross Ref]

            21. Gibson SM, Ficklin SP, Isaacson S, Luo F, Feltus FA, Smith MC. Massive-scale gene co-expression network construction and robustness testing using random matrix theory. PLoS One. 2013. Vol. 8:e55871. [Cross Ref]

            22. GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020. Vol. 369:1318–1330. [Cross Ref]

            23. Kenton W. The law of diminishing marginal utility: How it works, with examplescited 24 August 2023Investopedia [Internet]. https://www.investopedia.com/terms/l/lawofdiminishingutility.asp

            24. Bravais A. Analyse mathématique sur les probabilités des erreurs de situation d’un point. Paris: Impr. Royale. 1844

            25. Pearson K. Contributions to the mathematical theory of evolution. Philos Trans R Soc Lond A. 1894. Vol. 185:71–110

            26. Gysi DM, de Miranda Fragoso T, Zebardast F, et al.. Whole transcriptomic network analysis using co-expression differential network analysis (CoDiNA). PLoS One. 2020. Vol. 15:e0240523. [Cross Ref]

            27. Walsh MF, Nathanson KL, Couch FJ, Offit K. Genomic biomarkers for breast cancer risk. Adv Exp Med Biol. 2016. Vol. 882:1–32. [Cross Ref]

            28. Liu J, Liu Z, Zhou Y, et al.. Identification of a novel transcription factor prognostic index for breast cancer. Front Oncol. 2021. Vol. 11:666505. [Cross Ref]

            29. Vinayagam A, Zirin J, Roesel C, et al.. Integrating protein-protein interaction networks with phenotypes reveals signs of interactions. Nat Methods. 2014. Vol. 11:94–99. [Cross Ref]

            30. Szklarczyk D, Gable AL, Lyon D, et al.. STRING v11: Protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019. Vol. 47:D607–D613. [Cross Ref]

            31. Karimizadeh E, Sharifi-Zarchi A, Nikaein H, et al.. Analysis of gene expression profiles and protein-protein interaction networks in multiple tissues of systemic sclerosis. BMC Med Genomics. 2019. Vol. 12:199[Cross Ref]

            32. Gysi DM, do Valle Í, Zitnik M, et al.. Network medicine framework for identifying drug-repurposing opportunities for COVID-19. Proc Natl Acad Sci U S A. 2021. Vol. 118:e2025581118. [Cross Ref]

            33. Chatterjee A, Walters R, Shafi Z, et al.. Improving the generalizability of protein-ligand binding predictions with AI-Bind. Nat Commun. 2023. Vol. 14:1989[Cross Ref]

            34. Argmann C, Hou R, Ungaro RC, et al.. Biopsy and blood-based molecular biomarker of inflammation in IBD. Gut. 2023. Vol. 72:1271–1287. [Cross Ref]

            35. Johnson KA, Krishnan A. Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data. Genome Biol. 2022. Vol. 23:1[Cross Ref]

            36. Harrell FE. Hmisc: Harrell Miscellaneous. 2022. https://hbiostat.org/R/Hmisc/

            37. Meyer PE, Lafitte F, Bontempi G. minet: A R/Bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinformatics. 2008. Vol. 9:461. [Cross Ref]

            38. Gysi DM, Barabási A-L. Noncoding RNAs improve the predictive power of network medicine. Proc Natl Acad Sci U S A. 2023. Vol. 120:e2301342120. [Cross Ref]

            39. Luck K, Kim D-K, Lambourne L, et al.. A reference map of the human binary protein interactome. Nature. 2020. Vol. 580:402–408. [Cross Ref]

            40. Mosca R, Céol A, Aloy P. Interactome3D: Adding structural details to protein networks. Nat Methods. 2013. Vol. 10:47–53. [Cross Ref]

            41. Meyer MJ, Das J, Wang X, Yu H. INstruct: A database of high-quality 3D structurally resolved protein interactome networks. Bioinformatics. 2013. Vol. 29:1577–1579. [Cross Ref]

            42. Meyer MJ, Beltrán JF, Liang S, et al.. Interactome INSIDER: A structural interactome browser for genomic studies. Nat Methods. 2018. Vol. 15:107–114. [Cross Ref]

            43. Cowley MJ, Pinese M, Kassahn KS, et al.. PINA v2.0: Mining interactome modules. Nucleic Acids Res. 2012. Vol. 40:D862–D865. [Cross Ref]

            44. Licata L, Briganti L, Peluso D, et al.. MINT, the molecular interaction database: 2012 update. Nucleic Acids Res. 2012. Vol. 40:D857–D861. [Cross Ref]

            45. Chatr-Aryamontri A, Oughtred R, Boucher L, et al.. The BioGRID interaction database: 2017 update. Nucleic Acids Res. 2017. Vol. 45:D369–D379. [Cross Ref]

            46. Das J, Yu H. HINT: High-quality protein interactomes and their applications in understanding human disease. BMC Syst Biol. 2012. Vol. 6:92[Cross Ref]

            47. Alanis-Lobato G, Andrade-Navarro MA, Schaefer MH. HIPPIE v2.0: Enhancing meaningfulness and reliability of protein-protein interaction networks. Nucleic Acids Res. 2017. Vol. 45:D408–D414. [Cross Ref]

            48. Alonso-López D, Campos-Laborie FJ, Gutiérrez MA, et al.. APID database: Redefining protein-protein interaction experimental evidences and binary interactomes. Database J Biol Databases Curation. 2019. Vol. 2019:baz005. [Cross Ref]

            49. Li T, Wernersson R, Hansen RB, et al.. A scored human protein-protein interaction network to catalyze genomic interpretation. Nat Methods. 2017. Vol. 14:61–64. [Cross Ref]

            50. Huttlin EL, Bruckner RJ, Paulo JA, et al.. Architecture of the human interactome defines protein communities and disease networks. Nature. 2017. Vol. 545:505–509. [Cross Ref]

            51. Hein MY, Hubner NC, Poser I, et al.. A human interactome in three quantitative dimensions organized by stoichiometries and abundances. Cell. 2015. Vol. 163:712–723. [Cross Ref]

            52. Wan C, Borgeson B, Phanse S, et al.. Panorama of ancient metazoan macromolecular complexes. Nature. 2015. Vol. 525:339–344. [Cross Ref]

            53. Cheng F, Jia P, Wang Q, Zhao Z. Quantitative network mapping of the human kinome interactome reveals new clues for rational kinase inhibitor discovery and individualized cancer therapy. Oncotarget. 2014. Vol. 5:3697–3710. [Cross Ref]

            54. Hornbeck PV, Zhang B, Murray B, Kornhauser JM, Latham V, Skrzypek E. PhosphoSitePlus, 2014: Mutations, PTMs and recalibrations. Nucleic Acids Res. 2015. Vol. 43:D512–D520. [Cross Ref]

            55. Fazekas D, Koltai M, Türei D, et al.. SignaLink 2 – A signaling pathway resource with multi-layered regulatory networks. BMC Syst Biol. 2013. Vol. 7:7[Cross Ref]

            56. Breuer K, Foroushani AK, Laird MR, et al.. InnateDB: Systems biology of innate immunity and beyond--recent updates and continuing curation. Nucleic Acids Res. 2013. Vol. 41:D1228–D1233. [Cross Ref]

            57. Landrum MJ, Chitipiralla S, Brown GR, et al.. ClinVar: Improvements to accessing data. Nucleic Acids Res. 2020. Vol. 48:D835–D844. [Cross Ref]

            58. Davis AP, Grondin CJ, Johnson RJ, et al.. Comparative toxicogenomics database (CTD): Update 2021. Nucleic Acids Res. 2021. Vol. 49:D1138–D1143. [Cross Ref]

            59. Zhang G, Shi J, Zhu S, et al.. DiseaseEnhancer: A resource of human disease-associated enhancer catalog. Nucleic Acids Res. 2018. Vol. 46:D78–D84. [Cross Ref]

            60. Piñero J, Ramírez-Anguita JM, Saüch-Pitarch J, et al.. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 2020. Vol. 48:D845–D855. [Cross Ref]

            61. Piñero J, Bravo À, Queralt-Rosinach N, et al.. DisGeNET: A comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 2017. Vol. 45:D833–D839. [Cross Ref]

            62. Buniello A, MacArthur JAL, Cerezo M, et al.. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 2019. Vol. 47:D1005–D1012. [Cross Ref]

            63. Huang Z, Shi J, Gao Y, et al.. HMDD v3.0: A database for experimentally supported human microRNA-disease associations. Nucleic Acids Res. 2019. Vol. 47:D1013–D1017. [Cross Ref]

            64. Ma L, Cao J, Liu L, et al.. LncBook: A curated knowledgebase of human long non-coding RNAs. Nucleic Acids Res. 2019. Vol. 47:D128–D134. [Cross Ref]

            65. Bao Z, Yang Z, Huang Z, Zhou Y, Cui Q, Dong D. LncRNADisease 2.0: An updated database of long non-coding RNA-associated diseases. Nucleic Acids Res. 2019. Vol. 47:D1034–D1037. [Cross Ref]

            66. Chen G, Wang Z, Wang D, et al.. LncRNADisease: A database for long-non-coding RNA-associated diseases. Nucleic Acids Res. 2013. Vol. 41:D983–D986. [Cross Ref]

            67. Fokkema IFAC, Taschner PEM, Schaafsma GCP, Celli J, Laros JFJ, den Dunnen JT. LOVD v.2.0: The next generation in gene variant databases. Hum Mutat. 2011. Vol. 32:557–563. [Cross Ref]

            68. Shefchek KA, Harris NL, Gargano M, et al.. The Monarch Initiative in 2019: An integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 2020. Vol. 48:D704–D715. [Cross Ref]

            69. McKusick VA. Mendelian inheritance in man and its online version, OMIM. Am J Hum Genet. 2007. Vol. 80:588–604. [Cross Ref]

            70. Ramos EM, Hoffman D, Junkins HA, et al.. Phenotype-Genotype Integrator (PheGenI): Synthesizing genome-wide association study (GWAS) data with existing genomic resources. Eur J Hum Genet. 2014. Vol. 22:144–147. [Cross Ref]

            71. Gutiérrez-Sacristán A, Grosdidier S, Valverde O, et al.. PsyGeNET: A knowledge platform on psychiatric disorders and their genes. Bioinformatics. 2015. Vol. 31:3075–3077. [Cross Ref]

            Author and article information

            Journal
            nsm
            Network and Systems Medicine
            ScienceOpen (Berlin )
            2941-251X
            24 May 2025
            : 1
            : 1
            : 1-15
            Affiliations
            [1 ] Network Science Institute and Department of Physics, Northeastern University, Boston, MA, USA ( https://ror.org/04t5xt781)
            [2 ] US Department of Veteran Affairs, Boston, MA, USA ( https://ror.org/05rsv9s98)
            [3 ] STALICLA Discovery and Data Science Unit, Barcelona, Spain;
            [4 ] Scipher Medicine Corporation, Waltham, MA, USA;
            [5 ] Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA ( https://ror.org/04b6nzv94)
            [6 ] Department of Network and Data Science, Central European University, Budapest, Hungary ( https://ror.org/02zx40v98)
            Author notes
            *Correspondence to: Albert-Laszlo Barabasi, Network Science Institute and Department of Physics, Northeastern University, 177 Huntington Ave #1010, Boston, MA 02115, USA. E-mail: barabasi@ 123456gmail.com
            Author information
            https://orcid.org/0000-0002-1971-1350
            https://orcid.org/0000-0003-3595-3285
            https://orcid.org/0000-0003-4745-6331
            https://orcid.org/0000-0002-5771-8182
            https://orcid.org/0000-0002-4028-3522
            Article
            NSM.25.1.0002
            10.14293/NSM.25.1.0002
            f48459eb-f3b4-4269-925a-108c616b38a8
            2025 The Author(s).

            This work has been published open access under Creative Commons Attribution License CC BY 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Conditions, terms of use and publishing policy can be found at www.scienceopen.com.

            History
            : 18 October 2024
            : 28 January 2025
            Page count
            Figures: 6, References: 71, Pages: 15
            Funding
            Funded by: NIH
            Award ID: 1P01HL132825
            Funded by: Department of Veterans Affairs
            Award ID: 36C24120D0027
            Funded by: Scipher Inc.
            Award ID: 21-C-01472
            Funded by: European Union’s Horizon 2020
            Award ID: 810115—DYNASNET
            We thank R. Dorantes-Gilardi, F. Nasirian, and E. Battistella for the fruitful discussions and Scipher Medicine for the support. This research was supported in part by an NIH award 1P01HL132825, a Department of Veterans Affairs award Contract No. 36C24120D0027, and Scipher Inc. Agreement 21-C-01472. Albert-Laszlo Barabasi is supported by the European Union’s Horizon 2020 research and innovation program under grant agreement No. 810115—DYNASNET.
            Categories
            RESEARCH ARTICLE

            RNA-seq,sample size,gene co-expression networks,network medicine

            Comments

            Comment on this article