Code
library("reshape2")
library("ggplot2")
library("GENIE3")
library("igraph")
library("PRROC")
library("rfPermute")This practical’s aim is to perform network inference using GENIE3 (Huynh-Thu et al. 2010). The dataset used to illustrate this practical is an extract of the expression data and reconstructed network described in (Nicolas et al. 2012), that is related to the bacteria Bacillus subtilis. Note that gene and sample names have been anonymized.
Be also sure to have the proper R packages installed (usage of renv is strongly recommended):
library("reshape2")
library("ggplot2")
library("GENIE3")
library("igraph")
library("PRROC")
library("rfPermute")Expression data is included in the file ../data/expr.csv and are loaded with:
expr <- read.table("../data/expr.csv", sep = "\t", header = TRUE)
dim(expr)[1] 457 270The data are organized with genes in rows and samples in columns, the first column being the gene name:
head(expr[, 1:10])A global visualization of the gene expression is available using a heatmap:
df_heatmap <- melt(expr, id.vars = "Name")
names(df_heatmap) <- c("gene", "sample", "expression")
p <- ggplot(df_heatmap, aes(sample, gene)) +
  geom_tile(aes(fill = expression), color = "white") +
  scale_fill_gradient(low = "yellow", high = "red") +
  ylab("genes ") + xlab("samples") + theme(axis.text = element_blank()) +
  labs(fill = "Expression level")
pNetwork inference will be performed with GENIE3:
?GENIE3GENIE3 requires an expression matrix with genes in rows and gene names as row names:
expr_matrix <- as.matrix(expr[, 2:ncol(expr)])
rownames(expr_matrix) <- expr$Name
expr_matrix <- expr_matrix
set.seed(1055)
res_GENIE3 <- GENIE3(expr_matrix, nTrees = 50, verbose = TRUE)Note: This simulation is given as a mere illustration of the process. However, we can not expect much from the results: * 50 trees is probably not enough to obtain a good performance; * \(\sigma\) factors, which are major regulators in bacteria, are not specified in the original data (where they could have been).
The result is then compared with the ground true network, available in ../data/net.rds as an igraph object:
ref_net <- readRDS("../data/net.rds")
ref_netIGRAPH afc5784 DN-- 457 710 -- 
+ attr: name (v/c)
+ edges from afc5784 (vertex names):
 [1] W0285->W0285 W0285->F0297 W0285->C0769 W0285->O0665 W0285->X0991
 [6] W0285->K0548 W0285->Y0468 W0285->Z0620 W0285->X0574 W0285->P0189
[11] W0285->V0342 W0285->J0957 W0285->F0308 W0285->X0773 W0285->R0994
[16] W0285->Q0692 W0285->Z0954 W0285->D0661 W0285->S0907 W0285->Y0771
[21] W0285->A0935 W0285->O0136 W0285->V0123 W0285->G0870 W0285->J0696
[26] W0285->S0104 W0285->U0802 W0285->I0168 W0285->F0181 W0285->D0014
[31] W0285->V0861 W0285->N0624 W0285->X0018 W0285->D0148 W0285->Z0131
[36] W0285->P0275 W0285->F0851 W0285->Y0754 W0285->O0313 W0285->Z0109
+ ... omitted several edgesedge_density(ref_net)[1] 0.003407041The true network can be displayed using:
par(mar = rep(0, 4))
set.seed(1121)
plot(ref_net, vertex.size = 3, vertex.color = "lightgreen", 
     vertex.frame.color = "lightgreen", edge.color = "grey", 
     vertex.label = rep(NA, vcount(ref_net)))For the sake of simplicity, we will further use the undirected version of the network. We also extract the corresponding adjacency matrix:
undirected_ref <- as.undirected(ref_net, mode = "collapse")
ref_adj <- as_adj(undirected_ref, sparse = FALSE)
diag(ref_adj) <- 0Weight distribution of the solution is explored to visually set a relevant threshold:
all_weights <- c(res_GENIE3[upper.tri(res_GENIE3)], 
                 res_GENIE3[lower.tri(res_GENIE3)])
df <- data.frame("weights" = all_weights)
p <- ggplot(df, aes(x = weights)) + geom_histogram() + theme_bw() + 
  scale_x_log10() + geom_vline(xintercept = 1e-2, color = "darkred")
pThe corresponding (undirected) network is then deduced:
net1 <- res_GENIE3
net1[res_GENIE3 < 1e-2] <- 0
net1[res_GENIE3 >= 1e-2] <- 1
net1 <- graph_from_adjacency_matrix(net1, mode = "max")
net1IGRAPH 7138c93 UN-- 457 12142 -- 
+ attr: name (v/c)
+ edges from 7138c93 (vertex names):
 [1] W0285--C0769 W0285--P0255 W0285--S0077 W0285--K0840 W0285--Y0311
 [6] W0285--N0741 W0285--N0010 W0285--F0096 W0285--N0057 W0285--D0254
[11] W0285--U0658 W0285--Z0569 W0285--J0880 W0285--D0054 W0285--L0362
[16] W0285--E0486 W0285--E0301 W0285--Z0025 W0285--V0132 W0285--Y0083
[21] W0285--R0994 W0285--V0864 W0285--B0497 W0285--K0663 W0285--S0155
[26] W0285--R0764 W0285--O0113 W0285--V0123 W0285--X0082 W0285--C0834
[31] W0285--U0180 W0285--V0928 W0285--G0023 W0285--E0461 W0285--W0606
[36] W0285--N0763 W0285--X0912 W0285--P0487 W0285--E0983 W0285--F0717
+ ... omitted several edgesIt has a density equal to:
edge_density(net1)[1] 0.1165304which is larger than the density of the true network.
Visually, it also looks quite different:
par(mar = rep(0, 4))
set.seed(1112)
plot(net1, vertex.size = 3, vertex.color = "lightgreen", 
     vertex.frame.color = "lightgreen", edge.color = "grey", 
     vertex.label = rep(NA, vcount(net1)))It is mostly explained by a very different degree distribution (not using prior information on \(\sigma\) factor is probably instrumental in this difference):
df <- data.frame("true" = degree(ref_net), "GENIE3" = degree(net1))
p <- ggplot(df, aes(x = true, y = GENIE3)) + geom_point(alpha = 0.4) +
  theme_bw()
pAnother approach would find a threshold based on a realistic value of the network density (obtained from the true network, here):
ref_density <- edge_density(undirected_ref)
max_weight <- pmax(res_GENIE3[upper.tri(res_GENIE3)],
                   t(res_GENIE3)[upper.tri(res_GENIE3)])
thresh <- quantile(max_weight, probs = 1 - ref_density)
thresh 99.31955% 
0.05164564 This threshold is used to define a second network:
net2 <- res_GENIE3
net2[res_GENIE3 < thresh] <- 0
net2[res_GENIE3 >= thresh] <- 1
net2 <- graph_from_adjacency_matrix(net2, mode = "max")
net2IGRAPH ec7c922 UN-- 457 709 -- 
+ attr: name (v/c)
+ edges from ec7c922 (vertex names):
 [1] W0285--C0769 W0285--Y0311 W0285--N0010 W0285--V0123 W0285--F0717
 [6] W0285--T0776 F0297--R0154 C0769--X0082 C0769--L0809 C0769--G0023
[11] C0769--P0487 C0769--X0681 C0769--S0522 C0769--T0353 E0873--L0863
[16] P0255--I0595 P0255--B0357 P0255--V0583 P0255--C0521 O0665--N0927
[21] O0665--S0636 O0665--H0534 R0740--P0524 R0740--U0911 P0524--U0911
[26] J0702--Z0491 J0702--O0570 J0702--I0968 X0991--C0593 X0991--R0994
[31] X0991--R0578 X0991--K0643 D0934--Z0569 D0934--R0910 D0934--H0859
[36] T0196--L0743 T0196--V0933 Y0468--J0269 V0395--J0880 V0395--E0301
+ ... omitted several edgesThis network has a density close to the true network, as expected:
edge_density(net2)[1] 0.006804484But it still does not match the true network visually:
par(mar = rep(0, 4))
set.seed(1112)
plot(net2, vertex.size = 3, vertex.color = "lightgreen", 
     vertex.frame.color = "lightgreen", edge.color = "grey", 
     vertex.label = rep(NA, vcount(net1)))And, again, it has a very different degree distribution:
df <- data.frame("true" = degree(ref_net), "GENIE3" = degree(net2))
p <- ggplot(df, aes(x = true, y = GENIE3)) + geom_point(alpha = 0.4) +
  theme_bw()
pROC and PR curves are finally obtained, using the set of weights (maximum between the two weights for each edge) for positive (true) and negative (wrong) edges:
true_edges <- res_GENIE3
true_edges[ref_adj == 0] <- 0
true_edges <- pmax(true_edges[upper.tri(true_edges)], 
                   t(true_edges)[upper.tri(true_edges)])
true_edges <- true_edges[true_edges != 0]
wrong_edges <- res_GENIE3
wrong_edges[ref_adj == 1] <- 0
wrong_edges <- pmax(wrong_edges[upper.tri(wrong_edges)], 
                    t(wrong_edges)[upper.tri(wrong_edges)])
wrong_edges <- wrong_edges[wrong_edges != 0]
roc <- roc.curve(scores.class0 = wrong_edges, scores.class1 = true_edges,
                 curve = TRUE)
roc
  ROC curve
    Area under curve:
     0.501703 
    Curve for scores from  1.636326e-06  to  0.1349187 
    ( can be plotted with plot(x) )plot(roc)pr <- pr.curve(scores.class0 = wrong_edges, scores.class1 = true_edges,
               curve = TRUE)
pr
  Precision-recall curve
    Area under curve (Integral):
     0.9929469 
    Area under curve (Davis & Goadrich):
     0.9929468 
    Curve for scores from  1.636326e-06  to  0.1349187 
    ( can be plotted with plot(x) )plot(pr)rfPermute can be used to obtain a \(p\)-value describing the confidence of each edge. It requires to be run on each gene used as a target for the prediction, in turn, which would be too long for this practical. We just illustrate its use with one of the gene, with a very reduced (and not relevant) number of permutations:
expr_df <- data.frame(t(expr_matrix))
set.seed(1241)
res_rfPermute <- rfPermute(W0285 ~ ., data = expr_df, num.rep = 10, 
                           p = round(sqrt(500)))The results contain a \(p\)-value field that is summarized below:
res_rfPermuteAn rfPermute model
               Type of random forest: regression 
                     Number of trees: 500 
No. of variables tried at each split: 152 
       No. of permutation replicates: 10 
                          Start time: 2023-09-18 13:06:53 
                            End time: 2023-09-18 13:09:12 
                            Run time: 2.32 mins summary(res_rfPermute$pval[, 2, "scaled"])   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.09091 1.00000 1.00000 0.89872 1.00000 1.00000 Using a threshold of 10% on the \(p\)-value, we obtain the predictors of “W0285” that we can compare to the true neighborhood:
ref_nei <- neighborhood(undirected_ref, order = 1, nodes = "W0285")[[1]]$name
cat("reference:", ref_nei, "\n")reference: W0285 F0297 C0769 O0665 X0991 K0548 Y0468 Z0620 X0574 P0189 V0342 J0957 F0308 X0773 R0994 Q0692 Z0954 D0661 S0907 Y0771 A0935 O0136 V0123 G0870 J0696 S0104 U0802 I0168 F0181 D0014 V0861 N0624 X0018 D0148 Z0131 P0275 F0851 Y0754 O0313 Z0109 J0269 L0777 H0545 U0226 E0325 F0118 R0995 U0641 E0434 B0166 O0495 N0566 B0883 Z0078 W0606 J0233 N0278 Z0843 G0686 I0949 X0293 Y0904 N0763 N0896 S0139 N0094 Z0444 A0410 B0719 R0480 P0046 Q0690 R0906 C0200 X0912 W0547 C0837 D0649 M0953 U0978 F0161 M0393 H0637 K0305 O0093 O0789 B0300 U0423 A0241 A0437 P0206 S0759 P0407 S0176 R0578 T0352 Y0723 K0383 X0916 J0779 K0466 G0932 W0720 K0219 S0095 N0458 R0154 T0042 L0040 P0487 E0983 S0636 N0704 S0435 F0717 E0542 S0421 A0781 T0869 H0534 C0450 J0612 S0373 G0669 N0586 A0494 A0289 R0296 pred_nei <- names(which(res_rfPermute$pval[, 2, "scaled"] < 0.1))
cat("predicted:", pred_nei, "\n")predicted: S0077 K0840 Y0311 N0741 N0010 P0156 U0658 E0301 C0799 M0635 Y0083 K0563 V0123 X0082 C0834 E0325 G0023 L0040 F0717 N0671 J0732 B0594 K0756 T0776 S0522 N0430 L0129 The intersection between the two neighborhood is finally obtained with:
intersect(ref_nei, pred_nei)[1] "V0123" "E0325" "L0040" "F0717"Note: Again, the previous analysis is not satisfactory. What would be sounder would be to use the results of GENIE3 as a prior list of interactions to test and to perform rfPermute only to edges that have been selected during this prior step.
sessionInfo()R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
time zone: Europe/Paris
tzcode source: system (glibc)
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] rfPermute_2.5.2 PRROC_1.3.1     igraph_1.5.1    GENIE3_1.22.0  
[5] ggplot2_3.4.3   reshape2_1.4.4 
loaded via a namespace (and not attached):
 [1] randomForest_4.7-1.1 gtable_0.3.4         jsonlite_1.8.7      
 [4] dplyr_1.1.3          compiler_4.3.1       tidyselect_1.2.0    
 [7] Rcpp_1.0.11          stringr_1.5.0        swfscMisc_1.6.5     
[10] parallel_4.3.1       scales_1.2.1         yaml_2.3.7          
[13] fastmap_1.1.1        R6_2.5.1             plyr_1.8.8          
[16] labeling_0.4.3       generics_0.1.3       knitr_1.43          
[19] tibble_3.2.1         munsell_0.5.0        pillar_1.9.0        
[22] rlang_1.1.1          utf8_1.2.3           stringi_1.7.12      
[25] xfun_0.40            cli_3.6.1            withr_2.5.0         
[28] magrittr_2.0.3       digest_0.6.33        grid_4.3.1          
[31] rstudioapi_0.15.0    lifecycle_1.0.3      vctrs_0.6.3         
[34] evaluate_0.21        glue_1.6.2           farver_2.1.1        
[37] codetools_0.2-19     fansi_1.0.4          colorspace_2.1-0    
[40] rmarkdown_2.24       tools_4.3.1          pkgconfig_2.0.3     
[43] htmltools_0.5.6