学习WGCNA总结


在转录组数据处理过程中我们经常会用到差异表达分析这一概念,通过比较不同处理或不同组织间基因表达量(FPKM)差异来寻找特异基因,但这前提是你的不同处理或不同组织样本较少,当不同处理或组织有较多样本,如40个,此时的两两比较有780组比较^_^,这根本不是我们想要的结果;

此时就需要WGCNA(weighted gene co-expression network analysis)将复杂的数据进行归纳整理。除了这种最常见的比较差异表达,我们还想知道在不同处理或不同组织间是否有些基因的表达存在内在的联系或相关性?WGCNA同样可以帮助我们预测基因间的相互作用关系。

WGCNA is based on correlation and not differential expression comparisons.

WGCNA术语

权重(weghted)

Module

模块(module):表达模式相似的基因分为一类,这样的一类基因成为模块;

Eigengene

Eigengene(eigen- +‎ gene):基因和样本构成的矩阵,https://en.wiktionary.org/wiki/eigengene;

Adjacency Matrix

邻近矩阵:是图的一种存储形式,用一个一维数组存放图中所有顶点数据;用一个二维数组存放顶点间关系(边或弧)的数据,这个二维数组称为邻接矩阵;

Topological Overlap Matrix(TOM)

整体思路

先对数据进行处理→分层聚类→表达模式相似的基因组成模块→研究某一个模块中相关基因的功能富集(GO,KEGG),各个模块与样本表型数据间的相关性,各个模块与样本本身间的相关性(没有表型数据的情况,如不同组织)→具体到特定模块后分析其所包含基因间的相互作用网络关系,并找出其中的关键基因。

分析构建的网络寻找以下有用信息

  • 这类处于调控网络中心的基因称为核心基因(hub gene),这类基因通常是转录因子等关键的调控因子,是值得我们优先深入分析和挖掘的对象。
  • 在网络中,被调控线连接的基因,其表达模式是相似的。那么它们潜在有相似的功能。所以,在这个网络中,如果线条一端的基因功能是已知的,那么就可以预测线条另一端的功能未知的基因也有相似的功能。
  • R脚本

    输入数据为RNA-seq不同处理或组织所有样本的FPKM值组成的矩阵,切记含有 0 的要去掉;

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    189
    190
    191
    192
    193
    194
    195
    196
    197
    198
    199
    200
    201
    202
    203
    204
    205
    206
    207
    208
    209
    210
    211
    212
    213
    214
    215
    216
    217
    218
    219
    220
    221
    222
    223
    224
    225
    226
    227
    228
    229
    230
    231
    232
    233
    234
    235
    236
    237
    238
    239
    240
    241
    242
    243
    244
    245
    246
    247
    248
    249
    250
    251
    252
    253
    254
    255
    256
    257
    258
    259
    260
    261
    262
    263
    264
    265
    266
    267
    268
    269
    270
    271
    272
    273
    274
    275
    276
    277
    278
    279
    280
    281
    282
    283
    284
    285
    286
    287
    288
    289
    290
    291
    292
    293
    294
    295
    296
    297
    298
    299
    300
    301
    302
    303
    304
    305
    306
    307
    308
    309
    310
    311
    312
    313
    314
    315
    316
    317
    318
    319
    320
    321
    322
    323
    324
    325
    326
    327
    328
    329
    330
    331
    332
    333
    334
    335
    336
    337
    338
    339
    340
    341
    342
    343
    344
    345
    346
    347
    348
    349
    350
    351
    352
    353
    354
    355
    356
    357
    358
    359
    360
    361
    362
    363
    364
    365
    366
    367
    368
    369
    370
    371
    372
    373
    374
    375
    setwd("F:/WGCNA")
    library(WGCNA)
    options(stringsAsFactors = FALSE)
    enableWGCNAThreads()
    #############################################################################
    ####################### 一、 数据读入,处理和保存 ##############################
    #############################################################################
    fpkm<- read.csv("trans_counts.counts.matrix.TMM_normalized.FPKM.nozero.csv")
    #~~~~~~~~~~~~~~~~~~
    head(fpkm)
    GeneID PN_1_TPM PN_1_07_TPM PN_1_08_TPM PN_2_TPM PN_2_09_TPM PN_2_10_TPM
    1 MSTRG.1 0.000000 1.456143 1.093308 0.204315 0.000000 0.000000
    2 MSTRG.2 1.516181 0.849313 2.010783 1.567867 2.045446 2.246402
    3 MSTRG.3 1.305084 1.207246 0.889166 1.470162 0.340003 0.421222
    4 MSTRG.4 2.744250 2.791988 2.500786 2.719017 1.954149 2.468110
    5 MSTRG.5 1.946825 1.470012 1.263171 0.205806 1.644505 1.638583
    6 MSTRG.6 1.325277 0.793530 1.932236 1.210156 1.834274 2.153466
    #~~~~~~~~~~~~~~~~~~
    dim(fpkm)
    names(fpkm)
    datExpr0=as.data.frame(t(fpkm[,-c(1)]));
    names(datExpr0)=fpkm$trans;
    rownames(datExpr0)=names(fpkm)[-c(1)];
    #data<-log10(date[,-1]+0.01)
    # *************************************************************
    # 检测输入基因是否含有缺失值,并处理 ******************************
    # *************************************************************
    gsg = goodSamplesGenes(datExpr0, verbose = 3);
    gsg$allOK
    #~~~~~~~~~~~~~~~~~~
    # 如果上一步返回TRUE则跳过此步,如果返回FALSE则执行如下if语句去掉存在较多缺失值的基因所在行
    if (!gsg$allOK)
    {
    # Optionally, print the gene and sample names that were removed:
    if (sum(!gsg$goodGenes)>0)
    printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes], collapse = ", ")));
    if (sum(!gsg$goodSamples)>0)
    printFlush(paste("Removing samples:", paste(rownames(datExpr0)[!gsg$goodSamples], collapse = ", ")));
    # Remove the offending genes and samples from the data:
    datExpr0 = datExpr0[gsg$goodSamples, gsg$goodGenes]
    }
    # 再次检测
    dim(datExpr0)
    gsg = goodSamplesGenes(datExpr0, verbose = 3);
    gsg$allOK
    #~~~~~~~~~~~~~~~~~~
    # ***************************************************************
    # 聚类检测输入样本是否含有异常值(obvious outliers),并处理 ********
    # ***************************************************************
    sampleTree = hclust(dist(datExpr0), method = "average")
    #sizeGrWindow(12,9)
    par(cex = 0.6)
    par(mar = c(0,4,2,0))
    plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,
    cex.axis = 1.5, cex.main = 2)
    abline(h = 80000, col = "red");
    clust = cutreeStatic(sampleTree, cutHeight = 80000, minSize = 10)
    table(clust)
    keepSamples = (clust==1)
    datExpr = datExpr0[keepSamples, ]
    nGenes = ncol(datExpr)
    nSamples = nrow(datExpr)
    save(datExpr, file = "AS-green-FPKM-01-dataInput.RData")
    #############################################################################
    ############################ 二、 选择合适的阀值 ##############################
    #############################################################################
    powers = c(c(1:10), seq(from = 12, to=20, by=2))
    # Call the network topology analysis function
    sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
    # Plot the results:
    ##sizeGrWindow(9, 5)
    par(mfrow = c(1,2));
    cex1 = 0.9;
    # Scale-free topology fit index as a function of the soft-thresholding power
    plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
    xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
    main = paste("Scale independence"));
    text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
    labels=powers,cex=cex1,col="red");
    # this line corresponds to using an R^2 cut-off of h
    abline(h=0.90,col="red")
    # Mean connectivity as a function of the soft-thresholding power
    plot(sft$fitIndices[,1], sft$fitIndices[,5],
    xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
    main = paste("Mean connectivity"))
    text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

    #############################################################################
    ############################ 三、 网络构建和可视化 #############################
    #############################################################################
    #=====================================================================================
    #=====================================================================================
    # 网络构建有两种方法,One-step和Step-by-step;
    # 第一种:一步法进行网络构建
    #=====================================================================================
    #=====================================================================================

    ################################################################################
    ### 3.1 一步法网络构建:One-step network construction and module detection ######
    #3.1.1. 网络构建 ###############################################################
    ###############################################################################
    dim(datExpr)
    net = blockwiseModules(datExpr, power = 6, maxBlockSize = 6000,
    TOMType = "unsigned", minModuleSize = 30,
    reassignThreshold = 0, mergeCutHeight = 0.25,
    numericLabels = TRUE, pamRespectsDendro = FALSE,
    saveTOMs = TRUE,
    saveTOMFileBase = "AS-green-FPKM-TOM",
    verbose = 3)
    table(net$colors)
    ###############################################################################
    #3.1.2. 绘画结果展示 ###########################################################
    ###############################################################################
    # open a graphics window
    #sizeGrWindow(12, 9)
    # Convert labels to colors for plotting
    mergedColors = labels2colors(net$colors)
    # Plot the dendrogram and the module colors underneath
    plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]],
    "Module colors",
    dendroLabels = FALSE, hang = 0.03,
    addGuide = TRUE, guideHang = 0.05)
    ###############################################################################
    #3.1.3. 结果保存 ###############################################################
    ###############################################################################
    moduleLabels = net$colors
    moduleColors = labels2colors(net$colors)
    table(moduleColors)
    MEs = net$MEs;
    geneTree = net$dendrograms[[1]];
    save(MEs, moduleLabels, moduleColors, geneTree,
    file = "AS-green-FPKM-02-networkConstruction-auto.RData")
    ###############################################################################
    #3.1.4. 导出网络到 Cytoscape ###################################################
    ###############################################################################
    # Recalculate topological overlap if needed
    TOM = TOMsimilarityFromExpr(datExpr, power = 6);
    # Read in the annotation file
    # annot = read.csv(file = "GeneAnnotation.csv");
    # Select modules需要修改,根据table(moduleColors)的结果选择需要导出的模块颜色
    modules = c("turquoise", "blue");
    # Select module probes选择模块探测
    probes = names(datExpr)
    inModule = is.finite(match(moduleColors, modules));
    modProbes = probes[inModule];
    #modGenes = annot$gene_symbol[match(modProbes, annot$substanceBXH)];
    # Select the corresponding Topological Overlap
    modTOM = TOM[inModule, inModule];
    dimnames(modTOM) = list(modProbes, modProbes)
    # Export the network into edge and node list files Cytoscape can read
    cyt = exportNetworkToCytoscape(modTOM,
    edgeFile = paste("AS-green-FPKM-One-step-CytoscapeInput-edges-", paste(modules, collapse="-"), ".txt", sep=""),
    nodeFile = paste("AS-green-FPKM-One-step-CytoscapeInput-nodes-", paste(modules, collapse="-"), ".txt", sep=""),
    weighted = TRUE,
    threshold = 0.02,
    nodeNames = modProbes,
    #altNodeNames = modGenes,
    nodeAttr = moduleColors[inModule]);
    #################################################################################################
    #3.1.5. 分析网络可视化,用heatmap可视化权重网络,heatmap每一行或列对应一个基因,颜色越深表示有较高的邻近
    #################################################################################################
    options(stringsAsFactors = FALSE);
    lnames = load(file = "AS-green-FPKM-01-dataInput.RData");
    lnames
    lnames = load(file = "AS-green-FPKM-02-networkConstruction-auto.RData");
    lnames
    nGenes = ncol(datExpr)
    nSamples = nrow(datExpr)
    #====================================
    #3.1.5.1. 可视化全部基因网络 ==========
    #====================================
    # Calculate topological overlap anew: this could be done more efficiently by saving the TOM
    # calculated during module detection, but let us do it again here.
    dissTOM = 1-TOMsimilarityFromExpr(datExpr, power = 6);
    # Transform dissTOM with a power to make moderately strong connections more visible in the heatmap
    plotTOM = dissTOM^7;
    # Set diagonal to NA for a nicer plot
    diag(plotTOM) = NA;
    # Call the plot function
    #sizeGrWindow(9,9)
    TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")
    #====================================
    #3.1.5.2. 随便选取1000个基因来可视化 ==
    #====================================
    nSelect = 1000
    # For reproducibility, we set the random seed
    set.seed(10);
    select = sample(nGenes, size = nSelect);
    selectTOM = dissTOM[select, select];
    # There's no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
    selectTree = hclust(as.dist(selectTOM), method = "average")
    selectColors = moduleColors[select];
    # Open a graphical window
    #sizeGrWindow(9,9)
    # Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
    # the color palette; setting the diagonal to NA also improves the clarity of the plot
    plotDiss = selectTOM^7;
    diag(plotDiss) = NA;
    TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")
    #=====================================================================================
    #=====================================================================================
    # 第二种:一步步的进行网络构建
    #=====================================================================================
    #=====================================================================================

    ###############################################################################
    ### 3.2 Step-by-step network construction and module detection ################
    ###############################################################################
    #2.选择合适的阀值,同上
    ###############################################################################
    #3.2.1. 网络构建 ###############################################################
    ###############################################################################
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # (1) Co-expression similarity and adjacency ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    softPower = 6;
    adjacency = adjacency(datExpr, power = softPower);
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #(2) 邻近矩阵到拓扑矩阵的转换,Turn adjacency into topological overlap ~~~~~
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    TOM = TOMsimilarity(adjacency);
    dissTOM = 1-TOM
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # (3) 聚类拓扑矩阵 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #Call the hierarchical clustering function
    geneTree = hclust(as.dist(dissTOM), method = "average");
    # Plot the resulting clustering tree (dendrogram)
    #sizeGrWindow(12,9)
    plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
    labels = FALSE, hang = 0.04);
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #(4) 聚类分支的休整dynamicTreeCut ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    # We like large modules, so we set the minimum module size relatively high:
    minModuleSize = 30;
    # Module identification using dynamic tree cut:
    dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
    deepSplit = 2, pamRespectsDendro = FALSE,
    minClusterSize = minModuleSize);
    table(dynamicMods)
    ###############################################################################
    #3.2.2. 绘画结果展示 ###########################################################
    ###############################################################################
    # Convert numeric lables into colors
    dynamicColors = labels2colors(dynamicMods)
    table(dynamicColors)
    # Plot the dendrogram and colors underneath
    #sizeGrWindow(8,6)
    plotDendroAndColors(geneTree, dynamicColors, "Dynamic Tree Cut",
    dendroLabels = FALSE, hang = 0.03,
    addGuide = TRUE, guideHang = 0.05,
    main = "Gene dendrogram and module colors")
    ###############################################################################
    #3.2.3. 聚类结果相似模块的融合,Merging of modules whose expression profiles are very similar
    ###############################################################################
    #在聚类树中每一leaf是一个短线,代表一个基因,
    #不同分之间靠的越近表示有高的共表达基因,将共表达极其相似的modules进行融合
    ###############################################################################
    # Calculate eigengenes
    MEList = moduleEigengenes(datExpr, colors = dynamicColors)
    MEs = MEList$eigengenes
    # Calculate dissimilarity of module eigengenes
    MEDiss = 1-cor(MEs);
    # Cluster module eigengenes
    METree = hclust(as.dist(MEDiss), method = "average");
    # Plot the result
    #sizeGrWindow(7, 6)
    plot(METree, main = "Clustering of module eigengenes",
    xlab = "", sub = "")
    #选择有75%相关性的进行融合
    MEDissThres = 0.25
    # Plot the cut line into the dendrogram
    abline(h=MEDissThres, col = "red")
    # Call an automatic merging function
    merge = mergeCloseModules(datExpr, dynamicColors, cutHeight = MEDissThres, verbose = 3)
    # The merged module colors
    mergedColors = merge$colors;
    # Eigengenes of the new merged modules:
    mergedMEs = merge$newMEs;
    #绘制融合前(Dynamic Tree Cut)和融合后(Merged dynamic)的聚类图
    #sizeGrWindow(12, 9)
    #pdf(file = "Plots/geneDendro-3.pdf", wi = 9, he = 6)
    plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
    c("Dynamic Tree Cut", "Merged dynamic"),
    dendroLabels = FALSE, hang = 0.03,
    addGuide = TRUE, guideHang = 0.05)
    #dev.off()
    # 只是绘制融合后聚类图
    plotDendroAndColors(geneTree,mergedColors,"Merged dynamic",
    dendroLabels = FALSE, hang = 0.03,
    addGuide = TRUE, guideHang = 0.05)
    ###############################################################################
    #3.2.4. 结果保存 ###############################################################
    ###############################################################################
    # Rename to moduleColors
    moduleColors = mergedColors
    # Construct numerical labels corresponding to the colors
    colorOrder = c("grey", standardColors(50));
    moduleLabels = match(moduleColors, colorOrder)-1;
    MEs = mergedMEs;
    # Save module colors and labels for use in subsequent parts
    save(MEs, moduleLabels, moduleColors, geneTree, file = "AS-green-FPKM-02-networkConstruction-stepByStep.RData")
    ###############################################################################
    #3.2.5. 导出网络到Cytoscape ####################################################
    ###############################################################################
    # Recalculate topological overlap if needed
    TOM = TOMsimilarityFromExpr(datExpr, power = 6);
    # Read in the annotation file
    # annot = read.csv(file = "GeneAnnotation.csv");
    # Select modules需要修改
    modules = c("brown", "red");
    # Select module probes
    probes = names(datExpr)
    inModule = is.finite(match(moduleColors, modules));
    modProbes = probes[inModule];
    #modGenes = annot$gene_symbol[match(modProbes, annot$substanceBXH)];
    # Select the corresponding Topological Overlap
    modTOM = TOM[inModule, inModule];
    dimnames(modTOM) = list(modProbes, modProbes)
    # Export the network into edge and node list files Cytoscape can read
    cyt = exportNetworkToCytoscape(modTOM,
    edgeFile = paste("AS-green-FPKM-Step-by-step-CytoscapeInput-edges-", paste(modules, collapse="-"), ".txt", sep=""),
    nodeFile = paste("AS-green-FPKM-Step-by-step-CytoscapeInput-nodes-", paste(modules, collapse="-"), ".txt", sep=""),
    weighted = TRUE,
    threshold = 0.02,
    nodeNames = modProbes,
    #altNodeNames = modGenes,
    nodeAttr = moduleColors[inModule]);
    #################################################################################################
    #3.2.6. 分析网络可视化,用heatmap可视化权重网络,heatmap每一行或列对应一个基因,颜色越深表示有较高的邻近
    #################################################################################################
    options(stringsAsFactors = FALSE);
    lnames = load(file = "AS-green-FPKM-01-dataInput.RData");
    lnames
    lnames = load(file = "AS-green-FPKM-02-networkConstruction-stepByStep.RData");
    lnames
    nGenes = ncol(datExpr)
    nSamples = nrow(datExpr)
    #====================================
    #3.2.6.1. 可视化全部基因网络 ==========
    #====================================
    # Calculate topological overlap anew: this could be done more efficiently by saving the TOM
    # calculated during module detection, but let us do it again here.
    dissTOM = 1-TOMsimilarityFromExpr(datExpr, power = 6);
    # Transform dissTOM with a power to make moderately strong connections more visible in the heatmap
    plotTOM = dissTOM^7;
    # Set diagonal to NA for a nicer plot
    diag(plotTOM) = NA;
    # Call the plot function
    #sizeGrWindow(9,9)
    TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")
    #====================================
    #3.2.6.2. 随便选取1000个基因来可视化 ==
    #====================================
    nSelect = 1000
    # For reproducibility, we set the random seed
    set.seed(10);
    select = sample(nGenes, size = nSelect);
    selectTOM = dissTOM[select, select];
    # There's no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
    selectTree = hclust(as.dist(selectTOM), method = "average")
    selectColors = moduleColors[select];
    # Open a graphical window
    #sizeGrWindow(9,9)
    # Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
    # the color palette; setting the diagonal to NA also improves the clarity of the plot
    plotDiss = selectTOM^7;
    diag(plotDiss) = NA;
    TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")
    #此处画的是根据基因间表达量进行聚类所得到的各模块间的相关性图
    MEs = moduleEigengenes(datExpr, moduleColors)$eigengenes
    MET = orderMEs(MEs)
    sizeGrWindow(7, 6)
    plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap = c(3,4,2,2), plotDendrograms = FALSE, xLabelsAngle = 90)

    部分结果图简单解释

    Cytoscape生成网络图

    只需要第二个edges文件就能构建网络图。导入该文件后,在软件的导入设置中,将第一列设置为fromNode,第二列设置为toNode,最后把第三列设为网络关系属性,完成设置,便可生成网络图了。

    WGCNA样本要求

    由于WGCNA是基于相关系数的表达调控网络分析方法。当样本数过低的时候,相关系数的计算是不可靠的,得到的调控网络价值不大。所以,我们推荐的样本数如下:

  • 当独立样本数≥8(非重复样本)时,可以考虑基于Pearson相关系数的WGCNA共表达网络的方法(效果看实际情况而定);
  • 当样本数≥15(可以包含生物学重复)时,WGCNA方法会有更好的效果。
  • 当样品数<8时,不建议进行该项分析。
  • 该方法对于不同材料或不同组织进行分析更有意义,对于不同时间点处理相同样品意义不大。
  • 报错暨解决办法

    错误1

    运行 pickSoftThreshold 函数报错如下

    1
    2
    3
    4
    5
    > sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
    pickSoftThreshold: will use block size 773.
    pickSoftThreshold: calculating connectivity for given powers...
    ..working on genes 1 through 773 of 57862
    Error in { : task 1 failed - "'x' has a zero dimension."

    检查 datExpr 变量

    1
    2
    > dim(datExpr)
    [1] 0 50057

    重新dim检查之前的datExpr变量,绝对不是0行,而 datExpr变量的最近一次处理是 keepSamples = (clust==1)
    察看keepSamples = (clust==1)之后的keepSamples变量

    1
    2
    > keepSamples
    [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

    都是FALSE???,所以引起后续datExpr = datExpr0[keepSamples, ]的错误;
    继续往上查找变量clust,即clust = cutreeStatic(sampleTree, cutHeight = 80000, minSize = 10)cutreeStatic 函数会通过cutHeightminSize 的设定对聚类树进行cut去掉异常的样本,

    cutHeight :height at which branches are to be cut.
    minSize :minimum number of object on a branch to be considered a cluster.

    所以调整相应的cutHeightminSize 参数,察看 clust变量的值,保留可行的样本;

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    > clust
    [1] 1 2 1 2 1 1 2 1 1 1 1 1
    > keepSamples = (clust==1)
    > keepSamples
    [1] TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
    > clust2
    [1] 0 0 0 0 0 0 0 0 0 0 0 0
    > keepSamples = (clust2==0)
    > keepSamples
    [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

    或者简单暴力手动去除刚开始输入数据中异常样本,不做 聚类检测输入样本是否含有异常值(obvious outliers)这一过程;

    错误2

    按照WGCNA手册第五步Network visualization using WGCNA functions时报错如下:

    1
    2
    3
    4
    > TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")

    Error in .heatmap(as.matrix(dissim), Rowv = as.dendrogram(dendro, hang = 0.1), :
    row dendrogram ordering gave index of wrong length

    看到row dendrogram ordering gave index of wrong length这句报错内容,分别察看plotTOM, geneTree, moduleColors这三个变量length;

    1
    2
    3
    > dim(plotTOM)
    > geneTree
    > moduleColors

    果然,三者的length不同,发现geneTree少了一些,往回找geneTree来源 geneTree = net$dendrograms[[1]],net来源于网络构建过程:

    1
    2
    3
    4
    5
    6
    7
    net = blockwiseModules(datExpr, power = 6,
    TOMType = "unsigned", minModuleSize = 30,
    reassignThreshold = 0, mergeCutHeight = 0.25,
    numericLabels = TRUE, pamRespectsDendro = FALSE,
    saveTOMs = TRUE,
    saveTOMFileBase = "femaleMouseTOM",
    verbose = 3)

    所以,这是问题所在,继续察看文档发现基于内存等因素blockwiseModules函数默认最大maxBlockSize=5000(即最大5000个基因数目),而我们的数据超过了这个值,所以函数自动将输入datExpr数据做了拆分处理,而解决办法也很简单,设置maxBlockSize参数大于我们的值(dim(datExpr)所显示的数据行数) 即可。

    相关讨论

    signed vs unsigned network

    I think the “sign” represents the sign of weight on the edges. If you care about the sign, for example sign represents positive or negative regulation between two nodes, then you need to choose “signed”. On the other hand, if the weight just represent the strength of relatedness between two nodes, you might don’t care positive or negative, so you need to choose “unsigned”. That really depends on your needs. 【WGCNA signed vs unsigned network

    Should you use a signed or unsigned network? By and large, I recommend using one of the signed varieties, for two main reasons. First, more often than not, direction does matter: it is important to know where node profiles go up and where they go down, and mixing negatively correlated nodes together necessarily mixes the two directions together. Second, negatively correlated nodes often belong to different categories. For example, in gene expression data, negatively correlated genes tend to come from biologically very different categories. It is true that some pathways or processes involve pairs of genes that are negatively correlated; if there are enough negatively correlated genes, they will form a module on their own and the two modules can then be analyzed together. (For the advanced practitioner, another option is to use the fuzzy module membership measure based on the module eigengene to attach a few strongly negatively correlated genes to a module after the modules have been identified). 【Signed or unsigned: which network type is preferable?

    Should I filter probesets or genes

    Probesets or genes may be filtered by mean expression or variance (or their robust analogs such as median and median absolute deviation, MAD) since low-expressed or non-varying genes usually represent noise. Whether it is better to filter by mean expression or variance is a matter of debate; both have advantages and disadvantages, but more importantly, they tend to filter out similar sets of genes since mean and variance are usually related.

    We do not recommend filtering genes by differential expression. WGCNA is designed to be an unsupervised analysis method that clusters genes based on their expression profiles. Filtering genes by differential expression will lead to a set of correlated genes that will essentially form a single (or a few highly correlated) modules. It also completely invalidates the scale-free topology assumption, so choosing soft thresholding power by scale-free topology fit will fail. 【Data analysis questions

    power must be between 1 and 30

    First, the user should ensure that variables (probesets, genes etc.) have not been filtered by differential expression with respect to a sample trait. See item 2 above for details about beneficial and detrimental filtering genes or probesets.

    If the scale-free topology fit index fails to reach values above 0.8 for reasonable powers (less than 15 for unsigned or signed hybrid networks, and less than 30 for signed networks) and the mean connectivity remains relatively high (in the hundreds or above), chances are that the data exhibit a strong driver that makes a subset of the samples globally different from the rest. The difference causes high correlation among large groups of genes which invalidates the assumption of the scale-free topology approximation.

    Lack of scale-free topology fit by itself does not invalidate the data, but should be looked into carefully. It always helps to plot the sample clustering tree and any technical or biological sample information below it as in Figure 2 of Tutorial I, section 1; strong clusters in the clustering tree indicate globally different groups of samples. It could be the result a technical effect such as a batch effect, biological heterogeneity (e.g., a data set consisting of samples from 2 different tissues), or strong changes between conditions (say in a time series). One should investigate carefully whether there is sample heterogeneity, what drives the heterogeneity, and whether the data should be adjusted (see previous point).

    If the lack of scale-free topology fit turns out to be caused by an interesting biological variable that one does not want to remove (i.e., adjust the data for), the appropriate soft-thresholding power can be chosen based on the number of samples as in the table below. This table has been updated in December 2017 to make the resulting networks conservative.

    Number of samples Unsigned and signed hybrid networks Signed networks
    Less than 20 9 18
    20-30 8 16
    30-40 7 14
    more than 40 6 12
    tiramisutes wechat
    欢迎关注