学习WGCNA总结


在转录组数据处理过程中我们经常会用到差异表达分析这一概念,通过比较不同处理或不同组织间基因表达量(FPKM)差异来寻找特异基因,但这前提是你的不同处理或不同组织样本较少,当不同处理或组织有较多样本,如40个,此时的两两比较有780组比较^_^,这根本不是我们想要的结果;

此时就需要WGCNA(weighted gene co-expression network analysis)将复杂的数据进行归纳整理。除了这种最常见的比较差异表达,我们还想知道在不同处理或不同组织间是否有些基因的表达存在内在的联系或相关性?WGCNA同样可以帮助我们预测基因间的相互作用关系。

WGCNA术语

权重(weghted)

Module

模块(module):表达模式相似的基因分为一类,这样的一类基因成为模块;

Eigengene

Eigengene(eigen- +‎ gene):基因和样本构成的矩阵,https://en.wiktionary.org/wiki/eigengene;

Adjacency Matrix

邻近矩阵:是图的一种存储形式,用一个一维数组存放图中所有顶点数据;用一个二维数组存放顶点间关系(边或弧)的数据,这个二维数组称为邻接矩阵;

Topological Overlap Matrix(TOM)

整体思路

先对数据进行处理→分层聚类→表达模式相似的基因组成模块→研究某一个模块中相关基因的功能富集(GO,KEGG),各个模块与样本表型数据间的相关性,各个模块与样本本身间的相关性(没有表型数据的情况,如不同组织)→具体到特定模块后分析其所包含基因间的相互作用网络关系,并找出其中的关键基因。

分析构建的网络寻找以下有用信息

  • 这类处于调控网络中心的基因称为核心基因(hub gene),这类基因通常是转录因子等关键的调控因子,是值得我们优先深入分析和挖掘的对象。
  • 在网络中,被调控线连接的基因,其表达模式是相似的。那么它们潜在有相似的功能。所以,在这个网络中,如果线条一端的基因功能是已知的,那么就可以预测线条另一端的功能未知的基因也有相似的功能。
  • R脚本

    输入数据为RNA-seq不同处理或组织所有样本的FPKM值组成的矩阵,切记含有 0 的要去掉;

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    189
    190
    191
    192
    193
    194
    195
    196
    197
    198
    199
    200
    201
    202
    203
    204
    205
    206
    207
    208
    209
    210
    211
    212
    213
    214
    215
    216
    217
    218
    219
    220
    221
    222
    223
    224
    225
    226
    227
    228
    229
    230
    231
    232
    233
    234
    235
    236
    237
    238
    239
    240
    241
    242
    243
    244
    245
    246
    247
    248
    249
    250
    251
    252
    253
    254
    255
    256
    257
    258
    259
    260
    261
    262
    263
    264
    265
    266
    267
    268
    269
    270
    271
    272
    273
    274
    275
    276
    277
    278
    279
    280
    281
    282
    283
    284
    285
    286
    287
    288
    289
    290
    291
    setwd("F:/WGCNA")
    library(WGCNA)
    options(stringsAsFactors = FALSE)
    enableWGCNAThreads()
    #1. 数据读入,处理和保存
    fpkm<- read.csv("trans_counts.counts.matrix.TMM_normalized.FPKM.nozero.csv")
    head(fpkm)
    dim(fpkm)
    names(fpkm)
    datExpr0=as.data.frame(t(fpkm[,-c(1)]));
    names(datExpr0)=fpkm$trans;
    rownames(datExpr0)=names(fpkm)[-c(1)];
    #data<-log10(date[,-1]+0.01)
    gsg = goodSamplesGenes(datExpr0, verbose = 3);
    gsg$allOK
    sampleTree = hclust(dist(datExpr0), method = "average")
    #sizeGrWindow(12,9)
    par(cex = 0.6)
    par(mar = c(0,4,2,0))
    plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,
    cex.axis = 1.5, cex.main = 2)
    abline(h = 80000, col = "red");
    clust = cutreeStatic(sampleTree, cutHeight = 80000, minSize = 10)
    table(clust)
    keepSamples = (clust==1)
    datExpr = datExpr0[keepSamples, ]
    nGenes = ncol(datExpr)
    nSamples = nrow(datExpr)
    save(datExpr, file = "AS-green-FPKM-01-dataInput.RData")
    #2. 选择合适的阀值
    powers = c(c(1:10), seq(from = 12, to=20, by=2))
    # Call the network topology analysis function
    sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
    # Plot the results:
    ##sizeGrWindow(9, 5)
    par(mfrow = c(1,2));
    cex1 = 0.9;
    # Scale-free topology fit index as a function of the soft-thresholding power
    plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
    xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
    main = paste("Scale independence"));
    text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
    labels=powers,cex=cex1,col="red");
    # this line corresponds to using an R^2 cut-off of h
    abline(h=0.90,col="red")
    # Mean connectivity as a function of the soft-thresholding power
    plot(sft$fitIndices[,1], sft$fitIndices[,5],
    xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
    main = paste("Mean connectivity"))
    text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

    #=====================================================================================
    # 网络构建有两种方法,One-step和Step-by-step;
    # 第一种:一步法进行网络构建
    #=====================================================================================

    #3. 一步法网络构建:One-step network construction and module detection
    net = blockwiseModules(datExpr, power = 6, maxBlockSize = 6000,
    TOMType = "unsigned", minModuleSize = 30,
    reassignThreshold = 0, mergeCutHeight = 0.25,
    numericLabels = TRUE, pamRespectsDendro = FALSE,
    saveTOMs = TRUE,
    saveTOMFileBase = "AS-green-FPKM-TOM",
    verbose = 3)
    table(net$colors)
    #4. 绘画结果展示
    # open a graphics window
    #sizeGrWindow(12, 9)
    # Convert labels to colors for plotting
    mergedColors = labels2colors(net$colors)
    # Plot the dendrogram and the module colors underneath
    plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]],
    "Module colors",
    dendroLabels = FALSE, hang = 0.03,
    addGuide = TRUE, guideHang = 0.05)
    #5.结果保存
    moduleLabels = net$colors
    moduleColors = labels2colors(net$colors)
    table(moduleColors)
    MEs = net$MEs;
    geneTree = net$dendrograms[[1]];
    save(MEs, moduleLabels, moduleColors, geneTree,
    file = "AS-green-FPKM-02-networkConstruction-auto.RData")
    #6. 导出网络到Cytoscape
    # Recalculate topological overlap if needed
    TOM = TOMsimilarityFromExpr(datExpr, power = 6);
    # Read in the annotation file
    # annot = read.csv(file = "GeneAnnotation.csv");
    # Select modules需要修改,选择需要导出的模块颜色
    modules = c("turquoise", "blue");
    # Select module probes选择模块探测
    probes = names(datExpr)
    inModule = is.finite(match(moduleColors, modules));
    modProbes = probes[inModule];
    #modGenes = annot$gene_symbol[match(modProbes, annot$substanceBXH)];
    # Select the corresponding Topological Overlap
    modTOM = TOM[inModule, inModule];
    dimnames(modTOM) = list(modProbes, modProbes)
    # Export the network into edge and node list files Cytoscape can read
    cyt = exportNetworkToCytoscape(modTOM,
    edgeFile = paste("AS-green-FPKM-One-step-CytoscapeInput-edges-", paste(modules, collapse="-"), ".txt", sep=""),
    nodeFile = paste("AS-green-FPKM-One-step-CytoscapeInput-nodes-", paste(modules, collapse="-"), ".txt", sep=""),
    weighted = TRUE,
    threshold = 0.02,
    nodeNames = modProbes,
    #altNodeNames = modGenes,
    nodeAttr = moduleColors[inModule]);
    #=====================================================================================
    # 分析网络可视化,用heatmap可视化权重网络,heatmap每一行或列对应一个基因,颜色越深表示有较高的邻近
    #=====================================================================================
    options(stringsAsFactors = FALSE);
    lnames = load(file = "AS-green-FPKM-01-dataInput.RData");
    lnames
    lnames = load(file = "AS-green-FPKM-02-networkConstruction-auto.RData");
    lnames
    nGenes = ncol(datExpr)
    nSamples = nrow(datExpr)
    #1. 可视化全部基因网络
    # Calculate topological overlap anew: this could be done more efficiently by saving the TOM
    # calculated during module detection, but let us do it again here.
    dissTOM = 1-TOMsimilarityFromExpr(datExpr, power = 6);
    # Transform dissTOM with a power to make moderately strong connections more visible in the heatmap
    plotTOM = dissTOM^7;
    # Set diagonal to NA for a nicer plot
    diag(plotTOM) = NA;
    # Call the plot function
    #sizeGrWindow(9,9)
    TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")
    #随便选取1000个基因来可视化
    nSelect = 1000
    # For reproducibility, we set the random seed
    set.seed(10);
    select = sample(nGenes, size = nSelect);
    selectTOM = dissTOM[select, select];
    # There's no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
    selectTree = hclust(as.dist(selectTOM), method = "average")
    selectColors = moduleColors[select];
    # Open a graphical window
    #sizeGrWindow(9,9)
    # Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
    # the color palette; setting the diagonal to NA also improves the clarity of the plot
    plotDiss = selectTOM^7;
    diag(plotDiss) = NA;
    TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")
    #=====================================================================================
    # 第二种:一步步的进行网络构建
    #=====================================================================================
    ###################Step-by-step network construction and module detection
    #2.选择合适的阀值,同上
    #3. 网络构建:(1) Co-expression similarity and adjacency
    softPower = 6;
    adjacency = adjacency(datExpr, power = softPower);
    #(2) 邻近矩阵到拓扑矩阵的转换,Turn adjacency into topological overlap
    TOM = TOMsimilarity(adjacency);
    dissTOM = 1-TOM
    # (3) 聚类拓扑矩阵
    #Call the hierarchical clustering function
    geneTree = hclust(as.dist(dissTOM), method = "average");
    # Plot the resulting clustering tree (dendrogram)
    #sizeGrWindow(12,9)
    plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
    labels = FALSE, hang = 0.04);
    #(4) 聚类分支的休整dynamicTreeCut
    # We like large modules, so we set the minimum module size relatively high:
    minModuleSize = 30;
    # Module identification using dynamic tree cut:
    dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
    deepSplit = 2, pamRespectsDendro = FALSE,
    minClusterSize = minModuleSize);
    table(dynamicMods)
    #4. 绘画结果展示
    # Convert numeric lables into colors
    dynamicColors = labels2colors(dynamicMods)
    table(dynamicColors)
    # Plot the dendrogram and colors underneath
    #sizeGrWindow(8,6)
    plotDendroAndColors(geneTree, dynamicColors, "Dynamic Tree Cut",
    dendroLabels = FALSE, hang = 0.03,
    addGuide = TRUE, guideHang = 0.05,
    main = "Gene dendrogram and module colors")
    #5. 聚类结果相似模块的融合,Merging of modules whose expression profiles are very similar
    #在聚类树中每一leaf是一个短线,代表一个基因,
    #不同分之间靠的越近表示有高的共表达基因,将共表达极其相似的modules进行融合
    # Calculate eigengenes
    MEList = moduleEigengenes(datExpr, colors = dynamicColors)
    MEs = MEList$eigengenes
    # Calculate dissimilarity of module eigengenes
    MEDiss = 1-cor(MEs);
    # Cluster module eigengenes
    METree = hclust(as.dist(MEDiss), method = "average");
    # Plot the result
    #sizeGrWindow(7, 6)
    plot(METree, main = "Clustering of module eigengenes",
    xlab = "", sub = "")
    #选择有75%相关性的进行融合
    MEDissThres = 0.25
    # Plot the cut line into the dendrogram
    abline(h=MEDissThres, col = "red")
    # Call an automatic merging function
    merge = mergeCloseModules(datExpr, dynamicColors, cutHeight = MEDissThres, verbose = 3)
    # The merged module colors
    mergedColors = merge$colors;
    # Eigengenes of the new merged modules:
    mergedMEs = merge$newMEs;
    #绘制融合前(Dynamic Tree Cut)和融合后(Merged dynamic)的聚类图
    #sizeGrWindow(12, 9)
    #pdf(file = "Plots/geneDendro-3.pdf", wi = 9, he = 6)
    plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
    c("Dynamic Tree Cut", "Merged dynamic"),
    dendroLabels = FALSE, hang = 0.03,
    addGuide = TRUE, guideHang = 0.05)
    #dev.off()
    # 只是绘制融合后聚类图
    plotDendroAndColors(geneTree,mergedColors,"Merged dynamic",
    dendroLabels = FALSE, hang = 0.03,
    addGuide = TRUE, guideHang = 0.05)
    #5.结果保存
    # Rename to moduleColors
    moduleColors = mergedColors
    # Construct numerical labels corresponding to the colors
    colorOrder = c("grey", standardColors(50));
    moduleLabels = match(moduleColors, colorOrder)-1;
    MEs = mergedMEs;
    # Save module colors and labels for use in subsequent parts
    save(MEs, moduleLabels, moduleColors, geneTree, file = "AS-green-FPKM-02-networkConstruction-stepByStep.RData")
    #6. 导出网络到Cytoscape
    # Recalculate topological overlap if needed
    TOM = TOMsimilarityFromExpr(datExpr, power = 6);
    # Read in the annotation file
    # annot = read.csv(file = "GeneAnnotation.csv");
    # Select modules需要修改
    modules = c("brown", "red");
    # Select module probes
    probes = names(datExpr)
    inModule = is.finite(match(moduleColors, modules));
    modProbes = probes[inModule];
    #modGenes = annot$gene_symbol[match(modProbes, annot$substanceBXH)];
    # Select the corresponding Topological Overlap
    modTOM = TOM[inModule, inModule];
    dimnames(modTOM) = list(modProbes, modProbes)
    # Export the network into edge and node list files Cytoscape can read
    cyt = exportNetworkToCytoscape(modTOM,
    edgeFile = paste("AS-green-FPKM-Step-by-step-CytoscapeInput-edges-", paste(modules, collapse="-"), ".txt", sep=""),
    nodeFile = paste("AS-green-FPKM-Step-by-step-CytoscapeInput-nodes-", paste(modules, collapse="-"), ".txt", sep=""),
    weighted = TRUE,
    threshold = 0.02,
    nodeNames = modProbes,
    #altNodeNames = modGenes,
    nodeAttr = moduleColors[inModule]);
    #=====================================================================================
    # 分析网络可视化,用heatmap可视化权重网络,heatmap每一行或列对应一个基因,颜色越深表示有较高的邻近
    #=====================================================================================
    options(stringsAsFactors = FALSE);
    lnames = load(file = "AS-green-FPKM-01-dataInput.RData");
    lnames
    lnames = load(file = "AS-green-FPKM-02-networkConstruction-stepByStep.RData");
    lnames
    nGenes = ncol(datExpr)
    nSamples = nrow(datExpr)
    #1. 可视化全部基因网络
    # Calculate topological overlap anew: this could be done more efficiently by saving the TOM
    # calculated during module detection, but let us do it again here.
    dissTOM = 1-TOMsimilarityFromExpr(datExpr, power = 6);
    # Transform dissTOM with a power to make moderately strong connections more visible in the heatmap
    plotTOM = dissTOM^7;
    # Set diagonal to NA for a nicer plot
    diag(plotTOM) = NA;
    # Call the plot function
    #sizeGrWindow(9,9)
    TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")
    #随便选取1000个基因来可视化
    nSelect = 1000
    # For reproducibility, we set the random seed
    set.seed(10);
    select = sample(nGenes, size = nSelect);
    selectTOM = dissTOM[select, select];
    # There's no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
    selectTree = hclust(as.dist(selectTOM), method = "average")
    selectColors = moduleColors[select];
    # Open a graphical window
    #sizeGrWindow(9,9)
    # Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
    # the color palette; setting the diagonal to NA also improves the clarity of the plot
    plotDiss = selectTOM^7;
    diag(plotDiss) = NA;
    TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")
    #此处画的是根据基因间表达量进行聚类所得到的各模块间的相关性图
    MEs = moduleEigengenes(datExpr, moduleColors)$eigengenes
    MET = orderMEs(MEs)
    sizeGrWindow(7, 6)
    plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap = c(3,4,2,2), plotDendrograms = FALSE, xLabelsAngle = 90)

    部分结果图简单解释

    Cytoscape生成网络图

    只需要第二个edges文件就能构建网络图。导入该文件后,在软件的导入设置中,将第一列设置为fromNode,第二列设置为toNode,最后把第三列设为网络关系属性,完成设置,便可生成网络图了。

    WGCNA样本要求

    由于WGCNA是基于相关系数的表达调控网络分析方法。当样本数过低的时候,相关系数的计算是不可靠的,得到的调控网络价值不大。所以,我们推荐的样本数如下:

  • 当独立样本数≥8(非重复样本)时,可以考虑基于Pearson相关系数的WGCNA共表达网络的方法(效果看实际情况而定);
  • 当样本数≥15(可以包含生物学重复)时,WGCNA方法会有更好的效果。
  • 当样品数<8时,不建议进行该项分析。
  • 该方法对于不同材料或不同组织进行分析更有意义,对于不同时间点处理相同样品意义不大。
  • 报错暨解决办法

    按照WGCNA手册第五步Network visualization using WGCNA functions时报错如下:

    1
    2
    3
    4
    > TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")

    Error in .heatmap(as.matrix(dissim), Rowv = as.dendrogram(dendro, hang = 0.1), :
    row dendrogram ordering gave index of wrong length

    看到row dendrogram ordering gave index of wrong length这句报错内容,分别察看plotTOM, geneTree, moduleColors这三个变量length;

    1
    2
    3
    > dim(plotTOM)
    > geneTree
    > moduleColors

    果然,三者的length不同,发现geneTree少了一些,往回找geneTree来源 geneTree = net$dendrograms[[1]],net来源于网络构建过程:

    1
    2
    3
    4
    5
    6
    7
    net = blockwiseModules(datExpr, power = 6,
    TOMType = "unsigned", minModuleSize = 30,
    reassignThreshold = 0, mergeCutHeight = 0.25,
    numericLabels = TRUE, pamRespectsDendro = FALSE,
    saveTOMs = TRUE,
    saveTOMFileBase = "femaleMouseTOM",
    verbose = 3)

    所以,这是问题所在,继续察看文档发现blockwiseModules函数默认最大maxBlockSize=5000,而我们的数据超过了这个值,所以函数自动做了拆分处理,而解决办法也很简单,设置maxBlockSize参数大于我们的值即可。

    tiramisutes wechat
    欢迎关注