hope

植物单细胞转录组数据库PsctH

2021-10-21T13:55:09.000Z

作为所有生物体的基本组成部分，细胞在维持生命活动中起着至关重要的作用，而作为细胞异质性研究的重要工具，近年来单细胞转录组测序技术蓬勃发展，促使生物学研究进入单细胞水平的时代。然而，在植物学领域，单细胞的研究仍处于起步阶段，可用资源非常有限，且单细胞悬浮液（即原生质悬浮液）的制备和细胞簇的注释仍然是阻碍其研究的两大主要障碍。
近日，知名期刊Plant Biotechnology Journal在线发表了华中农业大学棉花遗传改良团队金双侠教授团队的篇题为《Plant Single Cell Transcriptome Hub (PsctH): an integrated online tool to explore the plant single-cell transcriptome landscape》论文，开发了植物单细胞转录组综合数据库PsctH，提供综合全面的单细胞Marker基因资源和单细胞研究的workflow。

PsctH主要包括植物单细胞悬浮液的制备手册，植物单细胞转录组分析流程的个性化定制，植物单细胞Marker基因资源和植物单细胞测序原始数据集四大板块，综合了单细胞研究的整个过程。

不同于动物细胞，细胞壁的存在使得植物单细胞悬浮液的制备更加困难。根据实验室前期原生质体解离实验经验，作者比较和评估了用于制备原生质体的不同方案，并提供了一套可行和高效的植物单细胞悬浮液制备手册，其中包括组织样品的制备和分离、酶解消化、纯化以及单个细胞完整性检测。同时，还提供了灵活的植物单细胞转录组（scRNA-seq）分析流程，包括质量控制、标准化、聚类和标记基因鉴定过程。同时考虑到单细胞研究的复杂性和大数据量，暂时并未提供相应的在线分析，而是通过配置关键参数以获得特定的R语言分析脚本。此外，还提供了配置数据分析的R环境文件（SingleCellCondaEnvironment.yml），研究人员可以通过conda再现单细胞转录组的分析环境，来运行之前获得的R脚本。
另一方面，由于缺乏有效的Marker基因资源，植物单细胞研究通常需要花费大量的时间来进行RNA原位杂交或报告基因的遗传转化来为细胞簇的组织类型鉴定提供可靠的实验证据。因此，作者通过收集目前植物单细胞研究，获取到来自5种植物（拟南芥、玉米、水稻、花生和番茄）的9个组织或亚组织的51种细胞类型的共计98个Marker基因（均经过RNA原位杂交或报告基因验证），并在PsctH的MarkerGeneDB栏目进行保存和展示。
此外，PsctH还提供目前植物单细胞测序原始数据集，植物单细胞文献的挖掘等功能。伴随单细胞研究的快速发展，数据库也采用定期跟新（包括新的功能和现有数据库内容，特别是Marker基因资源）的原则，也欢迎该领域研究人员提供相关资源和建议。

Nature Methods | 多伦多大学通过优化KRAB来提高CRISPRi对靶基因的沉默效果

2020-10-07T06:16:32.000Z

CRISPRi 简介

CRISPR-Cas9系统的简约和高效使其迅速成为生命科学实验室的新宠。近来，CRISPR-Cas9工具箱已经大大扩展，新增了CRISPRi和CRISPRa这两种工具。其中催化失活的Cas9（dCas9）在靶向启动子区域时会由于转录机制的空间位阻效应而导致基因表达的抑制。若将dCas9与阻遏结构域【如Krüppel-associated box (KRAB)】融合，则能实现高效的转录沉默。这个过程被称为CRISPRi（CRISPR干扰）。由于DNA没有任何变化，故CRISPRi实现了可逆的knockdown，而不是knockout。同理，与转录激活因子（如VP64和p65）融合的dCas9可靶向启动子和增强子区域来激活基因表达，导致基因表达上调，这被称为CRISPRa（CRISPR激活）（图1）。
图1. dCas9介导的基因调控系统。(Cell. 2013 Jul 18; 154(2): 442–451)

CRISPRi与RNAi区别

看到CRISPRi功能描述是否心中存满疑问：与跌落神坛的RNAi有啥区别？与CRISPR有啥区别？
首先与RNAi相比：

- ①靶标不同：RNAi的靶标是RNA，而CRISPRi的靶标是DNA的转录起始位点(transcription start site, TSS)。
- ②作用细胞区域不同：一般情况下，RNAi在胞质内发生剪切，而CRISPRi在细胞核内。
- ③脱靶率不同：RNAi似乎更高一些，而CRISPRi复合物必须在转录起始位点附近才能发挥作用，因此降低了脱靶效应。

此外，在dCas9-KRAB融合蛋白前加一个条件性启动子可进行条件性CRISPRi。将若干sgRNA、启动子和dCas9-KRAB串联表达可同时进行多重基因的CRISPRi，实现多基因的同时沉默等优势（Zheng, Y., Shen, W., Zhang, J. et al. CRISPR interference-based specific and efficient gene inactivation in the brain. Nat Neurosci 21, 447–454 (2018). https://doi.org/10.1038/s41593-018-0077-5 ）。

CRISPRi与CRISPR区别

尽管CRISPR相关工具的基因敲除在基因功能研究和性状改良方面有重要的应用，但同时也存在其局限性。如某个基因功能的完全丧失会产生极端表型，而育种过程中通常想要的是中间表型。此外，CRISPRi可靶向lncRNA、microRNA、反向转录产物、细胞核内的转录本等，是研究非蛋白编码基因的有力工具（图2）。
尽管CRISPRi如此优秀，但目前也存在一些“小问题”，如靶基因的不完全沉默；PAM序列在一定程度上限制了结合靶序列的数量；染色体状态和修饰可能会影响dCas9-gRNA与DNA的结合；当靶基因的靶序列与其它基因重叠，或受双向启动子调节时，CRISPRi的调控可能会影响到周边的基因表达。
图2. 功能缺失或功能获得系统示意图。(ACS Chem Biol. 2018 Feb 16;13(2):406-416)

CRISPRi优化

近日，知名植物学期刊Nature Methods在线发表了多伦多大学和加拿大高等研究院（CIFAR）题为An efficient KRAB domain for CRISPRi applications in human cells的研究论文。通过优化KRAB domain来提高对靶基因的沉默效果，并筛选出目前基因沉默效果最具高效的ZIM3 KRAB–dCas9载体。
作者将57个人的KRAB结构域融合到dCas9的N端，并用慢病毒感染法检测它们被招募到两个不同的含报告基因结构中时的活性。其中一种方法是将dCas9-KRAB融合靶向到SV40启动子上的两个位点，该启动子驱动人类胚胎肾器官293T细胞（HEK293T）中绿色荧光蛋白（EGFP）的表达。在另一组实验中，dCas9-KRAB融合蛋白靶向K562细胞的PGK1-EGFP报告基因下游的7×TetO阵列上（图3）。且两种报告细胞系均来自单细胞克隆，以确保报告基因和gRNA表达水平的一致性。
图3. KRAB结构域筛选系统。

在上述两种情况下，与nanolouc–dCas9对照组相比，不同的dCas9-KRAB融合体在被招募21天后对靶标基因表现出不同水平的沉默效果，从几乎完全沉默到无沉默，且在两个筛选系统间的结果基本一致（R2 = 0.59），这表明不同的KRAB结构域对靶标基因造成不同程度的沉默效果而与细胞类型无关。且ZIM3 KRAB结构域在两个筛选系统中表现出较强的基因沉默效果（图4）。
图4. 不同dCas9-KRAB融合体对靶标基因的沉默效果。

随后作者将ZIM3 KRAB–dCas9与现有的KOX1 KRAB–dCas9, KOX1 KRAB–MeCP2–dCas9和阴性对照Nanoluc–dCas9用慢病毒感染的方式靶标到HEK293T细胞的5个内源性启动子，并通过逆转录定量PCR (RT-qPCR)检测9d后的沉默效果。结果在4个启动子中，ZIM3 KRAB沉默效果显著优于KOX1 KRAB或KOX1 KRAB–MeCP2。此外，ZIM3-KRAB比其他两种结构更能抑制CD81的表达（图5）。
图5. 靶向ERK1，SEL1L和CD81启动子和相应基因的RT–qPCR表达水平。

综上所述，作者确定了一种高效的KRAB结构域，其具更有效的靶基因沉默和对gRNA选择较低的敏感性。除沉默效率外，相较于KOX1-KRAB-MeCP2结构，ZIM3-KRAB融合载体更小，可串联更多的gRNA，可用于多基因的大规模筛选。

爬取 Plant Cell 期刊

2020-08-29T05:42:20.000Z

爬取 Plant Cell 期刊创刊到现在发表的 Articles 文章标题和链接，并调用百度翻译API将标题翻译成中文。

使用

首先注册百度翻译帐号，获取API相关信息，然后修改``脚本的第44和45行👇

1 2	appid = '' secretKey = ''

修改第227行可选取爬取的起始和终止年限👇

1	film = parse_detail_page(1989, 2021, tool)

之后运行python Spider_Plant_Cell.py 即可开始爬取。

结果输出

CSV 文件

2004	A CDC45 Homolog in Arabidopsis Is Essential for Meiosis, as Shown by RNA Interference–Induced Gene Silencing	如RNA干扰诱导的基因沉默所示，拟南芥CDC45同系物对减数分裂至关重要	http://www.plantcell.org/content/16/1/99
2007	A CLASSY RNA Silencing Signaling Mutant in Arabidopsis	拟南芥一个RNA沉默信号突变株	http://www.plantcell.org/content/19/5/1439
2007	A CRM Domain Protein Functions Dually in Group I and Group II Intron Splicing in Land Plant Chloroplasts	陆地植物叶绿体中CRM结构域蛋白在Ⅰ组和Ⅱ组内含子剪接中的双重功能	http://www.plantcell.org/content/19/12/3864
2015	A Cascade of Sequentially Expressed Sucrose Transporters in the Seed Coat and Endosperm Provides Nutrition for the Arabidopsis Embryo	种皮和胚乳中蔗糖转运蛋白的级联表达为拟南芥胚胎提供了营养	http://www.plantcell.org/content/27/3/607
2011	A Case for Spatial Regulation in Tetrapyrrole Biosynthesis	四吡咯生物合成的空间调控	http://www.plantcell.org/content/23/12/4167
2001	A Cell Plate–Specific Callose Synthase and Its Interaction with Phragmoplastin	细胞板特异性胼胝质合成酶及其与胞浆蛋白的相互作用	http://www.plantcell.org/content/13/4/755
2009	A Cell Wall–Degrading Esterase of Xanthomonas oryzae Requires a Unique Substrate Recognition Module for Pathogenesis on Rice	水稻黄单胞菌的细胞壁降解酯酶需要一个独特的底物识别模块来研究水稻的发病机制	http://www.plantcell.org/content/21/6/1860

标题词云

源码

Nature Communications---“香料之王”染色体级参考基因组揭秘

2019-10-13T01:36:59.000Z

胡椒（Piper nigrum，2n = 52）是木兰亚纲（Magnoliids）胡椒目（Piperales）胡椒科（Piperaceae）胡椒属（Piper）的常绿热带藤本植物，是世界范围内使用历史最悠久的香料，素有“香料之王”的美誉。在中世纪的欧洲，胡椒是社会地位和财富的象征，甚至价格要高于金银可作为货币流通，价比黄金。历史上，西哥特人入侵罗马、十字军东征、新航路的开辟等重要事件与胡椒资源和贸易权的争夺有着密切联系，胡椒被认为是古代促进东西方文化交流的重要商品。在古代，胡椒贸易路线的拓展促成了新航路开辟、地理大发现等重要历史事件。如今，海南岛是我国胡椒种植的主要地区，占全国产量的95%以上，已发展成为关系到上百万热区百姓收入的重要热作产业。中国热带农业科学院香料饮料研究所位於海南省东南部万宁市兴隆华侨旅遊经济区，主要承担胡椒、咖啡、可可等热带香料饮料作物的科学研究和产业化配套技术研发任务。

华中农业大学作物遗传改良国家重点实验自2014年购置高性能集群平台以来，形成完善的生物信息学分析能力，先后完成异源四倍体陆地棉和海岛棉（百迈客助力四倍体海岛棉及陆地棉基因组发表于Nature Genetics）以及玉米（Nature Genet. | 华中农大严建兵团队公布热带玉米高质量基因组，并克隆了一个玉米产量基因）等重要农作物基因组图谱的构建，基于此中国热带农业科学院香料饮料研究所联合华中农业大学开展了热带重要作物胡椒高质量基因组图谱的构建项目，同时也服务于国家“一带一路”的倡议。

2019年10月16日，国际学术期刊Nature Communications在线发表了中国热带农业科学院香料饮料研究所联合华中农业大学、马来西亚科学院等7家单位完成的题为“The chromosome-scale reference genome of black pepper provides insight into piperine biosynthesis”的研究论文。该研究绘制了我国胡椒栽培种“热引1号”染色体级别精细基因组图谱（木兰亚纲胡椒目首次报道基因组组装的物种），综合解读胡椒的基因组特征，物种进化位置，并进一步对胡椒碱合成代谢网络和关键基因及其基因家族进行深入研究，为被子植物演化及胡椒碱生物合成提供了新的见解。

1. 构建染色体水平的高质量胡椒基因组

作为整个基部被子植物类群胡椒目中第一个完成全基因组测序的物种，kmer分析表明胡椒基因组的大小约为761.74 Mb，杂合度为1.33%，重复序列比例为59.54%，属高杂合且高重复序列的复杂基因组。因此，作者结合PacBio三代测序、10X Genomics、基于直接标记和染色（DLS）的BioNano单分子光学图谱和Hi-C染色体交互捕获四种测序技术对胡椒基因组进行de novo组装和染色体锚定，并用二代Illumina数据对组装结果进行潜在InDel和小片段错误组装的纠正。经过PacBio和10X Genomics数据初步组装的胡椒基因组（“Piper_nigrum_v1”）包含有1,277条scaffolds，其N50值为2.3 Mb，组装基因组大小为791.0 Mb。进一步的BioNano和Hi-C辅助组装后，得到组装基因组大小为761.2 Mb，其中包含45条scaffolds序列，且N50达到29.8 Mb的最终染色体级胡椒基因组组装结果“Piper_nigrum_v3”。CEGMA和BUSCO综合对胡椒基因组组装结果进行评估，显示了较高的完整性和准确性。

2. 完成胡椒基因结构和功能注释

首先通过从头预测和同源序列比对相结合的方法对胡椒基因组的重复序列进行注释和屏蔽，结果显示重复序列占总基因组的54.85%，其中54.01%属于散在重复序列，进一步细分有40.55%属于长末端重复序列反转录转座子（LTR-retrotransposons，LTR-RTs），LTR-RTs中主要类型LTR/Gypsy占27.63%和LTR/Copia占9.95%；随后选用BRAKER2基因结构注释流程，结合从头预测、胡椒属UniProt蛋白数据库数据和RNA-seq和Iso-seq测序转录组数据，对胡椒基因组基因结构进行预测。结果共注释到63,466个蛋白编码基因；另外，还注释到1,514个转运RNA（tRNA），1,206个核糖体RNA（rRNA），1,533个小核RNA（snRNA），256个非编码RNA（miRNAs），6,509个长链非编码RNA（lncRNAs），59个转录因子（TFs）家族， 157个转录调节因子（TRs）和646个染色质调节因子（CRs）。在基因功能注释中，有48,277和46,256个基因分别比对到NR和UniProt数据库，进一步InterProScan分析共鉴定到3,652个蛋白家族和2,071个GO分类。KEGG注释显示共有11,362个蛋白编码基因注释到KO功能，57,700个基因注释到330个代谢通路中。次生代谢物注释共鉴定到10与生物碱代谢通路相关的基因簇。

（a）胡椒染色体。（b-d）GC含量，重复序列密度和基因密度在500 Kb窗口内的分布。（e-l）不同组织基因表达情况（从外到内依次为2 MAP，4 MAP，6 MAP，8 MAP，根，茎，叶和花）。（m）次生代谢物（绿色正方形）和胡椒碱代谢路径（红色正方形）基因分布。（n）染色体共线性区块，带宽与共线性区块大小成正比。

3. 鉴定到发生于胡椒中的全基因组复制事件

对胡椒基因组进行共线性分析，结果显示胡椒基因组内存在1,295个共线性区块，占约基因组注释到总基因数量的45.10%，其中66.0%的旁系同源基因位于不同染色体间，34.0%的位于染色体内。其次，共线性dot图分析显示胡椒染色体内和染色体间存在大量的复制区域。此外，胡椒基因组的相互最佳匹配基因对和共线性区块基因对的同义替换率（Ks）分布显示在大约0.1处存在明显的单峰。明确了发生于胡椒的全基因组复制事件，并计算得出胡椒的WGD事件发生时间大致为17.2-17.9百万年前（MYA）。

4. 确定了胡椒及其所在木兰亚纲的系统发育位置

基于21个典型物种的比较基因组和系统发育分析，确定了胡椒及其所在木兰亚纲的系统发育位置。在胡椒、9个双子叶植物、3个单子叶植物、3个木兰亚纲植物、无油樟、2个裸子植物和两个低等植物外群共21个物种中鉴定出82个单拷贝直系同源基因。随后，利用这82个单拷贝直系同源基因进行物种进化树的构建和分歧时间的评估，结果显示木兰亚纲与整个单子叶-双子叶互为姐妹关系，进一步胡椒目与木兰目-樟目互为姐妹关系，且大约在175-187 MYA（95% HPD）发生分歧。

5. 初步揭示了胡椒中胡椒碱生物合成分子特征

通过分析胡椒中的基因家族扩增和不同组织的RNA-seq转录组数据，揭示了胡椒碱合成的重要代谢过程：苯丙烷代谢途径和赖氨酸代谢途径，以及酰基转移代谢过程。基因家族扩张分析发现，91个基因家族在胡椒中发生扩张，其中有35个基因家族发生显著扩张（family-wide P-value ≤ 0.01）。

胡椒物种特异的基因家族扩张相关基因显著富集在次生代谢相关功能和抗病防御相关基因中。不同组织的RNA-seq转录组数据分析发现参与到苯丙烷代谢途径和赖氨酸代谢途径的部分基因在果浆中特异高表达，且BAHD-AT和SCPL-AT基因家族的扩张伴随着在果浆组织中的高表达，将苯丙烷代谢和赖氨酸代谢联系起来。进一步对属于苯丙烷代谢途径和赖氨酸代谢途径以及BAHD-AT和SCPL-AT基因的扩张基因家族进行序列水平分析，检测到不同程度的纯化选择和正向选择。
为全面了解胡椒中胡椒碱合成代谢过程，作者集中在分析与胡椒碱生物合成相关生物过程和参与到胡椒碱代谢的基因家族的扩张情况。通过综合的比较基因组和基因家族分析，发现了参与到赖氨酸代谢的LDC基因，参与到苯丙烷代谢的GTF，CYP和HCT基因，以及BAHD-AT和SCPL-AT基因家族有显著变化。在植物的次生代谢过程中，苯丙烷和氨基酸代谢通路和酰基转移酶无处不在。如咖啡、苹果、葡萄、可可、茶树、菠萝和甜橙等物种均富含苯丙烷衍生物，且这些物种中相应次生代谢物的合成也通过聚合上述两个代谢过程（苯丙烷代谢和赖氨酸代谢）。特别是辣椒（chili pepper）中，辣椒素（capsaicinoids）的合成来源于苯丙烷和支链脂肪酸代谢通路。此外，烟草中烟碱合成的前体来源于萜类化合物代谢和氨基酸代谢过程。在荷花、罂粟、博落回和番木瓜等物种中，赖氨酸驱动的喹啉生物碱的合成同样通过聚合苯丙烷代谢和赖氨酸代谢两个过程。然而，胡椒碱合成起始于赖氨酸的脱羧反应和胺基氧化过程，这与通过聚合两个酪氨酸来进行苄基异喹啉生物碱（benzylisoquinoline alkaloid）的生物合成有明显的区别。随后来源于苯丙烷代谢和赖氨酸代谢的前体物质通过酰基转移酶催化进行胡椒碱的合成。其中苯丙烷代谢和赖氨酸代谢的聚合，特别是赖氨酸的脱羧反应，胺基氧化反应和酰基转移过程代表了胡椒碱合成的主要特征，并进一步通过基因水平、转录水平和序列进化过程揭示了胡椒中胡椒碱生物合成的特异性，初步阐述了胡椒碱生物合成分子机理。

6. Author contributions

C.H. and S.J. designed and supervised the research. Z.X. performed the genome assemblies and annotation. Z.X. and L.H. performed the transcriptome and phylogenetic analysis. L.H., H.W., X.Q., L.Y. and L.T. collected materials for sequencing and generated transcriptome data. Z.X., L.H., R.F. and B.W. analysed the RNA-seq data. M.W., D.Y., S.S., W.L., C.S., H.D., J.W., K.L. and X.Z. provided constructive comments and suggestions on data analysis. Z.X. and L.H. wrote the paper with input from all other authors. All authors approved the paper.

7. 基金支持

该研究得到了农业农村部物种资源保护、海南省自然科学基金、中央级公益性科研院所基本科研业务费等项目的支持，是中国热带农业科学院重视基础研究、大力实施以基础和前沿研究、推进“十百千”科技工程、促进原始创新成果产出的体现。中国热带农业科学院香料饮料研究所郝朝运研究员、华中农业大学金双侠教授为论文的共同通讯作者。中国热带农业科学院香料饮料研究所胡丽松副研究员、华中农业大学许忠平博士为论文的共同第一作者。

8. How to cite this article

The chromosome-scale reference genome of black pepper provides insight into piperine biosynthesis
类型期刊文章
作者 Lisong Hu
作者 Zhongping Xu
作者 Maojun Wang
作者 Rui Fan
作者 Daojun Yuan
作者 Baoduo Wu
作者 Huasong Wu
作者 Xiaowei Qin
作者 Lin Yan
作者 Lehe Tan
作者 Soonliang Sim
作者 Wen Li
作者 Christopher A. Saski
作者 Henry Daniell
作者 Jonathan F. Wendel
作者 Keith Lindsey
作者 Xianlong Zhang
作者 Chaoyun Hao
作者 Shuangxia Jin
URL https://www.nature.com/articles/s41467-019-12607-6
版权 2019 The Author(s)
卷 10
期 1
页码 1-11
期刊 Nature Communications
ISSN 2041-1723
日期 2019-10-16
刊名缩写 Nat Commun
DOI 10.1038/s41467-019-12607-6
访问时间 10/23/2019, 11:55:31 AM
馆藏目录 www.nature.com
语言 en
摘要 Black pepper (Piper nigrum) belongs to the long-isolated lineage of basal angiosperm and its fruit has been used for food spice and phytomedicines for thousands of years. Here, the authors assemble the reference genome of this species and analyze gene families associated with piperine biosynthesis.
添加日期 10/23/2019, 11:55:31 AM
修改日期 10/23/2019, 11:55:31 AM

9. 社交链接

微信推送
- Nature Communications|华中农大与热科院香料饮料研究所合作解析“香料之王”-胡椒染色体级参考基因组
- 胡椒基因组的意义，远不止胡椒碱的合成
找到了！胡椒那么辣的原因
中国科学家找到胡椒辛辣的原因：哪些基因起作用？
海南日报：热科院等单位成功绘制胡椒染色体级别精细基因组图谱

10. 彩蛋

http://tiramisutes.github.io/SCIFigure/

NC | 苹果蠹蛾基因组测序揭示其化学感觉和杀虫剂抗性机制

2019-09-18T15:14:25.000Z

苹果蠹蛾(Cydia pomonella)，俗称食心虫，属鳞翅目卷蛾科。在我国仅分布于新疆、甘肃等北部地区，在国内其他地区是检疫对象。苹果蠹蛾可为害苹果、梨、杏、桃、樱桃及梅等果树。苹果蠢蛾幼虫蛀食果实，不仅降低果品质量，而且引起大量落果。成虫产卵于果实或者叶片上，卵散产，前期果实较硬时，初孵幼虫多从萼洼或梗洼蛀入，后期果实肉质松软时，从果面蛀入，幼虫蛀果后有偏食种子的习性，并向外排出虫粪。几头幼虫能同时蛀食1个果实，1头幼虫也可转移2个以上果实为害。老熟幼虫脱果后由枝干爬向树皮下作茧化蛹。苹果蠹蛾会蛀入果心，啃食种子。对苹果等经济果树造成较大的经济损失。

近日，来自中国农业科学院、中科院、浙江大学和美国堪萨斯大学等国内和国际共20多家科研机构联合在Nature Communications杂志在线发表了“A chromosome-level genome assembly of Cydia pomonella provides insights into chemical ecology and insecticide resistance” 的研究论文，联合Illumina、PacBio、BioNano和Hi-C测序技术对苹果主要害虫苹果蠹蛾进行了基因组测序。

基因组组装结果描述

研究人员对来自甘肃省酒泉地区野外捕获的苹果蠹蛾进行人工饲养，后选取42头雌性进行基因组DNA的提取和测序。k-mer和流式细胞分析其基因组大小约为630 Mb。对此，构建了4个小片段（180, 300, 500和800 bp）和3个大片段（3，8和10 Kb）的Illumina文库（大约390X的基因组覆盖度），86X的PacBio测序，初步组装出682.49 Mb基因组大小，包含2221条contigs，且contig N50 为862 Kb。BioNano辅助组装后共获得1717条scaffolds，组装基因组大小为772.89 Mb，scaffold N50 为8.9 Mb。结合Hi-C辅助组装，1108条scaffolds能锚定到29条染色体（27条常染色体，Z和W两天性染色体）上，97%的基因组序列能定位到染色体。同时作者也注意到因为苹果蠹蛾高杂合的基因组特征，Hi-C组装时引入了大量的gap。为验证组装的准确性，作者获得了两个不同版本的组装结果1）联合PacBio和Hi-C组装结果；2）联合PacBio、BioNano和Hi-C组装的超级scaffolds。随后将这两个组装结果进行全基因组比对，结果二者有超高的共线性，随后作者选用联合PacBio、BioNano和Hi-C组装的超级scaffolds用于后续分析。
BUSCO比对到节肢动物门(Arthropoda)数据库，结果显示有98.5%的直系同源基因能在苹果蠹蛾组装基因组中鉴定到。为进一步验证组装质量，测定71 Gb的苹果蠹蛾基因组Nanopore数据，结果有99%的reads能够比对到组装结果。PacBio RNA测序获得的15,000 条consensus转录本有93%的能比对到组装结果。同时鉴于鳞翅目昆虫基因组具有典型的超高水平的共线性特征，将苹果蠹蛾和斜纹夜蛾的基因组进行比较，结果显示二者染色体的连接和基因排序是高度保守的。综上所有分析说明了苹果蠹蛾基因组组装的准确性和完整性。

基因组注释

共鉴定到占基因组大小42.87%的重复序列。用OMIGA结合28个RNA-seq测序数据共初步鉴定到16,997个蛋白编码基因。并对昆虫中广泛研究的对其适应性起重要作用的85个嗅觉受体基因，65个味觉受体基因，39个亲离子受体基因，50个气味结合蛋白基因，28个化学感应蛋白质基因，136个P450基因，47个ABC转运蛋白基因，73个羧酸酯酶基因，30个谷胱甘肽s-转移酶，9个烟碱乙酰胆碱受体，2个乙酰胆碱酯酶和1个电压门控钠离子通道基因进行手动注释。经过去冗余后最终共获得17,184个蛋白编码基因。
非编码RNA的鉴定中，共获得82个snRNA，137,752个piRNA，2435个tRNA，334个rRNA和217个miRNA。

比较基因组分析

苹果蠹蛾与覆盖7个昆虫目（鳞翅目，双翅目，鞘翅目，膜翅目，半翅目，等翅目和直翅目）的19个其他昆虫的比较基因组分析，共鉴定到2124个单拷贝直系同源基因，选取500个直系同源基因构建物种进化树，结果显示苹果蠹蛾与obtectomeran鳞翅目大约在141 百万年前发生分歧。

共线性，核型进化和性染色体

苹果蠹蛾与鳞翅目的斜纹夜蛾具有高水平的共线性。而斜纹夜蛾呈现出31条染色体的祖先核型，前期的细胞遗传学分析显示苹果蠹蛾包含27条常染色体，雌性为W染色体，新的Z性染色体来源于与家蚕15号染色体同源的Z常染色体的融合，与斜纹夜蛾基因组的比较证实了这一染色体融合事件，从而形成了苹果蠹蛾基因组中最大的染色体。此外，也观察到发生在苹果蠹蛾的2（来源于斜纹夜蛾的5和22号染色体）和3（来源于斜纹夜蛾的7和8号染色体）号染色体的融合事件。

对三个雌性和三个雄性重测序数据的分析，证实了Z染色体（chr1）和部分W染色体（chr29）存在于组装基因组中。细胞遗传学分析显示Z和W性染色体间几乎没有共享序列，表明新型的W染色体的片段发生了丢失或几乎完全的退化。
进一步chr29染色体与基因组其他序列的比较发现chr29有较高的GC含量。作为典型的非重组染色体半合子，鳞翅目W染色体高度退化，呈现出较少的基因数量和重复序列的富集，除TEs外，在chr29染色体上没有坚定到任何蛋白编码基因。而重复序列分析结果显示相较于其他染色体，chr29并没有表现出明显高的含量。但W染色体相较于其他染色体有较高比例的的长末端重复转座子和DNA转座子。

嗅觉受体基因OR3的复制增强其寻找食物和配偶的能力

在昆虫中，化学感应系统调节许多行为，如寻找食物，避难所，配偶和产卵地。因此在物种入侵能力测定中扮演着重要作用，特别是对于如苹果蠹蛾类的寡食性昆虫。苹果蠹蛾体内化学感受器的研究已经证实，两性都能被植物挥发性梨酯强烈吸引，而梨酯能大幅度的增加雄性对可得蒙（雌性产生的主要性信息素物质）的响应。然而梨酯与可得蒙在雄性之间的协同相应机制仍不清楚。
本次组装高质量的基因组为了解苹果蠹蛾体内的化学感应相关基因提供基础。作者共苹果蠹蛾基因组中鉴定到85个嗅觉受体基因，这些基因的进化分析发现信息素受体簇在苹果蠹蛾中发生扩张。此外，基因簇的染色体位置分析发现CpomOR3a和CpomOR3b有相同的基因长度和相同的外显子-内含子结构，在17号染色体上呈现出串联重复，二者间隔9812 nt。为证实OR3基因的复制是固定的而不是分离的(有些有一个拷贝，有些有两个)，我们在所有重测序的苹果蠹蛾数据中确认了OR3基因复制的存在。

确定了OR3基因复制的事实，但其是否有助于增强苹果蠹蛾对梨酯的感应能力？首先作者通过PCR确定了CpomOR3a和CpomOR3b的扩张。其次，24个不同组织的RNA-seq的FPKM值计算表明CpomOR3b只在成虫触角中表达（雄性FPKM = 1050.17和雌性FPKM = 4014.13），而CpomOR3a除在成虫雌性（FPKM = 88916.8）和雄性（FPKM = 41170.2）的触角中均表达外，还在成虫头部（FPKM = 14771.4–68715.2）和幼虫头部（FPKM = 7627.82）表达。相较于其他CpomORs基因，OR3复制基因（CpomOR3a和CpomOR3b）在成虫触角中有最高的表达量。苹果蠹蛾成虫触角的荧光原位杂交显示，尽管CpomOR3a和CpomOR3b都有一些专属表达的非共聚焦神经元，但二者主要表达于同一感受器内邻近但不同的神经元细胞中。此外，CpomOR3a和CpomOR3b与ORs基因的特异性共受体CpomORco在苹果蠹蛾成虫触角中一致性表达。这些结果表明，这两个拷贝有不同的表达模式，推断他们可能经历了新功能化，并且获得了不同的基因功能。
CpomOR3a已经有报道是假定的梨酯受体，与其有89%的序列相似性的CpomOR3b是否也能够检测到这种化合物，并有助于苹果蠹蛾对这种化学因子的高度灵敏性？作者将CpomOR3a或CpomOR3b分别与CpomORco在爪蟾卵母细胞中共表达，用双电极电压钳记录每个蛋白质对梨酯和其他化学物质的相应。结果发现这两个基因拷贝具有相似的反应谱。CpomOR3a和CpomOR3b对梨酯有较强的调控作用，但二者对性信息素可得蒙也均有一定的响应。随后通过对晚期的蛹注射siRNAs来将CpomOR3a和CpomOR3b进行各自单独或同时的敲除，qPCR结果显示其中一个的敲除并不会影响另一个旁系同源基因的表达。注射72小时后利用触角电位检测整个触角的电活性，结果表明不同处理下雄虫对梨酯和可得蒙的响应均受损。相反，仅有当CpomOR3a和CpomOR3b全部敲除时，雌虫对梨酯的响应才会受损。此外，单独沉默CpomOR3b显著降低了雄虫苹果蠹蛾对可得蒙的追踪能力。

GWAS鉴定到杀虫剂抗性相关SNPs

目前，主要通过化学杀虫剂对苹果蠹蛾进行管控，但已出现明显抗性。了解其抗药性机制对于进一步的害虫防御起到重要的作用。苹果蠹蛾主要的抗药性机制依赖于解毒酶活性的提高和降低靶蛋白对杀虫剂的敏感性。作者在苹果蠹蛾中共鉴定到667个潜在杀虫剂抗性相关基因，包括434个解毒基因，45个杀虫剂目标基因，124个角质层基因，47个ABC转运蛋白和12个水通道蛋白。之前的研究表明P450基因赋予苹果蠹蛾对杀虫剂的光谱抗性。因此，作者对苹果蠹蛾基因组中的P450基因进行深入研究。
146个P450基因在基因组的染色体分布显示有16个基因簇包含3个或更多的P450基因。其中位于chr20染色体的基因簇包含11个基因，包括3个CYP6AE基因。

为确定P450基因数目的增加赋予苹果蠹蛾对杀虫剂的抗虫，作者对来自三个种群的（S, Raz和Rv）苹果蠹蛾各随机挑选6个个体进行40X的重测序，S种群对杀虫剂敏感，自1995年起没有暴露在任何杀虫剂中；Raz种群对杀虫剂有抗性，自1997年起幼虫暴露在甲基谷硫磷中，相较于S种群表现出对甲基谷硫磷7倍的抗性，对西维因130倍的抗性；RV种群自1995年起幼虫暴露于溴氰菊酯，相较于S种群，表现出对溴氰菊酯140倍的抗性。
GWAS分析在S 和 Raz种群中共鉴定到109个具显著不同的等位基因频率的SNPs位于上述667个耐药相关基因的外显子区。S 和 Rv种群的比较共鉴定到242个具显著差异的SNPs位于耐药相关基因的外显子区，其中18个SNPs是Raz 和Rv种群共有的。对于其中的11个SNPs对，选取每个种群的数十个个体通过Sanger测序进行进一步的分析，确认了其中7个SNPs在S与Raz或Rv种群间存在固定的差异。验证的SNPs在毒蕈碱受体（mAChR），章鱼胺β受体和P450基因CYP6B2中的突变因相关基因从未报道参与到鳞翅目昆虫的杀虫剂抗性中而引起作者的注意。
随后作者对苹果蠹蛾的P450基因通过RACE进行5’UTR的注释，结果136个P450基因有69个完成5’UTR的注释，并将这69个P450基因比对到基因组scaffolds进行转录起始位点和启动子的注释。136个P450基因中，GWAS分析分别鉴定到128和203个SNPs在S种群和Raz或Rv种群间存在差异。在69个基因的启动子区中，分别鉴定到9和10个SNPs在S种群和Raz或Rv种群间存在差异。特别的，有3个SNPs均存在于Raz和Rv种群的CYP6B2基因启动子区：A52T: A (−52)T，T(−57)T，和T(−110)G (gene ID: CPOM05212)。qPCR对三个种群中的CYP6B2基因的表达进行测定，结果表明该基因相较于S种群，在两个抗性种群中（241.4-fold in Raz and 77.3-fold in Rv）组成性高表达，表明这3个SNPs在CYP6B2基因的表达调控中起到重要作用。
进一步为确认CYP6B2基因的表达与杀虫剂抗性相关，作者对来自酒泉的四龄测序苹果蠹蛾通过siRNA注射敲除CYP6B2基因，结果48小时后CYP6B2基因的表达下降55%，并用LC50浓度的甲基谷硫磷，溴氰菊酯和吡虫啉饲喂RNAi个体。结果饲喂甲基谷硫磷和溴氰菊酯的幼虫成活率分别为31.1%和45.6%，显著低于GFP注射的对照组或不做任何处理的分组，表明敲除CYP6B2基因显著增加对甲基谷硫磷和溴氰菊酯的敏感性，而幼虫对吡虫啉的敏感性并未受影响。综合表明CYP6B2基因赋予苹果蠹蛾对两个广泛使用的杀虫剂的抗性中起到关键作用。

原文链接：https://www.nature.com/articles/s41467-019-12175-9#Sec9

Linux下NoRoot安装nodejs和npm

2019-08-29T11:54:34.000Z

Node.js Node.js 是一个基于 Chrome V8 引擎的 JavaScript 运行环境。
NPM是随同NodeJS一起安装的包管理工具，能解决NodeJS代码部署上的很多问题

下载源码包

1
2
3

wget http://www.nodejs.org/dist/latest/node-v10.16.3.tar.gz
tar -zxvf node-v10.16.3.tar.gz
cd node-v10.16.3

配置、编译、安装

1
2
3

module load GCC/6.2.0-2.27
./configure --prefix=$PWD
make && make install

配置nodejs环境

1
2
3

export NODE_HOME=/public/home/software/node-v10.16.3
export PATH=$NODE_HOME/bin:$PATH
export NODE_PATH=$NODE_HOME/lib/node_modules:$PATH

测试是否安装成功

$ node -v
v10.16.3
$ npm -v
6.9.0

Docker和Singularity双剑合璧构建生物信息分析流

2019-08-29T08:03:25.000Z

使用场景

容器作为轻量级的虚拟机，可在主机之外提供多种系统环境选择；另外，在容器中一次打包好软件及相关依赖环境之后，即可将复杂的软件环境在各种平台上无缝运行，无需重复多次配置，大大减轻相关工作人员的工作量；

目前主流的容器为docker，其最初被用于软件产品需要快速迭代的互联网行业，极大地简化了系统部署、提高了硬件资源的利用率，近来也在各种特定领域的应用系统中被使用。

Docker 是一个开源的应用容器引擎，基于 Go 语言，并遵从Apache2.0协议开源。

Docker 可以让开发者打包他们的应用以及依赖包到一个轻量级、可移植的容器中，然后发布到任何流行的 Linux 机器上，也可以实现虚拟化。

容器是完全使用沙箱机制，相互之间不会有任何接口（类似 iPhone 的 app）,更重要的是容器性能开销极低。

Docker 的优点：

比较轻量级：相对于虚拟机这样的技术来说，docker 比较轻量级，占用的额外资源少
容易维护：采用Docker file 定义环境，方便更新和共享
操作简单： Docker 的操作非常的简单，学习成本低

在生物信息领域，为了一次配置，多次使用的目的，一些复杂的软件开始使用docker进行打包分发，另外由于云的兴起，docker也用于搭建私有生物信息云分析平台。尽管如此，由于权限、资源管理等因素限制，docker并未在传统HPC集群（物理裸机+操作系统+作业调度系统+高速互联网络）中流行开。

于是，有人开发了专门针对传统HPC集群的容器工具singularity，其使得普通用户可以方便地在集群使用打包好的容器镜像，配合作业调度系统，其使用也非常方便，跟使用其它应用软件的方式相同。singularity还有一大优势是，可直接使用docker镜像，大大提高了singularity的可用性。

安装

Docker和Singularity的服务器安装需要Root权限，所以可选择在本地个人电脑安装Docker用来制作镜像，然后上传服务器用Singularity运行；

镜像集合

Docker Hub

Singularity Hub

BioContainers，一个类似Bioconductor、Bioconda的生物信息分析工具集合。

Docker 国内镜像站

DaoCloud: 为了解决国内用户使用 Docker Hub 时遇到的稳定性及速度问题 DaoCloud 推出永久免费的新一代镜像站服务。

Docker 应用实例

镜像搜索
使用 docker search 命令来搜索标星至少为 25 的 CentOS 相关镜像。

1	docker search -f stars=25 centos

下载一个镜像到本地

# 下载远程镜像
docker pull biocontainers/bowtie2
# 载入本地镜像
docker load docker_rnaseq_ref.tar

察看所下载镜像

1	docker images

察看docker运行和相关软件镜像对应参数

1 2	docker run --help docker run biocontainers/bowtie2

其中重要参数有：

--rm：使得该容器退出后自动删除；
-v：进行本地文件夹的挂载，用来指定输入数据位置和输出结果位置；

删除本地镜像

1 2	docker image rm [选项] <镜像1> [<镜像2> ...] docker image rm biocontainers/bowtie2

[选项]的 -f 参数来执行强制删除；

Docker 创建自己的镜像

构建镜像的方式有好几种，我们主要利用Dockerfile，简要介绍一下相关指令（大小写不敏感，但惯例为均大写），具体的还应参考官方文档：

FROM：指定基础镜像，必须是第一条指令。

MAINTAINER：指定镜像作者信息。

LABEL：指定镜像信息。

ENV：设置环境变量。

ADD & COPY：将本地文件复制进镜像，注意COPY只能是本地文件，ADD可以是url，并且自动解压缩。

RUN：运行指定的命令，每一个RUN指令都将在一个新的container里面运行，并提交为一个image作为下一个RUN的base，即层与层之间不会共用内存，所以如果要将多条命令联合起来执行则需加上&&。另外，在一个Dockerfile中可以包含多个RUN，按顺序执行。

CMD：在容器启动时要运行的命令。

首先根据自己常用操作系统pull官方系统文件，如

1 2	docker search -f stars=25 centos docker pull centos

然后建立一个文件夹用于存放制作镜像过程中所用到的文件，下载软件的源码包并新建Dockerfile文件；

接下来制作镜像：

1	docker build -t macs2:ubuntu.v1 .

其中:前面为镜像名称，后面为TAG，.表示当前目录，也可以使用全路径，目的就是找到Dockerfile文件。

由于需要网络环境，整体过程可能较慢。构建成功后，可docker images查看。

Docker Hub有大量现成镜像，随便搜索一个打开其Dockerfile察看，参考制作自己的镜像文件；

然后登陆docker login，成功后推送docker push wenlongshen/macs2:ubuntu.v1，至此，就可以在自己的docker hub里查看并管理了，同样地，也可以docker pull wenlongshen/macs2:ubuntu.v1到本地使用。

Singularity 使用

singularity有许多命令，常用的命令有，pull、run、exec、shell、build；

下载镜像

1	singularity pull qiime2_core_2018.11.sif docker://qiime2/core:2018.11

载入数据

singularity exec -B /share/exercise/qiime2/emp-single-end-sequences qiime2_core_2018.11.sif qiime tools import --type EMPSingleEndSequences --input-path /share/exercise/qiime2/emp-single-end-sequences --output-path emp-single-end-sequences.qza

因为在容器中是没有/share/exercise/qiime2/emp-single-end-sequences 这个绝对路径的，所以需要使用-B选项将宿主机的/share/exercise/qiime2/emp-single-end-sequences路径映射到容器中；

参考资料

Docker —— 从入门到实践
生物信息分析流程
singularity
Singularity入门之安装

go语言简单介绍与module安装

2019-08-29T03:17:40.000Z

Go 是Google开发的一个开源的编程语言，它能让构造简单、可靠且高效的软件变得容易。

go 安装

到 https://golang.org/dl/ 下载最新版本的编译好的GoLang（国内好像打不开哦），并进行完整性验证，然后解压即可运行。

1
2
3

wget -c https://storage.googleapis.com/golang/go1.12.9.linux-amd64.tar.gz
shasum -a 256 go1.12.9.linux-amd64.tar.gz
tar -xvzf go1.12.9.linux-amd64.tar.gz

添加环境变量

1
2
3

export PATH=$PATH:/path/go/bin
export GOPATH="/path/go/src"
export GOBIN="/path/go/bin"

检测安装情况

1	go version

用go env来察看相关路径设置是否正确。

go module使用

go module 是go官方在go1.11版本引进的，用来管理包的依赖关系，可以通过版本号来进行迭代，通过go.mod来管理项目依赖的模块。

其go.mod格式如下：

module github.com/martian-lang/martian
 
go 1.12
 
require (
	github.com/cloudfoundry/gosigar v1.1.0
	github.com/dustin/go-humanize v0.0.0-20180713052910-9f541cc9db5d
	github.com/google/shlex v0.0.0-20150127133951-6f45313302b9
	github.com/martian-lang/docopt.go v0.0.0-20180828184714-57cc8f5f669d
	github.com/satori/go.uuid v1.1.1-0.20160713180306-0aa62d5ddceb
	github.com/golang/sys v0.0.0-20190209173611-3b5209105503
	github.com/golang/tools v0.0.0-20190219175448-49d818b07734
)

module 用来定义项目的模块名字。
go 1.12 用来表示，项目使用的go版本。
require 用来表示项目引用的模块和模块对应的版本号。

一个模块是由一组相关的软件包组成的一个单元，如下:

go项目通常放到src目录中。

如同CRAN 和 CPAN，go module也有对应仓库，主要是golang.org（官方仓库），https://goproxy.io/（最大代理仓库）和github。

golang.org

类似于python安装module仅需pip install XXX，perl仅需cpanm XXX和R仅需library("XXX")一样，go module的安装也可以通过命令直接安装。
如尝试安装 sys module：

1	go get -u golang.org/x/sys

但结果却是报错，不应该啊，网上各种教程甚至官网都是写的这样下载啊：

1
2

go: golang.org/x/sys@v0.0.0-20190209173611-3b5209105503: unrecognized import path "golang.org/x/sys" (https fetch: Get https://golang.org/x/sys?go-get=1: dial tcp 216.239.37.1:443: i/o timeout)
go: error loading module requirements

(┬＿┬)因为万恶的墙，go是Google开发的……..，难道不能用了吗？goproxy.io和github就是救世主。

goproxy.io

官网一张图看下 goproxy.io 的作用：解救我们于水生火热中……..

# 开启 go modules 特征
export GO111MODULE=on
# 设置GOPROXY环境变量
export GOPROXY=https://goproxy.io

然后重新下载所需module即可

1	go get -u golang.org/x/sys

github

两种方式：
将github相关仓库 git clone 到 gopath/src/golang.org/x/ 目录下，然后cd 到相应目录执行go install golang.org/x/sys；

mkdir -p $GOPATH/src/golang.org/x/
cd $GOPATH/src/golang.org/x/
git clone https://github.com/golang/sys.git sys
go install golang.org/x/sys

执行go install之后没有提示，就说明安装好了。

或者直接

1	go get -u -v gitlab.com/groupName/projectName

总结有效办法是将所需module的go.mod文件中的任何golang.org指定替换为相应的github位置。

我的 Go 环境如下：

GOARCH="amd64"
GOBIN="/public/home/zpxu/software/go/bin"
GOCACHE="/public/home/zpxu/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/public/home/zpxu/software/go/src"
GOPROXY="https://goproxy.io"
GORACE=""
GOROOT="/public/home/zpxu/software/go"
GOTMPDIR=""
GOTOOLDIR="/public/home/zpxu/software/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/public/home/zpxu/software/cellranger/martian/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build473583117=/tmp/go-build -gno-record-gcc-switches"

参考资料

How to Install GoLang (Go Programming Language) in Linux
Go中的modules初步使用

高能预警：拟南芥&基因家族分析也能发Cell

2019-08-24T15:10:02.000Z

测序技术的发展普及极快的促进了基因组学的研究进展，如果说DNA sequencing at 40: past, present and future呈现给我们的是气势磅礴的DNA测序技术，那“拟南芥1001基因组计划”（http://1001genomes.org ）就是这雄伟蓝图中的冰山一角，自2008年启动伊始，这一角却撬动了生命科学研究的步伐，加速了人类对基因功能，物种多样性的认知。2016年1135 个拟南芥基因组全基因组序列分析论文的发表，宣告了 1001 基因组计划项目第一阶段的结束，然而基于基因组的拟南芥相关研究却从未止步。

bioRxiv最近上传了题为“Chromosome-level assemblies of multiple Arabidopsis thaliana accessions reveal hotspots of genomic rearrangements”的文章通过对7个拟南芥品种进行PacBio测序和染色体水平基因组组装，揭示了其在长期进化过程中相应逆境胁迫而引起的大约350个热点区域的染色体重排。

背景介绍

说起逆境胁迫，植物从寒武纪生命大爆发选择陆地作为落脚点开始，风雨不动安如山便是对它们最真实的写照，为生存植物在长期的进化过程中形成其独特的免疫系统来发现潜在的病原体，植物免疫依赖于一系列免疫受体，而植物与微生物的共进化促进了免疫受体的多样性。通常有两种类型的受体可以激活植物的免疫信号：能够识别病原相关分子模式（microbe-associated molecular patterns, MAMPs）的细胞表面蛋白和能够感知病原菌效应子的细胞内蛋白。病原菌可以通过释放一系列的效应子来破坏或扰乱植物的PTI防御反应以便更好地侵染植物，植物在PTI的基础上配置了多种类型的核苷酸结合富亮氨酸重复的免疫受体 (nucleotide-binding leucine-rich repeat, NLR)，作为效应子触发免疫的第2道防御体系 (即ETI)。植物通过NLR免疫受体识别病原体，引起植物的超敏反应（hypersensitive response, HR），进而阻止病原菌的侵染。NLR 是具有保守结构的多结构域蛋白，包括C 末端富含亮氨酸重复序列（LRR）结构域，中央核苷酸结合结构域（NBD）和直接或间接识别病原体衍生效应子的可变N-末端结构域，其主要是TIR（Toll/interleukin-1 receptor）和CC（coiled-coil）结构域。就在昨天Science背靠背报道了TIR结构域能够分裂代谢辅因子NAD+（nicotinamide adenine dinucleotide），并作为抗病反应的细胞死亡信号。（详见：Science背靠背 | NLR受体介导植物抗病反应新机制）。这些关键性结构域为NLRs相关基因的鉴定和家族分析提供了依据。
基于NLR基因在植物免疫和育种应用中的重要性，目前有大量的物种进行相关基因的鉴定和进化分析。然而NLR基因家族极端的多态性，甚至是近缘个体间广泛存在的获得与缺失变异（presence-absence variants，PAVs），使得人们对NLR的多样性至今无清晰了解。
近日，来自德国马普发育生物学研究所的Felix Bemm团队在Cell杂志在线发表了“A Species-Wide Inventory of NLR Genes and Alleles in Arabidopsis thaliana” 的研究论文，通过对64个不同地理分布的拟南芥accession进行RenSeq测序（Resistance gene enrichment Sequencing，能够从已测序的植物基因组中重新定位NLR基因家族，并快速绘制分离群体中的抗性位点）和泛NLR组（pan-NLRome）分析，进而确定核心NLR complement，整合结构域多样性，描述新的结构域特征，评估非核心NLRs的获得与缺失多态性和锚定拟南芥Col-0参考基因组上的非典型NLRs基因位置。

结果

NLR Discovery

通过RenSeq和单分子实时测序（SMRT），在64个拟南芥中共构建了65个NLR complement，其中包含13,167个注释的NLR基因，平均在每个个体中有167到251个基因。其中47%到71%的NLR基因在不同个体基因组中成簇存在，且部分NLR基因呈现出head-to-head的方向性（定义为paired NLRs），每个accession中有10到34个这个的配对NLRs。所有NLRs基因根据其所包含结构域的不同分为四类：TIR-NLR (TNL), CC-NLR (CNL), CCR-NLR (RNL), 和NB-and-LRR-only proteins (NL)，其中每个accession中数量最多的是TNLs，其次是NLs, CNLs和RNLs。

Diversity of NLR Domain Architectures

这13,167个NLR基因中，663个编码至少一个非典型NLR结构域，代表36个明显不同的Pfam结构域。NLR组跨物种多样性的显著标志是不同结构域特征相对比例的变化。本次拟南芥泛NLR组研究共鉴定到97个明显的特征，其中仅有22个在Col-0参考基因组中存在，仅有48个在Col-0或其他十字花科植物中报道过。

The Pan-NLRome

为了解NLR数量和多样性变异，对所测序的64个拟南芥accessions所含NLRs基因基于序列相似性进行orthogroups（OGs）聚类。结果仅有小于10%（1,663个基因）的基因以单体形式存在，剩下的11,497个基因共聚成464个OGs。这464个OGs中95%可在任意的38个accessions中找到。OGs进一步通过大小，结构域特征和结构特征进行分类。核心NLR组仅包含106个OGs（23%），对应6,080（53%）个基因能在至少52个accessions中找到。shell NLR组有稍微高的143个OGs (31%)，对应3,932 (34%)个基因能在至少13个，但少于52个accessions中找到。cloud NLR组为215个OGs (46%)，对应1,485 (13%)个基因能在最多12个accessions中找到。

Genomic Placement of Non-reference OGs

296个OGs在Col-0参考基因组中缺失，其中6个属于核心，205个属于cloud和85个属于shell NLR组。本研究通过共线性将OGs锚定到Col-0参考基因组。结果共得到42个共线性子网络。

其中OG102和OG211聚类在新的NLRs区域中，此外，新锚定的OGs还包括一个CNL和三个TNL类的 NLR基因对。

Pan-NLRome Diversity

基于正交分类法（orthogonal approach, http://www.pharmtech.com/orthogonal-approach-biosimilarity ）根据结构域特征对NLR基因进行分类，并评估了序列多样性作为形成泛NLR组的进化压的指示。随机挑选的32个accessions中，平均核苷酸序列多样性达到95%的饱和度。相反，仅有49个accessions存在单倍型多样性的饱和度，这反映出新的单倍型的出现不仅来源于突变，而且与基因内的重排和基因转换相关，这在3/4的OGs（74%）中得到验证。这与基因内的重排能引起功能分化的报道相吻合。相较于没有聚类的OGs而言，成簇的OGs有显著高的核苷酸多样性，这与基因复制后存在较小的选择压相符合。尽管当涉及到获得与缺失多态性时不同的NLR分类有不同的整体轮廓，但是平均核苷酸多样性在OGs内是相似的对于CNLs, TNLs和NLs类NLR组来说。而RNLs类有较低的平均核苷酸多样性，这与他们功能的保守性相关联。

对进化的每一个分支进行选择分析共鉴定到131个OGs在至少一个分支中存在偶然正选择（episodic positive selection），大部分OGs属于核心（50）或shell （73）NLR组。位点特异的选择分析揭示了543个核心和shell OGs可能经历过experienced constant (46%), pervasive (30%)或 episodic (24%)正选择。没有变化的密码子代表了恒定的纯化选择能够在所有类型（core和shell），分类（TNLs, CNLs，RNL和NL）和配对状态 (paired和unpaired)中找到，而其子类表现出不均匀的正选择模式。

Linking Diversity to Known Function

为了将NLR组多样性与已知的基因功能相联系，进一步将OGs分类为对适应性活体营养型有抗性，对非适应性活体营养型有抗性和对半活体营养型有抗性三类。结果显示OGs中对适应性活体营养型（adapted biotrophs）有抗性类型显著多于其他类别，说明了宿主适应性活体营养型病原菌更加能够驱动NLRs基因的多样性。对适应性活体营养型有抗性的OGs有较高的Tajima’s D值，表明其不仅经历过正向选择也经历过平衡选择。

数据分析

One-way ANOVA with post-hoc Tukey HSD Test Calculator

正文结束，我是分割线

纵观全文，这个基因家族分析涉及SMRT RenSeq测序，类似于基因组的De novo组装，基因结构和功能注释以及分类，其中为保证NLR基因注释准确性通过多种手段进行人工纠正，随后的比较基因组分析，泛NLR组分析，聚类分析和序列水平的选择分析，最关键的是每一个漂亮的figure都告诉你是怎么画的，包括R脚本。正所谓面面俱到，完全不像满大街的单个物种找下目标基因，看下染色体分布，什么基因结构，motif有用没用先放上，或者再计算个简单的kaks，最后东拼西凑点转录组数据，塞几个qRT结果……类基因家族分析文章。非常值得学习。
最后奉上师兄的点评。

原文链接： A Species-Wide Inventory of NLR Genes and Alleles in Arabidopsis thaliana

R 标准地图记（区分四川和重庆且有台湾和南海诸岛）

2019-06-19T06:33:27.000Z

在重测序文章中经常见到用地图来描述测序样品分布，而地图涉及到主权，属敏感问题，但在R中可轻(fu)松(za)复现。

1. 不同方法比较

ggplot2

可灵活调整图形的任意组成成分，同时可在图形上添加2个或多个维度的数据；

maps

简单易操作，但原先中国的基础地图中，没有将四川和重庆区分开，现在虽然已经区分，但每个省份轮廓看起来还是与地图略有区别（国家基础地理信息中心）；

googleVis

绘制基础地图方法，仍然只能绘制一维的数据。同时绘制的地图依赖google地图；

REmap

国人开发的基于百度地图Echart。优点，绘制地图方便快捷，省市级地区的二级地图非常精准，并可绘制炫酷的迁徙图和热图，推荐学习网址：http://lchiffon.github.io/REmap/ ；缺点，同googleVis一样，只能绘制一维的数据，同时地图上只能显示中文地名。

2. 地图数据下载

Download GADM data

但是从GDM网站下载的中国地图没有台湾，果断差评。

GIS数据

http://cos.name/wp-content/uploads/2009/07/chinaprovinceborderdata_tar_gz.zip
主要是下载三个中国行政区地图数据信息文件： bou2_4p.dbf，bou2_4p.shp和bou2_4p.shx；
使用中如果出现中文省份名称乱码，设置Sys.setlocale("LC_ALL", "chinese")即可。
中国行政区地图数据信息数据中包含了925条记录，每条记录中都含有
面积（AREA）
周长（PERIMETER）
各种编号，ADCODE99 是国家基础地理信息中心定义的区域代码，共有 6 位数字，由省、地市、县各两位代码组成。
中文名（NAME）等字段,其中中文名（NAME）字段是以GBK编码的。可利用iconv 格式转换函数来转换各省名称table(iconv(map$NAME, from = "GBK"))

解压后三个文件放到相同目录下；虽然只读取.shp 文件，.shx 和 .dbf文件也必须在同一个文件目录下才能读取成功。
但是没有中国南海的八段线的线条绘制数据，由于南海诸岛的面积较小，如果不使用八段线标记的话，有时候如果地图展示面积太小的，南海诸岛就几乎难以辨清。

3. 地图绘制

1. Preparation

setwd("F:/Rwork/china_map")
library(maptools)
library(dplyr)
library(ggplot2)
library(RColorBrewer)
Sys.setlocale("LC_ALL", "chinese")

2. Map Data

Download GIS数据：http://cos.name/wp-content/uploads/2009/07/chinaprovinceborderdata_tar_gz.zip
解压后三个文件放到当前目录(getwd())下；
虽然只读取.shp 文件，.shx 和 .dbf文件也必须在同一个文件目录下才能读取成功。

map_data <- readShapePoly("bou2_4p.shp")
names(map_data)
map_data@data$ID <- row.names(map_data@data)
# 去掉包含NA的数据
map_data@data <- na.omit(map_data@data)
nrow(map_data@data)
# 可选，按照省份面积(AREA)筛选,主要为去掉南沙群岛和围绕南海的许多小岛
Fmap_data <- subset(map_data, AREA > 0.005)
nrow(Fmap_data@data)

添加省会拼音

Create a data.frame called cnmapdf which contains id, prov_en and prov_cn and key map plotting info;

prov_cn <- unique(map_data$NAME)
prov_en <- c("Heilongjiang", "Inner Mongolia", "Xinjiang", "Jilin",
             "Liaoning", "Gansu", "Hebei", "Beijing", "Shanxi",
             "Tianjin", "Shaanxi", "Ningxia", "Qinghai", "Shandong",
             "Tibet", "Henan", "Jiangsu", "Anhui", "Sichuan", "Hubei",
             "Chongqing", "Shanghai", "Zhejiang", "Hunan", "Jiangxi",
             "Yunnan", "Guizhou", "Fujian", "Guangxi", "Taiwan", 
             "Guangdong", "Hong Kong", "Hainan")

prov <- data.frame(prov_cn, prov_en)
id_prov <- map_data@data %>%
  mutate(prov_en = sapply(NAME, function(x) prov$prov_en[which(prov_cn == x)])) %>%
  mutate(prov_cn = as.character(NAME),prov_en = as.character(prov_en)) %>%
  select(id = ID, prov_cn, prov_en)

cnmapdf <- plyr::join(fortify(map_data), id_prov, by = "id")
head(cnmapdf)

添加省会城市坐标

cap_coord <- c(
  "Beijing", "北京", "Beijing", 116.4666667, 39.9,
  "Shanghai", "上海", "Shanghai", 121.4833333, 31.23333333,
  "Tianjin", "天津", "Tianjin", 117.1833333, 39.15,
  "Chongqing", "重庆", "Chongqing", 106.5333333, 29.53333333,
  "Harbin", "哈尔滨", "Heilongjiang", 126.6833333, 45.75,
  "Changchun", "长春", "Jilin", 125.3166667, 43.86666667,
  "Shenyang", "沈阳", "Liaoning", 123.4, 41.83333333,
  "Hohhot", "呼和浩特", "Inner Mongolia", 111.8, 40.81666667,
  "Shijiazhuang", "石家庄", "Hebei", 114.4666667, 38.03333333,
  "Taiyuan", "太原", "Shanxi", 112.5666667, 37.86666667,
  "Jinan", "济南","Shandong", 117, 36.63333333,
  "Zhengzhou", "郑州", "Henan", 113.7, 34.8, 
  "Xi'an", "西安", "Shaanxi", 108.9, 34.26666667,
  "Lanzhou", "兰州", "Gansu", 103.8166667, 36.05,
  "Yinchuan", "银川", "Ningxia", 106.2666667, 38.33333333,
  "Xining", "西宁", "Qinghai", 101.75, 36.63333333,
  "Urumqi", "乌鲁木齐", "Xinjiang", 87.6, 43.8,
  "Hefei", "合肥", "Anhui", 117.3, 31.85,
  "Nanjing", "南京", "Jiangsu", 118.8333333, 32.03333333,
  "Hangzhou", "杭州", "Zhejiang", 120.15, 30.23333333,
  "Changsha", "长沙", "Hunan", 113, 28.18333333,
  "Nanchang", "南昌", "Jiangxi", 115.8666667, 28.68333333,
  "Wuhan", "武汉", "Hubei", 114.35, 30.61666667,
  "Chengdu", "成都", "Sichuan", 104.0833333, 30.65,
  "Guiyang", "贵阳", "Guizhou", 106.7, 26.58333333,
  "Fuzhou", "福州", "Fujian", 119.3, 26.08333333,
  "Taibei", "台北", "Taiwan", 121.5166667, 25.05,
  "Guangzhou", "广州", "Guangdong", 113.25, 23.13333333,
  "Haikou", "海口", "Hainan", 110.3333333, 20.03333333,
  "Nanning", "南宁", "Guangxi", 108.3333333, 22.8,
  "Kunming", "昆明", "Yunnan", 102.6833333, 25,
  "Lhasa", "拉萨", "Tibet", 91.16666667, 29.66666667,
  "Hong Kong", "香港", "Hong Kong", 114.1666667, 22.3,
  "Macau", "澳门", "Macau", 113.5, 22.2)

cap_coord <- as.data.frame(matrix(cap_coord, nrow = 34, byrow = TRUE))
names(cap_coord) <- c("city_en", "city_cn", "prov_en", "long", "lat")
cap_coord <- cap_coord %>%
  mutate(prov_en = as.vector(prov_en),
         city_en = as.vector(city_en),
         city_cn = as.vector(city_cn),
         cap_long = as.double(as.vector(long)),
         cap_lat = as.double(as.vector(lat))) %>%
  select(prov_en, city_en, city_cn, cap_long, cap_lat)

head(cap_coord)
cnmapdf <- plyr::join(cnmapdf, cap_coord, by = "prov_en", type = "full")

筛选南海区域和添加九段线

Ncnmapdf <- cnmapdf
Ncnmapdf$class<-rep("Mainland",nrow(Ncnmapdf))
Width<-9
Height<-9
long_Start<-124
lat_Start<-16

df_Nanhai<-Ncnmapdf[Ncnmapdf$long>106.55 & Ncnmapdf$long<123.58,]
df_Nanhai<-df_Nanhai[df_Nanhai$lat>4.61 & df_Nanhai$lat<25.45,]

min_long<-min(df_Nanhai$long, na.rm = TRUE)
min_lat<-min(df_Nanhai$lat, na.rm = TRUE)
max_long<-max(df_Nanhai$long, na.rm = TRUE)
max_lat<-max(df_Nanhai$lat, na.rm = TRUE)

df_Nanhai$long<-(df_Nanhai$long-min_long)/(max_long-min_long)*Width+long_Start
df_Nanhai$lat<-(df_Nanhai$lat-min_lat)/(max_lat-min_lat)*Height+lat_Start
df_Nanhai$class<-rep("NanHai",nrow(df_Nanhai))


Ncnmapdf<-rbind(Ncnmapdf,df_Nanhai)

#df_NanHaiLine:Nanhai Line
df_NanHaiLine <- read.csv("中国南海九段线.csv")  
colnames(df_NanHaiLine)<-c("long","lat","ID")

df_NanHaiLine$long<-(df_NanHaiLine$long-min_long)/(max_long-min_long)*Width+long_Start
df_NanHaiLine$lat<-(df_NanHaiLine$lat-min_lat)/(max_lat-min_lat)*Height+lat_Start

3. 开始绘制地图

选择一个省画图

默认绘制的地图的形状有些扁平。这是因为，在绘图的过程中，默认把经度和纬度作为普通数据，均匀平等对待，绘制在笛卡尔坐标系上造成的。其实，地球的球面图形如何映射到平面图上，在地理学上是有一系列不同的专业算法的。地图不应该画在普通的笛卡尔坐标系上，而是要画在地理学专业的坐标系上。在这一点上，R 的 ggplot2 包提供了专门的coord_map()函数。

shanghai <- cnmapdf[cnmapdf$prov_en == "Shanghai",]
shanghai %>%
  ggplot(aes(x = long, y = lat, group = group, fill=factor(prov_en))) +
  geom_polygon( color = "grey") +
  coord_map() +
  ggtitle("上海直辖市") +
  xlab("经度") +
  ylab("维度") +
  scale_fill_brewer(palette="Paired")

画多个省

map1 <- cnmapdf %>%
  filter(prov_en %in% c("Jiangsu", "Zhejiang", "Shanghai"))  %>%
  ggplot() +
  geom_polygon(aes(x = long, y = lat, group = group, fill = prov_cn), color = "grey")

coord_delta_cap <- subset(cap_coord, prov_en %in% c("Zhejiang", "Shanghai", "Jiangsu"))
map1 +
  geom_point(data = coord_delta_cap, aes(x = cap_long, y = cap_lat)) +
  geom_text(data = coord_delta_cap, aes(cap_long, cap_lat - .25, label = city_cn)) +
  coord_map() +
  ggtitle("长江三角洲") +
  xlab("经度") +
  ylab("维度") +
  scale_fill_brewer(palette="Set2")

南海诸岛局部放大的全国地图

nb.cols <- length(unique(Ncnmapdf$prov_en))
mycolors <- colorRampPalette(brewer.pal(8, "Set2"))(nb.cols)

map00 <- ggplot()+
  geom_polygon(data=Ncnmapdf, aes(x=long, y=lat, group=interaction(class,group),fill=prov_cn),colour="grey",size=0.25)+ 
  #中国地图，包括中国主体部分和长方形方块内的南海诸岛数据
  geom_rect(aes(xmin=long_Start, xmax=long_Start+Width+1, ymin=lat_Start-1, ymax=lat_Start+Height),fill=NA, colour="black",size=0.25)+
  #绘制长方形方框
  geom_line(data=df_NanHaiLine, aes(x=long, y=lat, group=ID), colour="black", size=1)+  
  #绘制长方形方框内的中国南海八段线 
  #scale_fill_manual(values = colorRampPalette(rev(brewer.pal(11,'Spectral')))(33))+
  theme(legend.position = "none") +
  scale_fill_manual(values = mycolors) +
  coord_cartesian()+
  ylim(15,55)+
  theme(
    #legend.position=c(0.15,0.2),
    legend.background = element_blank()
  )

Ncoord_delta_cap <- subset(cap_coord, prov_en %in% unique(Ncnmapdf$prov_en))
# 解决重叠地名
library(ggrepel)
spec.city <- c("香港","澳门")
Ncap_map_data01 <- Ncoord_delta_cap[Ncoord_delta_cap$city_cn %in% spec.city,]
Ncap_map_data02 <- Ncoord_delta_cap[!Ncoord_delta_cap$city_cn %in% spec.city,]
map00 + geom_point(data=Ncap_map_data02,aes(x=cap_long, y= cap_lat),shape=1,colour="white") +
          geom_text(data=Ncap_map_data02,aes(x=cap_long, y= cap_lat,label=city_cn)) +
          geom_text_repel(data=Ncap_map_data01,aes(x=cap_long, y= cap_lat,label=city_cn)) +
          annotate("text", x=132.6, y=15.5, label="南海诸岛") +
          theme_void() + 
          theme(legend.position = "none")

全国地图

map0 <- cnmapdf %>%
  filter(prov_en %in% unique(cnmapdf$prov_en))  %>%
  ggplot() +
  geom_polygon(aes(x = long, y = lat, group = group, fill = "white"), color = "grey") +
  scale_fill_identity()
coord_delta_cap <- subset(cap_coord, prov_en %in% unique(cnmapdf$prov_en))
# 解决重叠地名
library(ggrepel)
spec.city <- c("香港","澳门")
cap_map_data01 <- coord_delta_cap[coord_delta_cap$city_cn %in% spec.city,]
cap_map_data02 <- coord_delta_cap[!coord_delta_cap$city_cn %in% spec.city,]
cnmap <-  map0 + geom_point(data=cap_map_data02,aes(x=cap_long, y= cap_lat),shape=1,colour="white") +
          geom_text(data=cap_map_data02,aes(x=cap_long, y= cap_lat,label=city_cn)) +
          geom_text_repel(data=cap_map_data01,aes(x=cap_long, y= cap_lat,label=city_cn)) +
          coord_map() +
          theme_void() + 
          theme(legend.position = "none") +
          scale_fill_identity()
cnmap

map1 <- cnmapdf %>%
  filter(prov_en %in% unique(cnmapdf$prov_en))  %>%
  ggplot() +
  geom_polygon(aes(x = long, y = lat, group = group, fill = prov_cn), color = "grey")

coord_delta_cap <- subset(cap_coord, prov_en %in% unique(cnmapdf$prov_en))

nb.cols <- length(unique(coord_delta_cap$prov_en))
mycolors <- colorRampPalette(brewer.pal(8, "Set2"))(nb.cols)

map1 +
  geom_point(data = coord_delta_cap, aes(x = cap_long, y = cap_lat)) +
  geom_text(data = coord_delta_cap, aes(cap_long, cap_lat - .25, label = city_cn)) +
  coord_map() +
  ggtitle("中国") +
  xlab("经度") +
  ylab("维度") +
  theme_void() + 
  theme(legend.position = "none") +
  scale_fill_manual(values = mycolors)

解决重叠地名

library(ggrepel)
spec.city <- c("香港","澳门")
cap_map_data1 <- coord_delta_cap[coord_delta_cap$city_cn %in% spec.city,]
cap_map_data2 <- coord_delta_cap[!coord_delta_cap$city_cn %in% spec.city,]
map1 + geom_point(data=cap_map_data2,aes(x=cap_long, y= cap_lat),shape=1,colour="white") +
  geom_text(data=cap_map_data2,aes(x=cap_long, y= cap_lat,label=city_cn)) +
  geom_text_repel(data=cap_map_data1,aes(x=cap_long, y= cap_lat,label=city_cn)) +
  coord_map() +
  theme_void() + 
  theme(legend.position = "none") +
  scale_fill_manual(values = mycolors)

颜色标注全国地图某几个省

https://cosx.org/2009/07/drawing-china-map-using-r/

par(mar=rep(0,4))
library(maps)
library(mapdata)
getColor = function(mapdata, provname, provcol, othercol){
	f = function(x, y) ifelse(x %in% y, which(y == x), 0)
	colIndex = sapply(mapdata@data$NAME, f, provname)
	fg = c(othercol, provcol)[colIndex + 1]
	return(fg)
}
provname = c("北京市", "天津市", "上海市", "重庆市")
provcol = c("red", "green", "yellow", "purple")
plot(map_data, col = getColor(map_data, provname, provcol, "white"))
points(cap_coord$cap_long, cap_coord$cap_lat, pch = 19, col = rgb(0, 0, 0, 0.5))
text(cap_coord$cap_long, cap_coord$cap_lat, cap_coord[, 3], cex = 0.9, col = rgb(0,0, 0, 0.7), 
     pos = c(2, 4, 4, 4, 3, 4, 2, 3, 4, 2, 4, 2, 2, 4, 3, 2, 1, 3, 1, 1, 2, 3, 2, 2, 1, 2, 4, 3, 1, 2, 2, 4, 4, 2))
axis(1, lwd = 0); axis(2, lwd = 0); axis(3, lwd = 0); axis(4, lwd = 0)
as.character(na.omit(unique(map_data@data$NAME)))

颜色标注全国地图某几个省 (推荐)

provname = c("北京市", "天津市", "上海市", "重庆市")
provcol = c("red", "green", "yellow", "purple")
getColors = function(mapdata, provname, provcol, othercol){
	f = function(x, y) ifelse(x %in% y, which(y == x), 0)
	colIndex = sapply(mapdata$prov_cn, f, provname)
	fg = c(othercol, provcol)[colIndex + 1]
	return(fg)
}
mc=getColors(cnmapdf, provname, provcol, "white")
map2 <- cnmapdf %>%
  filter(prov_en %in% unique(cnmapdf$prov_en))  %>%
  ggplot() +
  geom_polygon(aes(x = long, y = lat, group = group, fill = mc), color = "grey")

coord_delta_cap <- subset(cap_coord, prov_en %in% unique(cnmapdf$prov_en))
# 解决重叠地名
library(ggrepel)
spec.city <- c("香港","澳门")
cap_map_data21 <- coord_delta_cap[coord_delta_cap$city_cn %in% spec.city,]
cap_map_data22 <- coord_delta_cap[!coord_delta_cap$city_cn %in% spec.city,]
map2 + geom_point(data=cap_map_data22,aes(x=cap_long, y= cap_lat),shape=1,colour="white") +
  geom_text(data=cap_map_data22,aes(x=cap_long, y= cap_lat,label=city_cn)) +
  geom_text_repel(data=cap_map_data21,aes(x=cap_long, y= cap_lat,label=city_cn)) +
  coord_map() +
  theme_void() + 
  theme(legend.position = "none") +
  scale_fill_identity()

4. 地图添加数据

实例数据下载：中华人民共和国国家统计局

Heatmap

democn <- read.csv("China_pop.csv", stringsAsFactors = F, check.names=FALSE)
library(tidyr)
library(reshape2)
democndf <- melt(democn,variable.name ="year", value.name = "population")
head(spread(democndf, year, population))

map2df <- cnmapdf %>% 
  plyr::join(subset(democndf, year == "2018年"), by = "prov_cn") %>%
  mutate(population = as.numeric(population))
  
map2df %>%
  ggplot() +
  geom_polygon(aes(x = long, y = lat, group = group, fill = population), color = "grey") +
  geom_point(data=cap_map_data22,aes(x=cap_long, y= cap_lat),shape=1,colour="white") +
  geom_text(data=cap_map_data22,aes(x=cap_long, y= cap_lat,label=city_cn),size=2) +
  geom_text_repel(data=cap_map_data21,aes(x=cap_long, y= cap_lat,label=city_cn),size=2) +
  scale_fill_gradient(low = "red", high = "yellow") +
  theme_void()

多个图

map3df <- cnmapdf %>% 
  plyr::join(democndf, by = "prov_cn") %>%
  mutate(population = as.numeric(population)) %>%
  na.omit()
map3df %>%
  ggplot(aes(x = long, y = lat, group = group, fill = population)) +
  geom_polygon(color = "grey", lwd = .1) +
  coord_equal() +
  facet_wrap(~year)

Bubbles

map1 + 
  geom_point(data = map2df, aes(cap_long, cap_lat, size = population), shape = 21, fill="#9070c7",colour="grey", alpha = .5) +
  scale_size_area(max_size=5) +
  geom_text(data=cap_map_data2,aes(x=cap_long, y= cap_lat,label=city_cn),size=2,vjust=0,nudge_y=0.5) +
  geom_text_repel(data=cap_map_data1,aes(x=cap_long, y= cap_lat,label=city_cn),size=2,vjust=0,nudge_y=0.5) +
  coord_map() +
  theme_void() + 
  theme(legend.position = "none") +
  scale_fill_manual(values = mycolors)

Bar

map1 + 
  geom_errorbar(data=map2df,aes(x=cap_long, ymin=cap_lat, ymax=cap_lat + population/3000 ),
                colour="blue",size=2, width=0,alpha=0.5) +
  geom_text(data=cap_map_data2,aes(x=cap_long, y= cap_lat,label=city_cn),size=2,vjust=0,nudge_y=0.5) +
  geom_text_repel(data=cap_map_data1,aes(x=cap_long, y= cap_lat,label=city_cn),size=2,vjust=0,nudge_y=0.5) +
  coord_map() +
  theme_void() + 
  theme(legend.position = "none") +
  scale_fill_manual(values = mycolors)

5. 世界地图

library(rworldmap)
met <- as.data.frame(read.csv("MetObjects_5k-sample.csv"))
countries.met <- as.data.frame(table(met$Country))
head(countries.met)
colnames(countries.met) <- c("country", "value")
matched <- joinCountryData2Map(countries.met, joinCode="NAME", nameJoinColumn="country")
mapCountryData(matched, nameColumnToPlot="value", mapTitle="Met Collection Country Sample", catMethod = "pretty", colourPalette = "heat",oceanCol="aliceblue")

仅显示某一区域

1	mapCountryData(matched, nameColumnToPlot="value", mapTitle="Met Collection in Eurasia", mapRegion="Eurasia", colourPalette="heat", catMethod="pretty", oceanCol="aliceblue")

library(ggplot2)
library(dplyr)

WorldData <- map_data('world') %>% filter(region != "Antarctica") %>% fortify

df <- data.frame(region=c('Hungary','Lithuania','Argentina'), 
                 value=c(4,10,11), 
                 stringsAsFactors=FALSE)

p <- ggplot() +
    geom_map(data = WorldData, map = WorldData,
                  aes(x =long , y = lat, group = group, map_id=region),
                  fill = "white", colour = "#7f7f7f", size=0.5) + 
    geom_map(data = df, map=WorldData, aes(fill=value, map_id=region),colour="#7f7f7f", size=0.5) +
    coord_map("rectangular", lat0=0, xlim=c(-180,180), ylim=c(-60, 90)) +
    scale_fill_continuous(low="thistle2", high="darkred", guide="colorbar") +
    scale_y_continuous(breaks=c()) +
    scale_x_continuous(breaks=c()) +
    labs(fill="legend", title="Title", x="", y="") +
    theme(text = element_text( color = "#FFFFFF")
        ,panel.background = element_rect(fill = "aliceblue")
        ,plot.background = element_rect(fill = "aliceblue")
        ,panel.grid = element_blank()
        ,plot.title = element_text(size = 30)
        ,plot.subtitle = element_text(size = 10)
        ,axis.text = element_blank()
        ,axis.title = element_blank()
        ,axis.ticks = element_blank()
        ,legend.position = "right"
        )
    #theme_bw()
p

6. Info

## R version 3.4.2 (2017-09-28)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.936 
## [2] LC_CTYPE=Chinese (Simplified)_China.936   
## [3] LC_MONETARY=Chinese (Simplified)_China.936
## [4] LC_NUMERIC=C                              
## [5] LC_TIME=Chinese (Simplified)_China.936    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rworldmap_1.3-6    reshape2_1.4.3     tidyr_0.8.3       
##  [4] mapdata_2.3.0      maps_3.3.0         ggrepel_0.8.0     
##  [7] RColorBrewer_1.1-2 ggplot2_3.1.0      dplyr_0.8.0.1     
## [10] maptools_0.9-5     sp_1.3-1          
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0       pillar_1.3.1     compiler_3.4.2   plyr_1.8.4      
##  [5] tools_3.4.2      dotCall64_1.0-0  digest_0.6.18    evaluate_0.13   
##  [9] tibble_2.0.1     gtable_0.2.0     lattice_0.20-38  pkgconfig_2.0.2 
## [13] rlang_0.3.1      mapproj_1.2.6    yaml_2.2.0       spam_2.2-2      
## [17] xfun_0.5         withr_2.1.2      stringr_1.4.0    knitr_1.21      
## [21] fields_9.8-3     grid_3.4.2       tidyselect_0.2.5 glue_1.3.0      
## [25] R6_2.4.0         foreign_0.8-71   rmarkdown_1.11   purrr_0.3.1     
## [29] searcher_0.0.3   magrittr_1.5     scales_1.0.0     htmltools_0.3.6 
## [33] assertthat_0.2.0 colorspace_1.4-0 labeling_0.3     stringi_1.3.1   
## [37] lazyeval_0.2.1   munsell_0.5.0    crayon_1.3.4

4. 参考

R Visual. - China Map Part II
https://www.datanovia.com/en/blog/ggplot-colors-best-tricks-you-will-love/
https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/
标准中国地图的绘制

分子生物学 or 生物信息学, that's a question

2019-04-18T05:40:50.000Z

分子生物学（Molecular biology）是对生物在分子层次上的研究，是生物学和化学之间跨学科的研究，其研究领域涵盖了遗传学、生物化学和生物物理学等学科。分子生物学主要致力于对细胞中不同系统之间相互作用的理解，包括DNA，RNA和蛋白质生物合成之间的关系以及了解它们之间的相互作用是如何被调控的。在我们的研究中主要涉及基因功能和代谢通路解析；
生物信息学（Bioinformatics)利用应用数学、信息学、统计学和计算机科学的方法研究生物学的问题。生物信息学的研究材料和结果就是各种各样的生物学数据，其研究工具是计算机，研究方法包括对生物学数据的搜索（收集和筛选）、处理（编辑、整理、管理和显示）及利用（计算、模拟）。当前主要的研究方向有：序列比对、序列组装、基因识别、基因重组、蛋白质结构预测、基因表达、蛋白质反应的预测，以及创建进化模型。
自迈克尔·沃特曼（Michael Waterman）率先将数学和计算方法引入生物学研究开始，如今这门交叉学科作为后起之秀正逐渐渗透到生物学研究的多个领域，21世纪是生命科学的世纪，但也是一个信息化的时代，如果做分子的不懂生信，做生信的不懂分子，如何让自己走的更远？
工欲善其事，必先利其器，各行各业都有其不二法宝，科研工作亦当如此；

1. 浏览器

最好的浏览器当然是 Google Chrome (不接受任何反驳)，简洁无广告净化双眼；再配合其他插件 (http://tiramisutes.github.io/tiramisutes.github.io/2015/07/25/chrome.html) ，只能用如虎添翼来形容；

2. 文献检索

最好用的当然也是无所不能的 Google (不接受任何反驳)；

实在不行 NCBI 的PubMed也凑合，配合 Scholarscope (http://blog.scholarscope.cn/) 还可以显示期刊领域排名和IF等信息；

对不起，请再也不要说不会下载文献 (sci-hub.tw)；

3. NCBI 数据库

NCBI作为生物类主要数据库之一，查序列再顺手做个blast比对也是常规操作，看完下图让你以后的比对有更多的选择；

4. 文献管理

文献最主要两种使用方式就是阅读和写文章时的引用，业内巨头EndNote大家很熟悉，但开源的Zotero其实可以弥补非正版EndNote的缺陷；身处云的时代怎么能没有同步功能，完全没有安全感；Zotero网页点击即可保存文献，文献管理、同步、甚至手机端阅读样样精通；

5. 文本编辑

fasta等序列文件的察看，没有比较就没有伤害，同样的文件不同的打开方式，格然不同的阅读效果，轻量可打开绝大多文件类型且支持功能扩展的 Sublime (http://www.sublimetext.com/) 值得你拥有；

6. 数据安全

有备无患，辛辛苦苦得来的实验数据实时备份准没错，硬盘和云端双保险最佳；免费版坚果云和会员版百度云可供选择；当然免费的经特殊处理也可以，详细见：Linux和windows定时备份数据到百度云盘

我是分隔线

好了，如果你看到这里了，可选择性的配备你的工作站，开始接下来的进阶👇

7. 序列比对

windows基础版推荐简单方便且直观的 DNAMAN；

如果你序列太多，则最好Linux下计算，软件首选目前公认比对精确度最高的，需要调节的参数比较少的 MAFFT；

8. 进化构建

NJ法，ML法，BI法是目前主流的建树方法，MP法目前相对用得较少，每种方法都有它一定的优点，同时也存在着缺点。因为进化树的构建是一个统计学问题，我们所构建出来的进化树只是对真实的进化关系的评估或者模拟。如果采用一个适当的方法，那么所构建的进化树就会接近真实的“进化树”，所以对于相同的数据集，推荐用两种及两种以上的方法建系统发育树进行分析，互相比照。

方法	适用范围	优点	缺点
邻接法 neighbor-joining (NJ)	远缘序列且进化距离不大	计算速度快，结果相对准确	序列上的所有位点都被同等对待，而所分析序列的进化距离不能太大
最大似然法 maximum likelihood (ML)	跨物种比较，有模型有模型的情况下ML是与进化事实最吻合的树	很好的统计学基础，大样本时似然法可以获得参数统计的最小方差，在进化模型确定的情况下，ML法是与进化事实吻合最好的建树算法	计算量大，耗时时间长；依赖于合适的替代模型。
贝叶斯法 Bayesian	大而复杂的数据集	具有坚实的数学和统计学基础，可以处理复杂和接近实际情况的进化模型	对进化模型敏感，涉及较多的统计学假设和参数
最大简约法 maximum parsimony (MP)	所要比较的序列的碱基差别小；对于序列上的每一个碱基有近似相等的变异率；没有过多的颠换/转换的倾向；所检验的序列的碱基数目较多（大于几千个碱基）；	某些特殊的分子数据如插入、缺失等序列	只适于序列数目N≤12。存在较多回复突变或平行突变时，结果较差。推测的树不是唯一的，变异大的序列会出现长枝吸引而导致建树错误

进化树分类通常为有根树和无根树，有根树反映树上物种（基因）进化的时间顺序，无根树只反映分类单元间的距离，不涉及共同祖先问题；

Bootstrape是常用的系统进化树评估优化方法，树枝上的Bootstrape检验值表示该分支通过检验的次数占总次数的百分比，其值大小（>70）反映树枝的可信度；

windows下推荐MEGA，比对进化一步到位，支持最常用的ML和NJ算法；
如果你序列太多，则最好Linux下计算，软件推荐Phylip，RaxML，IQ-tree和MrBayes；

美化推荐iTOL，反正用过的都说好；

PacBio-BioNano-Genome数据上传NCBI经历

2019-04-05T01:28:11.000Z

1. PacBio原始下机bam格式数据上传

自PacBio Sequel平台开始，PacBio原始下机数据均为bam格式，该如何上传NCBI呢？

NCBI的SRA_metadata_acc.xlsx文件提供PacBio格式为PacBio RS平台的HDF5格式，而第一个bam格式则认为是比对结果文件，需要在assembly列提供比对基因组信息；

这该如何是好？万能的NCBI工作人员给我们支招了👇

For unaligned bam files please enter ‘unaligned’ in the ‘assembly’ column.

2. BioNano数据上传

进入Supplementary Files,选择BioNano原始Map数据或混合组装 (hybrid assembly) 过程数据上传；
根据介绍可选择文件类型为：CMAP，COORD (混合组装过程)，XMAP，SMAP (结构变异数据) 和下机数据BNX；

3. 基因组数据上传

Denovo组装基因组上传时通常需上传测序相关原始数据，首先参考测序数据上传NCBI总结提交专门上传基因组测序原始数据的BioProject和BioSample；
准备基因组fa(未注释基因组)/sqn(已注释基因组)格式文件，进入Genome上传；

3.1 准备数据清单

基因组fa/sqn文件
BioProject 号
BioSample 号
WGS 或 non-wgs genome
AGP文件，可通过AGP validation on-line进行文件格式确认或者下载软件在命令行确认 (fatoagp可根据fa文件生成AGP文件; fasta2apg.pl根据fa文件生成AGP文件且输出分隔的contig.fa和scaffold.fa)
其他可选注释信息

4. tbl2asn 使用

tbl2asn主要用于命令行下生成 gb 或 sqn 格式文件来提交数据到GenBank数据库；
直接输入tbl2asn -来察看详细参数；
详细准备数据见： Submission using tbl2asn，即新建目录来存储相关数据，且除template.sbt外，其他文件均使用相同前缀名称；

template.sbt (this is the only file whose prefix is different. Leave the prefix as is).
chr01.fsa
chr01.tbl
chr01.qvl

4.1 特殊参数解释

-i 指定fa文件名，且一定不能包含路径，只能是文件名;
-M 参数会覆盖部分其他参数；
-l 只有当 -M 设置为 t 时才可用；
-a s： fa文件包含多个序列，结果生成单个提交文件；

4.2 运行实例

1 2	cd NCBI_Genome linux64.tbl2asn -p . -r . -t template.sbt -i genome.fa -a s -j "[organism=xx][genotype=xx][tech=wgs][country=China]" -V vbgt -k c -Z discrep.txt -N 1.0 -L T -n xx

4.3 流程化运行

请参考： WGS2NCBI - toolkit for preparing genomes for submission to NCBI

5. fasta+GFF 转 Genebank/EMBL

参考 Genome_Scripts 项目👇

#!/usr/bin/env python
"""Convert a GFF and associated FASTA file into GenBank format.
Usage:
    gff_convert.py -f genbank -s  
"""
import sys
import os
from Bio import SeqIO
from Bio.Alphabet import generic_dna
from Bio import Seq
import argparse
from BCBio import GFF

parser=argparse.ArgumentParser(
    description='''Script that converts GFF + Fasta to GBK or EMBL ''',
    epilog="""hope (2019)  http://tiramisutes.github.io/2019/04/05/PBGNCBI.html""")
parser.add_argument("gff", help='GFF file')
parser.add_argument("fasta", help='Fasta file')
parser.add_argument("-f", "--format", choices=['genbank', 'embl'])
parser.add_argument("-s","--split", action='store_true', help='Split output into single files, 1 per contig')
parser.add_argument("-o","--output", help='Set the directory of output file/files')
args=parser.parse_args()

if len(sys.argv) < 2:
    parser.print_usage()
    sys.exit(1)
    

def _fix_ncbi_id(fasta_iter):
    """GenBank identifiers can only be 16 characters; try to shorten NCBI.
    """
    for rec in fasta_iter:
        if len(rec.name) > 16 and rec.name.find("|") > 0:
            new_id = [x for x in rec.name.split("|") if x][-1]
            print "Warning: shortening NCBI name %s to %s" % (rec.id, new_id)
            rec.id = new_id
            rec.name = new_id
        yield rec

def _check_gff(gff_iterator):
    """Check GFF files before feeding to SeqIO to be sure they have sequences.
    """
    for rec in gff_iterator:
        if isinstance(rec.seq, Seq.UnknownSeq):
            print "Warning: FASTA sequence not found for '%s' in GFF file" % (
                    rec.id)
            rec.seq.alphabet = generic_dna
        yield _flatten_features(rec)

def _flatten_features(rec):
    """Make sub_features in an input rec flat for output.
    GenBank does not handle nested features, so we want to make
    everything top level.
    """
    out = []
    for f in rec.features:
        cur = [f]
        while len(cur) > 0:
            nextf = []
            for curf in cur:
                out.append(curf)
                if len(curf.sub_features) > 0:
                    nextf.extend(curf.sub_features)
            cur = nextf
    rec.features = out
    return rec

gff_file = args.gff
fasta_file = args.fasta
format = args.format
output_dir = args.output

if args.split:
    if format == "genbank":
        print("Output set to " + format + ", splitting files and writting individual records to directory: " + output_dir)
        fasta_input = SeqIO.to_dict(SeqIO.parse(fasta_file, "fasta", generic_dna))
        for rec in GFF.parse(gff_file, fasta_input):
            SeqIO.write(_check_gff(_fix_ncbi_id([rec])), open(output_dir + "/" + rec.id + ".gbk", "w"), "genbank")
    if format == "embl":
        print("Output set to " + format + ", splitting files and writting individual records to directory: " + output_dir)
        fasta_input = SeqIO.to_dict(SeqIO.parse(fasta_file, "fasta", generic_dna))
        for rec in GFF.parse(gff_file, fasta_input):
            SeqIO.write(_check_gff(_fix_ncbi_id([rec])), open(output_dir + "/" + rec.id + ".embl", "w"), "embl")
else:
    if format == "genbank":
        out_file = output_dir + "/%s.gb" % os.path.splitext(os.path.basename(gff_file))[0]
        print("Output set to " + format + ", writing file to " + out_file)
        fasta_input = SeqIO.to_dict(SeqIO.parse(fasta_file, "fasta", generic_dna))
        gff_iter = GFF.parse(gff_file, fasta_input)
        SeqIO.write(_check_gff(_fix_ncbi_id(gff_iter)), out_file, "genbank")
    if format == "embl":
        out_file = output_dir + "/%s.embl" % os.path.splitext(os.path.basename(gff_file))[0]
        print("Output set to " + format + ", writing file to " + out_file)
        fasta_input = SeqIO.to_dict(SeqIO.parse(fasta_file, "fasta", generic_dna))
        gff_iter = GFF.parse(gff_file, fasta_input)
        SeqIO.write(_check_gff(_fix_ncbi_id(gff_iter)), out_file, "embl")

6. 通过FileZilla上传

完成SRA前面内容填充后即可到如下页面，记得使用FileZilla提交时一定要设置Remote Site。

Nature Communications | 苹果为什么那么红

2019-04-03T05:14:26.000Z

       研究表明，苹果的祖先原是灌木，大约6000万年前地球遭遇巨型陨石袭击时，大量灰尘被推入大气层中，遮蔽了阳光，降低了植物的光合作用，进而对全球各地的生态系统造成毁灭性的影响，令地球上的大部分生物包括恐龙灭绝，而苹果的祖先却死里逃生，通过进化获得了新生。
       苹果、葡萄、柑桔和香蕉并称为世界四大水果，而苹果更是四大水果之冠。“一天一个苹果”是人们熟知的健康口号。自19世纪起威尔士就有俗语说明苹果和健康的关系：“一天一苹果，医生远离我”（An apple a day keeps the doctor away）。
       的确，苹果含有丰富的糖类、有机酸、纤维素、维生素、矿物质、多酚及黄酮类营养物质，被科学家称为“全方位的健康水果”。依照美国农业部的数据，一份约重242克的苹果热量为126卡，含有大量的膳食纤维及维他命C。苹果皮中含有许多不确定营养价值的植物化学成分，在体外实验中可能有抗氧化作用。苹果中含有槲皮素、儿茶素及原花色素B2等酚类物质。【维基百科：苹果】

       同时苹果也是蔷薇科中种类最多和最具经济效益的水果，所以不仅在生活中受人们喜欢，在科学研究中也备受青睐。
       苹果基因组对于遗传研究和育种（抗寒，口味，成熟期等）具有重要的意义，同样高质量的染色体级参考基因组（High quality chromosome-scale reference genome）能够更加真实的反应物种基因组信息；截至目前为止（2019年4月），共有4篇高水平文章报道了苹果基因组研究情况。

       基因组研究揭示苹果其实是接受了部分“西方价值观”的亚洲移民【果壳 | 苹果：饱含历史又饱受沧桑】。2017年国际著名学术期刊 Nature Communications 更是以《Genome re-sequencing reveals the history of apple and supports a two-stage model for fruit enlargement》为题在分子水平上揭示了苹果起源、演化和驯化的规律，并证明世界栽培苹果起源于我国新疆。

       苹果基因组中首先被测序的是二倍体品种金冠苹果（Golden Delicious），在2017年 Nature Genetics 文章中其基因组Scaffolds N50已达到5.558Mb，属高质量参考基因组水平，通过基因组信息揭示了发生于21MYA前的转座子（transposable elements）爆炸式扩增与天山山脉（苹果的起源中心）的隆起时期相吻合，表明 TEs 在苹果祖先种的多样化和与梨的分歧中起到重要的作用；
     2019年4月2日，国际著名学术期刊 Nature Communications 以《A high-quality apple genome assembly reveals the association of a retrotransposon and red fruit colour》为题发表了来自中国农科院（辽宁）等单位联合的苹果基因组最新成果，在基因组水平揭示了苹果为什么那么红的原因。

       既然已经有高质量参考基因组那么还有必要再测一个吗？当然有，如同人类基因组计划和精准医疗，金冠苹果（Golden Delicious）是美国主栽品种，而中国苹果常栽品种可分为：元帅系，金冠系和富士系，且中国苹果现在栽种面积最大的是富士系苹果，以红富士为主要的栽种品种 (苹果种类及中国苹果常栽品种)：

元帅系：1872年在美国爱德华州与“钟花”苹果的根苗中发现的，1894年命名，1895年开始推广，又名蛇果。
金冠系：19世纪末在美国弗吉尼亚州发现，果实较大，又名黄元帅，金帅，为一重要的高产品种，成熟后表面金黄，甜酸爽口。
富士系：由日本农林水产省果树试验场盛冈分场培育，1962年命名，1966年引入中国，统称为红富士，目前已选出100多个品系。红富士是世界上最著名的晚熟品种，果实有风味好，晚熟，耐储存等优点，现在是世界上栽种面积最大的品种。

       本次测序品种 Hanfu (Dongguang x Fuji)) (寒富苹果)，是沈阳农业大学于1978年以抗寒性强而果实品质差的东光为母本与果实品质极上而抗寒性差的富士为父本进行杂交，选育出的抗寒、丰产、果实品质优、短枝性状明显的优良苹果品种。【摘自百度百科】该品种尽管与金冠属相同种（Species），但因生长环境和杂交亲本不同，其表型和基因组间必然会存在差异，获取该栽培种的特定基因组信息对于了解基因组进化过程和阐明复杂性状的遗传具有重要意义；

     通过联合 Illumina paired-end short reads, PacBio single-molecule real-time (SMRT) sequencing, chromosome conformation capture (Hi-C) sequencing 和 optical mapping 技术，得到了基因组大小为658.90 Mb，Scaffold N50达37.14Mb的纯合系HFTH1苹果基因组。通过与金冠苹果（Golden Delicious）基因组的比较，共鉴定出18,047个缺失，12,101个插入和14个大片段的倒位，且活跃的转座子是造成基因组内广泛的变异发生的主要原因。更重要的是基因组水平研究揭示了一个长末端重复反转录转座子插入到MdMYB1基因的上游与苹果的红色果皮表型相关联，而MdMYB1转录因子是调控花青素合成的核心转录激活因子，这解释了苹果为什么会变红的分子机制，同时高质量的参考基因组也为解密苹果的重要农艺性状奠定了基础。。

        该研究受到农业科技创新计划（No.CAAS-ASTIP-2016-RIP-02 和 CAAS-XTCX2016），中国农业研究系统专项资金（No.CARS-27）和中央公益性科研机构基础研究经费（No.1610182016020 和 Y2019XK09）的支持。

纳米孔测序系列之基础介绍

2019-03-12T13:39:59.000Z

纳米孔测序技术

纳米孔测序技术（Oxford Nanopore Technologies, 又称第四代测序技术）是最近几年兴起的新一代测序技术。
目前市场上广泛接受的纳米孔测序平台是Oxford Nanopore Technologies（ONT）公司的MinION，GridION X5 和 PromethION 三款不同类型测序仪。ONT测序的特点是单分子测序，测序读长长，测序速度快，测序数据实时监控，机器方便携带等。
据百迈客公布的不同物种实测数据显示：

基因组: 目前测序平均reads长度在20Kb以上，最长reads的N50高达48kKb左右，而最长reads可达到惊人的1.29M；
转录组: raw data数据量在2.5~4.4Gb，reads数为2.8~3.9M，其中N50在1.5kb左右，平均长度为1.0kb，平均质量值Q7以上。在对其进行质控后，其基因组比对率在85%左右。

读长和数据正确性是人们最为关心的两个方面，记得Nanopore测序器公布之初最大的缺陷就是超高的测序错误率 (5-15%)，同样据百迈客公布结果显示经过ONT公司科研人员的不断努力，目前的测序结果不仅有超长的测序读长，下机数据碱基的平均准确率达到~86% (未经过纠错与polish的数据), 且与PacBio的准确率 (~85%)相当。而且组装后经过纠错与polish后碱基的准确度均在99.99%以上。

工作原理

在充满电解液的腔内，带有纳米级小孔的绝缘防渗膜将腔体分成2个小室，如图1，当电压作用于电解液室，离子或其他小分子物质可穿过小孔，形成稳定的可检测的离子电流。掌握纳米孔的尺寸和表面特性、施加的电压及溶液条件，可检测不同类型的生物分子。
由于组成DNA的四种碱基腺嘌呤(A)、鸟嘌呤(G)、胞嘧啶(C)和胸腺嘧啶(T)的分子结构及体积大小均不同，单链DNA(ssDNA)在核酸外切酶的作用下被迅速逐一切割成脱氧核糖核苷酸分子，当单个碱基在电场驱使下通过纳米级的小孔时，不同碱基的化学性质差异导致穿越纳米孔时引起的电流的变化幅度不同，从而得到所测DNA的序列信息。

测序过程

MinION纳米孔测序仪的核心是一个有2,048个纳米孔，分成512组，由专用集成电路控制的flow cell。测序原理见下图a所示：首先，将双分子DNA连接lead adaptor（蓝色），hairpin adaptor（红色）和trailing adaptor（棕色）；当测序开始，lead adaptor带领测序分子进入由酶控制的纳米孔，lead adaptor后是template read（即待测序的DNA分子）通过纳米孔，hairpin adaptor的作用是DNA双链测序的保证，然后complement read（待测序分子的互补链）通过纳米孔，最后是trailing adaptor通过。在上述测序方法中，template read和complement read依次通过纳米孔，利用pairwise alignment，它们组合成2D read；而在另外一种测序方法中，不使用hairpin adaptor，只测序template read，最终形成1D read。后一种测序方法通量更高，但是测序准确性低于2D read。每个接头序列（adaptor）通过纳米孔引起的电流变化不同（图1c），这种差别可以用来做碱基识别。

测序平台简介

MinION

ONT最知名产品，仅有U盘大小，插入普通PC电脑即可运行，曾登上国际空间站，完成太空测序；MinION一次可运行一个flow cells，产出约10-30Gb的测序数据，运行时长约48小时；

GridION

GridION X5系统的测序部分包含五个 flow cells，这些flow cells可单独使用或协同使用，并通过USB连接到计算机。利用现有的试剂和软件每运行48小时可生成高达150GB的测序数据。GridION X5的出现填补了MinION和PromethION之间的空白。

PromethION

PromethION是ONT推出的最新款超高通量测序设备，它支持实时、长读长、直接DNA和RNA测序工作流程。PromethION一次可运行24 (PromethION 24) 到48 (PromethION 48) 个测序芯片，按照每张测序芯片包含3,000个纳米孔通道，所有芯片同时运行将可产出高达7.6Tb甚至是15Tb的数据，这能够满足超高通量的、快速周转测序需求。适用于大规模群体遗传学研究和大型动植物基因组测序项目；

应用领域

基因组组装

【Nanopore sequencing The advantages of long reads for genome assembly】

Assembler name	Algorithms	Errorcorrection	Link	Reference
LQS	DALIGNER, Celera OLC	Nanocorrect, Nanopolish	https://github.com/jts/nanopolish	Loman (2015)
Canu	MHAP, Celera OLC	Canu	https://github.com/marbl/canu	Berlin (2015)
Canu	MHAP, Celera OLC	Racon, Pilon	https://github.com/nanoporetech/ont-assembly-polish	nanoporetech
Miniasm	OLC		https://github.com/lh3/minia	Li (2016)
Miniasm	OLC	Racon	https://github.com/isovic/racon	Vaser (2017)
Ra-integrate	OLC		https://github.com/mariokostelac/ra-integrate/	Sovic (2016)
ALLPATHS-LG	de Bruijn graph	ALLPATHS-LG	https://www.broadinstitute.org/software/allpathslg/blog/?page_id	Gnerrea (2011)
SPAdes	de Bruijn graph	SPAdes	http://bioinf.spbau.ru/spades	Bankevich (2012)
SMART denovo	Smith-Waterm, dot matrix		https://github.com/ruanjue/smartdenovo	Ruan
ABruijn	de Bruijn graph		https://github.com/fenderglass/ABruijn	Lin (2016)

宏基因租

【Nanopore sequencing Addressing the challenges of metagenomics for environmental and clinical research】

变异分析

【Nanopore sequencing The application and advantages of long-read nanopore sequencing to structural variation analysis】

甲基化分析

结合ONT单分子长读长的实时测序特性，ONT甲基化数据能够真实的还原碱基修饰信息，得到准确和类型丰富的检测结果；

参考资料

纳米孔测序技术发展简介
全面解读第四代测序技术
Jain, Miten, et al. “The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community.“ Genome biology 17.1 (2016): 239.
Leggett, Richard M., and Matthew D. Clark. “A world of opportunities with nanopore sequencing.” Journal of Experimental Botany (2017): erx289.

研究人员的工作环境

2019-03-12T12:47:19.000Z

工欲善其事，必先利其器，各行各业都有其不二法宝，科研工作亦当如此，特别是在大数据和信息时代的21世纪，这些看似不起眼的日常却可以使我们的工作事半功倍，这里提供我的工作环境仅做参考；

浏览器

最好的浏览器当然是 Google Chrome (不接受任何反驳)，简洁无广告净化双眼；再配合其他插件 (http://tiramisutes.github.io/tiramisutes.github.io/2015/07/25/chrome.html) ,只能用如虎添翼来形容；

文献检索

最好用的当然也是无所不能的 Google (不接受任何反驳)；

实在不行 NCBI 的PubMed也凑合，配合 Scholarscope (http://blog.scholarscope.cn/) 还可以显示期刊领域排名和IF等信息；

NCBI 数据库

NCBI作为生物类主要数据库之一，查序列再顺手做个blast比对也是常规操作，看完下图让你以后的比对有更多的选择；

文献管理

文本编辑

数据安全

流程化的qPCR数据分析

2019-01-26T03:33:32.000Z

实验室经常要做qPCR实验，每次都是excel点点，好烦啊，特别是实验多的时候…干脆整一个流程省的每次浪费时间😃

就是下面这个喽👏

好处就是知道每一步是怎么计算的，心里有数 <(￣▽￣)>

qRCP-Pipeline

但是，还需要代码，脚本什么的，我不会啊，好复杂(┬＿┬)

没事，别急，更简单的来了，你需要做的仅仅是准备你的qPCR结果，然后上传就可以得到figure和分析结果数据╰(￣▽￣)╮

怎么实现呢？传递们：A shiny for analysis Real-Time PCR data

一图看懂 Trinity 转录组分析

2019-01-21T03:22:14.000Z

图片不过瘾，猛击这里【Trinity流程】看高清原图。

有参考基因组的RNA-seq详细分析流程

2018-12-04T06:27:58.000Z

1. 简介

测序技术的普及使得RNA-seq进入寻常百姓家，单纯的qRT数据通量不再满足实验数据的需求，而RNA-seq的分析无非就是有参和无参两种方式；
本文主要就有参转录组的分析做简单介绍；
此外，有参转录组数据分析流程千千万，本文仅是其中一种，详细运行参数请多 -help；

2. 环境准备

质量检验
- FastQC
- MultiQC (可选)
reads 过滤与修剪
- Trimmomatic
序列比对
- hisat2
排序及格式转换
- samtools
序列组装
- StringTie
差异表达分析
- Ballgown
- DESeq2
- edgeR

3. 数据准备

目标物种基因组数据【基因组fa (genome.fa)和gff注释文件 (genome.gff3)】
测序 reads （实验室生成或NCBI下载）

4. 测序reads分析过程

4.1 SRA 转 fq (可选)

参考 Using the SRA Toolkit to convert .sra files into other formats,根据个人喜好选用相应工具将NCBI的SRA数据库下载SRA数据转化为fq格式；

4.2 质控

FastQC 察看数据质量，Trimmomatic 来进行接头，低质量测序reads的过滤与修剪；

1
2
3

ls *gz |xargs -I [] echo 'nohup fastqc [] &'>fastqc.sh
./fastqc.sh
multiqc ./

java -jar trimmomatic-0.30.jar PE \ 
-threads 20 -phred33 reads1.fastq reads2.fastq \ 
reads1.clean.fastq reads1.unpaired.fastq reads2.clean.fastq reads2.unpaired.fastq \ 
ILLUMINACLIP:/Trimmomatic-0.30/adapters/TruSeq3-PE.fa:2:30:10 \ 
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

4.3 序列比对

1 2	hisat2-build genome.fa genome hisat2 -x genome -1 read1.fq.gz -2 read2.fq.gz -S Sample.sam -p 8

4.4 排序及格式转换

1	samtools view -bS Sample.sam \| samtools sort -@ 8 - Sample.sorted

4.5 序列组装

用StringTie对每个样本进行转录本组装

# Transcriptome assembly
stringtie -p 8 -G genome.gff3 -o Sample.gtf –l Sample.sorted.bam
# 获取所有*.gtf 文件名的列表， 并且每个文件名占据一行
ls | \grep "Sample" | sort -V | uniq | awk 'BEGIN{OFS="/"} {print $1,$1".gtf"}' > Sample_gtf.txt
# Merges transcripts into a non-redundant set of transcripts
stringtie --merge -p 8 -G genome.gff3 -o merged.gtf Sample_gtf.txt
# Expression level estimation
stringtie –e –B -p 8 -G merged.gtf -o Sample.gtf Sample.sorted.bam

4.6 count data 提取

准备上述gtf结果文件sample文件 (sample_lst.txt)，格式如下：

Sample1 <PATH_TO_Sample1.gtf>
Sample2 <PATH_TO_Sample2.gtf>
Sample3 <PATH_TO_Sample3.gtf>
Sample4 <PATH_TO_Sample4.gtf>

提取各样品count data

1	prepDE.py -i sample_lst.txt

5. 差异表达分析

差异分析可参考搭PacBio全长转录组便车的无重复样本RNA-seq分析；
主要就是准备表型文件和上述的基因或转录本count 文件；
表型数据格式如下 (phenodata.csv)：

sample	group
Sample1 leaf
Sample2 leaf
Sample3 root
Sample4 root

5.1 DESeq2

library("DESeq2")
countData <- as.matrix(read.csv("gene_count_matrix.csv", row.names="gene_id"))
colData <- read.csv("phenodata.csv", sep="\t", row.names=1)
all(rownames(colData) %in% colnames(countData))
countData <- countData[, rownames(colData)]
all(rownames(colData) == colnames(countData))
dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ group)
dds <- DESeq(dds)
res <- results(dds)
(resOrdered <- res[order(res$padj), ])

5.2 edgeR

edgeR 和上述 DESeq2相似，具体请参考其BiocManager 说明；

5.3 Ballgown

上述StringTie结果可直接用Ballgown读取来进行差异分析；

library(ballgown)
pheno_data <- read.csv("phenodata.csv")
bg <- ballgown(dataDir = "ballgown",
               samplePattern = "sample",
               pData = pheno_data)
samplesNames(bg)
bgfilt <-subset(bg,'rowVars(texpr(bg))>1',genomesubset=TRUE)（过滤掉表达差异较小的基因）
diff_genes <- stattest(bgfilt,feature='gene',covariate=【自变量】,adjustvars=【无关变量】,meas='FPKM')
diff_genes <- arrange(diff_genes,pval)
write.csv(diff_genes,'diff_genes.csv',row.names=FALSE)
# MA plot
library(ggplot2)
library(cowplot)
results_transcripts$mean <- rowMeans(texpr(bg_chrX_filt))
 
ggplot(results_transcripts, aes(log2(mean), log2(fc), colour = qval<0.05)) +
  scale_color_manual(values=c("#999999", "#FF0000")) +
  geom_point() +
  geom_hline(yintercept=0)

6. 扩增阅读

Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown
RNA-seq Wiki

搭PacBio全长转录组便车的无重复样本RNA-seq分析

2018-07-02T04:36:32.000Z

RNA-seq 分析是我当年入门生信的敲门砖，有参/无参的分析，当然还有现在升级的基于PacBio全长转录组的Illumina RNA-seq分析；
测序样本的重复性很重要，但总是能接到一些无重复的样本，对于他们的分析就没有像三次重复那么的顺手和多选择；

零、输入数据

输入数据通常为count结果，很多工具可以用于生成此counts table 文件，

[x] 如基于基因组比对的htseq-count;
[x] 基于全基因组的转录组结果的
- Salmon (Patro et al. 2017)
- Sailfish (Patro, Mount, and Kingsford 2014)
- kallisto (Bray et al. 2016)
- RSEM (Li and Dewey 2011)

上述软件通常第一步即对全转录组建立索引文件👇

获取转录组

For this example, we’ll be analyzing some Arabidopsis thaliana data, so we’ll download and index the A. thaliana transcriptome.

1	curl ftp://ftp.ensemblgenomes.org/pub/plants/release-28/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz -o athal.fa.gz

建立索引

1	salmon index -t athal.fa.gz -i athal_index

定量基因表达情况

#!/bin/bash
for fn in data/DRR0161{25..40};
do
samp=`basename ${fn}`
echo "Processing sample ${samp}"
salmon quant -i athal_index -l A \
         -1 ${fn}/${samp}_1.fastq.gz \
         -2 ${fn}/${samp}_2.fastq.gz \
         -p 8 -o quants/${samp}_quant
done

后续差异基因分析

Once you have your quantification results you can use them for downstream analysis with differential expression tools like DESeq2, edgeR, limma, or sleuth. Using the tximport package, you can import salmon’s transcript-level quantifications and optionally aggregate them to the gene level for gene-level differential expression analysis. You can read more about how to import salmon’s results into DESeq2 by reading the tximport section of the excellent DESeq2 vignette. For instructions on importing for use with edgeR or limma, see the tximport vignette. For preparing salmon output for use with sleuth, see the wasabi package.

==首选 DESeq 的方法；==

一、Analysing no replicate RNA-seq data with LPEseq package

LPEseq was designed for the RNA-Seq data with a small number of replicates, especially with non-replicate in each class. Also LPEseq can be equally applied both count-base and FPKM-based (non-count values) input data.

二、IsoEM2

三、edgeR

The quasi-likelihood method is highly recommended for differential expression analyses of bulk RNA-seq data as it gives stricter error rate control by accounting for the uncertainty in dispersion estimation. The likelihood ratio test can be useful in some special cases such as single cell RNA-seq and datasets with no replicates.

四、NOISeq: Differential Expression in RNA-seq

library(NOISeq)
dat <- read.csv("/public/home/zpxu/hulisong/results/kallisto/All.tpm.csv",row.names=1)
mycounts <- read.csv("/public/home/zpxu/hulisong/results/kallisto/All.counts.csv",row.names=1)
myfactors <- read.table("/public/home/zpxu/hulisong/scripts/3sleuth_info.txt",header=T)
mydata <- readData(data = mycounts, factors = myfactors)
str(mydata)
head(assayData(mydata)$exprs)
head(pData(mydata))
head(featureData(mydata)@data)

mycountsbio = dat(mydata, factor = NULL, type = "countsbio")
explo.plot(mycountsbio, toplot = 1, samples = 1, plottype = "boxplot")
mysaturation = dat(mydata, k = 0, ndepth = 7, type = "saturation")
explo.plot(mysaturation, toplot = 1, samples = 1:2, yleftlim = NULL, yrightlim = NULL)
explo.plot(mycountsbio, toplot = 1, samples = NULL, plottype = "barplot")
mycd = dat(mydata, type = "cd", norm = FALSE, refColumn = 1)
## Normalization
## Differential expression: no replicates available
myresults <- noiseq(mydata, factor = "condition",conditions = c("H0", "H1"), k = NULL, norm = "n", pnr = 0.2,nss = 5, v = 0.02, lc = 1, replicates = "no")


head(myresults@results[[1]])
mynoiseq.deg = degenes(myresults, q = 0.95, M = NULL)
mynoiseq.deg1 = degenes(myresults, q = 0.95, M = "up")
mynoiseq.deg2 = degenes(myresults, q = 0.95, M = "down")


DE.plot(myresults, q = 0.95, graphic = "expr", log.scale = TRUE)
DE.plot(myresults, q = 0.95, graphic = "MD")
DE.plot(myresults, chromosomes = c(1, 2), log.scale = TRUE, join = FALSE,q = 0.95, graphic = "chrom")

If NOISeq-sim has been used because no replicates are available, then it is preferable to use a higher threshold such as q = 0.9

五、DESeq2

Can I use DESeq2 to analyze a dataset without replicates? 和 Question: Deseq2 for RNAseq experiments without replicates

For an experiment without replicates, you should just run DESeq() as normal.

library(DESeq2)
raw_count <- read.csv("/public/home/zpxu/hulisong/results/kallisto/All.counts.csv", header=TRUE, row.names=1)
## 4舍5入
countdata1 <- round(raw_count) 
head(countdata1)
## 去掉整行是0的行
all <- apply(countdata1, 1, function(x) all(x==0) )
newdata <- countdata1[!all,]
head (newdata)
## 筛选行总和大于200的行
dat <- newdata[rowSums(newdata > 200) >= 1,]
countdata <- as.matrix(dat)
head(countdata)
## 
countData<-countdata[,c("controlH_1","treatH1_1")]
myfactors <- read.table("/public/home/zpxu/hulisong/scripts/count_sleuth_info.txt",header=T)
myfactors
## 根据需要比较的样本筛选出相应的行
colData <- myfactors[c(1,3),]
#colData <- myfactors[which((myfactors$condition=="H0")|(myfactors$sample=="H1")),]
dds <- DESeqDataSetFromMatrix(
       countData = countData,
       colData = colData,
       design = ~ condition)
colData(dds)$condition <- factor(colData(dds)$condition,levels=c("H0","H1"))
dds
dds <- DESeq(dds)
res <- results(dds)
res <- res[order(res$padj),]
head(res)
sum(res$padj < 0.1, na.rm=TRUE)
res05 <- results(dds, alpha=0.05)
summary(res05)
sum(res05$padj < 0.05, na.rm=TRUE)

或者利用 tximport直接将多个kallisto结果文件abundance.tsv导入R中；

tximport 导入数据

library("tximport")
samples <- read.table("/public/home/zpxu/hulisong/scripts/count_sleuth_info.txt", header = TRUE)
dir <- "/public/home/zpxu/hulisong/results/kallisto"
files <- file.path(dir, samples$sample, "abundance.tsv")
names(files) <- samples$sample
txi.kallisto.tsv <- tximport(files, type = "kallisto", txOut=TRUE)
head(txi.kallisto.tsv$counts)

DESeq2 进行差异表达计算

library("DESeq2")
head(txi.kallisto.tsv$counts)
raw_count <- txi.kallisto.tsv$counts
countdata1 <- round(raw_count) 
head(countdata1)
## 去掉整行是0的行
all <- apply(countdata1, 1, function(x) all(x==0) )
newdata <- countdata1[!all,]
head (newdata)
## 筛选行总和大于200的行
dat <- newdata[rowSums(newdata > 200) >= 1,]
countdata <- as.matrix(dat)
head(countdata)
countData <- countdata[,c("controlH_1", "treatH1_1")]
myfactors <- read.table("/public/home/zpxu/hulisong/scripts/count_sleuth_info.txt",header=T)
myfactors
## 根据需要比较的样本筛选出相应的行
colData <- myfactors[c(1,3),]
dds <- DESeqDataSetFromMatrix(
       countData = countData,
       colData = colData,
       design = ~ condition)
colData(dds)$condition <- factor(colData(dds)$condition,levels=c("H0","H1"))
dds
dds <- DESeq(dds)
res <- results(dds)
res <- res[order(res$padj),]
head(res)

六、DESeq

library( "DESeq" )
raw_count <- read.csv("/public/home/zpxu/hulisong/results/kallisto/All.counts.csv", header=TRUE, row.names=1)
head(raw_count)
myfactors <- read.table("/public/home/zpxu/hulisong/scripts/count_sleuth_info.txt",header=T)
countdata1 <- round(raw_count)
head(countdata1)
all <- apply(countdata1, 1, function(x) all(x==0) )
newdata <- countdata1[!all,]
head (newdata)
countdata <- as.matrix(newdata)
head(countdata)
condition = myfactors$sample
cds = newCountDataSet( countdata, condition )
## Normalisation
cds = estimateSizeFactors(cds)
sizeFactors( cds )
head( counts( cds, normalized=TRUE ) )
## Calling differential expression
cds2 = cds[ ,c( "controlH_1", "treatH1_1" ) ]
## 无生物学重复选 method="blind"和sharingMode="fit-only"
cds2 = estimateDispersions( cds2, method="blind", sharingMode="fit-only" )
## 察看cds2中包含的 conditions，因为nbinomTest函数的第二和第三个参数输入的是cds2中所包含的conditions
dispTable(cds2) 
res = nbinomTest( cds2, "controlH_1", "treatH1_1" )
plotMA(res)
hist(res$pval, breaks=100, col="skyblue", border="slateblue", main="")
## 根据FDR筛选显著差异的基因
resSig = res[ res$padj < 0.05, ]
## 按照 P 值排序
head( resSig[ order(resSig$pval), ] )
SDE <- resSig[ order(resSig$pval), ]
## The most strongly down-regulated of the significant genes
head( resSig[ order( resSig$foldChange, -resSig$baseMean ), ] )
SDE_down <- resSig[ order( resSig$foldChange, -resSig$baseMean ), ]
## The most strongly up-regulated ones
head( resSig[ order( -resSig$foldChange, -resSig$baseMean ), ] )
SDE_up <- resSig[ order( -resSig$foldChange, -resSig$baseMean ), ]
write.csv( res, file="My Pasilla Analysis Result Table.csv")
addmargins( table(res2_sig = res$padj < 0.05 ) )

输出结果中 foldChange 和 log2FoldChange 列包含有 Inf 和 -Inf，根据 Re-extracting refgene names after DESeq Analysis 解释，是因为同一个基因在样本A中count极端大，而在样本 B 中为零造成的；那么如何避免这种问题出现呢？根据 Question: Statistical question (Deseq, Cuffdis) when one condition is zero? 解释，通常做法是全部数据都加一个较小的数值，但这也一个较小的数值也会在 log2FoldChange 时被放大；

A popular strategy to cope with zeros is to add a small number to all counts so that you avoid division by zero and at the same time you don’t bias the results (e.g. 1000:0 is reasonably equivalent to 1001:1). Having said that, this is an issue that bugs me sometime when interpreting fold change ratios since small numbers can have a large effect which is not consistent with the biological interpretation. For example, if you add 1 to all your counts you could get log2(1001/1)= 9.97; if instead you add 0.1(biologically the same, I would argue) you get log2(1000.1/0.1)= 13.29, which is a big difference.

结果中存在 NA 的解释见 Question: Deseq Infinite In Logfc And Na For P Value 和 DESeq: “NA” generated in the resulted differentially expressed genes,不影响结果，直接去掉即可👇

1 2	mydf[complete.cases(mydf), ] resSig<-na.omit(resSig)

其他参考教程：RNA-Seq: differential expression using DESeq

多个组合比较差异基因分析

如下 Cross_DF.R 脚本可对不同组合进行差异基因的分析；

使用方式为：Rscript Cross_DF.R controlH_1 treatH1_1 /public/home/zpxu/hulisong/results/123/ 0.05

其中，

第一个参数(controlH_1)为比较的对照；
第二个参数(treatH1_1)为另外一组样本，即进行 controlH_1 和 treatH1_1 之间的比较来找出差异基因；
第三个参数(/public/home/zpxu/hulisong/results/123/) 为结果输出路径，包括保存的csv和pdf图片；

第四个参数(0.05) 为最后差异基因筛选时padj的阈值和画MA图片时的padj值；

arg <- commandArgs(T)
###### Input kallisto results ##########
print("Input kallisto results")
library("tximport")
myfactors <- read.table("/public/home/zpxu/hulisong/scripts/count_sleuth_info.txt", header = TRUE)
dir <- "/public/home/zpxu/hulisong/results/cds_kallisto"
files <- file.path(dir, myfactors$sample, "abundance.tsv")
names(files) <- myfactors$sample
txi.kallisto.tsv <- tximport(files, type = "kallisto", txOut=TRUE)
head(txi.kallisto.tsv$counts)
############## For DEGs ################
library( "DESeq" )
raw_count <- txi.kallisto.tsv$counts
#raw_count <- read.csv("/public/home/zpxu/hulisong/results/kallisto/All.counts.csv", header=TRUE, row.names=1)
head(raw_count)
#myfactors <- read.table("/public/home/zpxu/hulisong/scripts/count_sleuth_info.txt",header=T)
countdata1 <- round(raw_count)
head(countdata1)
all <- apply(countdata1, 1, function(x) all(x==0) )
newdata <- countdata1[!all,]
head (newdata)
countdata <- as.matrix(newdata)
head(countdata)
condition = myfactors$sample
cds = newCountDataSet( countdata, condition )
# Normalisation
print("Normalisation All Data")
cds = estimateSizeFactors(cds)
sizeFactors( cds )
head( counts( cds, normalized=TRUE ) )
# Select sample and do Variance estimation
print("Select sample and do Variance estimation")
cds2 = cds[ ,c(arg[1], arg[2]) ]
cds2 = estimateDispersions( cds2, method="blind", sharingMode="fit-only",fitType = "local" )
outdir <- arg[3]
pdf(file=paste0(outdir,arg[1],"_Vs_",arg[2],".pdf"))
print("Plot the per-gene estimates against the mean normalized counts per gene and overlay the fitted curve")
plotDispEsts(cds2)
#### Calling differential expression ###########
print("Calling differential expression")
res = nbinomTest( cds2, arg[1], arg[2] )
plotMA(res,col = ifelse(res$padj>=arg[4], "gray32", "red3"), main=paste0(arg[1],"_Vs_",arg[2]))
hist(res$pval, breaks=100, col="skyblue", border="slateblue", main=paste0(arg[1],"_Vs_",arg[2]))
resSig = res[ res$padj < arg[4], ]
N_resSig <- resSig[complete.cases(resSig), ]
write.csv( N_resSig, paste0(outdir,arg[1],"_Vs_",arg[2],".csv"),row.names = FALSE)
############### The follows is option and can not do it If your want #####
############### Variance stabilizing transformation ##########
print("Variance stabilizing transformation")
vsd = varianceStabilizingTransformation( cds2 )
library("vsn")
notAllZero = (rowSums(counts(cds))>0)
meanSdPlot(log2(counts(cds)[notAllZero, ] + 1))
meanSdPlot(vsd[notAllZero, ])
########### Data quality assessment by sample clustering and visualisation ####
print("Data quality assessment by sample clustering and visualisation")
print("Heatmap of the count table")
vsdFull = varianceStabilizingTransformation( cds2 )
library("RColorBrewer")
library("gplots")
select = order(rowMeans(counts(cds2)), decreasing=TRUE)[1:30]
hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100)
heatmap.2(exprs(vsdFull)[select,], col = hmcol, trace="none", margin=c(10, 6))
print("Heatmap of the sample-to-sample distances")
dists = dist( t( exprs(vsdFull) ) )
mat = as.matrix( dists )
rownames(mat) = colnames(mat) = with(pData(cds2), paste(condition, condition, sep=" : "))
heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))
print("Principal component plot of the samples")
print(plotPCA(vsdFull, intgroup=c("condition", "condition")))
dev.off()
print(paste0("Run Complete. Please Check The Results in Directory: ",outdir))

批量不同组合运行方式👇

for i in treatR1_1 treatR2_1 treatR3_1 treatR4_1 treatR5_1
do
Rscript Cross_DF.R controlR_1 ${i} /public/home/zpxu/hulisong/results/123/ 0.05
done

七、LPEseq

install.packages("devtools")
library(devtools)
install_github("iedenkim/LPEseq")
library(LPEseq)
TPM <- read.csv("/public/home/zpxu/hulisong/results/kallisto/All.tpm.csv",header = T,row.names=1)
NewTPM <- TPM[rowSums(TPM==0)==0,]
head(NewTPM)
dim(NewTPM)
TPM.norm <- log(NewTPM, base = 2)
head(TPM.norm)
TPM.result.norep <- LPEseq.test(TPM.norm[,1], TPM.norm[,4])
head(TPM.result.norep)
TPM.result.norep.Sig = TPM.result.norep[TPM.result.norep$q.value < 0.05, ]
head(TPM.result.norep.Sig)
write.table(TPM.result.norep.Sig, file="result_file.txt", quote=F, sep="\t")

八、GFOLD

基于基因组比对的bam文件计算差异表达基因；

Example 1: Count reads and rank genes

In the following example, hg19Ref.gtf is the ucsc knownGene table in GTF format for hg19; sample1.sam and sample2.sam are the mapped reads in SAM format.

1
2
3

gfold count -ann hg19Ref.gtf -tag sample1.sam -o sample1.read_cnt
gfold count -ann hg19Ref.gtf -tag sample2.sam -o sample2.read_cnt
gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff

Example 2: Count reads

This example utilizes samtools to produce mapped reads in SAM format from BAM format.

1	samtools view sample1.bam \| gfold count -ann hg19Ref.gtf -tag stdin -o sample1.read_cnt

Example 3: Identify differentially expressed genes without replicates

Suppose there are two samples: sample1 and sample2 with corresponding read count file being sample1.read_cnt sample2.read_cnt. This example finds differentially expressed genes using default parameters on two samples

1	gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff

Example 4: Identify differentially expressed genes with replicates

This example finds differentially expressed genes using default parameters on two group of samples.

1	gfold diff -s1 sample1,sample2,sample3 -s2 sample4,sample5,sample6 -suf .read_cnt -o 123VS456.diff

Example 5: Identify differentially expressed genes with replicates only in one condition

This example finds differentially expressed genes using default parameters on two group of samples. Only the first group contains replicates. In this case, the variance estimated based on the first group will be used as the variance of the second group.

1	gfold diff -s1 sample1,sample2 -s2 sample3 -suf .read_cnt -o 12VS3.diff

PacBio基因组组装之MARVEL

2018-05-01T13:13:19.000Z

1. 软件介绍

PacBio三代数据基因组组装，针对低测序深度的组装;

2. 软件安装

软件主页：The MARVEL Assembler
clone 后发现目录下没有 configure 文件，而是configure.ac文件，怎么安装？？？

2.1. 加载所需环境

source /public/home/software/.bashrc
module load Autoconf/2.69
## added python3.6
export PATH="/public/home/zpxu/miniconda3/bin:$PATH"
export LD_LIBRARY_PATH="$PATH:/public/home/zpxu/miniconda3/lib"

版本检测

1 2	autoreconf --version python --version

然后运行 autoreconf 来生成 configure 文件。

2.2. 安装

## 注意此目录与git clone 的 MARVEL 目录不同
./configure --prefix=/public/home/cotton/software/MARVEL-bin
make
make install

make install 后提示如下信息

---------------------------------------------------------------
Installation into /public/home/cotton/software/MARVEL-bin finished.
Don't forget to include /public/home/cotton/software/MARVEL-bin/lib.python in your PYTHONPATH.
---------------------------------------------------------------

2.3. 添加PYTHONPATH

参见：Permanently add a directory to PYTHONPATH

1	export PYTHONPATH="$PYTHONPATH:/public/home/cotton/software/MARVEL-bin/lib.python"

3. 软件使用

整体过程如下：

overlap
patch reads
overlap (again)
scrubbing
assembly graph construction and touring
optional read correction
fasta file creation
实际运行主要经过两个过程，READ PATCHING PHASE 和 ASSEMBLY PHASE，即reads的处理阶段（DBprepare.py）和组装阶段（do.py）；

3.1. 初始化数据库: DBprepare.py

1	/public/home/cotton/software/MARVEL-bin/scripts/DBprepare.py ECOL p6.25x.fasta

DBprepare.py 实际上是一个集成的python脚本，里面包含大约4步MARVEL程序 (FA2db,DBsplit,DBdust 和 HPCdaligner)。

运行完成将生成 ECOL.db，两个隐藏文件.ECOL.idx 和 .ECOL.bps ，两个plan后缀的文件；

3.2. 组装: do.py

##!/usr/bin/env python

import multiprocessing
import marvel
import marvel.config
import marvel.queue

###### settings

DB         = "ECOL"                         ## database name,需修改，和DBprepare.py中命名一样
COVERAGE   = 25                             ## coverage of the dataset，需修改
DB_FIX     = DB + "_FIX"                    ## name of the database containing the patched reads
PARALLEL   = multiprocessing.cpu_count()    ## number of available processors

###### patch raw reads

q = marvel.queue.queue(DB, COVERAGE, PARALLEL)

###### run daligner to create initial overlaps
q.plan("{db}.dalign.plan")

###### run LAmerge to merge overlap blocks
q.plan("{db}.merge.plan")

## create quality and trim annotation (tracks) for each overlap block
q.block("{path}/LAq -b {block} {db} {db}.{block}.las")
## merge quality and trim tracks
q.single("{path}/TKmerge -d {db} q")
q.single("{path}/TKmerge -d {db} trim")

## run LAfix to patch reads based on overlaps
q.block("{path}/LAfix -g -1 {db} {db}.{block}.las {db}.{block}.fixed.fasta")
## join all fixed fasta files
q.single("!cat {db}.*.fixed.fasta > {db}.fixed.fasta")

## create a new Database of fixed reads (-j numOfThreads, -g genome size)
q.single("{path_scripts}/DBprepare.py -s 50 -r 2 -j 4 -g 4600000 {db_fixed} {db}.fixed.fasta", db_fixed = DB_FIX)

## run the commands
q.process()

########## assemble patched reads

q = marvel.queue.queue(DB_FIX, COVERAGE, PARALLEL)

###### run daligner to create overlaps
q.plan("{db}.dalign.plan")
###### run LAmerge to merge overlap blocks
q.plan("{db}.merge.plan")

###### start scrubbing pipeline

########## for larger genomes (> 100MB) LAstitch can be run with the -L option (preload reads)
########## with the -L option two passes over the overlap files are performed:
########## first to buffer all reads and a second time to stitch them
########## otherwise the random file access can make LAstitch pretty slow.
########## Another option would be, to keep the whole db in cache (/dev/shm/)
q.block("{path}/LAstitch -f 50 {db} {db}.{block}.las {db}.{block}.stitch.las")

########## create quality and trim annotation (tracks) for each overlap block and merge them
q.block("{path}/LAq -s 5 -T trim0 -b {block} {db} {db}.{block}.stitch.las")
q.single("{path}/TKmerge -d {db} q")
q.single("{path}/TKmerge -d {db} trim0")

########## create a repeat annotation (tracks) for each overlap block and merge them
q.block("{path}/LArepeat -c {coverage} -b {block} {db} {db}.{block}.stitch.las")
q.single("{path}/TKmerge -d {db} repeats")

########## detects "borders" in overlaps due to bad regions within reads that were not detected
########## in LAfix. Those regions can be quite big (several Kb). If gaps within a read are
########## detected LAgap chooses the longer part oft the read as valid range. The other part(s) are
########## discarded
########## option -L (see LAstitch) is also available
q.block("{path}/LAgap -t trim0 {db} {db}.{block}.stitch.las {db}.{block}.gap.las")

########## create a new trim1 track, (trim0 is kept)
q.block("{path}/LAq -s 5 -u -t trim0 -T trim1 -b {block} {db} {db}.{block}.gap.las")
q.single("{path}/TKmerge -d {db} trim1")

########## based on different filter critera filter out: local-alignments, repeat induced-alifnments
########## previously discarded alignments, ....
########## -r repeats, -t trim1 ... use repeats and trim1 track
########## -n 500  ... overlaps must be anchored by at least 500 bases (non-repeats)
########## -u 0    ... overlaps with unaligned bases according to the trim1 interval are discarded
########## -o 2000 ... overlaps shorter than 2k bases are discarded
########## -p      ... purge overlaps, overlaps are not written into the output file
########## option -L (see LAstitch) is also available
q.block("{path}/LAfilter -n 300 -r repeats -t trim1 -T -o 2000 -u 0 {db} {db}.{block}.gap.las {db}.{block}.filtered.las")

########## merge all filtered overlap files into one overlap file
q.single("{path}/LAmerge -S filtered {db} {db}.filtered.las")

########## create overlap graph
q.single("{path}/OGbuild -t trim1 {db} {db}.filtered.las {db}.graphml")

########## tour the overlap graph and create contigs paths
q.single("{path_scripts}/OGtour.py -c {db} {db}.graphml")

q.single("{path}/LAcorrect -j 4 -r {db}.tour.rids {db} {db}.filtered.las {db}.corrected")
q.single("{path}/FA2db -c {db}_CORRECTED [expand:{db}.corrected.*.fasta]")

########## create contig fasta files
q.single("{path_scripts}/tour2fasta.py -c {db}_CORRECTED -t trim1 {db} {db}.tour.graphml {db}.tour.paths")

###### optional: create a layout of the overlap graph which can viewed in a Browser (svg) or Gephi (dot)
## q.single("{path}/OGlayout -R {db}.tour.graphml {db}.tour.layout.svg")
q.single("{path}/OGlayout -R {db}.tour.graphml {db}.tour.layout.dot")

q.process()

4. 软件运行实例

4.1. 分步运行

The axolotl genome and the evolution of key tissue formation regulators 基因组组装过程，来源于MARVEL的Github主页的 MARVEL/examples/axolotl/README （初始化数据库时经过Fix的修正过程）

This file contains the steps we took to assemble the axolotl genome.


########## READ PATCHING PHASE

## create initial database
FA2db -v -x 2000 -b AXOLOTL -f axoFiles.txt

## split database into blocks
DBsplit AXOLOTL

## create dust track
DBdust AXOLOTL

## create daligner and merge plans, replace SERVER and PORT
HPCdaligner -v -t 100 -D SERVER:PORT -r1 -j16 --dal 32 --mrg 32 -o AXOLOTL AXOLOTL
################################################################## DBprepare.py 脚本可运行上诉4步 ######################################################################################################
## start dynamic repeat masking server on SERVER:PORT, replace PORT （可跳过）
DMserver -t 16 -p PORT -C AXOLOTL 40 AXOLOTL_CP
################################################################################################################################################################################################################################
## AXOLOTL.daligner.plan 和 AXOLOTL.merge.plan 是 DBprepare.py 生成的结果，内容是一些运行LAmerge和daligner程序的命令
## run daligner to create initial overlaps
AXOLOTL.daligner.plan

## after all daligner jobs are finshied the dynamic repeat masker has to be shut down （可跳过）
DMctl -h HOST -p PORT shutdown

###### run LAmerge to merge overlap blocks
AXOLOTL.merge.plan
##############################################################################################################################################################################
## create quality and trim annotation (tracks) for each overlap block for each database block
LAq -b  AXOLOTL AXOLOTL..las

TKmerge -d AXOLOTL q
TKmerge -d AXOLOTL trim

## run LAfix to patch reads based on overlaps for each database block
LAfix -c -x 2000 AXOLOTL AXOLOTL..las AXOLOTL..fixed.fasta


########## ASSEMBLY PHASE

## create a new database with the fixed reads
FA2db -v -x 2000 -c AXOLOTL_FIX AXOLOTL.*.fixed.fasta

## split database into blocks
DBsplit AXOLOTL_FIX

## combine repeat tracks maskr and maskc that were created during read patching phase
TKcombine AXOLOTL_FIX mask maskr maskc

## create daligner and merge plans, replace SERVER and PORT
HPCdaligner -v -t 100 -D SERVER:PORT -m mask -r2 -j16 --dal 32 --mrg 32 -o AXOLOTL_FIX AXOLOTL_FIX

## start dynamic repeat masking server on SERVER:PORT, replace PORT
DMserver -t 16 -p PORT -C AXOLOTL_FIX 40 AXOLOTL_CP

###### run daligner to create overlaps
AXOLOTL_FIX.dalign.plan

## after all daligner jobs are finshied the dynamic repeat masker has to be shut down
DMctl -h HOST -p PORT shutdown

###### run LAmerge to merge overlap blocks
AXOLOTL_FIX.merge.plan

###### SCRUBBING PHASE

## repair alignments that prematurely stopped due to left-over errors in the reads for each database block
LAstitch -f 50 AXOLOTL_FIX AXOLOTL_FIX..las AXOLOTL_FIX..stitch.las

## create quality and trim annotation for each database block
LAq -T trim0 -s 5 -b  AXOLOTL_FIX AXOLOTL_FIX..stitch.las

TKmerge -d AXOLOTL_FIX q
TKmerge -d AXOLOTL_FIX trim0

## create a repeat annotation for each database block
LArepeat -c  -l 1.5 -h 2.0 -b  AXOLOTL_FIX AXOLOTL_FIX..stitch.las

TKmerge -d AXOLOTL_FIX repeats

## merge duplicate & overlapping annotation repeat annotation and masking server output
TKcombine {db} frepeats repeats maskc maskr

## remove gaps (ie. due to chimeric reads, ...) for each database block
LAgap -s 100 -t trim0 AXOLOTL_FIX AXOLOTL_FIX..stitch.las AXOLOTL_FIX..gap.las

## recalculate the trim track based on the cleaned up gaps for each database block
LAq -u -t trim0 -T trim1 -b  AXOLOTL_FIX AXOLOTL_FIX..gap.las

TKmerge -d AXOLOTL_FIX trim1

## filter repeat induced alignments and try to resolve repeat modules for each database block
LAfilter -p -s 100 -n 300 -r frepeats -t trim1 -o 1000 -u 0 AXOLOTL_FIX AXOLOTL_FIX..gap.las AXOLOTL_FIX..filtered.las

## not much is left now, so we can merge everything into a single las file
LAmerge -S filtered AXOLOTL_FIX AXOLOTL_FIX.filtered.las

## create overlap graph
mkdir -p components
OGbuild -t trim1 -s -c 1 AXOLOTL_FIX AXOLOTL_FIX.filtered.las components/AXOLOTL_FIX

## tour the overlap graph and create contigs paths for each components/*.graphml
OGtour.py -c AXOLOTL_FIX 

## create contig fasta files for each components/*.paths
tour2fasta.py -t trim1 AXOLOTL_FIX

4.2. 集合运行

source /public/home/software/.bashrc
export PATH="/public/home/cotton/software/MARVEL-bin/bin:$PATH"
export PATH="/public/home/zpxu/miniconda3/bin:$PATH"
export LD_LIBRARY_PATH="$PATH:/public/home/zpxu/miniconda3/lib"

cd /public/home/zpxu/AS/results/MARVEL
fa="/public/home/zpxu/AS/PacBio/bmk_nh_filtered_subreads.fasta"

echo "1. initially the database"
DBprepare.py AS ${fa}
cd /public/home/zpxu/AS/results/MARVEL
##sed -i "s/^/\/public\/home\/cotton\/software\/MARVEL-bin\/bin\//g" /public/home/zpxu/AS/results/MARVEL/AS.dalign.plan
##sed -i "s/^/\/public\/home\/cotton\/software\/MARVEL-bin\/bin\//g" /public/home/zpxu/AS/results/MARVEL/AS.merge.plan
python /public/home/zpxu/AS/scripts/MARVEL_do.py

5. 疑问参数

5.1. 线程数目参数

-j参数表示线程数，需为2的幂次。

5.2. HPCdaligner 的 -D SERVER:PORT

解释见：HPCdaligner parameter <-D host:port> ##7，即重复序列多的基因组组装需要设置此参数，如果不是则不需要设置此参数和运行随后的 DMserver 步骤。如下即可：

1	HPCdaligner -v -t 100 -r1 -j16 --dal 32 --mrg 32 -o AXOLOTL AXOLOTL

5.3. DMserver 的 -p PORT 和 expected.coverage

Dynamic repeat Masking server主要目的在于降低大基因组组装对于计算机CPU和内存的需求，

While preliminary calculations for computational time and storage space estimated over multiple millions of CPU hours and >2 PB of storage for one daligner run, the usage of a dynamic repeat masking server (below) reduced this dramatically to 150,000 CPU hours and 120 Tb of storage space for the complete pipeline.

HiC分析主要内容

2018-05-01T11:23:42.000Z

零、常识介绍

Mammalian genomes are spatially organized into compartments, topologically associating domains (TADs), and loops to facilitate gene regulation and other chromosomal functions.

3D interactions mostly occur within chromosomes (cis) rather than between chromosomes (trans), all methods detected more cis than trans interactions.

一、鉴定染色体交互 (identify chromatin interactions)

Chromatin interactions are contacts between regions far from each other on the linear DNA sequence but close in 3D space；

Function	Method	vailability	Programming language
Chromatin interactions	Fit-Hi-C	http://noble.gs.washington.edu/proj/fit-hi-c	Python
Chromatin interactions	GOTHiC	http://bioconductor.org/packages/release/bioc/html/GOTHiC.html	R
Chromatin interactions	HOMER	http://homer.ucsd.edu/homer/interactions/HiCmatrices.html	Perl, R
Chromatin interactions	HIPPIE	http://wanglab.pcbi.upenn.edu/hippie/	Python, Perl, R
Chromatin interactions	diffHic	https://bioconductor.org/packages/release/bioc/html/diffHic.html	R, Python
Chromatin interactions	HiCCUPS	https://github.com/theaidenlab/juicer/wiki/Download	Java
TADs	HiCseg	https://cran.r-project.org/web/packages/HiCseg/index.html	R
TADs	TADbit	https://github.com/3DGenomes/TADbit	Python
TADs	DomainCaller	http://chromosome.sdsc.edu/mouse/hi-c/download.html	Matlab, Perl
TADs	InsulationScore	https://github.com/dekkerlab/crane-nature-2015	Perl
TADs	Arrowhead	https://github.com/theaidenlab/juicer/wiki/Download	Java
TADs	TADtree	http://compbio.cs.brown.edu/projects/tadtree/	Python
TADs	Armatus	https://github.com/kingsfordgroup/armatus	C++

【上述图表来源于：Comparison of computational methods for Hi-C data analysis】

The total number of interactions called by each method increased with the number of reads retained by the filtering step for all tools at any resolution, although the rate of increase varied from tool to tool.

二、Topologically Associating Domains (TADs)

TADs are structural domains consisting of chromatin regions that are highly self-interacting but have limited interaction with regions in other domains;

TADtree: an algorithm the identification of hierarchical topological domains in Hi-C data

Arrowhead: for finding contact domains

Used bin size (resolution) of at least 40 kb for TAD calling;

1 2	cd /public/home/cotton/software/juicer/data java -jar ../CPU/juicer_tools.jar arrowhead -r 40000 -k NONE test.hic test_contact_domains_list

三、Compartments

CscoreTool: fast Hi-C compartment analysis at high resolution
Eigenvector: used to delineate compartments in Hi-C data at coarse resolution

The genome-wide chromosome conformation capture (Hi-C) has revealed that the eukaryotic genome can be partitioned into A and B compartments that have distinctive chromatin and transcription features.

The current method for calculating A/B compartments is based on the Principal Component Analysis (PCA) of the normalized Hi-C interaction matrix (Lieberman-Aiden et al., 2009). The first eigenvector (Principal Component 1, PC1) of the correlation matrix is then defined as the compartment score, and genomic windows with positive or negative compartment scores are defined as A or B compartment, respectively.

1 2	cd /public/home/cotton/software/juicer/data java -jar ../CPU/juicer_tools.jar eigenvector KR test.hic 1 BP 1000000 > test.Compartments

hic 文件来源于 3D DNA 流程或 java -Xmx2g -jar juicebox_tools.jar pre;

四、Chromatin loops

Chromatin loops in gene regulation
CTCF-Mediated Chromatin Loops between Promoter and Gene Body Regulate Alternative Splicing across Individuals30488-X)
Topologically associating domains and chromatin loops depend on cohesin and are regulated by CTCF, WAPL, and PDS5 proteins

HiCCUPS is an algorithm for finding chromatin loops.

HiCCUPS 算法包含在 juicer 软件中，可按照如下手动单独运行（有GPU节点使用）：

java -Xmx2g -jar ./CPU/juicer_tools.jar hiccups -h
java -Xmx2g -jar ./CPU/juicer_tools.jar hiccups /public/home/cotton/software/3d-dna/xzp/Hs1.split.hic all_hiccups_loops

java -Xmx2g -jar ./CPU/juicer_tools.jar hiccups -r 40000 -p 1 -i 3 -f 0.1 -d 80000 --ignore_sparsity /public/home/cotton/software/3d-dna/xzp/Hs1.split.hic hiccups_40kb

五、软件性能比较

过滤步骤

HiCCUPS retained the largest number of aligned reads, although it is worth noting that HiCCUPS filters only PCR duplicates without discarding other potential artifact reads.
diffHic filtered the highest proportion of aligned reads in most data sets (from 27% to 94%, depending on the data set); but, given its higher alignment rate, still retained a large number of reads.

Identification of chromatin interactions

GOTHiC called the highest number of cis interactions;
diffHic found the largest number of trans interactions;
HiCCUPS, which aggregates nearby peaks into a single interaction, identified fewer interactions than all other tools.
For interaction callers, HOMER and HiCCUPS yielded the highest proportion of interactions with a potential biological significance—although the potential of HiCCUPS could be fully exploited only in the analysis of very high-resolution data sets.

Distance between the interacting points in cis

GOTHiC found interactions at shorter mean distance at both 5- and 40-kb resolutions;
At 5 kb, Fit-Hi-C called interactions at an average distance of more than 10 Mb; which was expected, as Fit-Hi-C is designed to call midrange interactions.
At low resolution, GOTHiC had the highest concordance, most likely because it called a large number of short-range interactions in every sample replicate.
At high resolution, the interactions found by HiCCUPS were the most conserved among replicates.
At 5kb resolution, HiCCUPS and HOMER called the highest proportion of promoter–enhancer interactions, although not the highest absolute number.

cis interaction 正确性和敏感性

GOTHiC recovered the largest number of true-positive interactions. HOMER and Fit-Hi-C performed comparably to GOTHiC, although they called a smaller number of total interactions.
In high-resolution data sets, diffHic recalled the highest number of true positives, although HOMER identified more true positives than any other tool at comparable numbers of called interactions.
The highest sensitivity was achieved by Fit-Hi-C.

Identification of topologically associating domains

The number of TADs did not increase with the number of reads retained after filtering for all tools, with the exception of Arrowhead.
At 40-kb resolution, TADtree called the largest (7,638) and Arrowhead the smallest (636) number of TADs. Conversely, at 1-Mb resolution, InsulationScore returned the largest number of TADs.
Note that some methods (HiCseg, TADbit, InsulationScore) partition chromosomes in a continuous set of TADs, whereas the others allow gaps between TADs. Arrowhead and TADtree, which adopt multiscale approaches, returned nested TADs.
TADs identified by HiCseg were also the most reproducible when using the overlap coefficient.

六、分析流程

学IGV必看的初级教程

2018-02-05T08:06:54.000Z

Integrative Genomics Viewer (IGV)作为一个高性能的可视化工具，可以交互式的察看综合的基因组相关数据，也友好的支持多种数据类型，自然是生信工作者必须使用的利器之一。官网也提供了很详细的使用讲解，这里仅是根据我目前需要学习摘录部分做的整理，后面有时间在做其他整理。

1. 输入数据准备

IGV可以导入多种类型的数据，详见下文的数据导入介绍，此处主要说的是排序后的 bwa 的比对文件：bowtie2/BWA + samtools (samtools view>samtools sort>samtools index) 处理结果或RNA-seq的 Tophat结果；

2. 主界面

2.1 基础主界面

工具栏；
红框表示显示当前染色体的相应区域；
刻度线表示所处位置坐标；
tracks区域，也即 Alignment Track区；主要的信息区，通常会显示甲基化、基因表达、拷贝数、杂合性缺失（Loss of Heterozygosity）、突变等信息；对应的有三种显示形式：Collapsed、Squished 和 Expanded；
特征显示区；蓝色粗线—外显子区域，细线内含子区域，空白—基因间隙；
列出 Track names，即导入的比对结果名称；
属性面板；

2.2 结果界面说明

(1) 处可手动输入想要察看的染色体/contigs/scaffolds编号，然后回车察看；
(2) 处是参考序列对应的核酸序列，其中四种核酸分别用不同的颜色表示：(A, C, G, T)，下面为对应的翻译的氨基酸序列，甲硫氨酸（M）用绿色表示，终止密码子（*）红色星号表示；当右上角的标尺足够大时此区域才会显示；
(3) 处不同颜色条表示排序方式，鼠标停留在此处右键选择可选取不同的颜色形式；同时每一个长条对应的序列和比对信息可以鼠标右键选择来拷贝；每一个长条都是由一系列的核酸序列组成，可通行来显示；比对的reads长条也可通过成对的形式显示；
(4) 处鼠标停留时会显示此处碱基统计信息，例如在此处显示为红蓝色，红色是T，蓝色是C，红色方块大于蓝色，表示所有比对到这一位置的序列中这一位点碱基是T的序列大于C的，即C可能是突变；当导入数据为比对的bam数据时，此处所在区域为 Coverage Track，

3. 数据导入

当数据通过导入时，IGV通过导入文件的扩展名来确认数据格式 (file format)，进而确定数据类型 (data type)，再确定数据展现的 Track 形式 (track default display options)；如下所示（此默认值均可修改）：

4. 察看序列比对结果

可通过 View >>Preferences >>Alignments 面板设置相关参数;
在 Track 区不进行 Color alignments by 的情况下，alignments 只有亮灰和白色两种长条，其中白色的比对质量为零 (mapping quality equal to zero)；
插入：用紫色的 I 或红色的 I （当插入的碱基数多余预设的阀值时）表示；鼠标停留察看详细的插入碱基情况；
缺失：黑条表示；
Sort alignments by 可对Track区域进行排序，如想返回最初结果则选择 Re-pack alignments 即可；
默认情况下 Track Alignments 区以左图紧凑的单个 reads 的形式展示，通过 View as pairs 可成对显示，且中间以细线连接 (右图)；

在左图中按住 Ctrl 键鼠标左击某一个长条 (a read)，将以相同的彩色颜色显示出与其配对 (paired mate) 的另一条 read。黑色的表示没有与之配对的另一条read。选中一条 read 后右键 Go to Mate 将会跳转到与其配对 (paired mate) 的另一条 read。If the paired reads have a large insert size, the paired mate will not be highlighted. 右键选择 Clear Selections 来清除所有选择的reads。同时注意到不同reads会用不同的颜色表示 (蓝色：插入大小小于期望值；红色：插入大小大于期望值；绿色、青色、深蓝色：倒置、重复、易位事件)，更多详情见：Interpreting Color by Insert Size 和 Interpreting Color by Pair Orientation；低分辨率下在 Track Alignments 区域选择 Color alignments by >> insert size and pair orientation 时比对的reads会显示不同的颜色 (Red have larger than expected inferred sizes, and therefore indicate possible deletions; Blue have smaller than expected inferred sizes, and therefore indicate insertions；实心灰代表比对质量比较高的测序片段，空心灰代表比对到此处的测序片段也可以比对到其他位点。)，高分辨率下，可以精确到每个位点的碱基类型：当比对序列上与参考基因组相同的超过80%时，用灰色表示；否则用红色-T，蓝色-C，绿色-A，橙色-G；Translocations on the same chromosome can be detected by color-coding for pair orientation, whereas translocations between two chromosomes can be detected by coloring by insert size.
Paired-end alignment tracks 时 (View as pairs)，右键选择 View mate region in split screen 可分隔显示；可实现多个分隔；在下图①处右键选择 Switch to standard view 或鼠标左键双击可返回单个分区；

5. 察看可变剪切情况

Loaded junctions data in the standard .bed format (例如TopHat’s “junctions.bed”等输出文件)；

|-- accepted_hits.bam
|-- accepted_hits.bam.bai
|-- deletions.bed
|-- insertions.bed
|-- junctions.bed
|-- unmapped.bam
`-- unmapped.bam.bai

载入RNA-seq比对基因组bam文件

图示说明：

红色向上弧形表示可变剪切基因位于正链，蓝色向下为负链；
弧形的高度和厚度(thickness)与reads覆盖度成比例；

6. 察看变异

6.1 Mutation Files：MAF (mutation annotation format) and MUT (mutation)文件；

6.2 VCF Files

	Each bar across the top of the plot shows the allele fraction for a single locus.
	The genotypes for each locus in each sample. Dark blue = heterozygous, Cyan = homozygous variant, Grey = reference. Filtered entries are transparent.

7. 参考资料

IGV应用教程

A Beginner’s Guide to Learn R Programming

2017-12-15T01:57:52.000Z

Author: hope @Huazhong Agricultural University

一、数据操作

循环 (Loops)

library(tibble)
set.seed(7)
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
  )
median(df$a)
median(df$b)
median(df$c)
median(df$d)
output <- vector("double", ncol(df))  # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
  }
output
apply(df,2,median)

数据转换 (Data transformation) 清洗和整理

数据环境载入

library(nycflights13)
library(tidyverse)
head(flights)
unique(flights$month)

1.1 筛选: filter()

1	(jan1 <- filter(flights, month == 1, day == 1))

1.2 排列: arrange()

1 2	arrange(flights, year, month, day) arrange(flights, desc(arr_delay))

1.3 选择: select()

1	select(flights, year, month, day)

1.4 变形: mutate()

flights_sml <- select(flights, 
                      year:day, 
                      ends_with("delay"), 
                      distance, 
                      air_time
)

新添加的列可以用于后续计算

mutate(flights_sml,
       gain = arr_delay - dep_delay,
       hours = air_time / 60,
       gain_per_hour = gain / hours
)

只保留变形后的列

transmute(flights,
          gain = arr_delay - dep_delay,
          hours = air_time / 60,
          gain_per_hour = gain / hours
)

1.5 汇总: summarise()

1	summarise(flights, delay = mean(dep_delay, na.rm = TRUE))

1.6 分组: group_by()

1
2
3

by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
filter(flights, year == 2013, month == 1, day == 1)

1.7 管道函数(%>%) 和绘图

message

delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")
library(ggplot2)
ggplot(data = delays, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)

数据整形 (Reshaping Data)

tibble 型数据

library(tibble)
friends_data <- data_frame(
  name = c("Nicolas", "Thierry", "Bernard", "Jerome"),
  age = c(27, 25, 29, 26),
  height = c(180, 170, 185, 169),
  married = c(TRUE, FALSE, TRUE, TRUE)
)
# Print
friends_data

tibble 与常规 data frame 的差别

data("iris")
class(iris)
my_data <- as_data_frame(iris)
class(my_data)
my_data2 <- as.data.frame(my_data)

基本数据载入

library("tidyr")
my_data <- USArrests[c(1, 10, 20, 30), ]
my_data
my_data <- cbind(state = rownames(my_data), my_data)
my_data

gather(data, key, value, …)

my_data2 <- gather(my_data,
                   key = "arrest_attribute",
                   value = "arrest_estimate",
                   -state)
my_data2

spread(data, key, value)

my_data3 <- spread(my_data2, 
                   key = "arrest_attribute",
                   value = "arrest_estimate"
)
my_data3

unite(data, col, …, sep = “_”)

my_data4 <- unite(my_data,
                  col = "Murder_Assault",
                  Murder, Assault,
                  sep = "_")
my_data4

separate(data, col, into, sep = “[^[:alnum:]]+”)

my_data5 <- separate(my_data4,
         col = "Murder_Assault",
         into = c("Murder", "Assault"),
         sep = "_")
my_data5

管道函数(%>%)

my_data6 <- my_data %>% gather(key = "arrest_attribute",
                   value = "arrest_estimate",
                   Murder:UrbanPop) %>%
  unite(col = "attribute_estimate",
        arrest_attribute, arrest_estimate)

关系型数据 (Relational data)

数据载入

library(tidyverse)
library(nycflights13)
# nycflights13 contains four tibbles that are related to the flights table.
class(flights)
flights

Mutating joins

flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2
flights2 %>%
  select(-origin, -dest) %>% 
  left_join(airlines, by = "carrier")

Filtering joins

top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(5)
flights %>% 
  semi_join(top_dest)
dim(flights)
dim(flights %>% semi_join(top_dest))
flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(tailnum, sort = TRUE)

Set operations

df1 <- tribble(
  ~x, ~y,
  1,  1,
  2,  1
)
df2 <- tribble(
  ~x, ~y,
  1,  1,
  1,  2
)
df1
df2
intersect(df1, df2)
union(df1, df2)
setdiff(df1, df2)

二、Plotting in R for Biologists

ggplot2绘图

1. 散点图

1
2
3

library(ggplot2)
p <- ggplot(data=mpg, mapping=aes(x=cty, y=hwy))
p + geom_point()

将年份映射到颜色属性

1 2	p <- ggplot(mpg,aes(x=cty, y=hwy, colour=factor(year))) p + geom_point()

增加平滑曲线

1	p + geom_point() + stat_smooth()

分面

1	p + geom_point() + stat_smooth()+facet_wrap(~ year, ncol=1)

2. 直方图

1 2	p <- ggplot(mpg,aes(x=hwy)) p + geom_histogram()

统计变换+分面

1
2
3

p + geom_histogram(aes(fill=factor(year),y=..density..), alpha=0.3,colour='black') +
  stat_density(geom='line',position='identity',size=1.5, aes(colour=factor(year))) +
  facet_wrap(~year,ncol=1)

3. 条形图

1 2	p <- ggplot(mpg, aes(x=class)) p + geom_bar()

根据计数排序后绘制的条形图

class2 <- mpg$class
class2 <- reorder(class2,class2,length)
mpg$class2 <- class2
p <- ggplot(mpg, aes(x=class2))
p + geom_bar(aes(fill=class2))

4.饼图

1
2
3

p <- ggplot(mpg, aes(x = factor(1), fill = factor(class))) +
  geom_bar(width = 1)
p + coord_polar(theta = "y")

改变填充颜色

1	p + coord_polar(theta = "y") + scale_fill_brewer(palette="Dark2")

5.箱线图

1 2	p <- ggplot(mpg, aes(class,hwy,fill=class)) p + geom_boxplot()

6.小提琴图

1 2	p + geom_violin(alpha=0.3,width=0.9)+ geom_jitter(shape=21)

7.密度图

set.seed(1234)
df <- data.frame(
  sex=factor(rep(c("F", "M"), each=200)),
  weight=round(c(rnorm(200, mean=55, sd=5),
                 rnorm(200, mean=65, sd=5)))
)
head(df)
p <- ggplot(df, aes(x=weight, color=sex)) +
  geom_density()
p

8.线图

df2 <- data.frame(sex = rep(c("Female", "Male"), each=3),
                  time=c("breakfeast", "Lunch", "Dinner"),
                  bill=c(10, 30, 15, 13, 40, 17) )
head(df2)
ggplot(df2, aes(x=time, y=bill, group=sex)) +
  geom_line(aes(linetype=sex, color=sex))+
  geom_point(aes(color=sex))+
  theme(legend.position="top")

9.热图

library(pheatmap)
test = matrix(rnorm(200), 20, 10)
colnames(test) = paste("Test", 1:10, sep = "")
rownames(test) = paste("Gene", 1:20, sep = "")
pheatmap(test, color = colorRampPalette(c("navy", "white", "firebrick3"))(50))

10.相关性分析图

library(corrplot)
mydata <- select(mtcars,hp,disp,wt,qsec,mpg,drat)
source("http://www.sthda.com/upload/rquery_cormat.r")
rquery.cormat<-function(x, type=c('lower', 'upper', 'full', 'flatten'),
                        graph=TRUE, graphType=c("correlogram", "heatmap"),
                        col=NULL, ...)
{
  library(corrplot)
  # Helper functions
  #+++++++++++++++++
  # Compute the matrix of correlation p-values
  cor.pmat <- function(x, ...) {
    mat <- as.matrix(x)
    n <- ncol(mat)
    p.mat<- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
      for (j in (i + 1):n) {
        tmp <- cor.test(mat[, i], mat[, j], ...)
        p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
      }
    }
    colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
    p.mat
  }
  # Get lower triangle of the matrix
  getLower.tri<-function(mat){
    upper<-mat
    upper[upper.tri(mat)]<-""
    mat<-as.data.frame(upper)
    mat
  }
  # Get upper triangle of the matrix
  getUpper.tri<-function(mat){
    lt<-mat
    lt[lower.tri(mat)]<-""
    mat<-as.data.frame(lt)
    mat
  }
  # Get flatten matrix
  flattenCorrMatrix <- function(cormat, pmat) {
    ut <- upper.tri(cormat)
    data.frame(
      row = rownames(cormat)[row(cormat)[ut]],
      column = rownames(cormat)[col(cormat)[ut]],
      cor  =(cormat)[ut],
      p = pmat[ut]
    )
  }
  # Define color
  if (is.null(col)) {
    col <- colorRampPalette(c("#67001F", "#B2182B", "#D6604D", 
                              "#F4A582", "#FDDBC7", "#FFFFFF", "#D1E5F0", "#92C5DE", 
                              "#4393C3", "#2166AC", "#053061"))(200)
    col<-rev(col)
  }
  
  # Correlation matrix
  cormat<-signif(cor(x, use = "complete.obs", ...),2)
  pmat<-signif(cor.pmat(x, ...),2)
  # Reorder correlation matrix
  ord<-corrMatOrder(cormat, order="hclust")
  cormat<-cormat[ord, ord]
  pmat<-pmat[ord, ord]
  # Replace correlation coeff by symbols
  sym<-symnum(cormat, abbr.colnames=FALSE)
  # Correlogram
  if(graph & graphType[1]=="correlogram"){
    corrplot(cormat, type=ifelse(type[1]=="flatten", "lower", type[1]),
             tl.col="black", tl.srt=45,col=col,...)
  }
  else if(graphType[1]=="heatmap")
    heatmap(cormat, col=col, symm=TRUE)
  # Get lower/upper triangle
  if(type[1]=="lower"){
    cormat<-getLower.tri(cormat)
    pmat<-getLower.tri(pmat)
  }
  else if(type[1]=="upper"){
    cormat<-getUpper.tri(cormat)
    pmat<-getUpper.tri(pmat)
    sym=t(sym)
  }
  else if(type[1]=="flatten"){
    cormat<-flattenCorrMatrix(cormat, pmat)
    pmat=NULL
    sym=NULL
  }
  list(r=cormat, p=pmat, sym=sym)
}
rquery.cormat(mydata, type="full")

11.主成份分析(PCA)

z1 <- rnorm(10000, mean=1, sd=1)
z2 <- rnorm(10000, mean=3, sd=3)
z3 <- rnorm(10000, mean=5, sd=5)
z4 <- rnorm(10000, mean=7, sd=7)
z5 <- rnorm(10000, mean=9, sd=9)
mydata <- matrix(c(z1, z2, z3, z4, z5), 2500, 20, byrow=T, dimnames=list(paste("R", 1:2500, sep=""), paste("C", 1:20, sep=""))) 
pca <- prcomp(mydata, scale=T) 
summary(pca)
summary(pca)$importance[, 1:6]
x11(height=6, width=12, pointsize=12); par(mfrow=c(1,3))
mycolors <- c("red", "green", "blue", "magenta", "black")
plot(pca$x[,1:2], pch=20, col=mycolors[sort(rep(1:5, 500))])
pairs(pca$x[,1:4], pch=20, col=mycolors[sort(rep(1:5, 500))])
library(scatterplot3d)
scatterplot3d(pca$x[,1:3], pch=20, color=mycolors[sort(rep(1:5, 500))])

12.气泡图 (Bubbles )

require(ggplot2)
df<- read.csv("Bubbles.csv")
ggplot(df, aes(x = id,y=Term,label = Term)) +
  geom_point(aes(size = Input_number, colour = P.Value)) + 
  #geom_text(hjust = 1, size = 2) +
  scale_size(range = c(1,15)) +
  scale_x_continuous(breaks = seq(1, 15, 2)) +
  scale_colour_gradientn(colours=rainbow(4)) +
  theme_bw()

美化 (themes and background)

ggplot2自带主题

p <- ggplot(iris, aes(Sepal.Length, Sepal.Width, colour = Species))+
  geom_point()
p
p + theme_classic()

主题包

1
2
3

library(ggthemes)
p + theme_calc()+ scale_colour_calc()+
  ggtitle("Iris data")

定制主题

p + theme(
  panel.background = element_rect(fill = "lightblue",
                                  colour = "lightblue",
                                  size = 0.5, linetype = "solid"),
  panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                  colour = "white"), 
  panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                  colour = "white")
)

三、复杂图形修改

1
2
3

library(ggplot2)
#dat <- read.csv("https://github.com/tiramisutes/myscripts-R/blob/master/EconomistData.csv")
dat <- read.csv("C:/Users/hope/Desktop/R-class//EconomistData.csv")

Basic plot

1
2
3

pc1 <- ggplot(dat,aes(x = CPI, y = HDI, color = Region))+
  geom_point()
pc1

Trend line

pc2 <- pc1 +
  geom_smooth(aes(group = 1),
              method = "lm",
              formula = y ~ log(x),
              se = FALSE,
              color = "red")
pc2

Open points

pc3 <- ggplot(dat,aes(x = CPI, y = HDI, color = Region))+
  geom_point(shape = 1, size = 4) +
  geom_smooth(aes(group = 1),
              method = "lm",
              formula = y ~ log(x),
              se = FALSE,
              color = "red")
pc3

选择性的标注想要的点

pointsToLabel <- c("Russia", "Venezuela", "Iraq", "Myanmar", "Sudan",
                   "Afghanistan", "Congo", "Greece", "Argentina", "Brazil",
                   "India", "Italy", "China", "South Africa", "Spane",
                   "Botswana", "Cape Verde", "Bhutan", "Rwanda", "France",
                   "United States", "Germany", "Britain", "Barbados", "Norway", "Japan",
                   "New Zealand", "Singapore")
library("ggrepel")
pc4 <- pc3 +  geom_text_repel(aes(label = Country),
                              color = "gray20",
                              data = subset(dat, Country %in% pointsToLabel),
                              force = 10)
pc4

修改图例值和顺序

dat$Region <- factor(dat$Region,
                     levels = c("EU W. Europe",
                                "Americas",
                                "Asia Pacific",
                                "East EU Cemt Asia",
                                "MENA",
                                "SSA"),
                     labels = c("OECD",
                                "Americas",
                                "Asia &\nOceania",
                                "Central &\nEastern Europe",
                                "Middle East &\nnorth Africa",
                                "Sub-Saharan\nAfrica"))
pc4$data <- dat
pc4

利用scale来修改x，y轴，颜色和标出title

pc5 <- pc4 +
  scale_x_continuous(name = "Corruption Perceptions Index, 2011 (10=least corrupt)",
                     limits = c(.9, 10.5),
                     breaks = 1:10) +
  scale_y_continuous(name = "Human Development Index, 2011 (1=Best)",
                     limits = c(0.2, 1.0),
                     breaks = seq(0.2, 1.0, by = 0.1)) +
  scale_color_manual(name = "",
                     values = c("#24576D",
                                "#099DD7",
                                "#28AADC",
                                "#248E84",
                                "#F2583F",
                                "#96503F")) +
  ggtitle("Corruption and Human development") +
  theme(plot.title = element_text(hjust = 0.5))
pc5

微调主题

library(grid)
pc6 <- pc5 +
  theme_minimal() + # start with a minimal theme and add what we need
  theme(text = element_text(color = "gray20"),
        legend.position = c("top"), # position the legend in the upper left 
        legend.direction = "horizontal",
        legend.justification = 0.1, # anchor point for legend.position.
        legend.text = element_text(size = 11, color = "gray10"),
        axis.text = element_text(face = "italic"),
        axis.title.x = element_text(vjust = -1), # move title away from axis
        axis.title.y = element_text(vjust = 2), # move away for axis
        axis.ticks.y = element_blank(), # element_blank() is how we remove elements
        axis.line = element_line(color = "gray40", size = 0.5),
        axis.line.y = element_blank(),
        panel.grid.major = element_line(color = "gray50", size = 0.5),
        panel.grid.major.x = element_blank()
  )
pc6

四、RNA-Seq (DESeq2)

library(DESeq2)
library(limma)
library(pasilla)
data(pasillaGenes)
exprSet=counts(pasillaGenes)
head(exprSet)
colData <- data.frame(row.names=colnames(exprSet), 
                      group_list=group_list
)
dds <- DESeqDataSetFromMatrix(countData = exprSet,
                              colData = colData,
                              design = ~ group_list)
dds
dds2 <- DESeq(dds)
resultsNames(dds2)
res <-  results(dds2, contrast=c("group_list","treated","untreated")) 
resOrdered <- res[order(res$padj),]
resOrdered=as.data.frame(resOrdered)
head(resOrdered)

五、写在最后

本页内容对应PPT详细请见A Beginner’s Guide to Learn R Programming，其他更多优质资源请阅读 R语言的最好资源，一个就够！

Oxford Nanopore Sequencing

2017-10-21T03:47:31.000Z

纳米孔测序技术

纳米孔测序技术（又称第四代测序技术）是最近几年兴起的新一代测序技术。目前测序长度可以达到150kb；

目前市场上广泛接受的纳米孔测序平台是Oxford Nanopore Technologies（ONT）公司的MinION纳米孔测序仪。它的特点是单分子测序，测序读长长（超过150kb），测序速度快，测序数据实时监控，机器方便携带等。
纳米孔分析技术起源于Coulter计数器的发明以及单通道电流的记录技术。生理与医学诺贝尔奖获得者Neher和Sakamann在1976年利用膜片钳技术测量膜电势，研究膜蛋白及离子通道，推动了纳米孔测序技术的实际应用进程。1996年，Kasianowicz 等提出了利用α-溶血素对DNA测序的新设想，是生物纳米孔单分子测序的里程碑标志。随后，MspA孔蛋白、噬菌体Phi29连接器等生物纳米孔的研究报道，丰富了纳米孔分析技术的研究。Li等在2001年开启了固态纳米孔研究的新时代。经过十几年的发展，现如今固态纳米孔技术日益发展成熟。
目前用于DNA测序的纳米孔有两类：生物纳米孔(由某种蛋白质分子镶崁在磷脂膜上组成)和固态纳米孔(包括各种硅基材料、SiNx、碳纳米管、石墨烯、玻璃纳米管等)。DNA链的直径非常小(双链DNA直径约为2nm，单链DNA直径约为1nm)，对所采用的纳米孔的尺寸要求较苛刻。

工作原理

测序过程

Nanopore sequencing: (a) A biological nanopore is inserted into an electrically resistant synthetic membrane. A potential is applied across the membrane, resulting in ion flow. Library DNA molecules have adaptors with aliphatic tethers (not shown) which preferentially locate to the membrane for a localized library concentration. (b) The motor protein bound to the other adaptor docks with the pore, and passes the DNA molecule through it. (c) Bases in the nanopore cause disruptions in the current which are characteristic of their sequence (blue line). In some basecallers, the signal is further refined to events (red line) which correspond to distinct pore kmers.

MinION纳米孔测序仪的核心是一个有2,048个纳米孔，分成512组，由专用集成电路控制的flow cell。测序原理见下图a所示：首先，将双分子DNA连接lead adaptor（蓝色），hairpin adaptor（红色）和trailing adaptor（棕色）；当测序开始，lead adaptor带领测序分子进入由酶控制的纳米孔，lead adaptor后是template read（即待测序的DNA分子）通过纳米孔，hairpin adaptor的作用是DNA双链测序的保证，然后complement read（待测序分子的互补链）通过纳米孔，最后是trailing adaptor通过。在上述测序方法中，template read和complement read依次通过纳米孔，利用pairwise alignment，它们组合成2D read；而在另外一种测序方法中，不使用hairpin adaptor，只测序template read，最终形成1D read。后一种测序方法通量更高，但是测序准确性低于2D read。每个接头序列（adaptor）通过纳米孔引起的电流变化不同（图1c），这种差别可以用来做碱基识别。

分析工具

目前针对Nanopore测序数据的生物信息分析工具已经研发，包括最新的国产三代数据组装软件 MECAT和老牌的 MaSuRCA等，详细支持软件目录见：https://nanoporetech.com/resource-centre/tools。

主要应用领域

1. 基因组组装

【Nanopore sequencing The advantages of long reads for genome assembly】

Assembler name	Algorithms	Errorcorrection	Link	Reference
LQS	DALIGNER, Celera OLC	Nanocorrect, Nanopolish	https://github.com/jts/nanopolish	Loman (2015)
Canu	MHAP, Celera OLC	Canu	https://github.com/marbl/canu	Berlin (2015)
Canu	MHAP, Celera OLC	Racon, Pilon	https://github.com/nanoporetech/ont-assembly-polish	nanoporetech
Miniasm	OLC		https://github.com/lh3/minia	Li (2016)
Miniasm	OLC	Racon	https://github.com/isovic/racon	Vaser (2017)
Ra-integrate	OLC		https://github.com/mariokostelac/ra-integrate/	Sovic (2016)
ALLPATHS-LG	de Bruijn graph	ALLPATHS-LG	https://www.broadinstitute.org/software/allpathslg/blog/?page_id	Gnerrea (2011)
SPAdes	de Bruijn graph	SPAdes	http://bioinf.spbau.ru/spades	Bankevich (2012)
SMART denovo	Smith-Waterm, dot matrix		https://github.com/ruanjue/smartdenovo	Ruan
ABruijn	de Bruijn graph		https://github.com/fenderglass/ABruijn	Lin (2016)

2. 宏基因租

【Nanopore sequencing Addressing the challenges of metagenomics for environmental and clinical research】

3. 变异分析

【Nanopore sequencing The application and advantages of long-read nanopore sequencing to structural variation analysis】

PacBio vs. Oxford Nanopore sequencing

二者是目前主要的长reads测序技术，且具同样的缺点：高的错误率，但基于二者测序原理的差异，相较于Oxford Nanopore（MinION测序数据目前只有92%的准确性）而言PacBio错误率较低；就测序通量而言，Oxford Nanopore可同时测序多个分子，固通量较高；Oxford Nanopore测序仪体积仅有U盘大小，便于携带且测序费用低；

参考资料

纳米孔测序技术发展简介
Jain, Miten, et al. “The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community.“ Genome biology 17.1 (2016): 239.
Leggett, Richard M., and Matthew D. Clark. “A world of opportunities with nanopore sequencing.” Journal of Experimental Botany (2017): erx289.
PacBio vs. Oxford Nanopore sequencing

RNAi介导的抗虫效应

2017-09-04T09:56:53.000Z

在植物中表达dsRNA来达到抗虫的策略起始于陈晓亚院士的棉花中表达棉铃虫细胞色素单氧酶P450基因(CYP6AE14 )的dsRNA ，削弱幼虫对棉酚抗性【Silencing a cotton bollworm P450 monooxygenase gene by plant-mediated RNAi impairs larval tolerance of gossypol】,之后不同学者将其运用于实验中（【A transgenic strategy for controlling plant bugs (Adelphocoris suturalis) through expression of double-stranded RNA homologous to fatty acyl-coenzyme A reductase in cotton】）。

作用机制

Figure 1. An Updated Model of RNAi-Mediated Crop Protection from Insect Pests. Insecticidal dsRNAs expressed in planta either undergo processing into small interfering RNAs (siRNAs) or are imported into insect cells upon feeding on the plant, presumably by homologs of the double-stranded RNA (dsRNA) transport protein SID (originally discovered and subsequently characterized in detail in Caenorhabditis elegans [30]) and/or by endocytosis. Because insects do not possess RNA-dependent RNA polymerases (RdRPs), and therefore silencing signals cannot be amplified in insect cells, the RNAi response in the insect is largely dependent on the input of dsRNA that is taken up from the host plant. Efficient processing of dsRNAs expressed from the plant nuclear genome into siRNAs constrains the RNAi effect on the insect. This obstacle can be overcome by high-level expression of long dsRNAs in plastids (chloroplasts), a double membrane-surrounded cell organelle that lacks an RNAi machinery. Consequently, long dsRNAs produced inside the plastid are protected from cleavage by Dicer enzymes and remain stable. In this way, plastid expression of insecticidal dsRNAs provided complete protection of potato plants against the Colorado potato beetle [19]. However, dsRNase enzyme(s) present in the midgut of lepidopteran insects, such as the cotton bollworm, may degrade dsRNAs upon release from ingested leaf material and thus impede the RNAi responses.

术语

dsRNA-specific ribonuclease(dsRNase): an enzyme, typically an endoribonuclease, that specifically degrades dsRNAs.
RNA-dependent RNA polymerase(RdRP): an RNA polymerase capable of using single-stranded RNA as template to synthesize a complementary RNA strand. By catalyzing the replication of RNA, RdRPs can amplify silencing signals generated by dsRNAs.
Small interfering RNA (siRNA): a class of small non-coding RNA molecules (typically 20–25 bp) that function in post-transcriptional gene silencing. siRNAs are generated from dsRNA in the RNA interference(RNAi) pathway and induce the degradation of mRNAs with complementary sequences. siRNA不仅能引导RISC切割同源单链mRNA，而且可作为引物与靶RNA结合并在RNA聚合酶(RNA-dependent RNA polymerase，RdRP）作用下合成更多新的dsRNA，新合成的dsRNA再由Dicer切割产生大量的次级siRNA(Secondary siRNAs)，从而使RNAi的作用进一步放大，最终将靶mRNA完全降解。

干涉效率和面临挑战

Most of the dsRNA is processed into siRNAs in the plant, which are taken up much less efficiently by insect cells than is long dsRNA.

The concentration of insecticidal dsRNA in the plant tissue consumed is particularly important.
The length of the dsRNA fragment produced in the plant.
The physiology of the insect gut or hemolymph.
The choice of the target gene to be silenced in the insect.

参考文献

1.Luo, J. et al. A transgenic strategy for controlling plant bugs (Adelphocoris suturalis) through expression of double-stranded RNA homologous to fatty acyl-coenzyme A reductase in cotton. New Phytol 215, 1173–1185 (2017).
2.Zhang, J., Khan, S. A., Heckel, D. G. & Bock, R. Next-Generation Insect-Resistant Plants: RNAi-Mediated Crop Protection. Trends in Biotechnology 35, 871–882 (2017).

你真的懂Illumina数据质量控制吗？

2017-09-04T07:03:02.000Z

1. FastQC察看

2. 进行reads的修剪和过滤

Short-insert paired end reads

接头序列：

>PrefixPE/1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PrefixPE/2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

Trimmomatic等通常的质控软件。

Long Mate Pair libraries

接头序列：technote_nextera_matepair_data_processing.pdf

针对此类数据的处理软件主要是：nextclip和skewer，从文章结果来看后者略优。【Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads】

处理软件：nextclip (同时移除PCR duplicates)

./nextclip -d -i ~/AS/raw_reads/AS8K_R1.fastq -j ~/AS/raw_reads/AS8K_R2.fastq -o output

NextClip v1.3.2

                n: 18
                b: 100
          Entries: 26214400
       Entry size: 24
  Memory required: 856 MB

Creating hash tables for duplicate storage...
Hash:
 unique kmers: 0
 Capacity: 26214400 
 Occupied: 0.00%
 Pruned: 0 (-nan%)
 Collisions:
Adaptor: CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG

Opening input filename /public/home/zpxu/AS/raw_reads/AS8K_R1.fastq
Opening input filename /public/home/zpxu/AS/raw_reads/AS8K_R2.fastq
Opening output file output_A_R1.fastq
Opening output file output_A_R2.fastq
Opening output file output_B_R1.fastq
Opening output file output_B_R2.fastq
Opening output file output_C_R1.fastq
Opening output file output_C_R2.fastq
Opening output file output_D_R1.fastq
Opening output file output_D_R2.fastq
Hash:
 unique kmers: 25546550
 Capacity: 26214400 
 Occupied: 97.45%
 Pruned: 0 (0.00%)
 Collisions:
         tries 0: 32059454
         tries 1: 649509
         tries 2: 192228
         tries 3: 76759
         tries 4: 35001
         tries 5: 17641
         tries 6: 9035
         tries 7: 4943
         tries 8: 2774
         tries 9: 1495
too much rehashing!! Rehash=26

如上，若出现too much rehashing!! Rehash=26的错误信息则增大[-n | --number_of_reads] Approximate number of reads (default 20,000,000)参数值；

./nextclip -d -e -i ~/AS/raw_reads/AS3K_R1.fastq -j ~/AS/raw_reads/AS3K_R2.fastq -o 33AS3K -n 30000000

NextClip v1.3.2

                n: 19
                b: 100
          Entries: 52428800
       Entry size: 24
  Memory required: 1456 MB

Creating hash tables for duplicate storage...
Hash:
 unique kmers: 0
 Capacity: 52428800 
 Occupied: 0.00%
 Pruned: 0 (-nan%)
 Collisions:
Adaptor: CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG

Opening input filename /public/home/zpxu/AS/raw_reads/AS3K_R1.fastq
Opening input filename /public/home/zpxu/AS/raw_reads/AS3K_R2.fastq
Opening output file 33AS3K_A_R1.fastq
Opening output file 33AS3K_A_R2.fastq
Opening output file 33AS3K_B_R1.fastq
Opening output file 33AS3K_B_R2.fastq
Opening output file 33AS3K_C_R1.fastq
Opening output file 33AS3K_C_R2.fastq
Opening output file 33AS3K_D_R1.fastq
Opening output file 33AS3K_D_R2.fastq
Opening output file 33AS3K_E_R1.fastq
Opening output file 33AS3K_E_R2.fastq
Warning: read shorter than minimum read size (64) - ignoring
GC bases: 9705583313  AT bases: 12168077887

Hash:
 unique kmers: 28281698
 Capacity: 52428800 
 Occupied: 53.94%
 Pruned: 0 (0.00%)
 Collisions:
         tries 0: 72912204

Counting duplicates...

8%      [========                                                                                            ]Warning: count (999) exceeds maximum - treated as 999
10%     [==========                                                                                          ]Warning: count (999) exceeds maximum - treated as 999
20%     [====================                                                                                ]Warning: count (999) exceeds maximum - treated as 999
23%     [=======================                                                                             ]Warning: count (999) exceeds maximum - treated as 999
25%     [=========================                                                                           ]Warning: count (999) exceeds maximum - treated as 999
67%     [===================================================================                                 ]Warning: count (999) exceeds maximum - treated as 999
83%     [===================================================================================                 ]Warning: count (999) exceeds maximum - treated as 999
100%    [====================================================================================================]

SUMMARY

     Strict match parameters: 34, 18
    Relaxed match parameters: 32, 17
           Minimum read size: 25
                   Trim ends: 19

        Number of read pairs: 72966745
   Number of duplicate pairs: 44626706  61.16 %
Number of pairs containing N: 54541     0.07 %

   R1 Num reads with adaptor: 16173479  22.17 %
   R1 Num with external also: 4375021   6.00 %
       R1 long adaptor reads: 11092453  15.20 %
          R1 reads too short: 5081026   6.96 %
     R1 Num reads no adaptor: 12162760  16.67 %
  R1 no adaptor but external: 5248876   7.19 %

   R2 Num reads with adaptor: 14902833  20.42 %
   R2 Num with external also: 4406543   6.04 %
       R2 long adaptor reads: 9987006   13.69 %
          R2 reads too short: 4915827   6.74 %
     R2 Num reads no adaptor: 13433406  18.41 %
  R2 no adaptor but external: 5653578   7.75 %

   Total pairs in category A: 11389962  15.61 %
         A pairs long enough: 5627734   7.71 %
           A pairs too short: 5762228   7.90 %
A external clip in 1 or both: 18225     0.02 %
     A bases before clipping: 3416988600
       A total bases written: 749338798

   Total pairs in category B: 3422082   4.69 %
         B pairs long enough: 1695273   2.32 %
           B pairs too short: 1726809   2.37 %
B external clip in 1 or both: 47947     0.07 %
     B bases before clipping: 1026624600
       B total bases written: 323696037

   Total pairs in category C: 4565902   6.26 %
         C pairs long enough: 2610991   3.58 %
           C pairs too short: 1954911   2.68 %
C external clip in 1 or both: 143843    0.20 %
     C bases before clipping: 1369770600
       C total bases written: 509149505

   Total pairs in category D: 8649889   11.85 %
         D pairs long enough: 3667738   5.03 %
           D pairs too short: 4982151   6.83 %
D external clip in 1 or both: 5627647   7.71 %
     D bases before clipping: 2594966700
       D total bases written: 899148840

   Total pairs in category E: 308404    0.42 %
         E pairs long enough: 196969    0.27 %
           E pairs too short: 111435    0.15 %
E external clip in 1 or both: 37111     0.05 %
     E bases before clipping: 92521200
       E total bases written: 29751268

          Total usable pairs: 10130967  13.88 %
             All long enough: 13798705  18.91 %
    All categories too short: 14537534  19.92 %
      Duplicates not written: 44630506  61.17 %

         Category B became E: 90789     0.12 %
         Category C became E: 217615    0.30 %
          Overall GC content: 44.37 %


Done. Completed in 4414 seconds.

结果文件中的A,B和C category合并后用于后续分析。

处理软件： skewer

 ./skewer -m mp -i ~/AS/raw_reads/AS8K_R1.fastq ~/AS/raw_reads/AS8K_R2.fastq -o AS8K -t 5
.--. .-.
: .--': :.-.
`. `. : `'.' .--. .-..-..-. .--. .--.
_`, :: . `.' '_.': `; `; :' '_.': ..'
`.__.':_;:_;`.__.'`.__.__.'`.__.':_;
skewer v0.2.2 [April 4, 2016]
Parameters used:
-- 3' end adapter sequence (-x):        AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
-- paired 3' end adapter sequence (-y): AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
-- junction adapter sequence (-j):      CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG
-- maximum error ratio allowed (-r):    0.100
-- maximum indel error ratio allowed (-d):      0.030
-- minimum read length allowed after trimming (-l):     18
-- file format (-f):            Sanger/Illumina 1.8+ FASTQ (auto detected)
-- minimum overlap length for junction adapter detection (-k):  19
-- redistribute reads based on junction information (-i):       yes
Sat Sep  2 21:43:52 2017 >> started
|=================================================>| (100.00%)
Sat Sep  2 22:53:41 2017 >> done (4120.338s)
61751767 read pairs processed; of these:
 2750894 ( 4.45%) non-junction read pairs filtered out by contaminant control
 6682698 (10.82%) short read pairs filtered out after trimming by size control
31458966 (50.94%) empty read pairs filtered out after trimming by size control
20859209 (33.78%) read pairs available; of these:
17141247 (82.18%) trimmed read pairs available after processing
 3717962 (17.82%) untrimmed read pairs available after processing
log has been saved to "/public/home/zpxu/AS/raw_reads/AS8K_R1-trimmed.log".

两类reads的去除比例

After trimming and quality filtering, 56% of long-insert reads from each of the three mate-pair libraries and 95% of paired-end reads were retained on average.

实际同样数据运行结果比较

FastQ files	read length	median	mean	stdev	FF	FR	RF	RR
AS3K_R2_nextclip.fq.is.txt	83	2510	2307.246	764.2089	36	781	9164	19
AS5K_R2_nextclip.fq.is.txt	84	4444	3974.787	1506.76	54	773	9127	46
AS8K_R2_nextclip.fq.is.txt	81	5825	4733.265	2530.765	150	1167	8543	140
AS3K-trimmed-pair2.fastq.is.txt.skewer	102	2493	2319.141	725.1056	26	1912	8049	13
AS5K-trimmed-pair2.fastq.is.txt.skewer	106	4460	4042.562	1443.632	30	2333	7608	29
AS8K-trimmed-pair2.fastq.is.txt.skewer	111	5945	4935.393	2446.029	112	3607	6212	69

3. FastUniq 去除 paired reads 的PCR重复

建议先trim，然后在来用这个软件来去除dup,因为，这个软件是比较以后，随机保留相同的pair的中一个，如果不先trim，容易保留质量差的哪一个，而且即使trim后，它也能处理不同长度的pair。【每日一生信—FastUniq去除paired reads的duplicates】

单个文库/input_list.txt 多次运行

cat AS285.list
AS285A_R1.clean.fastq
AS285A_R2.clean.fastq
fastuniq -i AS285.list -o AS285A_R1.rd.clean.fastq -p AS285A_R2.rd.clean.fastq

或者多个文库写在同一个input_list.txt时输出结果会将多个文库合并成一个文件；

cat input_list.txt
input_R1_1.fastq
input_R1_2.fastq
input_R2_1.fastq
input_R2_2.fastq
fastuniq -i input_list.txt -t q -o output_1.fastq -p output_2.fastq -c 1

报错：内存问题,在大内存节点运行。

1	Error in Reading pair-end FASTQ sequence!

4. 进行reads 的纠正

BLESS和Musket有相似的纠正结果，前者一直报错；

BLESS

source /public/home/software/.bashrc
module load BLESS/1.02
cd /public/home/zpxu/AS/clean_reads
bless -read1 AS3k_R1.rd.clean.fastq -read2 AS3k_R2.rd.clean.fastq -prefix ../bless/AS3k_R12.rd.clean -kmerlength 31

报错：

1
2
3

Checking input read files

ERROR: Irregular quality score range 35-75

Musket - a multistage k-mer spectrum based corrector

1	musket AS485_R1.rd.clean.fastq AS485_R2.rd.clean.fastq -omulti AS485 -inorder -p 10

至此，经过Trim，去PCR duplicates和纠正后的reads可用于后续的基因组组装等其他分析。

Circos安装和使用

2017-05-29T09:51:59.000Z

安装主页

已发表文章circos图(http://circos.ca/in_literature/scientific/)

教程和数据下载

Creating Circos Plots

下载最新版circos

1 2	$ wget http://circos.ca/distribution/circos-0.69-4.tgz $ tar -xzvf circos-0.69-4.tgz

检查perl模块缺失情况

$ cd circos-0.69-4
$ bin/circos -modules
ok       1.38 Carp
ok       0.38 Clone
missing            Config::General
ok        3.3 Cwd
ok      2.154 Data::Dumper
ok       2.39 Digest::MD5
ok       2.77 File::Basename
ok        3.3 File::Spec::Functions
ok       0.22 File::Temp
ok       1.50 FindBin
ok       0.39 Font::TTF::Font
ok       2.44 GD
ok        0.2 GD::Polyline
ok       2.38 Getopt::Long
ok       1.14 IO::File
ok      0.416 List::MoreUtils
ok       1.46 List::Util
ok       0.01 Math::Bezier
ok       1.60 Math::BigFloat
ok       0.07 Math::Round
ok       0.08 Math::VecStat
ok    1.01_03 Memoize
ok       1.17 POSIX
missing            Params::Validate
ok       1.36 Pod::Usage
missing            Readonly
ok 2016060801 Regexp::Common
missing            SVG
missing            Set::IntSpan
missing            Statistics::Basic
ok       2.20 Storable
ok       1.11 Sys::Hostname
ok      2.0.0 Text::Balanced
missing            Text::Format
ok     1.9721 Time::HiRes

cpanm来安装缺失模块

检测GD是否能用于画图，最终能生成diag.png文件

1	bin/gddiag

检测circos最终是否正确安装

1	bin/circos -help

或者

1 2	cd example ./run

将在当前example目录下生成如下circos.png图和run.out记录屏幕输出

Circos的使用

1	circos -conf etc/circos.conf

参数解释如下：

1
2
3

-version             查询circos版本
-modules             检测perl模块
-conf        输入主配置文件

详细参数见：COMMAND LINE PARAMETERS
image.generic.conf 配置文件设置图形输出文件夹和名称；

Circos配置文件详解

Circos的使用主要通过输入一个主配置文件circos.conf 和若干其他类型的配制文件。
该主配置文件的内容格式主要以各种区块表示，大区块中可以包含小区块。
此外，有些配置信息一般不需要改动，比如颜色，字体等。我们一般将这类信息保存到一个独立的配置文件中。只需要在主配置文件中通过include声明包含这些独立的配置文件名，即表示使用其配置信息。
例如，最常用的放置到主配置文件头部的数行包括：

基因组染色体组型数据文件karyotype
1
<<include karyotype.and.layout.conf>>

例如，最常用的放置到主配置文件尾部的数行包括：

设置生成的图片参数
1
2
3
<image>
<image.conf>>
image>
设置颜色，字体，填充模式的配置信息
1
<<include etc/colors_fonts_patterns.conf>>
系统与debug参数
1
<<include etc/housekeeping.conf>>

一般绘图需要以下配置文件：

conf配置文件	内容和作用说明
circos.conf	主配置文件
karyotype.and.layout.conf	基因组染色体组型数据
ideogram.conf	描述基因组染色体组型数据中染色体展示形式
ticks.conf	描述基因组染色体组型数据中染色体大小展示形式
image.conf	生成图片参数
highlight.conf	基因组染色体组型数据
colors_fonts_patterns.conf	颜色，字体，填充模式的配置信息
ousekeeping.conf	系统与debug参数

运行circos -conf etc/circos.conf程序时，circos先从circos.conf所在路径文件夹下搜索所需的配置文件

1. `karyotype.and.layout.conf` 指定主要数据配置文件

karyotype 染色体组型

指定基因组 karyotype文件存放路径

karyotype 文件定义 names, sizes and colors of chromosomes that you will use in the image.

#指定染色体组型的文件
karyotype = data/karyotype/karyotype.human.txt,data/karyotype/karyotype.mouse.txt,data/karyotype/karyotype.rat.txt
chromosomes_order_by_karyotype = yes
# 设置长度单位，以下设置表示 1M 长度的序列代表为 1u
chromosomes_units              = 1000000
默认设置下是将 karyotype 文件中所有的染色体都展示出来。当然，也可能根据需要仅展示指定的 chromosomes, 使用如下的参数进行设置
chromosomes_display_default    = no
# 以下参数设置指定的 chromosomes 用于展示到圈图中。// 中是一个正则表达式，匹配的 chromosomes 用于展示到圈图中。其匹配的对象是 karyotype 文件中的第 3 列。也可以直接列出需要展示的 chromosomes， 例如：hs1;hs2;hs3;hs4 。
#chromosomes         = -/[XY]/;-/hs(1[1-9]|2\d)$/;-/rn/;-/mm/;rn1;mm1
chromosomes          = -/[XY]/;-/rn/;-/mm/;rn1;mm1
#chromosomes         = hs4;hs3;hs5;hs7;hs8;hs12;hs6;hs11;hs14;hs13;mm1
#chromosomes_order   = hs1,hs2,mm2,mm1
#设置各个 ideograms 的颜色，karyotype 文件最后一列指定了各个 chromosomes 的颜色，而使用 chromosomes_color 参数也能修改颜色。
chromosomes_color    = /mm/=blues-5-seq-4,/rn/=reds-5-seq-4
#使 hs2， hs3 和 hs4 在圈图上的展示方向是反向的。
chromosomes_reverse  = /hs[234]/
# rn1 and mm1 scaled to each occupy 1/4 of the figure
# 以下设置各个 ideograms 的大小。其总长度为 1 ，hs1 的长度为 0.5， hs2，hs3 和 hs4 这 3 个 chromosomes 的总长度为 0.5，并且这 3 个 chromosomes 的长度是分布均匀的。注意前者的单位是 r， 后者使用了正则表达式对应多个 chromosomes， 其单位于是为 rn 。
chromosomes_scale   = rn1=0.25r,mm1:0.25r
# /hs/=0.5rn - relative scaling, normalized by number of ideograms matching /hs/
#              this is equivalent to /hs/=0.0227r (0.5/22).
#chromosomes_scale   = /hs/=0.5rn
#默认下在 ideogram.conf 中统一设置了 ideogram 的位置，可以使用此参数调整指定 ideogram 的位置。
chromosomes_radius = hs2:1.05r;hs3:1.20r;hs4:1.35r;hs5:1.15r;hs6:1.05r
# karyotype 文件最后一列指定了各个 chromosomes 的颜色，而使用 chromosomes_color 参数也能修改颜色。当然，使用如下方式进行颜色的修改，则更加直观。以下方式是对颜色重新进行定义。chr1，chr2，chr3 和 chr4 对应着 karyotype 文件最后一列的值，代表着颜色的类型。此处使用 color block 来对其进行重新定义。注意重新定义的时候需要加符号 * 

chr1* = red
chr2* = orange
chr3* = green
chr4* = blue

染色体组型数据文件格式和内容
- [x] 内容: 制作自己基因组的karyotype file
- [x]脚本制作自己的karyotype文件，其中的颜色值可从/public/home/zpxu/bin/circos-0.69-4/etc/colors.unix.txt 或 /public/home/zpxu/bin/circos-0.69-4/etc/colors.ucsc.conf文件中选取

#染色体 -   染色体编号(ID) 图显示名称(Label)     起始   结束    颜色
chr -   Gmax01  1   0   56831624    chr1
chr -   Gmax02  2   0   48577505   	chr2
chr -   Gmax03  3   0   45779781    chr3

[x] 解释:

染色体：chr，这一列定义表明这是一个染色体；
   - ：短线占位符，这个占位符通常用来定义所属关系，对于染色体来说，没有所属；
染色体编号(ID)： ID是染色体唯一且不能重复的标识。如果一个染色体组型文件里面包含多个不同来源的染色体组，设置ID最好的办法就是使用前缀。
                    比如hs=homo sapiens, mm=mus musculus等等。有时候你可以使用hs19做为前缀来明示数据来源版本。
                    其实，即使是只有一个来源的染色体组，也最好使用前缀，以规范文件格式。
                    比如上面的示例，就是我绘图的大豆的基因组，因而我在设置染色体编号（ID）的时候，使用了Gmax01 …的格式，
                    自己可以根据自己的染色体来设置。
图显名称(Label)：是将来用于显示在图上的文本。Label主要是在图中显示的名称，我这里就直接使用1,2, …表示，
                    但是如果有多个物种的基因组，或者来自不同的样品，品系的，还是要加以区分。
起始、结束(START和END)：定义了染色体的大小。
                    对于染色体组型文件，需要指明的是，这里的START和END应该是染色体本身的大小，而不是你想绘制部分的起止位置。
                    对于指定绘制部分将由其它文件来定义。
颜色(COLOR)：是于定义显示的颜色。如果染色体组不以条纹(cytogenetic bands)图谱覆盖的话，那么就会以这里设置的颜色显示。
                对于人类基因组而言，circos预设了与染色体相同的名字做为颜色名，比如chr1, chr2, … chrX, chrY, chrUn.

[x] 关于染色体上加上条纹图谱的信息：可选内容

[x] 一般的，我们都会在染色体组型文件当中加上条纹图谱的信息，这样才会让染色体图谱看上去有被染色的效果。文件格式与之前的一致，也只有七列。这里的DOMAIN就是染色体组型当中的ID就好了，其它的定义与前面的一致。下面就是一个例子。

band    DOMAIN  ID  LABEL   START   END COLOR
band    hs1     p36.33  p36.33  0   2300000     gneg
band    hs1     p36.32  p36.32  2300000     5400000     gpos25
band    hs1     p36.31  p36.31  5400000     7200000     gneg
band    hs1     p36.23  p36.23  7200000     9200000     gpos25
band    hs1     p36.22  p36.22  9200000     12700000    gneg

[x] cytogenetic bands的名称例子：1p36.33
[x] 其命名规则是之前的数字、字母为染色体代号，一般是数字或者X,Y。而之后会有字母p或者q。p代表短臂，q代表长臂。而每个band都会有颜色深浅的不同，这里主要以gpos和gneg来区别。为了和真实值更接近，circos还定义了一系列的灰度。
- 自己额外显示的数据文件

SNP的密度显示track，那么就拿这样一个track来做一个示例：

#染色体编号 起始位置    终止位置    value
Gmax01  0   999999  0.000301
Gmax01  1000000     1999999     0.001321
Gmax01  2000000     2999999     0.001050
Gmax01  3000000     3999999     0.003027

文件内容解释：

染色体编号：一定要与基因组信息文件中的染色体编号保持一致，不然绘图的时候会出错；
起始位置、终止位置：这是你要统计的这个区段的的范围；
value：则是值，根据情况，比如我这里是统计的SNP的密度，那么这个值就是在这个染色体区段范围类，染色体的密度，
           如果是统计数量，那么这个值就是数目。

每个track都有自己对应的数据，而且不同的绘图类型，所需要的数据结构可能都不一样，这时你就需要根据自己的需求来准备数据文件。
不同显示数据的track对应的数据文件的格式，其实数据的格式都很好理解，如果你想绘制什么样的图，可以在官网上找到相关的数据类型结构，你只要根据这个结构来做好你的数据，就没什么问题。

2. `ideogram.conf` 显示染色体：不需要额外指定文件

将染色体在圈图上展示出来，代表每个染色体的图形，称为ideogram。
更详细解释见：circos绘图ideogram.conf文件的配置和 Circos系列教程（二）染色体示意图ideograms(有各种细节修改解释)



### 设定 ideograms 之间的空隙

# 设置圈图中染色体之间的空隙大小，以下设置为每个空隙大小为周长的 0.5%
default = 0.005r

# 也可以设置指定两条染色体之间的空隙
#
# 以下设定为两条染色体之间的空隙约为圆的 20 度角。
#spacing = 20r
#



## 设定 ideograms 
# 设定 ideograms 的位置，以下设定 ideograms 在图离圆心的 90% 处，修改染色体圆圈的大小，即空出较多边缘空间
radius           = 0.90r
# 设定 ideograms 的厚度，可以使用 r（比例关系） 或 p（像素）作为单位
thickness        = 20p
# 设定 ideograms 是否填充颜色。填充的颜色取决于 karyotype 指定的文件的最后一列。
fill             = yes
# 设定 ideograms 轮廓的颜色及其厚度。如果没有该参数或设定其厚度为0，则表示没有轮廓。
stroke_color     = dgrey
stroke_thickness = 2p

## 设定 label 的显示
# 设定是否显示 label 。 label 对应着 karyotype 文件的第 4 列。如果其值为 yes，则必须要有 label_radius 参数来设定 label 的位置，否则会报错并不能生成结果。
show_label       = yes
# 设定 label 的字体
label_font       = default
# 设定 label 的位置
label_radius     = 1r+90p
# 设定 label 的字体大小
label_size       = 40
# 设定 label 的字体方向，yes 是易于浏览的方向。
label_parallel   = yes

染色体标签显示：LABELS

3. `ticks.conf` 以刻度形式显示染色体大小：不需要额外指定文件

将染色体的大小以刻度的形式在圈图上展示出来。

# 是否显示 ticks
show_ticks         = yes
# 是否显示 ticks 的 lables
show_tick_labels    = yes

## 设定 ticks

## ticks 的设置
# 设定 ticks 的位置
radius           = 1r
# 设定 ticks 的颜色
color            = black
# 设定 ticks 的厚度
thickness        = 2p
# 设定 ticks' label 的值的计算。将该刻度对应位置的值 * multiplier 得到能展示到圈图上的 label 值。
multiplier       = 1e-6
# label 值的格式化方法。%d 表示结果为整数；%f 结果为浮点数； %.1f 结果为小数点后保留1位； %.2f 结果为小数点后保留2位。
format           = %d

## 以下设置了 2 个 ticks，前者是小刻度，后者是大刻度。

# 设置每个刻度代表的长度。若其单位为 u，则必须要设置 chromosomes_units 参数。比如设置 chromosomes_units = 1000000，则如下 5u 表示每个刻度代表 5M 长度的基因组序列。
spacing        = 5u
# 设置 tick 的长度
size           = 10p



spacing        = 25u
size           = 15p
# 由于设置的是大刻度，以下用于设置展示 ticks' label。
show_label     = yes
# 设置 ticks' label 的字体大小
label_size     = 20p
# 设置 ticks' label 离 ticks 的距离
label_offset   = 10p
format         = %d

4. `links.conf` 以曲线连接显示基因组内部区域之间的联系：需指定links文件

基因组内部不同的序列区域之间有联系，将之使用线条进行连接，从而展示到圈图上。常见的是重复序列之间的连接。
更多详细内容见：Circos系列教程（四）连线 links

数据结构

LABEL	ID	START	END
segdup00001	hs1	465	30596
segdup00001	hs2	114046768	114076456
segdup00002	hs1	486	76975
segdup00002	hs15	100263879	100338121

这里的LABEL是连线的名称，因为两点确定一条线，所以基本上连线的数据都是同一个名称出现两行数据，分别记录线两端对应的染色体组位置。ID对应的是karyotypes数据文件当中的ID，start和end分别定义起始和终止的位置。

或者是：

ID1    START    END    ID2    START    END
A01 27231996 27234600 D01 19250453 19253070
A03 98657717 98660412 A13 3532356 3534210
A03 97671503 97673317 D02 64857836 64860545
A03 98657717 98660412 D02 65834252 65836940

配置文件




# 指定 link 文件的路径，其文件格式为：
# chr1  start   end     chr2    start   end
# hs1   465     30596   hs2     114046768       114076456
# 表明这两个染色体区域有联系，例如这个区域的序列长度>1kb且序列相似性>=90%。
file          = data/5/segdup.txt
# 设置 link 曲线的半径
radius        = 0.8r
# 设置贝塞尔曲线半径，该值设大后曲线扁平，使图像不太好看。
bezier_radius = 0r
# 设置 link 曲线的颜色
color         = black_a4
# 设置 link 曲线的厚度
thickness     = 2


# 以下可以设置多个 rules，用来对 link 文件的每一行进行过滤或展示进行设定。每个 rule 都有一个 condition 参数；如果该 condition 为真，除非 flow=continue ，则不

# 如果 link 文件中该行数据是染色体内部的 link，则不对其进行展示

condition     = var(intrachr)
show          = no


# 设置 link 曲线的颜色与 ideogram 的颜色一致，否则为统一的颜色。

# condition 为真，则执行该 block 的内容
condition     = 1
# 设置 link 曲线的颜色为第 2 条染色体的颜色。对应这 link 文件中第 4 列数据对应的染色体的名称
color         = eval(var(chr2))
# 虽然 condition 为真，但依然检测下一个 rule
flow          = continue


# 如果 link 起始于 hs1，则其 link 曲线半径为 0.99r

condition     = from(hs1)
radius1       = 0.99r


# 如果 link 结束于 hs1，则其 link 曲线半径为 0.99r

condition     = to(hs1)
radius2       = 0.99r

5. `plots_histogram.conf` 以直方图形式展示数据：需指定额外数据

将基因组序列的GC含量，表达量等以直方图的形式在圈图中展示出来。以下作了两个直方图，并对分别添上背景或网格线。


# 设定为直方图
type = histogram
# 数据文件格式，为 4 列：
# chromosome	start	end	data
# hs1	0	1999999	180.0000
file = data/5/segdup.hs1234.hist.txt
# 设置直方图的位置，r1 要比 r0 大。直方图的方向默认为向外。
r1   = 0.88r
r0   = 0.81r
# 直方图的填充颜色
fill_color = vdgrey
# 默认下直方图轮廓厚度为 1px，若不需要轮廓，则设置其厚度为0，或在 etc/tracks/histogram.conf 中修改。
thickness = 0p
# 直方图是由 bins （条行框）所构成的。若 bins 在坐标上不相连，最好设置不要将其bins连接到一起。例如：
# hs1 10 20 0.5
# hs1 30 40 0.25
# 上述数据设置值为 yes 和 no 时，图形是不一样的。
extend_bin = no

# 以下添加 rule ，不在 hs1 上添加直方图。

<<include exclude.hs1.rule>>


# 设定直方图的背景颜色

show  = data


color = vvlgrey


color = vlgrey
y0    = 0.2r
y1    = 0.5r


color = lgrey
y0    = 0.5r
y1    = 0.8r


color = grey
y0    = 0.8r







type = histogram
# 此处直方图的数据文件第 4 列是多个由逗号分割的数值，需要制作叠加的直方图。
file = data/5/segdup.hs1234.stacked.txt
r1   = 0.99r
r0   = 0.92r
# 给 4 个值按顺序填充不同的颜色
fill_color  = hs1,hs2,hs3,hs4
thickness = 0p
orientation = in
extend_bin  = no


<<include exclude.hs1.rule>>


# 在直方图中添加坐标网格线

show = data
thickness = 1
color     = lgrey


spacing   = 0.1r


spacing   = 0.2r
color     = grey


position  = 0.5r
color     = red


position  = 0.85r
color     = green
thickness = 2

6. `plots_heatmap.conf` 以热图形式显示数据：需额外指定数据

基因组一个区域内有多组数据时，适合以热图形式显示数据。比如基因表达量。


# 绘制 heat map
type  = heatmap
# 设定数据文件路径。
#文件格式描述：文件有 5 列
# chrID start   end     data    class
# hs1 0 1999999 113.0000 id=hs1
# hs1 0 1999999 40.0000 id=hs4
# hs1 0 1999999 20.0000 id=hs2
# hs1 0 1999999 7.0000 id=hs3
file  = data/5/segdup.hs1234.heatmap.txt
# 设定图形所处位置
r1    = 0.89r
r0    = 0.88r
# 设定热图的颜色。颜色为 hs3 ，以及相应带不同透明程度的 5 种颜色。
color = hs1_a5,hs1_a4,hs1_a3,hs1_a2,hs1_a1,hs1
# 设定 scale_log_base 参数。计算颜色的方法如下：
# f = (value - min) / ( max - min )    热图中每个方块代表着一个值，并给予相应的颜色标示。一系列的值 [min,max] 对应一系列的颜色 c[n], i=0..N
# n = N * f ** (1/scale_log_base)
# 由上面两个公式计算出代表颜色的 n 值。
# 若 scale_log_base = 1，则数值与颜色的变化是线性的；
# 若 scale_log_base > 1，则颜色向小方向靠近；
# 若 scale_log_base < 1，则颜色向大方向靠近。
scale_log_base = 5


<<include exclude.hs1.rule>>

# 仅显示 id = hs1 的数据

condition = var(id) ne "hs1"
show      = no







type  = heatmap
file  = data/5/segdup.hs1234.heatmap.txt
r1    = 0.90r
r0    = 0.89r
color = hs2_a5,hs2_a4,hs2_a3,hs2_a2,hs2_a1,hs2
scale_log_base = 5


<<include exclude.hs1.rule>>


condition = var(id) ne "hs2"
show      = no







type  = heatmap
file  = data/5/segdup.hs1234.heatmap.txt
r1    = 0.91r
r0    = 0.90r
color = hs3_a5,hs3_a4,hs3_a3,hs3_a2,hs3_a1,hs3
scale_log_base = 5


<<include exclude.hs1.rule>>


condition = var(id) ne "hs3"
show      = no







type  = heatmap
file  = data/5/segdup.hs1234.heatmap.txt
r1    = 0.92r
r0    = 0.91r
color = hs4_a5,hs4_a4,hs4_a3,hs4_a2,hs4_a1,hs4
scale_log_base = 5


<<include exclude.hs1.rule>>


condition = var(id) ne "hs4"
show      = no

7. `plots_text.conf` 以文本形式显示数据：需额外指定数据

若需要在圈图上显示一些基因的名称，此时需要以文本形式显示数据。
参数的图形解释见：6 — 2D DATA TRACKS


# 显示出文字
type  = text
# 数据文件路径
file  = data/6/genes.labels.txt
# 显示在图形中的位置,r1值需大于r0值，r0 = 1r时表示标签显示在圆圈外；
# r1和r0间的差值空间是用来显示标签的区域，所以当所显示标签文本很长时必须加大这一空间，否则标签文本不能显示；
r1    = 0.8r
r0    = 0.6r
r0    = 1r
r1    = 1r+800p
# 标签的字体
label_font = light
# 标签大小
label_size = 12p
# 文字边缘的大小，设置较小则不同单词就可能会连接到一起了。
# padding  - text margin in angular direction
# rpadding - text margin in radial direction
rpadding   = 5p
# 设置是否需要在 label 前加一条线，用来指出 lable 的位置。
show_links     = no
link_dims      = 0p,2p,5p,2p,2p
link_thickness = 2p
link_color     = black


<<include exclude.hs1.rule>>

# 设置 rule ，对 label 中含有字母 a 或 b 的特异性显示

condition  = var(value) =~ /a/i
label_font = bold
flow       = continue


condition  = var(value) =~ /b/i
color      = blue

生成自己的 text 数据文件

1 2	cat exon-ixon.gff3 \| grep "mRNA" \| \grep -v "scaff" \| cut -f 1,4,5,9 \| cut -f 1 -d\; \| sed "s/ID=//g" \| \ sort -k1,1V \| sed "s/\t/ /g" > ../bin/circos-0.69-4/xzp/Gh.cysteine_proteinase.text.txt

更多关于这一部分的修改见：/public/home/zpxu/bin/circos-0.69-4/circos-tutorials-0.67/tutorials/6/7 目录；

`label_snuggle` 解释

`max_snuggle_distance` 解释

8. `rules.conf` 放置常用的规则配置

9. `circos.conf` 主配置文件

在主配置文件 circos.conf 中，包含以上所需要的配置文件信息，则可以画出所需要的track。此外，可以设置一些全局的设置。

# 额外信息显示与否
show_links      = yes
show_highlights = yes
show_text       = yes
show_heatmaps   = yes
show_scatter    = yes
show_histogram  = yes
# 从外部引用来指定染色体组型文件
<>

### 绘制 plot 图


<>
<>
<>



<>
<>
<>
<>

################################################################
# 插入必须的并不常修改的标准参数

<>

<>
<>

circos配置的单位概念
一共有4种单位：p, r, u, b
p表示像素，1p表示1像素
r表示相对大小，0.95r表示95% ring 大小。
u表示相对chromosomes_unit的长度，如果chromosomes_unit = 1000，则1u就是千分之一的染色体长度。
b表示碱基，如果染色体长1M，那么1b就是百万分之一的长度。

特殊图形展示(带链接)

使用技巧

ideogram.conf 文件下的 radius 可修改输出图片的整个屏幕占比（即充满整个画布或集中在中心）。
选取RGB颜色：Html color codes
选取特定数目的明显区别的颜色：
- How to generate a number of most distinctive colors in R?
- i want hue (推荐：网页可视化)

参考链接

Circos教程(二):基础使用
陈连福：Circos的安装和简单使用
Circos tutorial

单细胞扩增之：LIANTI

2017-05-28T09:39:13.000Z

之前我们也了解过Single Cell 全基因组扩增过程，最近北大的谢老师又重新刷新了单细胞全基因组扩增的新高度：Single-cell whole-genome analyses by Linear Amplification via Transposon Insertion (LIANTI),下面对本文做简单的解读?

转座子插入的线性扩增

Linear Amplification via Transposon Insertion (LIANTI)
Combines Tn5 transposition and T7 in vitro transcription for single-cell genomic analyses;

指数扩增 Vs 线性扩增

线性扩增优于指数扩增，基于以下两方面考虑：

拷贝数

例如，上图中假设DNA片段A和B的扩增效率 (replication yields) 分别为100%和70%每一次，并假设原始A/B=1:1，最终的扩增系数 (amplification factor) 为片段A 大约10,000；
即在指数扩增 (上一次的扩增结果可以成为下一次扩增的模版) 时需要经过13次扩增过程 (213=8,192; 214=16,384)，此时B对应的最终产物量为1.713~=990，A/B~=8:1；
而线性扩增仅使用最初模版，扩增出的模版被分离出来不进入扩增过程，所以需要10,000次的扩增过程，此时B对应的最终产物量为0.710,000=7,000，A/B=1:0.7;
*当上述扩增用于研究拷贝数变异(CNV)时，指数扩增会引起致命性错误；

准确性fidelity

对于保真性达10-7的高保真聚合酶扩增一次人的基因组 (3X 109bp)理论上将随机性的引入大约300个碱基的错误，并且因为指数扩增特性这300个碱基错误的扩增产物将作为下一次扩增的模版，在300个碱基错误的基础上可能再次引入300个碱基的错误，而且上一次扩增的错误会延续在下一次扩增中，这样循环下去这种错误将会被无限次的扩大，这在检测SNVs时会产生假阳性。
相反，在线性扩增时，由于模版始终为最初的模版，所以这种扩增错误每次都会随机出现在不同的位置，很容易通过不同时期扩增产物间的比对而消除。

SNP Vs SNV

单核苷酸多态性（single nucleotide polymorphism，SNP）和单核苷酸位点变异（single nucleotide variants, SNV）。个体间基因组DNA序列同一位置单个核苷酸变异(替代、插入或缺失)所引起的多态性。不同物种、个体基因组DNA序列同一位置上的单个核苷酸存在差别的现象。有这种差别的基因座、DNA序列等可作为基因组作图的标志。人基因组上平均约每1000个核苷酸即可能出现1个单核苷酸多态性的变化，其中有些单核苷酸多态性可能与疾病有关，但可能大多数与疾病无关。单核苷酸多态性是研究人类家族和动植物品系遗传变异的重要依据。在研究癌症基因组变异时，相对于正常组织，癌症中特异的单核苷酸变异是一种体细胞突变（somatic mutation），称做SNV。

SNP (single nucleotide polymorphism) vs. SNV (single nucleotide variant) As their name suggests, both are concerned with aberrations at a single nucleotide. However, a SNP is when an aberration is expected at the position for any member in the species 鈥?for example, a well characterized allele. A SNV on the other hand is when there is a variation at a position that hasn鈥檛 been well characterized 鈥?for example, when it is only seen in one individual. It is really all a question of frequency of occurrence.

扩增过程原理

1.在LIANTI扩增时来自于单细胞的基因组DNA在Tn5转座酶的作用下转座LIANTI转座子而随机片段化（400bp左右）。

LIANTI转座子 (LIANTI transposon)：包含一段19bp双链的转座酶结合位点和单链的T7启动子环；
LIANTI transposon DNA (5’/Phos/CTGTCTCTTATACACATCTGAACAGAATTTAATACGACTCACTATAGGGAGATGTGTATAAGAGACAG-3’, IDT oligo with PAGE purification) ;
等摩尔量的LIANTI 转座子和Tn5转座酶 (Tn5 transposase) 混合形成二聚体的LIANTI转座体 (LIANTI transposome).

2.碎片化后的基因组DNA被加上T7启动子标签，随后在体外转录线性扩增成成千上万的RNAs，紧接着3’端反向 and 合成互补的第二链形成双链LIANTI扩增用于DNA文库。

来源于单细胞的基因组DNA被随机碎片化后被LIANTI转座子标记，随后DNA聚合酶作用于碎片DNA双链两端互补单链的环状T7启动子为双链的T7启动子；
在T7RNA聚合酶的作用下体外转录线性扩增基因组DNA为基因组RNA，其中转录出的RNA能够在3‘ 端自动环化使单链状转座酶结合位点形成双链状；
随后经历反转录作用，RNase消化和第二链的合成，双链的LIANTI扩增物标记上特异的分子条形码 (unique molecular barcodes) 代表原始单细胞基因组DNA 的扩增产物用于之后的DNA文库和高通量测序；

LIANTI扩增消除了非特异性priming和指数扩增相较于其他全基因组扩增方法，因此能够大大的降低扩增的偏好性和错误性。

深入浅出的迭代器Iterator

2017-04-03T04:56:16.000Z

直接概念

可迭代对象(Iterable)：可以直接作用于for循环的对象；
迭代器(Iterator): 可以被next()函数调用并不断返回下一个值的对象。
所有的Iterable均可通过内置函数iter()来转变为Iterator。

迭代器

关于迭代器我们需要注意以下几点：
1. 迭代器不可重复利用，迭代完就变成空了，再次调用会引发StopIteration异常；
可通过copy包中的deepcopy复制迭代器从而可循环使用。
2. 迭代器是访问集合内元素的一种方式。迭代器对象从集合的第一个元素开始访问，直到所有的元素都被访问一遍后结束；
3. 迭代器不能回退，只能往前进行迭代；
4. 对于原生支持随机访问的数据结构（如tuple、list），迭代器和经典for循环的索引访问相比并无优势，反而丢失了索引值（可以使用内建函数enumerate()找回这个索引值）。但对于无法随机访问的数据结构（比如set）而言，迭代器是唯一的访问元素的方式；
enumerate()能在iter函数的结果前加上索引，以元组返回:

>>> lst = [5,6,7]
... for idx, ele in enumerate(lst):
...     print idx, ele
...     
0 5
1 6
2 7

5. 迭代器的另一个优点就是它不要求你事先准备好整个迭代过程中所有的元素。迭代器仅仅在迭代至某个元素时才计算该元素，而在这之前或之后，元素可以不存在或者被销毁。这个特点使得它特别适合用于遍历一些巨大的或是无限的集合，比如几个G的文件，或是斐波那契数列等等。这个特点被称为延迟计算或惰性求值(Lazy evaluation)；
6. 迭代器更大的功劳是提供了一个统一的访问集合的接口。只要是实现了iter()方法的对象，就可以使用迭代器进行访问。

迭代器操作

使用内建函数iter(iterable)获取迭代器对象，next(iterator)访问下一个元素；
常用的几个内建数据结构tuple、list、set、dict都支持迭代器，字符串也可以使用迭代操作。

`itertools`模块

Python的内建模块itertools提供了非常有用的用于操作迭代对象的函数。

“无限”迭代器
无限序列只有在for迭代时才会无限地迭代下去，如果只是创建了一个迭代对象，它不会事先把无限个元素生成出来;

count

count()会创建一个无限的迭代器：

>>> i = 0  
... for item in itertools.count(100):  
...     if i>10:  
...         break  
...     print item,
...     i = i+1 
...      
100 101 102 103 104 105 106 107 108 109 110

cycle

cycle()会把传入的一个序列无限重复下去：

>>> import itertools
>>> cs = itertools.cycle('ABC') # 注意字符串也是序列的一种
>>> for c in cs:
...     print c
...
'A'
'B'
'C'
'A'
'B'
'C'
...

repeat

repeat(elem [,n])repeat负责把一个元素无限重复下去，不过如果提供第二个参数就可以限定重复次数：
附： print后面逗号,作用

>>> import itertools  
... listone = ['a','b','c']  
... for item in itertools.repeat(listone,3):  
...     print item,  
...     
['a', 'b', 'c'] ['a', 'b', 'c'] ['a', 'b', 'c']
>>> import itertools  
... listone = ['a','b','c']  
... for item in itertools.repeat(listone,3):  
...     print item
...     
['a', 'b', 'c']
['a', 'b', 'c']
['a', 'b', 'c']

以上无限序列虽然可以无限迭代下去，但是通常我们会通过takewhile()等函数根据条件判断来截取出一个有限的序列：

>>> natuals = itertools.count(1)
>>> ns = itertools.takewhile(lambda x: x <= 10, natuals)
>>> for n in ns:
...     print n
...
打印出1到10

迭代器操作函数

chain

chain()可以把一组迭代对象串联起来，形成一个更大的迭代器：

>>> import itertools  
... listone = ['a','b','c']  
... listtwo = ['11','22','abc']  
... for item in  itertools.chain(listone,listtwo):  
...     print item
...     
a
b
c
11
22
abc

ifilter

ifilter(fun,iterator)返回一个可以让fun返回True的迭代器:

>>> import itertools  
... def funLargeFive(x):  
...     if x > 5:  
...         return True  
...       
... for item in itertools.ifilter(funLargeFive,range(-10,10)):  
...     print item,  
...     
6 7 8 9

imap

imap(fun,iterator)返回一个迭代器，对iterator中的每个项目调用fun:

>>> import itertools  
... listthree = [1,2,3]  
... def funAddFive(x):  
...     return x + 5  
... for item in itertools.imap(funAddFive,listthree):  
...     print item,  
...     print type(item)
...
6 'int'>
7 'int'>
8 'int'>

imap()和map()的区别在于，imap()可以作用于无穷序列，并且，如果两个序列的长度不一致，以短的那个为准。
和直接map的区别如下：

>>> listthree = [1,2,3]  
>>> def funAddFive(x):     
...    return x + 5  
... 
>>> map(funAddFive,listthree)
[6, 7, 8]
>>> type(map(funAddFive,listthree))
list

imap()返回一个迭代对象，而map()返回list,并且当你调用map()时，结果已经计算完毕，而当你调用imap()时，并没有进行任何计算，必须用for循环对其进行迭代，才会在每次循环过程中计算出下一个元素，从而实现了“惰性计算”，也就是在需要获得结果的时候才计算。

islice

itertools.islice(iterable, stop)
itertools.islice(iterable, start, stop[, step])
返回迭代器，将seq，从start开始,到stop结束，以step步长切割:
If start is None, then iteration starts at zero. If step is None, then the step defaults to one.

>>> import itertools  
... listone = ['a','b','c']  
... listtwo = ['11','22','abc']  
... listthree = listone + listtwo  
... for item in itertools.islice(listthree,3,5):  
...     print item,  
...     
11 22

izip

izip(*iterator)返回迭代器，结果是元组，元组来自*iterator的组合

>>> import itertools  
... listone = ['a','b','c']  
... listtwo = ['11','22','abc']  
... listthree = listone + listtwo  
... for item in itertools.izip(listone,listtwo):  
...     print item, 
...     print type(item)
...     
('a', '11') 'tuple'>
('b', '22') 'tuple'>
('c', 'abc') 'tuple'>

groupby()

groupby()把迭代器中相邻的重复元素挑出来放在一起：

>>> for key, group in itertools.groupby('AAABBBCCAAA'):
...     print key, list(group) # 为什么这里要用list()函数呢？
...
A ['A', 'A', 'A']
B ['B', 'B', 'B']
C ['C', 'C']
A ['A', 'A', 'A']

实际上挑选规则是通过函数完成的，只要作用于函数的两个元素返回的值相等，这两个元素就被认为是在一组的，而函数返回值作为组的key。如果我们要忽略大小写分组，就可以让元素’A’和’a’都返回相同的key：

>>> for key, group in itertools.groupby('AaaBBbcCAAa', lambda c: c.upper()):
...     print key, list(group)
...
A ['A', 'a', 'a']
B ['B', 'B', 'b']
C ['c', 'C']
A ['A', 'A', 'a']

生成器表达式(Generator expression)和列表解析(List Comprehension)

1. (x+1 for x in lst) #生成器表达式，返回迭代器。外部的括号可在用于参数时省略。
2. [x+1 for x in lst] #列表解析，返回list
由于返回迭代器时，并不是在一开始就计算所有的元素，这样能得到更多的灵活性并且可以避开很多不必要的计算，所以除非你明确希望返回列表，否则应该始终使用生成器表达式。
为列表解析提供if子句进行筛选：

1	(x+1 for x in lst if x!= 0)

或者提供多条for子句进行嵌套循环，嵌套次序就是for子句的顺序：

1	((x,y) for x in range(3) for y in range(x))

应用场景

1. 当对元素应用的动作太复杂，不能用一个表达式写出来时？
将动作def封装成函数，用于解析式；
2. 因为if子句里的条件需要计算，同时结果也需要进行同样的计算，不希望计算两遍？
组合一下列表解析式： [x for x in (y+1 for y in lst) if x >0]，内部的列表解析变量其实也可以用x，但为清晰起见我们改成了y。

写在最后

推荐一个画分满满萌萌哒的关于Iterators , Iterables and Generators 的文章: How to train your Python

参考来源：

http://python.jobbole.com/81916/

AstralWind: Python函数式编程指南（三）：迭代器

http://www.cnblogs.com/huxi/archive/2011/07/01/2095931.html

http://blog.csdn.net/xiaocaiju/article/details/6968123

廖雪峰的官方网站:itertools

python读取文件的正确方式

2017-04-02T12:39:50.000Z

Python对文件的基础操作间python基础教程总结：Python 文件I/O部分；以下主要总结大文件和小文件操作过程中内存有效利用方法。

小文件

with函数(推荐使用)

The with statement handles opening and closing the file, including if an exception is raised in the inner block.

1
2
3

with open('myfile') as f:
    for line in f:
        <do something with line>

readlines/readline

1 2	for line in open('myfile','r').readlines(): do_something(line)

二者的区别是readlines读进来的是列表，而readline是字符串；

>>> import re
... with open('zsq.txt') as f:
...     lines = f.readlines()
...     print type(lines)
'list'>
>>> import re
... with open('zsq.txt') as f:
...     lines = f.readline()
...     print type(lines)
'str'>

大文件

fileinput

import fileinput

for line in fileinput.input(['myfile']):
    do_something(line)

fileinput.input() call reads lines sequentially, but doesn’t keep them in memory after they’ve been read or even simply so this.

结合with处理多个文件：

1
2
3

with fileinput.input(files=('spam.txt', 'eggs.txt')) as f:
    for line in f:
        process(line)

buffer

filePath = "input.txt"

buffer = "Read buffer:\n"
file = open(filePath, 'rU')
while(1):
    bytes = file.read(5)
    if bytes:
        buffer += bytes
    else:
        break

print buffer

贡献来源

http://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python?noredirect=1&lq=1

python技巧总结（持续更新）

2017-03-25T08:17:03.000Z

以下总结是基于python2.7，在其他版本中是否可行没有验证。

多行字符串转换为单行

1 2	string = 'this is \n a \t example' string = ' '.join(string.split())

这里split不传入参数，那么这个函数会将以空白分开，空白包括：\n, \t , ‘ ‘,等，最终再用空格来连接起来，也可以指定分隔符，如string.split(‘;’)
可以自己写个函数来处理，利用lambda表达式：

1 2	processFunc = lambda s: " ".join(s.split()) string = processFunc(string)

替换

replace(“\n”, “”),后边的串替换掉前边的

1 2	>>>a="hope" >>> a.replace("h", "bioh")

字符串转数组

str = ‘1,2,3’
arr = str.split(‘,’)

数组转字符串

arr = [‘a’,’b’]
str = ‘,’.join(arr)

global vs. local

Python’s LEGB scope :Local -> Enclosed -> Global -> Built-in

x = 0
def in_func():
    global x   #将local变量用于global
    x = 1
    print('in_func:', x)
    
in_func()
print('global:', x)
>>>
in_func: 1
global: 1

local vs. enclosed

def outer():
       x = 1
       print('outer before:', x)
       def inner():
           nonlocal x  #modify the x variable in the enclosed scope
           x = 2
           print("inner:", x)
       inner()
       print("outer after:", x)
outer()
>>>
outer before: 1
inner: 2
outer after: 2

List comprehensions vs generators

List comprehensions are fast, but generators are faster!?
1. use lists if you want to use the plethora of list methods
2. use generators when you are dealing with huge collections to avoid memory issues

def plainlist(n=100000):
    my_list = []
    for i in range(n):
        if i % 5 == 0:
            my_list.append(i)
    return my_list

def listcompr(n=100000):
    my_list = [i for i in range(n) if i % 5 == 0]
    return my_list

def generator(n=100000):
    my_gen = (i for i in range(n) if i % 5 == 0)
    return my_gen

def generator_yield(n=100000):
    for i in range(n):
        if i % 5 == 0:
            yield i

assert

在没完善一个程序之前，我们不知道程序在哪里会出错，与其让它在运行最崩溃，不如在出现错误条件时就崩溃，这时候就需要assert断言的帮助。
assert断言是声明其布尔值必须为真的判定，如果发生异常就说明表达示为假。可以理解assert断言语句为raise-if-not，用来测试表示式，其返回值为假，就会触发异常。
See: python assert的作用

>>> assert 2==1,'2不等于1'
AssertionError: 2不等于1
AssertionErrorTraceback (most recent call last)
30-015aa214a555> in <module>()
----> 1 assert 2==1,'2不等于1'
AssertionError: 2不等于1
>>> assert 2==2,'2不等于1'

set数据类型

python中set和frozenset方法和区别
set(可变集合)与frozenset(不可变集合)
set无序排序且不重复，是可变的，有add（），remove（）等方法。既然是可变的，所以它不存在哈希值。基本功能包括关系测试和消除重复元素. 集合对象还支持union(联合), intersection(交集), difference(差集)和sysmmetric difference(对称差集)等数学运算.
sets 支持x in set, len(set),和 for x in set。作为一个无序的集合，sets不记录元素位置或者插入点。因此，sets不支持 indexing, 或其它类序列的操作。
frozenset是冻结的集合，它是不可变的，存在哈希值，好处是它可以作为字典的key，也可以作为其它集合的元素。缺点是一旦创建便不能更改，没有add，remove方法。

s1 = set("qiwsir")
# set of integers
my_set = {1, 2, 3}
print(my_set)
my=set(['Gh_A01G0993','Gh_A03G0561','Gh_A03G0561'])   #返回唯一值 
print my
# set of mixed datatypes
my_set = {1.0, "Hello", (1, 2, 3)}

set Operations

Python Sets

>>> A = {1, 2, 3, 4, 5}
... B = {4, 5, 6, 7, 8}
... # use | operator
... # Output: {1, 2, 3, 4, 5, 6, 7, 8}
... print(A | B)
... A |= B  #A=A|B
... A
... 
set([1, 2, 3, 4, 5, 6, 7, 8])
{1, 2, 3, 4, 5, 6, 7, 8}

`_,,_`

What does _ in Python do? [duplicate]

[_] (a single underscore) : stores previous output, like Python’s default interpreter.
[__] (two underscores): next previous.
[___] (three underscores): next-next previous.
1+1
print _
2+2
print _
3+3
print _
print __
print ___

用于占位符，也可以是一个简单的变量，by convention it means that you don’t intend to use that value, just read it and ignore it.

x=1
for _ in range(10):
    x+=1
    print x

“单下划线” 开始的成员变量叫做保护变量，意思是只有类对象和子类对象自己能访问到这些变量；
“双下划线” 开始的是私有成员，意思是只有类对象自己能访问，连子类对象也不能访问到这个数据。

class Student (object):
    def __init__(self,name):
        self._name=name
sd=Student("Tom")
sd._name
class Student (object):
    def __init__(self,name):
        self.__name=name
sd.__name
sd._Student__name

*args and **kwargs

Python函数可变参数args及kwargs释义
*args表示任何多个无名参数，它是一个tuple
**kwargs表示关键字参数，它是一个dict

def foo(*args,**kwargs):
    print 'args=',args
    print 'kwargs=',kwargs
    print '**********************'
 
if __name__=='__main__':
    foo(1,2,3)
    foo(a=1,b=2,c=3)
	
#结果如下：
args= (1, 2, 3)
kwargs= {}
**********************
args= ()
kwargs= {'a': 1, 'c': 3, 'b': 2}
**********************

You can pass a default value to get() for keys that are not in the dictionary:
Proper way to use **kwargs in Python

1 2	val2 = kwargs.get('val2',"default value") val2 = kwargs.get('val2',None)

除法运算

1 2	>>> print 5*(2/6) 0

运行结果总是0，WHY?
查找资料发现在Python里，整数初整数，只能得出整数。也就是 2 除 6 这个结果永远是0；
事实上不光python这样处理，C/C++，R也都是这样的，因为整数和浮点数本来就是两回事，用来计算除法的部件也不是同一个。
解决办法：
1. 如果想做浮点除法，就应该把至少一个操作数转化为浮点型。最简单的方法就是在后面加上.0。
2. 用类型转换的方法：(float)2/6。
3. 代码开头加上 from __future__ import division，在python3.0以后的版本中不存在这种情况的。

小数点位数

谈谈关于Python里面小数点精度控制的问题
至于保留小数点后位数可以通过内置函数round()和使用格式化,如"%.2f" % 2.645；
round()如果只有一个数作为参数，不指定位数的时候，返回的是一个整数，而且是最靠近的整数（这点上类似四舍五入）。但是当出现.5的时候，两边的距离都一样，round()取靠近的偶数；
当指定取舍的小数点位数的时候，一般情况也是使用四舍五入的规则，但是碰到.5的这样情况，如果要取舍的位数前的小树是奇数，则直接舍弃，如果偶数这向上取舍。

>>> round(2.635, 2)
2.63
>>> round(2.645, 2)
2.65

python默认的是17位小数的精度，但是这里有一个问题，就是当我们的计算需要使用更高的精度（超过17位小数）的时候可以使用高精度decimal模块，配合getcontext。

特殊取整

math模块的ceil(x) : 取大于或者等于x的最小整数；
math模块的floor(x) : 取小于或者等于x的最大整数。

列表操作

What is the syntax to insert one list into another list in python?

列表加法运算merge

>>> x = [1,2,3]
>>> y = [4,5,6]
>>> x + y
[1, 2, 3, 4, 5, 6]
>>> x.extend(y)
>>> x
[1, 2, 3, 4, 5, 6]

To extend a list at a specific insertion point you can use list slicing

1
2
3

>>> x[2:2] = ['a','b']
>>> x
[1, 2, 'a', 'b', 3]

**List slicing is quite flexible as it allows to replace a range of entries in a list with a range of entries from another list

1
2
3

>>> x[1:2] = ['a','b']
>>> x
[1, 'a', 'b', 3]

列表附加append

>>> x = [1,2,3]
>>> y = [4,5,6]
>>> x.append(y)
>>> x
[1, 2, 3, [4, 5, 6]]

插入insert

1
2
3

>>> x.insert(2, y)
>>> x
[1, 2, [4, 5, 6], 3]

计数count

1
2
3

>>> x = [1,2,3,2,2]
>>> x.count(2)
3

索引index

1 2	>>> x.index(2) 1

反转reverse

1
2
3

>>> x.reverse()
>>> x
[2, 2, 3, 2, 1]

其他还有list.remove(x)移除第一个item,list.pop([i])删除给定位置item,list.clear(),list.sort(key=None, reverse=False),list.copy()。

List Comprehensions

1	[(x,y) for x in [1,2,3] for y in [4,2,6] if x !=y]

eval和ast.literal_eval

Using python’s eval() vs. ast.literal_eval()?
eval是Python用于执行python表达式的一个内置函数，使用eval，可以很方便的将字符串动态执行,即将字符串str当成有效的表达式来求值并返回计算结果。

>>> eval("5+2")
7
>>> eval("[x for x in range(5)]")
[0, 1, 2, 3, 4]
>>>a = "[[1,2], [9,0]]"
>>> b = eval(a)
>>> print b
[[1, 2], [9, 0]]

“安全”使用eval
Eval函数的声明为eval(expression[, globals[, locals]]),第二三个参数分别指定能够在eval中使用的函数等，如果不指定，默认为globals()和locals()函数中包含的模块和函数。
但eval的使用存在安全隐患，具体See also the dangers of eval。
因此，Use ast.literal_eval whenever you need eval.ast.literal_eval raises an exception if the input isn’t a valid Python datatype, so the code won’t be executed if it’s not.

zip函数

Python零碎知识(2):强大的zip
1. 接受一系列可迭代对象作为参数，将对象中对应的元素打包成一个个tuple（元组），然后返回由这些tuples组成的list（列表）。若传入参数的长度不等，则返回list的长度和参数中长度最短的对象相同;
2. zip()配合*号操作符,可以将已经zip过的列表对象解压;
实例：找到字典中的最大值

ages = {'John': 21,
        'Mike': 52,
        'Sarah': 12,
        'Bob': 43
       }
max(zip(ages.values(), ages.keys()))

String slices

Chapter 8 Strings
The operator [n:m] returns the part of the string from the “n-eth” character to the “m-eth” character, including the first but excluding the last. This behavior is counterintuitive, but it might help to imagine the indices pointing between the characters.

贡献来源

http://www.educity.cn/wenda/356740.html
http://nbviewer.jupyter.org/github/rasbt/python_reference/blob/master/tutorials/not_so_obvious_python_stuff.ipynb#pm_in_lists

Python 命令行解析argparse 模块

2017-03-25T04:55:35.000Z

argparse是python用于解析命令行参数和选项的标准模块，用于代替已经过时的optparse模块。argparse模块的作用是用于在python解析命令行参数。

基本用法：

import argparse
parser=argparse.ArgumentParser()
parser.add_argument()
args=parser.parse_args()
parser.print_help()

导入argparse模块

1	import argparse

创建解析器对象ArgumentParser

1	parser=argparse.ArgumentParser()

ArgumentParser(prog=None, usage=None,description=None, epilog=None, parents=[],formatter_class=argparse.HelpFormatter, prefix_chars=’-‘,fromfile_prefix_chars=None, argument_default=None,conflict_handler=’error’, add_help=True)

可选参数

description：程序描述性语句，命令行帮助的开始文字；
add_help：默认是True，可以设置False禁用；
epilog：命令行帮助的结尾文字；
prog： (default: sys.argv[0])程序的名字，一般不需要修改，另外，如果你需要在help中使用到程序的名字，可以使用%(prog)s；
prefix_chars：命令的前缀，默认是-，例如-f/—file；
formatter_class：自定义帮助信息的格式（description和epilog）。默认情况下会将长的帮助信息进行<自动换行和消除多个连续空白>。

三个允许值

class argparse.RawDescriptionHelpFormatter 直接输出description和epilog的原始形式（不进行自动换行和消除空白的操作）；
class argparse.RawTextHelpFormatter 直接输出description和epilog以及add_argument中的help字符串的原始形式（不进行自动换行和消除空白的操作）；
class argparse.ArgumentDefaultsHelpFormatter 在每个选项的帮助信息后面输出他们对应的缺省值，如果有设置的话。

实例

parser = argparse.ArgumentParser(description=”This is a description of %(prog)s”, epilog=”This is a epilog of %(prog)s”, prefix_chars=”-+”, fromfile_prefix_chars=”@”, formatter_class=argparse.ArgumentDefaultsHelpFormatter)

add_mutually_exclusive_group()指定互斥选项

1
2
3

group=parser.add_mutually_exclusive_group()
group.add_argument("-v","--verbose",action="store_true")
group.add_argument("-q","--quiet",action="store_true")

argparse会为你强制执行互斥性，因此一次使用仅能给出该群组的选项中的一个。输出时形如[-v | -q]。

add_argument()指定命令参数

1	parser.add_argument()

add_argument(name or flags…[, action][, nargs][, const][, default][, type][, choices][, required][, help][, metavar][, dest])
name or flags：指定参数的形式，一般指定一个短参数，一个长参数，或直接写参数名，如”-f”, “—file”，”file”；
nargs：命令行参数的个数，一般使用通配符表示，其中，’?’表示只用一个，’*’表示0到多个，’+’表示至少一个;
default：默认值
type：参数的类型，默认是字符串string类型，还有float、int等类型;
dest: 如果提供dest，例如dest=”a”，那么可以通过args.a访问该参数;

parser.add_argument('--ratio',dest='ratio',type=float,default=None,
                help="only show values where the difference between study")
...
min_ratio=args.ratio

action: 参数出发的动作，常见形式为store_true/false, count等；
choices：允许的参数值；parser.add_argument(“-v”, “—verbosity”, type=int, choices=[0, 1, 2], help=”increase output verbosity”)；
metavar: 参数的名字，在显示帮助信息时才用到.
help：和ArgumentParser方法中的参数作用相似，出现的场合也一致;
在执行程序的时候，定位参数必选，可选参数可选。在输出的帮助信息中显示为分开的“与位置相关的参数”和“可选参数”两个部分：

定位参数Positional

不需要长/短线指示，直接输入参数
parser.add_argument(“bar”, help=”test test test”)

可选参数Optional

长/短线形式
parser.add_argument(“-f”, “—file”, help=”test test test”)

parse_args()解析命令行

1	args=parser.parse_args()

定义了所有参数之后，你就可以给 parse_args() 传递一组参数字符串来解析命令行。
parse_args() 的返回值是一个命名空间，包含传递给命令的参数。该对象将参数保存其属性，因此如果你的参数 dest 是 “myoption”，那么你就可以args.myoption 来访问该值。

parser.print_help() 打印帮助信息

高级用法

文件参数

parser.add_argument('-i', metavar='in-file', type=argparse.FileType('rt'))
parser.add_argument('-o', metavar='out-file', type=argparse.FileType('wt'))
	 
parser.print_help()
>>>
usage: __main__.py [-h] [-i in-file] [-o out-file]

optional arguments:
  -h, --help   show this help message and exit
  -i in-file
  -o out-file

参考来源：
http://www.sijitao.net/2000.html
http://blog.csdn.net/yugongpeng_blog/article/details/46693471
http://www.jb51.net/article/67158.htm

Direct：命令行访问NCBI

2017-01-13T13:13:07.000Z

命令行访问和获取NCBI数据当选Entrez Direct: E-utilities on the UNIX Command Line.

工具集

esearch 搜索功能；
elink looks up neighbors (within a database) or links (between databases).
efilter 搜索结果过滤，搜索结果以特定格式输出.
efetch 以指定格式下载搜索结果.
xtract 转化XML格式为table.
einfo obtains information on indexed fields in an Entrez database.
epost uploads unique identifiers (UIDs) or sequence accession numbers.
nquire sends a URL request to a web page or CGI service.

数据库查询

1 2	esearch -db pubmed -query "lycopene cyclase" \| efetch -format abstract esearch -db protein -query "lycopene cyclase" \| efetch -format fasta

当查询数据是蛋白或核酸时-format参数可以是fasta(fasta_cds_na, fasta_cds_aa, and gene_fasta),gb(GenBank), gp(GenPept),

搜索和过滤

esearch -db pubmed -query "opsin gene conversion" | elink -related | 
  efilter -query "tetrachromacy"
  efilter -days 60 -datetype PDAT   #过去2个月
  efilter -mindate 1990 -maxdate 1999 -datetype PDAT   #1990s

场景需求：linux下如何完成如下检索和筛选过程？

在第一步query时添加筛选项：

1	esearch -db nucleotide -query "beta-tubulin-2 AND (Fungi[filter] AND "mrna"[Filter])" \| efetch -format fasta

XML格式转换为制表符

$ esearch -db protein -query "lycopene cyclase" 

  protein
  NCID_1_322844954_130.14.22.215_9001_1484487403_837234396_0MetA0_S_MegaStore_F_1
  1
  12380
  1

$ esearch -db protein -query "lycopene cyclase" | xtract -pattern ENTREZ_DIRECT -element Count
12380

某一领域最多产的作者

$ SortUniqCountRank() {
    sort -f |
    uniq -i -c |
    perl -pe 's/\s*(\d+)\s(.+)/$1\t$2/' |
    sort -t $'\t' -k 1,1nr -k 2f
$ alias sort-uniq-count-rank='SortUniqCountRank'
$ esearch -db pubmed -query \
    "crotalid venoms [MAJR] AND phospholipase [TIAB]" |
  efetch -format xml |
  xtract -pattern PubmedArticle \
    -block Author -sep " " -tab "\n" -element LastName,Initials  |
  sort-uniq-count-rank

某一领域每年文章发表情况

esearch -db pubmed -query "legionnaires disease [TITL]" |
 efetch -format docsum |
 xtract -pattern DocumentSummary -element PubDate |
 cut -c 1-4 |
 sort-uniq-count-rank

人每条染色体上有多少基因

for chr in {1..22} X Y MT
 do
   esearch -db gene -query "Homo sapiens [ORGN] AND $chr [CHR]" |
   efilter -query "alive [PROP] AND genetype protein coding [PROP]" |
   efetch -format docsum |
   xtract -pattern DocumentSummary -NAME Name \
     -block GenomicInfoType -if ChrLoc -equals "$chr" \
       -tab "\n" -element ChrLoc,"&NAME" |
   sort | uniq | cut -f 1 | sort-uniq-count-rank
 done

CRISPR-sgRNA-Designer：从来都不应该那么神秘

2017-01-13T03:37:05.000Z

CRISPR介绍和作用过程之前也学习总结过：CRISPR/Cas9。

Story

陆地棉基因组2015年完成测序，之后公布基因组信息，我所在实验也在从事CRISPR的工作，得益于技术革新和谢卡斌老师PNAS文章，CRISPR在棉花中的敲除首次在我们实验室获得成功，于是更多功能基因的CRISPR工作列入许多研究生的工作计划中，CRISPR编辑第一步就是目标基因的sgRNA设计。现有的绝大部分在线sgRNA设计主要是针对模式生物的，并不能用于非模式生物。但整个sgRNA设计原理是相通的也比较好理解：在目标基因序列的cds中匹配NGG的PAM区域，往前延伸20bp碱基就是一个理论的sgRNA位点，接下来所需考虑的就是这个sgRNA的特异性(不能编辑其他基因，20bp的碱基在几百M甚至G的基因组里面特异存在)和脱靶率(有时候为了保证特异性会允许一定数量的错配情况存在，这可能引起脱靶)，所以后续会对这样的理论sgRNA位点在全基因组内比对，寻找可能的编辑位点。对于perl和python来说实现这样的功能并不是什么难事，华农动科学院谢老师用perl编写的可自行提供基因组的sgRNAcas9程序开发出来。

sgRNAcas9流程

sgRNAcas9软件功能介绍如下：

软件优点是可自行提供基因组文件，缺点是对酶切位点的计算较为麻烦。
偶然也发现软件作者收录了我早期一篇总结╰(￣▽￣)╮

插播广告：该内容现有跟新，http://tiramisutes.github.io/2015/08/05/bio-online.html

软件下载

该软件是有perl程序编写，有windows和linux平台可供下载使用，根据相应平台自行下载：sgRNAcas9。

软件安装

修改可执行权限

1
2
3

chmod +x sgRNAcas9_3.0.5.pl
chmod +x -R Seqmap
chmod +x -R Usefull_Script

文件准备

基因组文件：genome.fa
基因组cds文件：genome_cds.fa
基因组注释文件：genome.gff3 (非必须)
将上述文件mv到sgRNAcas9软件所在目录,用绝对路径会报错。

代码运行

1	perl sgRNAcas9_3.0.5.pl -i genome_cds.fa -x 20 -l 40 -m 60 -g genome.fa -o b -t s -v l -n 5

参数解释：
-i: 所需设计crispr敲除序列fa文件，可为多条序列；
-x: sgRNA长度，通常为20；
-l: GC含量下限；
-m: GC含量上限；GC含量一般为40%~60%。
-g: 基因组fa文件；
-o: 用DNA的哪条链作为crispr靶标位点搜寻，s正义链，a反义链，b双链；
-t: gRNA搜索模型，s单个gRNA，p一对gRNA搜寻；
-v: 操作系统类型，l为linux-64位，w为windows；
-n: 最大错配碱基数，一般为5。
-i参数设置全基因组cds的fa文件，-g参数设置去基因组fa文件就可得到上述提到的库文件。

结果解读

运行时间视所选物种基因组大小和目标基因多少相关；当基因组文件较大时windows下运行电脑易卡挂掉，最好在服务器下运行。
运行完后会生成report文件：sgRNAcas9.report_20.b.rhp.fa，包含以下内容：

注：OT为靶标去除NGG后的脱靶情况，即NGG前面20个碱基的脱靶；POT为种子序列的脱靶情况，种子序列即NGG前面的12个碱基；
sgRNAcas9_report.xls文件里有个综合了GC含量，错配和特异性后的风险等级排序，可简单选取Best对应crispr靶标位点。
Discard > High_risk > moderate_risk > low_risk > repeat_sites_or_bad ? > Best
当0M(on-/off-)值为0时意味着脱靶，大于1则存在有靶标序列，其他数字表示所存在靶标数目；

脱靶位点注释

脱靶是要尽量避免的，但若存在风险还可根据基因组注释gff3文件对可能的脱靶位点进行注释。

1 2	perl ot2gtf.pl -i -g -o perl pot2gtf.pl -i -g -o

CRISPR

CRISPR是华中农业大学生物信息学院陈玲玲老师团队开发的在线设计软件。输出结果可视化较好，包含有错配，靶标序列位置，酶切位点等信息。

cas-designer

cas-designer也提供有在线版和命令行版，可自行提供基因组数据，官网有详细的安装教程，不再敖述。
比较坑的是其中一个组件Cas-OFFinder需要OpenCL-enabled device，而Centos6.0尽然不支持这种驱动,所以放弃。

总结

最终sgRNA位点可综合多个软件结果信息，筛选靠谱的结果；

AUGUSTUS安装和非Root用户GLIBC“排雷”过程

2017-01-06T07:40:43.000Z

AUGUSTUS is a program that predicts genes in eukaryotic genomic sequences.

 ./augustus 
./augustus: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./augustus)
./augustus: /public/home/zpxu/bin/gcc-4.8.5/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./augustus)
./augustus: /public/home/zpxu/bin/gcc-4.8.5/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by ./augustus)

查看GLIBC版本：

strings /lib64/libc.so.6 | grep GLIBC
GLIBC_2.2.5
GLIBC_2.2.6
GLIBC_2.3
GLIBC_2.3.2
GLIBC_2.3.3
GLIBC_2.3.4
GLIBC_2.4
GLIBC_2.5
GLIBC_2.6
GLIBC_2.7
GLIBC_2.8
GLIBC_2.9
GLIBC_2.10
GLIBC_2.11
GLIBC_2.12
GLIBC_PRIVATE

发现最高版本是2.12但是系统需要2.14才可以，那么就自己编译安装吧。

1>下载GLIBC

1	wget http://ftp.gnu.org/gnu/glibc/glibc-2.14.tar.gz

2> glibc-2.14.tar.gz解压，并进入解压后目录，创建build目录，并且进入：

1	tar -zxvf glibc-2.14.tar.gz && cd glibc-2.14 && mkdir build && cd build

3>编译：

1	../configure --prefix=/opt/glibc-2.14 #你的安装目录

此时报错如下：

checking LD_LIBRARY_PATH variable... contains current directory
configure: error:
*** LD_LIBRARY_PATH shouldn't contain the current directory when
*** building glibc. Please change the environment variable
*** and run configure again.

报错意思简单明了：目录冲突；
echo $LD_LIBRARY_PATH但是次安装目录并不在我的环境变量.bashrc文件里啊。
打开configure文件，查找LD_LIBRARY_PATH，找到如下内容：

Test if LD_LIBRARY_PATH contains the notation for the current directory
since this would lead to problems installing/building glibc.
LD_LIBRARY_PATH contains the current directory if one of the following
is true:
- one of the terminals (“:” and “;”) is the first or last sign
- two terminals occur directly after each other
- the path contains an element with a dot in it

解释就是“LD_LIBRARY_PATH不能以终结符（”:” and “;”）作为开始和最后一个字符，且不能有2个终结符连在一起；因为在环境变量的最前和最后均有一个“:”，程序将此分隔符解释为当前目录了。
解决方法：
执行指令：vi ~/.bashrc
将LD_LIBRARY_PATH环境变量的开头和末尾的“:”去掉，保存。等正确编译完成后可以再次修改回原来。
执行指令：source ~/.bashrc

4> make

1	make -j4 && make install

make install 时报错如下：

/usr/bin/install: `include/limits.h' and `/glibc-2.14/include/limits.h' are the same file 
make[1]: *** [/glibc-2.14/include/limits.h] Error 1
make[1]: Leaving directory `/glibc-2.14'
make: ***[install] Error 2

Google之发现解决办法如下：

1	make install -k -i

通过k和i参数虽然可以强制安装，但经测试并不能真正解决问题，所以这最后一步的安装过程任然卡在这，目前也没有找到有效解决办法。这里先占个坑，等后面找到方法了再补充。
其中-j，-k和-i参数解释如下：

-j [jobs], --jobs[=jobs]
Specifies the number of jobs (commands) to run simultaneously. If there is more than one -j option, the last one is effective. If the -j option is given without an argument, make will not limit the number of jobs that can run simultaneously.
-i, --ignore-errors
Ignore all errors in commands executed to remake files.
-k, --keep-going
Continue as much as possible after an error. While the target that failed, and those that depend on it, cannot be remade, the other dependencies of these targets can be processed all the same.

5>添加环境变量

1	export LD_LIBRARY_PATH=/opt/glibc-2.14/lib:$LD_LIBRARY_PATH

PATH和LD_LIBRARY_PATH区别
PATH: 可执行程序的查找路径；
LD_LIBRARY_PATH: 动态库的查找路径；

再安装AUGUSTUS

上述关于GLIBC安装虽已失败告终，但AUGUSTUS的安装参考Installing Augustus with manual bamtools installation后得到解决。
1. 主要是无root权限下先安装依赖工具bam2hints 和 filterBam👇
首先安装 bamtools

git clone git://github.com/pezmaster31/bamtools.git
cd bamtools
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/your/path/to/bamtools .. 
make

2.修改AUGUSTUS中部分MakeFile文件
首先修改augustus-3.2.3/auxprogs/bam2hints目录下MakeFile文件内容：

Add:
BAMTOOLS = /your/path/to/bamtools

Replace:
INCLUDES = /usr/include/bamtools
By:
INCLUDES = $(BAMTOOLS)/include

Replace:
LIBS = -lbamtools -lz
By:
LIBS = $(BAMTOOLS)/lib/libbamtools.a -lz

再修改augustus-3.2.3/auxprogs/filterBam/src目录下MakeFile文件内容：

Replace:

BAMTOOLS = /usr/include/bamtools
By:
BAMTOOLS = /your/path/to/bamtools

Replace:
INCLUDES = -I$(BAMTOOLS) -Iheaders -I./bamtools
By:
INCLUDES = -I$(BAMTOOLS)/include -Iheaders -I./bamtools

Replace:
LIBS = -lbamtools -lz
By:
LIBS = $(BAMTOOLS)/lib/libbamtools.a -lz

3. make安装
最后回到主目录augustus-3.2.3下make即可安装成功，不需要make install过程。

参考来源

解决/lib64/libc.so.6: version `GLIBC_2.14′ not found问题

[error]LD_LIBRARY_PATH shouldn’t contain the current directory

非root用户interproscan的安装和使用

2016-12-15T03:10:40.000Z

InterProScan常用于基因序列的功能注释，InterPro**是一个包含有蛋白质功能和家族等的数据库，而InterProScan的功能就是将我们的目标序列比对到这个数据库，从而了解其功能。

关于InterProScan的功能和安装过程以及基本的配置要求官网提供了非常详细的InterProScan wiki，这里就不做细述。
但通常我们面临的问题是权限，比如python的版本问题，我的集群python是2.6，我也在自己家目录下正确安装有python2.7，但系统默认的是2.6，首先如何修改/usr/bin/python目录下python使其默认为自己安装的2.7。
最简单的就是用alias：

1	alias python='~/bin/Python-2.7.10/Python/bin/python2.7'

但是当所运行软件是调用系统默认python时，上面方法就失效了。
那么为什么非要修改系统默认版本呢？当需要python时直接指定就可以，是的，python编写的软件可以通过python2.7 软件这样的方式运行，但InterProScan的运行主程序 interproscan.sh并不是python写的，好在有一个专属的配置文件interproscan.properties，可设置软件和数据库路径。
所以,解决非root用户interproscan使用时python版本问题的方法就是添加 python.command=/path/to/python2.7到配置文件。

Pre-calculated Match Lookup Service

软件会联网搜寻并匹配EBI数据库来获得准确结果，当服务器不能联网时可选取如下解决办法：
1>Download and install the InterProScan 5 lookup service.
2>用-dp参数来关掉此功能.
3>用#号注释掉interproscan.properties文件中的precalculated.match.lookup.service.url=http://www.ebi.ac.uk/interpro/match-lookup行。

Mate-pair Reads Alignment

2016-11-25T01:56:33.000Z

文库类型

对于基因组文库我们一般会建小库（<1k）的**paired-end reads="" (l-=""> <-R) 和大库的 mate-pair reads(<-L R->)，二者最主要的区别就是reads1和reads2的方向和之间的间隔大小。

现在绝大部分的主流软件都是支持将paired-end reads进行比对的，那么 mate-pair reads如何处理呢，即 mate-pair reads**如何做比对？

reverse complement

When done standard Illumina MP preps, reverse complemented with fastx-toolkit and aligned with standard parameters using bwa/bowtie.

fastx-toolkit reverse complement

FASTQ/A Reverse Complement

$ fastx_reverse_complement -h
usage: fastx_reverse_complement [-h] [-r] [-z] [-v] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.

bowtie2

也可通过设置bowtie2的—fr/—rf/—ff、-I、-X参数来进行比对。

Aligning pairs

A “paired-end” or “mate-pair” read consists of pair of mates, called mate 1 and mate 2. Pairs come with a prior expectation about (a) the relative orientation of the mates, and (b) the distance separating them on the original DNA molecule. Exactly what expectations hold for a given dataset depends on the lab procedures used to generate the data. For example, a common lab procedure for producing pairs is Illumina’s Paired-end Sequencing Assay, which yields pairs with a relative orientation of FR (“forward, reverse”) meaning that if mate 1 came from the Watson strand, mate 2 very likely came from the Crick strand and vice versa. Also, this protocol yields pairs where the expected genomic distance from end to end is about 200-500 base pairs.

Paired-end options

-I/—minins
The minimum fragment length for valid paired-end alignments. E.g. if -I 60 is specified and a paired-end alignment consists of two 20-bp alignments in the appropriate orientation with a 20-bp gap between them, that alignment is considered valid (as long as -X is also satisfied). A 19-bp gap would not be valid in that case. If trimming options -3 or -5 are also used, the -I constraint is applied with respect to the untrimmed mates.
The larger the difference between -I and -X, the slower Bowtie 2 will run. This is because larger differences bewteen -I and -X require that Bowtie 2 scan a larger window to determine if a concordant alignment exists. For typical fragment length ranges (200 to 400 nucleotides), Bowtie 2 is very efficient.
Default: 0 (essentially imposing no minimum)

-X/—maxins
The maximum fragment length for valid paired-end alignments. E.g. if -X 100 is specified and a paired-end alignment consists of two 20-bp alignments in the proper orientation with a 60-bp gap between them, that alignment is considered valid (as long as -I is also satisfied). A 61-bp gap would not be valid in that case. If trimming options -3 or -5 are also used, the -X constraint is applied with respect to the untrimmed mates, not the trimmed mates.
The larger the difference between -I and -X, the slower Bowtie 2 will run. This is because larger differences bewteen -I and -X require that Bowtie 2 scan a larger window to determine if a concordant alignment exists. For typical fragment length ranges (200 to 400 nucleotides), Bowtie 2 is very efficient.
Default: 500.

—fr/—rf/—ff
The upstream/downstream mate orientations for a valid paired-end alignment against the forward reference strand. E.g., if —fr is specified and there is a candidate paired-end alignment where mate 1 appears upstream of the reverse complement of mate 2 and the fragment length constraints (-I and -X) are met, that alignment is valid. Also, if mate 2 appears upstream of the reverse complement of mate 1 and all other constraints are met, that too is valid. —rf likewise requires that an upstream mate1 be reverse-complemented and a downstream mate2 be forward-oriented. —ff requires both an upstream mate 1 and a downstream mate 2 to be forward-oriented. Default: —fr (appropriate for Illumina’s Paired-end Sequencing Assay).

Novoalign

Genomic-Feature

2016-11-04T09:54:27.000Z

问题描述

Genomic Feature通常包括exon、intron、intergenic region、UpstreamToGene、UTRs等，对于有完整参考基因组物种其一般都有注释文件gff3，但其一般只有mRNA，gene和exon的坐标信息，而我们通常也需要更多的Genomic Feature信息。

解决方案

工具

bedtools
bedtools具体使用讲解见我的另一篇博文：bedtools 使用小结。

特征种类

在gff3文件第三列标注有相应的特征类型，我们可以参看每种特征类型的数量：

1	cat XX.gff3 \| grep -v "^#" \| cut -f3 \| sort \| uniq -c \| sort -k1rn

Remove/merge overlapping exons

在gff3文件中我们发现存在有以下情况：

$ grep -B 5 "89201851" Gossypium_hirsutum_v1.1.gene.gff3 | grep "exon"
A01  EVM  exon  89201570  89201851  .   -  .   ID=evm.model.Gh_A01G1441.exon1;Parent=evm.model.Gh_A01G1441
A01  EVM  exon  89201852  89201963  .   -  .   ID=evm.model.Gh_A01G1442.exon3;Parent=evm.model.Gh_A01G1442
A01  EVM  exon  89202063  89202216  .   -  .   ID=evm.model.Gh_A01G1442.exon2;Parent=evm.model.Gh_A01G1442

仔细察看发现两个基因Gh_A01G1441和Gh_A01G1442的外显子尽然想连续（第一行终止位置89201851和第二行起始位置89201852），也就是两个连续的基因，在这种情况下有时我们在计算exon时想要将其合并为同一个exon。mergeBed ( Merges overlapping BED/GFF/VCF entries into a single interval)：bedtools merge [OPTIONS] -i 可实现这样的功能，但首先需要对起始位置排序（另一个组件sortBed）。

cat Gossypium_hirsutum_v1.1.gene.gff3 | \
awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5}' | \
sortBed | mergeBed -i - >merged-exon.gff3
#比较merge前后效果
cat Gossypium_hirsutum_v1.1.gene.gff3 | \
awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5}' | \
sortBed | diff - merged-exon.gff3
7832,7833c7832
< A01   89201569        89201851
< A01   89201851        89201963
---
> A01   89201569        89201963

Get intron regions

gff3中的intron区就是一个mRNA/gene的exon以外的区域，可通过subtractBed (Removes the portion(s) of an interval that is overlapped by another feature(s))：bedtools subtract [OPTIONS] -a -b 来实现。

cat Gossypium_hirsutum_v1.1.gene.gff3 | \
awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' | \
sortBed |subtractBed -a stdin -b merged-exon.gff3 >merged-intron.gff3
#比较 merged-exon.gff3 和 merged-intron.gff3
head -n 4  merged-exon.gff3 merged-intron.gff3
A01     15704   15772
A01     16263   16319
A01     16883   17103
A01     17483   17623

A01     15772   16263
A01     16319   16883
A01     17103   17483
A01     17623   18384

另外一个软件同样可以得到intron：GenomeTools: a comprehensive software library for efficient processing of structured genome annotations，gt gff3 -addintrons Gossypium_hirsutum_v1.1.gene.gff3 >Gh-intron.gff3。

Get intergenic regions

基因间区，即没有基因覆盖的染色体区域。complementBed：bedtools complement [OPTIONS] -i -g 可用来查找gff3中的这些区域。

cat Gossypium_hirsutum_v1.1.gene.gff3 | \
awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' | \
sortBed | complementBed -i stdin -g genome.fa.length> merged-intergenic.gff3
#查看 intergenic regions
more Gh-intergenic.gff3
A01     0       15704
A01     19194   22807
A01     24529   36427
A01     36860   40961

上述mergeBed、subtractBed、complementBed操作图解如下：

参考来源

Defining genomic regions

De Nove转录组组装质量评估

2016-10-30T15:00:17.000Z

无参De Nove组装通常用到Trinity软件,组装过程中最重要的两个参数就是--min_kmer_cov 和 --min_glue为组装出高质量结果我们通常需要去尝试用不同的参数，github上也有软件开发者讨论关于这两个参数Optimizing parameters可供参考，其实问题最终也就归结为你是否关心你数据中的低丰度转录本?
此外作者也提供了一系列方法来评估组装质量Transcriptome Assembly Quality Assessment总共列出6种方法可对不同参数的组装结果进行评估,看完后综合总结出其中4种评估方法。

Assessing the Read Content of the Transcriptome Assembly

bowtie2-build  ../trinity_out_dir${i}/Trinity.fasta ../trinity_out_dir${i}/Trinity.fasta
bowtie2 --local --no-unal -p ${cpu} -x  ../trinity_out_dir${i}/Trinity.fasta -q -1 ${left} -2 ${right} \
     | samtools view -Sb - | samtools sort -no - - > bowtie2.nameSorted.bam
#参看proper pairs reads数量和百分比
${TRINITY_DIR}/util/SAM_nameSorted_to_uniq_count_stats.pl  bowtie2.nameSorted.bam
grep "^proper_pairs" Read-Representation.out

第二步的bowtie2比对序列到组装转录本结果时可选部分数据来比对，这样可大大降低比对耗时。

Full-length transcript analysis for model and non-model organisms using BLAST+

blastall -p blastx -i ./trinity_out_dir${i}/Trinity.fasta  -d ${uniprot} -v 1 -b 1 -m 8 -e 1e-5 -a ${cpu} -F F -o uniprot_sprot.fasta_blastx.outfmt8
${TRINITY_DIR}/util/analyze_blastPlus_topHit_coverage.pl uniprot_sprot.fasta_blastx.outfmt8 ./trinity_out_dir${i}/Trinity.fasta /public/home/cotton/public_data/SwissProt/uniprot_sprot.fasta
${TRINITY_DIR}/util/misc/blast_outfmt6_group_segments.pl \
      ./uniprot_sprot.fasta_blastx.outfmt8 ./trinity_out_dir${i}/Trinity.fasta uniprot_sprot.fasta > ./uniprot_sprot.fasta_blastx.outfmt8.grouped
${TRINITY_DIR}/util/misc/blast_outfmt6_group_segments.tophit_coverage.pl ./uniprot_sprot.fasta_blastx.outfmt8.grouped

Compute DETONATE scores

RSEM-EVAL软件对于双端reads数据需要提供一个average fragment length值，可参考我的另一篇博文评估文库 Average Insert Size来计算得到此值。

rsem-eval-estimate-transcript-length-distribution ./trinity_out_dir${i}/Trinity.fasta ./RSEM-EVAL${i}/length_distribution_parameter.txt
rsem-eval/rsem-eval-calculate-score -p 1 \
              --transcript-length-parameters ./RSEM-EVAL${i}/length_distribution_parameter.txt \
              --paired-end  --phred33 --strand-specific ../1.clean.fq ../2.clean.fq\
              ./trinity_out_dir${i}/Trinity.fasta \
              hope-trinity_out_dir${i} 300

评估结果解释见：RSEM-EVAL: A novel reference-free transcriptome assembly evaluation measure。

RSEM-EVAL produces the following three score related files: ‘sample_name.score’, ‘sample_name.score.isoforms.results’ and ‘sample_name.score.genes.results’.

sample_name.score： stores the evaluation score for the evaluated assembly. The first lines Score the RSEM-EVAL score.

Higher RSEM-EVAL scores are better than lower scores. This is true despite the fact that the scores are always negative. For example, a score of -80000 is better than a score of -200000, since -80000 > -200000.

BUSCO explore completeness according to conserved ortholog

1	git clone https://gitlab.com/ezlab/busco.git

点击BUSCO官网相应图标下载所需数据库。

1	python BUSCO.py -i SEQUENCE_FILE -o OUTPUT_NAME -l LINEAGE -m tran

SEQUENCE_FILE：transcript set (DNA nucleotide sequences) file in FASTA format
OUTPUT_NAME：name to use for the run and temporary files (appended)
LINEAGE：location of the BUSCO lineage data to use (e.g. fungi_odb9)
察看结果: 在运行结果文件夹下short_summary_OUTPUT_NAME.txt中有如下统计信息👇

C:80.0%[S:80.0%,D:0.0%],F:0.0%,M:20.0%,n:10

8 Complete BUSCOs (C)
8 Complete and single-copy BUSCOs (S)
0 Complete and duplicated BUSCOs (D)
0 Fragmented BUSCOs (F)
2 Missing BUSCOs (M)
10 Total BUSCO groups searched

也可图像化展示结果👇：

1 2	cp short_summary_OUTPUT_NAME.txt ./plot python2.7 BUSCO_plot.py -wd ./busco/plot/

Population Genetics

2016-10-28T03:02:57.000Z

Glossary

Gene diversity

Gene diversity is a measure of the expected heterozygosity in a sample of gene copies collected at a single locus. It is a summary statistic used to represent patterns of molecular diversity within a sample of gene copies. Typically, the gene copies are allelic states such as allozymes or fragment sizes (e.g., RFLPs, AFLPs, microsatellites). The expected heterozygosity is caluclated under the assumption that the sample of gene copies was drawn from a population at Hardy-Weinberg equilibrium (HWE).
https://dendrome.ucdavis.edu/help/tutorials/gdiversity.php

Heterozygosity

Heterozygosity: An individual or population-level parameter. The proportion of loci expected to be heterozygous in an individual (ranging from 0 to 1.0).
HO (observed heterozygosity) is the observed proportion of heterozygotes, averaged over loci.
HE (expected heterozygosity) is also known as gene diversity (= D; preferred, less ambiguous term) and is calculated as 1.0 minus the sum of the squared gene frequencies. [See Weir, 1996, p. 124 for the multi-locus, multi-allele formula].
High heterozygosity means lots of genetic variability. Low heterozygosity means little genetic variability.Often, we will compare the observed level of heterozygosity to what we expect under Hardy-Weinberg equilibrium (HWE). If the observed heterozygosity is lower than expected, we seek to attribute the discrepancy to forces such as inbreeding. If heterozygosity is higher than expected, we might suspect an isolate-breaking effect (the mixing of two previously isolated populations).

Haplotype

A haplotype (haploid genotype) is a group of genes in an organism that are inherited together from a single parent.

Haplotype diversity (h)

Haplotype diversity is a measure of the uniqueness of a particular haplotype in a given population. The haplotype diversity (H) is computed as.

Nucleotide diversity (π)

Nucleotide diversity is a concept in molecular genetics which is used to measure the degree of polymorphism within a population.
This measure is defined as the average number of nucleotide differences per site between any two DNA sequences chosen randomly from the sample population, and is denoted by π.
Nucleotide diversity is a measure of genetic variation.
http://svitsrv25.epfl.ch/R-doc/library/ape/html/nuc.div.html
单倍型多样度（Hd）和核苷酸多样度（Pi）是衡量一个 mtDNA 变异程度的两个重要指标，Hd 值和 Pi 值越大，多样性程度越高，遗传多样性越丰富，反之，多样性程度越低，遗传多样性越贫乏。另外，
mtDNA 的单倍型多样性指数也可以衡量种内的变异程度。
更多术语见：Molecular Marker Glossary
深入学习间：Genetic Markers

程序运行报错总结

2016-10-26T09:04:06.000Z

跑程序难免会遇到各种各样的错误，解决办法也多种多样，自此仅总结我所遇到的问题和最优的解决方案。

R报错

1 2	>alphaData = read.csv("data.csv") Error: REAL() can only be applied to a 'numeric', not a 'integer'

解决办法

1	alphaData = read.csv("data.csv") * 1.0

npm报错

1 2	ERR! Windows_NT 6.3.9600 Error: tunneling socket could not be established, cause=connect ECONNREFUSED

解决办法

#first run
npm cache clean
#If there is no proxy , remove proxy config from npm
npm config set proxy null
npm config set https-proxy null
npm install -g XXXX

python报错

1. re模块正则匹配时报错

1	AttributeError: 'NoneType' object has no attribute 'group'

解决办法
写的正则表达式匹配不到任何内容，检查正则表达式正确性。

2. lib库报错

1	error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory

添加python的lib库地址到环境变量即可。

生信软件安装报错

make编译过程报错

1	error: ‘getpid’ was not declared in this scope

解决办法
添加#include 在相应报错的XX.cpp文件头部

Single Cell全基因组扩增

2016-10-13T12:44:53.000Z

单细胞测序得以实现或者测序质量的提升得益于whole-genome amplification (WGA)，WGA方法存在较大的扩增偏好性（偏好性来源于序列本身GC含量和非线性扩增过程），导致低的基因组覆盖度；

全基因组扩增WGA

目前主要存在有三种扩增方法：
简并寡核苷酸引物PCR扩增（DOP-PCR）、多重置换扩增反应（MDA）、置换预扩增和PCR扩增的组合（MALBAC）三种技术，各有优缺点。这些扩增方法可以把单细胞中pg级甚至fg级的DNA扩增至可满足测序的μg级样品量，正是这些技术的发明才使单细胞基因组测序成为可能。

DOP-PCR

Pure PCR-based amplification(DOP-PCR)：基于PCR的WGA用随机引物进行指数扩增，这一过程对不同的扩增序列会产生较大的影响。

MDA

Isothermal amplification(MDA)：利用随机六碱基引物在多个位点与模板DNA退火，接下来在高扩增效率和保真性的Phi29DNA聚合酶在DNA的多个位点同时起始复制，它沿着DNA模板合成DNA,同时取代模板的互补链。被置换的互补链又成为新的模板来进行扩增，因此最终我们可以获得大量高分子量的DNA。MDA虽然利用随机引物和链置换的ϕ29聚合酶在等温条件下扩增，相较于基于PCR的扩增能够降低序列本身GC含量造成的偏好性，对扩增覆盖度有较大提升，但依然是非线性的扩增，所以还是有较大的偏好性。

Φ29 ：一种具有较高连续合成能力以及链置换的DNA聚合酶，它具有3’-5’外切酶(校正)活性。Φ29 DNA聚合酶不耐高温，65℃下放置10分钟方可使它失活。该酶对于模板有很强的模板结合能力，能连续扩增100Kb的DNA模板而不从模板上解离。同时这种酶具有3’—5’外切酶活性，可以保证扩增的高保真性。

MALBAC

Multiple annealing and looping-based amplification cycles(MALBAC)：通过拟线性的预扩增来降低非线性扩增的偏好性。

MALBAC扩增原理

首先，单细胞双链DNA在94℃变性成单链，随后在0℃时随机引物（包含有27个碱基的共有序列和8个变异碱基）均匀的结合到单链DNA模版上；

65℃时链置换DNA聚合酶用来生成不同长度序列（0.5到1.5kb）的半扩增产物(semi-amplicon)，随后在94℃变性成单链从模版上脱离。

在随后的5个温度的循环中，对半扩增产物进行进一步的扩增来产生完整扩增产物(full-amplicons)，且完整扩增产物的5’端和3’端互补。

温度降到58℃进行完整扩增产物的环化，同时并防止进一步扩增和序列的杂交。

半扩增产物和基因组DNA则继续循环来生成完整扩增产物。

环化的完整扩增产物即完成拟线性的预扩增过程，随后进行PCR的指数扩增。在PCR扩增过程中与MALBAC引物27个共有序列相同的寡聚核苷酸作为引物。

扩增巧妙之处

只有以半扩增产物为模版才能产生两端互补的完整扩增产物，完成预扩增。

扩增效果比较

为期3天2016全国植物生物学大会报告有感

2016-10-11T13:06:56.000Z

由武大主办的为期3天的2016全国植物生物学大会于2016年10月10号在欧亚达国际酒店召开，有这些大牛云集助阵，会议质量也不枉费1500￥；

会议涉及到植物激素，逆境，发育，表观遗传，基因组，代谢等，有植物生物学中的研究热点，也有部分看似较冷，但很有意思，有基础研究，也有接近生活，涉及范围之广也让我见识到前所未有的一些全新内容；
能听/看到以前在百度看到的那些大牛讲报告，感受下他们的气场，也算是一种荣幸；
会议内容很精彩，ppt也很漂亮，真想要得到他们的全部ppt看看……~……
部分我能记得的，比较新颖的内总简单总结下。

DNA甲基化动态平衡

朱健康院士的报告，内容自然很精彩了；

水稻白金级基因组

张启发院士与汕优63年的苦恋，为解析珍汕97A与明恢63杂交组合的优势所在特打造白金级基因组，为水稻的研究提供基因组指导。

四倍体棉花进化

南京农业大学的张天真教授在文章Integrated mapping and characterization of the gene underlying the okra leaf trait in Gossypium hirsutum L中通过叶形来讲述的四倍体棉花经历多次杂交的进化过程。

棉花纤维发育

华中农业大学的张献龙教授讲述了调控纤维发育的表观调控过程。

比较代谢组学

华中农业大学的罗杰教授无疑的目前国际上做植物代谢的大牛级人物，独创的水稻和玉米比较代谢组，利用水稻定位区间的高效应（曼哈顿图中大于阀值的显著位点）但低分辨率（一个定位区间有好几个基因）与玉米低效应但高精度（定位区间基因较少，甚至是单基因）的互补，通过比较共线性定位区间的互补来解析控制代谢物基因。

图注：Co-linear genomic regions and homologous loci (or genes) of di-C, C-pentosyl-apigenin between rice grains and maize kernels.
详细内容见NC文章：Comparative and parallel genome-wide association studies for metabolic and agronomic traits in cereals

植物单细胞测序

华中农业大学的严建兵教授2015年NC文章Dissecting meiotic recombination based on tetrad analysis by single-microspore sequencing in maize开创了植物中单细胞测序的先河，通过分离四分体时期的单细胞并测序在玉米中研究遗传重组过程，发现在玉米中基因的3’和5’端UTR是重组热区；重组的发生存在染色体和染色单体干涉；非重组的交换(NCO)远大于重组交换(CO)。

PPR蛋白

PPR（pentatricopeptide repeat）蛋白是广泛分布于各类生物当中一大类蛋白家族，在高等植物叶绿体和线粒体当中尤其种类丰富，比如拟南芥和玉米中各有四百多种PPR蛋白。PPR蛋白对单链RNA具有序列特异性识别模式，与RNA的转录、剪切、编辑、和稳定性等过程都密切相关。
PPR蛋白的RNA结合模式与已知的RNA结合蛋白均不同。PPR蛋白由多个PPR重复单元组成，大多数情况下，一个经典的PPR重复单元含有35个氨基酸，每一个重复单元可以特异性地识别一个RNA碱基。

水稻中的重金属代谢

南京农业大学的赵方杰教授主要从事水稻中重金属铬和砷的研究，之前听过一个华裔日本人讲过他在水稻重金属中的工作，当时听了感觉这样的研究很有意义，听到国内也有人做这方面的工作还是很欣慰的，他讲述了中国水稻与重金属一个掺不忍睹的画面，比如取样湖南，xx%超标，xx%严重超标，害怕查水表就不爆数字了。
那么水稻中为什么会有这么多的重金属积累？除了与水污染，化肥的过度使用外也与水稻自身的特性有重要关系。水稻中含量最高的元素是si，大约占60%（如果我没记错的话），同时水稻生长在淹水的环境中，根系周围的重金属砷氧化呈三价态砷，而三价砷与si结构极为相似，水稻在对si吸收旺盛的同时也吸收了大量三价砷，旱稻则不存在这种问题；其次在土壤微生物的作用下砷可以甲基化成二甲基砷(DMA)被水稻根系吸收，引起水稻病害，赵教授的研究团队也发现了在这种土壤中同样存在另一种微生物可以将甲基化的DMA去甲基化。

Alernative polyadenylation

Mechanisms and consequences of alternative polyadenylation
这是我们传统教科书上关于成熟mRNA的认识，标准的5’帽子和3’ Poly(A)尾巴。

但是，来自厦门大学环境与生态学院的李庆顺教授告诉我们：原来这个3’ Poly(A)尾巴也存在选择性，对于单个基因，它可以选择在转录子的不同位置加Poly(A)尾巴从而产生长短不一的转录本，对转录本的功能如编码功能、稳定性、可翻译性等都产生重要影响，在转录水平极大的提高了基因转录调控的多样性。
在mRNA中存在70%选择性Poly(A)位点，Poly(A)可以加到5’和3’端，也可以加到内含子区（这就涉及到内含子的剪切和加A，内含子被剪切后进行加A），若加到5’端则基因失去功能；
同时，如下，在一个成熟mRNA的3’UTR区存在有较多的特异位点：
推荐一篇文章：Alternative polyadenylation of mRNA precursors

解决PacBio数据分析耗时问题

深圳农业基因所的阮珏教授正在开发的wtdbg软件，其目标在于大幅度降低现有软件对PacBio数据组装的高配置和耗时长的缺陷。

昆虫与植物的分子互作

ggplot2 2.2.0更新简要

2016-10-06T05:04:09.000Z

ggplot2迎来更新，最新版本为2.2.0，同时也带来一些功能上是改进，详细原文见：ggplot2 2.2.0 coming soon!
最新版安装：

1 2	install.packages("devtools") devtools::install_github("hadley/ggplot2")

主要更新内容如下：

Subtitles and captions（副标题和题注）

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE, method = "loess") +
  labs(
    title = "Fuel efficiency generally decreases with engine size",
    subtitle = "Two seaters (sports cars) are an exception because of their light weight",
    caption = "Data from fueleconomy.gov"
  )

注：title现在默认为左对齐，想要居中，设置theme(plot.title = element_text(hjust = 0.5))

个人认为这个功能在ggplot2中效果不眨地，效果也不好，还不如坐标轴截断实际，但文章中认为坐标轴截断不科学-……—

坐标轴修改

改变坐标轴位置

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  scale_x_continuous(position = "top") + 
  scale_y_continuous(position = "right")

添加双坐标轴sec.axis

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  scale_y_continuous(
    "mpg (US)", 
    sec.axis = sec_axis(~ . * 1.20, name = "mpg (UK)")
  )

主题

还记得以前画图时坐标轴间那挥之不去的空白吗，Now，他将不复存在。

箭头坐标轴element_line()

#定义箭头
arrow <- arrow(length = unit(0.4, "cm"), type = "closed")

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  theme_minimal() + 
  theme(
    axis.line = element_line(arrow = arrow)
  )

图例修改

图例可以与图形区对齐和添加外框。

ggplot(mpg, aes(displ, hwy, shape = drv, colour = fl)) + 
  geom_point() + 
  theme(
    legend.justification = "top", 
    legend.box.margin = margin(3, 3, 3, 3, "mm"), 
    legend.box.background = element_rect(colour = "grey50")
  )

注：panel.margin and legend.margin 重命名为 panel.spacing and legend.spacing 。

bars型图修改

新增geom_col()函数，相当于geom_bar(stat = “identity”)。

awk对table的统计计算

2016-10-01T09:28:19.000Z

第三列相同时，第四列累加

awk 'BEGIN{FS=OFS="\t"} \
NR>1 \
{a[$3]+=$4} \
END {for (i in a) {print i,a[i]}}' text.txt | sort

awk中的数组由一对字符串组成，第一个字符串是‘index’,第二个是index所对应的value，a[$3]+=$4中的index来自第三列，value是第四列相应值的累加。
用asorti在awk中排序

awk 'BEGIN{FS=OFS="\t"} \
NR>1 \
{a[$3]+=$4} \
END {n=asorti(a,b);for (i=1;i<=n;i++) {print b[i],a[b[i]]}}' text.txt

依据第三列和第二列，第四列累加

awk 'BEGIN{FS=OFS="\t"} \
NR>1 \
{a[$3$2]+=$4} \
END {n=asorti(a,b);for (i=1;i<=n;i++) {print b[i],a[b[i]]}}' text.txt

第三列相同时，第四列的最大值

1	awk -F, '{if (a[$1] < $2)a[$1]=$2;}END{for(i in a){print i,a[i];}}' OFS=, file.txt

第三列相同值计数

1	awk -F, '{a[$1]++;}END{for (i in a)print i, a[i];}' file.txt

第三列相同时，仅输出第四列第一个值

1	awk -F, '!a[$1]++' file.txt

第三列相同时，第四列的所有值并未一行

1	awk -F, '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS=, file.txt

参考资料

A Pivot Table In AWK
awk - 10 examples to group data in a CSV or text file

命令行生成数列：{}和seq

2016-10-01T08:59:45.000Z

如何在命令行上产生一列数字呢？

{START..END..INCREMENT}

{1..3}{a..c}

seq -s ‘,’ START INCREMENT END,使用方式$(seq -s ‘,’ START INCREMENT END)

在awk中的使用技巧：

1 2	echo "$"$(seq -s ',$' 20) $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20

More：http://www.thelinuxrain.com/articles/building-sequences-of-numbers-on-the-command-line

文本的转置

2016-10-01T08:23:39.000Z

实现文本转置的三种方法

1	$ awk '{for(i=1;i<=NF;i++){a[FNR,i]=$i}}END{for(i=1;i<=NF;i++){for(j=1;j<=FNR;j++){printf a[j,i]" "}print ""}}' test.txt

$ cat transpose.awk
#! /bin/sh
exec awk '
NR==1 {
  n=NF
  for (i=1;i<=NF;i++)
       row[i]=$i
  next
}
{
   if (NF>n)
       n=NF
   for (i=1;i<=NF;i++)
       row[i]=row[i] " " $i
}
END {
    for (i=1;i<=n;i++)
	     print row[i]
}' ${1+"$@"}

$ cat trans.sh 
#!/bin/bash

numc=$(($(head -n 1 "$1" | grep -o "$2" | wc -l)+1))
for ((i=1; i<="$numc"; i++))
do cut -d "$2" -f"$i" "$1" | paste -s -d "$2"
done
$ trans.sh 需要转置文件 分隔符

参考资料

Transposing rows and columns: 3 methods

table按条件合并

2016-10-01T07:10:29.000Z

a.txt的第三列按照2.txt替换

cat a.txt
1	h	1	hhh
2	k	3	uytfd
3	d	2	gfsr
4	f	3	jdgk
cat b.txt
1	a
2	b
3	c
cat 预期结果
1	h	a	hhh
2	k	c	uytfd
3	d	b	gfsr
4	f	c	jdgk

join版

1	join -t$'\t' -o 1.1 1.2 2.2 1.4 -1 3 -2 1 <(sort -k3 a.txt) b.txt \| sort -n -k1

join命令：
join -1 -2
-a<1或2> 除了显示原来的输出内容之外，还显示指令文件中没有相同栏位的行。
-i或—igore-case 比较栏位内容时，忽略大小写的差异。
-t<字符> 使用栏位的分隔字符。

awk版

1	awk 'BEGIN{FS=OFS=" "} NR==FNR {a[$1]=$2;next}{print $1,$2,a[$3],$4}' b.txt a.txt

添加系列表头的多种方法

2016-10-01T02:05:31.000Z

想要在某文本开头或/和结尾添加一行？

awk版

BEGIN在开头添加，END在结尾添加；

1	awk 'BEGIN{print "START"}; {print}; END{print "END"}'

sed版

sed用 1来匹配第一行，i执行插入操作，$匹配最后一行，a执行追加，

1	sed -e $'1i\\\nSTART' -e $'$a\\\nEND'

echo版

在管道 |操作符中用{命令1;命令2;命令3…}来执行多个命令并告诉程序这是单个复合命令。

1
2
3

content-generator |
{ echo START; cat; echo END; } |
postprocessor

忽视第一行

#head打印第一行，tail生成其他行来排序
head -n 1 table.txt && tail -n +2 table.txt | sort -nr -k1
#首先将第一行存储到一个变量中
foo=$(head -n 1 table); echo -e "$foo"; tail -n +2 table | sort -nr -k1

忽略第一行，从第二行开始添加行号

1	foo=$(head -n 1 table); echo -e "Record\t$foo"; tail -n +2 table \| nl \| sed 's/^[ ]*//'

删除第一行

awk 'NR!=1'
awk 'NR>1'
tail -n +2
sed '1d'

添加系列多表头

如需添加M1,M2….M2016，一系列由字母和数字组合的表头？

1	echo M{1..2016}; cat text.txt

在指定位置插入一列并赋值

1	awk '{$3=NR==1?"add" OFS $3:"hope" OFS $3} 1' OFS="\t\t" text.txt

按第一列将文件拆分并添加表头

cat mainfile.txt
file1	abc	def	xyz
file1	aaa	pqr	xyz
file2	lmn	ghi	xyz
file2	bbb	tuv	xyz
#单纯的按照第一列分隔文件
awk '{FILENAME=$1; print >>FILENAME}' mainfile.txt
awk -F '\t' '{if(FILENAME!=$1){FILENAME=$1;print "Name \t State \t Country" > FILENAME}} {print $2 "\t" $3 "\t" $4 > FILENAME}'  mainfile.txt
cat file1
Name	 State	Country
abc	 def	        xyz
aaa	 pqr 	        xyz
cat file2
Name	 State	Country
lmn	 ghi	        xyz    
bbb	 tuv	        xyz

参考资料

The header line: how to add, delete and ignore it
Adding header to sub files after splitting the main file using AWK Shell Programming and Scripting

R-Data-Science

2016-09-29T05:39:15.000Z

本内容是基于R for Data Science的学习总结；

ggplot2的数据可视化

基本绘图

ggplot2画图基本模型如下：

1 2	ggplot(data = ) + #生成一个空的图 (mapping = aes()) #增加一个图层，其中涉及参数仅用于这一图层

GEOM_FUNCTION可划分为展示单变量，两个变量和三变量，连续型或离散型变量；
在ggplot2 中每一个GEOM_FUNCTION函数都包含有一个mapping参数对应于aes(x,y,size,shape,color,alpha)，以上参数对应值均为DATA数据中的变量，若需要手动设置，将参数写于aes外，此时的参数对应值如下：

color=”颜色英语单词”

size=数字

shape=如下代表数字

其中shape图形中的外边界由colour指定(0到18)。内部填充由fill指定。
ggplot()中设置的aes相当于全局参数，为简化代码可将共有变量在ggplot中设置。若某一图层指定参数与次全局指定冲突，则在该图层使用geom指定的参数。基于这样的处理过程可以在不同的图层中指定不同的数据。

1
2
3

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = dplyr::filter(mpg, class == "subcompact"), se = FALSE) ##se为是否展示置信区间

stat (statistical transformation)

每一个geom都会默认指定一个stat来对数据进行统计转换，如geom_bar()默认stat是count，即geom_bar(..,stat=”count”)。

坐标系统

vim编辑器个性化配置总结

2016-09-21T08:16:31.000Z

Vim 终于发布了一个新的大版本 8.0

安装

下载安装最新版本的 Vim 的最好方式是使用 Git ：

1	git clone https://github.com/vim/vim.git

更多信息可参考： http://www.vim.org/git.php 。
Windows下图形界面版： ftp://ftp.vim.org/pub/vim/pc/gvim80.zip
windows下next安装就可以。
Vim安装完成之后，目录如下：

vim80：vim运行时所需的文件，对应目录为$VIMRUNTIME变量

vimfiles：第三方的文件，对应目录为$VIM/vimfiles

_vimrc：vim全局配置信息

主题

vimrc配置内容主要参考了http://blog.csdn.net/zhengzhoudaxue2/article/details/45247733

注：_vimrc主题中参数解释见 http://edyfox.codecarver.org/html/_vimrc_for_beginners.html

配置Vundle

vim插件Vundle能够轻松的管理插件；

下载Vundle

在Vim/vimfiles路径下新建文件夹bundle，然后在此文件夹下克隆github上的vundel项目：

1
2
3

#以管理员权限运行cmd，进入bundle文件夹下
cd Vim/vimfiles/bundle
$ git clone https://github.com/VundleVim/Vundle.vim.git Vundle.vim

配置Vundle

在_vimrc文件中添加如下代码：
以下英语输入法状态下的”符号是_vimrc中的注释符；

filetype off

" Vundle的路径
set rtp+=$VIM/vimfiles/bundle/Vundle.vim
" 插件的安装路径
call vundle#begin('$VIM/vimfiles/bundle/')
" 需要安装的插件
Plugin 'gmarik/Vundle.vim'
Plugin 'L9'
 
call vundle#end()
filetype plugin indent on

注：若不指定call vundle#begin()中的路径参数，默认保存路径为C:\Users***.vim；

vim中安装/卸载插件

vundle主要是利用git，来处理自动安装，更新和卸载插件，所以首先需要安装git。

vim插件安装方式

_vimrc指定的vim插件安装有4种方式：

1. 代码库放在github上

Bundle ‘tpope/vim-fugitive’
Bundle ‘Lokaltog/vim-easymotion’

2. 代码库在vim script上

Bundle ‘L9’
Bundle ‘FuzzyFinder’

3. 代码库在其他git库上

Bundle ‘git://git.wincent.com/command-t.git’

4. 当你自己写了个定制的插件，放在本地的时候

Bundle ‘file:///Users/gmarik/path/to/plugin’

常用的命令

启动vim，键入

:BundleInstall    安装插件
:PluginInstall    安装插件
:BundleInstall!   更新插件
:BundleClean(!)   卸载不在.vimrc配置列表中的插件
:BundleSearch(!)  搜索插件
:BundleList       显示已安装插件列表

如果想安装插件，首先在_vimrc中添加相应插件的Bundle，一般为Bundle ‘username/pluginname’的形式，如Bundle ‘gmarik/vundle’,然后打开Vim，输入一下命令，并等待Done即可，如果安装过程中出错，可以输入小写字母”l”查看日志;
如果想卸载插件，只需在_vimrc中删除（或注释）相应的Bundle，然后打开Vim，输入相应命令。

遇到的问题

安装ctags遇到”Taglist: Exuberant ctags (http://ctags.sf.net) not found in PATH.Plugin is not loaded.”

解决办法：ctags 目录下的 ctags.exe 复制到gvim.exe 所在的目录；

参考资料

Gvim各种插件配置（windows环境下） - vitah
VIM插件管理—vundle
Vim Skills——Windows利用Vundle和Github进行Vim配置和插件的同步
_vimrc for beginners
_gvim与插件的安装（ctag、taglist、cscope等）

评估文库 Average Insert Size

2016-09-19T14:07:39.000Z

用SOAPdenovo对Illumina paired-end进行基因组组装时需要配置文件，其中要填写每个文库的average insert size，那么如何进行average insert size大小的评估呢？

文库类型

对于基因组文库我们一般会建小库（<1K）的paired-end reads和大库的mate-pair reads，二者最主要的区别就是reads1和reads2的方向和之间的间隔大小。

现在绝大部分的主流软件都是支持将paired-end reads进行比对的，那么 mate-pair reads如何处理呢，即 mate-pair reads如何做比对？请参考我的另一篇博文 Mate-pair Reads Alignment

Insert Size

首先，什么是Insert Size呢？
对于paired-end reads来说

对于mate-pair reads来说其reads1和reads2方向指向外面，其插入大小统计需要格外注意。

基于bwa比对log文件统计插入大小

通过观察bwa软件的输出log文件发现其对每一个pair-end reads分4次读入（[M::main_mem] read 2613712 sequences (200000105 bp)…），对于每一次的读入会对reads进行统计如下：

由红框标出发现占主要比例的是RF reads，进一步往下寻找analyzing insert size distribution for orientation RF…就可得到其平均插入大小。

基于比对的sam文件统计插入大小

R计算

$ cat sample.sam | cut -f9 > initial.insertsizes.txt
R
a = read.table("initial.insertsizes.txt")
a.v = a[a[,1]>0,1]                     # 筛选大于0的值
mn = quantile(a.v, seq(0,1,0.05))[4]   #分位数计算，[4]表示取第四个分位值,15%
mx = quantile(a.v, seq(0,1,0.05))[18]  #85%
mean(a.v[a.v >= mn & a.v <= mx])       # mean
sd(a.v[a.v >= mn & a.v <= mx])         # sd

可见R计算过程选择过滤掉小于等于15%和大于等于85%的值来计算平均插入大小；

awk计算

1	awk '{ if ($9 > 0) { N+=1; S+=$9; S2+=$9$9 }} END { M=S/N; print "n="N", mean="M", stdev="sqrt ((S2-MM*N)/(N-1))}' sample.sam

awk选择全部大于0的值计算平均插入大小；

基于sorted.bam文件

qualimap

1	qualimap bamqc -bam sample.sorted.bam --java-mem-size=300G -c -nw 400 -hm 3 -outdir /resulted

CollectInsertSizeMetrics

java -jar CollectInsertSizeMetrics.jar \
      I=sample.sorted.bam \
      O=insert_size_metrics.txt \
      H=insert_size_histogram.pdf \
      M=0.5

问题

现对同一个比对产生的sam/bam文件用上述4种方法计算得出结果如下：
R ：mean insert size = 260.577232343453”，”standard deviation = 27.4153198790634”；
awk： mean=250.826, stdev=51.7005；
qualimap：mean insert size = 250.8258，std insert size = 51.7005；
CollectInsertSizeMetrics：MEAN_INSERT_SIZE=260.963523，STANDARD_DEVIATION=42.809159。
qualimap 和 CollectInsertSizeMetrics 都是java封装的软件，看不到其具体计算方法，根据以上计算结果可以看出CollectInsertSizeMetrics的计算原理应该和R的一样需要过滤掉数据，qualimap和awk中发一样，所以问题最后就归结为是否需要首先过滤数值再计算平均插入大小？

在R中计算时对数据a.v做正态性检验lillie.test(a.v)

> library("nortest")
> lillie.test(a.v)

        Lilliefors (Kolmogorov-Smirnov) normality test

data:  a.v
D = 0.1541, p-value < 2.2e-16

可以看出其插入大小分布不是呈正态分布，综合考虑后还是按照R的计算结果为准。

参考资料

Question: Estimate Insert Size In Paired-End/Mate-Pair
Question: What is the difference between a Read and a Fragment in RNA-seq?
Paired-end read confusion - library, fragment or insert size?

学习WGCNA总结

2016-09-14T08:59:41.000Z

在转录组数据处理过程中我们经常会用到差异表达分析这一概念，通过比较不同处理或不同组织间基因表达量(FPKM)差异来寻找特异基因，但这前提是你的不同处理或不同组织样本较少，当不同处理或组织有较多样本，如40个，此时的两两比较有780组比较^_^,这根本不是我们想要的结果；

此时就需要WGCNA(weighted gene co-expression network analysis)将复杂的数据进行归纳整理。除了这种最常见的比较差异表达，我们还想知道在不同处理或不同组织间是否有些基因的表达存在内在的联系或相关性？WGCNA同样可以帮助我们预测基因间的相互作用关系。

WGCNA is based on correlation and not differential expression comparisons.

WGCNA术语

权重(weghted)

Module

模块(module)：表达模式相似的基因分为一类，这样的一类基因成为模块；

Eigengene

Eigengene（eigen- +‎ gene）：基因和样本构成的矩阵，https://en.wiktionary.org/wiki/eigengene；

Adjacency Matrix

邻近矩阵：是图的一种存储形式，用一个一维数组存放图中所有顶点数据；用一个二维数组存放顶点间关系（边或弧）的数据，这个二维数组称为邻接矩阵；

Topological Overlap Matrix(TOM)

整体思路

先对数据进行处理→分层聚类→表达模式相似的基因组成模块→研究某一个模块中相关基因的功能富集(GO,KEGG)，各个模块与样本表型数据间的相关性，各个模块与样本本身间的相关性(没有表型数据的情况，如不同组织)→具体到特定模块后分析其所包含基因间的相互作用网络关系，并找出其中的关键基因。

分析构建的网络寻找以下有用信息

这类处于调控网络中心的基因称为核心基因（hub gene），这类基因通常是转录因子等关键的调控因子，是值得我们优先深入分析和挖掘的对象。

在网络中，被调控线连接的基因，其表达模式是相似的。那么它们潜在有相似的功能。所以，在这个网络中，如果线条一端的基因功能是已知的，那么就可以预测线条另一端的功能未知的基因也有相似的功能。

R脚本

输入数据为RNA-seq不同处理或组织所有样本的FPKM值组成的矩阵，切记含有 0 的要去掉；

setwd("F:/WGCNA")
library(WGCNA)
options(stringsAsFactors = FALSE)
enableWGCNAThreads()
#############################################################################
####################### 一、 数据读入，处理和保存 ##############################
#############################################################################
fpkm<- read.csv("trans_counts.counts.matrix.TMM_normalized.FPKM.nozero.csv")
#~~~~~~~~~~~~~~~~~~
head(fpkm)
   GeneID PN_1_TPM PN_1_07_TPM PN_1_08_TPM PN_2_TPM PN_2_09_TPM PN_2_10_TPM
1 MSTRG.1 0.000000    1.456143    1.093308 0.204315    0.000000    0.000000
2 MSTRG.2 1.516181    0.849313    2.010783 1.567867    2.045446    2.246402
3 MSTRG.3 1.305084    1.207246    0.889166 1.470162    0.340003    0.421222
4 MSTRG.4 2.744250    2.791988    2.500786 2.719017    1.954149    2.468110
5 MSTRG.5 1.946825    1.470012    1.263171 0.205806    1.644505    1.638583
6 MSTRG.6 1.325277    0.793530    1.932236 1.210156    1.834274    2.153466
#~~~~~~~~~~~~~~~~~~
dim(fpkm)
names(fpkm)
datExpr0=as.data.frame(t(fpkm[,-c(1)]));
names(datExpr0)=fpkm$trans;
rownames(datExpr0)=names(fpkm)[-c(1)];
#data<-log10(date[,-1]+0.01)
# *************************************************************
# 检测输入基因是否含有缺失值,并处理 ******************************
# *************************************************************
gsg = goodSamplesGenes(datExpr0, verbose = 3);
gsg$allOK
#~~~~~~~~~~~~~~~~~~
# 如果上一步返回TRUE则跳过此步，如果返回FALSE则执行如下if语句去掉存在较多缺失值的基因所在行
if (!gsg$allOK)
{
  # Optionally, print the gene and sample names that were removed:
  if (sum(!gsg$goodGenes)>0)
    printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes], collapse = ", ")));
  if (sum(!gsg$goodSamples)>0)
    printFlush(paste("Removing samples:", paste(rownames(datExpr0)[!gsg$goodSamples], collapse = ", ")));
# Remove the offending genes and samples from the data:
datExpr0 = datExpr0[gsg$goodSamples, gsg$goodGenes]
}
# 再次检测
dim(datExpr0)
gsg = goodSamplesGenes(datExpr0, verbose = 3);
gsg$allOK
#~~~~~~~~~~~~~~~~~~
# ***************************************************************
# 聚类检测输入样本是否含有异常值（obvious outliers）,并处理 ********
# ***************************************************************
sampleTree = hclust(dist(datExpr0), method = "average")
#sizeGrWindow(12,9)
par(cex = 0.6)
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,
     cex.axis = 1.5, cex.main = 2)
abline(h = 80000, col = "red");
clust = cutreeStatic(sampleTree, cutHeight = 80000, minSize = 10)
table(clust)
keepSamples = (clust==1)
datExpr = datExpr0[keepSamples, ]
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
save(datExpr, file = "AS-green-FPKM-01-dataInput.RData")
#############################################################################
############################ 二、 选择合适的阀值 ##############################
#############################################################################
powers = c(c(1:10), seq(from = 12, to=20, by=2))
# Call the network topology analysis function
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
# Plot the results:
##sizeGrWindow(9, 5)
par(mfrow = c(1,2));
cex1 = 0.9;
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
     main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,cex=cex1,col="red");
# this line corresponds to using an R^2 cut-off of h
abline(h=0.90,col="red")
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
     main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

#############################################################################
############################ 三、 网络构建和可视化 #############################
#############################################################################
#=====================================================================================
#=====================================================================================
# 网络构建有两种方法，One-step和Step-by-step；
# 第一种：一步法进行网络构建
#=====================================================================================
#=====================================================================================

################################################################################
### 3.1 一步法网络构建：One-step network construction and module detection ######
#3.1.1. 网络构建 ###############################################################
###############################################################################
dim(datExpr)
net = blockwiseModules(datExpr, power = 6, maxBlockSize = 6000,
                       TOMType = "unsigned", minModuleSize = 30,
                       reassignThreshold = 0, mergeCutHeight = 0.25,
                       numericLabels = TRUE, pamRespectsDendro = FALSE,
                       saveTOMs = TRUE,
                       saveTOMFileBase = "AS-green-FPKM-TOM",
                       verbose = 3)
table(net$colors)
###############################################################################
#3.1.2. 绘画结果展示 ###########################################################
###############################################################################
# open a graphics window
#sizeGrWindow(12, 9)
# Convert labels to colors for plotting
mergedColors = labels2colors(net$colors)
# Plot the dendrogram and the module colors underneath
plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]],
                    "Module colors",
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05)
###############################################################################
#3.1.3. 结果保存 ###############################################################
###############################################################################                  
moduleLabels = net$colors
moduleColors = labels2colors(net$colors)
table(moduleColors)
MEs = net$MEs;
geneTree = net$dendrograms[[1]];
save(MEs, moduleLabels, moduleColors, geneTree,
     file = "AS-green-FPKM-02-networkConstruction-auto.RData")
###############################################################################
#3.1.4. 导出网络到 Cytoscape ###################################################
###############################################################################  
# Recalculate topological overlap if needed
TOM = TOMsimilarityFromExpr(datExpr, power = 6);
# Read in the annotation file
# annot = read.csv(file = "GeneAnnotation.csv");
# Select modules需要修改，根据table(moduleColors)的结果选择需要导出的模块颜色
modules = c("turquoise", "blue");
# Select module probes选择模块探测
probes = names(datExpr)
inModule = is.finite(match(moduleColors, modules));
modProbes = probes[inModule];
#modGenes = annot$gene_symbol[match(modProbes, annot$substanceBXH)];
# Select the corresponding Topological Overlap
modTOM = TOM[inModule, inModule];
dimnames(modTOM) = list(modProbes, modProbes)
# Export the network into edge and node list files Cytoscape can read
cyt = exportNetworkToCytoscape(modTOM,
                               edgeFile = paste("AS-green-FPKM-One-step-CytoscapeInput-edges-", paste(modules, collapse="-"), ".txt", sep=""),
                               nodeFile = paste("AS-green-FPKM-One-step-CytoscapeInput-nodes-", paste(modules, collapse="-"), ".txt", sep=""),
                               weighted = TRUE,
                               threshold = 0.02,
                               nodeNames = modProbes,
                               #altNodeNames = modGenes,
                               nodeAttr = moduleColors[inModule]);
#################################################################################################
#3.1.5. 分析网络可视化，用heatmap可视化权重网络，heatmap每一行或列对应一个基因，颜色越深表示有较高的邻近
#################################################################################################
options(stringsAsFactors = FALSE);
lnames = load(file = "AS-green-FPKM-01-dataInput.RData");
lnames
lnames = load(file = "AS-green-FPKM-02-networkConstruction-auto.RData");
lnames
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
#====================================
#3.1.5.1. 可视化全部基因网络 ==========
#====================================
# Calculate topological overlap anew: this could be done more efficiently by saving the TOM
# calculated during module detection, but let us do it again here.
dissTOM = 1-TOMsimilarityFromExpr(datExpr, power = 6);
# Transform dissTOM with a power to make moderately strong connections more visible in the heatmap
plotTOM = dissTOM^7;
# Set diagonal to NA for a nicer plot
diag(plotTOM) = NA;
# Call the plot function
#sizeGrWindow(9,9)
TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")
#====================================
#3.1.5.2. 随便选取1000个基因来可视化 ==
#====================================
nSelect = 1000
# For reproducibility, we set the random seed
set.seed(10);
select = sample(nGenes, size = nSelect);
selectTOM = dissTOM[select, select];
# There's no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
selectTree = hclust(as.dist(selectTOM), method = "average")
selectColors = moduleColors[select];
# Open a graphical window
#sizeGrWindow(9,9)
# Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
# the color palette; setting the diagonal to NA also improves the clarity of the plot
plotDiss = selectTOM^7;
diag(plotDiss) = NA;
TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")
#=====================================================================================
#=====================================================================================
#  第二种：一步步的进行网络构建
#=====================================================================================
#=====================================================================================

###############################################################################
### 3.2 Step-by-step network construction and module detection ################
###############################################################################
#2.选择合适的阀值，同上
###############################################################################
#3.2.1. 网络构建 ###############################################################
###############################################################################
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# (1) Co-expression similarity and adjacency ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
softPower = 6;
adjacency = adjacency(datExpr, power = softPower);
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#(2) 邻近矩阵到拓扑矩阵的转换，Turn adjacency into topological overlap ~~~~~
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TOM = TOMsimilarity(adjacency);
dissTOM = 1-TOM
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# (3) 聚类拓扑矩阵  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#Call the hierarchical clustering function
geneTree = hclust(as.dist(dissTOM), method = "average");
# Plot the resulting clustering tree (dendrogram)
#sizeGrWindow(12,9)
plot(geneTree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity",
     labels = FALSE, hang = 0.04);
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#(4) 聚类分支的休整dynamicTreeCut ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# We like large modules, so we set the minimum module size relatively high:
minModuleSize = 30;
# Module identification using dynamic tree cut:
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
                            deepSplit = 2, pamRespectsDendro = FALSE,
                            minClusterSize = minModuleSize);
table(dynamicMods)
###############################################################################
#3.2.2. 绘画结果展示 ###########################################################
###############################################################################
# Convert numeric lables into colors
dynamicColors = labels2colors(dynamicMods)
table(dynamicColors)
# Plot the dendrogram and colors underneath
#sizeGrWindow(8,6)
plotDendroAndColors(geneTree, dynamicColors, "Dynamic Tree Cut",
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05,
                    main = "Gene dendrogram and module colors")
###############################################################################
#3.2.3. 聚类结果相似模块的融合，Merging of modules whose expression profiles are very similar
###############################################################################
#在聚类树中每一leaf是一个短线，代表一个基因，
#不同分之间靠的越近表示有高的共表达基因，将共表达极其相似的modules进行融合
###############################################################################
# Calculate eigengenes
MEList = moduleEigengenes(datExpr, colors = dynamicColors)
MEs = MEList$eigengenes
# Calculate dissimilarity of module eigengenes
MEDiss = 1-cor(MEs);
# Cluster module eigengenes
METree = hclust(as.dist(MEDiss), method = "average");
# Plot the result
#sizeGrWindow(7, 6)
plot(METree, main = "Clustering of module eigengenes",
     xlab = "", sub = "")
#选择有75%相关性的进行融合
MEDissThres = 0.25
# Plot the cut line into the dendrogram
abline(h=MEDissThres, col = "red")
# Call an automatic merging function
merge = mergeCloseModules(datExpr, dynamicColors, cutHeight = MEDissThres, verbose = 3)
# The merged module colors
mergedColors = merge$colors;
# Eigengenes of the new merged modules:
mergedMEs = merge$newMEs;
#绘制融合前(Dynamic Tree Cut)和融合后(Merged dynamic)的聚类图
#sizeGrWindow(12, 9)
#pdf(file = "Plots/geneDendro-3.pdf", wi = 9, he = 6)
plotDendroAndColors(geneTree, cbind(dynamicColors, mergedColors),
                    c("Dynamic Tree Cut", "Merged dynamic"),
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05)
#dev.off()
# 只是绘制融合后聚类图
plotDendroAndColors(geneTree,mergedColors,"Merged dynamic",
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05)
###############################################################################
#3.2.4. 结果保存 ###############################################################
###############################################################################
# Rename to moduleColors
moduleColors = mergedColors
# Construct numerical labels corresponding to the colors
colorOrder = c("grey", standardColors(50));
moduleLabels = match(moduleColors, colorOrder)-1;
MEs = mergedMEs;
# Save module colors and labels for use in subsequent parts
save(MEs, moduleLabels, moduleColors, geneTree, file = "AS-green-FPKM-02-networkConstruction-stepByStep.RData")
###############################################################################
#3.2.5. 导出网络到Cytoscape ####################################################
###############################################################################
# Recalculate topological overlap if needed
TOM = TOMsimilarityFromExpr(datExpr, power = 6);
# Read in the annotation file
# annot = read.csv(file = "GeneAnnotation.csv");
# Select modules需要修改
modules = c("brown", "red");
# Select module probes
probes = names(datExpr)
inModule = is.finite(match(moduleColors, modules));
modProbes = probes[inModule];
#modGenes = annot$gene_symbol[match(modProbes, annot$substanceBXH)];
# Select the corresponding Topological Overlap
modTOM = TOM[inModule, inModule];
dimnames(modTOM) = list(modProbes, modProbes)
# Export the network into edge and node list files Cytoscape can read
cyt = exportNetworkToCytoscape(modTOM,
                               edgeFile = paste("AS-green-FPKM-Step-by-step-CytoscapeInput-edges-", paste(modules, collapse="-"), ".txt", sep=""),
                               nodeFile = paste("AS-green-FPKM-Step-by-step-CytoscapeInput-nodes-", paste(modules, collapse="-"), ".txt", sep=""),
                               weighted = TRUE,
                               threshold = 0.02,
                               nodeNames = modProbes,
                               #altNodeNames = modGenes,
                               nodeAttr = moduleColors[inModule]);
#################################################################################################
#3.2.6. 分析网络可视化，用heatmap可视化权重网络，heatmap每一行或列对应一个基因，颜色越深表示有较高的邻近
#################################################################################################
options(stringsAsFactors = FALSE);
lnames = load(file = "AS-green-FPKM-01-dataInput.RData");
lnames
lnames = load(file = "AS-green-FPKM-02-networkConstruction-stepByStep.RData");
lnames
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
#====================================
#3.2.6.1. 可视化全部基因网络 ==========
#====================================
# Calculate topological overlap anew: this could be done more efficiently by saving the TOM
# calculated during module detection, but let us do it again here.
dissTOM = 1-TOMsimilarityFromExpr(datExpr, power = 6);
# Transform dissTOM with a power to make moderately strong connections more visible in the heatmap
plotTOM = dissTOM^7;
# Set diagonal to NA for a nicer plot
diag(plotTOM) = NA;
# Call the plot function
#sizeGrWindow(9,9)
TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")
#====================================
#3.2.6.2. 随便选取1000个基因来可视化 ==
#====================================
nSelect = 1000
# For reproducibility, we set the random seed
set.seed(10);
select = sample(nGenes, size = nSelect);
selectTOM = dissTOM[select, select];
# There's no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
selectTree = hclust(as.dist(selectTOM), method = "average")
selectColors = moduleColors[select];
# Open a graphical window
#sizeGrWindow(9,9)
# Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
# the color palette; setting the diagonal to NA also improves the clarity of the plot
plotDiss = selectTOM^7;
diag(plotDiss) = NA;
TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes")
#此处画的是根据基因间表达量进行聚类所得到的各模块间的相关性图
MEs = moduleEigengenes(datExpr, moduleColors)$eigengenes
MET = orderMEs(MEs)
sizeGrWindow(7, 6) 
plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap = c(3,4,2,2), plotDendrograms = FALSE, xLabelsAngle = 90)

部分结果图简单解释

Cytoscape生成网络图

只需要第二个edges文件就能构建网络图。导入该文件后，在软件的导入设置中，将第一列设置为fromNode，第二列设置为toNode，最后把第三列设为网络关系属性，完成设置，便可生成网络图了。

WGCNA样本要求

由于WGCNA是基于相关系数的表达调控网络分析方法。当样本数过低的时候，相关系数的计算是不可靠的，得到的调控网络价值不大。所以，我们推荐的样本数如下：

当独立样本数≥8（非重复样本）时，可以考虑基于Pearson相关系数的WGCNA共表达网络的方法（效果看实际情况而定）；

当样本数≥15（可以包含生物学重复）时，WGCNA方法会有更好的效果。

当样品数＜8时，不建议进行该项分析。

该方法对于不同材料或不同组织进行分析更有意义，对于不同时间点处理相同样品意义不大。

报错暨解决办法

错误1

运行 pickSoftThreshold 函数报错如下

> sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
pickSoftThreshold: will use block size 773.
 pickSoftThreshold: calculating connectivity for given powers...
   ..working on genes 1 through 773 of 57862
Error in { : task 1 failed - "'x' has a zero dimension."

检查 datExpr 变量

1 2	> dim(datExpr) [1] 0 50057

重新dim检查之前的datExpr变量，绝对不是0行，而 datExpr变量的最近一次处理是 keepSamples = (clust==1)
察看keepSamples = (clust==1)之后的keepSamples变量

1 2	> keepSamples [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

都是FALSE？？？，所以引起后续datExpr = datExpr0[keepSamples, ]的错误;
继续往上查找变量clust，即clust = cutreeStatic(sampleTree, cutHeight = 80000, minSize = 10)，cutreeStatic 函数会通过cutHeight 和 minSize 的设定对聚类树进行cut去掉异常的样本，

cutHeight ：height at which branches are to be cut.
minSize ：minimum number of object on a branch to be considered a cluster.

所以调整相应的cutHeight 和 minSize 参数，察看 clust变量的值，保留可行的样本；

> clust
 [1] 1 2 1 2 1 1 2 1 1 1 1 1
> keepSamples = (clust==1)
> keepSamples
 [1]  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
> clust2
 [1] 0 0 0 0 0 0 0 0 0 0 0 0
> keepSamples = (clust2==0)
> keepSamples
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

或者简单暴力手动去除刚开始输入数据中异常样本，不做 聚类检测输入样本是否含有异常值（obvious outliers）这一过程；

错误2

按照WGCNA手册第五步Network visualization using WGCNA functions时报错如下：

> TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")

Error in .heatmap(as.matrix(dissim), Rowv = as.dendrogram(dendro, hang = 0.1),  :
row dendrogram ordering gave index of wrong length

看到row dendrogram ordering gave index of wrong length这句报错内容，分别察看plotTOM, geneTree, moduleColors这三个变量length;

1
2
3

> dim(plotTOM)
> geneTree
> moduleColors

果然，三者的length不同，发现geneTree少了一些，往回找geneTree来源 geneTree = net$dendrograms[[1]]，net来源于网络构建过程：

net = blockwiseModules(datExpr, power = 6,
TOMType = "unsigned", minModuleSize = 30,
reassignThreshold = 0, mergeCutHeight = 0.25,
numericLabels = TRUE, pamRespectsDendro = FALSE,
saveTOMs = TRUE,
saveTOMFileBase = "femaleMouseTOM",
verbose = 3)

所以，这是问题所在，继续察看文档发现基于内存等因素blockwiseModules函数默认最大maxBlockSize=5000（即最大5000个基因数目），而我们的数据超过了这个值，所以函数自动将输入datExpr数据做了拆分处理，而解决办法也很简单，设置maxBlockSize参数大于我们的值(dim(datExpr)所显示的数据行数) 即可。

Congratulations For My First Paper

2016-09-13T11:57:28.000Z

Congratulations! My first SCI paper is published in scientific reports. A nice story about cottonseed and welcome reading.

Conclusions for this paper：

This is the first report of artificially improved oil content via RNAi strategy and the analysis of its metabolic mechanism in Upland cotton. Decreased GhPEPC1 expression in transgenic cotton led to the increased expression of TAG biosynthesis related genes and elevated cottonseed oil content, which demonstrated the feasibility of improving cottonseed oil yield by regulating the carbon flux.

GhPEPC1 works as a core enzyme not only involved in photosynthesis but also regulated the inflowing of carbon turnover to fatty acid biosynthesis and finally contributed to the increase of cottonseed oil content. In this report, the carboxylation pathway of PEP to OAA was blocked through RNAi of GhPEPC1 and resulted in a decline in OAA concentration. Under this background, more proteins would be converted to aspartate involved in anaplerotic reactions to offset the OAA deficiency in mitochondrial and this hypothesis has been confirmed by our RNA-seq data. Among the DEGs, the glutamine-dependent asparagine synthase 1 was found to be down-regulated in the RNAi lines. Simultaneously, more pyruvate will be transported into mitochondria due to the acceleration of glycolysis, through mitochondrial pyruvate carrier (MPC) located in the mitochondrial inner membrane. The pyruvate located in mitochondria was then involved into two metabolism branches: conversion into acetyl-CoA through pyruvate decarboxylation with pyruvate dehydrogenase complex (PDC) and other irreversible carboxylation to form OAA by pyruvate carboxylase (PC) ligase to serves as an anaplerotic reaction for TCA. The excessive acetyl CoA and relative lack of OAA forced chloroplasts to the heighten light-dependent reactions based on photosynthetic electron transport chains and which produced the ATP and NADPH by using Calvin cycle, where the fixed CO2 was converted as sucrose to provide substrate for glycolysis. However, the RNAi cotton plant was in a state of ‘starvation’ because of the down-regulation of GhPEPC and the TCA were confined. Moreover, the expression levels of ACC in transgenic lines were significantly increased, which indicates that superfluous acetyl-CoA could combine with OAA and form citrate and then transported to cytoplasm via citrate transport protein (CTP). These citrates have participated into biosynthesis of fatty acids and finally stored in cottonseed in the form of TAG. The red marker region indicated that relevant genes exhibiting rising trend in RNA-seq data. The blue marker indicated down-regulated genes.

全文见：Metabolic engineering of cottonseed oil biosynthesis pathway via RNA interference

How to cite this article:

Xu, Z. et al. Metabolic engineering of cottonseed oil biosynthesis pathway via RNA interference. Sci. Rep. 6, 33342; doi: 10.1038/srep33342 (2016).

Acknowledgements

首先需要感谢的就是实验室提供的良好的科研平台和优势，这是必不可少的；其次就是支持科研的导师金老师和郭老师；还有已毕业的师兄李敬文，没有他就没有这篇文章, Also thanks for hakim help me review my manuscript in limited time and his modify is more important for improve the grammar and flow of the manuscript, especially in the introduction and discussion；当然还有实验室每位同窗室友。

Next, I’s a time for a big party！

Awk Regular Expressions

2016-08-31T15:25:51.000Z

—re-interval

在标准的正则表达式中{m}表示匹配字符m次，即[A-Z]{m}表示匹配A到Z的任意一个字符m次。，所以我们在awk中通常如下匹配：

cat test.txt
12  AT  CG
7555  AAA       AT
878 GGGG        CTG
cat test.txt | awk 'BEGIN{FS=OFS="\t"} {for(i=1;i<=NF;i++) if ($i~/^[ATCG]{2}$/) print $i}'

空，什么也没有输出？
首先{m}属于基本的正则表达式，而awk只支持扩展的正则表达式；
awk要想使用{m,n}类型的正则表达式，必须向awk提供参数：—re-interval 。

cat test.txt | awk --re-interval 'BEGIN{FS=OFS="\t"} {for(i=1;i<=NF;i++) if ($i~/^[ATCG]{2}$/) print $i}'
AT
CG
AT

-v var=value or —asign var=value 赋值一个用户定义变量；

$ awk -va=1 '{print $1,$1+a}' log.txt
 ---------------------------------------------
 2 3
 3 4
 This's 1
 10 11
 $ awk -va=1 -vb=s '{print $1,$1+a,$1b}' log.txt
 ---------------------------------------------
 2 3 2s
 3 4 3s
 This's 1 This'ss
 10 11 10s

gensub()替换

1 2	echo "123356" \| awk '{print gensub("3","d",2)}' 123d56

gensub(a,b,c,d) a:匹配的字符，b替换的字符,c为指定替换目标是第几次匹配（如1，2，g），d为指定替换目标是哪个域如$1,$2，若无d指$0，返回值为target替换后内容(未替换还是返回 target原内容)。

两个文本按条件合并

[zpxu@node102 ~]$  cat 2.txt 1.txt 
I0011  11111    hhh
I0012  22222    kkk
I0014  55555    ppp
I0017  66666    ttt
0011AAA 200.00 20050321
0012BBB 300.00 20050621
0013DDD 400.00 20050622
0014FFF 500.00 20050401
#比较 1.txt的1-4字符 和 2.txt的2-5 字符，如果相同，将2.txt 的全部列 与 1.txt 合并
[zpxu@node102 ~]$ awk  'NR==FNR{a[substr($1,2,5)]=$0}NR>FNR&&a[b=substr($1,1,4)]{print $0, a[b]}' 2.txt 1.txt 
0011AAA 200.00 20050321 I0011  11111    hhh
0012BBB 300.00 20050621 I0012  22222    kkk
0014FFF 500.00 20050401 I0014  55555    ppp
#NR==FNR处理的是2.txt文件,NR>FNR处理的是1.txt文件
#awk 'NR==FNR{a[$1]=$2}NR>FNR&&a[$1] {print $1,a[$1]}' 2.txt 1.txt

命令解释：首先处理2.txt文件，a[$1]=$0相当与将$1为键，整个行$0为值的hash；当处理1.txt文件时，直接在键的数组a中匹配1.txt中的$1列，若匹配则输出1.txt的$1和其在对应2.txt中匹配到的值；

next与getline

awk code: ‘BEGIN{…}{Main Input}END{..}’
next 读入下一输入行并从(Main Input中的)第一个规则开始执行脚本。

[zpxu@node102 ~]$  cat data 
name naughty
25 shandong
age 14  
ah,here is test
[zpxu@node102 ~]$ awk '{if(NR==1){next} print $1,$2}' data   
25 shandong
age 14
ah,here is

当记录行号等于1，就跳过当前行，其后面的print $1,$2也不会执行,读入下一行重新开始；
next合并多行为一行
可首先将两个或多个文件处理：cat a.txt b.txt | sort -n -k1 ，然后用next合并多行为一行来进行两个文本的按条件合并；

cat data
web01[192.168.2.100]
httpd            ok
tomcat               ok
sendmail               ok
web02[192.168.2.101]
httpd            ok
postfix               ok
web03[192.168.2.102]
mysqld            ok
httpd               ok
awk '/^web/{T=$0;next;} {print T":\t"$0;}' data
web01[192.168.2.100]:   httpd            ok
web01[192.168.2.100]:   tomcat               ok
web01[192.168.2.100]:   sendmail               ok
web02[192.168.2.101]:   httpd            ok
web02[192.168.2.101]:   postfix               ok
web03[192.168.2.102]:   mysqld            ok
web03[192.168.2.102]:   httpd               ok
#行首匹配到web时将这一整行赋值给T存储起来，并读入下一行，最后将T和下一行一起print；

与next相似，getline也是读取下一行数据。但是与next不同的是，next读取下一行之后，把控制权交给了awk脚本的顶部。但是getline却没有改变脚本的控制，读取下一行之后，继续运行当前的awk脚本。getline执行之后，会覆盖$0的内容。

[zpxu@node102 ~]$ cat d  
$1=="name"{print $0;getline;print $0;}  
$1=="age"{print $0}  
[zpxu@node102 ~]$ awk -f d data   
name naughty  
25 shandong  
age 14

getline从整体上来说，应这么理解它的用法：

当其左右无重定向符 | 或 < 时，getline作用于当前文件，读入当前文件的第一行给其后跟的变量var 或$0（无变量）；应该注意到，由于awk在处理getline之前已经读入一行，所以getline得到
的返回结果是隔行的。

当其左右有重定向符 | 或 < 时，getline则作用于定向输入文件，由于该文件是刚打开，并没有被awk读入一行，只是getline读入，那么getline返回的是该文件的第一行，而不是隔行。

多行或多列的删除

多行

[zpxu@node102 ~]$  cat 1.txt 
1
2
3
4
5
6
7
8
9
10
11
12
[zpxu@node102 ~]$ awk -vD="1,3,5,8,11" 'BEGIN{split(D,a,",");c=1}NR==a[c]{c++;next}1' 1.txt 
2
4
6
7
9
10
12

多列

[zpxu@node102 ~]$  cat data 
1,2,3,4,5,6,7,8,9,10,11,12
[zpxu@node102 ~]$ awk --re-interval -vD='1,3,5,11' 'BEGIN{l=split(D,a,",")}{for(i=1;i<=l;i++){$0=gensub("(([^,]*,?){"a[i]-i"})([^,]+,?)(.*)","\\1\\4","1")}}1' data 
2,4,6,7,8,9,10,12

参考资料

Linux awk 命令
awk函数+数组+多文件处理

测序数据上传NCBI总结

2016-08-30T05:50:01.000Z

测序数据上传到NCBI的SRA数据库；
上传首页：https://submit.ncbi.nlm.nih.gov/
上传整体顺序为：BioProject，BioSample，SRA

需要注意的是，上传的过程中很多地方一旦保存或提交就不可以修改，尤其是各处的Alias，所以想清楚后再保存，可先看下别人的数据提交形式；确实需要修改的可以发邮件联系NCBI的工作人员修改内容。

1. BioProject

1> 点击进入BioProject主页；
2> 点击New submission

3>依次填写信息,最后保存即可。

2. BioSample

1> 点击进入BioSample主页；
2> 点击New submission
3>依次填写信息,最后保存即可。
根据实际情况选择合适的批处理，若为植物样本，选择下载Plant.1.0.txt文件，填写相应信息；
若现有sample1 and sample2两个实验处理，其各为3次生物学重复的双端RNA-seq数据，即sample1-1,sample1-2,sample1-3,sample2-1,sample2-2,sample2-3，此时的Plant.1.0.txt文件内容如下（文件中其他内容根据实际情况选择填写）,即生物学重复划分在同一个SAMPLE的不同RUN下；

sample_name	bioproject_accession
sample1	PRJNAxxxxxx
sample2	PRJNAxxxxxx

3. SRA

1> 点击进入SRA主页；
2> 点击Create new submission
3>依次填写信息,最后保存即可。

4. 数据上传

Linux下建议用Aspera上传；首先给sra工作人员发邮件要private SSH key文件；

1	/software/.aspera/connect/bin/ascp -i sra-8.ssh.priv -QT -l100m -k1 sample1_1.fastq.gz asp-sra@upload.ncbi.nlm.nih.gov:incoming

注：命令只能在登录节点上联网运行；
详细参数解释和其他上传方法见：https://www.ncbi.nlm.nih.gov/sra/docs/submitfiles/

5. 开始一个新的提交

如果有数据需要提交NCBI，但是不知道你的数据类型应该如何提交，那么点击Submission Wizard会提供各种类型数据的提交接口。

Metagenome：宏基因组介绍

2016-08-28T05:32:00.000Z

概念

宏基因组( Metagenome)(也称微生物环境基因组 Microbial Environmental Genome, 或元基因组) 。定义为”the genomes of the total microbiota found in nature” , 即生境中全部微小生物遗传物质的总和。它包含了可培养的和未可培养的微生物的基因, 目前主要指环境样品中的细菌和真菌的基因组总和。
宏基因组学(或元基因组学, metagenomics)就是一种以环境样品中的微生物群体基因组为研究对象, 以功能基因筛选和/或测序分析为研究手段, 以微生物多样性、种群结构、进化关系、功能活性、相互协作关系及与环境之间的关系为研究目的的新的微生物研究方法。

物种丰富度(species richness)：用来描述和量化微生物群落，反映特定区域物种的数量。
均匀度(evenness)：用来量化一个群体中（少数的优势物种和绝大多数的稀有物种）的不同代表物种，反映各物种个体数目分配的均匀程度。假设存在另外一个群体其含由同之前群体所含物种总数相同的物种，唯一不同的是这些物种均较为常见，即这两个群体拥有相同的物种丰富度，但其某一物种的丰度不同，如何评价这两个群体的多样性？
为更好的描述和比较不同群体的多样性，提出适应宏基因组的新度量，α多样性，β多样性和γ多样性，其相互关系为β = γ/α。

α多样性就是一个样本（环境）中的物种数目;

β多样性度量在地区尺度上物种组成沿着某个梯度方向从一个群落到另一个群落的变化率。

γ描述一片区域或者是大陆尺度内的物种多样性;

宏基因组发展历程

微生物群落研究始于1676年Leeuwenhoek发现第一个微生物，在70年代末 Carl Woese提出16S rRNA基因可用于物种分类，随后几十年分子技术如PCR、FISH、DGGE等的发展促使微生态学研究进入”new uncultured world”，近十年来随着下一代测序技术（NGS）的出现微生物群落研究已经从简单的物种发现发展为宏基因组学研究，即基于NGS技术研究环境样品所包含的全部微生物的遗传组成及其群落功能。
目前，根据测序数据类型的不同，宏基因组测序被分为两类：全基因组测序（full shutgun metagenomics）和扩增子测序（marker gene amplification metagenomics）。

全基因组测序即，直接从环境样品中提取全部微生物的DNA，构建宏基因组文库并测序。

这种策略可以回答以下问题：
1）环境中包含哪些微生物？
2）微生物群落具有哪些功能？
3）微生物间如何相互作用以维持生态平衡？

扩增子测序即，对微生物基因组上的特定基因如16S rRNA基因进行PCR扩增并测序。这种策略可以便捷、快速地分析各种复杂样品中的微生物群落结构。

宏基因组分析流程

估计宏基因组样本中的物种组成及丰度

宏基因组中的物种分类，一般用OTU (operational taxonomic unit), 即可操作物种单元来表示。在典型情况下，原核生物的OUT使用16S rDNA来衡量，真核生物的OUT使用18s rDNA来衡量。
但选择16S/18S rDNA鉴定物种，存在以下几个问题：1）rDNA之间的平行转移来干扰rDNA鉴定的可靠性。2）在单个细菌中，16r DNA可能存在序列不同的几个拷贝，干扰估计OTU数目的准确性。所以，其他备选的标记基因，比如单拷贝的看家基因被推荐用来作为菌种鉴定的标记。

研究成果

The oral and gut microbiomes are perturbed in rheumatoid arthritis and partly normalized after treatment
通过宏基因组shotgun测序和宏基因组关联分析来自类风湿性关节炎(RA)患者和健康人体的fecal, dental and salivary样本，观察到中肠和口腔中微生物组的一致性，同时观察到RA患者中肠和口腔中的微生物组失调，经过RA治疗后部分恢复正常；与健康人群相比，RA患者中gut, dental or saliva微生物组显著变化的个体与临床诊断结果密切相关；尤其是嗜血杆菌数量在RA患者的gut, dental and saliva中均下降，与血清自身抗体水平呈负相关；相反唾液乳杆菌在RA患者的gut, dental and saliva中超量存在，并随RA程度增加；功能上，RA患者个体微生物群落在氧化还原环境、铁硫锌离子和精氨酸的转运和代谢上发生变化。

参考资料

The Road to Metagenomics: From Microbiology to DNA Sequencing Technologies and Bioinformatics
Metagenomics: tools and insights for analyzing next-generation sequencing data derived from biodiversity studies

Post-Translational Modifications (PTMs)：Phosphorylation

2016-08-27T15:14:32.000Z

Introduction

PTM

Post-translational modification (PTM) serves as molecular switch mechanism, modulating diverse protein functions including enzymatic activity, protein turnover, interactions, conformation, localization, and crosstalk with other PTMs, which in turn regulate broad cellular biological functions.
These modifications include phosphorylation, glycosylation, ubiquitination, nitrosylation, methylation, acetylation, lipidation and proteolysis.

Phosphorylation

Phosphorylation is the most common mechanism of regulating protein function and transmitting signals throughout the cell.
Reversible Protein Phosphorylation Is a Molecular Switch Mechanism. Reversible protein phosphorylation is characterized by the addition of phosphate donated from ATP and the removal of phosphate from a phosphorylated protein substrate, catalyzed by protein kinase and phosphatase (PP) enzymes respectively.
蛋白质磷酸化主要发生在两种氨基酸上，一种是丝氨酸Ser S(包括苏氨酸Thr T)，另一种是酪氨酸Tyr Y。这两类酸磷酸化的酶不一样，功能也不一样，但也有少数双功能的酶可以同时作用于这两类氨基酸，如MEK(促丝裂原活化蛋白激酶激酶mitogen-activated proteinkinase kinase ,MAPKK)。大于90%的蛋白质磷酸化组由丝氨酸磷酸化(pS)和苏氨酸磷酸化(pT)组成，三种氨基酸的磷酸化在同一个细胞中相对丰度比例为 pS:pT:pY = 1800:200:1，大约占84%, 15%, and <1%。尽管pY所占比例最少，但酪氨酸激酶受体(RTKs)在人类疾病中的异常调控，使其研究一直处于pS和pT前面。

丝氨酸磷酸化的主要作用是变构蛋白质以激活蛋白质的活力，主要是指酶活力。

酪氨酸磷酸化除了在变构以及激活该蛋白的活力之外，更重要的功能是结合蛋白提供一个结构基因，以促进其和其他蛋白质相互作用而形成多蛋白复合体。蛋白复合体的形成再进一步促进蛋白质的磷酸化。周而复始，由最初蛋白质磷酸化所产生的信号就一步步如此转下去。如果最初产生的是一个刺激细胞生长的信号，此信号便最终转入细胞核，导致DNA复制和细胞分裂。

Signal Transduction Cascades

细胞通过其表面的受体，离子通道和转运蛋白实时监控胞内微环境,且细胞面表受体和转运蛋白具有细胞特异性，使得其能够对胞内或胞外刺激做出快速反应。细胞表面的受体在感知到刺激后激活下游激酶，使发生磷酸化和激活下游同类型底物传递信号。

Signal transduction cascades can be linear, in which kinase A activates kinase B, which activates kinase C and so forth. Signaling pathways have also been discovered that amplify the initial signal;

kinase A activates multiple kinases, which in turn activate additional kinases. With this type of signaling, a single molecule, such as a growth factor, can activate global cellular programs such as proliferation.

依赖于蛋白质磷酸化的信号传递的强度和持续时间由3方面决定：

激活配体的移除；

激活或底物的水解；

依赖于磷酸化的去磷酸化；

Protein phosphorylation sites

在一个给定的细胞中拥有成千上万的磷酸化位点：

在任何一个特定细胞中有成千上万不同种类的蛋白质；

大约1/10到1/2的蛋白质是处于磷酸化状态；

在人类基因组中有30%的蛋白质能够被磷酸化，磷酸化的异常常引起疾病的发生；

在一个给定蛋白质中磷酸化可发生在多个不同位点；

Dephosphorylation

Dephosphorylation is the end goal of these two groups of phosphatases, they do it through separate mechanisms.
Serine/threonine phosphatases mediate the direct hydrolysis of the phosphorus atom of the phosphate group using a bimetallic (Fe/Zn) center, while tyrosine phosphatases form a covalent thiophosphoryl intermediate that facilitates removal of the tyrosine residue.

参考资料

Phosphorylation
Protein Phosphorylation: A Major Switch Mechanism for Metabolic Regulation

PacBio sequence error correction amd assemble via pacBioToCA

2016-08-27T11:57:48.000Z

Illumina二代测序有个致命缺陷，说到底还是基于PCR扩增的,所以存在偏向性和对于高GC含量区无法扩增等系统误差，测序错误是不可避免的，其次就是测序长度短；但其价格便宜，通量非常高，准确性达99%，综合性价比也受到青睐。短序列的reads在做基因组装的时候，遇到大的重复片段就会很吃力。

10X Genomics

2015年备受瞩目的测序黑马：10X Genomics，是常规Illumina二代测序的升级版，由于开发出了一套巧妙的Barcoding建库方案，使得Illumina这种短读长二代测序能够得到跨度在30-100Kb的linked reads信息，与二代测序数据相结合，在Scaffold的组装上能够得到媲美三代测序的组装结果；

基本原理: 首先将每一条长片段的DNA分配至不同的油滴微粒中，通过专利的GEM建库技术，长片段DNA被切碎成适合测序的大小，并且来源于相同油滴(同一条长片段DNA)的DNA片段，会带上相同的一段DNA序列标记(Barcode)，之后在Illumina系统上测序完成后，可以理论上再将来源相同的DNA序列独立拼接，得到原先的长片段DNA序列。
对于不同GC含量区其效果如何呢？2015年10月Nat Review Genetics文章Genetic variation and the de novo assembly of human genomes中总结的PacBio、10X Genomics以及Illumina技术在不同GC含量DNA区域的覆盖度分布：

10X Genomics技术相对于Illumina来说，有改进，但依旧是个拱形，而PacBio则是无偏倚的均一分布，10X的技术，其Coverage一样是受GC含量影响较大的，那么如果真要应用10X技术，那么必须注意目标DNA的GC含量分布最好能控制在30～70%。
但10Xgenome毕竟是升级版，其也存在一些特有的优势：

(1) 微量样本：仅需1ng基因组DNA即可进行长片段建库；

(2) 精确分区：由于拥有众多的barcode和Partions，可对DNA进行精确分区；

(3) 长片段信息：该技术可与Illumina测序仪进行无缝对接，利用短Reads可获得长达100Kb的片段；

(4) 基因组组装质量提升：利用长片段信息结合Illumina组装数据组装的ScaffoldN50长度比单纯用Illumina方法提高十几倍。

PacBio

第三代测序中的PacBio单分子实时（Single Molecule Real-Time, SMRT）DNA测序可以实现超过99.999%（QV50）的高度精确测序，且不受DNA序列中GC和AT含量的影响，平均读长可达20kb（最长>60kb）。

PacBio三代测序最大的死穴是：通量不足和单次(1X)测序错误率高(85%)；但三代的错误是完全随机发生的，属于随机误差，可以靠覆盖度来自我纠错，如果通量不是限制因素，那么PacBio是目前最准确的测序方式：错误率可以无限接近罕见突变的发生率（即无法分辨是测序错误还是罕见突变）。2012年冷泉港实验室的Michael Schatz开发了一种纠错算法，用二代测序的短读长高精确数据对三代长读长数据进行纠错，这种称为”混合纠错拼接” (Hybrid error correction and de novo assembly of single-molecule sequencing reads)可以进一步提升PacBio测序精确度。

PBcR: 混合纠错拼接

PBcR: 混合纠错拼接粉色长方形：单个PacBio RS reads；黑色竖线：测序错误；(a)由于测序错误碱基的存在使得两条reads就难确定是否在末端重叠；(b)高质量的短reads比对到存在错误的长reads；短reads中的黑色竖线表示 ‘mapping errors’ ，是长reads和短reads中测序错误的组合，此外双拷贝的重复序列的存在（灰色轮廓）导致在每一个拷贝中出现短reads的堆挤，为避免reads map到错误的重复区，仅保留最高比对值的短reads；(c)剩余的比对形成一致性序列（紫色长方形），长reads和短reads中共有的部分错误未能得到纠正；(d)overlap纠正后的长reads；(e) 最后的组装能够跨越重复区域。

Illumina reads纠错覆盖度

纠错的准确性和组装一致性在Illumina高质量reads达50X后开始收益递减，因此50X Illumina reads足够，纠错后PacBio长reads准确性将由85%提升至>99.9%，此时嵌合体和错误剪切reads分别为<2.5% 和 <1%。
目前在P6C4试剂下，大约每SMRT Cell平均可以做到 600M～1G数据量。
PacBio的长读长、无GC偏向性和无PCR扩增偏向性等独特优势有助于克服复杂的重复区域，从而跨越整个基因转录区，显著提升基因组和转录组的De Nove组装质量；

Illumina二代+PacBio三代数据分析

PBcR首先通过纠错来提升PacBio reads准确性，然后进行组装。PBcR的纠错和组装分为self-correction (using only PacBio RS data，自动运行fastqToCA) or correction with high-identity sequences（二代数据）。

self-correction

1	PBcR -length 500 -partitions 200 -l lambda -s pacbio.spec -fastq pacbio.filtered_subreads.fastq genomeSize=50000 > run.out 2>&1

高质量Illumina reads

#short read准备
fastqToCA -libraryname illumina -technology illumina -reads illumina.fastq > illumina.frg
#纠正
pacBioToCA -length 500 -partitions 200 -l ec_pacbio -t 16 -s pacbio.spec \
    -fastq pacbio.filtered_subreads.fastq illumina.frg > run.out 2>&1
#组装
runCA -p asm -d asm -s asm.spec ec_pacbio.frg > asm.out 2>&1

：第一步short reads准备阶段请确认二代数据第四行质量编码值，一般是33，否则用-type参数指定，要不然会报错QV问题；
纠正时PBcR需要安装AMOS和blasr依赖软件，输入文件short reads (illumina.frg)和long reads (pacbio.filtered_subreads.fastq)；
fastqToCA和PBcR两个中的libraryname需不同；
fastqToCA生成的frg文件后面没有序列信息，是正确的；

Spec files参数解释

PBcR混合组装需要指定两个Spec配置文件： pacbio.spec(纠错)和asm.spec(组装)。这两个文件都包含特定的算法参数和计算机硬件参数，通常情况下算法参数可以忽略（此时将用软件默认值），但是计算机硬件参数需要根据实际情况调整。
所有参数均为option = value形式，其中的value为布尔型(boolean),即true=1，false=0。
具体关于specfile参数解释见PBcR：SpecFiles Options

Spec files实例

集群下参考pacbio.spec

#以下为grid计算
stopAfter=overlapper

# original asm settings
utgErrorRate = 0.25
utgErrorLimit = 4.5

cnsErrorRate = 0.25
cgwErrorRate = 0.25
ovlErrorRate = 0.25

merSize=14

merylMemory = 128000
merylThreads = 16

ovlStoreMemory = 8192

# grid info
useGrid = 1
scriptOnGrid = 1
frgCorrOnGrid = 1
ovlCorrOnGrid = 1

sge = -V -S /bin/sh
#sge = -V -A assembly
sgeScript = -pe smp 16
sgeConsensus = -pe smp 1
sgeOverlap = -pe smp 4
sgeFragmentCorrection = -pe smp 2
sgeOverlapCorrection = -pe smp 1

#ovlMemory=8GB --hashload 0.7
ovlHashBits = 25
ovlThreads = 4
ovlHashBlockLength = 20000000
ovlRefBlockSize =  50000000

# for mer overlapper
merCompression = 1
merOverlapperSeedBatchSize = 500000
merOverlapperExtendBatchSize = 250000

frgCorrThreads = 2
frgCorrBatchSize = 100000

ovlCorrBatchSize = 100000

######################以下为非gird计算，useGrid = 0 ####################
# non-Grid settings, if you set useGrid to 0 above these will be used
merylMemory = 128000
merylThreads = 4

ovlStoreMemory = 8192

ovlConcurrency = 6

cnsConcurrency = 16

merOverlapperThreads = 2
merOverlapperSeedConcurrency = 6
merOverlapperExtendConcurrency = 6

frgCorrConcurrency = 8
ovlCorrConcurrency = 16
cnsConcurrency = 16

集群下参考asm.spec

######################以下为gird计算，useGrid = 1 ####################
cnsErrorRate = 0.10
ovlErrorRate = 0.10

overlapper = ovl
unitigger = bogart
utgBubblePopping = 1

merSize = 14

merylMemory = 128000
merylThreads = 16

ovlStoreMemory = 8192

# grid info
useGrid = 1
scriptOnGrid = 1
frgCorrOnGrid = 1
ovlCorrOnGrid = 1

sge = -V -S /bin/sh
sgeScript = -pe smp 16
sgeConsensus = -pe smp 1
sgeOverlap = -pe smp 4
sgeFragmentCorrection = -pe smp 2
sgeOverlapCorrection = -pe smp 1

#ovlMemory=8GB --hashload 0.7
ovlHashBits = 25
ovlThreads = 6
ovlHashBlockLength = 20000000
ovlRefBlockSize =  5000000

# for mer overlapper
merCompression = 1
merOverlapperSeedBatchSize = 500000
merOverlapperExtendBatchSize = 250000

frgCorrThreads = 2
frgCorrBatchSize = 100000

ovlCorrBatchSize = 100000

######################以下为非gird计算，useGrid = 0 ####################
# non-Grid settings, if you set useGrid to 0 above these will be used
merylMemory = 128000
merylThreads = 12

ovlStoreMemory = 8192

ovlConcurrency = 8

merOverlapperThreads = 6
merOverlapperSeedConcurrency = 2
merOverlapperExtendConcurrency = 2

frgCorrConcurrency = 8

ovlCorrConcurrency = 16
cnsConcurrency = 16

doToggle=0
toggleNumInstances = 0
toggleUnitigLength = 2000

doOverlapBasedTrimming = 1
doExtendClearRanges = 2

输出结果

最后组装结果文件夹9-terminator;

主要输出文件是prefix.asm,以分层的数据结构提供组装结果的精确描述，包含生成的contig 和 scaffold 序列；

prefix.qc，关于组装结果的统计信息；

参考资料

当10X Genomics遇上PacBio——烫金开始剥落了
第三代测序成本偏高是什么原因导致的？
Computational Science Community Wiki： Sun Grid Engine: Job Arrays

PBcR：SpecFiles Options

2016-08-26T06:33:11.000Z

The spec file is an optional input to the runCA executive that launches the Celera Assembler pipeline. The spec files provides a convenient way to generate assemblies while documenting their parameters faithfully. The use of spec files is STRONGLY recommended.

Spec files参数解释

showNext=boolean (default=0)：如果设定，下一步主要命令将输出到屏幕而不执行；

pathMap=filename (default=empty-string)： filename包含主机到软件工作目录的映射，这个参数通常不需要指定；

shell=string (default=/bin/sh)：指定运行脚本的命令解释器；

错误率(Error Rates)

共有5个可配制的错误率，’error rate’在overlap中是小数，取值范围为0.0到0.4，而’error limit’是绝对值，取值范围没有限制。overlap取值只要低于’error rate’ 或 ‘error limit’阈值中的任意一个就将被取用。例如，100个碱基中overlap错误率为2%，假如 utgErrorRate=0.015并且utgErrorLimit=2.5，那么2%的overlap值将用于unitigging中。
错误率必须是utg ≤ ovl ≤ cns ≤ cgw. 通常情况下, ovl = cns

ovlErrorRate=float (default=0.06)： overlap的误差界限，运用在trim和assembly过程中，超过这一界限值的overlap将不会被检测到。

cnsErrorRate=float (default=0.06)：一致性的错误率；

cgwErrorRate=float (default=0.10)： scaffolder的错误率，低于此值的scaffolder将融合成unitigs和contigs；

obtErrorLimit=float (default=see below)：控制overlap碱基整理过程中的overlap质量，仅影响trim过程；

Unitigger的错误率较为复杂，Overlaps中的错误率不能用在unitig构建中，每一个unitig比对使用一个不同的错误率设定值；
1>utg uses utgErrorRate.
2>bog uses utgErrorRate and utgErrorLimit.
3>bogart uses utgGraphErrorRate, utgGraphErrorLimit, utgMergeErrorRate and utgMergeErrorLimit.

utgErrorRate=float (default=0.015 for utg and 0.030 for bog)：低于设定值的overlap将用于utg 和 bog的unitiggers中；

utgErrorLimit=float (default=2.5)：低于设定值的overlap将用于utg 和 bog的unitiggers中；

utgGraphErrorRate=float (default=0.030)：低于设定值的overlap将用于 bogart unitigger中最好的重叠图谱构建； bogart unitigger被开发用来处理高覆盖度数据；

utgGraphErrorLimit=float (default=3.25)：同上；

utgMergeErrorRate=float (default=0.045)：低于设定值的overlap将用于bogart unitigger中bubble popping和重复区检测；

utgMergeErrorLimit=float (default=5.25)：同上；

最小片段长度和最小overlap长度(Minimum Fragment Length and Minimum Overlap Length)

低于最小长度发片段在gatekeeper中将丢弃，低于最小长度的overlap将不被计算。

frgMinLen=integer (default=64)：低于设定值的片段将不被用在组装过程中；

ovlMinLen=integer (default=40)：低于最小长度的overlap将不被计算；

提前停止renCA运行(Stopping runCA Early)

runCA可在某一阶段运行完后停止。

stopBefore=string (default=empty-string)

                  meryl： Stop before computing mer histograms.
                  initialTrim： Stop before the OBT initial quality trim.
                  deDuplication： Stop before the OBT de-duplication.
                  finalTrimming： Stop before the OBT trim point merge.
                  chimeraDetection： Stop before the OBT chimera detection.
                  classifyMates： Stop before de-novo classification.
                  unitigger： Stop before unitigger.
                  scaffolder： Stop before the scaffolding stage starts.
                  CGW： Stop before the CGW program starts.
              eCR： Stop before the extend clear ranges program starts. extendClearRanges is an alias for this.
                  eCRPartition： Stop before partitioning for extend clear ranges. extendClearRangesPartition is an alias                   for this.
                  terminator： Stop before terminator.

stopAfter=string (default=empty-string)

                  initialStoreBuilding： Stop after the fragment and gatekeeper stores are created.
                  meryl： Stop after mer counts are generates.
              overlapBasedTrimming： Stop after the Overlap Based Trimming algorithm has updated the clear ranges.                   OBT is an alias for this.
                  overlapper： Stop after the overlapper finishes, and the overlap store is created.
                  classifyMates： Stop after de-novo classification.
                  unitigger： Stop after unitigs are constructed, but before consensus starts.
                  utgcns： Stop after unitig consensus finishes; consensusAfterUnitigger is an alias for this.
                  scaffolder： Stop after all stages of scaffolding are finished.
                  ctgcns： Stop after contig consensus finishes; consensusAfterScaffolder is an alias for this.

网格计算(Grid Engine Options)

grid计算，useGrid = 1，需要集群的特殊支持，如果运行报错”qsub: script file ‘smp’ cannot be loaded - No such file or directory”，则说明你的当前集群环境不支持gird；详细见：并行计算、分布式计算、集群计算和云计算。

gridEngine=string (default=SGE)：选择SGE或LSF做为gird引擎；

useGrid=integer (default=0)： 0表示不使用grid；

scriptOnGrid=integer (default=0): 0表示只在grid上进行并行计算；

mbtOnGrid=integer (default=1)： 0表示不在grid上进行mer-based trim；useGrid=0时此参数失效；

ovlOnGrid=integer (default=1)： 0表示不用grid进行overlap；

frgCorrOnGrid=integer (default=0)： frg文件纠错；

ovlCorrOnGrid=integer (default=0)： overlap纠错；

cnsOnGrid=integer (default=1)：一致性；

-pe thread N -l memory=Mg -p 400：指定单个host下用N个cpu进行计算，并且每个cpu使用内存为Mg；所以每一个任务需要的总内存数等于NMg；例如，sgeConsensus= -pe thread 3 -l memory=2g -p -600表示仅是进行一致性计算需要3个cpu，每个cpu 2g内存，共需3X2=6g内存；

局部参数

Gatekeeper

gkpFixInsertSizes=integer (default=1)： 1表示gatekeeper将修正预估的插入大小当标准差太大或太小时，可接受的插入大小估计是0.1 mean < std.dev. < 1/3 mean,如果标准差超出这一范围，则重设为0.1 * mean；

gkpAllowInefficientStorage=integer (default=1)： 1表示允许将long-reads存储在计算机内存中，对内存损耗就大，一般设置为0；

Fragment Trimming

doOverlapBasedTrimming=integer (default=1) (aliasdoOBT)： 1表示做trim；

doDeDuplication=integer (default=1)： 1表示搜寻重复reads或454数据中的mate-pairs reads；当doOBT=0时失效；

doChimeraDetection=off or normal or aggressive (default=normal)：通过与其他reads的比较检测嵌合体，当doOBT=0时失效；

mbtBatchSize=integer (default=1000000)：每次trim批处理的片段数；

mbtThreads=integer (default=4)：每次trim的线程数；

mbtConcurrency=integer (default=1)：同时运行多少个trim；

mbtIlluminaAdapter=integer (default=1)：在merTrim过程中移除Illumina接头序列；

mbt454Adapter=integer (default=1)：在merTrim过程中移除454接头序列；

Overlapper

每一对fragments互相比对确定是否重叠；
对于较小的组装，可通过期望并行计算的数量来划分fragments数，例对于16个jobs，可划分fragments为4；
对于较大的组装，建议用较大的ovlRefBlockSize和ovlHashBlockLength控制jobs数量；

overlapper=ovl or mer (default=ovl)：选择overlap阶段；

obtOverlapper=ovl or mer (default=ovl)：选择OBT (overlap-based trimming)的overlap阶段；

ovlOverlapper=ovl or mer (default=ovl)： unitig构建过程中的overlap阶段；

ovlStoreMemory=integer (default=1024M)：构建overlap存储的内存量；

saveOverlaps=integer (default = 0)： 0表示overlap store生成后清除中间文件；中间文件较大，一般选择清除；

merSize=integer (default=22)： K-mer长度，设置这一参数相当于同时设置了obtMerSize和 ovlMerSize；

obtMerSize=integer (default=22)：仅的OBT过程的k-mer长度；

ovlMerSize=integer (default=22)： unitig和组装过程的k-mer长度；

obtMerThreshold=integer (default=auto)：检查k-mer直方图挑选合适的k值；

ovlMerThreshold=integer (default=auto)：检查k-mer直方图挑选合适的k值；

merThreshold=integer (default=auto)：分配线程数给 obtMerThreshold 和 ovlMerThreshold；

OVL Overlapper

ovlThreads=integer (default=2)： overlap计算线程数；

ovlConcurrency=integer (default=1)：不使用SGE时，同时进行一致性overlap的数目；

ovlHashLoad=float (default 0.75): 最多载入Table Size的75%；例如，22对应的Table Size为88,080,384，实际载入大小为88,080,384X75%=66060288；

ovlHashBits=integer (default 22)： hash表大小，固定尺寸，不随 ovlHashBlockLength 或 ovlRefBlockSize变化；

ovlHashBlockLength=integer (default=100000000)：载入hash表的序列碱基数，每一个碱基占10 bytes内存。

ovlRefBlockSize=integer (default=2000000): 控制overlap的jobs数目和每一个的运行时间，较小的值将需要较多的jobs，但每一个jobs完成所需时间少；

ovlHashBits和ovlHashBlockLength如何综合选择呢？根据我们实际使用的计算机硬件情况决定，假如我们的计算机有8G内存，
1>设置ovlHashBits=25 根据上面表格得知载入这个hash表将需要消耗接近7G的内存，如果是1/2 GB操作系统，那么我们500 MB内存载入序列数据，也就是ovlHashBlockLength最多为50,000,000，而25对应的hash表可以载入704,643,072 k-mers，但是我们仅能载入50,000,000 k-mer（one k-mer per base of sequence），这样是设定造成内存的浪费；
2>设置ovlHashBits=24消耗305G内存，剩余3G载入序列，ovlHashBlockLength多达300,000,000，352,24对应的hash表能够载入321,536 k-mers，此时的配置较为合理。
overlap job log file (0-overlaptrim-overlap/#######.out and 1-overlapper/######.out)有助于我们筛选合适的配置值，其包含如下内容；

1
2
3

HASH LOADING STOPPED: strings         38020 out of        38020 max.
HASH LOADING STOPPED: length       15487424 out of     15487424 max.
HASH LOADING STOPPED: entries       4435417 out of     66060288 max (load 5.04).

在这里，载入15,487,424碱基序列，仅用了hash表可载入66,060,288大小的4,435,417，意味着可以增加ovlHashBlockLength (to load more sequence)或降低ovlHashBits (to use less memory)。

MER Overlapper

mer overlapper 也使用 Classic Overlapper参数 obtMerSize and ovlMerSize.

merCompression=integer (default=1)： ACTTTAAC with merCompression=1 would be ACTAC

Meryl

merylMemory=integer (default=800M)

merylThreads=integer (default=1)

1 2	merylMemory = -segments 4 -threads 4 merylThreads = 4

Fragment Error Correction

frgCorrBatchSize=integer (default=200000)：一次性载入的reads数；

doFragmentCorrection=integer (default=1)

frgCorrThreads=integer (default=2)

frgCorrConcurrency=integer (default=1)

Unitigger

unitigger=utg or bog or bogart (default=utg)： utg（Sanger数据），bog（只有454数据或结合Sanger数据），bogart（仅有 Illumina数据或结合其他数据）;

utgGenomeSize=integer (default=not-set):确定其是否输入或计算 grep -i genome /4-unitigger/unitigger.err

Scaffolder

Scaffold module is called CGW (chunk graph walker),It builds contigs and scaffolds from unitigs and mate pairs.

Consensus

unitigger和scaffolder后的一致性

Terminator

cleanup=none or light or heavy or aggressive (default=none)：最后组装完清除临时文件和中间文件，有效值是’none’ (no cleanup), ‘light’ (temporary files), ‘heavy’ (currently, same as light), ‘aggressive’ (everything except the output is removed)；

Unitig Repeat/Unique Toggling

Celera Assembler利用泊松分布将unitigs分类为unique和重复的，由于覆盖偏向性和截断的影响，这种分类偶尔也将unique unitigs划分为重复；为避免错误组装，重复的unitigs在组装过程中是不可信的，Unitig Repeat/Unique Toggling允许Celera Assembler纠错这些”重复“unitigs，然后重新组装；这一过程将生成一个10-toggledAsm目录，最后的组装结果在10-toggledAsm/9-terminator目录下。

doToggle=integer (default=0)： 1表示运行 Toggling过程；

Spec files实例

pacbio.spec

merSize=16
mhap=-k 16 --num-hashes 512 --num-min-matches 3 --threshold 0.04 --weighted

useGrid=0
scriptOnGrid=0

ovlMemory=32
ovlStoreMemory=32000
threads=32
ovlConcurrency=1
cnsConcurrency=8
merylThreads=32
merylMemory=32000
ovlRefBlockSize=20000
frgCorrThreads = 16
frgCorrBatchSize = 100000
ovlCorrBatchSize = 100000


sgeScript = -pe threads 1
sgeConsensus = -pe threads 8
sgeOverlap = -pe threads 15 –l mem=2GB
sgeCorrection = -pe threads 15 –l mem=2GB
sgeFragmentCorrection = -pe threads 16 –l mem=2GB
sgeOverlapCorrection = -pe threads 1 –l mem=16GB

asm.spec

ovlStoreMemory 		=	60000
ovlThreads 		= 	16
ovlConcurrency 		= 	1
cnsConcurrency 		= 	16
merylMemory           	=	-segments 16 -threads 16
merylThreads          	= 	16

frgCorrThreads 		= 	16
ovlCorrConcurrency	=	16
frgCorrBatchSize 	= 	100000
ovlCorrBatchSize 	= 	100000

ovlHashBlockLength		=	300000000
ovlRefBlockLength		=	0
ovlRefBlockSize			=	2000000

# assembly settings, designed for eukaryotic genomes (1GB+)
ovlErrorRate			=	0.1
utgErrorRate			=	0.06
cnsErrorRate			=	0.1
cgwErrorRate			=	0.1
cnsErrorRate			=	0.1
doOBT				=	1
obtErrorRate			=	0.08
obtErrorLimit			=	4.5

batOptions			=	-RS -NS -CS
utgGraphErrorRate		=	0.05
utgMergeErrorRate		=	0.05
unitigger			=	bogart
consensus			=	pbutgcns

frgMinLen			=	3000
ovlMinLen			=	100

ovlHashBits			=	24
ovlHashLoad			=	0.80

参考资料

SpecFiles
RunCA#Global_Options

Perl,awk,sed One-Liners Explained, Part III： Selective Printing and Deleting of Certain Lines

2016-08-18T08:08:20.000Z

sed -i 备份

1	sed -i.bak 's/:/;/' users

sed -i将会在原文件上执行sed命令，-i.bak将创建一个users.bak文件备份原users文件。

只在第N行进行替换

1	sed 'Ns/foo/bar/' test.txt

输出第N行

1 2	perl -ne '$.==N && print && exit' test.txt awk 'NR==N' test.txt

参数解释：

$.为专用变量，表示当前行编号；

输出第N、M行

1	perl -ne 'print if $.==N \|\| $.==M' test.txt

输出第N到M行

perl -ne 'print if $.>=N && $.<=M' test.txt
perl -ne 'print if N .. M' test.txt
awk 'NR==N,NR==M' test.txt
sed -n 'N,Mp' test.txt

输出最长行

1	perl -ne '$1=$_ if length($_)>length($1);END {print $1}' test.txt

输出奇数行

1	perl -ne 'print if $. % 2' test.txt

输出偶数行

1	perl -ne 'print if $. % 2==0' test.txt

重复行只输出一次，非重复行不输出

1	perl -ne 'print if ++$a{$_} ==2' test.txt

输出匹配到模式的下一行

1	awk '/模式/ {getline; print}' test.txt

参数解释：

getline读取下一行数据,继续运行当前的awk脚本;next也的读取下一行，然后把控制权交给了awk脚本的顶部，如awk ‘{if(NR==1){next} print $1,$2}’ data；

输出匹配到模式的行到最后一行

1	awk '/模式/,0' test.txt

输出匹配到模式1到模式2 的行

1 2	awk '/模式1/,/模式2/' test.txt #包括模式1和2 awk '/模式1/,/模式2/{if (!/模式1/&&!/模式2/)print}' test.txt #不包括模式1和2自身

删除所有空行

1	awk NF test.txt

空行时NF是零

文件1和文件2根据某一列对应值合并

[zpxu@node102 ~]$ cat datafile 
20081010 1123 xxx
20081011 1234 def
20081012 0933 xyz
20081013 0512 abc
20081013 0717 def
[zpxu@node102 ~]$ cat mapfile 
abc withdrawal
def payment
xyz deposit
xxx balance
[zpxu@node102 ~]$ awk 'NR==FNR{a[$1]=$2;next} {$3=a[$3]}1' mapfile datafile
20081010 1123 balance
20081011 1234 payment
20081012 0933 deposit
20081013 0512 withdrawal
20081013 0717 payment

空白单元格

$ cat test.txt 
a       4       5       6
b               5
d       1
s       5       3       5
$ #想要效果，替换空白单元格为NA
$ awk 'BEGIN{FS=OFS="\t"} {for(i=1;i<=NF;i++) if ($i~/^$/) $i="NA"};1' test.txt 
a       4       5       6
b       NA      5       NA
d       1       NA      NA
s       5       3       5

参数解释：

awk中最后的数字1是{ print $0 }的简写。
另一种效果：

$ cat myfile.csv 
1,2,3,4,5,6,7
,,,,,,
1,,,4,5,,
,2,3,4,5,,
$ cat fill-empty-values.sh
#!/bin/bash

for i in $( seq 1 2); do
  sed -e "s/^,/$2,/" -e "s/,,/,$2,/g" -e "s/,$/,$2/" -i $1
done
$ bash fill-empty-values.sh myfile.csv 0
$ cat myfile.csv 
1,2,3,4,5,6,7
0,0,0,0,0,0,0
1,0,0,4,5,0,0
0,2,3,4,5,0,0

more： http://www.catonmat.net/blog/ten-awk-tips-tricks-and-pitfalls/#awk_ranges

Perl,awk,sed One-Liners Explained, Part II： Text Conversion and Substitution

2016-08-18T03:02:20.000Z

所有字符大写

1	perl -nle 'print uc' test.txt

所有字符小写

1	perl -nle 'print lc' test.txt

行首字母大写

1
2
3

perl -nle 'print ucfirst lc' test.txt
等同于
perl -nle 'print "\u\L$_"' test.txt

去掉每行行首空格

perl -ple 's^[ \t]+//' test.txt
awk '{ sub(/^[ \t]+/, ""); print }' test.txt
sed 's/^[ \t]*//' test.txt
等同于
perl- ple 's/^\s+//' test.txt

去掉从开头到结尾的空格

perl -ple 's/^[ \t]+|[ \t]+$//g' test.txt
awk '{ gsub(/^[ \t]+|[ \t]+$/, ""); print }' test.txt
awk '{$1=$1; print}' test.txt
sed 's/^[ \t]*//;s/[ \t]*$//' test.txt

sub和gsub区别：sub替换遇到的第一个字符，而gsub相当于全局替换；

转换DOS/Windows换行符为UNIX换行符

perl -pe 's|\r\n|\n|' test.txt
awk '{ sub(/\r$/,""); print }' test.txt
sed 's/.$//' test.txt
sed 's/^M$//' test.txt

替换A为S

1	perl -pe 's/A/S/g' test.txt

仅替换最后一个A为S

1	sed 's/$.*$A/\1S/' test.txt

在C行替换A为S

1
2
3

perl -pe '/C/ && s/A/S/g' test.txt
awk '/C/ { gsub(/A/, "S") }; { print }' test.txt
sed '/C/s/A/S/g' test.txt

awk 中sort排序

1	awk -F ":" '{print $1 \| "sort"}' /etc/passwd

删除第二列

1	awk '{$2=""; print}' test.txt

每一列倒序输出

1	awk '{for (i=NF;i>0;i--) printf("%s ",$i); printf ("\n")}' test.txt

sed实现tac功能

1	sed '1!G;h;$!d' test.txt

参数解释：

1!G表示第一行不执行G命令；
$!d表示最后一行不执行d命令；

Perl, awk, sed One-Liners Explained, Part I： File Spacing, Numbering and Calculations

2016-08-13T11:06:55.000Z

文件间距

两倍行距

cat test.txt
Marrys 2143     78       84       77      239
Jacks  2321     78       78       45      189
Toms   2122     48       77       71      196
Mikes  2537     87       97       95      279
Bobs   2415     40       57       62      159
perl -pe '$\="\n"' test.txt
Marrys 2143     78       84       77      239

Jacks  2321     78       78       45      189

Toms   2122     48       77       71      196

Mikes  2537     87       97       95      279

Bobs   2415     40       57       62      159

#最终one line perl命令行相当于如下循环
while (<>) {
    $\ = "\n";
} continue {
    print or die "-p failed: $!\n";
}

参数解释：

-e：命令行进入执行perl程序，而不需要编写perl脚本文件；
-p：相当于perl语言的while循环，遍历所有输入内容(input或<>)，执行后面的命令并将结果传递给$_，最后print;

while (<>) {
    # your program goes here
} continue {
    print or die "-p failed: $!\n";
}

$\：相当于awk中的ORS，每次print时执行一次。
相同效果perl -pe ‘s/$/\n/‘ test.txt和perl -pe ‘$_ .= “\n”‘ test.txt

awk版awk ‘1; { print “” }’ test.txt = awk ‘{ print } { print “” }’ test.txt
简单粗暴sed版sed G test.txt两个换行符一个由G从保持空间传入交换到模式空间，另一个是sed流编辑器本身输出；
更多关于sed高级命令见Advanced-sed：n，N，d，D，p，P，b, T,t,h，H，g，G，x,y

两倍行距，除了空行

1
2
3

perl -pe '$_.="\n" unless /^$/' test.txt
#等同于
perl -pe '$_ .= "\n" if /\S/' test.txt

参数解释：

^$：表示空行；
\S：大写S，\S相对于\s，if /\S/结果就是匹配这一行包含至少一个非空（tab, vertical tab, space, etc）字符。

awk版awk ‘NF { print $0 “\n” }’ test.txt,空行时NF为0，可有效过滤掉空行。
简单粗暴sed版sed ‘/^$/d;G’ test.txt /^$/表示匹配空行，d表示删除，即首先将匹配到的空行全部删除，然后在执行G；
去掉两倍行距：sed ‘n;d’ test.txt，n表示读入下一行，即模式空间里同时每次存在两行内容；
注意：sed中-n和n的区别，例如sed -n ‘n;p’ test.txt,在一般 sed 的用法中，所有来自 STDIN的资料一般都会被列出到屏幕上。但如果加上 -n 参数后，则只有经过sed 特殊处理的那一行(或者动作)才会被列出来,而单引号中的n表示读取下一行到pattern space，由于pattern space中有按照正常流程读取的内容，使用n命令后，pattern space中又有了一行，此时，pattern space中有2行内容，但是先读取的那一行不会被取代、覆盖或删除；当n命令后，还有其他命令p的时候，此时打印出的结果是n命令读取的那一行的内容，即第二时间读入的，也就是n命令后的其他命令只能作用于第二时间入读发行，首次读入的不做任何n后面命令的处理；另为一个是N命令（将下一行添加到pattern space中，但将当前读入行和用N命令添加的下一行看成”一行”，一起被N后面的命令处理）。**

三倍行距

1	perl -pe '$\="\n\n"' test.txt

awk版awk ‘1; { print “\n” }’ test.txt = awk ‘{ print; print “\n” }’ test.txt
**简单粗暴sed版sed ‘G;G’ test.txt

N倍行距

1	perl -pe '$_.="\n"x7' test.txt

移除所以空行

perl -ne 'print unless /^$/' test.txt
#最终one line perl命令行相当于如下循环
LINE:
while (<>) {
    print unless /^$/
}
#进一步可解释为
LINE:
while (<>) {
    print $_ unless $_ =~ /^$/
}

-n：相当于如下while循环，while通过<>读入每一行，然后传递给$_；

LINE:
while (<>) {
    # your program goes here
}

相同效果perl -lne ‘print if length’ test.txt,-l参数相当于chomps，去掉每一行结尾的换行符，然后检查这一行的长度，如果存在任何字符则检查结果为true，并输出这一行；

当有多行空行时仅留下一行

1
2
3

perl -00 -pe '' test.txt
#相同效果
perl -00pe0 test.txt

将所有空行压缩或展开成N个连续的

1	perl -00 -pe '$_.="\n"x4' test.txt

行编号

所有行编号

1	perl -pe '$_="$.$_"' test.txt

参数解释：

$.：包含输入内容的当前行数；
awk版awk ‘{print FNR “\t” $0}’ test.txt和awk ‘{ print NR “\t” $0 }’ test.txt,当同时读入两个文件时前者awk中第二个文件开始编号为1，而后者的第二个文件开始编号继续第一个文件后

仅非空行编号，空行依然输出

1	perl -pe '$_=++$a."$_" if /./' test.txt

参数解释：

/./：匹配除了换行符外的任何字符，即非空行；
awk版awk ‘NF {$0=++a “:” $0}; {print}’ test.txt其中”:”表示编号和原内容间分隔符。

仅非空行编号，空行不输出

1	perl -ne 'print ++$a. "$_" if /./' test.txt

几点区别：
$.与++$a.：前者计数input的所有行，后者仅计数非空行；
-p与-n：前者while循环自带print函数，后者没有，需要指定print；

所有行编号，但是仅输出非空行

1	perl -pe '$_ = "$. $_" if /./' test.txt

仅编号匹配指定模式的行，但其他行也无编号输出

1	perl -pe '$_=++$a. "$_" if /模式/' test.txt

仅编号和输出匹配指定模式的行

1	perl -ne 'print ++$a. "$_" if /模式/' test.txt

所有行编号，但仅输出匹配指定行的行编号

1	perl -pe '$_ = "$. $_" if /模式/' test.txt

所有行编号，并自定义输出形式

1	perl -ne 'printf "%-5d %s", $.,$_' test.txt

awk版awk ‘printf(“%5d : %s\n”, NR, $0)’ test.txt

运算

统计所有行数,包括空行

1	perl -lne 'END { print $. }' test.txt

awk版awk ‘END {print NR}’ test.txt

统计非空行

1
2
3

perl -le 'print scalar (grep {/./}<>)' test.txt
perl -le 'print ~~grep{/./}<>' test.txt
perl -le 'print~~grep/./,<>' test.txt

统计空行数

1 2	perl -lne '$a++ if /^$/; END {print $a+0}' test.txt #一行一行读入，较高效 perl -le 'print ~~grep{/^$/}<>' test.txt #在内存中读入文件全部内容

grep -c功能

1 2	perl -lne '$a++ if /regex/; END {print $a+0}' test.txt awk '/Beth/ { n++ }; END { print n+0 }' test.txt

统计每一行数字总数

1	awk '{s=0; for (i=1;i<=NF;i++) s=s+$i} print s}' test.txt

取绝对值

1 2	awk '{for (i=1;i<=NF;i++) if ($i<0) $i=-$i; print}' test.txt perl -alne 'print "@{[map { abs} @F]}"' test.txt

计算所有文件每行总和

1	perl -MList::Util=sum -alne 'print sum @F' test.txt test2.txt

输出每行最小值

1	perl -MList::Util=min -alne 'priint min @F' test.txt

统计匹配某一模式的行数

1	perl -lne '/模式/' && $t++; END {print $t} test.txt

英语原文：http://www.catonmat.net/blog/perl-one-liners-explained-part-one/

awk 匹配与取反，命令行传递参数

2016-08-11T10:40:40.000Z

匹配

/FIN|TIME/ 匹配FIN或者TIME；

取反

取出第一列以外的其他列

awk ‘{$1=””;print }’ file.txt

第N列和M列外的其他列

awk ‘{\$1=\$3=”” ;print }’ file.txt

拆分文件

$ cat netstat.txt
Proto Recv-Q Send-Q Local-Address          Foreign-Address             State
tcp        0      0 0.0.0.0:3306           0.0.0.0:*                   LISTEN
tcp        0      0 0.0.0.0:80             0.0.0.0:*                   LISTEN
tcp        0      0 127.0.0.1:9000         0.0.0.0:*                   LISTEN
tcp        0      0 coolshell.cn:80        124.205.5.146:18245         TIME_WAIT
tcp        0      0 coolshell.cn:80        61.140.101.185:37538        FIN_WAIT2
tcp        0      0 coolshell.cn:80        110.194.134.189:1032        ESTABLISHED
tcp        0      0 coolshell.cn:80        123.169.124.111:49809       ESTABLISHED
tcp        0      0 coolshell.cn:80        116.234.127.77:11502        FIN_WAIT2
tcp        0      0 coolshell.cn:80        123.169.124.111:49829       ESTABLISHED
tcp        0      0 coolshell.cn:80        183.60.215.36:36970         TIME_WAIT
tcp        0   4166 coolshell.cn:80        61.148.242.38:30901         ESTABLISHED
tcp        0      1 coolshell.cn:80        124.152.181.209:26825       FIN_WAIT1
tcp        0      0 coolshell.cn:80        110.194.134.189:4796        ESTABLISHED
tcp        0      0 coolshell.cn:80        183.60.212.163:51082        TIME_WAIT
tcp        0      1 coolshell.cn:80        208.115.113.92:50601        LAST_ACK
tcp        0      0 coolshell.cn:80        123.169.124.111:49840       ESTABLISHED
tcp        0      0 coolshell.cn:80        117.136.20.85:50025         FIN_WAIT2

按第6列分隔文件，其中的NR！=1表示不处理表头。

$ awk 'NR!=1{print > $6}' netstat.txt
 
$ ls
ESTABLISHED  FIN_WAIT1  FIN_WAIT2  LAST_ACK  LISTEN  netstat.txt  TIME_WAIT
 
$ cat ESTABLISHED
tcp        0      0 coolshell.cn:80        110.194.134.189:1032        ESTABLISHED
tcp        0      0 coolshell.cn:80        123.169.124.111:49809       ESTABLISHED
tcp        0      0 coolshell.cn:80        123.169.124.111:49829       ESTABLISHED
tcp        0   4166 coolshell.cn:80        61.148.242.38:30901         ESTABLISHED
tcp        0      0 coolshell.cn:80        110.194.134.189:4796        ESTABLISHED
tcp        0      0 coolshell.cn:80        123.169.124.111:49840       ESTABLISHED
 
$ cat FIN_WAIT1
tcp        0      1 coolshell.cn:80        124.152.181.209:26825       FIN_WAIT1
 
$ cat FIN_WAIT2
tcp        0      0 coolshell.cn:80        61.140.101.185:37538        FIN_WAIT2
tcp        0      0 coolshell.cn:80        116.234.127.77:11502        FIN_WAIT2
tcp        0      0 coolshell.cn:80        117.136.20.85:50025         FIN_WAIT2
 
$ cat LAST_ACK
tcp        0      1 coolshell.cn:80        208.115.113.92:50601        LAST_ACK
 
$ cat LISTEN
tcp        0      0 0.0.0.0:3306           0.0.0.0:*                   LISTEN
tcp        0      0 0.0.0.0:80             0.0.0.0:*                   LISTEN
tcp        0      0 127.0.0.1:9000         0.0.0.0:*                   LISTEN
 
$ cat TIME_WAIT
tcp        0      0 coolshell.cn:80        124.205.5.146:18245         TIME_WAIT
tcp        0      0 coolshell.cn:80        183.60.215.36:36970         TIME_WAIT
tcp        0      0 coolshell.cn:80        183.60.212.163:51082        TIME_WAIT

if-else-if

1
2
3

$ awk 'NR!=1{if($6 ~ /TIME|ESTABLISHED/) print > "1.txt";
else if($6 ~ /LISTEN/) print > "2.txt";
else print > "3.txt" }' netstat.txt

统计

1
2
3

awk '{sum+=$5} END {print sum}' file.txt
awk 'NR!=1{a[$6]++;} END {for (i in a) print i ", " a[i];}' file.txt #输出非重复的第六列并计数
awk 'NR!=1{a[$6]+=$7;} END { for(i in a) print i ", " a[i]"KB";}' file.txt #输出非重复的第六列，其第七列对应值累加

shell脚本中传入参数

接收来自命令行传入的参数，第一个参数用\$1表示，第二个参数用\$2表示，以此类推；注意：\$0表示脚本文件名。

1 2	$ cat test.sh cat $@ \| awk -F, 'NR!=1 $79!~/\[M\+[0-9]\]+\|\[M\][0-9]+/' > te.$@

$@表示所有的命令行参数；详细见http://www.runoob.com/linux/linux-shell-passing-arguments.html

awk -v参数

-v var=var_value
在awk程序执行前，把awk变量var的值设置为var_value，这个var变量在BEGIN块中也有效，经常用来把shell变量引入awk程序。

1
2
3

$a=1
$ awk -v var=$a 'BEGIN{print var}'
1

读入csv文件
awk -F, -v OFS=, ‘{print $1,$3}’ old.csv

Linux下XML::Simple无root权限安装

2016-08-11T06:32:51.000Z

XML::Simple简介

XML::Simple 基本上有两个功能；它将 XML 文本文档转换为 Perl 数据结构（匿名散列和数组的组合），以及将这种数据结构转换回 XML 文本文档。提供了两个函数：XMLin() 和 XMLout()。第一个子函数读取 XML 文件，返回一个引用。给出适当数据结构的引用，第二个子函数将它转换为 XML 文档，根据参数的不同，产生的 XML 文档采用字符串格式或文件形式。

XML::Simple 有两个主要限制。首先，在输入方面，它将完整的 XML 文件读入内存，所以如果文件非常大或者需要处理 XML 数据流，就不能使用这个模块。第二，它无法处理 XML 混合内容，也就是在一个元素体中同时存在文本和子元素的情况.

为何需要XML::Simple

在用Trinotate: Transcriptome Functional Annotation and Analysis对De Novo转录组数据进行注释时，需要运行RNAMMER来识别rRNA转录本，同时将XML输出文件解析为gff结果；这一过程就需要XML::Simple这一模块；
否则将报错../rnammer error converting xml into gff

安装XML::Simple

XML::Simple模块的安装需要至少以下两个依赖包：XML::Parser和XML::SAX::Expat
注意：一定按照 XML::Parser、XML::SAX::Expat、XML::Simple的顺序依次安装；
XML::Parser一般可以在cpan下顺利安装，但是XML::SAX::Expat在正常的cpan安装将因为权限问题而中断：

$ perl -MCPAN -e shell
cpan> install XML::Simple
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
ERROR: Can't create '/usr/local/share/man/man3'
Do not have write permissions on '/usr/local/share/man/man3'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

所以就涉及到在Linux下如何无root权限安装perl模块？
这里推荐更加好用,更加人性的cpanm，**在没有 Root 权限时会自动安装到当前用户家目录的perl/lib文件夹下。

cpanm安装

下载cpanm到自己的目录下，并修改可执行权限

1 2	$ wget http://xrl.us/cpanm --no-check-certificate -O cpanm $ chmod +x cpanm

将路径写到bashrc文件中

1	export PATH=~/software/perl/:$PATH

perl模块安装

XML::SAX::Expat

1	$ cpanm XML::SAX::Expat

注意：第一次安装时，cpanm会创建一个~/perl5目录，而perl模块的安装位置为~/perl5/lib/perl5，将这个变量写入bashrc文件的PERL5LIB变量中

1
2
3

export PERL5LIB=~/perl5/lib/perl5

source ~/.bashrc

验证安装
perldoc XML::SAX::Expat
输出正常帮助文档则表明安装成功；
cpanm还有更多用法，参见扶凯 blog使用 CPANMinus 来安装 Perl 模块和其它技巧

XML::Simple

1	$ cpanm XML::Simple

报错：

t/0_Config.t ............ ok
t/1_XMLin.t ............. 
Failed 84/132 subtests 
t/2_XMLout.t ............ ok
t/3_Storable.t .......... 
Failed 21/23 subtests 
t/4_MemShare.t .......... 
Failed 7/8 subtests 
t/5_MemCopy.t ........... 
Failed 6/7 subtests 
t/6_ObjIntf.t ........... ok
t/7_SaxStuff.t .......... 
Failed 12/14 subtests 
t/8_Namespaces.t ........ ok
t/9_Strict.t ............ ok
t/A_XMLParser.t ......... ok
t/B_Hooks.t ............. 
Failed 4/12 subtests 
	(less 3 skipped subtests: 5 okay)
t/release-pod-syntax.t .. skipped: these tests are for release candidate testing

Test Summary Report
-------------------
t/1_XMLin.t           (Wstat: 139 Tests: 48 Failed: 0)
  Non-zero wait status: 139
  Parse errors: Bad plan.  You planned 132 tests but ran 48.
t/3_Storable.t        (Wstat: 139 Tests: 2 Failed: 0)
  Non-zero wait status: 139
  Parse errors: Bad plan.  You planned 23 tests but ran 2.
t/4_MemShare.t        (Wstat: 139 Tests: 1 Failed: 0)
  Non-zero wait status: 139
  Parse errors: Bad plan.  You planned 8 tests but ran 1.
t/5_MemCopy.t         (Wstat: 139 Tests: 1 Failed: 0)
  Non-zero wait status: 139
  Parse errors: Bad plan.  You planned 7 tests but ran 1.
t/7_SaxStuff.t        (Wstat: 139 Tests: 2 Failed: 0)
  Non-zero wait status: 139
  Parse errors: Bad plan.  You planned 14 tests but ran 2.
t/B_Hooks.t           (Wstat: 139 Tests: 8 Failed: 0)
  Non-zero wait status: 139
  Parse errors: Bad plan.  You planned 12 tests but ran 8.
Files=13, Tests=367,  3 wallclock secs ( 0.11 usr  0.05 sys +  1.61 cusr  0.35 csys =  2.12 CPU)
Result: FAIL
Failed 6/13 test programs. 0/367 subtests failed.
make: *** [test_dynamic] Error 255
-> FAIL Installing XML::Simple failed.

可以看出是text时存在其他依赖包的缺失，解决办法如下：

$ cpanm XML::LibXML::SAX::Parser 
$ cpanm XML::LibXML::SAX 
$ cpanm XML::SAX::PurePerl
$ cpanm XML::Simple
--> Working on XML::Simple
Fetching http://www.cpan.org/authors/id/G/GR/GRANTM/XML-Simple-2.22.tar.gz ... OK
Configuring XML-Simple-2.22 ... OK
Building and testing XML-Simple-2.22 ... OK
Successfully installed XML-Simple-2.22
1 distribution installed

more：https://bugzilla.redhat.com/show_bug.cgi?id=233003

Piwi-interacting RNA (piRNA)

2016-06-08T13:58:48.000Z

PiRNA

Piwi-interacting RNA (piRNA)是一大类主要在动物体内表达的small non-coding RNA，piRNA通过与Piwi蛋白互作形成RNA-protein复合体。该piRNA复合体在生殖细胞中参与表观遗传和逆转录转座子(retrotransposons)的转录后基因沉默。piRNA与miRNA和siRNA在长度、序列结构和生物起源上均存在差异；

PiRNA特点

1）26~31nt
2）无明显二级结构
3）5’端第一个碱基为U
4）5’端单磷酸盐(monophosphate)和3’端修饰(2’-O-methylation modification)阻止2’ or 3’氧化，增加PiRNA稳定性。
5）种类较多，不具一定的保守性，老鼠体内 50,000 unique piRNA，果蝇中>13,000。
6）产生存在显著的链偏好性；

位置

成簇贯穿基因组中，其每一个簇中包含PiRNA小于10个或达到成千上万，且大小差异极大。
在果蝇和脊椎动物中定位于非编码基因间，在线虫蛋白编码基因间也鉴定到PiRNA。
在无脊椎动物和哺乳动物生殖细胞中较多。
细胞核和细胞质中均存在。

生物起源

PiRNA的产生存在显著的链特异性，可能仅仅是来源于双链DNA的某一条链，这表明转录的长的单链前体经过一次初加工形成 pachytene PiRNA，这过程中PiRNA前体的转录趋向于起始于5’端第一个碱基U。

‘Ping Pong’机制：初级PiRNA(Primary piRNAs)第一个碱基位置偏向为U，第10个无偏向性；次级PiRNA(Secondary piRNAs)(产生于初级PiRNA指导的剪切)第一个碱基无偏向，第10个偏向A；二者从5’端开始有10个碱基的互补。
More：A piRNA Pathway Primed by Individual Transposons Is Linked to De Novo DNA Methylation in Mice
初级PiRNA识别其互补靶标并招募Piwi蛋白，然后从距离初级PiRNA 5’端10个碱基处劈开(识别的互补靶标)形成次级PiRNA，次级PiRNA靶向到第10个碱基是 A 。

Discrete Small RNA-Generating Loci as Master Regulators of Transposon Activity in Drosophila

生物功能

沉默转座子

见More：Piwi蛋白:RNAi中作用

后生效应(Epigenetic effects)

动植物中，小RNA通过特定的胞嘧啶甲基化来间接调控表观遗传，并且小RNA自身承担着表观遗传信息的载体。存在某些特殊转座子差异的果蝇品系间杂交能引起后代不育，这称之为杂种不育。当这一转座子是父系遗传时不育表型表现为显性，而母系遗传能够维持育性。在P- and I-element-mediated杂种不育中，依赖于父母本不同，其作用于每一个靶标元件(element)的PiRNA数量在后代表现出明显差异，这种差异来源于受精作用。综上表明母本生殖细胞内的PiRNA对上述特殊转座子的沉默响应起到重要作用，此沉默效应的缺失将引起杂种不育。
More：An epigenetic role for maternally inherited piRNAs in transposon silencing

PiRNA鉴定

PiRNA的鉴定目前主要通过识别’ping pong‘标签，相关软件如下：
piRNABank: a web resource on classified and clustered Piwi-interacting RNAs
PingPongPro：a software for finding ping-pong signatures and ping-pong cycle activity
proTRAC: a software for probabilistic piRNA cluster detection, visualization and analysis
piRNA cluster: database

PiRNA起源

基因组重复区域，例如逆转录转座子区；
异染色质区，双链RNA的反义链；

Argonaute蛋白家族

Argonaute蛋白包含有N-terminal, PAZ (Piwi-Argonaute-Zwille), middle and the C-terminal PIWI (P-element-induced wimpy testis) domains (Tolia et al., 2007)。
在果蝇中存在5种类型的Argonaute蛋白：AGO1, AGO2, Aubergine (Aub), Piwi and AGO3 (Gunawardane et al., 2007)；
AGO1和AGO2属于Argonaute (AGO)亚家族，Aub, Piwi and AGO3多存在于生殖细胞系中，且属于PIWI亚家族；

Piwi蛋白

Piwi蛋白(最初在果蝇中的P-element induced wimpy testis)，维持干细胞的不完全分化和生殖细胞细胞分裂比率的稳定性。Piwi蛋白高度保守，广泛存在于动植物体内。

RNAi中作用

Piwi蛋白存在有PAZ domain，该domain在Argonaute蛋白家族中参与双链RNA导向的单链RNA的水解作用。Argonaute是广泛研究的核酸结合蛋白(nucleic-acid binding)家族，其本质上是一种RNase H-like酶，完成RNA-induced silencing complex (RISC)的催化功能。在细胞RNAi反应中，RISC复合体中的Argonaute蛋白能够绑定(bind)到由ribonuclease Dicer切割（Dicer-2）外源双链RNA的正义链和反义链产生的siRNA(small interfering RNA)和切割（Dicer-1）内源非编码RNA（non-coding RNA）产生的miRNA(microRNA)上，从而形成RNA-RISC complex。该RNA-RISC complex绑定和切开与RNA（siRNA或miRNA）碱基互补的mRNA，破坏并且阻止其翻译过程。
补充RNAi中的RdRP机制：在线虫的研究中发现, siRNA 是合成 dsRNA 的特殊引物, 在RNA 依赖RNA 聚合酶(RdRP)作用下, 以靶mRNA 为模板合成dsRNA 。新生成的dsRNA 在Dicer 酶的作用下, 裂解产生新的siRNA , 新生成的siRNA 又可进入上述循环。大量集中的siRNA 可以形成RISC复合物, 这样可以提高mRNA 降解的效率。在这种RNAi过程中, 对靶mRNA 的特异性扩增有助于增强RNAi的特异性基因监视功能, 每个细胞只需少量的dsRNA就能完全关闭相应基因的表达，该模型称为RdRP。

Piwi蛋白和转座子沉默

Piwi蛋白通过与PiRNA形成内源系统来沉默内源自私基因(endogenous selfish genetic elements)表达，例如逆转录转座子和重复序列，防止该自私基因产物干扰生殖细胞的形成。
selfish genetic elements明显特征：通过形成额外拷贝数在基因组中传播（转座子）和对宿主的成功繁殖没有特殊贡献。

RasiRNA

RasiRNAs(Repeat associated small interfering RNA)是piRNA的亚种，与Piwi蛋白（Argonaute蛋白家族分枝）互作参与RNAi反应。在生殖细胞中建立和维持异染色质结构，控制重复序列的转录，沉默转座子和逆转录转座子。主要产生自反义链（antisense strand），缺乏动物siRNA and miRNA所特有的2’,3’羟基末端。
More：A Distinct Small RNA Pathway Silences Selfish Genetic Elements in the Germline

RNA 百科

RNA wiki

非Root用户编译安装GCC

2016-06-04T01:08:22.000Z

Linux下源码安装软件三部曲都需要GCC编译，所以Linux下都会有预安装的GCC，但处于稳定性和兼容性考虑，其版本均为较低的稳定版，而最新软件的安装编译时需要较高版本才可以，对于非Root普通用户解决办法就是自己目录下安装所需版本GCC。
如何证明你的GCC版本需要升级呢？
当你安装软件make编译时看到如下报错，就说明该升级了：

1 2	g++ -std=c++11 -pedantic -Wall -Wextra -c CCSSequence.cpp -o CCSSequence.o cc1plus: error: unrecognized command line option "-std=c++11"

-std=c++0x是g++-4.4支持的，而-std=c++11是g++-4.7及其后续版本。
gcc -v察看当前系统GCC版本，确认是否为GCC版本问题引起报错。

GCC安装

安装gcc之前依赖gmp、mpc、mpfr这三个包，所以先安装这个三个包，这三个包可以在下面的infrastructure目录下下载，gcc源码包在releases中下载，这里gcc下载的版本为gcc-4.8.5。
因为这三个包之间有依赖关系，所以一定按如下顺序依次安装。

gmp安装

$tar -jxvf gmp-4.3.2.tar.bz2

$cd gmp-4.3.2

$./configure --prefix=/home/software/opt/gmp-4.3.2/ #gmp安装路径

$make

$make check #这一步可以不要

$make install

mpfr安装

tar -jxvf mpfr-2.4.2.tar.bz2

$cd mpfr-2.4.2

$./configure --prefix=/home/software/opt/mpfr-2.4.2/ --with-gmp=/home/software/opt/gmp-4.3.2/ #congfigure后面是mpfr安装路径及依赖的gmp路径

$make

$make check #这一步可以不要

$make install

mpc安装

$tar -zxvf mpc-0.8.1.tar.gz

$cd mpc-0.8.

$ ./configure --prefix=/home/software/opt/mpc-0.8.1/ --with-gmp=/home/software/opt/gmp-4.3.2/ --with-mpfr=/home/software/opt/mpfr-2.4.2/

$make

$make check #这一步可以不要

$make install

更改~/.bashrc文件

安装完上述三个依赖包后设置环境变量 $LD_LIBRARY_PATH，即在bashrc文件添加如下内容：
因为系统的LD_LIBRARY_PATH中有两个相邻的冒号，编译gcc的导致通不过，所以先把这个变量自己重新定义一下，然后将上面装的三个包添加到该变量中

export LD_LIBRARY_PATH=/public/software/mpi/openmpi/1.6.5/intel/lib:/opt/gridview/pbs/dispatcher/lib:/public/software/compiler/intel/composer_xe_2013_sp1.0.080/compiler/lib/intel64:/public/software/compiler/intel/composer_xe_2013_sp1.0.080/mkl/lib/intel64:/usr/local/lib64:/usr/local/lib:/usr/local/otpserver/dependson_libs_x64

export LD_LIBRARY_PATH=~/opt/gmp-4.3.2/lib/:~/opt/mpfr-2.4.2/lib/:~/opt/mpc-0.8.1/lib/:$LD_LIBRARY_PATH

export LIBRARY_PATH=$LD_LIBRARY_PATH

不然会碰到错误 configure: error: cannot compute suffix of object files: cannot compile

gcc安装

完成依赖包的安装和环境设置后就可以开始GCC的安装了

$tar -jxvf gcc-4.8.5.tar.bz2

$cd gcc-4.8.5

$./configure --prefix=/home/software/opt/gcc-4.8.5/ --enable-threads=posix --disable-checking --disable-multilib --with-mpc=/home/software/opt/mpc-0.8.1/ --with-gmp=/home/software/opt/gmp-4.3.2/ --with-mpfr=/home/software/opt/mpfr-2.4.2/ 

make -j 10 #类似于使用10个线程编译，速度要快很多,此过程需要较长时间，中间不要间断。

make install

更改~/.bashrc文件

在文件中加入一下两句将gcc加入到环境变量中。

1
2
3

export PATH=/home/software/opt/gcc-4.8.5/bin/:$PATH

export LD_LIBRARY_PATH=/home/software/opt/gcc-4.8.5/lib/:~/opt/gcc-4.8.5/lib64/:$LD_LIBRARY_PATH

安装过程报错暨解决办法

Linux安装任何软件切记路劲　路劲　路劲　　要的事说3编！
路劲报错主要类型如下：
1）路劲缺失
解决：export PATH=”$PATH:/home/bin/amos-3.1.0/bin”相应缺失路径到.bashrc文件。
2）当前软件安装路径存在，如下面报错[configure-stage2-gcc] Error 1 。
3）意外路径存在于环境变量中，如下面blasr安装编译报错。

报错[configure-stage2-gcc] Error 1

contains current directory 
configure: error:  
*** LIBRARY_PATH shouldn't contain the current directory when 
*** building gcc. Please change the environment variable 
*** and run configure again.
make[2]: *** [configure-stage2-gcc] Error 1

1)根据提示看出是LIBRARY_PATH环境变量不应该包含有当前安装GCC的路径，即我想要安装gcc路径为/honm/software/gcc-4.8.5/，那个echo $LIBRARY_PATH就不应该包含此路径。
2)若echo $LIBRARY_PATH输出结果为/usr/lib/x86_64-linux-gnu/:（注意结尾冒号）,则同样会报错，解决办法就是去掉冒号/usr/lib/x86_64-linux-gnu/。
3）解决办法unset LIBRARY_PATH; ./configure -v。来源于http://stackoverflow.com/questions/8565695/error-compiling-gcc-4-6-2-under-ubuntu-11-10

报错[stage1-bubble] Error 2

1
2
3

make[1]: *** [stage1-bubble] Error 2
make[1]: Leaving directory `/np/linac/belloni/programs/gcc/gcc-build'
make: *** [all] Error 2

解决：主要由Error 1 报错引起的，在第一个报错解决后此错误消失。

后续编译其他软件报错

安装blasr报错如下：

g++ -std=c++11 -pedantic -Wall -Wextra    -c CCSSequence.cpp -o CCSSequence.o
/public/home/zpxu/bin/gcc-4.8.5/libexec/gcc/x86_64-unknown-linux-gnu/4.8.5/cc1plus: error while loading shared libraries: libmpc.so.2: cannot open shared object file: No such file or directory
make[3]: *** [CCSSequence.o] Error 1
make[3]: Leaving directory `/public/home/zpxu/bin/blasr_install/blasr/libcpp/pbdata'
make[2]: *** [libpbdata] Error 2
make[2]: Leaving directory `/public/home/zpxu/bin/blasr_install/blasr/libcpp'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/public/home/zpxu/bin/blasr_install/blasr/libcpp'

报错原因在于blasr安装相关路径已经存在于系统环境变量中，注释掉.bashrc中相应路径。

make install后报错

1	/bin/llvm-tblgen: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.15' not found (required by /bin/llvm-tblgen)

解决办法：
I found the libstdc++.so.6.0.18 at the place where I complied gcc 4.8.1

Then I do like this


cp ~/objdir/x86_64-unknown-linux-gnu/libstdc++-v3/src/.libs/libstdc++.so.6.0.18 /usr/lib64/

rm /usr/lib64/libstdc++.so.6

ln -s libstdc++.so.6.0.18 libstdc++.so.6

problem solved.

GCC延伸阅读

Linux下gcc生成和使用静态库和动态库详解
Linux添加环境变量与GCC编译器添加INCLUDE与LIB环境变量

贡献来源：
http://favoorr.github.io/centos6.6-build-gcc5.2-from-source/
http://stackoverflow.com/questions/5216399/usr-lib-libstdc-so-6-version-glibcxx-3-4-15-not-found

linux下java安装和运行报错

2016-05-31T11:08:12.000Z

Linux下安装java

java官网下载最新版本jdk：jdk-8u91-linux-x64.tar.gz。
按照官网说明安装：JDK Installation for Linux Platforms
最后配置环境变量：编辑.bashrc文件。

Linux服务器上java运行报错

1
2
3

Error occurred during initialization of VM
Could not reserve enough space for object heap
Could not create the Java virtual machine.

或者

1 2	Error occurred during initialization of VM Could not reserve enough space for code cache

根据报错提示主要是运行内存不足造成，解决办法如下：

1	set JAVA_OPTS=-Xms512m -Xmx512m -XX:MaxPermSize=256m

more：http://stackoverflow.com/questions/4401396/could-not-reserve-enough-space-for-object-heap

Circular RNAs

2016-04-12T06:50:29.000Z

环形RNA研究历史

1>最早的环形RNA分子在20世纪70年代于RNA病毒中发现。（Viroids are single-stranded covalently closed circular RNA molecules existing as highly base-paired rod-like structures）

2>2012年，斯坦福大学和霍华德休斯医学研究所的科学家们发表在《Plos One》的一项研究首次证实在人体细胞的基因表达程序中，环形RNA分子而非线性RNA分子是一个更普遍的特征。

3>2013年2月，Nature头条：震惊遗传界的环状RNA,揭示出环状RNA（circRNA）是一类特殊的非编码RNA分子，与传统的线性RNA（linear RNA，含5’和3’末端）不同，circRNA分子呈封闭环状结构，不受RNA外切酶影响，表达更稳定，不易降解。在功能上，circRNA分子富含microRNA（miRNA）结合位点，在细胞中起到miRNA海绵（ miRNA sponge）的作用，进而解除miRNA对其靶基因的抑制作用，升高靶基因的表达水平（近期研究显示，一个环状RNA-CDR1as (也称为ciRS-7) ，在其序列上有超过60个保守的miR-7结合位点，因此ciRS-7像海绵那样，将miR-7吸附到身上，进而影响miR-7靶标基因活性）；这一作用机制被称为竞争性内源RNA（ceRNA）机制。通过与疾病关联的miRNA相互作用， circRNA在疾病中发挥着重要的调控作用。
Circular RNAs are a large class of animal RNAs with regulatory potency.Nature.Year published:(2013)DOI:doi:10.1038/nature11928：证实，在斑马鱼中表达这一环状RNA或敲除miR-7可以改变大脑发育。
Natural RNA circles function as efficient microRNA sponges.NatureYear published:(2013)DOI:doi:10.1038/nature11993：发现这一环状RNA的表达阻断了miR-7。它使得miR-7活性受到抑制，miR-7靶基因表达增高，研究人员推测这是因为这一RNA环捕获和失活了miR-7。

早期认为环形RNA通过”外显子反向剪接成环（back splice circularization）”形成，定位于细胞浆中；2013年9月，中科院生物化学与细胞生物学研究所陈玲玲组发现来源于内含子序列的ciRNAs，其生成依赖特定的成环关键核酸序列；成熟的ciRNAs定位在细胞核内并调控其本位基因的转录速度。
Circular Intronic Long Noncoding RNAs

4>2014年9月，中科院上海生命科学研究院的研究人员在新研究中证实，是内含子的互补序列介导了外显子环化。
Complementary Sequence-Mediated Exon Circularization

5>2016年2月，中科院上海生命科学研究院生物化学与细胞生物学研究所的研究员陈玲玲全面探讨了环状RNA（circRNA）的生物合成和新功能。真核细胞的环状RNA来自于mRNA前体（pre-mRNA）的反向剪接。虽然环状RNA通常表达水平较低，但它们的表达存在细胞和组织特异性。
The biogenesis and emerging roles of circular RNAs
延伸阅读
剪接体
内含子经常存在于真核生物的蛋白质编码基因(coding gene)中。在内含子里，需要有 5’ 供体剪接位点(5’ donor splice site)、3’ 受体剪接位点(3’ acceptor splice site)及剪接分枝位点(branch point)来进行剪接。剪接是由剪接体（Spliceosome）来催化，它是以五个不同的小核核糖核酸(snRNs) 以及不下于一百个蛋白质所组成的大型核糖核酸蛋白质复合物，称为小核核糖蛋白(snRNP)。snRNP 的 RNA 会与内含子行杂交反应(hybridization)，并且参与剪接的催化反应。
snRNAs(small nuclear ribonucleoproteins)的作用
真核细胞有细胞核和细胞浆中都含有许多小RNA，它们约有100到300个碱基，每个细胞中可含有105-106个这种RNA分子。它们是由RNA聚合酶Ⅱ或Ⅲ所合成的，其中某些像mRNA一样可被加帽。在细胞核中的小RNA称为snRNA，而在细胞浆中的称为scRNA。但在天然状态下它们均与蛋白质相结合，故分别称为snRNP和scRNP。某些snRNPs和剪接作用有密切关系。有些snRNPs分别和供体及受体剪接位点以及分支顺序相互补。

环形RNA要点

具有闭合环状结构，没有PolyA “尾巴”。

不受RNA外切酶影响，表达更稳定，不易降解。

序列高度保守，具有一定的组织、时序和疾病特异性。

生物起源(Biogenesis)

剪切体(spliceosome): 剪切体抑制降低circRNA和线性RNA水平；circRNA表达受管控，剪切体能够区分正向剪切(linear RNA)和backsplicing(circRNA)。具体如何区分还不清楚，但3种环化机制已经识别，其共同点核心是相关剪切位点毗邻（ juxtaposition），区别在于这种临近是如何实现的。more：http://www.sciencedirect.com/science/article/pii/S1874939915001455

功能

图注：1. 靠近环化外显子侧面的内含子存在互补序列motifs，直接的motifs区域碱基配对将环化的剪切位点拉进；

RBPs(RNA bind protein)互作捆绑环化外显子侧面的内含子序列motifs区域，促进head-to-tail end-joining。
外显子跳跃导致包含外显子1和4的mRNA和包含外显子2和3的套索结构一样，这诱导外显子3的剪切供体和外显子2的剪切受体临近，随后的剪切形成EIciRNA(exon–intron ciRNAs)和circRNA，并伴随有外显子1和4组成的lines RNA。
CircRNA包含miRNA结合位点时吸附AGO-miRNA complexes；
调控RBPs；
exon–intron ciRNAs存在于核内，通过其保留的内含子的5‘剪切位点与U1 snRNP (U1)直接互作促进宿主基因的转录，exon–intron ciRNA-U1 complex招募RNA聚合酶 II(RNA pol II)刺激宿主基因转录起始。
延伸阅读
Specialised spliceosomes splice the introns out of pre-mRNA and seal the exon ends together, using the splicing consensus sequences at the intron/exon boundaries to identify the correct positions to splice. Sometimes a regulatory protein will mask a splicing sequence, resulting in alternative splicing. The spliceosomes consist primarily of RNA-protein complexes called small nuclear ribonucleoproteins (snRNPs). The snRNPs are composed of small nuclear RNAs (snRNAs) - U1, U2, U4, U5 and U6 - as well as a group of seven proteins known as Sm ribonucleoproteins that collectively make up the extremely stable Sm core of the snRNP. The snRNPs bind to the pre-mRNA in a specific order to align the splice sites for cleavage, which involves RNA-RNA pairing between the snRNA and the pre-mRNA with the help of the Sm proteins. The U1 snRNP binds to the 5’ end of the intron and the U2 snRNP binds close to the 3’ end of the intron (at the branch point), followed by the binding of the U4/U6 snRNPs that play an important role close to the reaction centre, and finally the U5 snRNP that helps hold the two exons together. After the intron is spliced out it is rapidly degraded, and the two exons are ligated together. More：http://www.ebi.ac.uk/interpro/potm/2005_5/Page1.htm

Analyzing Data

2016-04-01T06:17:34.000Z

Summary Statistics

mean(),max(),min(),range()，sd()

以上函数运算中，若存在NA，则返回结果为NA，设置na.rm=TRUE忽略NA值。
mean()中移除异常值：trim（mean(x, trim = 0.1)：先把x的最大的10%的数和最小的10%的数去掉，然后剩下的数算平均）
range():同时返回最大/最小值。
sd()：标准差。

quantile(),fivenum()，IQR()

quantile(dow30$Open, probs=c(0,0.25,0.5,0.75,1.0))
返回不同的百分位数值，probs指定百分位。
fivenum()：返回(minimum, 25th percentile, median, 75th percentile, and maximum)。
IQR()：返回25%与75%的差值
以上函数既可用于单独数组，也可用于apply, tapply对数据框的操作。

summary()

对于数值变量计算了五个分位点和均值，对于分类变量则计算了频数。

Statistical Tests

基于正态分布的检验；自然群体大都为正态分布。

Comparing means

Specifically, suppose that you have a set of observations
x1, x2, …, xn with experimental mean μ and want to know if the experimental
mean is different from the null hypothesis mean μ0. Furthermore, assume that the
observations are normally distributed. To test the validity of the hypothesis, you can
use a t-test. In R, you would use the function t.test;

Comparing paired data(means)

For example, you might have two observations per subject: one before an experiment and one after the experiment.
In this case, you would use a paired t-test. You can use the t.test function, specifying
paired=TRUE, to perform this test.

Comparing variances of two populations

To compare the variances of two samples from normal populations, R includes the
var.test function which performs an F-test;

Comparing means across more than two groups

ANOVA单因素方差分析与R实现：http://tiramisutes.github.io/2015/10/08/ANOVA.html

Correlation tests

If you’d like to check whether there is a statistically significant
correlation between two vectors, you can use the cor.test function；

SQLite数据库简单操作

2016-03-22T05:43:04.000Z

SQLite数据库整体查询命令

dot-commands

$sqlite3 
sqlite> .help
.backup ?DB? FILE      Backup DB (default "main") to FILE
.bail ON|OFF           Stop after hitting an error.  Default OFF
.databases             List names and files of attached databases
.dump ?TABLE? ...      Dump the database in an SQL text format
                         If TABLE specified, only dump tables matching
                         LIKE pattern TABLE.
.echo ON|OFF           Turn command echo on or off
.exit                  Exit this program
.explain ON|OFF        Turn output mode suitable for EXPLAIN on or off.
.genfkey ?OPTIONS?     Options are:
                         --no-drop: Do not drop old fkey triggers.
                         --ignore-errors: Ignore tables with fkey errors
                         --exec: Execute generated SQL immediately
                       See file tool/genfkey.README in the source 
                       distribution for further information.
.header(s) ON|OFF      Turn display of headers on or off
.help                  Show this message
.import FILE TABLE     Import data from FILE into TABLE
.indices ?TABLE?       Show names of all indices
                         If TABLE specified, only show indices for tables
                         matching LIKE pattern TABLE.
.load FILE ?ENTRY?     Load an extension library
.mode MODE ?TABLE?     Set output mode where MODE is one of:
                         csv      Comma-separated values
                         column   Left-aligned columns.  (See .width)
                         html     HTML  code
                         insert   SQL insert statements for TABLE
                         line     One value per line
                         list     Values delimited by .separator string
                         tabs     Tab-separated values
                         tcl      TCL list elements
.nullvalue STRING      Print STRING in place of NULL values
.output FILENAME       Send output to FILENAME
.output stdout         Send output to the screen
.prompt MAIN CONTINUE  Replace the standard prompts
.quit                  Exit this program
.read FILENAME         Execute SQL in FILENAME
.restore ?DB? FILE     Restore content of DB (default "main") from FILE
.schema ?TABLE?        Show the CREATE statements
                         If TABLE specified, only show tables matching
                         LIKE pattern TABLE.
.separator STRING      Change separator used by output mode and .import
.show                  Show the current values for various settings
.tables ?TABLE?        List names of tables
                         If TABLE specified, only list tables matching
                         LIKE pattern TABLE.
.timeout MS            Try opening locked tables for MS milliseconds
.width NUM NUM ...     Set column widths for "column" mode
.timer ON|OFF          Turn the CPU timer measurement on or off

SQLite数据类型
text

integer 

real 

NULL, used for missing data, or no value 

BLOB, which stands for binary large object, and stores any type of object as bytes 

注：SQLite并没有强制同列必须使用相同类型的数据，每个表的每一列都有优先类型（type affinity），但为了下游分析方便，最好同一列保持相同数据类型。
当某一列是混合数据类型时，排序原则为：NULL values, integer and real values (sorted numerically), text values, and finally blob values。
数据库内容查询
万能的SELECT命令
语法：
1
SELECT  FROM ;

基本形式：SELECT选择指令从一个table中抓取所有列的所有行（columns设定为*）。
选取特定列：不同列之间逗号分隔（SELECT trait, chrom, position, strongest_risk_snp, pvalue FROM gwascat LIMIT 5;）
SELECT语句除了在sqlite中交互查询外，还可在命令行中直接查询
1
2
3
4
#交互
sqlite> SELECT * FROM gwascat;
#命令行
sqlite3 gwascat.db "SELECT * FROM gwascat" > results.txt

SQLite默认输出不规则，可做以下设置输出排列整齐易读输出：
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
sqlite> SELECT trait, chrom, position, strongest_risk_snp, pvalue
   ...> FROM gwascat LIMIT 5;
trait|chrom|position|strongest_risk_snp|pvalue
Asthma and hay fever|6|32658824|rs9273373|4.0e-14
Asthma and hay fever|4|38798089|rs4833095|5.0e-12
Asthma and hay fever|5|111131801|rs1438673|3.0e-11
Asthma and hay fever|2|102350089|rs10197862|4.0e-11
Asthma and hay fever|17|39966427|rs7212938|4.0e-10
sqlite> .header on
sqlite> .mode column
sqlite> SELECT trait, chrom, position, strongest_risk_snp, pvalue
   ...> FROM gwascat LIMIT 5;
trait                 chrom       position    strongest_risk_snp  pvalue
--------------------  ----------  ----------  ------------------  ----------
Asthma and hay fever  6           32658824    rs9273373           4.0e-14
Asthma and hay fever  4           38798089    rs4833095           5.0e-12
Asthma and hay fever  5           111131801   rs1438673           3.0e-11
Asthma and hay fever  2           102350089   rs10197862          4.0e-11
Asthma and hay fever  17          39966427    rs7212938           4.0e-10

SELECT可选参数：
LIMIT    输出查询行数
1
sqlite> SELECT * FROM gwascat LIMIT 2;

ORDER BY     输出结果排序
1
2
3
4
SELECT author, trait, journal FROM  ORDER BY author DESC LIMIT 5; 
#（按author降序排序），排序有助于异常值检测。
#若所指定排序列含由NULL值，可通过 IS NOT NULL 排除NULL值
SELECT chrom, position, trait, strongest_risk_snp, pvalue FROM   WHERE pvalue IS NOT NULL ORDER BY pvalue LIMIT 5;

WHERE    数据筛选
1
2
SELECT chrom, position, trait, strongest_risk_snp, pvalue FROM  WHERE lower(strongest_risk_snp) = "rs429358";
#SQLite大小写敏感，所以匹配时最好用lower() 转换。

多条件筛选：
1
2
3
4
5
6
7
8
sqlite> SELECT chrom, position, strongest_risk_snp, pvalue FROM gwascat
   ...> WHERE chrom IN ("1", "2", "3") AND pvalue < 10e-11
   ...> ORDER BY pvalue LIMIT 5;
#或者
sqlite> SELECT chrom, position, strongest_risk_snp, pvalue
   ...> FROM gwascat WHERE chrom = "22"
   ...> AND position BETWEEN 24000000 AND 25000000
   ...> AND pvalue IS NOT NULL ORDER BY pvalue LIMIT 5;

AS    对原始数据的修改：
1
2
3
4
5
6
7
sqlite> SELECT lower(trait) AS trait,
   ...> "chr" || chrom || ":" || position AS region FROM gwascat LIMIT 5;
#||为连接运算符，用来连接两个字符串
#NULL的替换，ifnull()函数
sqlite> SELECT ifnull(chrom, "NA") AS chrom, ifnull(position, "NA") AS position,
   ...> strongest_risk_snp, ifnull(pvalue, "NA") AS pvalue FROM gwascat
   ...> WHERE strongest_risk_snp = "rs429358";

更多SQLite内置函数
Function    Description
ifnull(x, val)    If x is NULL, return with val, otherwise return x; shorthand for coalesce() with two arguments
min(a, b, c, …)    Return minimum in a, b, c, …
max(a, b, c, …)    Return maximum in a, b, c, …
abs(x)    Absolute value
coalesce(a, b, c, …)    Return first non-NULL value in a, b, c, … or NULL if all values are NULL
length(x)    Returns number of characters in x
lower(x)    Return x in lowercase
upper(x)    Return x in uppercase
replace(x, str, repl)    Return x with all occurrences of str replaced with repl
round(x, digits)    Round x to digits (default 0)
trim(x, chars), ltrim(x, chars), rtrim(x, chars)    Trim off chars (spaces if chars is not specified) from both sides, left side, and right side of x, respectively.
substr(x, start, length)    Extract a substring from x starting from character start and is length characters long
集合函数（Aggregate）
count(colname)函数：
返回总行数（无视NULL的存在）：sqlite> SELECT count(*) FROM gwascat;
若colname是具体某列则返回出去NULL值的总行数.
其他相似函数：avg(x),max(x),min(x),sum(x),total(x)
计算列非重复值(unique)个数
1
sqlite> SELECT count(DISTINCT 列) AS unique_rs FROM gwascat;

行分组(GROUP BY)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
sqlite> SELECT chrom, count(*) FROM gwascat GROUP BY chrom;
chrom       count(*)
----------  ----------
            70
1           1458
10          930
11          988
12          858
13          432
[...]
#列重命名，降序，分组计算
sqlite> SELECT chrom, count(*) as nhits FROM gwascat GROUP BY chrom
   ...> ORDER BY nhits DESC;
chrom       nhits
----------  ----------
6           1658
1           1458
2           1432
3           1033
11          988
10          930
[...]
#多列分组，不同列之间逗号分隔
sqlite> select strongest_risk_snp, strongest_risk_allele, count(*) AS count
   ...> FROM gwascat GROUP BY strongest_risk_snp, strongest_risk_allele
   ...> ORDER BY count DESC LIMIT 10;
strongest_risk_snp  strongest_risk_allele  count
------------------  ---------------------  ----------
rs1260326           T                      22
rs2186369           G                      22
rs1800562           A                      20
rs909674            C                      20
rs11710456          G                      19

自己动手写数据库
we’ll use the basic SQL syntax to create tables and insert records into tables. Then load data into SQLite using Python’s sqlite3 module.
创建tables
基本语法：
1
2
3
4
5
6
CREATE TABLE tablename(
  id integer primary key,
  column1 column1_type,
  column2 column2_type,
  ...
);

注意到所有SQLite数据库第一列总是id integer primary key，primary key是非重复整数来识别table中每一条记录。
创建table：
1
2
3
4
5
6
7
$ sqlite3 practice.dbsqlite> CREATE TABLE variants(
   ...>   id integer primary key,
   ...>   chrom text,
   ...>   start integer,
   ...>   end integer,
   ...>   strand text,
   ...>   name text);

数据写入table
基本语法：
1
2
INSERT INTO tablename(column1, column2)
VALUES (value1, value2);

建立索引
基本语法：
1
2
3
4
5
6
sqlite> CREATE INDEX  ON ();
#察看索引
sqlite> .indices
columns-name_idx
#删除索引
sqlite> DROP INDEX columns-name_idx;

修改/删除table
删除：DROP TABLE
修改：ALTER TABLE
python中交互操作SQLite
连接SQLite数据库并创建table
create_table.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import sqlite3

# the filename of this SQLite database
db_filename = "variants.db"

# initialize database connection
conn = sqlite3.connect(db_filename) #connect() 连接数据库

c = conn.cursor() #在python中用cursor()与SQLite数据库交互

table_def = """\ 
CREATE TABLE variants(
  id integer primary key,
  chrom test,
  start integer,
  end integer,
  strand text,
  rsid text);
"""

c.execute(table_def)  #SQL语法，相当于确认
conn.commit()  #提交跟新内容到SQLite数据库
conn.close()   #关闭与数据库的连接

数据载入table
load_variants.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
##load_variants.py data.txt
import sys
import sqlite3
from collections import OrderedDict

# the filename of this SQLite database
db_filename = "variants.db"

# initialize database connection
conn = sqlite3.connect(db_filename)
c = conn.cursor()

## Load Data
# columns (other than id, which is automatically incremented
tbl_cols = OrderedDict([("chrom", str), ("start", int), 
                        ("end", int), ("strand", str),
                        ("rsid", str)])

with open(sys.argv[1]) as input_file:
    for line in input_file:
        # split a tab-delimited line
        values = line.strip().split("\t")

        # pair each value with its column name
        cols_values = zip(tbl_cols.keys(), values)
		# use the column name to lookup an appropriate function to coerce each
        # value to the appropriate type
        coerced_values = [tbl_cols[col](value) for col, value in cols_values]

        # create an empty list of placeholders
        placeholders = ["?"] * len(tbl_cols)

        # create the query by joining column names and placeholders quotation
        # marks into comma-separated strings
        colnames = ", ".join(tbl_cols.keys())
        placeholders = ", ".join(placeholders)
        query = "INSERT INTO variants(%s) VALUES (%s);"%(colnames, placeholders)

        # execute query
        c.execute(query, coerced_values)

conn.commit() # commit these inserts
conn.close()




Linux中如何正确删除：find-rm
2016-03-21T04:18:46.000Z
假如当前目录下有诸多fastq文件想要删除，通常我们选择这样：
直接删除
1
rm *-temp.fastq

但是如果一时疏忽输入rm  -temp.fastq（号和-号之间多了空格），那结果就惨了…..
而如果结合Linux常用命令之find命令那么一切就简单多了
1
2
3
find . -name "*-temp.fastq" -exec rm -i {} \;
#or
find . -name "*-temp.fastq" | xargs rm

-exec 表示由find找到的匹配项会作为”-exec后面设定的命令”的参数（｛｝中输入值）
-i    交互删除
若想要删除文件夹，则-delete 代替-exec -rm {} 
打印删除rm命令并检查
1
2
3
4
5
6
7
8
9
find . -name "*-temp.fastq" | xargs -n 1 echo "rm -i" > delete-temp.sh
cat delete-temp.sh
rm -i ./zmaysA_R1-temp.fastq
rm -i ./zmaysA_R2-temp.fastq
rm -i ./zmaysC_R1-temp.fastq
rm -i ./zmaysC_R2-temp.fastq
bash delete-temp.sh
#or
find . -name "*.fastq" | xargs -n 1 -P 4 bash script.sh

-n 1    表示find擦找到的参数每次只有一个输入到xargs中
-P    并行运算
将删除文件放入临时文件夹（tmp）
1
myrm(){ D=/tmp/$(date +%Y%m%d%H%M%S); mkdir -p $D; mv "$@" $D && echo "moved to $D ok"; }



被忽视的Samtools参数
2016-03-20T14:54:42.000Z
Samtools是一个用于操作序列比对结果sam和bam文件的工具合集。
sam文件格式
SAM格式由两部分组成：头部区和比对区，都以tab分列。
头部区:以’@’开始，体现了比对的一些总体信息。比对的SAM格式版本，比对的参考序列，比对使用的软件等。
比对区: 比对结果，每一个比对结果是一行，有11个主列和1个可选列。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
@HD VN:1.0 SO:unsorted  
头部区第一行：VN是格式版本；SO表示比对排序的类型，有unkown(default),unsorted,
queryname和coordinate几种。samtools软件在进行排序后不能自动更新bam文件的SO值。
picard却可以。
@SQ SN:A.auricula_all_contig_1 LN:9401
参考序列名。这些参考序列决定了比对结果sort的顺序。SN是参考序列名；LN是参考序列
长度;
@RG ID:sample01
Read Group. 1个sample的测序结果为1个Read Group；该sample可以有多个library
的测序结果。
@PG ID:bowtie2 PN:bowtie2 VN:2.0.0-beta7
比对所使用的软件。

比对区11个列和可选列的解释
1  QNAME  比对的序列名,即单端或双端fa/fq 中的reads编号, ‘*’ indicates the information is unavailable.
2  FLAG   Bwise FLAG(表明比对类型：pairing，strand，mate strand等)
3  RNAME  比对上的参考序列名,An unmapped segment without coordinate has a ‘*’ at this field. 
4  POS    1-Based的比对上的最左边的定位,POS is set as 0 for an unmapped read without coordinate. If POS is 0, no assumptions can be made about RNAME and CIGAR.
5  MAPQ   比对质量，255表示没有map
6  CIGAR  Extended CIGAR string (操作符：MIDNSHP) 比对结果信息：匹配碱基数，可变剪接等，*表示不可用。
7  RNEXT   相匹配的另外一条序列，比对上的参考序列名,‘*’ when the information is unavailable, and set as ‘=’ if RNEXT is identical RNAME
8  PNEXT   1-Based leftmost Mate POsition,0 when the information is unavailable.
9  TLEN  插入片段长度,set as 0 for single-segment template or when the information is unavailable
10 SEQ    和参考序列在同一个琏上的比对序列(若比对结果在负意链上，则序列是其反向重复序列), ‘*’ when the sequence is not stored
11 QUAL   比对序列的质量(ASCII-33=Phred base quality),‘*’ when quality is not stored
12 可选的行，以TAG：TYPE：VALUE的形式提供额外的信息

比对区解释
sam/bam比对区包含有此次比对的结果信息，其中主要信息解释如下：
FLAG部分


0x800 表明相应位置的比对属于嵌合体比对；
0x4 没有map上的reads；

CIGAR部分


对于mRNA到基因组的比对，N表示内含子。
More: http://samtools.github.io/hts-specs/SAMv1.pdf
sam文件的几个特例解释
Unmapped reads 统计
Each alignment is one line of the SAM file, but not all lines are successful alignments. Unmapped reads在sam文件中的标记：FLAG列为4而且RNAME列为星号*；
1
2
3
4
#统计包含星号的比对行数
cut -f3 smallRNA-seq.sam | grep -c \*
#总的比对行数
grep -c -v "^@" smallRNA-seq.sam

How many different read IDs are in the file?
The query (read) ID is in field 1. Some reads may have multiple alignments, so the number of lines is not necessarily the number of reads.
1
2
3
grep -v "@" HR-1B.fq.gz.sam | cut -f1 | sort | uniq | wc -l 
#同时统计此次比对的单端/双端 fa/fq文件发现两者结果不同，说明确实有些reads未能比对上去。
zcat pepper/RNA-seq/sgs-clean-reads/HR-1B.fq.gz | wc -l

How many different read sequences are in the file?
第一列的reads ID仅能表示测序过程中的不同reads，但他们的序列可能因为PCR扩增原因或文库偏好而完全相同，所以统计第10列的uniq序列数能够准确的表示reads总数。
1
cut -f10 smallRNAseq.sam | sort | uniq | wc -l

How many reads are uniquely mapped?
1
cut -f10 smallRNAseq.sam | sort | uniq -u | wc -l

对于 BWA比对结果也可用grep -c XT:A:U smallRNA-seq.sam来准确统计。
How many reads are multi-hits?
1
cut -f1 smallRNA-seq.sam | sort | uniq -d | wc -l

对于 BWA比对结果也可用grep -c XT:A:R smallRNAseq.sam来准确统计。
How many alignments are reported for each read?
1
2
grep -v "^@" smallRNA-seq.sam | cut -f1 | sort | uniq -c | sort -nr > sortedreadcount.txt
grep -v "^@" smallRNA-seq.sam | cut -f1 |sort | uniq -c | sort -nr | cut -c1-8 | sort | uniq -c

How many different reference sequences are represented in the file?
1
grep -v "^@" smallRNA-seq.sam | cut –f3 | sort | uniq | wc -l

view
-c    计数
-f    返回指定区间/flags比对结果
-q    返回比对质量大于等于指定值的比对数目
-F 4：统计map 上的 reads总数；
-f 4：统计没有map 上的 reads总数；
To get the unmapped reads from a bam file use :
samtools view -f 4 file.bam > unmapped.sam, the output will be in sam
to get the output in bam use : samtools view -b -f 4 file.bam > unmapped.bam
To get only the mapped reads use the parameter ‘F’, which works like -v of grep and skips the alignments for a specific flag.
samtools view -b -F 4 file.bam > mapped.bam
samtools view -b -F 4 -f 8 file.bam > onlyThisEndMapped.bam
samtools view -b -F 8 -f 4 file.bam > onlyThatEndMapped.bam
samtools view -b -F12 file.bam > bothEndsMapped.bam
samtools merge merged.bam onlyThisEndMapped.bam onlyThatEndMapped.bam bothEndsMapped.bam
对于tophat比对结果：
samtools view -b -f 2  accepted_hits.bam > mappedPairs.bam
Better with:
samtools view -b -f 0x2 accepted_hits.bam > mappedPairs.bam
sort
-m    指定运算内存，支持K，M，G等缩写
-@    并行运算核数
index
必须对bam文件进行默认情况下的排序后，才能进行index。否则会报错。
建立索引后将产生后缀为.bai的文件，用于快速的随机处理。很多情况下需要有bai文件的存在，特别是显示序列比对情况下。比如samtool的tview命令就需要；gbrowse2显示reads的比对图形的时候也需要。
faidx
对fasta文件建立索引,生成的索引文件以.fai后缀结尾。该命令也能依据索引文件快速提取fasta文件中的某一条（子）序列。
See more： bedtools 使用小结
flagstat
给出BAM文件的比对结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ samtools flagstat example.bam
11945742 + 0 in total (QC-passed reads + QC-failed reads)
#总共的reads数
0 + 0 duplicates
7536364 + 0 mapped (63.09%:-nan%)
#总体上reads的匹配率
11945742 + 0 paired in sequencing
#有多少reads是属于paired reads
5972871 + 0 read1
#reads1中的reads数
5972871 + 0 read2
#reads2中的reads数
6412042 + 0 properly paired (53.68%:-nan%)
#完美匹配的reads数：比对到同一条参考序列，并且两条reads之间的距离符合设置的阈值
6899708 + 0 with itself and mate mapped
#paired reads中两条都比对到参考序列上的reads数
636656 + 0 singletons (5.33%:-nan%)
#单独一条匹配到参考序列上的reads数，和上一个相加，则是总的匹配上的reads数。
469868 + 0 with mate mapped to a different chr
#paired reads中两条分别比对到两条不同的参考序列的reads数
243047 + 0 with mate mapped to a different chr (mapQ>=5)

mpileup
samtools还有个非常重要的命令mpileup，以前为pileup。该命令用于生成bcf文件，再使用bcftools进行SNP和Indel的分析。bcftools是samtool中附带的软件，在samtools的安装文件夹中可以找到。
-f 来输入有索引文件的fasta参考序列；
-g 输出到bcf格式。
depth
计算每一个位点或者区域的测序深度；
1
samtools depth sorted.bam > sorted.bam.txt

一共得到3列以指标分隔符分隔的数据，第一列为染色体名称，第二列为位点，第三列为覆盖深度。

贡献来源
http://www.plob.org/2014/01/26/7112.html
http://blog.sina.com.cn/s/blog_670445240101l30k.html
https://www.biostars.org/p/56246/
https://www.biostars.org/p/110039/
https://www.biostars.org/p/95929/
https://www.biostars.org/p/110157/
https://groups.google.com/forum/#!forum/bedtools-discuss
https://code.google.com/p/hydra-sv/



bedtools 使用小结
2016-03-18T08:29:32.000Z
概述
BEDTools是可用于genomic features的比较，相关操作及进行注释的工具。而genomic features通常使用Browser Extensible Data (BED) 或者 General Feature Format (GFF)文件表示，用UCSC Genome Browser进行可视化比较。
与BEDTools使用相关的基本概念
已有的一些genome features信息一般由BED格式或者GFF格式进行存储。
genome features: 功能元素（gene）， 遗传多态性 (SNPs, INDELs, or structural variants), 已经由测序或者其他方法得到的注释信息，也可以是自定义的一些特征信息。
Overlapping/intersecting features: 两个genome features的区域至少有一个bp的共同片段。
BED和GFF文件的一个差异
BED文件中起始坐标为0，结束坐标至少是1,； GFF中起始坐标是1而结束坐标至少是1。
相关格式
BED format
BEDTools主要使用BED格式的前三列，BED可以最多有12列。BED格式的常用列描述如下：
chrom: 染色体信息， 如chr1, III, myCHrom, contig1112.23, 必须有
start: genome feature的起始位点，从0开始， 必须有
end: genome feature的终止位点，至少为1， 必须有
score: 可以是p值等等一些可以刻量化的数值信息
strands: 正反链信息
GFF format
seqname - name of the chromosome or scaffold; chromosome names can be given with or without the ‘chr’ prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
source - name of the program that generated this feature, or the data source (database or project name)
feature - feature type name, e.g. Gene, Variation, Similarity
start - Start position of the feature, with sequence numbering starting at 1.end - End position of the feature, with sequence numbering starting at 1.
score - A floating point value.strand - defined as + (forward) or - (reverse).
frame - One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on..
attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.
See more from http://www.ensembl.org/info/website/upload/gff.html
genome files
BEDTools中的一些工具（genomeCoverageBed, complementBed, slopBed）需要物种的染色体大小的信息，genome file一般就是每行都是tab隔开，两列，一列为染色体的名字，第二列为这个染色体的大小。一般常用物种的genome file在BEDTools安装目录的/genome里面。
自定义基因组genome files文件生成方法见我的另一篇博文：批量求fasta格式序列长度。
BEDTools使用总结
intersect/intersectBed：计算 Overlaps
1
bedtools intersect -a A.bed -b B.bed -wa -wb

用来求两个BED或者BAM文件中的overlap，overlap可以进行自定义是整个genome features的overlap还是局部。
默认的结果描述如下图：

加-wa参数可以报告出原始的在A文件中的feature， 如下图

加-wb参数可以报告出原始的在B文件中的feature, 加-c参数可以报告出两个文件中的overlap的feature的数量。
当用bedtools intersect 处理大文件时比较耗内存，有效的方法是对A和B文件按照染色体名字(chromosome)和位置(position)排序(sort -k1,1 -k2,2n),然后用-sorted参数重新intersect。
1
bedtools intersect -a A-sorted.bed -b B-sorted.bed --sorted

其他参数：
-wo 返回overlap碱基数
1
2
3
4
5
6
7
$bedtools  intersect -a A.bed -b B.bed -wo
chr1    0       15      a       chr1    0       4       x       4
chr1    0       15      a       chr1    9       15      z       6
chr1    25      29      b       chr1    18      28      y       3
chr1    18      18      c       chr1    18      28      y       1
chr1    10      14      d       chr1    9       15      z       4
chr1    20      23      e       chr1    18      28      y       3

-v 返回非overlap区间
-s 相同链上的feature
-c 两个文件中的overlap的feature的数量
complement：返回基因组非覆盖区
1
bedtools complement -i  -g 

Slop：增加特征区间大小
要求：单个输入bed文件（-i指定）和genome files
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cat ranges-qry.bed 
chr1    0       15      a
chr1    25      29      b
chr1    18      18      c
chr1    10      14      d
chr1    20      23      e
chr1    6       7       f
bedtools slop -i ranges-qry.bed -g genome.txt -b 4
chr1    0       19      a
chr1    21      33      b
chr1    14      22      c
chr1    6       18      d
chr1    16      27      e
chr1    2       11      f
#-b 4	:两端同时缩短4个碱基

-l 3 -r 5:增加左3右5
flank：提取特定区域(启动子区)
要求：基因组GTF文件（-i指定）和genome files
1
2
3
4
5
6
7
bedtools flank -i mm_GRCm38.75_protein_coding_genes.gtf \
                 -g Mus_musculus.GRCm38_genome.txt \
                 -l 3000 -r 0 > mm_GRCm38_3kb_promoters.gtf
cut -f1,4,5,7 mm_GRCm38_3kb_promoters.gtf | head -n 3
1       3671499 3674498 -
1       4360315 4363314 -
1       4496414 4499413 -

getfasta：提取序列
要求：基因组fasta文件（-fi指定）和提取区间GTF文件(-bed指定)
1
2
bedtools getfasta -fi Mus_musculus.GRCm38.75.dna_rm.toplevel_chr1.fa \
   -bed mm_GRCm38_3kb_promoters.gtf -fo mm_GRCm38_3kb_promoters.fasta

-tab    Report extract sequences in a tab-delimited format instead of in FASTA format.

提取序列之samtools（速度较快）
1
2
3
4
5
6
7
8
9
10
11
12
13
#首先建立fai索引文件（第一列为染色体名字，第二列为序列碱基数）
samtools faidx Mus_musculus.GRCm38.75.dna.chromosome.8.fa
#序列提取，多提取区间空格隔开
samtools faidx Mus_musculus.GRCm38.75.dna.chromosome.8.fa \
     8:123407082-123410744 8:123518835-123536649
>8:123407082-123410744
GAGAAAAGCTCCCTTCTTCTCCAGAGTCCCGTCTACCCTGGCTTGGCGAGGGAAAGGAAC
CAGACATATATCAGAGGCAAGTAACCAAGAAGTCTGGAGGTGTTGAGTTTAGGCATGTCT
[...]
>8:123518835-123536649
TCTCGCGAGGATTTGAGAACCAGCACGGGATCTAGTCGGAGTTGCCAGGAGACCGCGCAG
CCTCCTCTGACCAGCGCCCATCCCGGATTAGTGGAAGTGCTGGACTGCTGGCACCATGGT
[...]

nuc: 计算GC含量即各碱基数
1
bedtools nuc -fi hg19.fa -bed CDS.bed

输出结果解释：在原bed文件每行结尾增加以下几列
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Output format: 
The following information will be reported after each BED entry:
    1) %AT content
    2) %GC content
    3) Number of As observed
    4) Number of Cs observed
    5) Number of Gs observed
    6) Number of Ts observed
    7) Number of Ns observed
    8) Number of other bases observed
    9) The length of the explored sequence/interval.
    10) The seq. extracted from the FASTA file. (opt., if -seq is used)
    11) The number of times a user's pattern was observed.
        (opt., if -pattern is used.)

genomecov：染色体和全基因组覆盖度计算
要求：单个输入bed文件（-i指定）和genome files；如果输入为bam(-ibam指定)文件，则不需要genome files。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
cat ranges-cov-sorted.bed
chr1    4       9
chr1    1       6
chr1    8       19
chr1    25      30
chr2    0       20
$ cat cov.txt
chr1    30
chr2    20
bedtools genomecov -i ranges-cov-sorted.bed -g cov.txt
chr1    0       7       30      0.233333 1
chr1    1       20      30      0.666667
chr1    2       3       30      0.1
chr2    1       20      20      1 2
genome  0       7       50      0.14 3
genome  1       40      50      0.8
genome  2       3       50      0.06
#name 覆盖次数 覆盖碱基数 总碱基数   覆盖度
#同时计算单染色体和全基因组覆盖度

ranges-cov.bed文件需提前排序sort -k1,1 ranges-cov.bed > ranges-cov-sorted.bed

-bg参数可得到每个碱基的覆盖度。

coverage：计算染色体给定区间覆盖度
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ cat A.bed
chr1  0   100
chr1  100 200
chr2  0   100

$ cat B.bed
chr1  10  20
chr1  20  30
chr1  30  40
chr1  100 200

$ bedtools coverage -a A.bed -b B.bed
chr1  0   100  3  30  100 0.3000000
chr1  100 200  1  100 100 1.0000000
chr2  0   100  0  0   100 0.0000000


贡献来源
http://www.plob.org/2012/09/26/3748.html
http://bedtools.readthedocs.org/en/latest/content/bedtools-suite.html
https://code.google.com/archive/p/bedtools/wikis/Usage.wiki
https://code.google.com/archive/p/bedtools/wikis/UsageAdvanced.wiki



R数据整形术之 dplyr
2016-03-15T13:10:33.000Z
数据集类型
将过长过大的数据集转换为显示更友好的 tbl_df 类型:
1
hflights_df <- tbl_df(hflights)

可以 hflights_df 感受一下不再被刷屏的感觉.
基本操作
常用的数据操作行为归纳为以下五种:
筛选: filter()
按给定的逻辑判断筛选出符合要求的子数据集, 类似于 base::subset() 函数
1
2
3
filter(hflights_df, Month == 1, DayofMonth == 1)
#对同一对象的任意个条件组合
filter(hflights_df, Month == 1 | Month == 2)

排列: arrange()
按给定的列名依次对行进行排序
1
2
3
4
arrange(hflights_df, DayofMonth, Month, Year)
#对列名加 desc() 进行倒序:
arrange(hflights_df, desc(ArrDelay))
#这个函数和 plyr::arrange() 是一样的, 类似于 order()

选择: select()
用列名作参数来选择子数据集:
1
2
3
4
5
select(hflights_df, Year, Month, DayOfWeek)
#还可以用 : 来连接列名, 没错, 就是把列名当作数字一样使用
select(hflights_df, Year:DayOfWeek)
用 - 来排除列名:
select(hflights_df, -DrayOfWeek)

变形: mutate()
对已有列进行数据运算并添加为新列:
1
mutate(hflights_df,   gain = ArrDelay - DepDelay,   speed = Distance / AirTime * 60)

汇总: summarise()
对数据框调用其它函数进行汇总操作, 返回一维的结果:
1
summarise(hflights_df, delay = mean(DepDelay, na.rm = TRUE))

分组动作 group_by()
以上5个动词函数已经很方便了, 但是当它们跟分组操作这个概念结合起来时, 那才叫真正的强大! 当对数据集通过 group_by() 添加了分组信息后,mutate(), arrange() 和 summarise() 函数会自动对这些 tbl 类数据执行分组操作 (R语言泛型函数的优势).
1
2
3
4
5
6
7
8
9
10
mtfs_df %>%
     group_by(chr) %>%
     summarize(max_recom = max(recom), mean_recom = mean(recom), num=n())

Source: local data frame [23 x 4]
     chr max_recom mean_recom  num
1   chr1   41.5648   2.217759 2095
2  chr10   42.4129   2.162635 1029
3  chr11   36.1703   2.774918  560
[...]

summarise中用到的函数

n(): 计算个数 n_distinct(): 计算 x 中唯一值的个数. (原文为 count_distinct(x))
first(x), last(x) 和 nth(x, n): 返回对应秩的值, 类似于自带函数 x[1], x[length(x)], 和 x[n]
连接符 %>%
使用时把数据名作为开头, 然后依次对此数据进行多步操作.
join功能
两数据集取交，并集等。

inner_join(x,y)  交集
semi_join(x,y) 
left_join(x,y)
anti_join(x,y)
inner_join(y,x)
semi_join(y,x)
left_join(y,x)
anti_join(y,x)
full_join(x,y)  并集
【Cheatsheet for dplyr join functions】深入学习
dplyr 包自带的60页详细文档.
其余几个vignettes (网页) 或 vignette(package = “dplyr”),包含了数据库相关, 混合编程, 运算性能比较, 以及新的 window-functions 等内容.
简单看了下vignette(“window-functions”, package = “dplyr”), 提供了一系列函数, 扩展了原来只能返回一个数值的聚焦类函数(如sum(), mean())至返回等长度的值, 变成 cumsum()和 cummean(), 以及 n(), lead() 和 lag()等便捷功能.
plyr 包的相关文档: 主页
还有data.table包也是很强大的哦, 空下来可以学一学.




R数据整形术之 tidyr
2016-03-09T13:07:46.000Z

Happy families are all alike; every unhappy family is unhappy in its own way — Leo Tolstoy


R数据整形包之一tidyr最近迎来更新(tidyr 0.4.0)，所以有必要对其Tidy data进行学习。
以下为个人简单总结：
#gather--mutate--separate--select--arrange
setwd("F:/Rwork/tidyr")
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.2.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.2
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
preg <-read.csv("preg.csv",stringsAsFactors = FALSE)
preg
##           name treatmenta treatmentb
## 1   John Smith         NA         18
## 2     Jane Doe          4          1
## 3 Mary Johnson          6          7
preg6 <-tbl_df(read.csv("preg.csv",stringsAsFactors = FALSE))
preg6
## Source: local data frame [3 x 3]
## 
##           name treatmenta treatmentb
##          (chr)      (int)      (int)
## 1   John Smith         NA         18
## 2     Jane Doe          4          1
## 3 Mary Johnson          6          7
preg2<-preg %>% 
  gather(treatment,n,treatmenta:treatmentb) %>%  
#The first argument, is the name of the key column, which is the name of the variable defined by the values of the column headings. 
#The second argument is the name of the value column.
#The third argument defines the columns to gather, here, every column except religion.
 #gather(treatment,n,-name,na.rm = TRUE)    #the same above # na.rm to drop any missing values from the gather columns
  mutate(treatment=gsub("treatment","",treatment)) %>%
  arrange(name,treatment)   #arrange=sort
preg2
##           name treatment  n
## 1     Jane Doe         a  4
## 2     Jane Doe         b  1
## 3   John Smith         a NA
## 4   John Smith         b 18
## 5 Mary Johnson         a  6
## 6 Mary Johnson         b  7
#each deplay one by one 
preg3<-preg %>% 
  gather(treatment,n,treatmenta:treatmentb)
preg3
##           name  treatment  n
## 1   John Smith treatmenta NA
## 2     Jane Doe treatmenta  4
## 3 Mary Johnson treatmenta  6
## 4   John Smith treatmentb 18
## 5     Jane Doe treatmentb  1
## 6 Mary Johnson treatmentb  7
preg33<-preg3 %>% separate(treatment, c("Treatments", "group"),9 )  #separate
preg33
##           name Treatments group  n
## 1   John Smith  treatment     a NA
## 2     Jane Doe  treatment     a  4
## 3 Mary Johnson  treatment     a  6
## 4   John Smith  treatment     b 18
## 5     Jane Doe  treatment     b  1
## 6 Mary Johnson  treatment     b  7
preg333<-preg33 %>% select(name,group,n)  # select
preg333
##           name group  n
## 1   John Smith     a NA
## 2     Jane Doe     a  4
## 3 Mary Johnson     a  6
## 4   John Smith     b 18
## 5     Jane Doe     b  1
## 6 Mary Johnson     b  7
preg3333<-preg333 %>% spread(group,n)  #spread: one column become two column
preg3333
##           name  a  b
## 1     Jane Doe  4  1
## 2   John Smith NA 18
## 3 Mary Johnson  6  7
preg4<-preg3 %>% mutate(treatment=gsub("treatment","",treatment))
preg4
##           name treatment  n
## 1   John Smith         a NA
## 2     Jane Doe         a  4
## 3 Mary Johnson         a  6
## 4   John Smith         b 18
## 5     Jane Doe         b  1
## 6 Mary Johnson         b  7
preg5<-preg4 %>% arrange(name,treatment)
preg5
##           name treatment  n
## 1     Jane Doe         a  4
## 2     Jane Doe         b  1
## 3   John Smith         a NA
## 4   John Smith         b 18
## 5 Mary Johnson         a  6
## 6 Mary Johnson         b  7
#reads all files from the same locaed pathway  into a single data frame.
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
paths <- dir("F:/Rwork/tidyr", pattern = "\\.csv$", full.names = TRUE)
names(paths) <- basename(paths)
all<-ldply(paths, read.csv, stringsAsFactors = FALSE)
all
##               .id         name treatmenta treatmentb
## 1 preg - Copy.csv   John Smith         NA         18
## 2 preg - Copy.csv     Jane Doe          4          1
## 3 preg - Copy.csv Mary Johnson          6          7
## 4        preg.csv   John Smith         NA         18
## 5        preg.csv     Jane Doe          4          1
## 6        preg.csv Mary Johnson          6          7
#get some data from name column incloud John Smith and Jane Doe  located in preg3 data frame 
subset(preg3, name %in% c("John Smith", "Jane Doe"))
##         name  treatment  n
## 1 John Smith treatmenta NA
## 2   Jane Doe treatmenta  4
## 4 John Smith treatmentb 18
## 5   Jane Doe treatmentb  1



Economist_Graph：复杂图形修改
2016-03-09T09:23:10.000Z

原图：http://www.economist.com/node/21541178
Basic plot
1
2
3
4
library(ggplot2)
dat <- read.csv("EconomistData.csv")
pc1 <- ggplot(dat, aes(x = CPI, y = HDI, color = Region))+
       geom_point()


Trend line
1
2
3
4
5
6
7
pc2 <- pc1 +
   geom_smooth(aes(group = 1),
               method = "lm",
               formula = y ~ log(x),
               se = FALSE,
               color = "red") +
   geom_line()


Open points
1
2
pc3 <- pc2 +
  geom_point(shape = 1, size = 4)


Labelling points
1
2
3
4
5
6
7
8
9
10
11
pointsToLabel <- c("Russia", "Venezuela", "Iraq", "Myanmar", "Sudan",
                   "Afghanistan", "Congo", "Greece", "Argentina", "Brazil",
                   "India", "Italy", "China", "South Africa", "Spane",
                   "Botswana", "Cape Verde", "Bhutan", "Rwanda", "France",
                   "United States", "Germany", "Britain", "Barbados", "Norway", "Japan",
                   "New Zealand", "Singapore")
library("ggrepel")
pc4 <- pc3 +  geom_text_repel(aes(label = Country),
            color = "gray20",
            data = subset(dat, Country %in% pointsToLabel),
            force = 10)


选择性的标注想要的点
Change the region labels and order
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
dat$Region <- factor(dat$Region,
                     levels = c("EU W. Europe",
                                "Americas",
                                "Asia Pacific",
                                "East EU Cemt Asia",
                                "MENA",
                                "SSA"),
                     labels = c("OECD",
                                "Americas",
                                "Asia &\nOceania",
                                "Central &\nEastern Europe",
                                "Middle East &\nnorth Africa",
                                "Sub-Saharan\nAfrica"))
pc4$data <- dat
pc4


修改图例值和顺序
Add title and format axes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
library(grid)
pc5 <- pc4 +
  scale_x_continuous(name = "Corruption Perceptions Index, 2011 (10=least corrupt)",
                     limits = c(.9, 10.5),
                     breaks = 1:10) +
  scale_y_continuous(name = "Human Development Index, 2011 (1=Best)",
                     limits = c(0.2, 1.0),
                     breaks = seq(0.2, 1.0, by = 0.1)) +
  scale_color_manual(name = "",
                     values = c("#24576D",
                                "#099DD7",
                                "#28AADC",
                                "#248E84",
                                "#F2583F",
                                "#96503F")) +
  ggtitle("Corruption and Human development"))


利用scale来修改x，y轴，颜色和标出title
Theme tweaks
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
library(grid) # for the 'unit' function
pc6 <- pc5 +
  theme_minimal() + # start with a minimal theme and add what we need
  theme(text = element_text(color = "gray20"),
        legend.position = c("top"), # position the legend in the upper left 
        legend.direction = "horizontal",
        legend.justification = 0.1, # anchor point for legend.position.
        legend.text = element_text(size = 11, color = "gray10"),
        axis.text = element_text(face = "italic"),
        axis.title.x = element_text(vjust = -1), # move title away from axis
        axis.title.y = element_text(vjust = 2), # move away for axis
        axis.ticks.y = element_blank(), # element_blank() is how we remove elements
        axis.line = element_line(color = "gray40", size = 0.5),
        axis.line.y = element_blank(),
        panel.grid.major = element_line(color = "gray50", size = 0.5),
        panel.grid.major.x = element_blank()
        ))


微调主题
Add model R^2 and source note
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
mR2 <- summary(lm(HDI ~ log(CPI), data = dat))$r.squared
library(grid)
png(file = "images/econScatter10.png", width = 800, height = 600)
pc6 
grid.text("Sources: Transparency International; UN Human Development Report",
         x = .02, y = .03,
         just = "left",
         draw = TRUE)
grid.segments(x0 = 0.81, x1 = 0.825,
              y0 = 0.90, y1 = 0.90,
              gp = gpar(col = "red"),
              draw = TRUE)
grid.text(paste0("R² = ",
                 as.integer(mR2*100),
                 "%"),
          x = 0.835, y = 0.90,
          gp = gpar(col = "gray20"),
          draw = TRUE,
          just = "left")

dev.off()


Contribution from ：
http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html



myself ggplot2 learning tip (再学ggplot2： 遗漏细节)
2016-03-09T05:44:37.000Z

原图：http://www.economist.com/node/21541178
ggplot2绘图语法结构：
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
ggplot(data = set>, 
       aes(x = ,
           y = ,
           ... ),
       ... ) +

       geom_type>(aes(size = for this geom>, 
                      ... ),
                  data = for this point geom>,
                  stat = function>,
                  position = function>,
                  color = <"fixed color specification">,
                  function) +

  scale__<type>(name = <"scale label">,
                     breaks = <where to put tick marks>,
                     labels = for tick marks>,
                     ... for the scale>) +

  theme(plot.background = element_rect(fill = "gray"),
        ... )

Geometric Objects(geom_XX) and Aesthetics (aes())
1
geom_查看所有Geometric Objects（几何学对象）

Statistical Transformations（统计变换）
统计变换 （stat_） 比如求均值，求方差等，当我们需要展示出某个变量的某种统计特征的时候，需要用到统计变换。
每一个geom_XX都有一个默认的统计量，可通过args(geom_XX)查看（args(stat_bin)）例如geom_bar的默认统计量是stat_count,表示进行计数。什么意思呢？举例如下：
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> library(ggplot2)
> df <- data.frame(trt = c("a","a", "b", "c"), outcome = c(2.3,5, 1.9, 3.2))
> df
  trt outcome
1   a     2.3
2   a     5.0
3   b     1.9
4   c     3.2
> ##stat()
> ggplot(df, aes(trt, outcome)) +
  geom_bar()
Error: stat_count() must not be used with a y aesthetic.
#成功报错，因为第一句aes()中我们指定了y轴为outcome，即一个x轴的trt值对应一个y轴的outcome值，而geom_bar()
#中stat_count要进行计数，即计数trt中相同值出现的次数.最终y轴指定的outcome与需要表示的count值冲突，报错。
##解决：
> ggplot(df, aes(trt)) +
  geom_bar()


如果想要表示相同trt值相加后的对应值，修改stat=indentity即可。
1
2
> ggplot(df, aes(trt, outcome)) +
+   geom_bar(stat = "identity")


ggplot2中包含的统计变换有如下多种：

更多参见：https://www.zhihu.com/question/24779017
Scales
控制aes()映射
Aesthetic (aes()) 映射仅仅是告诉一个变量应该映射到一个aesthetic，但并没有说明如何映射？例如，当我们用aes(shape=x)去映射一个变量到shape时，并没有说明用什么shape；同样的，aes(color=z)并无说明用什么颜色；通常在我们未定义这些时ggplot2会自动用默认值，而我们可以通过scale来修改这些值。
在ggplot2中scales包括：
position

color and fill

size

shape

line type

x,y

### 修改方式
1
2
scale__<type>(name,limits,breaks,labels)
通过键入scale_查看全部可修改函数。


特殊scale函数有额外参数，例如对颜色的修改scale_color_continuous函数有low和high参数。
1
2
3
4
5
scale_x_discrete(name="State Abbreviation") +
scale_color_continuous(name="",
                         breaks = c(19751, 19941, 20131),
                         labels = c(1971, 1994, 2013),
                         low = "blue", high = "red")



## Faceting
### 分面: 一页多图
facet_wrap()：对数据分类只能应用一个标准，例facet_wrap(~State, ncol = 10)),按State分组后每行设置10个小图依次画出全部。
facet_grid()：多个标准对数据进行分组绘图,facet_grid(color~cut，margins=TRUE)，波浪号前为小图分行标准，后面为分列标准，margins指用于分面的包含每个变量元素所有数据的数据组，相当于每个小图一个title。
## Themes
### 主题
更多ggplot2主题演示如下：
http://docs.ggplot2.org/dev/vignettes/themes.html
https://github.com/jrnold/ggthemes
### 修改主题默认值
1
2
theme_minimal() +
  theme(text = element_text(color = "turquoise"))


### 自定义主题
1
2
3
4
5
6
7
8
9
theme_new <- theme_bw() +
  theme(plot.background = element_rect(size = 1, color = "blue", fill = "black"),
        text=element_text(size = 12, family = "Serif", color = "ivory"),
        axis.text.y = element_text(colour = "purple"),
        axis.text.x = element_text(colour = "red"),
        panel.background = element_rect(fill = "pink"),
        strip.background = element_rect(fill = muted("orange")))

p5 + theme_new


## 关于晕晕的aes()
任何与数据向量顺序相关，需要逐个指定的参数都必须写在aes里

什么？还是搞不清该放aes里面还是外面？那就记着想统一整个图层时就放到aes外，想分成不同组调整，并且已经有一个与x、y长度一致的分组变量了，那就放到aes里

##其他总结
加注释，所有注释的实现都是通过annotate函数实现的，geom_text()是兼职的。

theme函数最妙的地方是将对于数据相关的美学调整和与数据无关的美学调整分离，将数据处理与数据美学分开，数据美学与数据无关的调整分开。





Noncoding RNAs (ncRNAs)
2016-01-29T02:29:16.000Z
Noncoding RNAs (ncRNAs)
most ncRNAs operate as RNA-protein complexes, including ribosomes, snRNPs, snoRNPs, telomerase, microRNAs, and long ncRNAs.


Non-coding RNA database
http://research.imb.uq.edu.au/rnadb/



Retroelements
2016-01-03T12:18:31.000Z
Introduction
Retroelements are mobile genetic elements (MGEs) that retrotranspose via a RNA intermediate that is reverse-transcribed to DNA by the encoded reverse transcriptase and integrated into a new location within the host genome by an integrase enzyme. They have been found among different organisms from bacteria to humans and often constitute a significant part of genomes, particularly in higher plants and fungi. Various retroelements with different gene organizations and replicative mechanisms have evolved in the course of evolution. Although in most cases they have no effect on the host organism, there are many examples of mutations caused by retroelements resulting in various diseases. However co-adaptation has led in some cases to the use of retroelements for essential and beneficial host functions.
Retroelements (retrotransposons and retroviruses) can currently be divided into four systems or groups commonly known as the long terminal repeat (LTR) retroelements, the non-LTR retroelements, the tyrosine recombinase (YR) retroelements and the Penelope retrotransposons (Eickbush and Jamburuthugoda 2008).
LTR retroelements
These include the broad range of LTR retrotransposons and retroviruses circulating in plants, fungi and animals. A full-lenght consensus LTR retroelement genome is characterized by an internal translating region (gag and pol genes, and env gene when is present) flanked by long terminal repeats (LTRs). They can be classified into four major groups or families based on sequence similarity and other features known as the Ty1/Copia, the Ty3/Gypsy, the Bel/Pao, and the Retroviridae families.

补充材料


Gag is a polyprotein and is an acronym for Group Antigens (ag).
Pol is the reverse transcriptase.
Env in the envelope protein.
The group antigens form the viral core structure, RNA genome binding proteins, and are the major proteins comprising the nucleoprotein core particle. Reverse transcriptase is the essential enzyme that carries out the reverse transcription process that take the RNA genome to a double-stranded DNA preintegrate form. The reverse transcriptase gene also encodes an Integrase activity and an RNase H activity that functions during genome reverse transscription.
Non-LTR retroelements
These constitute a system of retrotransposons widely distributed in eukaryotes, which do not present LTRs or terminal repeats; non-LTR retrotransposons end most frequently with a poly(A) tail at their 3′ end, while their 5′ end often contains variable deletions (5′ truncations). Depending on their capability to transpose or not autonomously these elements are classified as autonomous and non autonomous retroelements, respectively.


Autonomous non-LTR retroelements. On the basis of their molecular structures the autonomous non-LTR retroelements have been grouped in two major classes:
R2 elements : these constitute one of the most studied families of non-LTR retroelements (Eickbush 2002). They encode for a single ORF with a central RT domain and an endonuclease (EN) conserved domain at the C-terminus.
Long INterspersed repetitive Elements (LINEs) : these are a family of 6-8 kb long elements encoding for two ORFs. The first ORF shows similarity to retroviral gags. The second ORF encodes for a pol polyprotein displaying RT and apurinic-apyrimidinic endonuclease (APE) domains. However, some lineages additionally include a protein domain of unknown function at the C-terminus and in other cases an RNase H (RH) downstream of the RT domain as usual in LTR retrotransposons (Eickbush and Jamburuthugoda 2008). LINEs are widespread in mammals (Moran and Gilbert 2002).


Non autonomous non-LTR retroelements. These are DNA sequences of 80 to 630 bp known as Short INterspersed repetitive Elements (SINEs). They represent reverse-transcribed RNA molecules originally transcribed by RNA polymerase III into tRNA, rRNA, and other small nuclear RNAs (Malik and Eickbush 1998). Like the LINEs, the SINEs end by a poly(A) tail or by A- or T-rich sequences, their 5’ and 3’ ends reveal similarities to tRNA genes (or, as shown for some animal SINEs, to 7SL RNA gene) and to the 3’ end of LINEs, respectively (Oshima et al. 1996). Two conserved sequence motifs found in the SINE tRNA-like part, called box A and box B, show homology to RNA polymerase III promoters. SINEs do not encode their own reverse transcriptase and are therefore unable to transpose autonomously. For this reason it has been proposed that SINEs use the enzymatic machinery of LINEs for their retrotransposition (Luan et al. 1993; Wallace et al. 2008; Kroutter et al. 2009). LINEs and SINEs elements have been not only described in the genomes of animals but also in many species across the Plantae Kingdom (Schmidt 1999).


Tyrosine recombinase (YR) retroelements
YR retroelements have been found in plants, protists, fungi, as well as a variety of animals including vertebrates, echinoderms and nematodes. The “gag-RT-RNaseH” genome organization of YR retroelements is similar to that of LTR retroelements but differ in the fact that YR retroelements lack the PR and usually show a tyrosine recombinase (instead of INT), which is typically involved in site-specific recombinations between similar or identical DNA sequences. YR retroelements can be divided into three families DIRS, Ngaro and VIPER  (Goodwin et al. 2004; Goodwin and Poulter 2004; Vazquez et al. 2000; Lorenzi et al. 2006). The typical structure of DIRS elements contains inverted terminal repeats (ITRs), internal ORFs and an internal complementary region (ICR) derived from the duplication of flanking ITR sequences. DIRS elements differ from the two other YR-like families in the presence of a conserved methyltransferase (MT) domain at C-terminal to RT/RH that is similar to those MTs encoded by various bacteriophages (Goodwin and Poulter 2004). Ngaro elements show an ORF organization similar to that of DIRS elements but differ in the orientation of flanking repeats. While DIRS elements are delimited by inverted repeats, Ngaro elements are flanked by direct repeats (referred as A1, A2, B1 and B2). Ngaro elements have been found in different organisms as the zebrafish Danio rerio (DrNgaro1), as well as in fungi and in echinoderms (Goodwin and Poulter 2004). In turn, VIPER (Vestigial interposed retroelement) elements have been described in the genomes of trypanosome protozoan parasites. As an example, the figure below also shows the genomic organization of tcVIPER, an element described in the genome of T. cruzi. This element contains a coding internal region flanked by a lineage of SINEs (called SIRE) also found in T. cruzi genome (Lorenzi et al. 2006; Vazquez et al. 2000).

Penelope retrotransposons (PLEs)
This is a family of retrotransposons described in many animal genomes (more than 80 species belonging to at least 10 animal phyla), protists, fungi, and plants (Arkhipova 2006). Their genome structure contains apparent LTRs that may be in either direct or inverted orientations flanking a coding region with RT and EN domains (i.e. a pol polyprotein domain). Phylogenetic reconstruction analysis indicate that PLEs-like RTs are closer to telomerase RTs (TERTs) than to any other characterized RTs (Arkhipova et al. 2003). In turn, PLE-like ENs are related with the intron-encoded endonucleases and the bacterial repair endonuclease UvrC, both belonging to the Uri family of ENs (Pyatkov et al. 2004). Studies realized with the fly Drosophila virilis and bdelloid rotifer organisms revealed that the majority of PLEs in these species contain spliceosomal introns (Arkhipova et al. 2003). The peculiar structural organization of PLE elements, their ability to retain introns during transposition, and the distinct placement in the phylogeny of retroelements, suggest that PLEs constitute an ancient class of retroelements (Arkhipova 2006; Schostak et al. 2008).


Read more: http://gydb.org/index.php/Retroelements



SCI经典英语写作语句汇总
2015-12-28T14:39:04.000Z
It is time to ‘upgrade’ cancer epigenetics research and put together an ambitious plan to tackle the many unanswered questions in this field using epigenomics approaches.

Histones are no longer considered to be simple ‘DNA-packaging’ proteins; they are recognized as being dynamic regulators of gene activity that undergo manypost-translational chemical modifications, including acetylation, methylation, phosphorylation, ubiquitylation and sumoylation.

In addition to their influence on gene expression, emerging evidence indicates that specific histone modifications interface with other nuclear processes.

Histone modifications, together with DNA methylation, also have a vital role in organizing nuclear architecture,which, in turn, is involved in regulating transcription and other nuclear processes.

However, what distinguishes metabolomics from clinical chemistry is the fact that in metabolomics one is not attempting to characterize a few compounds at a time, but literally dozens or even hundreds of compounds at a time.

Blood is a special biofluid, as it potentially reflects all processes going on in all organs. This can be both a blessing and a curse, as metabolite perturbations in the blood, while easily detectable, cannot be easily traced to a specific organ or a specific cause.

On a cellular level, organisms face two main challenges: to maintain genome integrity in the face of mutagens and mobile genetic elements, and to express a specific repertoire of genes at the proper level and with the appropriate timing.

In sharp contrast to the low within-species genetic variation, differences between species-specific haplotypes were high.

Given the staggering crop losses that result from insect herbivory and the environmental problems associated with insecticide use, genomeenabled research on the natural mechanisms that plants use to defend themselves against insects will undoubtedly have not only ecological, but also agricultural relevance.

















Awk经典实例总结
2015-12-27T05:09:47.000Z
删除某一行
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
[zpxu@node102 ~]$ cat fkjsaf 
        GO_ids
Gh_A01G0005     GO:0016021
Gh_A01G0006     GO:0006629
Gh_A01G0007
Gh_A01G0008
Gh_A01G0009
Gh_A01G0010     GO:0008121,GO:0006122
Gh_A01G0011
Gh_A01G0012
Gh_A01G0013     GO:0003677,GO:0006355
Gh_A01G0014
Gh_A01G0015     GO:0004713,GO:0005524,GO:0004674,GO:0004672,GO:0006468
Gh_A01G0016     GO:0006886,GO:0005643,GO:0008536,GO:0005515,GO:0008565
Gh_A01G0017     GO:0003676
Gh_A01G0018
Gh_A01G0019     GO:0016020,GO:0006810,GO:0005215
[zpxu@node102 ~]$ awk '{if(NR==1){next} print $0}' fkjsaf 
Gh_A01G0005     GO:0016021
Gh_A01G0006     GO:0006629
Gh_A01G0007
Gh_A01G0008
Gh_A01G0009
Gh_A01G0010     GO:0008121,GO:0006122
Gh_A01G0011
Gh_A01G0012
Gh_A01G0013     GO:0003677,GO:0006355
Gh_A01G0014
Gh_A01G0015     GO:0004713,GO:0005524,GO:0004674,GO:0004672,GO:0006468
Gh_A01G0016     GO:0006886,GO:0005643,GO:0008536,GO:0005515,GO:0008565
Gh_A01G0017     GO:0003676
Gh_A01G0018
Gh_A01G0019     GO:0016020,GO:0006810,GO:0005215

删除列数小于N的行
1
2
3
4
5
6
7
8
9
[zpxu@node102 ~]$ awk '{if(NF==1){next} print $0}' fkjsaf 
Gh_A01G0005     GO:0016021
Gh_A01G0006     GO:0006629
Gh_A01G0010     GO:0008121,GO:0006122
Gh_A01G0013     GO:0003677,GO:0006355
Gh_A01G0015     GO:0004713,GO:0005524,GO:0004674,GO:0004672,GO:0006468
Gh_A01G0016     GO:0006886,GO:0005643,GO:0008536,GO:0005515,GO:0008565
Gh_A01G0017     GO:0003676
Gh_A01G0019     GO:0016020,GO:0006810,GO:0005215

删除空行
1
2
3
4
5
6
7
8
9
10
11
[zpxu@node102 ~]$ cat text
111
222

222
333
[zpxu@node102 ~]$ awk NF text
111
222
222
333

不输出后两列
1
2
3
4
5
6
cat file 
a b c d e f
1 2 3 4
awk 'NF-=2' file
a b c d
1 2

不输出前两列
1
2
3
awk '{for(i=3;i file
c d e f
3 4

文件1和文件2交集部分合并输出
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cat a.txt    //a.txt  
111   aaa  
222   bbb  
333   cccc  
444   ddd  
cat b.txt    //b.txt  
111  123  456  
2    abc  cbd  
444  rts  786  
#要求输出结果是
111,aaa,123,456
444,ddd,rts,786
#实现方法1
awk 'NR==FNR{a[$1]=$2;}NR!=FNR && a[$1]{print $1","a[$1]","$2","$3}' a.txt b.txt  
111,aaa,123,456  
444,ddd,rts,786

解释
当NR和FNR相同时,这就说明在对第一个文件进行操作，a[$1]=$2表示，建立一个数组，以第一个字段为下标，第二个字段为值。当NR!=FNR时,说明在对第二个文件进行操作，注意：这个时候的,$1和前面的$1不是同一个东西了，前面的$1表示的是a.txt的第一个字段，而后面的$1表示的是b.txt的第一个字段。a[$1]表示以b.txt中第一个字段的为下标的值，如果a[$1]有值的话，说明也存在于a.txt文件中，这样就把数据print出来就行了。
1
2
3
4
#方法2
awk -v OFS="," 'NR==FNR{a[$1]=$2;} NR!=FNR && $1 in a { print $1,a[$1],$2,$3}' a.txt b.txt  
111,aaa,123,456  
444,ddd,rts,786

解释
-v OFS=”,”这个是设置输出时的列分割符，$1 in a这个是b.txt文件中的第一列的值是不是在数组a的key中，这个对做程序的来说很好理解，各种语言当中都有这样的用法，或者函数。



sed中那些特殊的替换
2015-12-26T14:39:28.000Z
修改匹配的第N个内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cat text1
2
2
3
22
2
sed ':a;N;$! ba;s/2/--/5' text1   #替换第5个
2
2
3
22
--
sed ':a;N;$!ba;s/\(.*\)2/\1--/' text1  #替换最后一个
cat text2
2 52 8 2
sed 's/2/--/4' text2
2 52 8 --




Awk系统变量和内置函数
2015-12-26T07:41:14.000Z
awk中有两种类型的系统变量。第一种类型定义的变量默认值可以改变，如字段和记录分隔符；第二种类型定义的变量的值可用于报告或数据处理中，如字段数量或记录数量。
内置变量表


实例
处理多行记录
1
2
3
4
5
6
7
8
9
10
11
12
$ cat text.txt
John Robinson

Koren Inc.

978 Commonwealth Ave.

Boston

MA 01760

696-0987

注：6个字段，记录之间用空行分隔。
为了处理这种包括多行数据的记录，可以将字段分隔符定义为换行符，记录分隔符定义为空字符串，代表一个空行。
1
BEGIN｛FS="\n";RS=""｝

所以可以用下面脚本打印第一个和最后一个字段：
1
2
$ awk 'BEGIN{ FS = "\n"; RS = "" } {print $1, $NF}' text.txt
John Robinson 696-0987

输出数据格式设置：(OFMT使用）
1
2
3
$ awk 'BEGIN{OFMT="%.3f";print 2/3,123.11111111;}' /etc/passwd   
0.667 123.111
#OFMT默认输出格式是：%.6g 保留六位小数，这里修改OFMT会修改默认数据输出格式

按宽度指定分隔符（FIELDWIDTHS使用）
1
2
3
$ echo 20100117054932 | awk 'BEGIN{FIELDWIDTHS="4 2 2 2 2 3"}{print $1"-"$2"-"$3,$4":"$5":"$6}'
2010-01-17 05:49:32
#FIELDWIDTHS其格式为空格分隔的一串数字，用以对记录进行域的分隔，FIELDWIDTHS="4 2 2 2 2 2"就表示$1宽度是4，$2是2，$3是2  .... 。这个时候会忽略：FS分隔符

内置函数
awk内置函数，主要分以下3种类似：算数函数、字符串函数、其它一般函数、时间函数
字符串函数

split
awk的内建函数split允许你把一个字符串分隔为单词并存储在数组中。你可以自己定义域分隔符或者使用现在FS(域分隔符)的值。
格式：
   split (string, array, field separator)
   split (string, array)  —>如果第三个参数没有提供，awk就默认使用当前FS值。
split有3个参数，第一个传要切分的字符串，第二个放切分完后输出的数组，第三个定义分隔符;
1
2
3
4
5
$ awk 'BEGIN{info = "this is a test";slen=split(info,ta," ");for (i=1;i<=slen;i++) {print i,ta[i];}}'
1 this
2 is
3 a
4 test

参考:
http://www.cnblogs.com/chengmo/archive/2010/10/06/1844818.html



Advanced-sed：n，N，d，D，p，P，b, T,t,h，H，g，G，x,y
2015-12-25T13:24:52.000Z

高级命令分为3个组：

处理多行模式空间(N,D,P)。
采用保持空间来保存模式空间的内容，并使它用于后续命令(H,h,G,g,x)。
编写使用分支和条件指令的脚本来更改控制流（：，b，T/t）。多行模式空间
awk，sed，grep的模式匹配是面向行的，在单个输入行上匹配一个模式。但是其他如在一行的结尾处开始到下一行的开始处结束的短语，则只有在多行上重复时才有意义。
sed能察看模式空间的多个行，允许匹配模式扩展到多行上.
这里的3个多行命令(N,D,P)对应于之前的小写字母的基本命令（n,d,p）。命令解释
D/d: d删除模式空间的内容，D只是删除模式空间的第一行内容。
P/p: p打印当前模式空间内容，追加到默认输出之后，P(大写)打印当前模式空间开端至\n的内容，并追加到默认输出之前。
N/n: Next(N)通过读取新的输入行，并将它添加到模式空间的现有内容之后来创建多行模式空间。模式空间最初的内容和新的输入行之间用换行符分隔。多行模式空间中，“^”匹配空间中的第一个字条，而不是换行符后面的字符，“$”只匹配模式空间中最后的换行符，不匹配任何嵌入的。next（n）输出模式空间的内容，然后读取新的输入行。实例
n
1
2
3
4
5
6
7
8
9
10
cat aaa 
This is 1    
This is 2    
This is 3    
This is 4    
This is 5    
     
sed -n 'n;p' aaa         //-n表示隐藏默认输出内容    
This is 2    
This is 4



注 释：读取This is 1，执行n命令，此时模式空间为This is 2，执行p，打印模式空间内容This is 2，之后读取 This is 3，执行n命令，此时模式空间为This is 4，执行p，打印模式空间内容This is 4，之后读取This is 5，执行n 命令，因为没有了，所以退出，并放弃p命令。
N
1
2
3
4
5
6
7
sed -n '$!N;P' aaa            
This is 1   
This is 3   
This is 5
sed -n 'N;P' aaa 
This is 1    
This is 3

注释中1代表This is 1   2代表This is 2  以此类推
注释：读取1，$!条件满足（不是尾行），执行N命令，得出1\n2，执行P，打印得1，读取3，$!条件满足（不是尾行），执行N命令，得出3\n4，执行P，打印得3，读取5，$!条件不满足，跳过N，执行P，打印得5.
$!N: 排除了对最后一行（$）执行Next命令。http://blog.chinaunix.net/uid-10540984-id-1759548.html
1
2
3
4
5
cat text 
Owner and Operator
Guide
sed '/Operator$/ {N;s/Owner and Operator\nGuide/Installtion Guide/ }' text 
Installtion Guide

关于更详细的关于sed参数n和N，见ww.cbcb.umd.edu/software/PBcR/MHAP/asm/
d
1
2
3
4
sed 'n;d' aaa           
This is 1   
This is 3   
This is 5

注释：读取1，执行n，得出2，执行d，删除2，得空，以此类推，读取3，执行n，得出4，执行d，删除4，得空，但是读取5时，因为n无法执行，所以d不执行。因无-n参数，故输出1\n3\n5.
D
1
2
sed 'N;D' aaa           
This is 5

注释：读取1，执行N，得出1\n2，执行D，得出2，执行N，得出2\n3，执行D，得出3，依此类推，得出5，执行N，条件失败退出，因无-n参数，故输出5.
输入/输出循环

P（大写）经常出现在N之后D之前，通过N-P-D可建立一个输入/输出循环，用来维护两行的模式空间，但是一次只输出一行。这个循环的目的是只输出模式空间的第一行，然后返回到脚本的顶端将所有的命令应用于模式空间的第二行。
案例分析：
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
$ cat text.txt
I want to see @f1(what will happen) if we put the 
font change commands @f1(on a set of lines). If I understand
things (correctly), the @f1(third) line causes problems.(No?).
Is this really the case, or is it (maybe) just something else?

Let's test having two on a line @f1(here) and @f1(there) as
well as one that begins on one line and ends @f1(somewhere
on another line). What if @f1(it is here) on the line?
Another @f1(one).
$ #将@f1(anything)替换为\fB anything \fR
$ sed 's/@f1(\(.*\))/\\fB\1\\fR/g'   #匹配@f1(.*) 并用“\(” 和 “\)” 保存括号中任意内容，在替换部分，保存的匹配部分用“\1” 回调。
$ sed -f sed.len test
I want to see \fBwhat will happen\fR if we put the
font change commands \fBon a set of lines\fR. If I understand
things (correctly), the \fBthird) line causes problems. (No?\fR.
Is this really the case, or is it (maybe) just something else?

Let's test having two on a line \fBhere) and @f1(there\fR as
well as one that begins on one line and ends @f1(somewhere
on another line). What if \fBit is here\fR on the line?
Another \fBone\fR.
$ #替换命令在第三行和第二段第一行失效，正则表达式贪婪匹配总是进行可能最长的匹配，“.*”匹配从"@f1(" 到这一行最后一个右圆括号中所有字符。
$ sed 's/@f1(\([^)]*\))/\\fB\1\\fR/g'  #除“）”以外的零次或多次出现的任意字符
I want to see \fBwhat will happen\fR if we put the
font change commands \fBon a set of lines\fR. If I understand
things (correctly), the \fBthird\fR line causes problems.(No?).
Is this really the case, or is it (maybe) just something else?

Let's test having two on a line \fBhere\fR and \fBthere\fR as
well as one that begins on one line and ends @f1(somewhere
on another line). What if \fBit is here\fR on the line?
Another \fBone\fR.
$ #可以看到对于跨越两行的替换仍然没有完成。这时多行模式空间变可发挥其神奇功效了,如果匹配“@f1(” 并且没有找到右圆括号的话，那么就需要将另一行读入(N)缓冲区并试着生成与第一种情况相同的匹配。 
$ cat sednew
s/@f1(\([^)]*\))/\\fB\1\\fR/g
/@f1(.*/ {
          N
		  s/@f1(.*\n[^)]*\))/\\fB\1\\fR/g
}
$ #/@f1(.*/地址将过程限制在匹配/@f1(.*/的行上，并对其执行｛｝中的命令。
$ sed -f sednew test
I want to see \fBwhat will happen\fR if we put the
font change commands \fBon a set of lines\fR. If I understand
things (correctly), the \fBthird\fR line causes problems.(No?).
Is this really the case, or is it (maybe) just something else?

Let's test having two on a line \fBhere\fR and \fBthere\fR as
well as one that begins on one line and ends \fBsomewhere
on another line\fR. What if @f1(it is here) on the line?
Another \fBone\fR.
$ #可以看出倒数第二个替换不成功，why? 模式匹配/@f1(.*/找到@f1(somewhere\n后执行N输入第二行，此时模式空间为“well as one that begins on one line and ends @f1(somewhere\non another line). What if @f1(it is here) on the line?“，进行第二行脚本替换命令"s/@f1(.*\n[^)]*\))/\\fB\1\\fR/g",模式空间变为"well as one that begins on one line and ends \fBsomewhere\non another line\fR. What if @f1(it is here) on the line?"并一起输出，然后sed再次输入的是最后一行“Another @f1(one)”来从头执行脚本；我们发现这个替换脚本似乎是”忘记“了@f1(it is here)的存在，成功跳过它完成匹配。而这原因就是sed默认是输出模式空间的整个内容，所以@f1(it is here)没有机会让脚本程序重头对其执行，也就没能通过脚本第一行替换完成任务。
$ #如果我们在多行模式空间中完成跨越两行的匹配替换后只是输出第一行（P），然后将其删除（D），这样剩下的“What if @f1(it is here) on the line?”部分成为模式空间的第一行，并将控制转移到脚本的顶端，这时检查是否在该行上还有其他的“@f1(”,这得到机会让脚本从上至下的所有命令应用到它完成替换。
$ cat sednew2
s/@f1(\([^)]*\))/\\fB\1\\fR/g
/@f1(.*/ {
          N
		  s/@f1(\(.*\n[^)]*\))/\\fB\1\\fR/g
		  P
		  D
}

大于2行模式空间
我们发现Next(N)命令只能的在读入第一行的基础上再次读入下一行，即模式空间中同时存在2行，如果想要匹配3行或更多怎么办？
这时就该高级的流控制命令起作用了。
其中用于控制执行脚本的哪一部分以及何时执行的命令为分支(b)和测试(T/t)，他们将脚本中的控制转移到包含特殊标签的行，如果没有标签被指定，则转移到脚本结尾处。分支用于无条件转移，测试用于有条件转移。
标签是任意不多于7个字符的序列，标签本身占据一行并以冒号开始：
1
:mylabel

注：冒号和标签间不允许有空格，行尾处的空格是标签的一部分。当在分支和测试命令中指定标签时，在命令和标签间允许有空格：
1
b mylabel

所以对于大于2行的模式空间匹配可以通过一下实现：
1
2
3
4
5
6
7
8
9
10
11
12
13
14
:begin
/@f1(\([^)]*\))/{
                 s//\\fB\1\\fR/g
                 b begin
}
/@f1(.*/{
         N
         s/@f1(\([^)]*\n[^)]*\))/\\fB\1\\fR/g
         t again
         b begin
}
:again
P
D

保持空间
模式空间是容纳当前输入的缓冲区。还有一个保持空间(hold space)的顶留(set-aside)缓冲区,模式空间和保持空间内容可实现互换。保持空间用于临时存储，单独的命令不能寻址保持空间或更改他的内容。
保持空间最常用的当改变模式空间中的原始内容时，用于保留当前输入行的副本。
y
y命令的作用在于字符转换
将aaa文件内容大写
1
2
3
4
5
6
7
8
9
sed 'y/his/HIS/' aaa  
THIS IS 1  
THIS IS 2  
THIS IS 3  
THIS IS 4  
THIS IS 5
#或者echo "axxbxxcxx" | sed 'y/abc/123/'
1xx2xx3xx
#不连续字符串的替换

h命令，H命令，g命令，G命令
h命令是将当前模式空间中内容覆盖至保持空间，H命令是将当前模式空间中的内容追加至保持空间
g命令是将当前保持空间中内容覆盖至模式空间，G命令是将当前保持空间中的内容追加至模式空间
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
cat ddd   
This is a and a is 1   
This is b and b is 2   
This is c and c is 3   
This is d and d is 4   
This is e and e is 5  
#将ddd文件中数字和字母互换，并将字母大写
cat ddd.sed
h  
{  
s/.*is \(.*\) and .*/\1/  
y/abcde/ABCDE/
G  
s/\(.*\)\n\(.*is \).*\(and \).*\(is \)\(.*\)/\2\5 \3\5 \4\1/  
}  
                                           
sed -f ddd.sed ddd  
This is 1 and 1 is A  
This is 2 and 2 is B  
This is 3 and 3 is C  
This is 4 and 4 is D  
This is 5 and 5 is E


x
x命令是将当前保持空间和模式空间内容互换.



perl-oneline
2015-12-08T06:40:52.000Z
Perl有很多命令行参数. 通过它们, 我们有机会写出更简单的程序. 在这篇文章里我们来了解一些常用的参数.
第一部分：Safety Net Options 安全网参数
在使用Perl尝试一些聪明(或stupid)的想法时, 错误难免会发生. 有经验的Perl程序员常常使用三个参数来提前找到错误所在,
1：-C
这个参数编译 Perl 程序但不会真正运行它. 由此检查所有语法错误. 每次修改 perl 程序之后我都会立刻使用它来找到任何语法错误.
$ perl -c program.pl
2：-W
它会提示你任何潜在的问题.Perl 5.6.0之后的版本已经用use warnings; 替换了-w .你应该使用 use warnings 因为它要比 -w 更灵活.
3：-T
它把perl放到了tain模式.  在这个模式里, Perl 会质疑任何程序外传来的数据. 例如,从命令行读取, 外部文件里读取 或是 CGI 程序里传来的数据.
这些数据在 -T 模式里都会被 Tainted 掉.
第二部分：命令行Perl参数：可以让短小的Perl程序运行在命令行.
1：-e
可以让Perl程序在命令行上运行.
例如, 我们可以在命令行上运行 “Hello World” 程序而不用把它写入文件再运行.
$ perl -e ‘print “Hello Worldn”‘
多个 -e 也可以同时使用, 运行顺序根据它出现的位置.
$ perl -e ‘print “Hello “;’ -e ‘print “Worldn”‘
象所有的 Perl 程序一样, 只有程序的最后一行不需要以 ; 结尾.
2：-M
可以象通常一样引用模
$ perl -MLWP::Simple -e ‘getstore (“http://www.163.com/","163.html")'##下载整个网页
-M+模块名 和 use模块名一样
第三部分:隐式循环
3：-n
增加了循环的功能, 使你可以一行一行来处理文件
$  perl -n -e’print;’ 1.txt #####$  perl -ne ‘print;’ 1.txt
这与下面的程序一样.
LINE:
    while (<>;) {
     print;
    }
<>; 打开命令行里的文件,一行行的读取.每一行缺省保存在 $
$ perl -n -e ‘print “$. - $“‘ file
上面的这一行可以写成
  LINE:
    while (<>;) {
      print “$. - $“
    }
输出当前行数 $. 和当前行 $.
4:-p ,和 -n 一样，但是还会打印 $_ 的内容
如果想在循环的前后做些处理, 可以使用 BEGIN 或 END block. 下面的这一行计算文件里的字数.
$ perl -ne ‘END { print $t } @w = /(w+)/g; $t += @w’ file.txt
每一行所有匹配的字放入数组 @w , 然后把 @w 的元素数目递加到  $t. END block 里的 print 最后输出文件总字数.
还有两个参数可以让这个程序变得更简单.
5:-a
打开自动分离 (split)模式. 空格是缺省的分离号. 输入根据分离号被分离然后放入缺省数组@F
使用-a，上面的命令可以写成这样：
$ perl -ane ‘END {print $x} $x += @F’ file.txt  ##使用了-a
6：-F
把缺省的分离号改为你想要的.例如把分离号定为非字符，上面的命令可以改为：
$ perl -F’W’ -ane ‘END {print $x} $x += @F’ file.txt
下面通过Unix password 文件来介绍一个复杂的例子.  Unix password 是文本文件, 每一行是一个用户记录,
由冒号 : 分离. 第 7 行是用户的登录 shell 路径. 我们可以得出每一个不同 shell 路径被多少个用户使用 :
$ perl -F’:’ -ane ‘$s{$F[6]}++;’ >; -e ‘END { print “$ : $s{$}” for keys %s }’ /etc/passwd
虽然现在不是一行, 但是你可以看出使用参数可以解决什么问题.
第四部分：Record Separators 数据分隔符
$/ 和 $ — 输入,输出分隔号.
$/ 用来分隔从文件句柄里读出的数据, 缺省 $/ 分隔号是 n , 这样每次从文件句柄里就会一行行的读取
$ 缺省是空字符, 用来自动加到要 print 的数据尾端. 这就是为什么很多时候 print 都要在末尾加上 n.
$/ 和 $ 可与 -n -p 一起使用. 在命令行上相对应为 -0 (零) 和 -l ( 这是 L ).
-0 后面可以跟一个 16 进制或8进制数值, 这个值用来付给 $/ .
-00 打开段落模式, -0777 打开slurp 模式 (即可以一次把整个文件读入) , 这与把 $/ 设为空字符和 undef 一样效果.
单独使用 -l  有两个效果：
第一：自动 chomp 输入分隔号
第二：把$/ 值付给 $ (这样 print 的时候就会自动在末尾加 n )
1：-l 参数, 用来给每一个输出加 n. 例如
$ perl -le ‘print “Hello World”‘
第五部分：原位编辑
使用已有的参数我们可以写出很有效的命令行程序. 常见的Unix I/O 重定向:
$ perl -pe ‘some code’ < input.txt >  > output.txt
这个程序从 input.txt 读取数据, 然后做一些处理再输出到 output.txt. 你当然也可以把输出重定向到同一个文件里.
上面的程序可以通过 -i 参数做的更简单些.
2: -i
把源文件更名然后从这个更名的源文件里读取.最后把处理后的数据写入源文件.
如果 -i 后跟有其他字符串, 这个字符串与源文件名合成后来生成一个新的文件名.
此文件会被用来储存原始文件以免被 -i  参数覆盖.
这个例子把所有 php 字符替换为 perl :
$ perl -i -pe ‘s/bPHPb/Perl/g’ file.txt
程序读取文件的每一行, 然后替换字符, 处理后的数据重新写入( 即覆盖 ) 源文件.
如果不想覆盖源文件, 可以使用
$perl -i.bak -pe ‘s/bPHPb/Perl/g’ file.txt
这里处理过的数据写入 file.txt , file.txt.bak 是源文件的备份.
perl经典的例子
问题：
遇到一问题：
aaa@domain.com  2
aaa@domain.com 111
bbb@home.com   2222
bbb@home.com   1
类似这种输出，我想把他们变换成下面形式：
aaa@domain.com 113
bbb@home.com 2223
就是将相同邮箱名称后面的数字相加。各位大侠能否给些思路如何用perl来实现。
答案：
perl -anle ‘$cnt{$F[0]}+=$F[1];END{print “$t$cnt{$}” for keys %cnt}’ urfile
如果熟悉了上面几个perl命令行参数的用法，上面的这个命令应该很好理解：
每次读取urfile的一行，由于使用了-a，打开自动分离 (split)模式. 空格是缺省的分离号. 输入根据分离号被分离然后放入缺省数组@F中，
以文件的第一行为例子$F[0] 就是 aaa@domain.com , $F[1] 就是2
$cnt{$F[0]} +=$F[1] 就是一个哈希数组, 以$F[0]为key,$F[1]为value,把相同key的数值都叠加起来.然后把文件的每一行都这样处理一次.
END{} 就是在循环完之后再处理.里面的意思就是打印这个%cnt 哈希数组.这个哈希数组的key就是 邮箱名称,value就是叠加后的数字.
下面的是上面行命令的文本形式：
!/usr/bin/perl
use strict;
use warnings;
my %hash;
while (<>){
      chomp;
     my @array=split;
     $hash{$array[0]} +=$array[1];
}
END{
foreach (keys %hash){
        print”$t$hash{$}n”;
}
}
与One-Liner相关的Perl命令行参数：
-0<数字>
(用8进制表示)指定记录分隔符($/变量)，默认为换行
-00，段落模式，即以连续换行为分隔符
-0777，禁用分隔符，即将整个文件作为一个记录
-a，自动分隔模式，用空格分隔$_并保存到@F中。相当于@F=split。分隔符可以使用-F参数指定
-F，指定-a的分隔符，可以使用正则表达式
-e，执行指定的脚本。
-i<扩展名>原地替换文件，并将旧文件用指定的扩展名备份。不指定扩展名则不备份。
-l，对输入内容自动chomp，对输出内容自动添加换行
-n，自动循环，相当于while(<>){脚本;}
-p，自动循环+自动输出，相当于while(<>){脚本;print;}
http://blog.sina.com.cn/s/blog_4af3f0d20100g9oz.html



How to use subscripts in ggplot2 legends -[expression()]
2015-12-05T03:16:17.000Z
If you want to incorporate Greek symbols etc. into the major tick labels, use an unevaluated expression.
1
2
3
4
5
6
7
8
9
10
11
library(ggplot2)
data <- data.frame(names=tolower(LETTERS[1:4]),mean_p=runif(4))

p <- ggplot(data,aes(x=names,y=mean_p))
p <- p + geom_bar(colour="black",fill="white")
p <- p + xlab("expressions") + scale_y_continuous(expression(paste("Wacky Data")))
p <- p + scale_x_discrete(labels=c(a=expression(paste(Delta^2)),
                               b=expression(paste(q^n)),
                               c=expression(log(z)),
                               d=expression(paste(omega / (x + 13)^2))))
p


Contribution from ：
http://stackoverflow.com/questions/6202667/how-to-use-subscripts-in-ggplot2-legends-r



hexo语法：markdown基础篇
2015-12-04T14:59:58.000Z
Markdown是一种轻量级的「标记语言」，目标是实现「易读易写」。我使用改语言，主要的目的还是因为github的缘故。所以了解一些Markdown的一些基本语法，就是非常有必要了。

Markdown 常用语法

标题
只需要在文字前加 #。具体可以支持到1到6个#，建议在#后，最好加入一个空格，这是Mardown的标准写法。
列表
列表主要两种类型，无序和有序。无序的只要在文字前加-或者*，有序的是使用1.,2.,3.标记。
PDF

我是PDF
引用
要引用一段文字，在文字前使用标记 > 这种尖括号（大于号）即可。
这里是引用：hope

1
这里是引用：hope

图片与链接
图片：
1
![](){ImgCap}{/ImgCap}

或者：
1
"https://raw.githubusercontent.com/wiki/tiramisutes/blog_image/pythonlogo.jpg" width="600" height="300">

链接:
1
[标注](link)

http://tiramisutes.github.io/
下载
 Download Now

粗体与斜体
粗体与斜体也比较简单，两个或_包含一段文本就是粗体，一个或_包含一段文本就是斜体
粗体   斜体
表格
1
2
3
4
5
| Tables        | Are           | Cool  |
| ------------- |:-------------:| -----:|
| col 3 is      | right-aligned | $1600 |
| col 2 is      | centered      |   $12 |
| zebra stripes | are neat      |    $1 |

效果展示
| Tables        | Are           | Cool  |
| ——————- |:——————-:| ——-:|
| col 3 is      | right-aligned | $1600 |
| col 2 is      | centered      |   $12 |
| zebra stripes | are neat      |    $1 |
如果让标题居中，加:——————-:，右对齐——-:
代码框
1
cord if for while

分割线
分割线的语法只需要三个 * 号。

我是分割线

视频
1
"420" height="315" src="http://www.youtube.com/" frameborder="0" allowfullscreen>

hexo server
hexo s启动hexo服务时报错如下：

FATAL Port 4000 has been used. Try other port instead.

显示hexo默认4000端口被占用；
解决办法：
windows下检查端口是否占用并杀死该进程

1
2
3
netstat -ano | findstr 4000 （最后一列是pid）
tasklist | findstr pid
taskkill -PID pid -F

或者换其他端口

1
hexo server --port=4001

Contribution from ：
http://www.jianshu.com/p/1e402922ee32
http://daringfireball.net/projects/markdown/basics
https://guides.github.com/features/mastering-markdown/
http://blog.csdn.net/microcosmv/article/details/51868284



python基础教程总结
2015-12-03T11:05:19.000Z

简介
Python 是一个高层次的结合了解释性、编译性、互动性和面向对象的脚本语言。
Python 的设计具有很强的可读性，相比其他语言经常使用英文关键字，其他语言的一些标点符号，它具有比其他语言更有特色语法结构。
Python 是一种解释型语言： 这意味着开发过程中没有了编译这个环节。类似于PHP和Perl语言。
Python 是交互式语言： 这意味着，您可以在一个Python提示符，直接互动执行写你的程序。
Python 是面向对象语言: 这意味着Python支持面向对象的风格或代码封装在对象的编程技术。
Python 是初学者的语言：Python 对初级程序员而言，是一种伟大的语言，它支持广泛的应用程序开发，从简单的文字处理到 WWW 浏览器再到游戏。
Python 特点
1.易于学习：Python有相对较少的关键字，结构简单，和一个明确定义的语法，学习起来更加简单。
2.易于阅读：Python代码定义的更清晰。
3.易于维护：Python的成功在于它的源代码是相当容易维护的。
4.一个广泛的标准库：Python的最大的优势之一是丰富的库，跨平台的，在UNIX，Windows和Macintosh兼容很好。
5.互动模式：互动模式的支持，您可以从终端输入并获得结果的语言，互动的测试和调试代码片断。
6.便携式：Python可以运行在多种硬件平台和所有平台上都具有相同的接口。
7.可扩展：可以添加低层次的模块到Python解释器。这些模块使程序员可以添加或定制自己的工具，更有效。
8.数据库：Python提供所有主要的商业数据库的接口。
9.GUI编程：Python支持GUI可以创建和移植到许多系统调用。
10.可扩展性：相比 shell 脚本，Python 提供了一个更好的结构，且支持大型程序。
Python 环境搭建
可以通过终端窗口输入 “python” 命令来查看本地是否已经安装Python以及Python的安装版本。
Python下载
Python最新源码，二进制文档，新闻资讯等可以在Python的官网查看到：
Python官网：http://www.python.org/
可以在一下链接中下载Python的文档，你可以下载 HTML、PDF 和 PostScript 等格式的文档。
Python文档下载地址：www.python.org/doc/
Python安装
Unix & Linux 平台安装 Python:
下载及解压压缩包。

如果你需要自定义一些选项修改Modules/Setup 

执行 ./configure 脚本

make

make install

执行以上操作后，Python会安装在 /usr/local/bin 目录中，Python库安装在/usr/local/lib/pythonXX，XX为你使用的Python的版本号。

Window 平台安装 Python:
下载后，双击下载包，进入Python安装向导，安装非常简单，你只需要使用默认的设置一直点击”下一步”直到安装完成即可。

环境变量配置
Unix/Linux 设置环境变量
1
2
export PATH="$PATH:/usr/local/bin/python" 
##/usr/local/bin/python 是Python的安装目录

Windows 设置环境变量：
1
2
3
##命令提示框中(cmd) : 输入 
path %path%;C:\Python 
##C:\Python 是Python的安装目录

Python 重要环境变量：

Python 中文打印错误
解决方法为只要在文件开头加入 # -- coding: UTF-8 -- 或者 #coding=utf-8 就行了。
Python 基础语法
行和缩进
学习Python与其他语言最大的区别就是，Python的代码块不使用大括号（{}）来控制类，函数以及其他逻辑判断。python最具特色的就是用缩进来写模块。
缩进的空白数量是可变的，但是所有代码块语句必须包含相同的缩进空白数量，这个必须严格执行。
IndentationError: unexpected indent 错误是python编译器是在告诉你”Hi，老兄，你的文件里格式不对了，可能是tab和空格没对齐的问题”，所有python对格式要求非常严格。
如果是 IndentationError: unindent does not match any outer indentation level错误表明，你使用的缩进方式不一致，有的是 tab 键缩进，有的是空格缩进，改为一致即可。
因此，在Python的代码块中必须使用相同数目的行首缩进空格数。
建议你在每个缩进层次使用 单个制表符 或 两个空格 或 四个空格 , 切记不能混用。
多行语句
Python语句中一般以新行作为为语句的结束符，但是我们可以使用斜杠（ \）将一行的语句分为多行显示，语句中包含[], {} 或 () 括号就不需要使用多行连接符。
Python注释
python中单行注释采用 # 开头，多行注释使用三个单引号(‘’’)或三个单引号(“””)。
Python 变量
Python有五个标准的数据类型：

Numbers（数字）

String（字符串）

List（列表）

Tuple（元组）

Dictionary（字典）

Python数字(Number)
Python支持四种不同的数值类型：

int（有符号整型）

long（长整型[也可以代表八进制和十六进制]）
    
float（浮点型）
    
complex（复数）

Python数学函数：

Python字符串(String)
字符串或串(String)是由数字、字母、下划线组成的一串字符。
字符串用’’标识
python的字串列表有2种取值顺序:

从左到右索引默认0开始的，最大范围是字符串长度少1

从右到左索引默认-1开始的，最大范围是字符串开头

如果你的实要取得一段子串的话，可以用到变量[头下标:尾下标]，就可以截取相应的字符串，其中下标是从0开始算起，可以是正数或负数，下标可以为空表示取到头或尾。
加号（+）是字符串连接运算符，星号（*）是重复操作。
Python字符串运算符：

python的字符串内建函数：



序号
方法
描述




1
string.capitalize()
把字符串的第一个字符大写


2
string.center(width))
返回一个原字符串居中,并使用空格填充至长度 width 的新字符串


3
string.count(str, beg=0, end=len(string))
返回 str 在 string 里面出现的次数，如果 beg 或者 end 指定则返回指定范围内 str 出现的次数


4
string.decode(encoding=’UTF-8’, errors=’strict’)
以 encoding 指定的编码格式解码 string，如果出错默认报一个 ValueError 的 异 常 ， 除 非 errors 指 定 的 是 ‘ignore’ 或 者’replace’


5
string.encode(encoding=’UTF-8’, errors=’strict’)
以 encoding 指定的编码格式编码 string，如果出错默认报一个ValueError 的异常，除非 errors 指定的是’ignore’或者’replace’


6
string.endswith(obj, beg=0, end=len(string)))
检查字符串是否以 obj 结束，如果beg 或者 end 指定则检查指定的范围内是否以 obj 结束，如果是，返回 True,否则返回 False.


7
string.expandtabs(tabsize=8)
把字符串 string 中的 tab 符号转为空格，默认的空格数 tabsize 是 8.


8
string.join(seq)
Merges (concatenates)以 string 作为分隔符，将 seq 中所有的元素(的字符串表示)合并为一个新的字符串


9
string.ljust(width)
返回一个原字符串左对齐,并使用空格填充至长度 width 的新字符串


10
string.lower()
转换 string 中所有大写字符为小写


11
string.lstrip()
截掉 string 左边的空格


12
string.maketrans(intab, outtab])
maketrans() 方法用于创建字符映射的转换表，对于接受两个参数的最简单的调用方式，第一个参数是字符串，表示需要转换的字符，第二个参数也是字符串表示转换的目标。


13
string.replace(str1, str2,  num=string.count(str1))
把 string 中的 str1 替换成 str2,如果 num 指定，则替换不超过 num 次


14
string.rstrip()
删除 string 字符串末尾的空格


15
string.split(str=””, num=string.count(str))
以 str 为分隔符切片 string，如果 num有指定值，则仅分隔 num 个子字符串


16
string.strip([obj])
在 string 上执行 lstrip()和 rstrip()


17
string.title()
返回”标题化”的 string,就是说所有单词都是以大写开始，其余字母均为小写(见 istitle())



Python列表(List)
列表用[ ]标识。
列表中的值得分割也可以用到变量[头下标:尾下标]，就可以截取相应的列表，从左到右索引默认0开始的，从右到左索引默认-1开始，下标可以为空表示取到头或尾。
加号（+）是列表连接运算符，星号（*）是重复操作。
访问列表中的值
使用下标索引来访问列表中的值，同样你也可以使用方括号的形式截取字符。
更新列表
1
2
3
4
5
6
7
8
>>> list = ['physics', 'chemistry', 1997, 2000]
>>> print list[2];
1997
>>> list[2] = 2001;
>>> print list[2];
2001
>>> print list;
['physics', 'chemistry', 2001, 2000]

删除列表元素
可以使用 del 语句来删除列表的的元素
1
2
3
>>> del list[2];
>>> print list;
['physics', 'chemistry', 2000]

Python列表函数



序号
函数
作用




1
cmp(list1, list2)
比较两个列表的元素


2
len(list)
列表元素个数


3
max(list)
返回列表元素最大值


4
min(list)
返回列表元素最小值


5
sum(list)
返回列表元素总和


6
list(seq)
将元组转换为列表



Python列表方法



序号
函数
作用




1
list.append(obj)
在列表末尾添加新的对象


2
list.count(obj)
统计某个元素在列表中出现的次数


3
list.extend(seq)
在列表末尾一次性追加另一个序列中的多个值（用新列表扩展原来的列表）


4
list.index(obj)
从列表中找出某个值第一个匹配项的索引位置


5
list.insert(index, obj)
将对象插入列表


6
list.pop(obj=list[-1])
移除列表中的一个元素（默认最后一个元素），并且返回该元素的值


7
list.remove(obj)
移除列表中某个值的第一个匹配项


8
list.reverse()
反向列表中元素


9
list.sort([func])
对原列表进行排序



Python元组
元组是另一个数据类型，类似于List（列表）。
元组用”()”标识。
内部元素用逗号隔开。但是元素不能二次赋值，相当于只读列表。
元组内置函数
tuple(seq)：将列表转换为元组。
Python元字典
字典(dictionary)是除列表以外python之中最灵活的内置数据结构类型。列表是有序的对象结合，字典是无序的对象集合。
两者之间的区别在于：字典当中的元素是通过键来存取的，而不是通过偏移存取。
字典用”{ }”标识。字典由索引(key)和它对应的值value组成。
adict[key] 形式返回键key对应的值value，如果key不在字典中会引发一个KeyError。
字典用法举例
1
2
3
4
5
6
7
8
9
10
11
12
>>> code = {"GLY" : "G", "ALA" : "A", "LEU" : "L", "ILE" : "I",
... "ARG" : "R", "LYS" : "K", "MET" : "M", "CYS" : "C"}
>>> code[’VAL’]
’V’
>>> code.keys()
>>> code.values()
>>> code.items()
>>> del code[’CYS’]
>>> code.update({’CYS’:’C’, ’MET’:’M’)
>>> one2three = {}
>>> for key,val in code.items():
... one2three[val]= key

字典内置函数&方法



序号
函数
作用




1
radiansdict.clear()
删除字典内所有元素


2
radiansdict.copy()
返回一个字典的浅复制


3
radiansdict.fromkeys()
创建一个新字典，以序列seq中元素做字典的键，val为字典所有键对应的初始值


4
radiansdict.get(key, default=None)
返回指定键的值，如果值不在字典中返回default值


5
radiansdict.items()
以列表返回可遍历的(键, 值) 元组数组


6
radiansdict.keys()
以列表返回一个字典所有的键


7
radiansdict.update(dict2)
把字典dict2的键/值对更新到dict里


8
radiansdict.values()
以列表返回字典中的所有值



Python数据类型转换
有时候，我们需要对数据内置的类型进行转换，数据类型的转换，你只需要将数据类型作为函数名即可。
以下几个内置的函数可以执行数据类型之间的转换。这些函数返回一个新的对象，表示转换的值。

Python 运算符
Python语言支持以下类型的运算符:

算术运算符：+，-，，/,%,*(幂 - 返回x的y次幂xy),//(取整除 - 返回商的整数部分)

比较（关系）运算符:==,!=,<>(不等于),>,<,>=,<=

赋值运算符：=，-=(减法赋值运算符),+=(加法赋值运算符),=,/=,%=,*=,//=

逻辑运算符：and，or，not

位运算符

成员运算符：in，not in

身份运算符：is，not is

运算符优先级

Python 条件语句
1
2
3
4
5
6
7
8
if 判断条件1:
    执行语句1……
elif 判断条件2:
    执行语句2……
elif 判断条件3:
    执行语句3……
else:
    执行语句4……

其中”判断条件”成立时（非零），则执行后面的语句，而执行内容可以多行，以缩进来区分表示同一范围。
Python 循环语句
Python提供了for循环和while循环

Python While循环语句
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/usr/bin/python

count = 0
while (count < 9):
   print 'The count is:', count
   count = count + 1

print "Good bye!"
输出：
The count is: 0
The count is: 1
The count is: 2
The count is: 3
The count is: 4
The count is: 5
The count is: 6
The count is: 7
The count is: 8
Good bye!

Python for 循环语句
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/usr/bin/python
# -*- coding: UTF-8 -*-

for num in range(10,20):  # 迭代 10 到 20 之间的数字
   for i in range(2,num): # 根据因子迭代
      if num%i == 0:      # 确定第一个因子
         j=num/i          # 计算第二个因子
         print '%d 等于 %d * %d' % (num,i,j)
         break            # 跳出当前循环
   else:                  # 循环的 else 部分
      print num, '是一个质数'
以上实例输出结果：

10 等于 2 * 5
11 是一个质数
12 等于 2 * 6
13 是一个质数
14 等于 2 * 7
15 等于 3 * 5
16 等于 2 * 8
17 是一个质数
18 等于 2 * 9
19 是一个质数

Python函数
自定义一个函数
你可以定义一个由自己想要功能的函数，以下是简单的规则：

函数代码块以def关键词开头，后接函数标识符名称和圆括号()。

任何传入参数和自变量必须放在圆括号中间。圆括号之间可以用于定义参数。

函数的第一行语句可以选择性地使用文档字符串—用于存放函数说明。

函数内容以冒号起始，并且缩进。

Return[expression]结束函数，选择性地返回一个值给调用方。不带表达式的return相当于返回 None。

语法
1
2
3
4
def functionname( parameters ):
   "函数_文档字符串"
   function_suite
   return [expression]

实例
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/usr/bin/python
# -*- coding: UTF-8 -*-

total = 0; # 这是一个全局变量
# 可写函数说明
def sum( arg1, arg2 ):
   #返回2个参数的和."
   total = arg1 + arg2; # total在这里是局部变量.
   print "函数内是局部变量 : ", total
   return total;  #return语句[表达式]退出函数，选择性地向调用方返回一个表达式。不带参数值的return语句返回None
 
#调用sum函数
sum( 10, 20 );
print "函数外是全局变量 : ", total 
以上实例输出结果：

函数内是局部变量 :  30
函数外是全局变量 :  0

Python 模块
简单地说，模块就是一个保存了Python代码的文件。模块能定义函数，类和变量。模块里也能包含可执行的代码。关于模块的安装见《python模块安装—无root权限（easy_install和pip）》
import 语句
想使用Python源文件，只需在另一个源文件里执行import语句。
From…import 语句
Python的from语句让你从模块中导入一个指定的部分到当前命名空间中。
From…import* 语句
把一个模块的所有内容全都导入到当前的命名空间也是可行的。
Python中的包
包是一个分层次的文件目录结构，它定义了一个由模块及子包，和子包下的子包等组成的Python的应用环境。
Python 文件I/O
读取键盘输入
Python提供了两个内置函数从标准输入读入一行文本，默认的标准输入是键盘。如下：

raw_input：raw_input([prompt]) 函数从标准输入读取一个行，并返回一个字符串（去掉结尾的换行符）

input：input([prompt]) 函数和raw_input([prompt]) 函数基本可以互换，但是input会假设你的输入是一个有效的Python表达式，并返回运算结果。

open()函数
你必须先用Python内置的open()函数打开一个文件，创建一个file对象，相关的辅助方法才可以调用它进行读写。

语法：


file object = open(file_name [, access_mode][, buffering])


各个参数的细节如下：

file_name：file_name变量是一个包含了你要访问的文件名称的字符串值。

access_mode：access_mode决定了打开文件的模式：只读，写入，追加等。所有可取值见如下的完全列表。这个参数是非强制的，默认文件访问模式为只读(r)。

buffering:如果buffering的值被设为0，就不会有寄存。如果buffering的值取1，访问文件时会寄存行。如果将buffering的值设为大于1的整数，表明了这就是的寄存区的缓冲大小。如果取负值，寄存区的缓冲大小则为系统默认。


Close()方法

语法：


fileObject.close();

Write()方法
Write()方法在字符串的结尾不添加换行符(‘\n’)
注意：write(str())写入的数据必须是字符串。

语法：


fileObject.write(string);

read()方法
read（）方法从一个打开的文件中读取一个字符串。

fileObject.read([count]);

Python正则表达式
Python 自1.5版本起增加了re 模块，它提供 Perl 风格的正则表达式模式。
re 模块使 Python 语言拥有全部的正则表达式功能。
compile 函数根据一个模式字符串和可选的标志参数生成一个正则表达式对象。该对象拥有一系列方法用于正则表达式匹配和替换。
re 模块也提供了与这些方法功能完全一致的函数，这些函数使用一个模式字符串做为它们的第一个参数。
详细内容分见
自己使用总结
python字符串替换的2种有效方法：
用字符串本身
a = ‘hello word’
a.replace(‘word’,’python’)
用正则表达式
import re
strinfo = re.compile(‘word’)
b = strinfo.sub(‘python’,a)
print b
异常报错
TypeError: ‘str’ object does not support item assignment
AttributeError: ‘str’ object has no attribute ‘append’
错误原因：对str进行list的操作
解决办法：转换数据类型
list和str转化str.split()
这个内置函数实现的是将str转化为list。其中str=””是分隔符。
join可以说是split的逆运算
1
2
3
>>> name=['Albert', 'Ainstain']
>>> "".join(name)
'AlbertAinstain'

Contribution from ：
http://m.runoob.com/python/
http://www.ynpxrz.com/n781659c2025.aspx



miRNA
2015-12-01T06:06:06.000Z
简介
MicroRNAs (miRNAs) are short in sequence and are generated by enzymatic excision from precursor transcripts called primary miRNAs (pri-miRs), which until now had been assumed not to encode any proteins.
Lauressergues et al find that plant pri-miRNAs contain short open reading frame sequences that encode regulatory peptides. This is the first report of a functional peptide being encoded by a pri-miR and provides a fresh perspective on the significance of pri-miR regions beyond those that directly give rise to miRNAs.

图片注释：MicroRNAs and their associated peptides. The precursors of plant microRNAs (miRNAs) are
pri-miR sequences, which are transcribed from DNA by the enzyme RNA polymerase II (Pol II). They are
then modified by capping and addition of a poly(A) tail. The miRNA duplexes are subsequently excised by
the enzyme Dicer-like1 and transported to the cytoplasm; other parts of the pri-miR are degraded. After
further processing, the resulting miRNA sequence guides repression of gene expression as part of the
RISC complex. Lauressergues et al.report that some pri-miRs contain short open reading frame (ORF)
sequences that can produce peptides (miPEP). The miPEPs enhance expression of the pri-miR, leading
to more miRNA and so more effective cleavage of the target gene’s messenger RNA. Such ORFs may avoid
degradation as part of pri-miRs that might exit the nucleus without being processed by Dicer-like1.
关键信息总结
1）    Both plant and animal pri-miRs are transcribed from DNA in the nucleus by the enzyme RNA polymerase II.
2）    The structured (fold-back) region of the transcript surrounding the miRNA sequence is recognized and processed by one of two enzymes-Drosha or Dicer-like1(In plants, Dicer-like1 cuts out the miRNA in a duplex form).
3)    Transporter proteins export the excised sequences to the cytoplasm, where they are further processed before becoming competent to guide the RNA-induced silencing complex (RISC) in repressing target genes through either cleavage or translational repression of their mRNAs.
4)    The miPEPs had the same tissue distribution as their associated mature miRNAs and enhanced the expression and effectiveness of these miRNAs. Moreover, the miPEPs promoted the transcription of their corresponding pri-miR, rather than enhancing miRNA stability.
miRNA合成途径
miRNA的合成大体经过：初级miRNA(pri-miRNA, primary miRNA)→前体miRNA(pre-miRNA, precursor-miRNA)→成熟miRNA(miRNA, mature miRNA)。
其具体合成途径存在两种形式：经典合成途径和Mirtron合成途径；

经典合成途径
来源：基因间隔区或编码基因内含子中；
细胞核内编码miRNA 的基因通过RNA聚合酶II 或RNA聚合酶III 转录生成初级miRNA(pri-miRNA)，pri-miRNA与来自蛋白质编码基因mRNA相似，有5’ 端帽式结构和3’ 端多聚腺苷酸尾结构，长度可达数千个碱基。接着pri-miRNA在一种RNaseIII(Drosha 酶)和它的伴侣分子(DiGeorge syndrome critical region gene 8,DGCR8)组成的复合物作用下，剪切为70~80 个核苷酸长度、具有茎环结构的miRNA 前体(pre-miRNA)。miRNA 前体在Ran-GTP 依赖的核质/ 细胞质转运蛋白——输出蛋白5(exportin 5)的作用下，从核内运输到胞质中。随后miRNA前体在另一种RNaseIII(Dicer酶)的作用下被剪切成21~25 个核苷酸长度，而且其5’端磷酸化和3’端有2 nt 的悬垂序列，类似于siRNA 的不完全配对的双链RNA，它们是由成熟miRNA与miRNA* 组成的二聚体，miRNA*是pre-miRNA中的一段RNA，其位置恰好与成熟的miRNA 相对。最后在RNA解旋酶作用下生成成熟miRNA 和miRNA*，成熟miRNA结合到RNA诱导的基因沉默复合物(RNA-induced silencing complex, RISC)中发挥作用，miRNA* 则被降解。

Mirtron途径
2007年，Ruby等在研究黑腹果蝇和秀丽隐杆线虫的小RNA 序列时首次发现mirtron，是一类定位于mRNA编码基因内含子内的miRNA，与经典miRNA合成相比其合成过程不需要经过Drosha酶切割，其形式功能与miRNA相同，具有以下特点：
(1)来自长度为56 nt 的内含子序列；
(2)能够形成miRNA:miRNA*复合物；
(3)内含子序列及其二级结构非常保守；
(4)内含子存在于基因组中的外显子中间并且有”GU-AG”的典型内含子特点；
(5)最后形成的成熟miRNA总来自于3’ 端侧的序列；
(6)合成需要Ldbr(lariat-debranching enzyme,套索分支酶)(参与内含子剪接) 的作用；
(7)与经典的miRNA一样，依赖Ran-GTP/ 输出蛋白5 的转运机制转运出核，在Dicer 酶的作用下成为成熟的miRNA。
同时，mirtron在哺乳动物中和在无脊椎动物中有明显的不同，如：
(1)哺乳动物中的mirtron途径合成的功能miRNA大多数来自mirtron的5’茎序列；
(2)不管哺乳动物中的mirtron3’ 茎序列能否成为成熟miRNA，它的起始核苷酸总是尿嘧啶(U)或胞嘧啶(C);
(3)哺乳动物中mirtron悬垂序列(overhang, 发夹结构末端突出序列)大多数为单核苷酸(G)，只有少数是双核苷酸(AG)；
(4)哺乳动物中mirtron比经典的miRNA前体和无脊椎动物mirtron的GC 含量要高，自由能要低。
阻断miRNA对mRNA的调控作用
miRNA海绵(miRNA sponges)
miRNA海绵是一种miRNA靶基因的竞争性抑制剂，是将若干个miRNA的反义序列串联在一起，连接到合适的载体中表达，其转录物”吸附”相应的miRNA，与miRNA靶基因形成竞争，导致靶基因去抑制化。
miRNA海绵表达载体的构建策略
circRNA
环状RNA（circRNA）对miRNA的吸附作用。http://tiramisutes.github.io/2016/04/12/Circular-RNAs/
miRNA数据库
miocroRNA databases:http://micrornadatabases.com/Databases.html
miRNA相关工具
mirtron prediction
Mirtron’SVM’prediction
MicroRNA Target Prediction
miRanda — miRNA target prediction for human, drosophila and zebrafish genomes
miRBase — a comprehensive repository for miRNAs and their predicted targets
miRDB — an online database for miRNA target prediction and functional annotations in animals
miRNAMap — a genomic maps of microRNA genes and their target genes in mammalian genomes
miR2Disease — a database providing comprehensive resource of miRNA deregulation in various human diseases
TarBase — a comprehensive database of experimentally supported animal microRNA targets
PicTar — microRNA targets for vertebrates, fly and nematodes
TargetScan — a search for the presence of conserved sites that match the seed of each miRNA
Target Gene Prediction at EMBL — miRNA-Target predictions for Drosophila miRNAs
Databases for microRNA Expression
microRNA.org — predicted microRNA targets & target downregulation scores. Experimentally observed expression patterns
HMDD — Human MicroRNA Disease Database (HMDD) is a database that contains the experimentally supported miRNA-disease
association data, which are manually curated from publications. The dysfunction evidence or miRNAs and literature PubMed ID are also given
TransmiR — a web query-driven database integrating the experimentally supported transcription factor and miRNA regulator relations
RNA Secondary Structure Prediction
DIANA MicroTest — a prediction of miRNA-mRNA interaction
mfold — tools for predicting the secondary structure of RNA and DNA, mainly by using thermodynamic methods
microInspector — a web tool for detection of miRNA binding sites in an RNA sequence
miRNA Bioinfor — miRNA End Energy calculator which takes miRNA duplex to calculate free energy for 5 base pairs at
one end plus a dangling nucleotide
miRRim — a method for detecting miRNA foldbacks based on hidden Markov model (HMM)
MXSCARNA — a multiple alignment tool for RNA sequences using progressive alignment based on pairwise structural alignment algorithm of SCARNA. Good for large scale analyses.
RNAhybrid— a tool for finding the minimum free energy hybridisation of a long and a short RNA
MicroRNA Homologous Prediction
miRNAminer — a web-based tool used for homologous miRNA gene search in several species
miRviewer — a global view of homologous miRNA genes in many species
RISCbinder— prediction of guide strand of microRNAs
Mireval — Sequence evaluation of microRNA properties
MicroRNA Deep Sequencing
miRanalyzer— A microRNA detection and analysis tool for next-generation sequencing experiments
miRNAkey— A software pipeline for the analysis of microRNA Deep Sequencing data
miRDeep— Discovering known and novel miRNAs from deep sequencing data
miRNA百科
microRNA.gene-quantification.info
参考文献：
Plant biology: Coding in non-coding RNAs
Mirtrons: microRNA biogenesis via splicing



ChIP-sequencing介绍
2015-12-01T05:21:42.000Z
简介(From Wikipedia, the free encyclopedia)
ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.
ChIP-seq is used primarily to determine how transcription factors and other chromatin-associated proteins influence phenotype-affecting mechanisms. Determining how proteins interact with DNA to regulate gene expression is essential for fully understanding many biological processes and disease states. 
流程

ChIP
The ChIP process enriches specific crosslinked DNA-protein complexes  using an antibody against the protein of interest. For a good description of the ChIP wet lab protocol see ChIP-on-chip. Oligonucleotide adaptors are then added to the small stretches of DNA that were bound to the protein of interest to enable massively parallel sequencing.
Sequencing
After size selection, all the resulting ChIP-DNA fragments are sequenced simultaneously using a genome sequencer. A single sequencing run can scan for genome-wide associations with high resolution, as opposed to large sets of tilingarrays required for lower resolution ChIP-chip.
中文解释

ChIP 测序的基本原分为 ChIP 和测序两个步骤。ChIP 是英文 Chromatin immunoprecipitation 的缩写，即染色质免疫沉淀，其步骤包括：细胞内蛋白质和 DNA 的交联、DNA 分子分离及片段化、免疫沉淀和解除交联。测序就是对解除交联后的 DNA 片段进行测序，制备文库时也包括连接测序接头和片段筛选等步骤。由于免疫沉淀的 DNA 片段在蛋白质结合区域周围富集，因此识别蛋白质结合区域转化为检测测序标签富集的区域。在信息处理中，这是一个信号检测的问题。在转录因子 ChIP 测序数据分析中，由于测序标签还会在转录因子结合位点周围形成分布，通过对分布的统计建模分析，可从数据中精确定位结合位点。 
Computational analysis
The read count data generated by ChIP-seq is massive. It motivates the development of computational analysis methods. To predict DNA-binding sites from ChIP-seq read count data, peak calling methods have been developed. The most popular method is MACS which empirically models the shift size of ChIP-Seq tags, and uses it to improve the spatial resolution of predicted binding sites.^[10]
Another relevant computational problem is Differential peak calling, which identifies significant differences in two ChIP-seq signals from distinct biological conditions. Differential peak callers segment two ChIP-seq signals and identify differential peaks using Hidden Markov Models. Examples for two-stage differential peak callers are ChIPDiff^[11] and ODIN.
具体数据分析详细见：http://www.plob.org/2012/09/29/3760.html
Contribution from ：
https://en.wikipedia.org/wiki/ChIP-sequencing
http://www.illumina.com/documents/products/datasheets/datasheet_chip_sequence.pdf
http://bioinfo.au.tsinghua.edu.cn/member/xwang/files/Thesis_XiWang_OL.pdf



Linux 下批量下载 http 中链接内容
2015-12-01T00:31:16.000Z

KOBAS这个软件所需数据库有3540个物种数据，如何实现批量下载…

察看html网页源文件可发现：所有这些下载链接都整齐排列于 KO
这样一个html语法结构中，所以可以通过正则表达式匹配找出所有的下载链接；
方法1:python抓取网页并下载
python中实现爬虫，解析网页html文件和下载文件的简单容易上手的优秀模块分别为requests，BeautifulSoup和re，通过三者的组合可实现爬虫网页并批量下载。
导入所需模块
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!~/bin/Python-2.7.10/
from BeautifulSoup import BeautifulSoup
import requests
import re
#—--name:url-batch-download.py--
#—--修改:hope--
# point to output directory
outpath = '/backend3/'
url = 'http://kobas.cbi.pku.edu.cn/site/download_db.jsp'
mbyte=1024*1024

print 'Reading: ', url
html = requests.get(url).text
soup = BeautifulSoup(html)

print 'Processing: ', url
for name in soup.findAll('a', href=True):
    gzurl = name['href']
    strinfo = re.compile('\../')
    xzp= strinfo.sub('http://kobas.cbi.pku.edu.cn/',gzurl)
    print xzp
    if( xzp.endswith('.gz') ):
        outfname = outpath + xzp.split('/')[-1]
        r = requests.get(xzp, stream=True)
        if( r.status_code == requests.codes.ok ) :
            fsize = int(r.headers['content-length'])
            print 'Downloading %s (%sMb)' % ( outfname, fsize/mbyte )
            with open(outfname, 'wb') as fd:
                for chunk in r.iter_content(chunk_size=1024): # chuck size can be larger
                    if chunk: # ignore keep-alive requests
                        fd.write(chunk)
                fd.close()

修稿py文件为可执行文件
1
chmod +x url-batch-download.py

开始爬虫并批量下载
1
python2.7 url-batch-download.py

方法2：wget批量下载
1） wget下载整个html页面后批量下载
wget都很熟悉，但是通常只是用来下载单个文件，如果需要批量下载呢？
1
wget http://kobas.cbi.pku.edu.cn/site/download_db.jsp

对于这个链接下载完后是jsp后缀文件，可以重命名为index.html。
2） 开始批量下载
1
wget -i index.html -F -B http://kobas.cbi.pku.edu.cn/site/download_db.jsp

参数解释：
-i 表示从文件导入链接，默认是直接按行读取URL
-F 表示将文件以HTML的格式解析，其实就是解析
-B 因为发现解析出来的链接用的都是相对路径，而为了下载这个文件，必须在相对路径前添加上Base URL，-B就是用来添加Base URL。
这样就可以批量下载当前目录中的文件了。



DNA methylation
2015-11-30T15:49:08.000Z
简介
DNA methylation is a process by which methyl groups are added to DNA. Methylation modifies the function of the DNA. When located in a gene promoter, DNA methylation typically acts to repress gene transcription. DNA methylation is essential for normal development and is associated with a number of key processes including genomic imprinting, X-chromosome inactivation, repression of repetitive elements, and carcinogenesis.
DNA methylation at the 5 position of cytosine has the specific effect of reducing gene expression and has been found in every vertebrate examined. In adult somatic cells (cells in the body, not used for reproduction), DNA methylation typically occurs in a CpG dinucleotide context; non-CpG methylation is prevalent in embryonic stem cells,[5][6][7] and has also been indicated in neural development.
原理
在甲基转移酶的催化下，DNA的CG两个核苷酸的胞嘧啶被选择性地添加甲基，形成5－甲基胞嘧啶，这常见于基因的5’-CG-3’序列。大多数脊椎动物基因组DNA都有少量的甲基化胞嘧啶，主要集中在基因5’端的非编码区，并成簇存在。甲基化位点可随DNA的复制而遗传，因为DNA复制后，甲基化酶可将新合成的未甲基化的位点进行甲基化。DNA的甲基化可引起基因的失活，DNA甲基化导致某些区域DNA构象变化，从而影响了蛋白质与DNA的相互作用，甲基化达到一定程度时会发生从常规的B-DNA向Z-DNA的过渡，由于Z-DNA结构收缩，螺旋加深，使许多蛋白质因子赖以结合的原件缩入大沟而不利于转录的起始，导致基因失活。
DNA甲基化主要形成5－甲基胞嘧啶（5-mC）和少量的N6-甲基腺嘌呤（N6-mA）及7－甲基鸟嘌呤（7-mG）。
结构基因编辑
含有很多CpG 结构，2CpG 和2GPC 中两个胞嘧啶的5 位碳原子通常被甲基化，且两个甲基集团在DNA 双链大沟中呈特定三维结构。基因组中60%～ 90% 的CpG 都被甲基化，未甲基化的CpG 成簇地组成CpG 岛，位于结构基因启动子的核心序列和转录起始点。有实验证明超甲基化阻遏转录的进行。DNA 甲基化可引起基因组中相应区域染色质结构变化，使DNA 失去核酶ö限制性内切酶的切割位点，以及DNA 酶的敏感位点，使染色质高度螺旋化，凝缩成团，失去转录活性。5 位C 甲基化的胞嘧啶脱氨基生成胸腺嘧啶，由此可能导致基因置换突变，发生碱基错配： T2G，如果在细胞分裂过程中不被纠正，就会诱发遗传病或癌症，而且，生物体甲基化的方式是稳定的，可遗传的。
The Role of Methylation in Gene Expression
1) Not all genes are active at all times. DNA methylation is one of several epigenetic mechanisms that cells use to control gene expression.
2) 5-azacytidine Experiments Provide Early Clues to the Role of Methylation in Gene Expression.
候选基因CpG岛区域和候选基因甲基化与表观基因组学

1
2
3
4
5
6
7
Figure 1 | Altered DNA-methylation patterns in tumorigenesis. The hypermethylation of CpG islands of tumoursuppressor
genes is a common alteration in cancer cells, and leads to the transcriptional inactivation of these genes
and the loss of their normal cellular functions. This contributes to many of the hallmarks of cancer cells. At the same
time, the genome of the cancer cell undergoes global hypomethylation at repetitive sequences, and tissue-specific
and imprinted genes can also show loss of DNA methylation. In some cases, this hypomethylation is known to
contribute to cancer cell phenotypes, causing changes such as loss of imprinting, and might also contribute to the
genomic instability that characterizes tumours. E, exon.

如图1左所示，在正常细胞中，位于抑癌基因启动子区域的CpG岛处于低水平或未甲基化状态，此时抑癌基因处于正常的开放状态，抑癌基因不断表达抑制肿瘤的发生。而在肿瘤细胞中，该区域的CpG岛被高度甲基化，染色质构象发生改变，抑癌基因的表达被关闭，从而导致细胞进入细胞周期，凋亡丧失，DNA修复缺陷，血管生成以及细胞粘附功能缺失等，最终导致肿瘤发生。同样，如图1右所示，对于在正常细胞中处于高度甲基化的一些基因和重复序列，如果其甲基化水平降低，这些基因将表达和重复序列将激活，从而导致基因印记丢失，细胞过度增长，不合适的细胞特异性表达，基因组脆性增加，以及内寄生序列(endoparasitic sequence)的激活，最终也导致肿瘤发生。
DNA甲基化检测方法
据检测样本不同，可以分为DNA和mRNA。现有方法，绝大部分都是取样于细胞的DNA，根据研究水平，又将这些方法归为3大类，即：基因组甲基化水平(Methylation Content)的分析，候选基因甲基化分析，和基因组层次的DNA甲基化模式(Methylation pattern)与甲基化谱(Methylation Profiling)分析。
基因组甲基化水平(Methylation Content)
高效液相色谱(High-performance Liquid Chromatography, HPLC)
根据DNA或蛋白分子量和构象的不同而使其加以分离。由于在动态相和静态相下分子的光吸收度并不相同而加以定量。随着系统的压强的增加，其分辨率增高。故而能够定量测定基因组整体水平DNA甲基化水平。
过程是将DNA样品先经盐酸或氢氟酸水解成碱基，水解产物通过色谱柱，结果与标准品比较，用紫外光测定吸收峰值及其量，计算5 mC/(5mC+5C)的积分面积就得到基因组整体的甲基化水平。这是一种检测DNA甲基化水平的标准方法。
高效毛细管电泳法(High-performance Capillary Electrophoresis, HPCE)
利用窄孔熔融石英毛细管来从复合物中分离不同化学组分的技术。其基础是在强电场下不同分子的由于其所带电荷，大小，结构以及疏水性等不同而相互分开。用HPCE方法处理DNA水解产物来确定5mC水平，简便，经济且敏感性高。
以上各种方法虽然能够明确检测出目的序列中所有CpG位点的甲基化状况，但并不能对甲基化位点进行定位。
候选基因(Candidate Gene)
甲基化敏感性限制性内切酶-PCR/Southern法( methylation-sensitive restriction Endonuclease -PCR/Southern, MSRE-PCR/Southern)
这种方法利用甲基化敏感性限制性内切酶对甲基化区的不切割的特性，将DNA消化为不同大小的片段后，进行Southern或PCR扩增分离产物，明确甲基化状态再进行分析。常使用的甲基化敏感的限制性内切酶有HpaⅡ-MspⅠ(CCGG)和SmaⅠ-Xmal(CCCGGG)等。
重亚硫酸盐测序法(Bisulphite  Sequencing)
该方法首先用重亚硫酸盐使DNA中未发生甲基化的胞嘧啶脱氨基转变成尿嘧啶,而甲基化的胞嘧啶保持不变，行PCR扩增所需片段,则尿嘧啶全部转化成胸腺嘧啶，最后,对PCR产物进行测序并且与未经处理的序列比较,判断是否CpG位点发生甲基化。此方法是精确度很高，能明确目的片段中每一个CpG位点的甲基化状态，但需要大量的克隆测序,过程较为繁琐、昂贵。
甲基化特异性的PCR(methylation-specific PCR, MS-PCR)
该方法同样DNA先用重亚硫酸盐处理，随后行引物特异性的PCR。其设计两对引物，分别与重亚硫酸盐处理后的序列互补配对，即一对结合处理后的甲基化DNA链，另一对结合处理后的非甲基化DNA链。检测MS-PCR扩增产物，如果用针对处理后甲基化DNA链的引物能扩增出片段，则说明该被检测的位点存在甲基化；反之亦然。
甲基化荧光法(MethyLight)
结合重亚硫酸盐处理待测DNA片段，设计一个能与待测位点区互补的探针，探针的5’端连接报告荧光，3’端连接淬灭荧光，随后行实时定量PCR。如果探针能够与DNA杂交，则在PCR用引物延伸时，TaqDNA聚合酶5′到3′端的外切酶活性会将探针序列上5′端的报告荧光切下，淬灭荧光不再能对报告荧光进行抑制，这样报告荧光发光，测定每个循环报告荧光的强度即可得到该位点的甲基化情况及水平。本方法高效，迅速，具备可重复、所需样本量少、不需要电泳分离的特点。
焦磷酸测序(Pyrosequencing)
该方法，由4种酶催化同一反应体系中的酶级联化学发光反应，在每一轮测序反应中，只加入一种dNTP，若该dNTP与模板配对，聚合酶就能将其加入到引物链中并释放出等摩尔数的焦磷酸(PPi)。PPi可最终转化为可见光信号，并由PyrogramTM转化为一个峰值，其高度与核苷酸数目成正比。当用于甲基化检测时，经重亚硫酸盐处理的序列可以看作是C-T型的SNP改变。其操作简单，结果准确可靠，可以行大规模分析。
结合重亚硫酸盐的限制性内切酶法(combined bisulfiterestriction analysis, COBRA)
这种方法对标本DNA行重亚硫酸盐处理及PCR扩增，处理后原甲基化的胞嘧啶被保留，而非甲基化的胞嘧啶变为胸腺嘧啶。随后用限制性内切酶对转化后PCR产物切割的特性以识别原标本DNA的甲基化状况。该方法相对简单，不需预先知道CpG位点及样本序列。
基因组范围的DNA甲基化模式(Methylation　pattern)与甲基化谱(Methylation Profiling)
限制性标记基因组扫描(Restriction Landmark Genomic Scanning, RLGS)
RLGS是最早适用于基因组范围DNA甲基化分析的方法之一。该方法先用甲基化敏感的稀频限制性内切酶NotⅠ消化基因组DNA，甲基化位点保留，标记末端、切割、行一维电泳，随后再用更高频的甲基化不敏感的内切酶切割，行二维电泳，这样甲基化的部分被切割开并在电泳时显带，得到RLGS图谱与正常对照得出缺失条带即为甲基化的可能部位。
甲基化间区位点扩增(amplification of inter-methylated sites, AIMS)
AIMS是基于任意引物PCR(Arbitrary Primed PCR)的一种方法，由于任意引物PCR使用寡核苷酸连接子(linker) 进行连接，不需要依赖任何序列的先验信息。在该方法中，用来进行扩增的模板序列首先通过甲基化敏感的限制性内切酶进行消化而富集，其特异性由该酶酶切片断一端的特定序列结合连接子来保证。随后，由内切酶进行第二次消化，再次连接，提纯进行PCR扩增，最后电泳，提取目的序列进行测序。
甲基化CpG岛扩增(Methylated  CpG-island amplification, MCA)
MCA也是基于任意引物PCR的方法，该方法使用两种对甲基化具有不同敏感度的限制性内切酶(如SmaI和XmaI)先后进行消化，然后对甲基化敏感的限制性酶切片断进行接头(Adaptor)连接，行PCR，那些富含CpG的序列就会被选择性的扩增。该方法对甲基化分析和克隆甲基化差异性基因都非常有帮助。
差异甲基化杂交(Differential Methylation Hybridization,DMH)
DMH属于一种芯片技术，在该技术中，包括扩增子(Amplicon)生成和CGI文库筛选两个重要组成部分。在扩增子生成中，首先用MseI来酶切DNA样本，然后接上连接子，并去除重复序列，这时的样本一分为二，其一直接进行PCR扩增，生成仅由MseI处理过的扩增子，而另一半则用甲基化敏感的酶BstUI进行消化,然后进行PCR扩增，生成由MseI/BstUI共同处理过的扩增子。CGI文库通过筛查出重复序列，然后进行PCR，选出含有BstUI位点的克隆，最后这些克隆一式两份点到芯片上，制备成CGI芯片。然后把两种不同的扩增子分别杂交到相应的CGI克隆点上，最后通过差异性对比检测出那些未甲基化位点。
由连接子介导PCR出的HpaII小片断富集分析(HpaII tiny fragement Enrichment by Ligation-mediated PCR, HELP)
该方法用HpaII与其对甲基化敏感同裂酶MspI对同一基因组序列进行消化，产生不同的代表性序列，然后对此序列进行连接子介导的PCR，瑞和进行电泳等比较分析或将此DNA样本共杂交到基因组芯片上进行分析。这种方法已经揭示了大量的组织特异性，差异甲基化区域，并用于正常细胞和癌症细胞的基因组比较分析。
甲基化DNA免疫沉淀法(Methylated DNA immunoprecipitation, MeDIP)
这是一种高效富集甲基化DNA的方法。在该方法中，可与5mC特异性结合的抗体加入到变性的基因组DNA片段中，从而使甲基化的基因组片断免疫沉淀，形成富集。通过与已有DNA微芯片技术相结合，从而进行大规模DNA甲基化分析。该方法简便，特异性高，适合DNA甲基化组学(DNA Methylome)的分析。
候选基因甲基化分析网址
http://katahdin.mssm.edu/kismeth/revpage.pl
Contribution from ：http://www.nature.com/scitable/topicpage/the-role-of-methylation-in-gene-expression-1070
http://www.nature.com/nrg/journal/v8/n4/full/nrg2005.html
http://www.lifeomics.com/?p=18458
甲基化引物设计
Methyl Primer Express® Software v1.0
https://www.thermofisher.com/order/catalog/product/4376041



Degradome-降解组
2015-11-30T15:36:38.000Z
Degradome sequencing
Degradome sequencing (Degradome-Seq),also referred to as parallel analysis of RNA ends (PARE),is a modified version of 5’-Rapid Amplification of cDNA Ends (RACE) using high-throughput, deep sequencing method using as Illumina’s SBS technology. Degradome sequencing provides a comprehensive means of analyzing patterns of RNA degradation.
Degradome sequencing has been used to identify microRNA (miRNA) cleavage sites,because miRNAs can cause endonucleolytic cleavage of mRNA by extensive and often perfect complementarity to mRNAs.Degradome sequencing revealed many known and novel plant miRNA (siRNA) targets. Recently, degradome sequencing also has been applied to identify animal (human and mouse) miRNA-derived cleavages.
原理

在植物体内绝大多数的miRNA是利用剪切作用调控靶基因的表达，且剪切常发生在miRNA与mRNA互补区域的第十位核苷酸上。靶基因经剪切产生二个片段，5’ 剪切片段和3’ 剪切片段。其中3’ 剪切片段，包含有自由的5’ 单磷酸和3’ polyA尾巴，可被RNA连接酶连接，连接产物可用于下游高通量测序；而含有5’ 帽子结构的完整基因，含有帽子结构的5’ 剪切片段或是其他缺少5’ 单磷酸基团的RNA是无法被RNA酶连接的，因而无法进入下游的测序实验；对测序数据进行深入地比对分析，可以直观地发现在mRNA序列的某个位点会出现一个波峰，而该处正是候选的miRNA剪切位点。
測序數據的分析方法
測序數據使用賓州大學 Addo-Quaye 等建立CleaveLand 分析方法進行比對分析，可以直觀地發現在 mRNA 序列的某個位點會出現一個波峰，而該處正是候選的 miRNA 剪切位元點。



Fluorescence in situ Hybridization(荧光原位杂交)
2015-11-29T14:51:46.000Z

简介
荧光原位杂交（fluorescence in situ hybridization，FISH）是在20世纪80年代末在放射性原位杂交技术的基础上发展起来的一种非放射性分子细胞遗传技术，以荧光标记取代同位素标记而形成的一种新的原位杂交方法。探针首先与某种介导分子（reporter molecule）结合，杂交后再通过免疫细胞化学过程连接上荧光染料。FISH的基本原理是将DNA（或RNA）探针用特殊的核苷酸分子标记，然后将探针直接杂交到染色体或DNA纤维切片上，再用与荧光素分子偶联的单克隆抗体与探针分子特异性结合来检测DNA序列在染色体或DNA纤维切片上的定性、定位、相对定量分析。FISH具有安全、快速、灵敏度高、探针能长期保存、能同时显示多种颜色等优点，不但能显示中期分裂相，还能显示于间期核。同时在荧光原位杂交基础上又发展了多彩色荧光原位杂交技术和染色质纤维荧光原位杂交技术.。
Preparation and hybridization process – DNA

First, a probe is constructed. The probe must be large enough to hybridize specifically with its target but not so large as to impede the hybridization process. The probe is tagged directly with fluorophores, with targets for antibodies or with biotin. Tagging can be done in various ways, such as nick translation, or PCR using tagged nucleotides.
Then, an interphase or metaphase chromosome preparation is produced. The chromosomes are firmly attached to a substrate, usually glass. Repetitive DNA sequences must be blocked by adding short fragments of DNA to the sample. The probe is then applied to the chromosome DNA and incubated for approximately 12 hours while hybridizing. Several wash steps remove all unhybridized or partially hybridized probes. The results are then visualized and quantified using a microscope that is capable of exciting the dye and recording images.
If the fluorescent signal is weak, amplification of the signal may be necessary in order to exceed the detection threshold of the microscope. Fluorescent signal strength depends on many factors such as probe labeling efficiency, the type of probe, and the type of dye. Fluorescently tagged antibodies or streptavidin are bound to the dye molecule. These secondary components are selected so that they have a strong signal.
Difference between Southern Blot and Fluorescence in situ hybridization (FISH) ?
FISH is performed on intact chromosomes through interphase or metaphase.it doesnt need DNA extraction and is performed on microscopic slides. its a type of kariotyping.its use is to determine the chromosomal abberations(numerical and structural) of a patient.
But,in Southern blotting you should first extract DNA of the specimen then cleave it with a restriction enzyme and then run on gel electrophoresis. the next step is to blot the DNA on a membrane and hybridaze it with a labeled probe. the use of suthern blot is in detecting mutations and restriction length fragment polymorphysms.
https://answers.yahoo.com/question/index?qid=20120611095540AA1YXhl



R数据整形术之plyr
2015-11-29T08:24:33.000Z
plyr包可以进行类似于数据透视表的操作，将数据分割成更小的数据，对分割后的数据进行些操作，最后把操作的结果汇总。
本文主要介绍以下内容：
Split-Aapply-Combine 原理介绍
baby_names的名字排名
求分段拟合的系数
部分其他函数介绍
在正式开始之前，请确保电脑上已经安装plyr，如果没有，通过install.packages()函数安装。
1
2
3
4
5
6
7
8
9
10
11
> require(plyr)  #载入plyr包
> ##假设有美国新生婴儿的取名汇总，每一年，会统计男孩和女孩的取名情况，形成如下的一张表。baby_names数据集包含1880 ~ 2008年间的数据， 包含统计的年份(year)，新生婴儿的性别、名字、以及改名字的比例。
> baby_names<-read.csv("baby-names.csv")
> head(baby_names)
  year    name  percent sex
1 1880    John 0.081541 boy
2 1880 William 0.080511 boy
3 1880   James 0.050057 boy
4 1880 Charles 0.045167 boy
5 1880  George 0.043292 boy
6 1880   Frank 0.027380 boy

以提问并解决问题的形式对plyr做介绍。
想知道数据集中，每年都有多少记录？
数据集中，男孩和女孩名的各自排名？
男孩名和女孩名各自排名前100在当年中的比例？
数据集中，每年都有多少记录
先假设我们有某一年的数据，我们会如何统计其中的记录数呢？由于数据集中，每条记录一行，只需要统计对应的行数就可以得到对应的记录数。
写个函数试试
1
2
3
> record_count <- function(df) {
+     return(data.frame(count = nrow(df)))
+ }

返回值类型是data.frame类型，是为即将介绍的ddply()函数做铺垫。先来看看2008年，数据集中有多少记录。
1
2
3
4
> baby_names_2008 <- subset(baby_names, year == 2008)
> record_count(baby_names_2008)
  count
1  2000

结果显示2000条，貌似我们已经得到答案。下面想想，该如何得到1880 ~ 2008这129年间，每年的记录数呢？
1
2
3
4
5
6
7
8
9
10
11
12
13
14
> baby_names_1880_2008<-ddply(baby_names,     # 数据集
+                  .(year),        # 分类的标准
+                  record_count    # 函数
+                             )
> head(baby_names_1880_2008)
  year count
1 1880  2000
2 1881  2000
3 1882  2000
4 1883  2000
5 1884  2000
6 1885  2000
> dim(baby_names_1880_2008)
[1] 129   2

ddply解释：
定义了一个负责计数的函数record_count()
调用ddply()，这里出现刚刚定义的函数
ddply()函数是plyr包中用于对data.frame结构的数据做处理的函数，其结果也是data.frame。ddply的参数列表如下：
1
2
ddply(.data, .variables, .fun = NULL, ..., .progress = "none",
  .inform = FALSE, .drop = TRUE, .parallel = FALSE, .paropts = NULL)

各部分解释如下
第一个参数是要操作的原始数据集，比如babyname
第二个参数是按照某个（也可以几个）变量，对数据集分割，比如按照year对数据集分割，可以写成.(year)的形式
第三个参数是具体执行操作的函数，对分割后的每一个子数据集，调用该函数
第四个参数可选，表示第三个参数对应函数所需的额外参数
其他参数，可以暂时不用考虑。ddply()函数会自动的将分割后的每一小部分的计算结果汇总，以data.frame的格式保存。分割后的数据，是fun的第一个参数。
在上面的描述中，提到的分割、_操作、汇总，在plyr包中是一种处理方式(“frame”)，即”Split - Apply - Combine”。在plyr包中有很多这种处理方式的函数，在介绍这些函数之前，我们再来看看ddply()的一些更深入的用法。
各年，男孩名与女孩名的各自排名
以2008年的数据为例，男孩名”Jacob”的比例最高，排名应当是第一，”Michael”紧跟其后，排名应当第二，依此类推。对于女孩名，”Emma”排名第一，”Isabella”排名第二，”Emily”排名第三等等。我们希望得到这样的结果。
对于2008年的数据，可以通过简单的rank即可得到，不过要对男孩和女孩分别排序。
1
2
3
baby_names_2008_boy <- subset(baby_names_2008, sex == "boy") # 获取男孩名
baby_names_2008_boy$rank <- rank(- baby_names_2008_boy$percent) # 排序
head(baby_names_2008_boy) # 查看

如何利用ddply()对原始数据集做相应的操作呢？这里需要介绍R语言中的一个函数transform()，该函数对原始数据集做一些操作，并把结果存储在原始数据中，更详细的用法，参见帮助文档?transform。
第一个版本的处理方式是这样的
1
2
3
4
5
ddply(baby_names, 
      .(year, sex), 
      transform, 
      rank = rank(-percent, ties.method = "first")
)

第二个参数有点变化，除了year，还有sex，这表示对baby_name数据集，对year和sex分类（类似于SQL中的group by year, sex）。
第四个参数是transform的额外参数，如果查看transform的帮助文档，其函数调用方式如下：
1
transform(_data, ...)

第一参数为操作的数据，在ddply()中为按年份和性别分割后的子数据集；后面的…参数是tag = value的形式，这种tag:value将追加在数据中。
由于rank默认对数据进行升序排序，若要实现逆序排序，常规的做法是将数据的符号取反，这也就是上面的rank函数中出现-percent的原因。在plyr中，有一个类似的函数，实现取反的操作，是desc。
1
2
3
x <- 1:10
desc(x)
# -1  -2  -3  -4  -5  -6  -7  -8  -9 -10

所以，上面对percent取反的操作，可以写得更优雅些，就有了第二个版本的函数
1
2
3
4
5
baby_names <- ddply(baby_names, 
                    .(year, sex), 
                    transform, 
                    rank = rank(desc(percent), ties.method = "first")
)

注意这里把结果赋给了baby_name，因为后面还会用到排名的信息，就把结果保存下来。
排名前100的男孩名与女孩名在当年中的比例
跟前一问类似，处理方法是：
把每年排名前100的数据筛选出来
把男孩和女孩对应的percent相加
1
2
3
4
5
baby_names_top100 <- subset(baby_names, rank <= 100)  # 将前100排名的数据筛选出来
baby_names_top100_trend <- ddply(baby_names_top100, 
                                 .(year, sex), # 按年和性别分割
                                 summarize, # 汇总数据
                                 trend = sum(percent)) # 汇总方式（求和）

这里出现一个新的操作函数summarize()，该函数是对数据做汇总，与transform不一样的是，该函数并不追加结果到原始数据，而是产生新的数据集。比如想知道，2008年的男孩名中，排名最高和最低的名字的百分比之差，可以通过如下方式求得：
1
2
summarize(baby_names_2008_boy, trend = max(percent) - min(percent))
# 0.010266

回到刚才的问题，从1880 ~ 2008年间，男孩名与女孩名的前100所占比例（可以衡量名字大众化的程度）到底是什么样的呢？画个图就知道了。

Contribution from ：http://www.jianshu.com/p/bfddfe29aa39



R数据整形术之reshape2
2015-11-29T07:23:03.000Z
数据类型
宽数据
1
2
3
4
5
#   ozone   wind  temp
# 1 23.62 11.623 65.55
# 2 29.44 10.267 79.10
# 3 59.12  8.942 83.90
# 4 59.96  8.794 83.97

长数据
1
2
3
4
5
6
7
8
9
10
11
12
13
#    variable  value
# 1     ozone 23.615
# 2     ozone 29.444
# 3     ozone 59.115
# 4     ozone 59.962
# 5      wind 11.623
# 6      wind 10.267
# 7      wind  8.942
# 8      wind  8.794
# 9      temp 65.548
# 10     temp 79.100
# 11     temp 83.903
# 12     temp 83.968

长数据有一列数据是变量的类型，有一列是变量的值。长数据不一定只有两列。ggplot2需要长类型的数据，plyr也需要长类型的数据，大多数的模型(比如lm(), glm()以及gam())也需要长数据。
reshape2 包
reshape2 用得比较多的是melt和cast两个函数。
melt函数对宽数据进行处理，得到长数据；
cast函数对长数据进行处理，得到宽数据；
melt函数
melt(参数)
1
2
3
melt(data,id.vars,measure.vars,
    variable.name = "variable", ..., na.rm = FALSE,
    value.name = "value")

其中id.vars可以指定一系列变量，然后measure.vars就可以留空了，这样生成的新数据会保留id.vars的所有列，然后增加两个新列：variable和value，一个存储变量的名称一个存储变量值。
此处用R内置的airquality数据集
1
2
3
4
5
6
7
8
> head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

首先将列名改成小写，然后查看相应的数据
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
> names(airquality)<- tolower(names(airquality))
> head(airquality)
  ozone solar.r wind temp month day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
##直接用metl函数处理上述的数据
> library("reshape2")
> aql <- melt(airquality)
No id variables; using all as measure variables
> head(aql)
  variable value
1    ozone    41
2    ozone    36
3    ozone    12
4    ozone    18
5    ozone    NA
6    ozone    28
> tail(aql)
    variable value
913      day    25
914      day    26
915      day    27
916      day    28
917      day    29
918      day    30

默认情况下，melt认为所有数值列的变量均有值。很多情况下，这都是我们想要的情况。在这里，我们想知道每个月(month)以及每天(day)的ozone, solar.r, wind以及temp的值。因此，我们需要告诉melt，month和day是”ID variables”。ID variables就是那些能够区分不同行数据的变量，个人感觉类似于数据库中的主键。
1
2
3
4
5
6
7
8
9
> aql <- melt(airquality, id.vars = c("month", "day"))
> head(aql)
  month day variable value
1     5   1    ozone    41
2     5   2    ozone    36
3     5   3    ozone    12
4     5   4    ozone    18
5     5   5    ozone    NA
6     5   6    ozone    28

如果我们想修改长数据中的列名，该如何操作呢?
1
2
3
4
5
6
7
8
9
10
11
> aql <- melt(airquality, id.vars = c("month", "day"),
+             variable.name = "climate_variable", 
+             value.name = "climate_value")
> head(aql)
  month day climate_variable climate_value
1     5   1            ozone            41
2     5   2            ozone            36
3     5   3            ozone            12
4     5   4            ozone            18
5     5   5            ozone            NA
6     5   6            ozone            28

cast函数
从宽格式数据变换到长格式的数据比较直观，然后反过来则需要一些二外的功夫。
在reshape2中有好几个cast版本的函数。若你经常使用data.frame，就需要使用dcast函数。acast函数返回向量、矩阵或者数组。
dcast借助于公式来描述数据的形状，左边参数表示”ID variables”，而右边的参数表示measured variables。可能需要几次尝试，才能找到合适的公式。
这里，我们需要告知dcast，month和day是ID variables，variable则表示measured variables。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
> aql <- melt(airquality, id.vars = c("month", "day"))
> aqw <- dcast(aql, month + day ~ variable)
> head(aqw)
  month day ozone solar.r wind temp
1     5   1    41     190  7.4   67
2     5   2    36     118  8.0   72
3     5   3    12     149 12.6   74
4     5   4    18     313 11.5   62
5     5   5    NA      NA 14.3   56
6     5   6    28      NA 14.9   66
> head(airquality) # 与原始数据比较
  ozone solar.r wind temp month day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

If it isn’t clear to you what just happened there, then have a look at this illustration:

蓝色阴影块是能够表示每一行数据的ID variables；红色阴影块包含了将待生成数据的列名；而灰色的数据表示用于填充相关区域的数据。
令人产生疑惑的情况往往是，一个数据单元有一个以上的数据。比如，我们的ID variables不包含day，
1
2
3
4
5
6
7
8
> dcast(aql, month ~ variable)
Aggregation function missing: defaulting to length
  month ozone solar.r wind temp
1     5    31      31   31   31
2     6    30      30   30   30
3     7    31      31   31   31
4     8    31      31   31   31
5     9    30      30   30   30

再次查看dcast的输出数据，可以看到每个单元是month与climate组合的个数。所得到数据是month对应的day的记录数。当每个单元有多个数据是，需要告诉dcast如何聚合(aggregate)这些数据，比如取均值(mean)，计算中位数(median)，或者简单的求和(sum)。比如，在这里，我们简单的计算下均值，同时通过na.rm = TRUE删除NA值。
1
2
3
4
5
6
7
> dcast(aql, month ~ variable, fun.aggregate = mean, na.rm = TRUE)
  month    ozone  solar.r      wind     temp
1     5 23.61538 181.2963 11.622581 65.54839
2     6 29.44444 190.1667 10.266667 79.10000
3     7 59.11538 216.4839  8.941935 83.90323
4     8 59.96154 171.8571  8.793548 83.96774
5     9 31.44828 167.4333 10.180000 76.90000

Additional help
Read the package help:
help(package = "reshape2")

See the reshape2 website:
http://had.co.nz/reshape/

And read the paper on reshape:
Wickham, H. (2007). Reshaping data with the reshape package.
21(12):1–20.
http://www.jstatsoft.org/v21/i12

(But note that the paper is written for the reshape package not the
reshape2 package.)




Linux常用命令之find
2015-11-29T07:14:58.000Z
想查看当前文件夹及子文件夹里有没有文件名为”abc”的文件
1
# find . -name abc

. ：表示当前目录
-name：表示要根据名称查找
想查看当前文件夹及子文件夹里有没有”xyz”目录
1
# find . -type d -name xyz

-type：表示设定类型，d表示文件夹类型，可以替换为f(普通文件)、l(链接文件)
想找出当前文件夹及子文件夹里所有后缀是”.txt”的文件
1
# find . -name "*.txt"

想查找当前目录及其子文件夹中”roc”用户自己的文件有哪些
1
# find . -user roc

-user：用于设定所属用户的名称，此处可替换为-group，即所属用户组的名称
想查找当前文件夹及子文件夹里权限设定为755的所有文件
1
# find . -perm 755

-perm：用于设定权限
想查找当前文件夹及子文件夹里的同时含有b字符和3字符的文件：用到正则表达式技术
1
# find . -regex '.*b.*3'

-regex：表示使用正则表达式进行匹配。请注意，此命令会和”全路径”进行匹配，也就是说前面要加.*，因为输出结果中会有”./“符号。
如果想全部输出用find命令查找出的”*.abc”文件的内容
1
# find . -type f -name "*.abc" -exec cat {} \;

-exec 表示由find找到的匹配项会作为”-exec后面设定的命令”的参数
可以使用-ok代替-exec，这样对每个匹配项进行操作，都会要求用户确认（y为是，n为否）
命令最后的{} \; 别忘了写，其中{}代表用find查找到的结果中的每一个查找项。
查找当前目录下在5分钟内被访问过的文件
1
# find . -amin -5

访问过用amin，修改过用mmin，文件状态改变过用cmin
精确到分钟的用amin,mmin,cmin，精确到天的用atime,mtime,ctime
在5分钟之内的用-5，在5分钟以上的用＋5
想查找当前目录及子目录下文件大小大于10M的所有文件
1
# find . -size +10000000c

-size：表示文件大小，＋表示大于某个数，－表示小于某个数。c表示单位是字节，你可以将c换成k,M,G。
上述所有的find命令都是查找当前目录及其子目录。如果不想深入到子目录中，而是只查找当前一层目录，则可以：
1
# find . -maxdepth 1 -name "*.c"


Contribution from ：
http://roclinux.cn/?p=18



Linux常用命令之cut
2015-11-29T06:55:55.000Z
定位
cut命令主要是接受三个定位方法：
第一，字节（bytes），用选项-b
-b支持形如3-5的写法，而且多个定位之间用逗号隔开就成了,如：3-5，7
但有一点要注意，cut命令如果使用了-b选项，那么执行此命令时，cut会先把-b后面所有的定位进行从小到大排序，然后再提取
-3表示从第一个字节到第三个字节，而3-表示从第三个字节到行尾
第二，字符（characters），用选项-c
主要用于提取中文操作
第三，域（fields），用选项-f
为什么会有”域”的提取呢，因为刚才提到的-b和-c只能在固定格式的文档中提取信息，而对于非固定格式的信息则束手无策。这时候”域”就派上用场了。
-d  来设置间隔符
如果遇到空格和制表符时，怎么分辨呢？
1
2
3
4
5
6
$ cat tab_space.txt
this is tab	finish.
this is several space      finish.
$ sed -n l tab_space.txt
this is tab\tfinish.$
this is several space      finish.$

看到了吧，如果是制表符（TAB），那么会显示为\t符号，如果是空格，就会原样显示。
通过此方法即可以判断制表符和空格了。
注意，上面sed -n后面的字符是L的小写字母，不要看错。
取反
--complement：补足被选择的字节、字符或字段；
1
2
3
4
$ cut -f2 --complement dddd 
1       3
$  cat dddd 
1       2       3




Linux常用命令之sort强悍参数“k”
2015-11-29T06:33:06.000Z
准备素材
1
2
3
4
5
$ cat facebook.txt
google 110 5000
baidu 100 5000
guge 50 3000
sohu 100 4500

第一个域是公司名称，第二个域是公司人数，第三个域是员工平均工资。
排序
让这个文件按公司的字母顺序排序，也就是按第一个域进行排序：（这个facebook.txt文件有三个域）
1
2
3
4
5
$ sort -t ' ' -k 1 facebook.txt
baidu 100 5000
google 110 5000
guge 50 3000
sohu 100 4500

看到了吧，就直接用-k 1设定就可以了。（其实此处并不严格，稍后你就会知道）
若数据存在表头，而我们一般不需要对表头排序，但是又希望排序后表头依然存在，该如何操作呢？
1
2
## if you have two header lines and want to keep both of them:
(sed -n '1,2p' your_file; cat your_file | sed '1,2d' | sort) > sort_header.txt

让facebook.txt按照公司人数排序
1
2
3
4
5
$ sort -n -t ' ' -k 2 facebook.txt
guge 50 3000
baidu 100 5000
sohu 100 4500
google 110 5000

但是，此处出现了问题，那就是baidu和sohu的公司人数相同，都是100人，这个时候怎么办呢？按照默认规矩，是从第一个域开始进行升序排序，因此baidu排在了sohu前面。
让facebook.txt按照公司人数排序 ，人数相同的按照员工平均工资升序排序：
1
2
3
4
5
$ sort -n -t ' ' -k 2 -k 3 facebook.txt
guge 50 3000
sohu 100 4500
baidu 100 5000
google 110 5000

加了一个-k2 -k3就解决了问题,sort支持这种设定，就是说设定域排序的优先级，先以第2个域进行排序，如果相同，再以第3个域进行排序。（如果你愿意，可以一直这么写下去，设定很多个排序优先级）
让facebook.txt按照员工工资降序排序，如果员工人数相同的，则按照公司人数升序排序：
1
2
3
4
5
$ sort -n -t ' ' -k 3r -k 2 facebook.txt
baidu 100 5000
google 110 5000
sohu 100 4500
guge 50 3000

此处有使用了一些小技巧，你仔细看看，在-k 3后面偷偷加上了一个小写字母r,r和-r选项的作用是一样的，就是表示逆序。因为sort默认是按照升序排序的，所以此处需要加上r表示第三个域（员工平均工资）是按照降序排序。此处你还可以加上n，就表示对这个域进行排序时，要按照数值大小进行排序，举个例子吧：
1
2
3
4
5
$ sort -t ' ' -k 3nr -k 2n facebook.txt
baidu 100 5000
google 110 5000
sohu 100 4500
guge 50 3000

看，我们去掉了最前面的-n选项，而是将它加入到了每一个-k选项中了。
 -k选项的具体语法格式
要继续往下深入的话，就不得不来点理论知识。你需要了解-k选项的语法格式，如下：
1
[ FStart [ .CStart ] ] [ Modifier ] [ , [ FEnd [ .CEnd ] ][ Modifier ] ]

这个语法格式可以被其中的逗号（”，”）分为两大部分，Start部分和End部分。
先给你灌输一个思想，那就是”如果不设定End部分，那么就认为End被设定为行尾”。这个概念很重要的，但往往你不会重视它。
Start部分也由三部分组成，其中的Modifier部分就是我们之前说过的类似n和r的选项部分。我们重点说说Start部分的FStart和C.Start。
C.Start也是可以省略的，省略的话就表示从本域的开头部分开始。之前例子中的-k 2和-k 3就是省略了C.Start的例子喽。
FStart.CStart，其中FStart就是表示使用的域，而CStart则表示在FStart域中从第几个字符开始算”排序首字符”。
同理，在End部分中，你可以设定FEnd.CEnd，如果你省略.CEnd，则表示结尾到”域尾”，即本域的最后一个字符。或者，如果你将CEnd设定为0(零)，也是表示结尾到”域尾”。
从公司英文名称的第二个字母开始进行排序：
1
2
3
4
5
$ sort -t ' ' -k 1.2 facebook.txt
baidu 100 5000
sohu 100 4500
google 110 5000
guge 50 3000

使用了-k 1.2，这就表示对第一个域的第二个字符开始到本域的最后一个字符为止的字符串进行排序。你会发现baidu因为第二个字母是a而名列榜首。sohu和google第二个字符都是o，但sohu的h在google的o前面，所以两者分别排在第二和第三。guge只能屈居第四了。
只针对公司英文名称的第二个字母进行排序，如果相同的按照员工工资进行降序排序：
1
2
3
4
5
$ sort -t ' ' -k 1.2,1.2 -k 3,3nr facebook.txt
baidu 100 5000
google 110 5000
sohu 100 4500
guge 50 3000

由于只对第二个字母进行排序，所以我们使用了-k 1.2,1.2的表示方式，表示我们”只”对第二个字母进行排序。（如果你问”我使用-k 1.2怎么不行？”，当然不行，因为你省略了End部分，这就意味着你将对从第二个字母起到本域最后一个字符为止的字符串进行排序）。对于员工工资进行排序，我们也使用了-k 3,3，这是最准确的表述，表示我们”只”对本域进行排序，因为如果你省略了后面的3，就变成了我们”对第3个域开始到最后一个域位置的内容进行排序”了。
在modifier部分还可以用到哪些选项？
可以用到b、d、f、i、n 或 r。
其中n和r你肯定已经很熟悉了。
b表示忽略本域的签到空白符号。
d表示对本域按照字典顺序排序（即，只考虑空白和字母）。
f表示对本域忽略大小写进行排序。
i表示忽略”不可打印字符”，只针对可打印字符进行排序。（有些ASCII就是不可打印字符，比如\a是报警，\b是退格，\n是换行，\r是回车等等）
思考关于-k和-u联合使用的例子：
1
2
3
4
5
$ cat facebook.txt
google 110 5000
baidu 100 5000
guge 50 3000
sohu 100 4500

这是最原始的facebook.txt文件。
1
2
3
4
5
6
7
8
9
$ sort -n -k 2 facebook.txt
guge 50 3000
baidu 100 5000
sohu 100 4500
google 110 5000
$ sort -n -k 2 -u facebook.txt
guge 50 3000
baidu 100 5000
google 110 5000

当设定以公司员工域进行数值排序，然后加-u后，sohu一行就被删除了！原来-u只识别用-k设定的域，发现相同，就将后续相同的行都删除。
1
2
3
4
5
6
7
8
9
10
$ sort  -k 1 -u facebook.txt
baidu 100 5000
google 110 5000
guge 50 3000
sohu 100 4500

$ sort  -k 1.1,1.1 -u facebook.txt
baidu 100 5000
google 110 5000
sohu 100 4500

这个例子也同理，开头字符是g的guge就没有幸免于难。
1
2
3
4
5
$ sort -n -k 2 -k 3 -u facebook.txt
guge 50 3000
sohu 100 4500
baidu 100 5000
google 110 5000

咦！这里设置了两层排序优先级的情况下，使用-u就没有删除任何行。原来-u是会权衡所有-k选项，将都相同的才会删除，只要其中有一级不同都不会轻易删除的:)（不信，你可以自己加一行sina 100 4500试试看）
最诡异的排序：
1
2
3
4
5
$ sort -n -k 2.2,3.1 facebook.txt
guge 50 3000
baidu 100 5000
sohu 100 4500
google 110 5000

以第二个域的第二个字符开始到第三个域的第一个字符结束的部分进行排序。
第一行，会提取0 3，第二行提取00 5，第三行提取00 4，第四行提取10 5。
又因为sort认为0小于00小于000小于0000….
因此0 3肯定是在第一个。10 5肯定是在最后一个。但为什么00 5却在00 4前面呢？（你可以自己做实验思考一下。）
答案揭晓：原来”跨域的设定是个假象”，sort只会比较第二个域的第二个字符到第二个域的最后一个字符的部分，而不会把第三个域的开头字符纳入比较范围。当发现00和00相同时，sort就会自动比较第一个域去了。当然baidu在sohu前面了。用一个范例即可证实：
1
2
3
4
5
$ sort -n -k 2.2,3.1 -k 1,1r facebook.txt
guge 50 3000
sohu 100 4500
baidu 100 5000
google 110 5000

有时候在sort命令后会看到+1 -2这些符号，这是什么东东？
关于这种语法，最新的sort是这么进行解释的：
On older systems, sort' supports an obsolete origin-zero syntax+POS1 [-POS2]’ for specifying sort keys.  POSIX 1003.1-2001 (*note Standards conformance::) does not allow this; use `-k’ instead.
原来，这种古老的表示方式已经被淘汰了，以后可以理直气壮的鄙视使用这种表示方法的脚本喽！
（为了防止古老脚本的存在，在这再说一下这种表示方法，加号表示Start部分，减号表示End部分。最最重要的一点是，这种方式方法是从0开始计数的，以前所说的第一个域，在此被表示为第0个域。以前的第2个字符，在此表示为第1个字符.）
其他有用参数
V(大写)：聪明的字母和数字排序
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
cat example2.bed
chr2    15      19
chr22   32      46
chr10   31      47
chr1    34      49
chr11   6       16
chr2    17      22
chr2    27      46
chr10   30      42
$ sort -k1,1 -k2,2n example2.bed
chr1    34      49
chr10   30      42
chr10   31      47
chr11   6       16
chr2    15      19
chr2    17      22
chr2    27      46
chr22   32      46
$ sort -k1,1V -k2,2n example2.bed
chr1    34      49
chr2    15      19
chr2    17      22
chr2    27      46
chr10   30      42
chr10   31      47
chr11   6       16
chr22   32      46

V is only for GNU sort which will sort chromsome in alpha-numeric order。
设置临时目录
1
2
-T, --temporary-directory=DIR
             use DIR for temporaries, not $TMPDIR or /tmp; multiple options specify multiple directories

因为sort命令默认临时文件目录为”/tmp”，在sort执行过程中，如果产生的临时文件过大会导致”/tmp”目录被占满的。

Contribution from ：
本原创文章属于《Linux大棚》博客，博客地址为http://roclinux.cn。文章作者为rocrocket。



sed：流编辑器(stream editor)简单总结
2015-11-29T05:29:36.000Z

Sed简介
sed 是一种在线编辑器，它一次处理一行内容。处理时，把当前处理的行存储在临时缓冲区中，称为”模式空间”（pattern space），接着用sed命令处理缓冲区中的内容，处理完成后，把缓冲区的内容送往屏幕。接着处理下一行，这样不断重复，直到文件末尾。文件内容并没有 改变，除非你使用重定向存储输出。
sed命令形式
1
2
sed [options] 'command' file(s)    
sed [options] -f scriptfile file(s)

sed command
command部分可以分为两部分，一部分是确定范围部分，一部分是处理方式部分。
确定范围部分
1 指定行数：例如3,5表示第3、第4和第5行；5,$表示第5行至最后一行；
2 用模式匹配进行指定：例如/^[^dD]/表示匹配行首不是以d或D开头的行等；
处理方式部分呢，有很多命令可用，介绍几个最常用的：
d 表示删除行
p 打印该行
r 读取指定文件的内容
w 写入指定文件sed ‘/200[4-6]/w new.txt’ mysed.txt（w new.txt表示将来源于mysed.txt中含有2004、2005、2006的行写入到new.txt文件中）
a\ 在特定行”下面”插入特定内容sed ‘/2004/a\China’ mysed.txt
i\ 在特定行”上面”插入特定内容sed ‘/2004/i\China’ mysed.txt
y  就表示将第一栏的每个字符都替换为相对应的第二栏的字符sed ‘y/eijng/EIJNG/‘ mysed.txt
n; 对匹配行的下一行进行处理sed ‘/2004/{n;y/eijng/EIJNG/;}’ mysed.txt  (找含有2004的行，然后将它下面的一行中的eijng替换为大写的EIJNG。这里面的”n;”起到了”移至下一行”的作用。n背后的含义其实是将下一行放到模式空间中去)
sed选项
1
2
3
4
5
6
-e command, --expression=command  允许多台编辑，sed -n -e '1,2p' -e '4p' mysed.txt 
-h, --help                        打印帮助，并显示bug列表的地址。  
-n, --quiet, --silent             取消默认输出（默认为全输出）,-n之后只输出后面处理过的行。  
-f, --filer=script-file           引导sed脚本文件名。  
-V, --version                     打印版本和版权信息。
-r, --regexp-extended             使用扩展的正则表达式，如果不用r参数就要在正则表达式里使用很多的\来进行强制转换，如果使用r了就可以直接写正则表达式，而不用写那么多\了

sed元字符集
1
2
3
4
5
6
7
8
9
10
11
12
13
^  锚定行的开始 如：/^sed/匹配所有以sed开头的行。  
$  锚定行的结束 如：/sed$/匹配所有以sed结尾的行。 
.  匹配一个非换行符的字符 如：/s.d/匹配s后接一个任意字符，然后是d。
*  匹配零或多个字符 如：/*sed/匹配所有模板是一个或多个空格后紧跟sed的行。
[] 匹配一个指定范围内的字符，如/[Ss]ed/匹配sed和Sed。  
[^]匹配一个不在指定范围内的字符，如：/[^A-RT-Z]ed/匹配不包含A-R和T-Z的一个字母开头，紧跟ed的行。  
\(..\) 保存匹配的字符，如s/\(love\)able/\1rs，loveable被替换成lovers。  
&  保存搜索字符用来替换其他字符，如s/love/**&**/，love变成**love**。   
\< 锚定单词的开始，如:/\
\> 锚定单词的结束，如/love\>/匹配包含以love结尾的单词的行。   
x\{m\}  重复字符x，m次，如：/0\{5\}/匹配包含5个o的行。   
x\{m,\} 重复字符x,至少m次，如：/o\{5,\}/匹配至少有5个o的行。   
x\{m,n\} 重复字符x，至少m次，不多于n次，如：/o\{5,10\}/匹配5--10个o的行。

sed实例
例一 显示test文件的第20到30行：sed -n ‘20,30p’ test
例二 将所有以d或D开头的行的所有小写x变为大写X：sed ‘/^[dD]/s/x/X/g’ test
例三 删除每行最后的两个字符：sed ‘s/..$//‘ test
例四 删除每一行的前两个字符：sed ‘s/..//‘ test
例五
1
2
3
4
5
$cat mysed.txt
Beijing Beijing Beijing Beijing
$sed 's/\(Beijing\)\(.*\)\(Beijing\)/\12008\2\32008/' mysed.txt
Beijing2008 Beijing Beijing Beijing2008
##这个命令稍显复杂，其中用到了一个技巧，就是预存储，即被\(和\)括起来的匹配内容会被按顺序存储起来，存储到\1、\2…里面。这样你就可以使用\加数字来调用这些内容了。这个例子就是使用了这个技巧，分别存储了三个内容，分别为匹配Beijing、匹配.*和匹配Beijing。

任意字符:  sed -n ‘/.ing/‘p temp.txt     注意是.ing,而不是ing
‘s/^[][]//g’      删除行首空格
‘s/^$/d’           删除空行
‘s/COL/(…/)//g’  删除紧跟COL的后三个字母
morehttp://www.grymoire.com/Unix/Sed.html
Contribution from ：
http://roclinux.cn/?p=1363
http://www.iteye.com/topic/587673
http://www.cnblogs.com/emanlee/archive/2013/09/07/3307642.html



python模块安装--无root权限（easy_install和pip）
2015-11-27T05:17:35.000Z
easy_install是由PEAK(Python Enterprise Application Kit)开发的setuptools包里带的一个命令，所以使用easy_install实际上是在调用setuptools来完成安装模块的工作。 Perl 用户比较熟悉 CPAN，而 Ruby 用户则比较熟悉 Gems；引导 setuptools 的ez_setup工具和随之而生的扩展后的easy_install 与”Cheeseshop”（Python Package Index，也称为 “PyPI”）一起工作来实现相同的功能。它可以很方便的让您自动下载，编译，安装和管理Python包。【百度百科】
easy_install和pip都是用来下载安装Python一个公共资源库PyPI的相关资源包的，pip类似RedHat里面的yum，安装Python包非常方便。本节详细介绍pip的安装、以及使用方法。
首先安装setuptools
1
2
3
4
wget "https://bitbucket.org/pypa/setuptools/get/default.tar.gz#egg=setuptools-dev" --no-check-certificate
tar -xzvf default.tar.gz
cd pypa-setuptools-eb92fc5071bf //依据你的解压目录名而定
python setup.py install

安装easy_install
1
wget https://pypi.python.org/pypi/ez_setup

解压,安装.
1
python ez_setup.py

easy_install安装包
1
easy_install 【要安装的模块】

pip下载安装
1
wget "https://pypi.python.org/packages/source/p/pip/pip-1.5.4.tar.gz#md5=834b2904f92d46aaa333267fb1c922bb" --no-check-certificate

pip安装
1
2
3
tar -xzvf pip-1.5.4.tar.gz
cd pip-1.5.4
python setup.py install

pip使用详解
pip安装包
1
2
3
pip install SomePackage
[...]
Successfully installed SomePackage

pip查看已安装的包
1
2
3
4
5
6
7
pip show --files SomePackage
Name: SomePackage
Version: 1.0
Location: /my/env/lib/pythonx.x/site-packages
Files:
../somepackage/__init__.py
[...]

pip检查哪些包需要更新
1
2
pip list --outdated
SomePackage (Current: 1.0 Latest: 2.0)

pip升级包
1
2
3
4
5
6
7
pip install --upgrade SomePackage
[...]
Found existing installation: SomePackage 1.0
Uninstalling SomePackage:
Successfully uninstalled SomePackage
Running setup.py install for SomePackage
Successfully installed SomePackage

pip卸载包
1
2
3
4
5
pip uninstall SomePackage
Uninstalling SomePackage:
/my/env/lib/pythonx.x/site-packages/somepackage
Proceed (y/n)? y
Successfully uninstalled SomePackage

常见错误
1
ImportError No module named setuptools

解决办法：安装setuptools



Genes and Isoforms
2015-11-25T11:44:00.000Z
Isoforms
Definition(s)
Different forms of a protein that may be produced from different genes, or from the same gene by alternative splicing.
Definition from: MeSH via Unified Medical Language System  at the National Library of Medicine
The protein products of different versions of messenger RNA created from the
same gene by employing different promoters, which causes transcription to
skip certain exons. Since the promoters are tissue-specific, different tissues
express different protein products of the same gene.
Definition from: GeneReviewsfrom the University of Washington and the National Center for Biotechnology Information
Related discussion in the Handbook
 Are fingerprints determined by genetics?

Gene
Definition(s)
The functional and physical unit of heredity passed from parent to offspring. Genes are pieces of DNA, and most genes contain the information for making a specific protein.
Definition from: Physician Data Query via Unified Medical Language System at the National Library of Medicine
The basic unit of heredity, consisting of a segment of DNA arranged in a linear
manner along a chromosome, which codes for a specific protein or segment
of protein leading to a particular characteristic or function.
Definition from: GeneReviewsfrom the University of Washington and the National Center for Biotechnology.
The gene is the basic physical unit of inheritance. Genes are passed from parents to offspring and contain the information needed to specify traits. Genes are arranged, one after another, on structures called chromosomes. A chromosome contains a single, long DNA molecule, only a portion of which corresponds to a single gene. Humans have approximately 23,000 genes arranged on their chromosomes.
Definition from: Talking Glossary of Genetic Terms from the National Human Genome Research Institute
The fundamental physical and functional unit of heredity. A gene is an ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific functional product (i.e., a protein or RNA molecule).
Definition from: Human Genome Project Informationat the U.S. Department of Energy
Related discussion in the Handbook
 What is a gene?           

 How do geneticists indicate the location of a gene?  

 Gene Therapy           

 What are gene families?           

See also Understanding Medical Terminology.



重测序文献精读
2015-11-14T12:57:42.000Z
了解一项新的工作或研究内容，文献精读是首选，而Nature系类文章技术含金量高，文章内容条理清晰，精读一定会收获颇丰。
全基因组重测序是对已知基因组序列的物种进行不同个体的基因组测序，并在此基础上对个体或群体进行差异性分析。全基因组重测序的个体，通过序列比对，可以找到大量的单核苷酸多态性位点（SNP），插入缺失位点（InDel，Insertion/Deletion）、结构变异位点（SV，Structure Variation）位点，在全基因组水平上扫描并检测与重要性状相关的基因序列差异和结构变异，实现遗传进化分析及重要性状候选基因预测。
基于其研究的重要性，Google Scholar关键词搜索resequence，检索到《Resequencing of 31 wild and cultivated soybean genomes identifies patterns of genetic diversity and selection》，《Resequencing 302 wild and cultivated accessions identifies genes related to domestication and improvement in soybean》，现选择第一篇精读。
关键点总结
测序相关

Approximately ×5 depth and >90% coverage.
Previous reports have shown that the SNP calling accuracy from resequencing data is ~95–99% (Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456,60–65 (2008)；Xia, Q. et al. Complete resequencing of 40 genomes reveals domestication events and genes in silkworm (Bombyx). Science 326, 433–436 (2009)).
D-value (Tajima’s D) distribution was significantly higher that indicating a significant loss of rare SNPs, which may be due to reduced recombination within the LD blocks.
Divergence index (FST) value allowed us to identify genomic regions of large FST value, which signified areas having a high degree of diversification.Subregions that have very high FST values may provide an indication of the functional genes or alleles involved.
A genome-wide sequencing comparison to reveal haplotype sharing could provide a unique tool to identify introgression events in the history of these cultivars.
Previous studies have indicated that whole genome duplication (WGD) events can cause gene loss and rapid functional diversification. 大豆研究相关
They have exceptionally high linkage disequilibrium (LD) and a high ratio of average nonsynonymous versus synonymous nucleotide differences (Nonsyn/Syn). 
There was a recent history of introgression from wild soybean.
Human selection probably had a strong impact on the genetic diversity in the cultivated soybeans.
Genome-wide analyses showed the opposite: we found that the low-frequency alleles were less abundant among the wild as compared to the cultivated accessions.
In comparison with other crops, SNP analysis showed that the cultivated soybean exhibited a lower diversity (cultivated soybean: 1.89 × 10−3; rice: 2.29 × 10−3; corn: 6.6 × 10−3).
The average distance over which LD decays to half of its maximum value in soybean was substantially longer than that of all plants analyzed
to date.
SNP analyses in the LD blocks showed that there was a lower SNP ratio in long LD blocks as compared to the whole genome in both wild and cultivated.
Allelic diversity in wild soybeans was higher than in cultivated soybeans across the entire genome.

Only ~3% of the total SNPs identified were present in coding regions. The remaining ~97% SNPs were in noncoding regions.
The presence of a higher Nonsyn/Syn value at the whole-genome level and more large-effect mutations suggested that the soybean genome had accumulated a higher ratio of deleterious mutations.
High LD(long LD blocks) would result in the lack of effective recombination; consequently, deleterious mutations could not be eliminated and would accumulate.

Selection signals during domestication and improvement.
分析大致流程
进化相关工作，包括phylogenetic tree(iTO)，principle component analysis(PCA)，population structure(Bayesian clustering analysis).
Whole-genome SNP analysis (using the parameter θπ) and the distribution of genome-wide diversity.

High linkage disequilibrium and genomewide patterns of nucleotide diversity(Selection and introgression).
Genome duplication（copy number variations (CNVs)） and Gene content variation.涉及软件
STRUCTURE，Bayesian clustering program，http://pritchardlab.stanford.edu/structure.html

Haploview，LD analysis，https://www.broadinstitute.org/scientific-community/science/programs/medical-and-population-genetics/haploview/haploview
AUGUSTUS ，基因注释，http://bioinf.uni-greifswald.de/augustus/
GeneWise and Genomewise，http://www.ncbi.nlm.nih.gov/pmc/articles/PMC479130/   ，http://www.ebi.ac.uk/Tools/psa/genewise/
GeneWise, which predicts gene structure using similar protein sequences, and Genomewise, which provides a gene structure final parse across cDNA- and EST-defined spliced structure. Both algorithms are heavily used by the Ensembl annotation system. The GeneWise algorithm was developed from a principled combination of hidden Markov models (HMMs). Both algorithms are highly accurate and can provide both accurate and complete gene structures when used with the correct evidence.
SOAP and SOAPsnp,Short Oligonucleotide Alignment Program(45-bp or 76-bp), http://soap.genomics.org.cn/;
BWA，Paired-end sequencing reads mapping，http://sourceforge.net/projects/bio-bwa/files/
SAMtools，SNP detectionhttp://www.htslib.org/，http://biobits.org/samtools_primer.html
Picard package，Duplicated reads filtered，http://picard.sourceforge.net/
BEDtools，coverage of sequence alignmentshttp://bedtools.readthedocs.org/en/latest/
Genome Analysis Toolkit (GATK)，SNP/Indel calling，https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_genotyper_UnifiedGenotyper.php
ANNOVAR，SNP annotationhttp://annovar.openbioinformatics.org/en/latest/
EIGENSOFT，Principal component analysis (PCA) of whole-genome SNPshttp://genetics.med.harvard.edu/reich/Reich_Lab/Software.html
PLINK，Whole genome association analysis toolset，http://pngu.mgh.harvard.edu/~purcell/plink/
GAPIT，Genome Association and Predictionhttp://www.maizegenetics.net/#!gapit/cmkv
manhattan plot，https://pods.iplantcollaborative.org/wiki/display/eot/Make+manhattan+plot+with+ggplot2+script，http://blog.how-to-code.info/r/Manhattan-plot.html
内容补充链接
Tajima’s D，https://en.wikipedia.org/wiki/Tajima%27s_D；http://baike.baidu.com/link?url=hkRPQcUtBVMTVhMl2wzKGLy5QtDcrMwonUV7CspqxqdphkGztrSNFZLiUYazq6oz6rxZyVoy1YhHjexhi9Op9_.
Penn State University Center for Comparative Genomics and Bioinformatics，http://www.bx.psu.edu/miller_lab/




What's the probability that a significant p-value indicates a true effect?
2015-11-14T11:32:12.000Z
If the p-value is < .05, then the probability of falsely rejecting the null hypothesis is  <5%, right? That means, a maximum of 5% of all significant results is a false-positive (that’s what we control with the α rate).
Well, no. As you will see in a minute, the “false discovery rate” (aka. false-positive rate), which indicates the probability that a significant p-value actually is a false-positive, usually is much higher than 5%.
A common misconception about p-values
Oates (1986) asked the following question to students and senior scientists:
1
2
3
You have a p-value of .01. Is the following statement true, or false?

You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.

The answer is “false” (you will learn why it’s false below). But 86% of all professors and lecturers in the sample who were teaching statistics (!) answered this question erroneously with “true”. Gigerenzer, Kraus, and Vitouch replicated this result in 2000 in a German sample (here, the “statistics lecturer” category had 73% wrong). Hence, it is a wide-spread error to confuse the p-value with the false discovery rate.
The False Discovery Rate (FDR) and the Positive Predictive Value (PPV)
To answer the question “What’s the probability that a significant p-value indicates a true effect?”, we have to look at the positive predictive value (PPV) of a significant p-value. The PPV indicates the proportion of significant p-values which indicate a real effect amongst all significant p-values. Put in other words: Given that a p-value is significant: What is the probability (in a frequentist sense) that it stems from a real effect?
(The false discovery rate simply is 1-PPV: the probability that a significant p-value stems from a population with null effect).
That is, we are interested in a conditional probability Prob(effect is real | p-value is significant).
Inspired by Colquhoun (2014) one can visualize this conditional probability in the form of a tree-diagram (see below). Let’s assume, we carry out 1000 experiments for 1000 different research questions. We now have to make a couple of prior assumptions (which you can make differently in the app we provide below). For now, we assume that 30% of all studies have a real effect and the statistical test used has a power of 35% with an α level set to 5%. That is of the 1000 experiments, 300 investigate a real effect, and 700 a null effect. Of the 300 true effects, 0.35300 = 105 are detected, the remaining 195 effects are non-significant false-negatives. On the other branch of 700 null effects, 0.05700 = 35 p-values are significant by chance (false positives) and 665 are non-significant (true negatives).
This path is visualized here (completely inspired by Colquhoun, 2014):

Now we can compute the false discovery rate (FDR): 35 of (35+105) = 140 significant p-values actually come from a null effect. That means, 35/140 = 25% of all significant p-values do not indicate a real effect! That is much more than the alleged 5% level.

 Contribution from ：http://www.r-bloggers.com/whats-the-probability-that-a-significant-p-value-indicates-a-true-effect/



主成份分析、因子分析和聚类分析的异同点
2015-11-11T13:33:50.000Z
基本介绍
主成分分析就是将多项指标转化为少数几项综合指标,用综合指标来解释多变量的方差- 协方差结构。综合指标即为主成分。所得出的少数几个主成分，要尽可能多地保留原始变量的信息，且彼此不相关。
因子分析是研究如何以最少的信息丢失，将众多原始变量浓缩成少数几个因子变量，以及如何使因子变量具有较强的可解释性的一种多元统计分析方法。
聚类分析是依据实验数据本身所具有的定性或定量的特征来对大量的数据进行分组归类以了解数据集的内在结构，并且对每一个数据集进行描述的过程。其主要依据是聚到同一个数据集中的样本应该彼此相似，而属于不同组的样本应该足够不相似。
三种分析方法既有区别也有联系，本文力图将三者的异同进行比较，并举例说明三者在实际应用中的联系,以期为更好地利用这些高级统计方法为研究所用有所裨益。
基本思想的异同
共同点
主成分分析法和因子分析法都是用少数的几个变量(因子) 来综合反映原始变量(因子) 的主要信息，变量虽然较原始变量少，但所包含的信息量却占原始信息的85 %以上，所以即使用少数的几个新变量，可信度也很高，也可以有效地解释问题。并且新的变量彼此间互不相关，消除了多重共线性。这两种分析法得出的新变量，并不是原始变量筛选后剩余的变量。在主成分分析中，最终确定的新变量是原始变量的线性组合，如原始变量为x1 ，x2 ，. . . ，x3 ，经过坐标变换，将原有的p个相关变量xi 作线性变换，每个主成分都是由原有p 个变量线性组合得到。在诸多主成分Zi 中，Z1 在方差中占的比重最大，说明它综合原有变量的能力最强，越往后主成分在方差中的比重也小，综合原信息的能力越弱。因子分析是要利用少数几个公共因子去解释较多个要观测变量中存在的复杂关系，它不是对原始变量的重新组合，而是对原始变量进行分解，分解为公共因子与特殊因子两部分。公共因子是由所有变量共同具有的少数几个因子；特殊因子是每个原始变量独自具有的因子。对新产生的主成分变量及因子变量计算其得分，就可以将主成分得分或因子得分代替原始变量进行进一步的分析，因为主成分变量及因子变量比原始变量少了许多，所以起到了降维的作用，为我们处理数据降低了难度。
聚类分析的基本思想是: 采用多变量的统计值，定量地确定相互之间的亲疏关系，考虑对象多因素的联系和主导作用，按它们亲疏差异程度，归入不同的分类中一元，使分类更具客观实际并能反映事物的内在必然联系。也就是说，聚类分析是把研究对象视作多维空间中的许多点，并合理地分成若干类，因此它是一种根据变量域之间的相似性而逐步归群成类的方法，它能客观地反映这些变量或区域之间的内在组合关系[3 ]。聚类分析是通过一个大的对称矩阵来探索相关关系的一种数学分析方法，是多元统计分析方法，分析的结果为群集。对向量聚类后，我们对数据的处理难度也自然降低，所以从某种意义上说，聚类分析也起到了降维的作用。
不同之处
主成分分析是研究如何通过少数几个主成分来解释多变量的方差一协方差结构的分析方法，也就是求出少数几个主成分(变量) ，使它们尽可能多地保留原始变量的信息，且彼此不相关。它是一种数学变换方法，即把给定的一组变量通过线性变换，转换为一组不相关的变量(两两相关系数为0 ，或样本向量彼此相互垂直的随机变量) ，在这种变换中，保持变量的总方差(方差之和) 不变，同时具有最大方差，称为第一主成分；具有次大方差，称为第二主成分。依次类推。若共有p 个变量，实际应用中一般不是找p 个主成分，而是找出m (m < p) 个主成分就够了，只要这m 个主成分能反映原来所有变量的绝大部分的方差。主成分分析可以作为因子分析的一种方法出现。
因子分析是寻找潜在的起支配作用的因子模型的方法。因子分析是根据相关性大小把变量分组，使得同组内的变量之间相关性较高，但不同的组的变量相关性较低，每组变量代表一个基本结构，这个基本结构称为公共因子。对于所研究的问题就可试图用最少个数的不可测的所谓公共因子的线性函数与特殊因子之和来描述原来观测的每一分量。通过因子分析得来的新变量是对每个原始变量进行内部剖析。因子分析不是对原始变量的重新组合，而是对原始变量进行分解，分解为公共因子和特殊因子两部分。具体地说，就是要找出某个问题中可直接测量的具有一定相关性的诸指标，如何受少数几个在专业中有意义、又不可直接测量到、且相对独立的因子支配的规律，从而可用各指标的测定来间接确定各因子的状态。因子分析只能解释部分变异，主成分分析能解释所有变异。       
聚类分析算法是给定m 维空间R 中的n 个向量，把每个向量归属到k 个聚类中的某一个，使得每一个向量与其聚类中心的距离最小。聚类可以理解为: 类内的相关性尽量大，类间相关性尽量小。聚类问题作为一种无指导的学习问题，目的在于通过把原来的对象集合分成相似的组或簇，来获得某种内在的数据规律。
从三类分析的基本思想可以看出，聚类分析中并没于产生新变量，但是主成分分析和因子分析都产生了新变量。
数据标准化的比较
主成分分析中为了消除量纲和数量级，通常需要将原始数据进行标准化，将其转化为均值为0方差为1 的无量纲数据。而因子分析在这方面要求不是太高，因为在因子分析中可以通过主因子法、加权最小二乘法、不加权最小二乘法、重心法等很多解法来求因子变量，并且因子变量是每一个变量的内部影响变量，它的求解与原始变量是否同量纲关系并不太大，当然在采用主成分法求因子变量时，仍需标准化。不过在实际应用的过程中，为了尽量避免量纲或数量级的影响，建议在使用因子分析前还是要进行数据标准化。在构造因子变量时采用的是主成分分析方法，主要将指标值先进行标准化处理得到协方差矩阵，即相关矩阵和对应的特征值与特征向量，然后构造综合评价函数进行评价。
聚类分析中如果参与聚类的变量的量纲不同会导致错误的聚类结果。因此在聚类过程进行之前必须对变量值进行标准化，即消除量纲的影响。不同方法进行标准化，会导致不同的聚类结果要注意变量的分布。如果是正态分布应该采用z 分数法。
应用中的优缺点比较
主成分分析
优点：首先它利用降维技术用少数几个综合变量来代替原始多个变量，这些综合变量集中了原始变量的大部分信息。其次它通过计算综合主成分函数得分，对客观经济现象进行科学评价。再次它在应用上侧重于信息贡献影响力综合评价。
缺点：当主成分的因子负荷的符号有正有负时，综合评价函数意义就不明确。命名清晰性低。
因子分析
优点：第一它不是对原有变量的取舍，而是根据原始变量的信息进行重新组合，找出影响变量的共同因子，化简数据；第二，它通过旋转使得因子变量更具有可解释性，命名清晰性高。
缺点：在计算因子得分时，采用的是最小二乘法，此法有时可能会失效。
聚类分析
优点：聚类分析模型的优点就是直观，结论形式简明。
缺点：在样本量较大时，要获得聚类结论有一定困难。由于相似系数是根据被试的反映来建立反映被试间内在联系的指标，而实践中有时尽管从被试反映所得出的数据中发现他们之间有紧密的关系，但事物之间却无任何内在联系，此时，如果根据距离或相似系数得出聚类分析的结果，显然是不适当的，但是，聚类分析模型本身却无法识别这类错误。
Contribution from ：http://blog.sina.com.cn/s/blog_66d362d70101fiuj.html



R中cluster包进行聚类分析
2015-11-11T13:05:33.000Z
Description
Methods for Cluster analysis. Much extended the original from Peter Rousseeuw, Anja Struyf and Mia Hubert, based on Kaufman and Rousseeuw (1990) ``Finding Groups in Data’’.
聚类分析：按照个体或样品(individuals, objects or subjects)的特征将它们分类，使同一类别内的个体具有尽可能高的同质性(homogeneity)，而类别之间则应具有尽可能高的异质性(heterogeneity)。
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.
Typical cluster models

Connectivity models: for example hierarchical clustering builds models based on distance connectivity.

Centroid models: for example the k-means algorithm represents each cluster by a single mean vector.

Distribution models: clusters are modeled using statistical distributions, such as multivariate normal distributions used by the Expectation-maximization algorithm.

Density models: for example DBSCAN and OPTICS defines clusters as connected dense regions in the data space.

Subspace models: in Biclustering (also known as Co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes.

Group models: some algorithms do not provide a refined model for their results and just provide the grouping information.

Graph-based models: a clique, i.e., a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the complete connectivity requirement (a fraction of the edges can be missing) are known as quasi-cliques, as in the HCS clustering algorithm.


Algorithms
Connectivity based clustering (hierarchical clustering)
Connectivity based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away. These algorithms connect “objects” to form “clusters” based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a dendrogram, which explains where the common name “hierarchical clustering” comes from: these algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances. In a dendrogram, the y-axis marks the distance at which the clusters merge, while the objects are placed along the x-axis such that the clusters don’t mix.

Centroid-based clustering
In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

Distribution-based clustering
The clustering model most closely related to statistics is based on distribution models. Clusters can then easily be defined as objects belonging most likely to the same distribution. A convenient property of this approach is that this closely resembles the way artificial data sets are generated: by sampling random objects from a distribution.

Density-based clustering
The most popular density based clustering method is DBSCAN. In contrast to many newer methods, it features a well-defined cluster model called “density-reachability”. Similar to linkage based clustering, it is based on connecting points within certain distance thresholds. However, it only connects points that satisfy a density criterion, in the original variant defined as a minimum number of other objects within this radius.

综上内容看出其和主成份分析(Principal component analysis,PCA)有较多相似性，所以开始前先对主成份分析、因子分析和聚类分析的异同点进行比较分析，具体关于PCA分析可看本博博文PCA分析。
R中进行聚类分析
cluster包
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
> library(cluster)
> #载入所需数据
> data(votes.repub)
> votes.diss <- daisy(votes.repub)
> pamv <- pam(votes.diss, 2, diss = TRUE)
> clusplot(pamv, shade = TRUE)
> ## is the same as
> votes.clus <- pamv$clustering
> clusplot(votes.diss, votes.clus, diss = TRUE, shade = TRUE)
> ##Remove the dotted line
> clusplot(votes.diss, votes.clus, diss = TRUE)
> ## show label
> op <- par(new=TRUE, cex = 0.6)
> clusplot(votes.diss, votes.clus, diss = TRUE,
+          axes=FALSE,ann=FALSE, sub="", col.p=NA, col.txt="dark green", labels=3)
> par(op)


参数解释：
daisy:Dissimilarity Matrix(相异度矩阵)Calculation,compute all the pairwise dissimilarities (distances) between observations in the data set.
相异度矩阵：相异度矩阵是对象—对象结构的一种数据表达方式，多数聚类算法都是建立在相异度矩阵基础上，如果数据是以数据矩阵形式给出的，就要将数据矩阵转化为相异度矩阵。对象间的相似度或相异度是基于两个对象间的距离来计算的。
1
daisy(x, metric = c("euclidean", "manhattan", "gower"),stand = FALSE, type = list(), weights = rep.int(1, p))

x—numeric matrix or data frame, of dimension n*p.
metric—“euclidean” (the default), “manhattan” and “gower”.Euclidean distances are root sum-of-squares of differences, and manhattan distances
are the sum of absolute differences.”Gower’s distance” is chosen by metric “gower” or automatically if some columns of x are not numeric.
stand—logical flag: if TRUE, then the measurements in x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable’s mean value and dividing by the variable’s mean absolute deviation.
pam:Partitioning (clustering) of the data into k clusters “around medoids”, a more robust version of K-means.
1
2
3
4
5
6
pam(x, k, diss = inherits(x, "dist"), metric = "euclidean",
medoids = NULL, stand = FALSE, cluster.only = FALSE,
do.swap = TRUE,
keep.diss = !diss && !cluster.only && n < 100,
keep.data = !diss && !cluster.only,
pamonce = FALSE, trace.lev = 0)

x—data matrix or data frame, or dissimilarity matrix or object.In case of a dissimilarity matrix, x is typically the output of daisy or dist.
k—positive integer specifying the number of clusters, less than the number of observations.
diss—logical flag: if TRUE (default for dist or dissimilarity objects), then x will be considered as a dissimilarity matrix. If FALSE, then x will be considered as a matrix of observations by variables.
metric—“euclidean” and “manhattan”,If x is already a dissimilarity matrix, then this argument will be ignored.
clusplot:Bivariate Cluster Plot (clusplot) Default Method
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
clusplot(x, clus, diss = FALSE,
s.x.2d = mkCheckX(x, diss), stand = FALSE,
lines = 2, shade = FALSE, color = FALSE,
labels= 0, plotchar = TRUE,
col.p = "dark green", col.txt = col.p,
col.clus = if(color) c(2, 4, 6, 3) else 5, cex = 1, cex.txt = cex,
span = TRUE,
add = FALSE,
xlim = NULL, ylim = NULL,
main = paste("CLUSPLOT(", deparse(substitute(x)),")"),
sub = paste("These two components explain",
round(100 * var.dec, digits = 2), "% of the point variability."),
xlab = "Component 1", ylab = "Component 2",
verbose = getOption("verbose"),
...)

x—data matrix or data frame, or dissimilarity matrix or object.In case of a dissimilarity matrix, x is typically the output of daisy or dist.
clus—clus is often the clustering component of the output of pam.
diss—logical flag: if TRUE (default for dist or dissimilarity objects), then x will be considered as a dissimilarity matrix. If FALSE, then x will be considered as a matrix of observations by variables.
lines—lines = 0, no distance lines will appear on the plot;lines = 1, the line segment between m1 and m2 is drawn;lines = 2, a line segment between the boundaries of E1 and E2 is drawn (along the line connecting m1 and m2).
shade—logical flag: if TRUE, then the ellipses are shaded in relation to their density.
color—logical flag: if TRUE, then the ellipses are colored with respect to their density.labels—labels= 0, no labels are placed in the plot;
labels= 1, points and ellipses can be identified in the plot (see identify);labels= 2, all points and ellipses are labelled in the plot;labels= 3, only the points are labelled in the plot;labels= 4, only the ellipses are labelled in the plot.labels= 5, the ellipses are labelled in the plot, and points can be identified.
col.p—color code(s) used for the observation points.
更多资源
https://cran.r-project.org/web/views/Cluster.html



批量求fasta格式序列长度
2015-11-08T14:14:22.000Z
linux下用awk计算fasta序列的长度
fasta序列文件data.fa
1
2
3
4
5
6
7
8
>Gorai.004G111100.1
ATGGGTACTGCTCCAACCCAGTGCCCTTCTGGAATCACTGCAAATTTCCACGCCAAATTTGATAACAGAACTGAGTTTTC
>Gorai.004G111100.2
ATGTTTTTCATGCTCCGGTGGACAAGATACTCTGGGATGCCGGGGAACAGTTTTTCCTTTTCTTGGCAGACATATGCACATAAAATTCTT
>Gorai.004G111100.3
ATGGGTACTGCTCCAACCCAGTGCCCTTCTGGAATCACTGCAAATTTCCAC
>Gorai.004G111100.4
ATGGGAATGCATGAACTAGCAGCCAAAGTTGATGAGT

首先将fasta序列转换成一行显示，命令如下：
1
awk '/^>/&&NR>1{print "";}{ printf "%s",/^>/ ? $0"%":$0 }'  data.fa >data2.fa

结果：
1
2
3
4
>Gorai.004G111100.1%ATGGGTACTGCTCCAACCCAGTGCCCTTCTGGAATCACTGCAAATTTCCACGCCAAATTTGATAACAGAACTGAGTTTTC
>Gorai.004G111100.2%ATGTTTTTCATGCTCCGGTGGACAAGATACTCTGGGATGCCGGGGAACAGTTTTTCCTTTTCTTGGCAGACATATGCACATAAAATTCTT
>Gorai.004G111100.3%ATGGGTACTGCTCCAACCCAGTGCCCTTCTGGAATCACTGCAAATTTCCAC
>Gorai.004G111100.4%ATGGGAATGCATGAACTAGCAGCCAAAGTTGATGAGT

长度计算：
1
awk -F"%" '{print $1"\t"length($2)}'  data2.fa >data3.fa

结果：
1
2
3
4
>Gorai.004G111100.1 80
>Gorai.004G111100.2 90
>Gorai.004G111100.3 51
>Gorai.004G111100.4 37

More： Question: Multiline Fasta To Single Line Fasta



R中的基础函数
2015-11-08T08:53:00.000Z
在R中偶尔也需要对一些数据进行简短的处理，掌握一些基本函数是必须的，本文将持续收集那些短小精悍的R函数，正确的运用还是能起到四两拔千斤的效果，欢迎评论补充。
aggregate
功能:aggregate(formula, data, FUN)
首先将数据进行分组（按行），然后对每一组数据进行函数统计，最后把结果组合成一个比较nice的表格返回.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
> head(chickwts)
  weight      feed
1    179 horsebean
2    160 horsebean
3    136 horsebean
4    227 horsebean
5    217 horsebean
6    168 horsebean
> unique(chickwts$feed)
[1] horsebean linseed   soybean   sunflower meatmeal  casein   
Levels: casein horsebean linseed meatmeal soybean sunflower
> #aggregate(chickwts$weight, by=list(chickwts$feed), FUN=mean)
> aggregate(weight ~ feed, data = chickwts, mean)
       feed   weight
1    casein 323.5833
2 horsebean 160.2000
3   linseed 218.7500
4  meatmeal 276.9091
5   soybean 246.4286
6 sunflower 328.9167
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> unique(iris$Species)
[1] setosa     versicolor virginica 
Levels: setosa versicolor virginica
> aggregate(. ~ Species, data = iris, mean)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

paste
功能：paste(…, sep = “ “, collapse = NULL)
字符串连接
1
2
3
4
5
> paste("CK", 1:6, sep = "")
[1] "CK1" "CK2" "CK3" "CK4" "CK5" "CK6"
> #设置collapse参数，连成一个字符串
> paste("CK", 1:6, sep = "", collapse = "-")
[1] "CK1-CK2-CK3-CK4-CK5-CK6"

paste在不指定分割符的情况下，默认分割符是空格 ，paste0在不指定分割符的情况下，默认分割符是空。
strsplit
功能：strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)
字符串拆分，生成一个list
参数解释：
x为字串向量，每个元素都将单独进行拆分。
split为拆分位置的字串向量（分隔符），默认为正则表达式匹配（fixed=FALSE）。如果你没接触过正则表达式，设置fixed=TRUE，表示使用普通文本匹配或正则表达式的精确匹配。普通文本的运算速度快。
perl=TRUE/FALSE的设置和perl语言版本有关，如果正则表达式很长，正确设置表达式并且使用perl=TRUE可以提高运算速度。
useBytes设置是否逐个字节进行匹配，默认为FALSE，即按字符而不是字节进行匹配。
1
2
3
4
> text <- "Hello Adam!\nHello Ava!"
> strsplit(text, "\\s")
[[1]]
[1] "Hello" "Adam!" "Hello" "Ava!"

如果要对一个向量使用该函数，需要注意。
1
2
3
#分割向量的每一个元素，并取分割后的第一个元素
unlist(lapply(X = c("abc", "bcd", "dfafadf"), FUN = function(x) {return(strsplit(x, split = "")[[1]][1])}))
[1] "a" "b" "d"

grep/regexpr/gregexpr/regexec
功能:grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
grep仅返回匹配项的下标.
regexpr、gregexpr和regexec返回的结果包含了匹配的具体位置和字符串长度信息.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
> text <- c("Hellow, Adam!", "Hi, Adam!", "How are you, Adam.")
> text
[1] "Hellow, Adam!"      "Hi, Adam!"          "How are you, Adam."
> grep("Adam",text)
[1] 1 2 3
> regexpr("Adam", text)
[1]  9  5 14
attr(,"match.length")
[1] 4 4 4
attr(,"useBytes")
[1] TRUE
> gregexpr("Adam", text)
[[1]]
[1] 9
attr(,"match.length")
[1] 4
attr(,"useBytes")
[1] TRUE

[[2]]
[1] 5
attr(,"match.length")
[1] 4
attr(,"useBytes")
[1] TRUE

[[3]]
[1] 14
attr(,"match.length")
[1] 4
attr(,"useBytes")
[1] TRUE

> regexec("Adam", text)
[[1]]
[1] 9
attr(,"match.length")
[1] 4

[[2]]
[1] 5
attr(,"match.length")
[1] 4

[[3]]
[1] 14
attr(,"match.length")
[1] 4

substr
功能：substr(x, start, stop)
字符串提取
1
2
> substr(text,9,12)
[1] "Adam" "!"    "you,"

strtrim
功能：strtrim(x, width)
将字符串修剪到特定的显示宽度.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1            5         3.5          1.4         0.2  setosa
2            4         3.0          1.4         0.2  setosa
3            4         3.2          1.3         0.2  setosa
4            4         3.1          1.5         0.2  setosa
5            5         3.6          1.4         0.2  setosa
6            5         3.9          1.7         0.4  setosa
> strtrim(head(iris)$Sepal.Length,1)
[1] "5" "4" "4" "4" "5" "5"
> iris$Sepal.Length<-strtrim(head(iris)$Sepal.Length,1)
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1            5         3.5          1.4         0.2  setosa
2            4         3.0          1.4         0.2  setosa
3            4         3.2          1.3         0.2  setosa
4            4         3.1          1.5         0.2  setosa
5            5         3.6          1.4         0.2  setosa
6            5         3.9          1.7         0.4  setosa

length/nchar
nchar是向量元素的字符个数，而length是向量长度.
1
2
3
4
5
6
7
8
9
> length("ATGGGAATGCATGAACTAGCAGCCAAAGTTGATGAGT")
[1] 1
> car <- c('bmw','ford','mini','bmw','mini')
> length(car)
[1] 5
> length(unique(car))
[1] 3
> nchar("ATGGGAATGCATGAACTAGCAGCCAAAGTTGATGAGT")
[1] 37

round
功能：round(x, digits = 0)
四舍五入
1
2
> round(c(1.1254,0.1247844),3)
[1] 1.125 0.125

axes/axis
axes=FALSE       暂时禁止坐标轴的生成|以便使用axis()函数添加你自己定制的坐标轴。默认情况是axes=TRUE，即包含坐标轴。
axis(side, . . . )
在当前图形的指定边上添加坐标，在哪个边上由第一个参数指定（1到4，从底部按照顺时针顺序）。其他参数控制坐标的位置|在图形内或图形外，以及标记的位置和标签。适合在调用参数为axes=FALSE的函数plot()后添加定制的坐标轴。
order()
A[order(A[,4],decreasing=T),] ＃按照第4列降序排序
data  #dataframe对象 含有v1,v2两列
data[sort(data$v1,index.return=TRUE)$ix,]　　#对data的数据按v1排列,v1须为numeric  as.numeric()
%in%
功能：在数据框中选取某一列只含特定字符的行
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
> library(dplyr)
> library(tidyr)
> df <- data_frame(
+   group = c(1:2, 1),
+   item_name = c("a", "b", "b"),
+   value1 = 1:3,
+   value2 = 4:6
+ )
> df
Source: local data frame [3 x 4]

  group item_name value1 value2
  (dbl)     (chr)  (int)  (int)
1     1         a      1      4
2     2         b      2      5
3     1         b      3      6
>#选取item_name列中只含有a的行
> a<-c("a")
> df[df$item_name %in% a,]
Source: local data frame [1 x 4]

  group item_name value1 value2
  (dbl)     (chr)  (int)  (int)
1     1         a      1      4

table
功能：统计数据的频数
1
2
3
4
5
> a<-c(1,1,1,2,2,3)
> table(a)
a
1 2 3 
3 2 1

str()
功能：查看数据结构
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
> str(reps )
'data.frame':	255956 obs. of  17 variables:
 $ bin      : int  585 585 585 585 585 585 585 585 585 585 ...
 $ swScore  : int  342 2271 3379 2704 180 380 1113 233 388 673 ...
 $ milliDiv : int  0 159 80 108 0 131 168 302 256 182 ...
 $ milliDel : int  0 37 4 31 0 0 19 0 0 7 ...
 $ milliIns : int  0 25 0 10 0 14 19 68 14 90 ...
 $ genoName : Factor w/ 1 level "chrX": 1 1 1 1 1 1 1 1 1 1 ...
 $ genoStart: int  0 41 1799 2290 2797 2945 3015 3565 5012 5164 ...
 $ genoEnd  : int  38 446 2272 2703 2817 3015 3221 3757 5164 5186 ...
 $ genoLeft : int  -154913716 -154913308 -154911482 -154911051 -154910937 -154910739 -154910533 -154909997 -154908590 -154908568 ...
 $ strand   : Factor w/ 2 levels "-","+": 2 2 2 2 2 2 1 2 1 2 ...
 $ repName  : Factor w/ 1105 levels "(A)n","(AAATG)n",..: 90 519 656 579 90 183 248 284 940 240 ...
 $ repClass : Factor w/ 15 levels "DNA","LINE","Low_complexity",..: 10 4 4 4 10 10 11 3 4 11 ...
 $ repFamily: Factor w/ 35 levels "AcHobo","Alu",..: 28 7 7 7 28 28 2 12 13 2 ...
 $ repStart : int  3 741 1 1 2 1 -20 1 -200 1 ...
 $ repEnd   : int  40 1150 475 422 21 69 292 179 228 22 ...
 $ repLeft  : int  0 -104 -83 -79 0 0 87 0 37 -280 ...
 $ id       : int  1 2 3 4 5 6 7 8 9 1 ...

ifelse
功能：条件判断
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
> set.seed(123)
> col1 <- runif (5, 0, 2)
> col2 <- rnorm (5, 0, 2)
> col3 <- rpois (5, 3)
> col4 <- rchisq (5, 0.1)
> df <- data.frame (col1, col2, col3, col4)
> df
       col1       col2 col3         col4
1 0.5751550 -3.3791113    5 2.771082e-01
2 1.5766103  2.4789918    2 3.888853e-04
3 0.8179538 -0.2179319    0 8.652702e-05
4 1.7660348 -0.2344839    2 6.492406e-17
5 1.8809346  0.3661652    6 2.963428e-02
> output <- ifelse ((df$col1) > 1 & (df$col3) > 2, "yes", "no")
> df$output <- output
> df
       col1       col2 col3         col4 output
1 0.5751550 -3.3791113    5 2.771082e-01     no
2 1.5766103  2.4789918    2 3.888853e-04     no
3 0.8179538 -0.2179319    0 8.652702e-05     no
4 1.7660348 -0.2344839    2 6.492406e-17     no
5 1.8809346  0.3661652    6 2.963428e-02    yes

gsub和sub
字符串替换
gsub替换匹配到的全部
sub 替换匹配到的第一个
1
2
3
4
5
6
# 将b替换为B
gsub(pattern = "b", replacement = "B", x = c("abcb", "boy", "baby"))
[1] "aBcB" "Boy"  "BaBy
# 只替换第一个b
sub(pattern = "b", replacement = "B", x = c("abcb", "baby"))
[1] "aBcb" "Baby"

字符串中字符统计
1
2
3
4
5
6
7
s <- "aababac"
p <- "a"
countCharOccurrences <- function(char, s) {
    s2 <- gsub(char,"",s)
    return (nchar(s) - nchar(s2))
}
countCharOccurrences(p,s)



ANOVA单因素方差分析与R实现
2015-10-08T15:37:43.000Z
单因子方差分析
        方差分析(analysis of variance, 简写为ANOVA)是工农业生产和科学研究中分析试验数据的一种有效的统计方法. 引起观测值不同(波动)的原因主要有两类: 一类是试验过程中随机因素的干扰或观测误差所引起不可控制的的波动, 另一类则是由于试验中处理方式不同或试验条件不同引起的可以控制的波动.
        方差分析的主要工作就是将观测数据的总变异(波动)按照变异的原因的不同分解为因子效应与试验误差，并对其作出数量分析，发现多组数据之间的差异显著行，比较各种原因在总变异中所占的重要程度，以此作为进一步统计推断的依据.
在进行方差分析之前先对几条假设进行检验，由于随机抽取，假设总体满足独立、正态，考察方差齐次性（用bartlett检验）.

正态性检验
在进行方差分析前先对输入数据做正态性检验。
对数据的正态性，利用Shapiro-Wilk正态检验方法(W检验)，它通常用于样本容量n≤50时，检验样本是否符合正态分布。
R中，函数shapiro.test()提供了W统计量和相应P值，所以可以直接使用P值作为判断标准(P值大于0.05说明数据正态)，其调用格式为shapiro.test(x)，参数x即所要检验的数据集，它是长度在3到5000之间的向量。
1
2
3
4
5
6
7
8
9
10
nx <- c(rnorm(10))
nx
[1] -0.83241783 -0.29609562 -0.06736888 -0.02366562
0.23652392 0.97570959
[7] -0.85301145 1.51769488 -0.84866517 0.20691119
shapiro.test(nx)
Shapiro-Wilk normality test
data: nx
W = 0.9084, p-value = 0.2699
#检验结果，因为p 值小于W 值，所以数据为正态分布.

更多正态性检验见：R语言做正态分布检验
其中，D检验(Kolmogorov - Smirnov)是比较精确的正态检验法。
SPSS 规定:当样本含量3 ≤n ≤5000 时,结果以Shapiro - Wilk (W 检验) 为准,当样本含量n > 5000 结果以Kolmogorov - Smirnov 为准。

SAS 规定:当样本含量n ≤2000 时,结果以Shapiro - Wilk (W 检验) 为准,当样本含量n >2000 时,结果以Kolmogorov - Smirnov (D 检验) 为准。

方差齐性检验
方差分析的另一个假设：方差齐性，需要检验不同水平下的数据方差是否相等。R中最常用的是Bartlett检验,bartlett.test()调用格式为
1
bartlett.test(x，g…)

其中，参数X是数据向量或列表(list) ; g是因子向量，如果X是列表则忽略g.当使用数据集时，也通过formula调用函数:
1
bartlett.test(formala, data, subset，na.action…)

formula是形如lhs一rhs的方差分析公式;data指明数据集:subset是可选项，可以用来指定观测值的一个子集用于分析:na.action表示遇到缺失值时应当采取的行为。
1
2
3
4
5
6
7
8
> x=c(x1,x2,x3)
> account=data.frame(x,A=factor(rep(1:3,each=7)))
> bartlett.test(x~A,data=account)
 
Bartlett test of homogeneity of variances
 
data: x by A
Bartlett's K-squared = 0.13625, df = 2, p-value = 0.9341

由于P值远远大于显著性水平a=0.05，因此不能拒绝原假设，我们认为不同水平下的数据是等方差的。
方差分析：F-Test
In R the function var.test allows for the comparison of two variances using an F-test.Although it is possible to compare values of s2 for two samples, there is no capability within R for comparing the variance of a sample,s2,to the variance of a population, σ2. The syntax for the testing variances is :
1
var.test(X, Y, ratio = 1, alternative = "two.sided", conf.level = 0.95)

where X and Y are vectors containing the two samples.
The optional command ratio is the null hypothesis; the default value is 1 if not specified.
The command alternative gives the alternative hypothesis should the experimental F-ratio is found to be significantly different than that specified by ratio. The default for alternative is “two-sided” with the other possible choices being “less” or “greater” .
The command conf.level gives the confidence level to be used in the test and the default value of 0.95 is equivalent to α = 0.05.
Here is a typical result using the objects std.method and new.method.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> std.method<-c( 21.62, 22.20, 24.27, 23.54, 24.25, 23.09, 21.01 )
> new.method<-c(21.54 ,20.51 ,22.31, 21.30, 24.62, 25.72, 21.54 ) 
> var(std.method); var(new.method) 
[1] 1.638495
[1] 3.690329
> var.test(std.method, new.method)    

	F test to compare two variances

data:  std.method and new.method
F = 0.444, num df = 6, denom df = 6, p-value = 0.3462
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.07629135 2.58395513
sample estimates:
ratio of variances 
         0.4439971

There are two ways to interpret the results provided by R.
First, the p-value provides the smallest value of α for which the F-ratio is significantly different from the hypothesized value.
If this value is larger than the desired α, then there is insufficient evidence to reject the null hypothesis; otherwise, the null hypothesis is rejected. Second, R provides the desired confidence interval for the F-ratio;
if the calculated value falls within the confidence interval, then the null hypothesis is retained. For this example, the null hypothesis is retained and we find no evidence for a difference in the variances for the objects std.method and new.method. Note that R does not restrict the F-ratio to values greater than 1.
1）判断组间是否有差别
R中的函数aov()用于方差分析的计算，其调用格式为:
1
aov(formula, data = NULL, projections =FALSE, qr = TRUE,contrasts = NULL, ...)

其中的参数formula表示方差分析的公式，在单因素方差分析中即为x~A ;
data表示做方差分析的数据框:projections为逻辑值，表示是否返回预测结果;
qr同样是逻辑值，表示是否返回QR分解结果，默认为TRUE;
contrasts是公式中的一些因子的对比列表;
通过函数summary()可列出方差分析表的详细结果。
以淀粉为原料生产葡萄的过程中, 残留许多糖蜜, 可作为生产酱色的原料. 在生产酱色的过程之前应尽可能彻彻底底除杂, 以保证酱色质量.为此对除杂方法进行选择. 在实验中选用5种不同的除杂方法, 每种方法做4次试验, 即重复4次, 结果见表.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
> X<-c(25.6, 22.2, 28.0, 29.8, 24.4, 30.0, 29.0, 27.5, 25.0, 27.7,
       23.0, 32.2, 28.8, 28.0, 31.5, 25.9, 20.6, 21.2, 22.0, 21.2)
> A<-factor(rep(1:5, each=4))
> miscellany<-data.frame(X, A)
> miscellany
      X A
1  25.6 1
2  22.2 1
3  28.0 1
4  29.8 1
5  24.4 2
6  30.0 2
7  29.0 2
8  27.5 2
9  25.0 3
10 27.7 3
11 23.0 3
12 32.2 3
13 28.8 4
14 28.0 4
15 31.5 4
16 25.9 4
17 20.6 5
18 21.2 5
19 22.0 5
20 21.2 5
> aov.mis<-aov(X~A, data=miscellany)
> summary(aov.mis)
            Df Sum Sq Mean Sq F value Pr(>F)  
A            4  132.0   32.99   4.306 0.0162 *
Residuals   15  114.9    7.66                 
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

代码解释
上述结果中, Df表示自由度; sum Sq表示平方和; Mean Sq表示均方和;
F value表示F检验统计量的值, 即F比; Pr(>F)表示检验的p值; A就是因素A;
Residuals为残差.
可以看出, F = 4.3061 > F0.05(5-1, 20-5) = 3.06, 或者p=0.01618<0.05,
说明有理由拒绝原假设, 即认为五种除杂方法有显著差异.
2）如果有差别，判断是哪两组间有差别
其中，上述所得结果为5个除杂方法之间的差异显著性分析，如果假设上述5中处理中A1为对照组，其余A2,A3,A4,A5均为处理组，现在若想分析一个对照和多个处理间的差异显著性，可以通过以下代码实现：
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
> A1A2<-miscellany[1:8,]
> A1A2
     X A
1 25.6 1
2 22.2 1
3 28.0 1
4 29.8 1
5 24.4 2
6 30.0 2
7 29.0 2
8 27.5 2
> an.aov.mis<-aov(X~A, data=A1A2)
> summary(an.aov.mis)
            Df Sum Sq Mean Sq F value Pr(>F)
A            1   3.51   3.511   0.419  0.542
Residuals    6  50.31   8.385

即选取对照为一组数据，处理为另一组，缺点是对于多个处理一个对照需要重复此操作，现在还没找到好的处理办法，希望以后能学到或者有谁知道望相告。
最近总结出的另一个比较有效的办法：
接上aov()的F检验通过summary(aov.mis)看出五种除杂方法有显著差异.接下来考察具体的差异（多重比较）通过 TukeyHSD()函数：
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
  > TukeyHSD(aov.mis)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = X ~ A, data = miscellany)

$A
      diff        lwr        upr     p adj
2-1  1.325  -4.718582  7.3685818 0.9584566
3-1  0.575  -5.468582  6.6185818 0.9981815
4-1  2.150  -3.893582  8.1935818 0.8046644
5-1 -5.150 -11.193582  0.8935818 0.1140537
3-2 -0.750  -6.793582  5.2935818 0.9949181
4-2  0.825  -5.218582  6.8685818 0.9926905
5-2 -6.475 -12.518582 -0.4314182 0.0330240
4-3  1.575  -4.468582  7.6185818 0.9251337
5-3 -5.725 -11.768582  0.3185818 0.0675152
5-4 -7.300 -13.343582 -1.2564182 0.0146983 
  > miscellany
      X A
1  25.6 1
2  22.2 1
3  28.0 1
4  29.8 1
5  24.4 2
6  30.0 2
7  29.0 2
8  27.5 2
9  25.0 3
10 27.7 3
11 23.0 3
12 32.2 3
13 28.8 4
14 28.0 4
15 31.5 4
16 25.9 4
17 20.6 5
18 21.2 5
19 22.0 5
20 21.2 5
#TukeyHSD图
 > plot(TukeyHSD(aov.mis))

 注意：可以看出上述结果是所有分组间的两两比较，但经常我们所需要的仅仅是一个对照组和其他几个处理组间的比较，这时multcomp包是不错的选择；
Dunnett
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
a = c(56,60,44,53)
b = c(29,38,18,35)
c = c(11,25,7,18)
d = c(26,44,20,32)
strains.frame = data.frame(a, b, c, d)
strains = stack(strains.frame)  #stack是reshape2包中的一个函数，用于将宽格式数据转化为长格式；
colnames(strains) = c("weight", "group")
##常规的两两相互比较计算
TukeyHSD( aov(weight ~ group, data=strains) )
library(multcomp)
summary(glht(aov(weight ~ group, data=strains), linfct=mcp(group="Dunnett")))
## The first group ("a" in this example) is used as the reference group. 
## If this is not the case, use the relevel() command to set the reference.
strains$group = relevel(strains$group, "b")
summary(glht(aov(weight ~ group, data=strains), linfct=mcp(group="Dunnett")))
plot(glht(aov(weight ~ group, data=strains), linfct=mcp(group="Dunnett")))


More: http://barcwiki.wi.mit.edu/wiki/SOPs/anova

multcomp包部分参数解释：
glht：General Linear Hypotheses，General linear hypotheses and multiple comparisons for parametric models, including generalized linear models, linear mixed effects models, and survival models.
linfct：a specification of the linear hypotheses to be tested，即指定之前的线性model将用于何种检验。
mcp (Multiple comparisons)：多重比较的意思，For each factor, which is included in model as independent variable, a contrast matrix or a symbolic description of the contrasts can be specified as arguments to mcp，其参数意思为Tukey’s all-pair comparisons or Dunnett’s comparison with a control.
同样高效的办法：
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
> person <- rep(c(1:10),2)
> treat <- c("A","B","A","A","B","B","A","B","A","B","B","A","B","B","A","A","B","A","B","A")
> phase <- rep(c(1,2),each=10)
> x <- c(760,860,568,780,960,940,635,440,528,800,770,855,602,800,958,952,650,450,530,803)
> data46 <- data.frame(person,treat,phase,x)
> data46$person<-factor(data46$person)
> data46
   person treat phase   x
1       1     A     1 760
2       2     B     1 860
3       3     A     1 568
4       4     A     1 780
5       5     B     1 960
6       6     B     1 940
7       7     A     1 635
8       8     B     1 440
9       9     A     1 528
10     10     B     1 800
11      1     B     2 770
12      2     A     2 855
13      3     B     2 602
14      4     B     2 800
15      5     A     2 958
16      6     A     2 952
17      7     B     2 650
18      8     A     2 450
19      9     B     2 530
20     10     A     2 803
> result<-aov(x~phase+person+treat,data=data46)
> summary(result)
            Df Sum Sq Mean Sq  F value   Pr(>F)    
phase        1    490     490    9.925   0.0136 *  
person       9 551111   61235 1240.195 1.32e-11 ***
treat        1    198     198    4.019   0.0799 .  
Residuals    8    395      49                      
---
Signif. codes:  0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1

 观察p adj值发现两两二者间的方差显著性.
据上述结果可以填写下面的方差分析表:

再通过函数plot( )绘图可直观描述5种不同除杂方法之间的差异, R中运行命令
1
> plot(miscellany$X~miscellany$A)


从图形上也可以看出, 5种除杂方法产生的除杂量有显著差异, 特别第5种与前面的4种, 而方法1与3, 方法2与4的差异不
明显.

 Contribution from ：http://www.cnblogs.com/jpld/p/4594003.html



bar
2015-10-08T14:18:26.000Z
The bar geom is used to produce 1d area plots: bar charts for categorical x, and histograms for continuous y. stat_bin explains the details of these summaries in more detail. In particular, you can use the weight aesthetic to create weighted histograms and barcharts where the height of the bar no longer represent a count of observations, but a sum over some other variable. See the examples for a practical example.
Usage
1
geom_bar(mapping = NULL, data = NULL, stat = "bin", position = "stack", ...)

Aesthetics
geom_bar understands the following aesthetics (required aesthetics are in bold):
 x
  

 alpha
  

 colour
  

 fill
  

 linetype
  

 size
  

 weight

Grouping Bars Together
1
2
3
4
5
6
7
8
p<- ggplot(df2, aes(x=sample, y=high, fill=sample)) + 
  geom_bar(stat="identity",fill="lightblue", color="black", 
           position=position_dodge()) +
  geom_errorbar(aes(ymin=high-sd, ymax=high+sd), width=.2,
                position=position_dodge(.9)) +
  theme(legend.position='none') +
  labs(title="Tooth length per dose", x="Sample", y = "high")
print(p)

代码解释
aes中fill可指定不同类显示柱子颜色.
geom_bar()的fill修改柱子填充颜色，color修改柱子外围颜色.
theme()控制图例.
labs()添加x，y轴和主题标签.
1
scale_fill_brewer(palette="Pastel1") #亦可用来修改柱子颜色

在柱状图中使用不同颜色—把适当的变量映射到Fill中
1
2
3
4
ggplot(upc, aes(x=reorder(Abb, Change), y=Change, fill=Region)) +
	  geom_bar(stat="identity", colour="black")  +
	  scale_fill_manual(values=c("#669933", "#FFCC66")) + 
	  xlab("State")

代码解释
reorder函数，把柱状图按照大小排列.
xlab()对x轴修改坐标轴注释.
其方法随可以为不同柱子fill不同颜色，但所填充颜色是ggplot2系统自动生成，有时候颜色不好看想要修改为你自己制定的颜色，方法如下：
方法1:breaks()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
MYdata <- data.frame(Age = rep(c(0,1,3,6,9,12), each=20),
                    Richness = rnorm(120, 10000, 2500))
ggplot(data = MYdata, aes(x = Age, y = Richness)) + 
  geom_boxplot(aes(fill=factor(Age))) + 
  geom_point(aes(color = factor(Age))) +
  scale_x_continuous(breaks = c(0, 1, 3, 6, 9, 12)) +
  scale_colour_manual(breaks = c("0", "1", "3", "6", "9", "12"),
                      labels = c("0 month", "1 month", "3 months",
                                 "6 months", "9 months", "12 months"),
                      values = c("#E69F00", "#56B4E9", "#009E73", 
                                 "#F0E442", "#0072B2", "#D55E00")) +
  scale_fill_manual(breaks = c("0", "1", "3", "6", "9", "12"),
                      labels = c("0 month", "1 month", "3 months",
                                 "6 months", "9 months", "12 months"),
                      values = c("#E69F00", "#56B4E9", "#009E73", 
                                 "#F0E442", "#0072B2", "#D55E00"))


With this color scheme, the points that fall inside the boxplot are not visible (since they are the same color as the boxplot’s fill). Perhaps leaving the boxplot hollow and drawing its lines in the color would be better.
1
2
3
4
5
6
7
8
9
ggplot(data = MYdata, aes(x = Age, y = Richness)) + 
  geom_boxplot(aes(colour=factor(Age)), fill=NA) + 
  geom_point(aes(color = factor(Age))) +
  scale_x_continuous(breaks = c(0, 1, 3, 6, 9, 12)) +
  scale_colour_manual(breaks = c("0", "1", "3", "6", "9", "12"),
                      labels = c("0 month", "1 month", "3 months",
                                 "6 months", "9 months", "12 months"),
                      values = c("#E69F00", "#56B4E9", "#009E73", 
                                 "#F0E442", "#0072B2", "#D55E00"))


代码解释
操作自己数据时可能会出现报错 “Continuous value supplied to discrete scale” ，Brian Diggs大神给出的解释是：
Age is a continuous variable, but you are trying to use it in a discrete scale (by specifying the color for specific values of age). In general, a scale maps the variable to the visual; for a continuous age, there is a corresponding color for every possible value of age, not just the ones that happen to appear in your data. However, you can simultaneously treat age as a categorical variable (factor) for some of the aesthetics. For the third part of your question, within the scale description, you can define specific labels corresponding to specific breaks in the scale.
也就是要转换连续型变量为因子变量.
方法2：Change the default palettes
These are color-blind-friendly palettes, one with gray, and one with black.

To use with ggplot2, it is possible to store the palette in a variable, then use it later.
1
2
3
4
5
6
7
8
9
10
11
# The palette with grey:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

# The palette with black:
cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

# To use for fills, add
  scale_fill_manual(values=cbPalette)

# To use for line and point colors, add
  scale_colour_manual(values=cbPalette)


Coloring Negative and Postive Bars Differently—设定新的变量，将新建变量映射到fill中
1
2
3
4
5
csub <- subset(climate, Source=="Berkeley" & Year >= 1900)
	csub$pos <- csub$Anomaly10y >= 0
	ggplot(csub, aes(x=Year, y=Anomaly10y, fill=pos)) +
  	  geom_bar(stat="identity", position="identity", colour="black", size=0.25) +
	  scale_fill_manual(values=c("#CCEEFF", "#FFDDDD"), guide=FALSE)


代码解释
首先通过subset()函数选取一个数集赋值到csub，选取原则为：climate数据中Source这一列值为Berkeley并且Year这一列>= 1900.
csub$pos为原数集添加pos这一列，若Anomaly10y >= 0则其值为TRUE,否则为FALSE.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
      Source Year Anomaly1y Anomaly5y Anomaly10y Unc10y   pos
155 Berkeley 1954        NA        NA     -0.032  0.038 FALSE
156 Berkeley 1955        NA        NA     -0.022  0.035 FALSE
157 Berkeley 1956        NA        NA      0.012  0.031  TRUE
158 Berkeley 1957        NA        NA      0.007  0.028  TRUE
159 Berkeley 1958        NA        NA      0.002  0.027  TRUE
160 Berkeley 1959        NA        NA      0.002  0.026  TRUE
161 Berkeley 1960        NA        NA     -0.019  0.026 FALSE
162 Berkeley 1961        NA        NA     -0.001  0.021 FALSE
163 Berkeley 1962        NA        NA      0.017  0.018  TRUE
164 Berkeley 1963        NA        NA      0.004  0.016  TRUE
165 Berkeley 1964        NA        NA     -0.028  0.018 FALSE
166 Berkeley 1965        NA        NA     -0.006  0.017 FALSE
167 Berkeley 1966        NA        NA     -0.024  0.017 FALSE

最后将pos映射到fill，geom_bar()中size改变柱子外框黑线的厚度.
scale_fill_manual()进行修改颜色，通过设定guide=FALSE 去掉图例.
geom_bar(width=0.5)调整width改变柱子宽度，也就是改变了柱子之间的距离.
pylr改变图中堆积的颜色—order=desc()
1
2
3
library(plyr) # Needed for desc()
ggplot(cabbage_exp, aes(x=Date, y=Weight, fill=Cultivar, order=desc(Cultivar))) +
      geom_bar(stat="identity")


Making a Propotional Stacked Bar Graph
1
2
3
4
5
6
7
library(gcookbook) # For the data set
library(plyr)
# Do a group-wise transform(), splitting on "Date"
ce <- ddply(cabbage_exp, "Date", transform,
            percent_weight = Weight / sum(Weight) * 100)
ggplot(ce, aes(x=Date, y=percent_weight, fill=Cultivar)) +
      geom_bar(stat="identity")


plyr里ddply的语法解析
cabbage是数据集
“Date” 通俗来说就是x轴的变量
transform是要做的变形，在ddply中还有summarize等
最后一项是是新建的变量和变型方法
柱条上添加文字
1
2
3
4
5
6
7
8
9
10
library(ggplot2)
library(ggthemes)
dt = data.frame(obj = c('A','D','B','E','C'), val = c(2,15,6,9,7))
dt$obj = factor(dt$obj, levels=c('D','B','C','A','E'))   ## 设置柱条的顺序
p = ggplot(dt, aes(x = obj, y = val, fill = obj, group = factor(1))) + 
    geom_bar(stat = "identity", width = 0.5) +   ## 修改柱条的宽度
    theme_pander() + 
    geom_text(aes(label = val, vjust = -0.8, hjust = 0.5, color = obj), show_guide = FALSE) +   ## 显示柱条上的数字
    ylim(min(dt$val, 0)*1.1, max(dt$val)*1.1)   ## 加大 Y 轴的范围，防止数字显示不齐全
p


代码解释
ggthemes为ggplot2的一个主题包，通过theme_pander()修改ggplot2默认主题（theme）.
1
dt$obj是因子类型，ggplot2作图时按照因子水平顺序来的，所以修改因子水平的顺序即可修改作图顺序，具体可以输出dt$obl.

另一种改变柱子顺序方式：
1
p + scale_x_discrete(limits=c('D','B','C','A','E'))


 Contribution from ：http://yangchao.me/2013/02/ggplot2-bar-chart/
                       http://www.bubuko.com/infodetail-1051940.html
                       http://stackoverflow.com/questions/10805643/ggplot2-add-color-to-boxplot-continuous-value-supplied-to-discrete-scale-er



Icons
2015-10-07T05:56:31.000Z
 Setting up Font Awesome can be as simple as adding two lines of code to your website, or you can be a pro and
customize the LESS yourself! Font Awesome even plays nicely withBootstrap 3!
Paste the following code into the  section of your site’s HTML.
1
"stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.4.0/css/font-awesome.min.css">

如果font-awesome.min.css文件在本地，则按一下操作：
 进入 fontawesome下载字体和相应的CSS文件。


 找到下载压缩文件中的fonts和css文件夹，将其中内容拷贝到自己站点下。

1
2
your blog address\themes\jacman\source\font   修改字体文件
your blog address\jacman\source\css    修改字体相应的css

 following code into the  section of your site’s HTML.
1
"/css/font-awesome.min.css" rel="stylesheet">

Examples
Basic Icons
 fa-camera-retro
1
"fa fa-camera-retro"> fa-camera-retro

Larger Icons
 fa-lg
 fa-2x
 fa-3x
 fa-4x
 fa-5x
1
2
3
4
5
"fa fa-camera-retro fa-lg"> fa-lg
"fa fa-camera-retro fa-2x"> fa-2x
"fa fa-camera-retro fa-3x"> fa-3x
"fa fa-camera-retro fa-4x"> fa-4x
"fa fa-camera-retro fa-5x"> fa-5x

Fixed Width Icons
    Home
    Library
    Applications
    Settings
1
2
3
4
5
6
"list-group">
  "list-group-item" href="#">"fa fa-home fa-fw">  Home
  "list-group-item" href="#">"fa fa-book fa-fw">  Library
  "list-group-item" href="#">"fa fa-pencil fa-fw">  Applications
  "list-group-item" href="#">"fa fa-cog fa-fw">  Settings


List Icons
  
List icons

  can be used

  as bullets

  in lists

1
2
3
4
5
6
"fa-ul">
  "fa-li fa fa-check-square">List icons
  "fa-li fa fa-check-square">can be used
  "fa-li fa fa-spinner fa-spin">as bullets
  "fa-li fa fa-square">in lists


Bordered & Pulled Icons

…tomorrow we will run faster, stretch out our arms farther…
And then one fine morning— So we beat on, boats against the
current, borne back ceaselessly into the past.
1
2
3
4
"fa fa-quote-left fa-3x fa-pull-left fa-border">
...tomorrow we will run faster, stretch out our arms farther...
And then one fine morning— So we beat on, boats against the
current, borne back ceaselessly into the past.

Animated Icons





1
2
3
4
5
"fa fa-spinner fa-spin">
"fa fa-circle-o-notch fa-spin">
"fa fa-refresh fa-spin">
"fa fa-cog fa-spin">
"fa fa-spinner fa-pulse">

Rotated & Flipped
 normal

 fa-rotate-90

 fa-rotate-180

 fa-rotate-270

 fa-flip-horizontal

 icon-flip-vertical
1
2
3
4
5
6
"fa fa-shield"> normal

"fa fa-shield fa-rotate-90"> fa-rotate-90

"fa fa-shield fa-rotate-180"> fa-rotate-180

"fa fa-shield fa-rotate-270"> fa-rotate-270

"fa fa-shield fa-flip-horizontal"> fa-flip-horizontal

"fa fa-shield fa-flip-vertical"> icon-flip-vertical

Stacked Icons

  
  

fa-twitter on fa-square-o


  
  

fa-flag on fa-circle


  
  

fa-terminal on fa-square


  
  

fa-ban on fa-camera
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
"fa-stack fa-lg">
  "fa fa-square-o fa-stack-2x">
  "fa fa-twitter fa-stack-1x">

fa-twitter on fa-square-o

"fa-stack fa-lg">
  "fa fa-circle fa-stack-2x">
  "fa fa-flag fa-stack-1x fa-inverse">

fa-flag on fa-circle

"fa-stack fa-lg">
  "fa fa-square fa-stack-2x">
  "fa fa-terminal fa-stack-1x fa-inverse">

fa-terminal on fa-square

"fa-stack fa-lg">
  "fa fa-camera fa-stack-1x">
  "fa fa-ban fa-stack-2x text-danger">

fa-ban on fa-camera

Bootstrap 3 Examples

   Delete

   Settings

   Font Awesome
Version 4.4.0

  
  
  
  



  
  


  
  



   User
  
    
  
     Edit

     Delete

     Ban

     Make admin

  


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
"btn btn-danger" href="#">
  "fa fa-trash-o fa-lg"> Delete
"btn btn-default btn-sm" href="#">
  "fa fa-cog"> Settings

"btn btn-lg btn-success" href="#">
  "fa fa-flag fa-2x pull-left"> Font Awesome
Version 4.4.0

"btn-group">
  "btn btn-default" href="#">"fa fa-align-left">
  "btn btn-default" href="#">"fa fa-align-center">
  "btn btn-default" href="#">"fa fa-align-right">
  "btn btn-default" href="#">"fa fa-align-justify">


"input-group margin-bottom-sm">
  "input-group-addon">"fa fa-envelope-o fa-fw">
  "form-control" type="text" placeholder="Email address">

"input-group">
  "input-group-addon">"fa fa-key fa-fw">
  "form-control" type="password" placeholder="Password">


"btn-group open">
  "btn btn-primary" href="#">"fa fa-user fa-fw"> User
  "btn btn-primary dropdown-toggle" data-toggle="dropdown" href="#">
    "fa fa-caret-down">
  "dropdown-menu">
    "#">"fa fa-pencil fa-fw"> Edit
    "#">"fa fa-trash-o fa-fw"> Delete
    "#">"fa fa-ban fa-fw"> Ban
    "#">"i"> Make admin
  


 Contribution from ：http://fontawesome.io/examples/



为图形添加文本
2015-10-05T11:52:16.000Z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
> text1<-read.delim("fun.txt",header=FALSE)
> text1
                                                                    V1
1                                   INFORMATION STORAGE AND PROCESSING
2                 [J] Translation, ribosomal structure and biogenesis 
3                                 [A] RNA processing and modification 
4                                                   [K] Transcription 
5                           [L] Replication, recombination and repair 
6                                [B] Chromatin structure and dynamics 
7                                     CELLULAR PROCESSES AND SIGNALING
8      [D] Cell cycle control, cell division, chromosome partitioning 
9                                               [Y] Nuclear structure 
10                                             [V] Defense mechanisms 
11                                 [T] Signal transduction mechanisms 
12                         [M] Cell wall/membrane/envelope biogenesis 
13                                                  [N] Cell motility 
14                                                   [Z] Cytoskeleton 
15                                       [W] Extracellular structures 
16  [U] Intracellular trafficking, secretion, and vesicular transport 
17   [O] Posttranslational modification, protein turnover, chaperones 
18                                                          METABOLISM
19                               [C] Energy production and conversion 
20                          [G] Carbohydrate transport and metabolism 
21                            [E] Amino acid transport and metabolism 
22                            [F] Nucleotide transport and metabolism 
23                              [H] Coenzyme transport and metabolism 
24                                 [I] Lipid transport and metabolism 
25                         [P] Inorganic ion transport and metabolism 
26   [Q] Secondary metabolites biosynthesis, transport and catabolism 
27                                                POORLY CHARACTERIZED
28                               [R] General function prediction only 
29                                               [S] Function unknown 
> a<-c("a","b","c");
> b<-c(1,2,3);
> c<-c(4,6,7);
> abc<-data.frame(a,b,c);
> abc;
  a b c
1 a 1 4
2 b 2 6
3 c 3 7
> library(reshape2);
> agcd<-melt(abc,id.vars="a",value.name="value",variable.name="bq");
> len<-nrow(text1);
> a1<-agcd[,1];
> b1<-agcd[,3];
> library(ggplot2);
> library(grid);
> vp1<-viewport(width=0.6,height=1,x=0.3,y=0.5);
> pm<-ggplot(agcd,aes(a1,weight=value,fill=bq))+geom_bar(position="dodge")+theme(legend.title=element_blank(),legend.position=c(0.1,0.9))+xlab("COG")+ylab("M82/smithella  and  M82/SB");
#之上为画图部分，下面开始绘制文本
> par(fig=c(0.55,1,0,1),bty="n");
> b<-20;
> plot(1:b,1:b,type="n",xaxt="n",yaxt="n",xlab="",ylab="");
> sum=b+b/(2*len);
> for(i in 1:(len)){
+   if (i %in% c(1,7,18,27) ){
+     text(1,sum,text1[i,],adj=0,cex=0.8,font=2);
+     sum=sum-b/(len);
+   }else{
+     text(1,sum,text1[i,],adj=0,cex=0.8);
+     sum=sum-b/(len);}}
#将图形和文本合并
> print(pm,vp=vp1);


关键点解释
设置图形参数—函数par()
1
2
3
4
5
6
7
adj：设定在text、mtext、title中字符串的对齐方向。0表示左对齐，0.5（默认值）表示居中，而1表示右对齐。
ann：如果ann=FALSE，那么高水平绘图函数会调用函数plot.default使对坐标轴名称、整体图像名称不做任何注解。
bty：用于限定图形的边框类型。如果bty的值为"o"（默认值）、"l"、"7"、"c"、"u"或者"]"中的任意一个，对应的边框类型就和该字母的形状相似，"n"，表示无边框。
fig: c(x1, x2, y1, y2)，设定当前图形在绘图设备中所占区域，需要满足x1
fin：当前绘图区域的尺寸规格，形式为(width,height)。
lty：直线类型。参数的值可以为整数（0为空，1为实线（默认值），2为虚线，3为点线。
oma：参数形式为c(bottom, left, top, right) ，用于设定外边界。


melt()
1
2
id.vars 是被当做维度的列变量，每个变量在结果中占一列；
measure.vars 是被当成观测值的列变量，它们的列变量名称和值分别组成 variable 和 value两列，列变量名称用variable.name 和 value.name来指定。

position()
1
2
3
4
5
6
geom_bar(position="dodge")调整条形图排列方式，可选参数为"dodge，fill，identity，jitter，stack"。legend.position调整图例位置。
dodge："避让"方式，即往旁边闪，如柱形图的并排方式就是这种
fill：填充方式， 先把数据归一化，再填充到绘图区的顶部
identity：原地不动，不调整位置
jitter：随机抖一抖，让本来重叠的露出点头来
stack：叠罗汉

1
b<-20;为自定义值，根据图形微调。

1
1,7,18,27为文本文件中特殊行。

附加
通过设置par（）绘制一页多图
1
2
3
4
5
6
7
8
9
10
attach(mtcars)
opar<-par(no.readonly=T) 
par(fig=c(0,0.8,0,0.8))
plot(wt,mpg,xlab="Miles per Gallon",ylab="car weight")
par(fig=c(0,0.8,0.55,1),new=T)
boxplot(wt,horizontal=T,axes=F)
par(fig=c(0.65,1,0,0.8),new=T)
boxplot(mpg,axes=F)
par(opar)
detach(mtcars)


 Contribution from ：http://www.dataguru.cn/article-4827-1.html



R中分组统计函数
2015-10-05T11:37:27.000Z
apply（对一个数组按行或者按列进行计算）
使用格式为：
1
apply(X, MARGIN, FUN, ...)

其中X为一个数组；MARGIN为一个向量（表示要将函数FUN应用到X的行还是列），若为1表示取行，为2表示取列，为c(1,2)表示行、列都计算。
1
2
3
4
5
6
7
8
9
10
11
12
13
> ma <- matrix(c(1:4, 1, 6:8), nrow = 2)
> ma
     [,1] [,2] [,3] [,4]
[1,]    1    3    1    7
[2,]    2    4    6    8
> apply(ma, c(1,2), sum)
     [,1] [,2] [,3] [,4]
[1,]    1    3    1    7
[2,]    2    4    6    8
> apply(ma, 1, sum)
[1] 12 20
> apply(ma, 2, sum)
[1]  3  7  7 15

tapply（分组统计）
使用格式为：
1
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)

其中X通常是一向量；
INDEX是一个list对象，且该list中的每一个元素都是与X有同样长度的因子；
FUN是需要计算的函数；
simplify是逻辑变量，若取值为TRUE（默认值），且函数FUN的计算结果总是为一个标量值，那么函数tapply返回一个数组；
                  若取值为FALSE，则函数tapply的返回值为一个list对象。
需要注意的是，当第二个参数INDEX不是因子时，函数 tapply() 同样有效，因为必要时 R 会用 as.factor()把参数强制转换成因子。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
> a<-data.frame(name=c("tom","sam","mik","ali"),age=c(8,9,8,9),math=c(50,100,70,90),verbal=c(90,60,96,80))
> a
  name age math verbal
1  tom   8   50     90
2  sam   9  100     60
3  mik   8   70     96
4  ali   9   90     80
> ages<-levels(as.factor(a$age))
> ages
[1] "8" "9"
> b<-matrix(nrow=length(ages),ncol=2)
> rownames(b)<-ages
> colnames(b)<-c("math","verbal")
> for(i in 1:2){
+   b[,i]<-tapply(a[,i+2],a[,"age"],mean)   #tapply的排序方法是输入factor的levels.
+   }
> b
  math verbal
8   60     93
9   95     70

table（因子出现的频数）
使用格式为：
1
2
table(..., exclude = if (useNA == "no") c(NA, NaN), useNA = c("no",
    "ifany", "always"), dnn = list.names(...), deparse.level = 1)

其中参数exclude表示哪些因子不计算。
1
2
3
4
5
6
7
8
9
10
11
12
> d <- factor(rep(c("A","B","C"), 10), levels=c("A","B","C","D","E"))
> d
 [1] A B C A B C A B C A B C A B C A B C A B C A B C A B C A B C
Levels: A B C D E
> table(d)
d
 A  B  C  D  E
10 10 10  0  0
> table(d, exclude="B")
d
 A  C  D  E
10 10  0  0




R作图--坐标中断(axis breaks)-- plotrix
2015-10-05T05:12:51.000Z
R当中的坐标中断一般都使用plotrix库中的axis.break(), gap.plot(), gap.barplot(), gap.boxplot()等几个函数来实现.
axis.break
1
2
3
4
5
6
7
8
9
library(plotrix)
opar<-par(mfrow=c(1,3))
plot(sample(5:7,20,replace=T),main="Axis break test of gap",ylim=c(2,8))
axis.break(axis=2,breakpos=3.5,breakcol="red",style="gap")
plot(sample(5:7,20,replace=T),main="Axis break test of slash",ylim=c(2,8))
axis.break(axis=2,breakpos=3.5,breakcol="red",style="slash")
plot(sample(5:7,20,replace=T),main="Axis break test of zigzag",ylim=c(2,8))
axis.break(axis=2,breakpos=3.5,breakcol="red",style="zigzag")
par(opar)


parameters
1
2
3
4
5
6
7
8
9
axis.break(axis=1,breakpos=NULL,pos=NA,bgcol="white",breakcol="black",
           style="slash",brw=0.02)
axis：    which axis to break,1=x轴，2=y轴，3=顶端x轴，4=右y轴
breakpos：where to place the break in user units
pos：     position of the axis (see axis)
bgcol：   the color of the plot background
breakcol：the color of the "break" marker
style：   Either gap, slash or zigzag
brw：     break width relative to plot width

gap.plot
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
opar<-par(mfrow=c(1,3))
twogrp<-c(rnorm(5)+4,rnorm(5)+20,rnorm(5)+5,rnorm(5)+22)
gap.plot(twogrp,gap=c(8,16,25,35),
          xlab="X values",ylab="Y values",xlim=c(1,30),ylim=c(0,45),
          main="Test two gap plot with the lot",xtics=seq(0,30,by=5),
          ytics=c(4,6,18,20,22,38,40,42),
          lty=c(rep(1,10),rep(2,10)),
          pch=c(rep(2,10),rep(3,10)),
          col=c(rep(2,10),rep(3,10)),
          type="b")
gap.plot(21:30,rnorm(10)+40,gap=c(8,16,25,35),add=TRUE,
         lty=rep(3,10),col=rep(4,10),type="l")
gap.barplot(twogrp,gap=c(8,16),xlab="Index",ytics=c(3,6,17,20),
         ylab="Group values",main="Barplot with gap")
gap.barplot(twogrp,gap=c(8,16),xlab="Index",ytics=c(3,6,17,20),
         ylab="Group values",horiz=TRUE,main="Horizontal barplot with gap")
par(opar)


1
2
3
4
5
6
7
opar<-par(mfrow=c(1,2))
twovec<-list(vec1=c(rnorm(30),-6),vec2=c(sample(1:10,40,TRUE),20))
gap.boxplot(twovec,gap=list(top=c(12,18),bottom=c(-5,-3)),
        main="Show outliers separately")
gap.boxplot(twovec,gap=list(top=c(12,18),bottom=c(-5,-3)),range=0,
         main="Include outliers in whiskers")
par(opar)


1
2
3
4
twogrp<-c(rnorm(5)+4,rnorm(5)+20,rnorm(5)+5,rnorm(5)+22)
gpcol<-c(2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5)
gap.plot(twogrp,gap=c(8,16),xlab="Index",ylab="Group values", main="E ",col=gpcol)
legend(19, 9.5, c("2","3","4","5"), pch = 1, col = 2:5)


parameters
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
gap.plot(x,y,gap,gap.axis="y",bgcol="white",breakcol="black",brw=0.02,xlim=range(x),ylim=range(y),
xticlab,xtics=NA,yticlab,ytics=NA,lty=rep(1,length(x)),col=rep(par("col"),length(x)),
pch=rep(1,length(x)),add=FALSE,stax=FALSE,...)
 
x,y:      data values
gap:      the range(s) of values to be left out  省略的轴
gap.axis: whether the gaps are to be on the x or y axis   在哪个轴上省略
bgcol:    the color of the plot background
breakcol: the color of the "break" marker
brw:      break width relative to plot width
xlim,ylim:the plot limits.
xticlab:  labels for the x axis ticks
xtics:    position of the x axis ticks  #x轴显示的表号
yticlab:  labels for the y axis ticks
ytics:    position of the y axis ticks
lty:      line type(s) to use if there are lines
col:      color(s) in which to plot the values
pch:      symbols to use in plotting.
add:      whether to add values to an existing plot.
stax:     whether to call staxlab for staggered axis labels.

gap.barplot
使用gap.plot, gap.barplot, gap.boxplot之后重新使用axis.break来修改中断类型，使得看上去美一点,
并绘制出双反斜线中断，可以视实际情况延伸断点起止位置.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
library(plotrix)
opar<-par(mfrow=c(2,2))
x<-c(1:5,6.9,7)
y<-2^x
from<-33
to<-110
plot(x,y,type="b",main="normal plot")
gap.plot(x,y,gap=c(from,to),type="b",main="gap plot")
axis.break(2,from,breakcol="red",style="gap")
axis.break(2,from*(1+0.02),breakcol="black",style="slash")
axis.break(4,from*(1+0.02),breakcol="black",style="slash")
axis(2,at=from)
gap.barplot(y,gap=c(from,to),col=as.numeric(x),main="barplot with gap")
axis.break(2,from,breakcol="red",style="gap")
axis.break(2,from*(1+0.02),breakcol="black",style="slash")
axis.break(4,from*(1+0.02),breakcol="black",style="slash")
axis(2,at=from)
gap.barplot(y,gap=c(from,to),col=as.numeric(x),horiz=T,main="Horizontal barplot with gap")
axis.break(1,from,breakcol="red",style="gap")
axis.break(1,from*(1+0.02),breakcol="black",style="slash")
axis.break(3,from*(1+0.02),breakcol="black",style="slash")
axis(1,at=from) 
par(opar)


如果画图过程中困惑了，记得重新来看一下内容，有惊喜：
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
x1=c(3,5,6,9,375,190);
x1
x2=c(2,2,3,30,46,60);
x2
data=rbind(x1,x2);
data
colnames(data)=c("Pig","Layer","Broiler","Dairy","Beef","Sheep")
rownames(data)=c("1980","2010")
data
library(plotrix)
newdata<-data
newdata[newdata>200]<-newdata[newdata>200]-150
newdata
barpos<-barplot(newdata,names.arg=colnames(newdata),
                ylim=c(0,250),beside=TRUE,col=c("darkblue","red"),axes=FALSE)
axis(2,at=c(0,50,100,150,200,235),
     labels=c(0,50,100,150,200,375))
box()
axis.break(2,210,style="gap")


Contribution from ：http://www.dataguru.cn/article-4827-1.html



Modify the coordinates
2015-10-03T07:30:33.000Z
修改坐标的函数
修改坐标的这类属性，要用到theme()函数：
1
2
gg<-ggplot(diamonds[1:20,])
gg+geom_bar(aes(price,fill=cut)) + theme(axis.text.x=element_text(family="myFont2",face="bold",size=10,angle=45,color="red"))

效果:

解释：
凡事要修改坐标文字的格式，都加一句来修改：
1
theme(axis.text.x=theme_text(X轴属性),asix.text.y=theme_text(Y轴属性))

theme_text()是存储文字属性的函数，其内置属性如下：
1
2
3
4
5
family：字体
face：粗体、斜体等
size：字体大小
angle：倾斜角度
color：颜色

修改字体
提前设置一下字体：
1
windowsFonts(myFont1=windowsFont("Times New Roman"),myFont2=windowsFont("宋体"))

然后才可以用family来修改字体
1
Family="myfont1"

修改字体粗细
Face可以设置的属性有以下几个：
1
2
3
4
plain：普通
italic：斜体
bold：粗体
bold.italic：粗体+斜体

修改尺寸大小
用数字代表字体大小即可，普通的字体可以设置为
1
size=8

修改角度
1
angle=45

表示字体逆时针倾斜45°，范围是0-360
修改颜色
用color或者colour都可以修改颜色，颜色用关键字来表示，或者用十六进制的颜色代码来表示

详细说明http://blog.csdn.net/bone_ace/article/details/47362619
http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/
修改位置
修改位置用下面的参数：
1
2
3
hjust：调整横向位置
vjust：调整纵向位置
上面都设置数字，一般调整0.5左右，可以是负值

修改刻度标签
1
2
3
4
xname<-c("a","b")
p<- ggplot(data, aes(x=name, y=high),xaxt="n")+
           scale_y_discrete(labels=xname)
           scale_x_discrete(labels=xname)

scale_xx_manual(values=c(a,b,c))对ggplot2自动设置aes()进行修改，xx可以是aes()包括的fill，colour，shape.
legend图例的修改

ggplot2中的legend包括四个部分:legend.tittle, legend.text, legend.key, legend.backgroud。针对每一部分有四种处理方式：
element_text()绘制标签和标题，可控制字体的family, face, colour, size, hjust, vjust, angle, lineheight,当改变角度时，序将hjust调整至0或1.
element_rect()绘制主要供背景使用的矩形，你可以控制颜色的填充（fill）和边界的colour, size, linetype
element_blank()表示空主题，即对元素不分配相应的绘图空间。该函数可以山区我们不感兴趣的绘图元素。使用之前的colour=NA，fill=NA,让某些元素不可见，但仍然占绘图空间。
element_get()可得到当前主题的设置。
theme()可在一幅图中对某些元素进行局部性修改，theme_update()可为后面图形的绘制进行全局性的修改
不加Legend
1
p+theme(legend.title=element_blank())

图例（legend）的位置
1
p+theme(legend.position="left")

图例（legend）的位置和对齐使用的主题设置legend.position来控制，其值可为right,left,top,bottom,none。
修改legend.tittle内容
1
2
3
4
p+scale_colour_hue(name="what does it eat?",breaks=c("herbi","carni","omni",NA),labels=c("plants","meat","both","don't know"))
注：name定义标签标题(legend.tittle)
    breaks为标签原内容(legend.text)
	labels为自定义后的标签内容(legend.text)

修改尺寸大小
1
2
3
4
p+theme(legend.background=element_rect(colour="purple",fill="pink",size=3,linetype="dashed"));
p+theme(legend.key.size=unit(2,'cm'));
p+theme(legend.key.width=unit(5,'cm'));
p+theme(legend.text = element_text(colour = 'red', angle = 45, size = 10, hjust = 3, vjust = 3, face = 'bold'))

报错：could not find function “unit”
解决办法：library(grid)
颜色的修改以及一致性
1
2
3
4
library(RColorBrewer);
newpalette<-colorRampPalette(brewer.pal(12,"Set3"))(length(unique(eee$name)));
p+scale_fill_manual(values=newpalette);
p+geom_bar(position="stack",aes(order=desc(name)))

更多图例修改：https://github.com/hadley/ggplot2/wiki/Legend-Attributes
修改坐标轴的显示范围
1
gg+geom_line(aes(depth,price,color=cut,alpha=1/3),size=2) +labs(title="example")


1
2
3
gg+geom_line(aes(depth,price,color=cut,alpha=1/3),size=2) +
     labs(title="example") +
     scale_x_continuous(limits=c(60,64))


修改坐标的显示刻度
1
2
3
4
gg+geom_line(aes(depth,price,color=cut,alpha=1/3),size=2) +
    labs(title="example") +
    scale_x_continuous(limits=c(60,64)) +
    theme(axis.text.x=element_text(angle=45,size=5))


修改坐标轴显示间隔用到breaks参数，并且要用seq(起始值，终止值，间隔)函数来设置间隔
1
2
3
4
gg+geom_line(aes(depth,price,color=cut,alpha=1/3),size=2) +
    labs(title="example") +
    scale_x_continuous(limits=c(60,64),breaks=seq(60,64,2)) +
    theme(axis.text.x=element_text(angle=45,size=5))


 Contribution from ：http://blog.sina.com.cn/s/blog_670445240102v250.html



Mix multiple graphs on the same page
2015-10-03T03:23:52.000Z
Easy way to mix multiple graphs on the same page - R software and data visualization
Install and load required packages
1
2
3
4
install.packages("gridExtra")
library("gridExtra")
install.packages("cowplot")
library("cowplot")

Prepare some data
1
2
3
4
5
6
7
8
9
10
11
df <- ToothGrowth
# Convert the variable dose from a numeric to a factor variable
df$dose <- as.factor(df$dose)
head(df)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Cowplot: Publication-ready plots
The cowplot package is an extension to ggplot2 and it can be used to provide a publication-ready plots.
Basic plots
1
2
3
4
5
6
7
8
9
library(cowplot)
# Default plot
bp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
  geom_boxplot() + 
  theme(legend.position = "none")
bp

# Add gridlines
bp + background_grid(major = "xy", minor = "none")


Recall that, the function ggsave()[in ggplot2 package] can be used to save ggplots. However, when working with cowplot, the function save_plot() [in cowplot package] is preferred. It’s an alternative to ggsave with a better support for multi-figur plots.
1
2
3
save_plot("mpg.pdf", plot.mpg,
          base_aspect_ratio = 1.3 # make room for figure legend
          )

Arranging multiple graphs using cowplot
1
2
3
4
5
6
7
8
9
10
# Scatter plot
sp <- ggplot(mpg, aes(x = cty, y = hwy, colour = factor(cyl)))+ 
  geom_point(size=2.5)
sp

# Bar plot
bp <- ggplot(diamonds, aes(clarity, fill = cut)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle=70, vjust=0.5))
bp


Combine the two plots (the scatter plot and the bar plot):
1
plot_grid(sp, bp, labels=c("A","B"), ncol = 2, nrow = 1)


The function draw_plot() can be used to place graphs at particular locations with a particular sizes. The format of the function is:
1
draw_plot(plot, x = 0, y = 0, width = 1, height = 1)


plot: the plot to place (ggplot2 or a gtable)

x: The x location of the lower left corner of the plot.

y: The y location of the lower left corner of the plot.

width, height: the width and the height of the plot


The function ggdraw() is used to initialize an empty drawing canvas.
1
2
3
4
5
6
7
8
9
10
plot.iris <- ggplot(iris, aes(Sepal.Length, Sepal.Width)) + 
  geom_point() + facet_grid(. ~ Species) + stat_smooth(method = "lm") +
  background_grid(major = &#39;y', minor = "none") + # add thin horizontal lines 
  panel_border() # and a border around each panel
# plot.mpt and plot.diamonds were defined earlier
ggdraw() +
  draw_plot(plot.iris, 0, .5, 1, .5) +
  draw_plot(sp, 0, 0, .5, .5) +
  draw_plot(bp, .5, 0, .5, .5) +
  draw_plot_label(c("A", "B", "C"), c(0, 0, 0.5), c(1, 0.5, 0.5), size = 15)


grid.arrange: Create and arrange multiple plots
The R code below creates a box plot, a dot plot, a violin plot and a stripchart (jitter plot) :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
library(ggplot2)
# Create a box plot
bp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
  geom_boxplot() + 
  theme(legend.position = "none")

# Create a dot plot
# Add the mean point and the standard deviation
dp <- ggplot(df, aes(x=dose, y=len, fill=dose)) +
  geom_dotplot(binaxis=&#39;y', stackdir='center')+
  stat_summary(fun.data=mean_sdl, mult=1, 
                 geom="pointrange", color="red")+
   theme(legend.position = "none")

# Create a violin plot
vp <- ggplot(df, aes(x=dose, y=len)) +
  geom_violin()+
  geom_boxplot(width=0.1)

# Create a stripchart
sc <- ggplot(df, aes(x=dose, y=len, color=dose, shape=dose)) +
  geom_jitter(position=position_jitter(0.2))+
  theme(legend.position = "none") +
  theme_gray()

Combine the plots using the function grid.arrange() [in gridExtra] :
1
2
3
library(gridExtra)
grid.arrange(bp, dp, vp, sc, ncol=2, 
             main="Multiple plots on the same page")


Add a common legend for multiple ggplot2 graphs
This can be done in four simple steps :

Create the plots : p1, p2, ….

Save the legend of the plot p1 as an external graphical element (called a “grob” in Grid terminology)

Remove the legends from all plots

Draw all the plots with only one legend in the right panel


To save the legend of a ggplot, the helper function below can be used :
1
2
3
4
5
6
7
library(gridExtra)
get_legend<-function(myggplot){
  tmp <- ggplot_gtable(ggplot_build(myggplot))
  leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
  legend <- tmp$grobs[[leg]]
  return(legend)
}

(The function above is derived from this forum. )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 1. Create the plots
#++++++++++++++++++++++++++++++++++
# Create a box plot
bp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
  geom_boxplot()

# Create a violin plot
vp <- ggplot(df, aes(x=dose, y=len, color=dose)) +
  geom_violin()+
  geom_boxplot(width=0.1)+
  theme(legend.position="none")

# 2. Save the legend
#+++++++++++++++++++++++
legend <- get_legend(bp)

# 3. Remove the legend from the box plot
#+++++++++++++++++++++++
bp <- bp + theme(legend.position="none")

# 4. Arrange ggplot2 graphs with a specific width
grid.arrange(bp, vp, legend, ncol=3, widths=c(2.3, 2.3, 0.8))


Scatter plot with marginal density plots
Step 1/3. Create some data :
1
2
3
4
5
6
7
8
9
10
11
12
x <- c(rnorm(500, mean = -1), rnorm(500, mean = 1.5))
y <- c(rnorm(500, mean = 1), rnorm(500, mean = 1.7))
group <- as.factor(rep(c(1,2), each=500))
df2 <- data.frame(x, y, group)
head(df2)
##             x          y group
## 1 -2.20706575 -0.2053334     1
## 2 -0.72257076  1.3014667     1
## 3  0.08444118 -0.5391452     1
## 4 -3.34569770  1.6353707     1
## 5 -0.57087531  1.7029518     1
## 6 -0.49394411 -0.9058829     1

Step 2/3. Create the plots :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Scatter plot of x and y variables and color by groups
scatterPlot <- ggplot(df2,aes(x, y, color=group)) + 
  geom_point() + 
  scale_color_manual(values = c(&#39;#999999','#E69F00')) + 
  theme(legend.position=c(0,1), legend.justification=c(0,1))


# Marginal density plot of x (top panel)
xdensity <- ggplot(df2, aes(x, fill=group)) + 
  geom_density(alpha=.5) + 
  scale_fill_manual(values = c(&#39;#999999','#E69F00')) + 
  theme(legend.position = "none")

# Marginal density plot of y (right panel)
ydensity <- ggplot(df2, aes(y, fill=group)) + 
  geom_density(alpha=.5) + 
  scale_fill_manual(values = c(&#39;#999999','#E69F00')) + 
  theme(legend.position = "none")

Create a blank placeholder plot :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
lankPlot <- ggplot()+geom_blank(aes(1,1))+
  theme(
    plot.background = element_blank(), 
   panel.grid.major = element_blank(),
   panel.grid.minor = element_blank(), 
   panel.border = element_blank(),
   panel.background = element_blank(),
   axis.title.x = element_blank(),
   axis.title.y = element_blank(),
   axis.text.x = element_blank(), 
   axis.text.y = element_blank(),
   axis.ticks = element_blank(),
   axis.line = element_blank()
     )

Step 3/3. Put the plots together:
Arrange ggplot2 with adapted height and width for each row and column :
1
2
3
library("gridExtra")
grid.arrange(xdensity, blankPlot, scatterPlot, ydensity, 
        ncol=2, nrow=2, widths=c(4, 1.4), heights=c(1.4, 4))


Create a complex layout using the function viewport()
The different steps are :

Create plots : p1, p2, p3, ….

Move to a new page on a grid device using the function grid.newpage()

Create a layout 2X2 - number of columns = 2; number of rows = 2

Define a grid viewport : a rectangular region on a graphics device

Print a plot into the viewport


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Move to a new page
grid.newpage()

# Create layout : nrow = 2, ncol = 2
pushViewport(viewport(layout = grid.layout(2, 2)))

# A helper function to define a region on the layout
define_region <- function(row, col){
  viewport(layout.pos.row = row, layout.pos.col = col)
} 

# Arrange the plots
print(scatterPlot, vp=define_region(1, 1:2))
print(xdensity, vp = define_region(2, 1))
print(ydensity, vp = define_region(2, 2))


Insert an external graphical element inside a ggplot
The function annotation_custom() [in ggplot2] can be used for adding tables, plots or other grid-based elements. The simplified format is :
1
annotation_custom(grob, xmin, xmax, ymin, ymax)


grob: the external graphical element to display

xmin, xmax : x location in data coordinates (horizontal location)

ymin, ymax : y location in data coordinates (vertical location)


The different steps are :

Create a scatter plot of y = f(x)

Add, for example, the box plot of the variables x and y inside the scatter plot using the function annotation_custom()


As the inset box plot overlaps with some points, a transparent background is used for the box plots.
1
2
3
4
5
6
7
8
9
10
11
# Create a transparent theme object
transparent_theme <- theme(
 axis.title.x = element_blank(),
 axis.title.y = element_blank(),
 axis.text.x = element_blank(), 
 axis.text.y = element_blank(),
 axis.ticks = element_blank(),
 panel.grid = element_blank(),
 axis.line = element_blank(),
 panel.background = element_rect(fill = "transparent",colour = NA),
 plot.background = element_rect(fill = "transparent",colour = NA))

Create the graphs :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
p1 <- scatterPlot # see previous sections for the scatterPlot

# Box plot of the x variable
p2 <- ggplot(df2, aes(factor(1), x))+
  geom_boxplot(width=0.3)+coord_flip()+
  transparent_theme

# Box plot of the y variable
p3 <- ggplot(df2, aes(factor(1), y))+
  geom_boxplot(width=0.3)+
  transparent_theme

# Create the external graphical elements
# called a "grop" in Grid terminology
p2_grob = ggplotGrob(p2)
p3_grob = ggplotGrob(p3)
   

# Insert p2_grob inside the scatter plot
xmin <- min(x); xmax <- max(x)
ymin <- min(y); ymax <- max(y)
p1 + annotation_custom(grob = p2_grob, xmin = xmin, xmax = xmax, 
                       ymin = ymin-1.5, ymax = ymin+1.5)


1
2
3
4
# Insert p3_grob inside the scatter plot
p1 + annotation_custom(grob = p3_grob,
                       xmin = xmin-1.5, xmax = xmin+1.5, 
                       ymin = ymin, ymax = ymax)


If you have a solution to insert, at the same time, both p2_grob and p3_grob inside the scatter plot, please let me a comment. I got some errors trying to do this…
Mix table, text and ggplot2 graphs
The functions below are required :

tableGrob() [in the package gridExtra] : for adding a data table to a graphic device

splitTextGrob() [in the package RGraphics] : for adding a text to a graph


Make sure that the package RGraphics is installed.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
library(RGraphics)
library(gridExtra)

# Table
p1 <- tableGrob(head(ToothGrowth))

# Text
text <- "ToothGrowth data describes the effect of Vitamin C on tooth growth in Guinea pigs.  Three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods [orange juice (OJ) or ascorbic acid (VC)] are used."
p2 <- splitTextGrob(text)

# Box plot
p3 <- ggplot(df, aes(x=dose, y=len)) + geom_boxplot()

# Arrange the plots on the same page
grid.arrange(p1, p2, p3, ncol=1)


Infos
 This analysis has been performed using R software (ver. 3.1.2) and ggplot2 (ver. 1.0.0) 
Contribution from ：http://www.sthda.com/english/wiki/ggplot2-easy-way-to-mix-multiple-graphs-on-the-same-page-r-software-and-data-visualization



Multiple graphs on one page using ggplot2
2015-10-03T02:41:58.000Z
The easy way is to use the multiplot function to put multiple graphs on one page, defined at the bottom of this page. If it isn’t suitable for your needs, you can copy and modify it.
Problem
You want to put multiple graphs on one page.
Solution-1
use the multiplot function
plots and store
First, set up the plots and store them, but don’t render them yet. The details of these plots aren’t important; all you need to do is store the plot objects in variables.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
library(ggplot2)

# This example uses the ChickWeight dataset, which comes with ggplot2
# First plot
p1 <- ggplot(ChickWeight, aes(x=Time, y=weight, colour=Diet, group=Chick)) +
    geom_line() +
    ggtitle("Growth curve for individual chicks")

# Second plot
p2 <- ggplot(ChickWeight, aes(x=Time, y=weight, colour=Diet)) +
    geom_point(alpha=.3) +
    geom_smooth(alpha=.2, size=1) +
    ggtitle("Fitted growth curve per diet")

# Third plot
p3 <- ggplot(subset(ChickWeight, Time==21), aes(x=weight, colour=Diet)) +
    geom_density() +
    ggtitle("Final weight, by diet")

# Fourth plot
p4 <- ggplot(subset(ChickWeight, Time==21), aes(x=weight, fill=Diet)) +
    geom_histogram(colour="black", binwidth=50) +
    facet_grid(Diet ~ .) +
    ggtitle("Final weight, by diet") +
    theme(legend.position="none")        # No legend (redundant in this graph)

multiplot function
This is the definition of multiplot. It can take any number of plot objects as arguments, or if it can take a list of plot objects passed to plotlist.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

multiplot
Once the plot objects are set up, we can render them with multiplot. This will make two columns of graphs:
1
2
3
multiplot(p1, p2, p3, p4, cols=2)
#> Loading required package: grid
#> geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.


Solution-2
facet_grid
1
2
3
p <- ggplot(mtcars, aes(mpg, wt)) + geom_point()
# With one variable
p + facet_grid(. ~ cyl)


1
2
# With two variables
p + facet_grid(vs ~ am)


Solution-3
grid.arrange
1
2
3
library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol=2, 
             main="Multiple plots on the same page")





quick start guide of ggplot2 line plot  - R software and data visualization
2015-09-21T06:29:22.000Z
This R tutorial describes how to create line plots using R software and ggplot2 package.

In a line graph, observations are ordered by x value and connected.

The functions geom_line(), geom_step(), or geom_path() can be used.

x value (for x axis) can be :


date : for a time series data

texts

discrete numeric values

continuous numeric values




Basic line plots


Data

Data derived from ToothGrowth data sets are used. ToothGrowth describes the effect of Vitamin C on tooth growth in Guinea pigs.

1
2
3
4
5
6
7
8
df <- data.frame(dose=c("D0.5", "D1", "D2"),
                len=c(4.2, 10, 29.5))

head(df)
##   dose  len
## 1 D0.5  4.2
## 2   D1 10.0
## 3   D2 29.5


len : Tooth length

dose : Dose in milligrams (0.5, 1, 2)




Create line plots with points

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Basic line plot with points
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_line()+
  geom_point()

# Change the line type
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_line(linetype = "dashed")+
  geom_point()

# Change the color
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_line(color="red")+
  geom_point()


Read more on line types : ggplot2 line types

You can add an arrow to the line using the grid package :

1
2
3
4
5
6
7
8
9
10
11
library(grid)
# Add an arrow
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_line(arrow = arrow())+
  geom_point()

# Add a closed arrow to the end of the line
myarrow=arrow(angle = 15, ends = "both", type = "closed")
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_line(arrow=myarrow)+
  geom_point()


Observations can be also connected using the functions geom_step() or geom_path() :

1
2
3
4
5
6
7
8
ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_step()+
  geom_point()


ggplot(data=df, aes(x=dose, y=len, group=1)) +
  geom_path()+
  geom_point()






geom_line : Connecting observations, ordered by x value

geom_path() : Observations are connected in original order

geom_step : Connecting observations by stairs






Line plot with multiple groups


Data

Data derived from ToothGrowth data sets are used. ToothGrowth describes the effect of Vitamin C on tooth growth in Guinea pigs. Three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods [orange juice (OJ) or ascorbic acid (VC)] are used :

1
2
3
4
5
6
7
8
9
10
11
12
df2 <- data.frame(supp=rep(c("VC", "OJ"), each=3),
                dose=rep(c("D0.5", "D1", "D2"),2),
                len=c(6.8, 15, 33, 4.2, 10, 29.5))

head(df2)
##   supp dose  len
## 1   VC D0.5  6.8
## 2   VC   D1 15.0
## 3   VC   D2 33.0
## 4   OJ D0.5  4.2
## 5   OJ   D1 10.0
## 6   OJ   D2 29.5


len : Tooth length

dose : Dose in milligrams (0.5, 1, 2)

supp : Supplement type (VC or OJ)




Create line plots

In the graphs below, line types, colors and sizes are the same for the two groups :

1
2
3
4
5
6
7
8
9
# Line plot with multiple groups
ggplot(data=df2, aes(x=dose, y=len, group=supp)) +
  geom_line()+
  geom_point()

# Change line types
ggplot(data=df2, aes(x=dose, y=len, group=supp)) +
  geom_line(linetype="dashed", color="blue", size=1.2)+
  geom_point(color="red", size=3)




Change line types by groups

In the graphs below, line types and point shapes are controlled automatically by the levels of the variable supp :

1
2
3
4
5
6
7
8
9
# Change line types by groups (supp)
ggplot(df2, aes(x=dose, y=len, group=supp)) +
  geom_line(aes(linetype=supp))+
  geom_point()

# Change line types and point shapes
ggplot(df2, aes(x=dose, y=len, group=supp)) +
  geom_line(aes(linetype=supp))+
  geom_point(aes(shape=supp)


It is also possible to change manually the line types using the function scale_linetype_manual().

1
2
3
4
5
# Set line types manually
ggplot(df2, aes(x=dose, y=len, group=supp)) +
  geom_line(aes(linetype=supp))+
  geom_point()+
  scale_linetype_manual(values=c("twodash", "dotted"))


You can read more on line types here : ggplot2 line types

If you want to change also point shapes, read this article : ggplot2 point shapes



Change line colors by groups

Line colors are controlled automatically by the levels of the variable supp :

1
2
3
4
p<-ggplot(df2, aes(x=dose, y=len, group=supp)) +
  geom_line(aes(color=supp))+
  geom_point(aes(color=supp))
p


It is also possible to change manually line colors using the functions :


scale_color_manual() : to use custom colors

scale_color_brewer() : to use color palettes from RColorBrewer package

scale_color_grey() : to use grey color palettes


1
2
3
4
5
6
7
8
# Use custom color palettes
p+scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"))

# Use brewer color palettes
p+scale_color_brewer(palette="Dark2")

# Use grey scale
p + scale_color_grey() + theme_classic()


Read more on ggplot2 colors here : ggplot2 colors




Change the legend position

1
2
3
4
5
6
7
8
9
p <- p + scale_color_brewer(palette="Paired")+
  theme_minimal()

p + theme(legend.position="top")

p + theme(legend.position="bottom")

# Remove legend
p + theme(legend.position="none")


The allowed values for the arguments legend.position are : “left”,”top”, “right”, “bottom”.

Read more on ggplot legend : ggplot2 legend



Line plot with a numeric x-axis

If the variable on x-axis is numeric, it can be useful to treat it as a continuous or a factor variable depending on what you want to do :

1
2
3
4
5
6
7
8
9
10
11
12
# Create some data
df2 <- data.frame(supp=rep(c("VC", "OJ"), each=3),
                dose=rep(c("0.5", "1", "2"),2),
               len=c(6.8, 15, 33, 4.2, 10, 29.5))
head(df2)
##   supp dose  len
## 1   VC  0.5  6.8
## 2   VC    1 15.0
## 3   VC    2 33.0
## 4   OJ  0.5  4.2
## 5   OJ    1 10.0
## 6   OJ    2 29.5

1
2
3
4
5
6
7
8
9
10
11
12
13
# x axis treated as continuous variable
df2$dose <- as.numeric(as.vector(df2$dose))
ggplot(data=df2, aes(x=dose, y=len, group=supp, color=supp)) +
  geom_line() + geom_point()+
  scale_color_brewer(palette="Paired")+
  theme_minimal()

# Axis treated as discrete variable
df2$dose<-as.factor(df2$dose)
ggplot(data=df2, aes(x=dose, y=len, group=supp, color=supp)) +
  geom_line() + geom_point()+
  scale_color_brewer(palette="Paired")+
  theme_minimal()




Line plot with dates on x-axis

economics time series data sets are used :

1
2
3
4
5
6
7
8
head(economics)
        date   pce    pop psavert uempmed unemploy
## 1 1967-06-30 507.8 198712     9.8     4.5     2944
## 2 1967-07-31 510.9 198911     9.8     4.7     2945
## 3 1967-08-31 516.7 199113     9.0     4.6     2958
## 4 1967-09-30 513.3 199311     9.8     4.9     3143
## 5 1967-10-31 518.5 199498     9.7     4.7     3066
## 6 1967-11-30 526.2 199657     9.4     4.8     3018

Plots :

1
2
3
4
5
6
7
# Basic line plot
ggplot(data=economics, aes(x=date, y=pop))+
  geom_line()

# Plot a subset of the data
ggplot(data=subset(economics, date > as.Date("2006-1-1")), 
       aes(x=date, y=pop))+geom_line()


Change line size :

1
2
3
# Change line size
ggplot(data=economics, aes(x=date, y=pop, size=unemploy/pop))+
  geom_line()




Line graph with error bars

The function below will be used to calculate the mean and the standard deviation, for the variable of interest, in each group :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#+++++++++++++++++++++++++
# Function to calculate the mean and the standard deviation
  # for each group
#+++++++++++++++++++++++++
# data : a data frame
# varname : the name of a column containing the variable
  #to be summariezed
# groupnames : vector of column names to be used as
  # grouping variables
data_summary <- function(data, varname, groupnames){
  require(plyr)
  summary_func <- function(x, col){
    c(mean = mean(x[[col]], na.rm=TRUE),
      sd = sd(x[[col]], na.rm=TRUE))
  }
  data_sum<-ddply(data, groupnames, .fun=summary_func,
                  varname)
  data_sum <- rename(data_sum, c("mean" = varname))
 return(data_sum)
}

Summarize the data :

1
2
3
4
5
6
7
8
9
10
df3 <- data_summary(ToothGrowth, varname="len", 
                    groupnames=c("supp", "dose"))
head(df3)
##   supp dose   len       sd
## 1   OJ  0.5 13.23 4.459709
## 2   OJ  1.0 22.70 3.910953
## 3   OJ  2.0 26.06 2.655058
## 4   VC  0.5  7.98 2.746634
## 5   VC  1.0 16.77 2.515309
## 6   VC  2.0 26.14 4.797731

The function geom_errorbar() can be used to produce a line graph with error bars :

1
2
3
4
5
6
7
8
9
10
11
12
# Standard deviation of the mean
ggplot(df3, aes(x=dose, y=len, group=supp, color=supp)) + 
    geom_errorbar(aes(ymin=len-sd, ymax=len+sd), width=.1) +
    geom_line() + geom_point()+
   scale_color_brewer(palette="Paired")+theme_minimal()

# Use position_dodge to move overlapped errorbars horizontally
ggplot(df3, aes(x=dose, y=len, group=supp, color=supp)) + 
    geom_errorbar(aes(ymin=len-sd, ymax=len+sd), width=.1, 
    position=position_dodge(0.05)) +
    geom_line() + geom_point()+
   scale_color_brewer(palette="Paired")+theme_minimal()




Customized line graphs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Simple line plot
# Change point shapes and line types by groups
ggplot(df3, aes(x=dose, y=len, shape=supp, linetype=supp))+ 
    geom_errorbar(aes(ymin=len-sd, ymax=len+sd), width=.1, 
    position=position_dodge(0.05)) +
    geom_line() +
    geom_point()+
    labs(title="Plot of lengthby dose",x="Dose (mg)", y = "Length")+
    theme_classic()


# Change color by groups
# Add error bars
p <- ggplot(df3, aes(x=dose, y=len,  color=supp))+ 
    geom_errorbar(aes(ymin=len-sd, ymax=len+sd), width=.1, 
    position=position_dodge(0.05)) +
    geom_line(aes(linetype=supp)) + 
    geom_point(aes(shape=supp))+
    labs(title="Plot of lengthby dose",x="Dose (mg)", y = "Length")+
    theme_classic()

p + theme_classic() + scale_color_manual(values=c(&#39;#999999','#E69F00'))


Change colors manually :

1
2
3
4
5
6
7
p + scale_color_brewer(palette="Paired") + theme_minimal()

# Greens
p + scale_color_brewer(palette="Greens") + theme_minimal()

# Reds
p + scale_color_brewer(palette="Reds") + theme_minimal()


Infos

 This analysis has been performed using R software (ver. 3.1.2) and ggplot2 (ver. 1.0.0) 

Contribution from ：http://www.sthda.com/english/wiki/ggplot2-line-plot-quick-start-guide-r-software-and-data-visualization



ggplot2 error bars (finished)-Quick start guide - R software and data visualization
2015-09-21T05:13:36.000Z
This tutorial describes how to create a graph with error bars using R software and ggplot2 package. There are different types of error bars which can be created using the functions below :


geom_errorbar()

geom_linerange()

geom_pointrange()

geom_crossbar()

geom_errorbarh()


Add error bars to a bar and line plots

Prepare the data

ToothGrowth data is used. It describes the effect of Vitamin C on tooth growth in Guinea pigs. Three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods [orange juice (OJ) or ascorbic acid (VC)] are used :

1
2
3
4
5
6
7
8
9
10
11
library(ggplot2)
df <- ToothGrowth
df$dose <- as.factor(df$dose)
head(df)
   len supp dose
1  4.2   VC  0.5
2 11.5   VC  0.5
3  7.3   VC  0.5
4  5.8   VC  0.5
5  6.4   VC  0.5
6 10.0   VC  0.5


len : Tooth length

dose : Dose in milligrams (0.5, 1, 2)

supp : Supplement type (VC or OJ)

In the example below, we’ll plot the mean value of Tooth length in each group. The standard deviation is used to draw the error bars on the graph.

First, the helper function below will be used to calculate the mean and the standard deviation, for the variable of interest, in each group :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#+++++++++++++++++++++++++
# Function to calculate the mean and the standard deviation
  # for each group
#+++++++++++++++++++++++++
# data : a data frame
# varname : the name of a column containing the variable
  #to be summariezed
# groupnames : vector of column names to be used as
  # grouping variables
data_summary <- function(data, varname, groupnames){
  require(plyr)
  summary_func <- function(x, col){
    c(mean = mean(x[[col]], na.rm=TRUE),
      sd = sd(x[[col]], na.rm=TRUE))
  }
  data_sum<-ddply(data, groupnames, .fun=summary_func,
                  varname)
  data_sum <- rename(data_sum, c("mean" = varname))
 return(data_sum)
}


Summarize the data :

1
2
3
4
5
6
7
8
9
10
11
12
df2 <- data_summary(ToothGrowth, varname="len", 
                    groupnames=c("supp", "dose"))
# Convert dose to a factor variable
df2$dose=as.factor(df2$dose)
head(df2)
   supp dose   len       sd
1   OJ  0.5 13.23 4.459709
2   OJ    1 22.70 3.910953
3   OJ    2 26.06 2.655058
4   VC  0.5  7.98 2.746634
5   VC    1 16.77 2.515309
6   VC    2 26.14 4.797731


Barplot with error bars

The function geom_errorbar() can be used to produce the error bars :

1
2
3
4
5
6
7
8
9
10
11
12
13
library(ggplot2)
# Default bar plot
p<- ggplot(df2, aes(x=dose, y=len, fill=supp)) + 
  geom_bar(stat="identity", color="black", 
           position=position_dodge()) +
  geom_errorbar(aes(ymin=len-sd, ymax=len+sd), width=.2,
                 position=position_dodge(.9)) 
print(p)

# Finished bar plot
p+labs(title="Tooth length per dose", x="Dose (mg)", y = "Length")+
   theme_classic() +
   scale_fill_manual(values=c('#999999','#E69F00'))



Note that, you can chose to keep only the upper error bars
1
2
3
4
5
# Keep only upper error bars
 ggplot(df2, aes(x=dose, y=len, fill=supp)) + 
  geom_bar(stat="identity", color="black", position=position_dodge()) +
  geom_errorbar(aes(ymin=len, ymax=len+sd), width=.2,
                 position=position_dodge(.9))



Read more on ggplot2 bar graphs : ggplot2 bar graphs

Line plot with error bars

1
2
3
4
5
6
7
8
9
10
11
12
# Default line plot
p<- ggplot(df2, aes(x=dose, y=len, group=supp, color=supp)) + 
  geom_line() +
  geom_point()+
  geom_errorbar(aes(ymin=len-sd, ymax=len+sd), width=.2,
                 position=position_dodge(0.05))
print(p)

# Finished line plot
p+labs(title="Tooth length per dose", x="Dose (mg)", y = "Length")+
   theme_classic() +
   scale_color_manual(values=c('#999999','#E69F00'))



You can also use the functions geom_pointrange() or geom_linerange() instead of using geom_errorbar()

1
2
3
4
5
6
7
8
# Use geom_pointrange
ggplot(df2, aes(x=dose, y=len, group=supp, color=supp)) + 
geom_pointrange(aes(ymin=len-sd, ymax=len+sd))

# Use geom_line()+geom_pointrange()
ggplot(df2, aes(x=dose, y=len, group=supp, color=supp)) + 
  geom_line()+
  geom_pointrange(aes(ymin=len-sd, ymax=len+sd))



Read more on ggplot2 line plots : ggplot2 line plots

Dot plot with mean point and error bars

The functions geom_dotplot() and stat_summary() are used :

The mean +/- SD can be added as a crossbar , a error bar or a pointrange :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
p <- ggplot(df, aes(x=dose, y=len)) + 
    geom_dotplot(binaxis='y', stackdir='center')

# use geom_crossbar()
p + stat_summary(fun.data="mean_sdl", mult=1, 
                 geom="crossbar", width=0.5)

# Use geom_errorbar()
p + stat_summary(fun.data=mean_sdl, mult=1, 
        geom="errorbar", color="red", width=0.2) +
  stat_summary(fun.y=mean, geom="point", color="red")
   
# Use geom_pointrange()
p + stat_summary(fun.data=mean_sdl, mult=1, 
                 geom="pointrange", color="red")



Read more on ggplot2 dot plots : ggplot2 dot plot

Infos

 This analysis has been performed using R software (ver. 3.1.2) and ggplot2 (ver. 1.0.0) 

Contribution from ：http://www.sthda.com/english/wiki/ggplot2-error-bars-quick-start-guide-r-software-and-data-visualization



CRISPR/Cas9
2015-09-13T01:26:55.000Z
CRISPR/Cas9 gene editing technology has revolutionized the field of genome modification. This system is based on two key components that form a complex: Cas9 endonuclease and a target-specific RNA (single guide RNA or sgRNA) that guides Cas9 to the genomic DNA target site. Targeting to a particular genomic locus is solely mediated by the sgRNA.

CRISPR/Cas9 简介
细菌（bacteria）和古细菌（archaea）都有一套防御机制抵御这种外来的侵入性因子，这种防御机制就是在成簇的、有规律间隔的多次重复短片段（clustered regularly interspaced short palindromic repeat, CRISPR）的基础上建立起来的适应性免疫系统（adaptive immune system）。
CRISPR系统能够将各种外源的病毒或质粒DNA短片段”集中”到细胞基因组里某个特定的重复片段区域上，将这些外源DNA当作细胞曾经经受过外源DNA入侵的一种记忆给储存起来。然后这段DNA会转录生成CRISPR前体RNA（precursor CRISPR RNA），前体RNA生成之后会被切割成一段一段的重复RNA片段，这些小RNA分子就是成熟的CRISPR RNA（crRNA）。然后crRNA会招募CRISPR相关蛋白（CRISPR-associated proteins, Cas）和tracrRNA与各种被细胞记住的外源的入侵DNA或mRNA片段结合，将它们彻底摧毁。

CRISPR/Cas9发现史
1987年，日本人在大肠杆菌中发现有串联间隔重复序列，后来的研究发现，这种重复序列广泛存在于细菌和古细菌中。
2002年，正式命名为CRISPR(Clustered regulatory interspaced short palindromic repeats)。
2005年，三个研究组同时发现间隔序列和侵染细菌的病毒或phage高度同源。从而推测，这一系统可能是类似于siRNA一样，是细菌抵抗Phage的一种机理。
2007年，Science发表文章，证明细菌可能利用CRSPR系统对抗噬菌体入侵，并解释细菌抵抗外界入侵的大致流程，Cas位点编码多个核酸酶和解旋酶，他们把入侵的DNA切割，整合到CRISPR的重复序列中，形成记忆。当再次遭到入侵时，转录出RNA，Cas蛋白复合物利用这些和入侵的DNA同源的RNA去切割摧毁外源的DNA。
2012年，Jennifer A. Doudna和Emmanuelle Charpentier的这篇Science发现了一个比较简单的CRISPR（TypeII）系统的机理。

CRISPR/Cas9 type II 作用机制
相关术语

CRISPR–Cas系统也称作2型系统（type II systems），如图所示。Cas9内切酶在向导RNA的指引下能够对各种入侵的外源DNA分子进行定点切割，不过主要识别的还是保守的间隔相邻基序（proto-spacer adjacent motifs，PAM基序）。如果要形成一个有功能的DNA切割复合体，还需要另外两个RNA分子的帮助，它们就是CRISPR RNA (crRNA)和反式作用CRISPR RNA（trans -acting CRISPR RNA, tracrRNA)。不过最近有研究发现，这两种RNA可以被”改装”成一个向导RNA（single-guide RNA, sgRNA）。tracrRNA刚好与crRNA的小片段互补，同时它们还是RNA特异性的宿主核糖核酸酶RNase III的底物。经过RNase III的切割之后，这一对互补的RNA（其中包括一条42bp的crRNA和一条75bp的tracrRNA）就可以充当Cas9因子的向导，也就是说，tracrRNA分子能够帮助Cas9-crRNA复合体在细胞复杂的DNA环境中精准地定位到入侵的DNA序列上，这个sgRNA足以帮助Cas9内切酶对DNA进行定点切割,在完整基因组上的特定位点完成切割反应后细胞通常会通过两种方式对发生双链断裂的DNA进行修复，这两种方式分别是同源重组修复机制（homologous recombination, HR）和非同源末端连接修复机制（non-homologous end joining, NHEJ），不过在修复的过程中细胞有可能会对修复位点进行修饰，或者插入新的遗传信息。

图例解释
The type II RNA-mediated CRISPR/Cas immune pathway. The expression and interference steps are represented in the drawing. The type II CRISPR/Cas loci are composed of an operon of four genes (blue) encoding the proteins Cas9, Cas1, Cas2 and Csn2, a CRISPR array consisting of a leader sequence followed by identical repeats (black rectangles) interspersed with unique genome-targeting spacers (colored diamonds) and a sequence encoding the trans-activating tracrRNA (red). Represented here is the type II CRISPR/Cas locus of S. pyogenes SF370 (Accession number NC_002737) (4). Experimentally confirmed promoters and transcriptional terminator in this locus are indicated (4). The CRISPR array is transcribed as a precursor CRISPR RNA (pre-crRNA) molecule that undergoes a maturation process specific to the type II systems (4). In S. pyogenes SF370, tracrRNA is transcribed as two primary transcripts of 171 and 89 nt in length that have complementarity to each repeat of the pre-crRNA. The first processing event involves pairing of tracrRNA to pre-crRNA, forming a duplex RNA that is recognized and cleaved by the housekeeping endoribonuclease RNase III (orange) in the presence of the Cas9 protein. RNase III-mediated cleavage of the duplex RNA generates a 75-nt processed tracrRNA and a 66-nt intermediate crRNAs consisting of a central region containing a sequence of one spacer, flanked by portions of the repeat sequence. A second processing event, mediated by unknown ribonuclease(s), leads to the formation of mature crRNAs of 39 to 42 nt in length consisting of 5’-terminal spacer-derived guide sequence and repeat-derived 3’-terminal sequence. Following the first and second processing events, mature tracrRNA remains paired to the mature crRNAs and bound to the Cas9 protein. In this ternary complex, the dual tracrRNA:crRNA structure acts as guide RNA that directs the endonuclease Cas9 to the cognate target DNA. Target recognition by the Cas9-tracrRNA:crRNA complex is initiated by scanning the invading DNA molecule for homology between the protospacer sequence in the target DNA and the spacer-derived sequence in the crRNA. In addition to the DNA protospacer-crRNA spacer complementarity, DNA targeting requires the presence of a short motif (NGG, where N can be any nucleotide) adjacent to the protospacer (protospacer adjacent motif - PAM). Following pairing between the dual-RNA and the protospacer sequence, an R-loop is formed and Cas9 subsequently introduces a double-stranded break (DSB) in the DNA. Cleavage of target DNA by Cas9 requires two catalytic domains in the protein. At a specific site relative to the PAM, the HNH domain cleaves the complementary strand of the DNA while the RuvC-like domain cleaves the noncomplementary strand.
关键点注释
1
2
3
4
5
6
5端是来源于spacer的序列，3端是来源于重复端（repeat）的序列，二者共同组成了39到42nt的成熟crRNAs，其中的3端重复端部分序列
与成熟的tracrRNAs部分互补配对。
tracrRNA:crRNA形成的二元结构招募Cas9形成Cas9-tracrRNA:crRNA三元复合体，开始扫描入侵的DNA，识别与成熟crRNA来源于5端
的spacer序列互补的区段.
PAM即为NGG序列，存在与入侵DNA也就是CRISPR/Cas9系统所要去切割的双链DNA上，对应于CRISPR/Cas9的crRNA上就是3端的CCN。
参与三元复合体Cas9-tracrRNA:crRNA与目标DNA的识别.


1
2
3
4
Schematic representation of tracrRNA, crRNA-sp2, and protospacer 2 DNA sequences. Regions of crRNA 
complementarity to tracrRNA (orange) and the protospacer DNA (yellow) are represented. The PAM 
sequence is shown in gray; cleavage sites mapped in (C)and (D) are represented by blue arrows (C),
 a red arrow [(D), complementary strand], and a red line [(D), noncomplementary strand].


（A）Cas9蛋白需要crRNA和tracrRNA的共同帮助才能识别入侵的外来DNA分子，与之结合并将其降解。因为crRNA中的向导链部分可以与外源DNA的一条链互补并结合形成R环结构。在被识别区域两端的DNA基序也起到了非常重要的作用，它们可以帮助DNA链结旋打开双链，有利于crRNA链侵入。然后靶标DNA会在Cas9蛋白两个核酶结构域的作用下被切断。（B）级联反应复合体里含有一个crRNA，可是却最多携带了5种不同的Cas蛋白。所以它能够持续不断地招募核酶和解旋酶Cas3蛋白，不停歇地将入侵的外源DNA打开并切断。
Cas9蛋白能够凭借分子内的两个核酸酶结构域切割靶标DNA分子，形成平末端产物，其中每一个结构域负责切割靶标DNA分子R环状（R-loop）结构中的一条DNA链，具体来说，就是一个HNH核酶结构域负责切割与crRNA互补配对的那一条DNA链，RuvC核酶结构域负责切割另外一条DNA链。Jinek等人发现这种切割的效率非常高，而且不论靶标DNA分子是松弛型的，还是紧密的超螺旋型的都可以被毫不费力地被切割开，说明Cas9蛋白在细胞里是循环使用的，这样能保证在第一时间将所有来犯的入侵者全都消灭掉。
Choosing a Target Sequence for CRISPR/Cas9 Gene Editing
CRISPR/Cas9 gene targeting requires a custom single guide RNA (sgRNA) that contains a targeting sequence (crRNA sequence) and a Cas9 nuclease-recruiting sequence (tracrRNA). The crRNA region (shown in red below) is a 20-nucleotide sequence that is homologous to a region in your gene of interest and will direct Cas9 nuclease activity.

Selecting an appropriate DNA target sequence
Use the following guidelines to select a genomic DNA region that corresponds to the crRNA sequence of the sgRNA:
The 3′ end of the DNA target sequence must have a proto-spacer adjacent motif (PAM) sequence (5′-NGG-3′). The 20 nucleotides upstream of the PAM sequence will be your targeting sequence (crRNA) and Cas9 nuclease will cleave approximately 3 bases upstream of the PAM.
The PAM sequence itself is absolutely required for cleavage, but it is NOT part of the sgRNA sequence and therefore should not be included in the sgRNA.
The target sequence can be on either DNA strand.
There are online tools (e.g., http://crispr.mit.edu/ or https://chopchop.rc.fas.harvard.edu/) that detect PAM sequences and list possible crRNA sequences within a specific DNA region. These algorithms also predict off-target effects in different organisms, allowing you to choose the most specific crRNA.

CRISPR/Cas9-sgRNA设计工具
模式生物(Model Organism)

http://www.e-crisp.org/E-CRISP/ currently the best online solution (Recommended).
http://tools.flycrispr.molbio.wisc.edu/targetFinder/  a simple drosophila focused tool.
http://crispr.mit.edu/  MIT’s online CRISPR tool is a little slow, but pretty decent if they support your genome of interest.
http://crispr.u-psud.fr/  simple but effective, supports analysis of any sequence for target-sites.
http://zifit.partners.org/ZiFiT/  another simple but effective solution.

模式生物+部分非模式生物(Model Organism AND Some non-model organisms)

https://crispr.dbcls.jp/ Rational design of CRISPR/Cas target.
http://www.rgenome.net/cas-offinder/ select genomes only, but allows for alternative Cas9 species.
http://tools.flycrispr.molbio.wisc.edu/targetFinder/ a simple drosophila focused tool.
https://code.google.com/p/ssfinder/ (SSFinder) a simple but effective tool, it will likely be slow for large genomes.
CasFinder: Flexible algorithm for identifying specific Cas9 targets in genomes
http://crispr.hzau.edu.cn/CRISPR2/help.php

可自己提供基因组的程序(use yourself genome data)

Cas-Designer: provides all possible RGEN targets in the given input sequence
sgRNAcas9: A software package for designing CRISPR sgRNA and evaluating potential off-target cleavagesites

CRISPR/Cas9存在的问题暨避免措施
Predicting sgRNA Efficacy
We have recently examined sequence features that enhance on-target activity of sgRNAs by creating all possible sgRNAs for a panel of genes and assessing, by flow cytometry, which sequences led to complete protein knockout (1). Some sequences worked better than others, and we also saw that variations in the protospacer-adjacent motif (PAM) led to differences in activity: specifically, CGGT tended to serve as a better PAM than the canonical NGG sequence. By examining the nucleotide features of the most-active sgRNAs from a set of 1,841 sgRNAs, we derived scoring rules and built a website implementation of these rules to design sgRNAs against genes of interest, available here: http://www.broadinstitute.org/rnai/public/analysis-tools/sgrna-design.
Once sgRNA sequences most-likely to give high activity are identified, some filtering can be used to further winnow down a list. For example, basic features of the target gene can be used to eliminate some sgRNAs, such as those that target near the C’ terminus of a protein, as frameshifts are less-likely to be deleterious if most of the protein has already been translated. While every protein will be different, it seems reasonable that target sites in the first half of a protein will likely lead to a functional knockout. Indeed, for some of the genes we examined, even targets very close to the C’ terminus disrupted expression. Certainly, for any gene of interest, it would be unwise to make conclusions on the basis of the activity of a single sgRNA, and thus diversity of target sites across a gene should be examined.
Avoiding Off-target Sites
The off-target activity of sgRNAs is important to consider. Several papers have reached far-different conclusions regarding the extent of these effects, and certainly at least one reason for these observed differences is the expression levels of Cas9 and sgRNA used in these studies (2,3). Additionally, the ability to predict off-target sites in the genome is still in its infancy. While the basic landscape of mismatches that can lead to cutting has been established, and can be used to identify sites that are likely to give rise to an off-target effect, as yet there is not enough data to fully predict which sites will and will not show appreciable levels of modification. To further confound matters, it has recently been shown that bulges in either the RNA or DNA – that is, non-symmetrical basepairing of the strands – can give rise to off-target activity (4). Predicting such basepairing interactions is far-more computationally intensive, and thus existing algorithms ignore these potential off-target sites.
Importantly, recent whole-genome sequencing of cells modified by CRISPR indicates that the consequences of off-target activity, at least for the experimental conditions used, led to no detectable mutations (5). Indeed, when working with single-cell clones, the authors note that “clonal heterogeneity may represent a more serious obstacle to the generation of truly isogenic cell lines than nuclease-mediated off-target effects.” Further, several genetic screens using genome-wide libraries have shown high concordance between different sequences targeting the same gene, suggesting that off-target effects did not overwhelm true signal in these assays (6-8). Again, the experimental strategy is clear: for any gene of interest, one should require that multiple sgRNAs of different sequence give rise to the same phenotype in order to conclude that the phenotype is due to an on-target effect.
How Can It Go Wrong?
Even with optimized on-target design, and proper avoidance of off-target effects both explicitly when designing sgRNAs and experimentally by the use of multiple sgRNAs, it is apparent that not all genes are equally amenable to targeting in all cellular contexts. One major reason is the chromatin state of a target site. For genes that are in more restricted chromatin or, potentially, different locations in the nucleus, Cas9 will be less effective at finding the target (9). Achieving biallelic knockout of such a gene in a high percentage of target cells might therefore not be practical. Here, single cell-cloning might be necessary, and complementary technologies such as RNAi may be a better experimental choice (while still relying, of course, on multiple different sequences of small RNA to interpret a phenotype!)
In sum, selection of sgRNAs for an experiment needs to balance maximizing on-target activity while minimizing off-target activity, which sounds obvious but can often require difficult decisions. For example, would it be better to use a less-active sgRNA that targets a truly unique site in the genome, or a more-active sgRNA with one additional target site in a region of the genome with no known function? For the creation of stable cell models that are to be used for long-term study, the former may be the better choice. For a genome-wide library to conduct genetic screens, however, a library composed of the latter would likely be more effective, so long as care is taken in the interpretation of results by requiring multiple sequences targeting a gene to score in order to call that gene as a hit. Indeed, existing genome-wide libraries have not taken into account on-target activity, and new libraries will surely incorporate such design rules in the near future.
This is exciting time for functional genomics, with an ever-expanding list of tools to probe gene function. The best tools are only as good as the person using them, and the proper use of CRISPR technology will always depend on careful experimental design, execution, and analysis.
CRISPR/Cas9: Planning Your Experiment
How do I get started?
CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) genome editing is a popular new technology that uses a short RNA (gRNA) to guide a nuclease (generally, Cas9) to a DNA target. The technology has been quickly adopted due to its advantages, like speed, cost, efficiency, and ease of implementation, over protein-based targeting methods (zinc fingers and TALENs). This experimental guide is meant to provide a broad overview of the major steps and considerations in setting up a CRISPR experiment. With all genome engineering technologies, it is recommended that the user performs due diligence regarding off-target effects.
Before getting started, familiarize yourself with the science behind CRISPR at our CRISPR Science Guide. You can also find more information from our blog and our list of CRISPR forums and FAQs.
Experimental Design

1.Choose a CRISPR system

What do you want to use CRISPR for? Common uses of CRISPR are:

Gene disruption (via insertion or deletion)
Activation or repression of gene expression
You can find more information about CRISPR uses in our CRISPR guide or browse CRISPR plasmids by function.


What system will you be using CRISPR in?

You can browse CRISPR plasmids by model organism or browse CRISPR plasmids by function.


Once you know what you want to do and what system you want to use, you can design your gRNA.


2.Design a gRNA for your genome target
As versatile as the Cas9 protein is, it requires the specificity of a gRNA to guide it to the desired genome target. Choosing an appropriate target sequence in the genomic DNA is a very important step in designing your experiment.
Important characteristics of a genome target are:

~20 nucleotides in length
Followed (or preceded) by the appropriate Protospacer Adjacent Motif (PAM) sequence in the genomic DNA

The PAM will vary depending on the bacterial species the Cas9 was derived from. The Cas9 and gRNA have to be from the same bacterial species. Read more about PAM sequences.
The majority of the CRISPR plasmids in Addgene’s collection are from the bacteria S. pyogenes unless otherwise noted.
gRNAs:

should not contain the PAM sequence
can be on either strand of the genomic DNA

For genomic disruptions (via Insertions/Deletions or InDels), the gRNA should be targeted closer to the N-terminus of a protein coding region to increase the likelihood of gene disruption.
For genome editing or modification, the gRNA target site will be limited to the desired location of the edit or modification.

should be designed using bioinformatics software to minimize off-target effects

A gRNA sequence can potentially appear in multiple places in the genome.
While we offer validated gRNAs, there are a number of software tools available to help you choose/design target sequences, as well as lists of bioinformatically determined (but not experimentally validated) unique gRNAs for different genes in different species.
Looking for more help?
From our blog, John Doench and Ella Hartenian (Broad Institute) give practical advice for using CRISPRs, as well as designing your gRNA and introducing it into cells.
Additionally, we have gRNA design protocols from our CRISPR depositors and links to CRISPR discussion groups.
Once you have your genomic target identified and gRNA designed, you can synthesize the oligos for your gRNA and find an empty gRNA vector, or browse our selection of validated gRNAs.

Clone the gRNA into a plasmid

If you are using one of our validated gRNA plasmids, you can skip this step.
If you have to clone your gRNA into your CRISPR plasmid:

Depositor plasmids may have specific cloning guides in their protocols.  
For a general overview of cloning, review our molecular biology tools and references.

After cloning, sequence verify your final plasmid product. Some depositor tools are designed so that a test digest can verify a successful insertion.

Deliver your CRISPR components

Each model system will have its best practices for efficient delivery of CRISPR components. CRISPR depositors have submitted protocols for a few model organisms like nematode, fly, and zebrafish.
If you will be working in mammalian cell culture, some common mammalian DNA delivery methods are: 
Mammalian Cell Line DNA Delivery

This table is not inclusive of all methods and the user will want to review the current literature about their preferred model.

Evaluate outcome

If CRISPR is being used for genome modification, the modification has to be evaluated after delivery of CRISPR components.

Design PCR (polymerase chain reaction) primers and amplify genomic region of modification.

PCR is a method for making a copy of a piece of DNA.
There are many software tools available on the web for primer design. Example: IDT Web Tools
View our walk-through of a basic PCR reaction and the required reagents.

Two popular methods to assess genomic alteration from a PCR product are:

Endonuclease mismatch detection assay
NEB provides a graphical overview of the assay.
Sequence verification
Find tips for sequencing analysis and troubleshooting at our blog.
CRISPR-Cas9 FAQs Answered!
Designing Your CRISPR Genome Editing Experiment
Q1: Should I use wildtype or double nickase for my CRISPR genome engineering experiments?
A1: When assessing which nickase type to use for your CRISPR genome engineering experiments, consider that wildtype Cas9 with optimized chimeric gRNA has high efficiency but has been shown to have off-target effects. ‘Double nickase’ is a new system, developed by the Zhang lab, which has comparable efficiency to the optimized chimeric design but with better accuracy (in other words, lower off-target effect.
The double nickase system is based on the Cas9 D10A nickase described in Figure 4 of the Cong, et. al, 2013 Science paper. For example, if you want to use double nickase, you could express two spacers and use PX335 to express the Cas9n (nickase).
The concept of the double nickase system is that you can express two different chimeric gRNAs with the Cas9 nickase which will together introduce cleavage of the target site with efficiency similar to using a single chimeric gRNA. At the same time, the off-target effects are reduced because the Cas9 nickase doesn’t have the ability to induce double-stranded breaks like the wildtype Cas9 does. There are a few references for the double nickase system, including one recently from the Zhang group.
Learn more here.
Q2: When designing oligos for cloning my target sequence into a backbone that uses the human U6 promoterto drive expression, is it necessary to add a G nucleotide to the start of my target sequence?
A2: The human U6 promoter prefers a ‘G’ at the transcription start site to have high expression, so adding this G couldhelp with expression, though it is possible for the plasmid to still express without the G. Because the G is only one base, the Zhang lab usually adds it when they order the oligo. If your spacer sequence starts with a ‘G’, you naturally have one and do not need to add an additional ‘G’.
Q3: What is the maximum amount of DNA that can be inserted into the genome using CRISPR/Cas forHomologous Recombination (HR)? How long should the homology arms be for efficient recombination?
A3: The most we’ve tried to insert so far has been 1kb. We used homology arms that were 800bp long.
Tips for Using CRISPR-Cas9 at the Lab Bench
Q4: After the introduction of a mutation into the genome, how can cells with that mutation be selected/screened?
A4: Before starting your experiment, consider co-transfecting with GFP. This allows you to sort for GFP-positive cells and to enrich for those cells that were positively transfected. Alternatively, you can use a selection marker to select transfected cells (for example, plasmid with a puromycin resistance cassette, such as PX459). After you co-transfect the CRISPR/Cas system with your homologous recombination (HR) template, you could then:

Confirm your HR by doing Restriction Fragment Length Polymorphism (RFLP) (see Figure 4 of the Cong, et. al, 2013 Science paper).
If you detect positive HR, isolate single-cell colonies, grow them up, then perform individual genotyping (using Sanger sequencing, for example) on each colony in order to screen for positive ones. 
If your HR template has a selection marker such as puromycin, you can (also) select for the positive colonies by puromycin selection. You could then confirm this purification by performing a genotyping assay (such as Sanger sequencing).
Click herefor a useful reference.

More FAQS, CRISPR Protocols, and gRNA Design Tools
Need more questions answered about CRISPRs?

Check out the full list of 16 FAQs answered by Le Cong
Read Addgene’s CRISPR guide for background information on CRISPR/Cas9 systems
Peruse the most recent genome editing review articles, such as: Sander JD & Joung JK, Nature Biotech, 2 March 2014.
Or browse the articles related to the most frequently requested CRISPR plasmids at Addgene
Find protocols and gRNA design tools:
List of CRISPR protocols developed by a variety of labs and optimized for specific plasmids
Links to different software to help you identify your gRNA target sequences

CRISPR/Cas9 文献

Cell Press Selections: CRISPR/Cas9
Science：A Swiss Army Knife of Immunity
Science：A Programmable Dual-RNA–Guided DNA Endonuclease in Adaptive Bacterial Immunity
Nature：Targeted mutagenesis in the model plant Nicotiana benthamiana using Cas9 RNA-guided endonuclease

CRISPR/Cas9 科幻之力
◆ 自我复制

◆ 改变孟德尔遗传规律，改变整个种群




Error bars (basic)
2015-08-22T10:00:06.000Z
Error bars are a graphical representation of the variability of data and are used on graphs to indicate the error, or uncertainty in a reported measurement. They give a general idea of how precise a measurement is, or conversely, how far from the reported value the true (error free) value might be. Error bars often represent one standard deviation of uncertainty, one standard error, or a certain confidence interval (e.g., a 95% interval). 
Add error bar used R
loading data
1
plot(mpg~disp,data=mtcars)

verticality error bars
1
2
3
4
5
6
7
8
arrows(x0=mtcars$disp,
       y0=mtcars$mpg*0.95,
       x1=mtcars$disp,
       y1=mtcars$mpg*1.05,
       angle=90,
       code=3,      #drawing an arrowhead at both ends
       length=0.04,
       lwd=0.4)

结果如下：

horizontal error bars
1
2
3
4
5
6
7
8
arrows(x0=mtcars$disp*0.95,
       y0=mtcars$mpg,
       x1=mtcars$disp*1.05,
       y1=mtcars$mpg,
       angle=90,
       code=3,
       length=0.04,
       lwd=0.4)

结果如下：




Linux环境变量
2015-08-22T08:23:25.000Z
常见的环境变量
对于PATH和HOME等环境变量大家都不陌生。除此之外，还有下面一些常见环境变量。
◆ HISTSIZE是指保存历史命令记录的条数。
◆ LOGNAME是指当前用户的登录名。
◆ HOSTNAME是指主机的名称，许多应用程序如果要用到主机名的话，通常是从这个环境变量中来取得的。
◆ SHELL是指当前用户用的是哪种Shell。
◆ LANG/LANGUGE是和语言相关的环境变量，使用多种语言的用户可以修改此环境变量。
◆ MAIL是指当前用户的邮件存放目录。
◆ PS1是基本提示符，对于root用户是#，对于普通用户是$。PS2是附属提示符，默认是”>”。可以通过修改此环境变量来修改当前的命令符，比如下列命令会将提示符修改成字符串”Hello,My NewPrompt “。
除了这些常见的环境变量，许多应用程序在安装时也会增加一些环境变量，比如使用Java就要设置JAVA_HOME和CLASSPATH等
定制环境变量
环境变量是和Shell紧密相关的，用户登录系统后就启动了一个Shell。对于Linux来说一般是bash，但也可以重新设定或切换到其它的 Shell。环境变量是通过Shell命令来设置的，设置好的环境变量又可以被所有当前用户所运行的程序所使用。对于bash这个Shell程序来说，可 以通过变量名来访问相应的环境变量，通过export来设置环境变量。下面通过几个实例来说明。
1、显示环境变量HOME
1
2
$ echo $HOME 
/home/terry

2、设置一个新的环境变量WELCOME
1
2
3
$ export WELCOME="Hello!" 
$ echo $WELCOME 
Hello!

3、使用env命令显示所有的环境变量
1
2
3
4
5
6
7
$ env 
HOSTNAME=terry.mykms.org 
PVM_RSH=/usr/bin/rsh 
SHELL=/bin/bash 
TERM=xterm
HISTSIZE=1000 
...

4、使用set命令显示所有本地定义的Shell变量
1
2
3
4
5
6
7
8
9
$ set 
BASH=/bin/bash 
BASH_VERSINFO=([0]="2"[1]="05b"[2]="0"[3]="1"[4]="release"[5]="i386-redhat-linux-gnu") 
BASH_VERSION='2.05b.0(1)-release' 
COLORS=/etc/DIR_COLORS.xterm 
COLUMNS=80 
DIRSTACK=() 
DISPLAY=:0.0 
...

5、使用unset 命令来清除环境变量
set可以设置某个环境变量的值。清除环境变量的值用unset命令。如果未指定值，则该变量值将被设为NULL。示例如下：
1
2
3
4
5
$ export TEST="Test..." #增加一个环境变量TEST 
$ env|grep TEST #此命令有输入，证明环境变量TEST已经存在了 
TEST=Test... 
$ unset $TEST #删除环境变量TEST 
$ env|grep TEST #此命令没有输出，证明环境变量TEST已经存在了

6、使用readonly 命令设置只读变量
如果使用了readonly命令的话，变量就不可以被修改或清除了。示例如下：
1
2
3
4
5
6
$ export TEST="Test..." #增加一个环境变量TEST 
$ readonly TEST #将环境变量TEST设为只读 
$ unset TEST #会发现此变量不能被删除 
-bash: unset: TEST: cannot unset: readonly variable 
$ TEST="New" #会发现此也变量不能被修改 
-bash: TEST: readonly variable

环境变量PATH
which, 它用来查找某个命令的绝对路径
1
2
3
4
5
6
7
8
[root@localhost ~]# which rmdir
/bin/rmdir
[root@localhost ~]# which rm
alias rm='rm -i'
        /bin/rm
[root@localhost ~]# which ls
alias ls='ls --color=auto'
        /bin/ls




AWK程序设计语言
2015-08-22T05:48:08.000Z

一. AWK入门指南
Awk是一种便于使用且表达能力强的程序设计语言，可应用于各种计算和数据处理任务。本章是个入门指南，让你能够尽快地开始编写你自己的程序。第二章将描述整个语言，而剩下的章节将向你展示如何使用Awk来解决许多不同方面的问题。纵观全书，我们尽量选择了一些对你有用、有趣并且有指导意义的实例。
1.1 起步
有用的awk程序往往很简短，仅仅一两行。假设你有一个名为 emp.data 的文件，其中包含员工的姓名、薪资（美元/小时）以及小时数，一个员工一行数据，如下所示：
1
2
3
4
5
6
Beth	4.00	0
Dan	3.75	0
kathy	4.00	10
Mark	5.00	20
Mary	5.50	22
Susie	4.25	18

现在你想打印出工作时间超过零小时的员工的姓名和工资（薪资乘以时间）。这种任务对于awk来说就是小菜一碟。输入这个命令行就可以了：:
1
awk '$3 >0 { print $1, $2 * $3 }' emp.data

你应该会得到如下输出：
1
2
3
4
Kathy 40
Mark 100
Mary 121
Susie 76.5

该命令行告诉系统执行引号内的awk程序，从输入文件 emp.data 获取程序所需的数据。引号内的部分是个完整的awk程序，包含单个模式-动作语句。模式 $3>0 用于匹配第三列大于0的输入行，动作:
1
{ print $1, $2 * $3 }

打印每个匹配行的第一个字段以及第二第三字段的乘积。
如果你想打印出还没工作过的员工的姓名，则输入命令行：:
1
awk '$3 == 0 { print $1 }' emp.data

这里，模式 $3 == 0 匹配第三个字段等于0的行，动作:
1
{ print $1 }

打印该行的第一个字段。
当你阅读本书时，应该尝试执行与修改示例程序。大多数程序都很简短，所以你能快速理解awk是如何工作的。在Unix系统上，以上两个事务在终端里看起来是这样的：
1
2
3
4
5
6
7
8
9
$ awk '$3 > 0 { print $1, $2 * $3 }' emp.data
Kathy 40
Mark 100
Mary 121
Susie 76.5
$ awk '$3 == 0 { print $1 }' emp.data
Beth
Dan
$

行首的 $ 是系统提示符，也许在你的机器上不一样。
AWK程序的结构
让我们回头看一下到底发生了什么事情。上述的命令行中，引号之间的部分是awk编程语言写就的程序。本章中的每个awk程序都是一个或多个模式-动作语句的序列：
1
2
3
pattern { action }
pattern { action }
...

awk的基本操作是一行一行地扫描输入，搜索匹配任意程序中模式的行。词语”匹配”的准确意义是视具体的模式而言，对于模式 $3 >0 来说，意思是”条件为真”。
每个模式依次测试每个输入行。对于匹配到行的模式，其对应的动作（也许包含多步）得到执行，然后读取下一行并继续匹配，直到所有的输入读取完毕。
上面的程序都是模式与动作的典型示例。:
1
$3 == 0 { print $1 }

是单个模式-动作语句；对于第三个字段为0的每行，打印其第一个字段。
模式-动作语句中的模式或动作（但不是同时两者）都可以省略。如果某个模式没有动作，例如：:
1
$3 == 0

那么模式匹配到的每一行（即，对于该行，条件为真）都会被打印出来。该程序会打印 emp.data 文件中第三个字段为0的两行
1
2
Beth 4.00 0
Dan 3.75 0

如果有个没有模式的动作，例如：:
1
{ print $1 }

那么这种情况下的动作会打印每个输入行的第一列。
由于模式和动作两者任一都是可选的，所以需要使用大括号包围动作以区分于其他模式。
执行AWK程序
执行awk程序的方式有多种。你可以输入如下形式的命令行：:
1
awk 'program' input files

从而在每个指定的输入文件上执行这个program。例如，你可以输入：:
1
awk '$3 == 0 { print $1 }' file1 file2

打印file1和file2文件中第三个字段为0的每一行的第一个字段。
你可以省略命令行中的输入文件，仅输入：:
1
awk 'program'

这种情况下，awk会将program应用于你在终端中接着输入的任意数据行，直到你输入一个文件结束信号（Unix系统上为control-d）。如下是Unix系统的一个会话示例：
1
2
3
4
5
6
7
8
9
10
11
12
$ awk '$3 == 0 { print $1 }'
Beth 4.00 0
Beth

Dan 3.75 0
Dan

Kathy 3.75 10
Kathy 3.75 0
Kathy

...

加粗的字符是计算机打印的。
这个动作非常便于尝试awk：输入你的程序，然后输入数据，观察发生了什么。我们再次鼓励你尝试这些示例并进行改动。
注意命令行中的程序是用单引号包围着的。这会防止shell解释程序中 $ 这样的字符，也允许程序的长度超过一行。
当程序比较短小（几行的长度）的时候，这种约定会很方便。然而，如果程序较长，将程序写到一个单独的文件中会更加方便。假设存在程序 progfile ，输入命令行：:
1
awk -f progfile     optional list of input files

其中 -f 选项指示awk从指定文件中获取程序。可以使用任意文件名替换 progfile 。
错误
如果你的awk程序存在错误，awk会给你一段诊断信息。例如，如果你打错了大括号，如下所示：:
1
awk '$3 == 0 [ print $1 }' emp.data

你会得到如下信息：
1
2
3
4
5
6
awk: syntax error at source line 1
context is
$3 == 0 >>> [ <<<
extra }
missing ]
awk: bailing out at source line 1

“Syntax error”意味着在 >>> <<< 标记的地方检测到语法错误。”Bailing out”意味着没有试图恢复。有时你会得到更多的帮助-关于错误是什么，比如大括号或括弧不匹配。
因为存在句法错误，awk就不会尝试执行这个程序。然而，有些错误，直到你的程序被执行才会检测出来。例如，如果你试图用零去除某个数，awk会在这个除法的地方停止处理并报告输入行的行号以及在程序中的行号（这话是什么意思？难道输入行的行号是忽略空行后的行号？）。
1.2 简单输出
这一节接下来的部分包含了一些短小，典型的awk程序，基于操纵上文中提到的 emp.data 文件. 我们会简单的解释程序在做什么，但这些例子主要是为了介绍 awk 中常见的一些简单有用的操作 – 打印字段, 选择输入, 转换数据. 我们并 没有展现 awk 程序能做的所有事情, 也并不打算深入的去探讨例子中的一些细节. 但在你读完这一节之后, 你将能够完成一些简单的任务, 并且你将发现在阅读后 面章节的时候会变的容易的多.
我们通常只会列出程序部分, 而不是整个命令行. 在任何情况下, 程序都可以用 引号包含起来放到 awk 命令的地一个参数中运行, 就像上文中展示的那样, 或者 把它放到一个文件中使用 awk 的 -f 参数调用它.
在 awk 中仅仅只有两种数据类型: 数值 和 字符构成的字符串. emp.data 是 一个包含这类信息的典型文件 – 混合了被空格和(或)制表符分割的数字和词语.
Awk 程序一次从输入文件的中读取一行内容并把它分割成一个个字段, 通常默认 情况下, 一个字段是一个不包含任何空格或制表符的连续字符序列. 当前输入的 行中的地一个字段被称做 $1, 第二个是 $2, 以此类推. 整个行的内容被定 义为 $0. 每一行的字段数量可以不同.
通常, 我们要做的仅仅只是打印出每一行中的某些字段, 也许还要做一些计算. 这一节的程序基本上都是这种形式.
打印每一行
如果一个动作没有任何模式, 这个动作会对所有输入的行进行操作. print 语 句用来打印(输出)当前输入的行, 所以程序
{ print }
会输出所有输入的内容到标准输出. 由于 $0 表示整行,
{ print $0 }
也会做一样的事情.
打印特定字段
使用一个 print 语句可以在同一行中输出不止一个字段. 下面的程序输出了每 行输入中的第一和第三个字段
1
{ print $1, $3 }

使用 emp.data 作为输入, 它将会得到
1
2
3
4
5
6
Beth 0
Dan 0
Kathy 10
Mark 20
Mary 22
Susie 18

在 print 语句中被逗号分割的表达式, 在默认情况下他们将会用一个空格分割 来输出. 每一行 print 生成的内容都会以一个换行符作为结束. 但这些默认行 为都可以自定义; 我们将在第二章中介绍具体的方法.
NF, 字段数量
很显然你可能会发现你总是需要通过 $1, $2 这样来指定不同的字段, 但任何表 达式都可以使用在$之后来表达一个字段的序号; 表达式会被求值并用于表示字段 序号. Awk会对当前输入的行有多少个字段进行计数, 并且将当前行的字段数量存 储在一个内建的称作 NF 的变量中. 因此, 下面的程序
1
{ print NF, $1, $NF }

会依次打印出每一行的字段数量, 第一个字段的值, 最后一个字段的值.
计算和打印
你也可以对字段的值进行计算后再打印出来. 下面的程序
1
{ print $1, $2 * $3 }

是一个典型的例子. 它会打印出姓名和员工的合计支出(以小时计算):
1
2
3
4
5
6
Beth 0
Dan 0
Kathy 40
Mark 100
Mary 121
Susie 76.5

我们马上就会学到怎么让这个输出看起来更漂亮.
打印行号
Awk提供了另一个内建变量, 叫做 NR, 它会存储当前已经读取了多少行的计数. 我们可以使用 NR 和 $0 给 emp.data 的没一行加上行号:
1
{ print NR, $0 }

打印的输出看起来会是这样:
1
2
3
4
5
6
1 Beth   4.00     0
2 Dan    3.75     0
3 Kathy  4.00    10
4 Mark   5.00    20
5 Mary   5.50    22
6 Susie  4.25   1 8

在输出中添加内容
你当然也可以在字段中间或者计算的值中间打印输出想要的内容:
1
{ print "total pay for", $1, "is", $2 * $3 }

输出
1
2
3
4
5
6
total pay for Beth is 0
total pay for Dan is 0
total pay for Kathy is 40
total pay for Mark is 100
total pay for Mary is 121
total pay for Susie is 76.5

在打印语句中, 双引号内的文字将会在字段和计算的值中插入输出.
1.3 高级输出
print 语句可用于快速而简单的输出。若要严格按照你所想的格式化输出，则需要使用 printf 语句。正如我将在2.4节所见， printf 几乎可以产生任何形式的输出，但在本节中，我们仅展示其部分功能。
字段排队
printf 语句的形式如下：:
1
printf(format, value1, value2, ..., valuen)

其中 format 是字符串，包含要逐字打印的文本，穿插着 format 之后的每个值该如何打印的规格(specification)。一个规格是一个 % 符，后面跟着一些字符，用来控制一个 value 的格式。第一个规格说明如何打印 value1 ，第二个说明如何打印 value2 ，… 。因此，有多少 value 要打印，在 format 中就要有多少个 % 规格。
这里有个程序使用 printf 打印每位员工的总薪酬：:
1
{ printf("total pay for %s is $%.2f\n", $1, $2 * $3) }

printf 语句中的规格字符串包含两个 % 规格。第一个是 %s ，说明以字符串的方式打印第一个值 $1 。第二个是 %.2f ，说明以数字的方式打印第二个值 $2*$3 ，并保留小数点后面两位。规格字符串中其他东西，包括美元符号，仅逐字打印。字符串尾部的 \n 代表开始新的一行，使得后续输出将从下一行开始。以 emp.data 为输入，该程序产生：
1
2
3
4
5
6
total pay for Beth is $0.00
total pay for Dan is $0.00
total pay for Kathy is $40.00
total pay for Mark is $100.00
total pay for Mary is $121.00
total pay for Susie is $76.50

printf 不会自动产生空格或者新的行，必须是你自己来创建，所以不要忘了 \n 。
另一个程序是打印每位员工的姓名与薪酬：:
1
{ printf("%-8s $%6.2f\n", $1, $2 * $3) }

第一个规格 %-8s 将一个姓名以字符串形式在8个字符宽度的字段中左对齐输出。第二个规格 %6.2f 将薪酬以数字的形式，保留小数点后两位，在6个字符宽度的字段中输出。
1
2
3
4
5
6
Beth     $  0.00
Dan      $  0.00
Kathy    $ 40.00
Mark     $100.00
Mary     $121.00
Susie    $ 76.50

之后我们将展示更多的 printf 示例。一切精彩尽在2.4小节。
排序输出
假设你想打印每位员工的所有数据，包括他或她的薪酬，并以薪酬递增的方式进行排序输出。最简单的方式是使用awk将每位员工的总薪酬置于其记录之前，然后利用一个排序程序来处理awk的输出。Unix上，命令行如下:
1
awk '{ printf("%6.2f    %s\n", $2 * $3, $0) }' emp.data | sort

将awk的输出通过管道传给 sort 命令，输出为：
1
2
3
4
5
6
  0.00    Beth  4.00 0
  0.00    Dan   3.75 0
 40.00    Kathy 4.00 10
 76.50    Susie 4.25 18
100.00    Mark  5.00 20
121.00    Mary  5.50 22

1.4 选择
Awk的模式适合用于为进一步的处理从输入中选择相关的数据行。由于不带动作的模式会打印所有匹配模式的行，所以很多awk程序仅包含一个模式。本节将给出一些有用的模式示例。
通过对比选择
这个程序使用一个对比模式来选择每小时赚5美元或更多的员工记录，也就是，第二个字段大于等于5的行：:
1
$2 >= 5

从 emp.data 中选出这些行：:
1
2
Mark    5.00    20
Mary    5.50    22

通过计算选择
程序
1
$2 * $3 > 50 { printf("$%.2f for %s\n", $2 * $3, $1) }

打印出总薪资超过50美元的员工的薪酬。
通过文本内容选择
除了数值测试，你还可以选择包含特定单词或短语的输入行。这个程序会打印所有第一个字段为 Susie 的行：:
1
$1 == "Susie"

操作符 == 用于测试相等性。你也可以使用称为 正则表达式 的模式查找包含任意字母组合，单词或短语的文本。这个程序打印任意位置包含 Susie 的行：:
1
/Susie/

输出为这一行：:
1
Susie   4.25    18

正则表达式可用于指定复杂的多的模式；2.1节将会有全面的论述。
模式组合
可以使用括号和逻辑操作符与 && ， 或 || ， 以及非 ! 对模式进行组合。程序:
1
$2 >= 4 || $3 >= 20

会打印 $2 (第二个字段) 大于等于 4 或者 $3 (第三个字段) 大于等于 20 的行：:
1
2
3
4
5
Beth    4.00    0
kathy   4.00    10
Mark    5.00    20
Mary    5.50    22
Susie   4.25    18

两个条件都满足的行仅打印一次。与如下包含两个模式程序相比：:
1
2
$2 >= 4
$3 >= 20

如果某个输入行两个条件都满足，这个程序会打印它两遍：:
1
2
3
4
5
6
7
Beth    4.00    0
Kathy   4.00    10
Mark    5.00    20
Mark    5.00    20
Mary    5.50    22
Mary    5.50    22
Susie   4.25    18

注意如下程序:
1
!($2 < 4 && $3 < 20)

会打印极不满足 $2 小于4也不满足 $3 小于20的行；这个条件与上面第一个模式组合等价，虽然也许可读性差了点。
数据验证
实际的数据中总是会存在错误的。在数据验证-检查数据的值是否合理以及格式是否正确-方面，Awk是个优秀的工具。
数据验证本质上是否定的：不是打印具备期望属性的行，而是打印可疑的行。如下程序使用对比模式 将5个数据合理性测试应用于 emp.data 的每一行：:
1
2
3
4
5
NF != 3     { print $0, "number of fields is not equal to 3" }
$2 < 3.35   { print $0, "rate is below minimum wage" }
$2 > 10     { print $0, "rate exceeds $10 per hour" }
$3 < 0      { print $0, "negative hours worked" }
$3 > 60     { print $0, "too many hours worked" }

如果没有错误，则没有输出。
BEGIN与END
特殊模式 BEGIN 用于匹配第一个输入文件的第一行之前的位置， END 则用于匹配处理过的最后一个文件的最后一行之后的位置。这个程序使用 BEGIN 来输出一个标题：:
1
2
BEGIN { print "Name    RATE    HOURS"; print ""}
      { print }

输出为：:
1
2
3
4
5
6
7
8
9
NAME    RATE    HOURS

Beth    4.00    0
Dan     3.75    0

Kathy   4.00    10
Mark    5.00    20
Mary    5.50    22
Susie   4.25    18

程序的动作部分你可以在一行上放多个语句，不过要使用分号进行分隔。注意 普通的 print 是打印当前输入行，与之不同的是 print “” 会打印一个空行。
1.5 使用AWK进行计算
一个动作就是一个以新行或者分号分隔的语句序列。你已经见过一些其动作仅是单个 print 语句的例子。本节将提供一些执行简单的数值以及字符串计算的语句示例。在这些语句中，你不仅可以使用像 NF 这样的内置变量，还可以创建自己的变量用于计算、存储数据诸如此类的操作。awk中，用户创建的变量不需要声明。
计数
这个程序使用一个变量 emp 来统计工作超过15个小时的员工的数目：:
1
2
$3 > 15 { emp = emp + 1 }
END     { print emp, "employees worked more than 15 hours" }

对于第三个字段超过15的每行， emp 的前一个值加1。以 emp.data 为输入，该程序产生：:
1
3 employees worked more than 15 hours

用作数字的awk变量的默认初始值为0，所以我们不需要初始化 emp 。
求和与平均值
为计算员工的数目，我们可以使用内置变量 NR ，它保存着到目前位置读取的行数；在所有输入的结尾它的值就是所读的所有行数。
1
END { print NR, "employees" }

输出为：:
1
6 employees

如下是一个使用 NR 来计算薪酬均值的程序：:
1
2
3
4
5
    { pay = pay + $2 * $3 }
END { print NR, "employees"
      print "total pay is", pay
      print "average pay is", pay/NR
    }

第一个动作累计所有员工的总薪酬。 END 动作打印出
1
2
3
6 employees
total pay is 337.5
average pay is 56.25

很明显， printf 可用来产生更简洁的输出。并且该程序也有个潜在的错误：在某种不太可能发生的情况下， NR 等于0，那么程序会试图执行零除，从而产生错误信息。
处理文本
awk的优势之一是能像大多数语言处理数字一样方便地处理字符串。awk变量可以保存数字也可以保存字符串。这个程序会找出时薪最高的员工：:
1
2
$2 > maxrate { maxrate = $2; maxemp = $1 }
END { print "highest hourly rate:", maxrate, "for", maxemp }

输出
1
highest hourly rate: 5.50 for Mary

这个程序中，变量 maxrate 保存着一个数值，而变量 maxemp 则是保存着一个字符串。（如果有几个员工都有着相同的最大时薪，该程序则只找出第一个。）
字符串连接
可以合并老字符串来创建新字符串。这种操作称为 连接（concatenation） 。程序
1
2
    { names = names $1 " "}
END { print names }

通过将每个姓名和一个空格附加到变量 names 的前一个值， 来将所有员工的姓名收集进单个字符串中。最后 END 动作打印出 names 的值：:
1
Beth Dan Kathy Mark Mary Susie

awk程序中，连接操作的表现形式是将字符串值一个接一个地写出来。对于每个输入行，程序的第一个语句先连接三个字符串： names 的前一个值、当前行的第一个字段以及一个空格，然后将得到的字符串赋值给 names 。因此，读取所有的输入行之后， names 就是个字符串，包含所有员工的姓名，每个姓名后面跟着一个空格。用于保存字符串的变量的默认初始值是空字符串(也就是说该字符串包含零个字符)，因此这个程序中的 names 不需要显式初始化。
打印最后一个输入行
虽然在 END 动作中 NR 还保留着它的值，但 $0 没有。程序
1
2
    { last = $0 }
END { print last }

是打印最后一个输入行的一种方式：:
1
Susie   4.25    18

内置函数
我们已看到awk提供了内置变量来保存某些频繁使用的数量，比如：字段的数量和输入行的数量。类似地，也有内置函数用来计算其他有用的数值。除了平方根、对数、随机数诸如此类的算术函数，也有操作文本的函数。其中之一是 length ，计算一个字符串中的字符数量。例如，这个程序会计算每个人的姓名的长度：:
1
{ print $1, length($1) }

结果：:
1
2
3
4
5
6
Beth 4
Dan 3
Kathy 5
Mark 4
Mary 4
Susie 5

行、单词以及字符的计数
这个程序使用了 length 、 NF 、以及 NR 来统计输入中行、单词以及字符的数量。为了简便，我们将每个字段看作一个单词。
1
2
3
4
    { nc = nc + length($0) + 1
      nw = nw + NF
    }
END { print NR, "lines,", nw, "words,", nc, "characters" }

文件 emp.data 有:
1
6 lines, 18 words, 77 characters

$0 并不包含每个输入行的末尾的换行符，所以我们要另外加个1。
1.6 控制语句
Awk为选择提供了一个 if-else 语句，以及为循环提供了几个语句，所以都效仿C语言中对应的控制语句。它们仅可以在动作中使用。
if-else语句
如下程序将计算时薪超过6美元的员工的总薪酬与平均薪酬。它使用一个 if 来防范计算平均薪酬时的零除问题。
1
2
3
4
5
6
7
$2 > 6 { n = n + 1; pay = pay + $2 * $3 }
END    { if (n > 0)
            print n, "employees, total pay is", pay,
                     "average pay is", pay/n
         else
             print "no employees are paid more than $6/hour"
        }

emp.data 的输出是：:
1
no employees are paid more than $6/hour

if-else 语句中，if 后的条件会被计算。如果为真，执行第一个 print 语句。否则，执行第二个 print 语句。注意我们可以使用一个逗号将一个长语句截断为多行来书写。
while语句
一个 while 语句有一个条件和一个执行体。条件为真时执行体中的语句会被重复执行。这个程序使用公式 [Math Processing Error]
来演示以特定的利率投资一定量的钱，其数值是如何随着年数增长的。
1
2
3
4
5
6
7
8
9
10
# interest1 - 计算复利
#   输入: 钱数    利率    年数
#   输出: 复利值

{   i = 1
    while (i <= $3) {
        printf("\t%.2f\n", $1 * (1 + $2) ^ i)
        i = i + 1
    }
}

条件是 while 后括弧包围的表达式；循环体是条件后大括号包围的两个表达式。 printf 规格字符串中的 \t 代表制表符； ^ 是指数操作符。从 # 开始到行尾的文本是注释，会被awk忽略，但能帮助程序的读者理解程序做的事情。
你可以为这程序输入三个一组的数字，看看不一样的钱数、利率、以及年数会产生什么。例如，如下事务演示了1000美元，利率为6%与12%，5年的复利分别是如何增长的：:
1
2
3
4
5
6
7
8
9
10
11
12
13
$ awk -f interest1
1000 .06 5
        1060.00
        1123.60
        1191.02
        1262.48
        1338.23
1000 .12 5
        1120.00
        1254.40
        1404.93
        1573.52
        1762.34

for语句
另一个语句， for ，将大多数循环都包含的初始化、测试、以及自增压缩成一行。如下是之前利息计算的 for 版本：:
1
2
3
4
5
6
7
# interest1 - 计算复利
#   输入: 钱数    利率    年数
#   输出: 每年末的复利

{ for (i = 1; i <= $3; i = i + 1)
    printf("\t%.2f\n", $1 * (1 + $2) ^ i)
}

初始化 i = 1 只执行一次。接下来，测试条件 i <= $3 ；如果为真，则执行循环体的 printf 语句。循环体执行结束后执行自增 i = i + 1 ，接着由另一次条件测试开始下一个循环迭代。代码更加紧凑，并且由于循环体仅是一条语句，所以不需要大括号来包围它。
1.7 数组
awk为存储一组相关的值提供了数组。虽然数组给予了awk很强的能力，但在这里我们仅展示一个简单的例子。如下程序将按行逆序打印输入。第一个动作将输入行存为数组 line 的连续元素；即第一行放在 line[1] ，第二行放在 line[2] , 依次继续。 END 动作使用一个 while 语句从后往前打印数组中的输入行：:
1
2
3
4
5
6
7
8
9
10
# 反转 - 按行逆序打印输入

    { line[NR] = $0 }  # 记下每个输入行

END { i = NR           # 逆序打印
      while (i > 0) {
        print line[i]
        i = i - 1
      }
    }

以 emp.data 为输入，输出为
1
2
3
4
5
6
Susie    4.25   18
Mary     5.50   22
Mark     5.00   20
Kathy    4.00   10
Dan      3.75   0
Beth     4.00   0

如下是使用 for 语句实现的相同示例：:
1
2
3
4
5
6
7
# 反转 - 按行逆序打印输入

    { line[NR] = $0 }   # 记下每个输入行

END { for (i = NR; i > 0; i = i - 1)
        print line[i]
    }

二. AWK语言详解
本章将主要通过示例来解释构成awk程序的概念。因为这是对语言的全面描述，材料会很详细，因此我们推荐你浏览略读，需要的时候再回来核对细节。
最简单的awk程序就是一个模式-动作语句的序列：:
1
2
3
pattern    { action }
pattern    { action }
...

某些语句中，可能没有模式；另一些语句中，可能没有动作及其大括号。awk检查你的程序以确认不存在语法错误后，一次读取一行输入，并对每一行按序处理模式。对于每个匹配到当前输入行的模式，执行其关联的动作。不存在模式，则匹配每个输入行，因此没有模式的每个动作对于每个输入行都要执行。一个仅包含模式的模式-动作语句将打印匹配该模式的每个输入行。本章的大部分内容中，名词”输入行(input-line)”和”记录(record)” 是同义的。2.5小节中，我们将讨论多行记录，即一个记录包含多行输入。
本章的第一节将详细描述模式。第二节通过表达式、赋值以及控制语句来描述动作。剩下的章节覆盖函数定义，输出，输入，以及awk程序如何调用其他程序等内容。多数章节包含了主要特性的概要。
输入文件 countries
本章中，我们使用一个名为 countries 的文件作为许多awk程序的输入。文件的每行包含一个国家的名字，以千平方英里为单位的面积，以百万为单位的人口数，以及属于哪个洲。数据是1984年的，苏联(USSR)被武断地归入了亚洲。文件中，四列数据以制表符tab分隔；以单个空格将 North 、 South 与 America 分隔开。
文件 countries 包含如下数据行：:
1
2
3
4
5
6
7
8
9
10
11
USSR    8649    275     Asia
Canada  3852    25      North America
China   3705    1032    Asia
USA     3615    237     North America
Brazil  3286    134     South America
India   1267    746     Asia
Mexico  762     78      North America
France  211     55      Europe
Japan   144     120     Asia
Germany 96      61      Europe
England 94      56      Europe

本章的其余部分，如果没有明确说明输入文件，那么就是使用文件 countries 。
程序的格式
模式-动作语句以及动作中的语句通常以换行分隔，如果它们以分号分隔，则多个语句可以出现在一行中。分号可以放在任意语句的尾部。
动作的开大括号必须与其对应的模式处于同一行；动作的其余部分，包括闭大括号，则可以出现接下来的行中。
空行会被忽略；一般为了提高程序的可读性会在语句的前面或者后面插入空行。在操作符和操作数的两边插入空格和制表符也是为了提高可读性。
任意行的末尾可能会有注释。注释以符号 # 开始，结束于行尾，就像这样
1
{ print $1, $3 }        # print country name and population

长语句可以跨越多行，但要在断行的地方加入一个反斜杠和一个换行符：:
1
2
3
4
{ print \
                $1,             # country name
                $2,             # area in thousands of square miles
                $3 }    # population in millions

如上例所示，语句也可以逗号断行，在每个断行的末尾也可以加入注释。
本书中，我们使用了多种格式风格，部分是为了说明相异之处，部分是为了避免程序占用太多的行空间。类似于本章中的简短程序，格式并不是很重要，但一致性与可读性可以帮助更长的程序保持可控。
2.1 模式
模式控制着动作的执行：模式匹配，其关联的动作则执行。本节将描述6种模式及其匹配条件。
模式摘要

BEGIN { 语句 }
在读取任何输入前执行一次 语句
END { 语句 }
读取所有输入之后执行一次 语句
表达式 { 语句 }
对于 表达式 为真（即，非零或非空）的行，执行 语句
/正则表达式/ { 语句 }
如果输入行包含字符串与 正则表达式 相匹配，则执行 语句
组合模式 { 语句 }
一个 组合模式 通过与（&&），或（||），非（|），以及括弧来组合多个表达式；对于组合模式为真的每个输入行，执行 语句
模式1，模式2 { 语句 }
范围模式(range pattern)匹配从与 模式1 相匹配的行到与 模式2 相匹配的行（包含该行）之间的所有行，对于这些输入行，执行 语句 。
BEGIN和END不与其他模式组合。范围模式不可以是任何其他模式的一部分。BEGIN和END是仅有的必须搭配动作的模式。




常用生物信息在线工具
2015-08-05T11:19:40.000Z
声明：本文所列工具均为较为初级的生物信息分析，只用于简单的分析过程，更加优秀的工具和准确的分析结果我也在不懈寻找中，同样也欢迎大家留言提供，一个分析结果最好是能够综合不同方式所得结果。


韦恩图
Venny2.0

升级版韦恩图
jvenn: 可做到6个

基因预测
FGENESH

phylogenetic
 iTOL

Evolview v3

在线构建进化树
IQTREE Web Server: Fast and accurate phylogenetic trees under maximum likelihood

启动子区预测
Promoter Scan

蛋白质一级结构分析
PredictProte

ExPASy-ProtParam tool

蛋白质磷酸化位点
NetPhos 2.0

信号肽
SignalP

跨膜结构域
TMHMM Server v. 2.0

蛋白质亚细胞定位
TargetP (Subcellular location of proteins: mitochondrial, chloroplastic, secretory pathway, or other)

DeepLoc (Prediction of eukaryotic protein subcellular localization using deep learning)

蛋白质二级结构分析
SOPMA

蛋白质三级结构预测
SWISS-MODEL

短序列拼接
Cap3

多序列比对相似性展示
SimiTriX-SimiTetra 和
Ternary plot


多序列比对可视化
MView: A multiple alignment viewer
or
AlignmentViewer

过滤多序列比对结果
GUIDANCE2 Server: Server for alignment confidence score

绘制GO注释结果
WEGO：Web Gene Ontology Annotation Plotting

蛋白质
Pfam database
meme:Multiple Em for Motif Elicitation
SMART
Conserved Domains within a protein or coding nucleotide sequence

模体(motif)
属于蛋白质的超二级结构，由2个或2个以上具有二级结构的的肽段，在空间上相互接近，形成一个特殊的空间构象，并发挥专一的功能。一种类型的模体总有其特征性的氨基酸序列。
模体是二级结构有规律的组合。例如螺旋-环-螺旋，贝塔折叠的组合、阿而法螺旋组合等。再比如亮氨酸拉链、锌指结构都是典型的模体，它们执行一定的功能，即模体即是结构的单位，又是功能单位，他们可直接作为结构域和三级结构的建筑块。某些蛋白质因子与DNA大沟结合的部位靠的就是某些特异的模体。
结构域（domain）
是指在较大的分子（主要指蛋白质也包括核酸分子）中形成的某些在空间上可以辨别的结构，往往是球状压缩区或纤维状压缩区。它们也既是结构单位，又是功能单位。例如免疫球蛋白的功能区就是结构域。


基因组杂合性评估
GenomeScope：Estimate genome heterozygosity, repeat content, and size from sequencing reads using a kmer-based statistical approach

circos图
CIRCOS可以用来画基因组数据的环状图，也可以用来绘制其它数据的相关环状图。

1. 需要注意的是上传数据格式为空格或tab分隔的txt格式纯文本列表文件，值均为非负整数，若存在缺失值，用“-”线代替，若有小数，每一个单元格乘以某一值(如1000)，化为整数，且每个单元格中只能有数字，其他任何符号都不行，除了缺失的“-”，(1555，而不是1,555)；
2. 在线版只能绘制75阶方阵数据，若需要绘制较复杂的请下载Circos and use the tableviewer tool。
3. 每一个标签所对应半圈的总长度为这一标签所对应的所有值的和，不同半圈间连线表示这两标签所表示的值。
元数据可视化
Web-Igloo：Interactively visualizing multivariate data without feature decomposition

需要数据和元数据两个文件,实例数据结构如下：
数据(Select data file (Tab delimited))
Samples    Palmitic    Palmitoleic    Stearic    Oleic    Linoleic    Linolenic    Arachidic    Eicosenoic
S1    1075    75    226    7823    672    36    60    29
S2    1088    73    224    7709    781    31    61    29
S3    911    54    246    8113    549    31    63    29
S4    966    57    240    7952    619    50    78    35
S5    1051    67    259    7771    672    50    80    46
S6    911    49    268    7924    678    51    70    44
S7    922    66    264    7990    618    49    56    29
S8    1100    61    235    7728    734    39    64    35
S9    1082    60    239    7745    709    46    83    33
S10    1037    55    213    7944    633    26    52    30
S11    1051    35    219    7978    605    21    65    24
S12    1036    59    235    7868    661    30    62    44

元数据（Select metadata (Tab delimited)）
Samples    Geography
S1    N
S2    N
S3    N
S4    NA
S5    NA
S6    NA
S7    NAp
S8    NAp
S9    NAp
S10    NApulia
S11    NApulia
S12    NApulia

基因结构展示
GSDS2.0: Gene Structure Display Server

AnnotationSketch

外显子-内含子结构
Exon-Intron Graphic Maker
MyDomains
DomainDraw draws 
蛋白突变位点注释
MutationMapper: interprets mutations with protein annotations

regulatory genes 分析
Transcription factors, transcription regulators, and chromatin regulators, collectively referred to as regulatory genes.
PlantTFcat: An Online Plant Transcription Factor and Transcriptional Regulator Categorization and Analysis Tool

密码子偏好性 (Codon Optimization)
Codon Optimization On-Line (COOL)

Codon Optimization Tool：Integrated DNA Technologies

序列格式转换(Sequence Format Conversion)
EMBOSS Seqret

真菌效应蛋白预测
EffectorP: predicting fungal effector proteins from secretomes using machine learning

BLAST结果可视化
kablammo: Visualize your BLAST results

or
Circoletto

植物基因家族分类和富集分析
GenFam: Gene Family based classification and enrichment analysis

生物类文件格式转换
Sequence conversion

Plant-Specific Myristoylation Predictor
Plant-Specific Myristoylation Predictor

启动子元件预测
Plant CARE: Search for CARE

植物启动子/转录因子分析
PlantPAN 3.0

转录组分析
iDEP

简并引物设计
Genefisher 2

词云 wordcloud
WordCloud Generator

Free Word Cloud Generator




firewall
2015-08-02T06:18:15.000Z
受到管制的中国互联网
中国对其互联网内容有着严格的管制，设置了名为”防火长城”的屏蔽系统来阻止用户与外面的世界自由地互动。想要从中国访问自由的、不受管制的互联网，你需要绕开长城防火墙，在中国，上网就是如此之难。


测试网站是否被墙
websitepulse.com

greatfire.org

翻墙必备
VyprVPN ，有500M的免费流量，抗墙能力强，特别是安卓等移动设备。Vyprvpn为中国用户提供多一倍的免费数据流量，可在此下载（中文页面），在推荐的这个入口注册账号才有多一倍的免费流量，下载手机客户端，用注册的账号登录即可。安卓下的vyprvpn可启用专用协议对抗中国的网络封锁，不可错过。 专供中国用户：安装及使用疑难问题解答。

萤火虫代理 开源翻墙工具，支持多平台。

 Lantern 用私有网络分享节点，可以一试。

WoW Legacy 翻牆瀏覽器WowLegacy，基於Goagent重新開發，可一鍵翻牆。

VPN Gate  由全世界志愿者提供的公共 VPN 服务器获得自由访问互联网。VPNgate的开发者引入了一项新技术：P2P中继。简单来说，如果墙内有人连上了墙外志愿者提供的VPNgate服务器，那么他就自动成为了服务器，其他墙内的人就可以通过连上他来翻墙。（安装 下载 使用经验）

赛风 （下载）由开放网络基金资助、多伦多大学的公民实验室（Citizen Lab）开发。

Shadowsocks 免费账号

建议关注
P2P翻墙项目Uproxy的进展，谷歌翻墙利器。
付费方案
红杏




linux命令行精选
2015-08-01T13:27:15.000Z
致力于收集那些短小精悍的linux命令，欢迎补充！

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
grep -vf file1 file2 查看两个文件的不同
rename -n "s/-.*//" * 批量前缀重命名
diff <(sort file1.txt) <(sort file2.txt) 比较两个已排序的文件
for i in `find -name '*_test.rb'` ; do mv $i ${i%%_test.rb}_spec.rb ; done 批量重命名
shred -u -z -n 17 rubricasegreta.txt 安全删除文件
sed -re '/^#/d ; s/#.*$//' 清理注释
ps aux | sort -nk 6 排序按 第6列
convert -density 300x300 input.pdf output.png 把pdf转化成png
:%s///g 如何在vim清理^M
export HISTTIMEFORMAT='%Y.%m.%d-%T :: ' HISTFILESIZE=50000 HISTSIZE=50000 更好地设置bash history
/^\([2-9]\d*\|1\d+\) vim 中查找比 1 大的数
tar cvfz dir_name.tgz dir/ 如何tar gz 一个目录
pv file1 > file2 带进度条的复制文件
gs -dNOPAUSE -sDEVICE=jpeg -r144 -sOutputFile=p%03d.jpg file.pdf 使用Ghostscript转换PDF为JPEG
find . -iname "*.jpg" -printf '\n' > gallery.html 创建一个html相册
sed -i ,d  从一个文件删除一个范围内的行
sed '/^$/d' file >newfile 清除文件中的空行
awk '{print NR": "$0; for(i=1;i<=NF;++i)print "\t"i": "$i}' 分析列
echo 'wget url' | at 01:00 定时启动wget下载
awk -F'^"|", "|"$' '{ print $2,$3,$4 }' file.csv 用 awk 解析.csv
find -type f -exec mv {} . \; 将子目录的所有内容都移动到当前目录
sed -i 's/\s\+$//'  删除文件中每行末尾的空格
ls -al | sort +4n 按大小排序
sort file1 file2 | uniq -d 求交集
wget -r -nd -q -A "*.ext" http://www.example.org/   抓去一个网页的所有特定扩展名的文件
awk '{s+=$1}END{print s}'   列求和




rss制作
2015-08-01T02:34:40.000Z
它非常小巧，仅仅加了一种能自动生成种子的功能。用户只需提供需要处理的URL地址，Feedityl会提供起始终止的选项模块，这样它就能输出选定的内容。




杂志图表的经典用色
2015-07-31T15:07:34.000Z
《经济学人》常用的藏青色

经济学人上的图表，基本只用这一个颜色，或加上一些深浅明暗变化，再就是左上角的小红块，成为经济学人图表的招牌样式。罗兰贝格也非常爱用这个色，有时也配合橙色使用。各类提供专业服务的网站也多爱用此色。
风格就是这样，即使很单调，只要你坚持，也会成为自己的风格，别人也会认同。所谓以不变应万变，变得太多反而难以把握。
《商业周刊》常用的蓝红组合

早年的商业周刊上的图表，几乎都使用这个颜色组合，基本成为商业周刊图表的招牌标志了，应该是来源于其VI系统。不过今年来好像很有些变化，更加轻快明亮。
《华尔街日报》常用的黑白灰

HSJ是一份报纸，所以图表多是黑白的，但就是这种黑白灰的组合，做出的图表仍然可以非常专业，配色也非常容易。
使用同一颜色的不同深浅

如果既想使用彩色，又不知道配色理论，可在一个图表内使用同一颜色的不同深浅/明暗。这种方法可以让我们使用丰富的颜色，配色难度也不高，是一种很保险的方法，不会出大问题。当然，最深/最亮的要用于最需要突出的序列。
《FOCUS》常用的一组色

这组颜色似乎是从组织的LOGO而来，比较亮丽明快，也不错。
设计师珍藏自用颜色：橙＋灰

我发现，设计师们总喜好把这个颜色组合用于自己的宣传，似乎这样能体现设计师的专业性。如Inmagine、Nordrio的LOGO就是这样。
暗红＋灰组合

这种红＋灰的组合给人很专业的印象，也经常出现在财经杂志上。
橙＋绿组合

这种橙＋绿的组合比较亮丽明快，充满活力，也经常出现在财经杂志上。
黑底图表

最为强烈的黑白对比，总是显得比较专业、高贵，黑底的图表其特点非常明显，但不要学麦肯锡的那一套，有些刺眼，也被它用滥了。



ggplot2 fine drawing using mytheme
2015-07-31T14:45:06.000Z
ggplot2本身自带了很漂亮的主题格式，如theme_gray和theme_bw。但是在工作用图上，很多时候对图表格式配色字体等均有明文的规定。ggplot2允许我们事先定制好图表样式，我们可以生成如mytheme或者myline这样的有明确配色主题的对象，到时候就可以定制保存图表模板或者格式刷，直接在生成的图表里引用格式刷型的主题配色，就可以快捷方便的更改图表内容，保持风格的统一了。
画图前的准备：自定义ggplot2格式刷
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
library(ggplot2)
library(dplyr)
library(RColorBrewer)
library(tidyr)
library(grid)
#定义好字体
windowsFonts(CA=windowsFont("Calibri"))
#事先定制好要加图形的形状、颜色、主题等
#定制主题，要求背景全白，没有边框。然后所有的字体都是某某颜色
mytheme<-theme_bw()+theme(legend.position="right",
                          panel.border=element_blank(),
                          panel.grid.major=element_line(linetype="dashed"),
                          panel.grid.minor=element_blank(),
                          plot.title=element_text(size=15,
                                                  colour="#003087",
                                                  family="CA"),
                          legend.text=element_text(size=9,colour="#003087",
                                                   family="CA"),
                          legend.key=element_blank(),
                          axis.text=element_text(size=10,colour="#003087",
                                                 family="CA"),
                          strip.text=element_text(size=12,colour="#EF0808",
                                                  family="CA"),
                          strip.background=element_blank()

                        )
#饼图主题
pie_theme=mytheme+theme(axis.text=element_blank(),
                        axis.ticks=element_blank(),
                        axis.title=element_blank(),
                        panel.grid.major=element_blank())
#定制线的大小
myline_blue<-geom_line(colour="#085A9C",size=2)
myline_red<-geom_line(colour="#EF0808",size=2)
myarea=geom_area(colour=NA,fill="#003087",alpha=.2)
mypoint=geom_point(size=3,shape=21,colour="#003087",fill="white")
mybar=geom_bar(fill="#0C8DC4",stat="identity")
#然后是配色（主要是图形配色），考虑到样本的多样性，可以事先设定颜色，如3种颜色或7种颜色的组合，如果想要用原来默认主题颜色则这部分注释掉即可
mycolour_3<-scale_fill_manual(values=c("#085A9C","#EF0808","#526373"))
mycolour_7<-scale_fill_manual(values=c("#085A9C","#EF0808","#526373",
"#FFFFE7","#FF9418","#219431","#9C52AD"))
mycolour_line_7<-scale_color_manual(values=c("#085A9C","#EF0808","#526373",
                                             "#0C8DC4","#FF9418","#219431","#9C52AD"))
#定制标题
mytitle<-labs(title = "hope 实例")

载入例子数据
1
2
3
4
5
require(ggplot2)
data(diamonds)
set.seed(42)
small <- diamonds[sample(nrow(diamonds), 1000), ]
head(small)

不用自定义格式画图
1
ggplot(small)+geom_bar(aes(x=clarity, fill=cut))+coord_polar()


利用自定义格式画图
1
ggplot(small)+geom_bar(aes(x=clarity, fill=cut))+coord_polar()+ mytheme + mytitle


画图前的准备：数据塑形利器dplyr/tidyr
在R里，则是用一些常用的包，如dplyr及tidyr，基本用的都是reshape2+plyr的组合对数据进行重塑再造.
载入数据，数据来源: 股票的成交明细.
1
2
3
4
5
6
7
8
9
10
11
> data<-read.table("gupiao.txt",header=TRUE)
> head(data)
     Time Price BuySell Volume     Money
1 9:25:08 18.03    0.00  73520 132557642
2 9:29:59 17.99   -0.04  11034  19857700
3 9:30:09 17.99    0.00  28920  52089378
4 9:30:09 17.99    0.00   9272  16681906
5 9:30:14 17.96   -0.03    556    998913
6 9:30:19 17.96    0.00    873   1567490
> dim(data)
[1] 2345    5

将数据汇总(group_by & summarize) 甚至再拆分 (spread)
示例里面就是把成交记录按照成交Price和BuySell拆分
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> data %>% group_by(Price,BuySell) %>% summarize(Money=sum(Money,na.rm=TRUE)) %>% spread(BuySell,Money)
Source: local data frame [46 x 16]

   Price -0.07 -0.06 -0.05    -0.04   -0.03    -0.02    -0.01        0     0.01
1  17.58    NA    NA    NA       NA      NA       NA 41631769 29645465       NA
2  17.59    NA    NA    NA       NA      NA 17173618 37029276       NA 24179724
3  17.60    NA    NA    NA       NA      NA   318581 42756941 15987562 11197676
4  17.61    NA    NA    NA       NA      NA       NA 58098336 36701330 14999088
5  17.62    NA    NA    NA       NA      NA       NA 32385632 51156365 24341609
6  17.63    NA    NA    NA       NA      NA  5191027 16112558 32054647 23599759
7  17.64    NA    NA    NA 24642084 3725529 14682967  4791698 18864232  4731619
8  17.65    NA    NA    NA       NA 3918096  6003983 16293279 19115145 13177514
9  17.66    NA    NA    NA       NA 5175002       NA 54169855 16671362  7801764
10 17.67    NA    NA    NA       NA      NA 10951987 38090607  8704892  7911066
..   ...   ...   ...   ...      ...     ...      ...      ...      ...      ...
Variables not shown: 0.02 (int), 0.03 (int), 0.04 (int), 0.05 (int), 0.06 (int),
  0.07 (int)

然后便可用ggplot等包绘图。
常用图形
1
2
3
4
5
6
简单柱形图+文本（单一变量） 
分面柱形图（facet_wrap/facet_grid) 
簇型柱形图(position="dodge") 
堆积柱形图(需要先添加百分比，再对百分比的变量做柱形图) 
饼图、极坐标图 
多重线性图

1)简单柱形图 

代码组成如下，这里使用格式刷mybar和mytheme，然后用geom_text添加柱形图标签(vjust=1表示在柱形图里面显示)
1
2
3
4
5
data1<-diamonds %>% group_by(cut) %>% summarize(avg_price=mean(price))
bar_chart<-ggplot(data1,aes(x=cut,y=avg_price,fill=as.factor(cut)))+
        mybar+mytheme+mytitle+
        geom_text(aes(label=round(avg_price)),vjust=1,colour="white")
bar_chart

2）带分类的柱形图
使用facet_wrap或者facet_grid可以快速绘制相应图形。这也是ggplot2不太支持双坐标的原因：可以快速绘图，就不需要做那么多无用功了。

代码如下：
1
2
3
4
5
6
#dplyr处理数据
data2<-diamonds %>% group_by(cut,color) %>% summarize(avg_price=mean(price))
#画图，套用设定好的绘图元素
ggplot(data2,aes(x=color,y=avg_price))+facet_wrap(~cut,ncol = 2)+
        mybar+mytheme+mytitle
#在facet_wrap里面，如果加上scales="free"的话，坐标就不一样了。

3）簇型图
制图要点是，对数据作图后，添加geom_bar时，position=”dodge”（分开的）如果去掉这部分，默认是生成堆积图.

代码如下：
1
2
3
4
5
6
data3<-diamonds %>% filter(cut %in% c("Fair","Very Good","Ideal")) %>%
        group_by(cut,color) %>% summarize(avg_price=mean(price))
#簇状图
簇状柱形图<-ggplot(data3,aes(x=color,y=avg_price,fill=cut))+
        geom_bar(stat="identity",position="dodge")+
        mytheme+mytitle+mycolour_3

这里如果想要定义颜色的相应顺序的话，可以使用factor 
譬如以下,只是用这行代码对颜色重新定义一下，用levels改变factor顺序，再画图的时候，颜色以及柱子顺序就会跟着改变了。非常方便。
1
data3$cut<-factor(data3$cut,levels=c("Very Good","Ideal","Fair"))

4）百分比堆积图
制图前要事先添加一个百分比的数据之后才好作图，这里我们用mutate(percent=n/sum(n))添加该百分比数据。同时去掉position=”dodge”

1
2
3
4
5
data4<-diamonds %>% filter(cut %in% c("Fair","Very Good","Ideal")) %>%
         count(color,cut) %>% 
        mutate(percent=n/sum(n))
堆积图<-ggplot(data4,aes(x=color,y=percent,fill=cut))+mytitle+
        geom_bar(stat="identity")+mytheme+mycolour_3

当然，也可以做面积图。不过如果数据有缺失，面积图出错几率蛮大的
5）饼图以及极坐标图
在ggplot2里并没有直接画饼图的方法，基本上都是先画出柱形图，再用coord_polar转化为饼图.
不指定x轴，直接用geom_bar生成y轴，然后fill=分类颜色，coord_polar直接投影y轴，该方法的好处代码是比较简单：coord_polar(“y”) 。
加标签方法请见： http://stackoverflow.com/questions/8952077/pie-plot-getting-its-text-on-top-of-each-other

1
2
3
4
data5<-diamonds %>% count(cut) %>% 
        mutate(percent=n/sum(n))
ggplot(data5,aes(x=factor(1),y=percent,fill=cut))+geom_bar(stat="identity",width=3)+mycolour_7+
        coord_polar("y")+pie_theme+mytitle

6、折线图

要点是，先做成如A-B-变量这样的二联表，然后，x轴为A，group为b,colour为b
下面代码展示了这个处理
如果去掉group的话，折线图会不知道怎么去处理数字。
1
2
3
4
data6<-diamonds %>% count(color,cut) %>% filter(color %in% c("D","E","F"))%>%
        mutate(percent=n/sum(n))
ggplot(data6,aes(x=cut,y=n,group=color,colour=color))+geom_line(size=1.5)+mypoint+
        mycolour_line_7+mytheme+mytitle




14 个 grep 命令的例子
2015-07-29T14:49:56.000Z
概述：所有的类linux系统都会提供一个名为grep(global regular expression print，全局正则表达式输出)的搜索工具。grep命令在对一个或多个文件的内容进行基于模式的搜索的情况下是非常有用的。模式可以是单个字符、多个字符、单个单词、或者是一个句子。
当命令匹配到执行命令时指定的模式时，grep会将包含模式的一行输出，但是并不对原文件内容进行修改。
在本文中，我们将会讨论到14个grep命令的例子。
例1 在文件中查找模式（单词）在/etc/passwd文件中查找单词”linuxtechi”
1
2
3
root@Linux-world:~# grep linuxtechi /etc/passwd
linuxtechi:x:1000:1000:linuxtechi,,,:/home/linuxtechi:/bin/bash
root@Linux-world:~#

例2 在多个文件中查找模式。
1. root@Linux-world:~# grep linuxtechi /etc/passwd /etc/shadow /etc/gshadow
2. /etc/passwd:linuxtechi:x:1000:1000:linuxtechi,,,:/home/linuxtechi:/bin/bash
3. /etc/shadow:linuxtechi:$6$DdgXjxlM$4flz4JRvefvKp0DG6re:16550:0:99999:7:::/etc/gshadow:adm:*::syslog,linuxtechi
4. /etc/gshadow:cdrom:*::linuxtechi
5. /etc/gshadow:sudo:*::linuxtechi
6. /etc/gshadow:dip:*::linuxtechi
7. /etc/gshadow:plugdev:*::linuxtechi
8. /etc/gshadow:lpadmin:!::linuxtechi
9. /etc/gshadow:linuxtechi:!::
10. /etc/gshadow:sambashare:!::linuxtechi
11. root@Linux-world:~#
例3 使用-l参数列出包含指定模式的文件的文件名。
1. root@Linux-world:~# grep -l linuxtechi /etc/passwd /etc/shadow /etc/fstab /etc/mtab
2. /etc/passwd
3. /etc/shadow
4. root@Linux-world:~#
例4 使用-n参数，在文件中查找指定模式并显示匹配行的行号
1. root@Linux-world:~# grep -n linuxtechi /etc/passwd
2. 39:linuxtechi:x:1000:1000:linuxtechi,,,:/home/linuxtechi:/bin/bash
3. root@Linux-world:~#
4. root@Linux-world:~# grep -n root /etc/passwd /etc/shadow
例5 使用-v参数输出不包含指定模式的行输出/etc/passwd文件中所有不含单词”linuxtechi”的行
1. root@Linux-world:~# grep -v linuxtechi /etc/passwd
例6 使用 ^ 符号输出所有以某指定模式开头的行Bash脚本将 ^ 符号视作特殊字符，用于指定一行或者一个单词的开始。例如输出/etc/passes文件中所有以”root”开头的行
1. root@Linux-world:~# grep ^root /etc/passwd
2. root:x:0:0:root:/root:/bin/bash
3. root@Linux-world:~#
例7 使用 $ 符号输出所有以指定模式结尾的行。输出/etc/passwd文件中所有以”bash”结尾的行。
1. root@Linux-world:~# grep bash$ /etc/passwd
2. root:x:0:0:root:/root:/bin/bash
3. linuxtechi:x:1000:1000:linuxtechi,,,:/home/linuxtechi:/bin/bash
4. root@Linux-world:~#
Bash脚本将美元($)符号视作特殊字符，用于指定一行或者一个单词的结尾。
例8 使用 -r 参数递归地查找特定模式
1. root@Linux-world:~# grep -r linuxtechi /etc/
2. /etc/subuid:linuxtechi:100000:65536
3. /etc/group:adm:x:4:syslog,linuxtechi
4. /etc/group:cdrom:x:24:linuxtechi
5. /etc/group:sudo:x:27:linuxtechi
6. /etc/group:dip:x:30:linuxtechi
7. /etc/group:plugdev:x:46:linuxtechi
8. /etc/group:lpadmin:x:115:linuxtechi
9. /etc/group:linuxtechi:x:1000:
10. /etc/group:sambashare:x:131:linuxtechi
11. /etc/passwd-:linuxtechi:x:1000:1000:linuxtechi,,,:/home/linuxtechi:/bin/bash
12. /etc/passwd:linuxtechi:x:1000:1000:linuxtechi,,,:/home/linuxtechi:/bin/bash
13. ............................................................................
上面的命令将会递归的在/etc目录中查找”linuxtechi”单词
例9 使用 grep 查找文件中所有的空行
1. root@Linux-world:~# grep ^$ /etc/shadow
2. root@Linux-world:~#
由于/etc/shadow文件中没有空行，所以没有任何输出
例10 使用 -i 参数查找模式grep命令的-i参数在查找时忽略字符的大小写。
我们来看一个例子，在paswd文件中查找”LinuxTechi”单词。
1. nextstep4it@localhost:~$ grep -i LinuxTechi /etc/passwd
2. linuxtechi:x:1001:1001::/home/linuxtechi:/bin/bash
3. nextstep4it@localhost:~$
例11 使用 -e 参数查找多个模式例如，我想在一条grep命令中查找’linuxtechi’和’root’单词，使用-e参数，我们可以查找多个模式。
1. root@Linux-world:~# grep -e "linuxtechi" -e "root" /etc/passwd
2. root:x:0:0:root:/root:/bin/bash
3. linuxtechi:x:1000:1000:linuxtechi,,,:/home/linuxtechi:/bin/bash
#或者
4. root@Linux-world:~# grep -E "(Olfr1413|Olfr1411)" Mus_musculus.GRCm38.75_chr1_genes.txt
ENSMUSG00000058904      Olfr1413
ENSMUSG00000062497      Olfr1411
#或者
grep 'usrquota\|grpquota' /etc/fstab
例12 使用 -f 用文件指定待查找的模式首先，在当前目录中创建一个搜索模式文件”grep_pattern”，我想文件中输入的如下内容。
1. root@Linux-world:~# cat grep_pattern
2. ^linuxtechi
3. root
4. false$
5. root@Linux-world:~#
现在，试试使用grep_pattern文件进行搜索
1. root@Linux-world:~# grep -f grep_pattern /etc/passwd
例13 使用 -c 参数计算模式匹配到的数量继续上面例子，我们在grep命令中使用-c命令计算匹配指定模式的数量
1. root@Linux-world:~# grep -c -f grep_pattern /etc/passwd
2. 22
3. root@Linux-world:~#
例14 输出匹配指定模式行的前或者后面N行
a)使用-B参数输出匹配行的前4行
1. root@Linux-world:~# grep -B 4 "games" /etc/passwd
b)使用-A参数输出匹配行的后4行
1. root@Linux-world:~# grep -A 4 "games" /etc/passwd
c)使用-C参数输出匹配行的前后各4行
1. root@Linux-world:~# grep -C 4 "games" /etc/passwd
例15 -E参数用扩展的正则表达式
列出当前目录下包含 s..s 的文件名，\1为反向引用()中内容，如下例子中等同于s。
ls | grep -E '(s).+\1'
例16 组分隔符
‘—group-separator=STRING’
当使用’-A’, ‘-B’ or ‘-C’时，使用STRING替代默认的组分隔符。
(注：组分隔符表示匹配到的内容的上下文。例如”-A 2”，在某行匹配到时，还将输出后两行，这是一个组。下一次匹配成功时，如果是在该组之后行匹配上的，则这两组中间默认使用”—“分隔)
‘—no-group-separator’
当使用’-A’, ‘-B’ or ‘-C’时，不输出任何组分隔符，而是将不同组相邻输出。



用于比较两个文件不同的行差集
2015-07-29T14:39:49.000Z
这个脚本用于求两个文件不同的行所构成的差集，即A中存在而B中不存在的行，及B中存在而A中不存在的行.
代码：
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#!/usr/bin/perl -w
use strict;

my ($fileA,$fileB) = @ARGV;

open A,'<',$fileA or die "Unable to open file:$fileA:$!";
my %ta;
my $i; 
while(){
  chomp;
  $ta{$_} = ++$i; 
}

close A;

open B,'<',$fileB or die "Unable to open file:$fileB:$!";
my @B; 
while(){
    chomp;
    unless (defined $ta{$_}){
        push @B,$_;
    }else{
        $ta{$_} = 0;
    }   
}
close B;

# Output diff to different files respectively

open DIFF_A, ">$fileA.diff" or die "Unable to create diff file for $fileA:$!";
my $countA;
print "Remain in files $fileA\n";
my %tt = reverse %ta;

foreach (keys %tt) {
    $countA += $_>0? print DIFF_A $tt{$_}."\n":0;
}

print "$countA lines\n";

close DIFF_A;

open DIFF_B, ">$fileB.diff" or die "Unable to create diff file for $fileB:$!";
my $countB = scalar @B; 
print DIFF_B $_."\n" foreach @B; 

if ($countA == 0 and $countB ==0 ){
    print STDOUT "The two files are identical\n";
}

close DIFF_B;




如何向google提交sitemap
2015-07-29T14:23:44.000Z
Sitemap 可方便管理员通知搜索引擎他们网站上有哪些可供抓取的网页。向google提交自己hexo博客的sitemap有助于让别人更好地通过google搜索到自己的博客。
下面来说一下具体步骤:
第一步 生成自己的sitemap文件：xml-sitemaps.

生成后点击红框那下载自己的sitemaps.xml文件

第二步 向google提交你的网页
用自己的google帐号登陆Webmaster Central的网页
1
https://www.google.com/webmasters/verification/home?hl=en

点击ADD A SITE
输入网页url点击continue

第三步 google验证网页所有权
进入验证所有权的页面
可以选择上传一个html文件到你的网页的方式来验证
如下图


也可以选择其他方法也就是html tag
如下图

大致的意思就是在主页的head里面加一条meta标签
在自己的主页加了google指定的meta标签以后
回来此页点击verify按钮即完成验证

第四步 google网站站长上传sitemap
点击以下链接
1
https://www.google.com/webmasters/tools

由于之前在第2步已经向google提交了你的网页
所以这里能看到自己网页的缩略图

这里直接点击红色框的部分
就会进入site dashboard
点击sitemap这一项
进入后点击ADD/TEST SITEMAP这个按钮
然后输入你的sitemap.xml的link
按submit sitemap按钮即可
下面就会告诉你有多少个url被indexed
第五步 测试sitemaps
点击 Test Sitemap 进行测试

测试成功结果如下：




PCA分析
2015-07-29T13:54:45.000Z
主成分分析（PCA）是一种数据降维技巧，它能将大量相关变量转化为一组很少的不相关变量，这些无关变量称为主成分，它们是观测变量的线性组合。如第一主成分为：
PC1=a1X1=a2X 2+…+akXk
它是k个观测变量的加权组合，对初始变量集的方差解释性最大。第二主成分也是初始变量的线性组合，对方差的解释性排第二，同时与第一主成分正交（不相关）。后面每一个主成分都最大化它对方差的解释程度，同时与之前所有的主成分都正交。
PCA分析模型如下图：

例如如下数据的PCA分析：
学生身体4 项指标的主成份分析
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
学生序号	x1身高	x2体重	x3胸围	x4坐高
1	148	41	72	78
2	139	34	71	76
3	160	49	77	86
4	149	36	67	79
5	159	45	80	86
6	142	31	66	76
7	153	43	76	83
8	150	43	77	79
9	151	42	77	80
10	139	31	68	74
11	140	29	64	74
12	161	47	78	84
13	158	49	78	83
14	140	33	67	77
15	137	31	66	73
16	152	35	73	79
17	149	47	82	79
18	145	35	70	77
19	160	47	74	87
20	156	44	78	85
21	151	42	73	82
22	147	38	73	78
23	157	39	68	80
24	147	30	65	75
25	157	48	80	88
26	151	36	74	80
27	144	36	68	76
28	141	30	67	76
29	139	32	68	73
30	148	38	70	78

数据读入R软件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
> d=read.table("clipboard",header=T)
> d
x1身高 x2体重 x3胸围 x4坐高
1 148 41 72 78
2 139 34 71 76
3 160 49 77 86
4 149 36 67 79
5 159 45 80 86
6 142 31 66 76
7 153 43 76 83
8 150 43 77 79
9 151 42 77 80
10 139 31 68 74
11 140 29 64 74
12 161 47 78 84
13 158 49 78 83
14 140 33 67 77
15 137 31 66 73
16 152 35 73 79
17 149 47 82 79
18 145 35 70 77
19 160 47 74 87
20 156 44 78 85
21 151 42 73 82
22 147 38 73 78
23 157 39 68 80
24 147 30 65 75
25 157 48 80 88
26 151 36 74 80
27 144 36 68 76
28 141 30 67 76
29 139 32 68 73
30 148 38 70 78

原始数据标准化
1
> sd=scale(d)

标准化数据展示
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
> sd
x1身高 x2体重 x3胸围 x4坐高
[1,] -0.1366952 0.35602486 -0.04530114 -0.31999814
[2,] -1.3669516 -0.72752905 -0.23944887 -0.78828809
[3,] 1.5036468 1.59437218 0.92543751 1.55316168
[4,] 0.0000000 -0.41794222 -1.01603978 -0.08585316
[5,] 1.3669516 0.97519852 1.50788070 1.55316168
[6,] -0.9568661 -1.19190930 -1.21018751 -0.78828809
[7,] 0.5467806 0.66561169 0.73128978 0.85072675
[8,] 0.1366952 0.66561169 0.92543751 -0.08585316
[9,] 0.2733903 0.51081827 0.92543751 0.14829182
[10,] -1.3669516 -1.19190930 -0.82189205 -1.25657805
[11,] -1.2302564 -1.50149613 -1.59848297 -1.25657805
[12,] 1.6403419 1.28478535 1.11958524 1.08487173
[13,] 1.2302564 1.59437218 1.11958524 0.85072675
[14,] -1.2302564 -0.88232247 -1.01603978 -0.55414311
[15,] -1.6403419 -1.19190930 -1.21018751 -1.49072302
[16,] 0.4100855 -0.57273564 0.14884659 -0.08585316
[17,] 0.0000000 1.28478535 1.89617616 -0.08585316
[18,] -0.5467806 -0.57273564 -0.43359660 -0.55414311
[19,] 1.5036468 1.28478535 0.34299432 1.78730666
[20,] 0.9568661 0.82040510 1.11958524 1.31901671
[21,] 0.2733903 0.51081827 0.14884659 0.61658177
[22,] -0.2733903 -0.10835539 0.14884659 -0.31999814
[23,] 1.0935613 0.04643802 -0.82189205 0.14829182
[24,] -0.2733903 -1.34670271 -1.40433524 -1.02243307
[25,] 1.0935613 1.43957876 1.50788070 2.02145164
[26,] 0.2733903 -0.41794222 0.34299432 0.14829182
[27,] -0.6834758 -0.41794222 -0.82189205 -0.78828809
[28,] -1.0935613 -1.34670271 -1.01603978 -0.78828809
[29,] -1.3669516 -1.03711588 -0.82189205 -1.49072302
[30,] -0.1366952 -0.10835539 -0.43359660 -0.31999814
attr(,"scaled:center")
x1身高 x2体重 x3胸围 x4坐高 
149.00000 38.70000 72.23333 79.36667 
attr(,"scaled:scale")
x1身高 x2体重 x3胸围 x4坐高 
7.315548 6.460223 5.150717 4.270858

读取标准化数据
1
> d=read.table("clipboard",header=T)

主成分分析
1
> pca=princomp(d,cor=T)

碎石图
1
2
> screeplot(pca,type="line",main="碎石图",lwd=2)
>


主成分1贡献率较高
求相关矩阵
1
> dcor=cor(d)

输出
1
2
3
4
5
6
> dcor
               x1身高       x2体重       x3胸围       x4坐高
x1身高 1.0000000 0.8631621 0.7321119 0.9204624
x2体重 0.8631621 1.0000000 0.8965058 0.8827313
x3胸围 0.7321119 0.8965058 1.0000000 0.7828827
x4坐高 0.9204624 0.8827313 0.7828827 1.0000000

求相关矩阵的特征向量 特征值
1
> deig=eigen(dcor)

输出
1
2
3
4
5
6
7
8
9
>deig
$values
[1] 3.54109800 0.31338316 0.07940895 0.06610989
$vectors
[,1] [,2] [,3] [,4]
[1,] -0.4969661 0.5432128 -0.4496271 0.5057471
[2,] -0.5145705 -0.2102455 -0.4623300 -0.6908436
[3,] -0.4809007 -0.7246214 0.1751765 0.4614884
[4,] -0.5069285 0.3682941 0.7439083 -0.2323433

输出特征值
1
2
3
4
5
> deig$values
[1] 3.54109800 0.31338316 0.07940895 0.06610989
> sumeigv=sum(deig$values)
> sumeigv
[1] 4

求前2个主成分的累积方差贡献率
1
2
3
4
> sum(deig$value[1:2])/4
[1] 0.9636203
> sum(deig$value[1:1])/4
[1] 0.8852745

第一主成份有88.53%的方差贡献率,前两个主成份累计贡献率更高达96.36%,故只需前两个主成份就能很好地概括这组数据.
输出前两个主成分的载荷系数（特征向量）
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
> pca$loadings[,1:2]
              Comp.1     Comp.2
x1身高 -0.4969661 0.5432128
x2体重 -0.5145705 -0.2102455
x3胸围 -0.4809007 -0.7246214
x4坐高 -0.5069285 0.3682941

-----------------------------------------

z1=-0.4969661 x1+-0.5145705 x2 +-0.4809007x3+-0.5069285x4

z2=0.5432128 x1+-0.2102455 x2 +-0.7246214x3+0.3682941x4

z= 3.54109800/4 z1 + 0.31338316/4 z2=0.8852745 z1 +0.07834579 Z2

=0.8852745(-0.4969661 x1+-0.5145705 x2 +-0.4809007x3+-0.5069285x4)

+0.07834579 (0.5432128 x1+-0.2102455 x2 +-0.7246214x3+0.3682941x4)

 

-----------------------------------------

计算主成分C1和C2的系数b1 和b2：
1
2
3
> deig$values[1]/4;deig$values[2]/4
[1] 0.8852745
[1] 0.07834579

综合得分函数C 为：
1
C=(b1*C1+b2*C2)/(b1+b2)=0.9187*C1+0.0813*C2

输出前2 个主成分的得分
1
> s=pca$scores[,1:2]

计算综合得分
1
2
3
4
5
6
7
8
> c=s[1:30,1]*0.918696+s[1:30,2]*0.0813

> s[1:30,1]
[1] 0.06990950 1.59526340 -2.84793151 0.75996988 -2.73966777 2.10583168
[7] -1.42105591 -0.82583977 -0.93464402 2.36463820 2.83741916 -2.60851224
[13] -2.44253342 1.86630669 2.81347421 0.06392983 -1.55561022 1.07392251
[19] -2.52174212 -2.14072377 -0.79624422 0.28708321 -0.25151075 2.05706032
[25] -3.08596855 -0.16367555 1.37265053 2.16097778 2.40434827 0.50287468

输出综合得分信息
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
> cbind(s,c)
          Comp.1       Comp.2           c
[1,] 0.06990950 -0.23813701 0.04486504
[2,] 1.59526340 -0.71847399 1.40715017
[3,] -2.84793151 0.38956679 -2.58471151
[4,] 0.75996988 0.80604335 0.76371262
[5,] -2.73966777 0.01718087 -2.51552502
[6,] 2.10583168 0.32284393 1.96086635
[7,] -1.42105591 -0.06053165 -1.31043961
[8,] -0.82583977 -0.78102576 -0.82219309
[9,] -0.93464402 -0.58469242 -0.90618922
[10,] 2.36463820 -0.36532199 2.14268298
[11,] 2.83741916 0.34875841 2.63507969
[12,] -2.60851224 0.21278728 -2.37913015
[13,] -2.44253342 -0.16769496 -2.25757928
[14,] 1.86630669 0.05021384 1.71865087
[15,] 2.81347421 -0.31790107 2.55888214
[16,] 0.06392983 0.20718448 0.07557617
[17,] -1.55561022 -1.70439674 -1.56770034
[18,] 1.07392251 -0.06763418 0.98110965
[19,] -2.52174212 0.97274301 -2.23763039
[20,] -2.14072377 0.02217881 -1.96487123
[21,] -0.79624422 0.16307887 -0.71824807
[22,] 0.28708321 -0.35744666 0.23468178
[23,] -0.25151075 1.25555188 -0.12898555
[24,] 2.05706032 0.78894494 1.95395431
[25,] -3.08596855 -0.05775318 -2.83976229
[26,] -0.16367555 0.04317932 -0.14685759
[27,] 1.37265053 0.02220972 1.26285420
[28,] 2.16097778 0.13733233 1.99644676
[29,] 2.40434827 -0.48613137 2.16934265
[30,] 0.50287468 0.14734317 0.47396795
>

 排序
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[11,]	2.83741916	0.34875841	2.63507969
[15,]	2.81347421	-0.31790107	2.55888214
[29,]	2.40434827	-0.48613137	2.16934265
[10,]	2.3646382	-0.36532199	2.14268298
[28,]	2.16097778	0.13733233	1.99644676
[6,]	2.10583168	0.32284393	1.96086635
[24,]	2.05706032	0.78894494	1.95395431
[14,]	1.86630669	0.05021384	1.71865087
[2,]	1.5952634	-0.71847399	1.40715017
[27,]	1.37265053	0.02220972	1.2628542
[18,]	1.07392251	-0.06763418	0.98110965
[4,]	0.75996988	0.80604335	0.76371262
[30,]	0.50287468	0.14734317	0.47396795
[22,]	0.28708321	-0.35744666	0.23468178
[16,]	0.06392983	0.20718448	0.07557617
[1,]	0.0699095	-0.23813701	0.04486504
[23,]	-0.25151075	1.25555188	-0.12898555
[26,]	-0.16367555	0.04317932	-0.14685759
[21,]	-0.79624422	0.16307887	-0.71824807
[8,]	-0.82583977	-0.78102576	-0.82219309
[9,]	-0.93464402	-0.58469242	-0.90618922
[7,]	-1.42105591	-0.06053165	-1.31043961
[17,]	-1.55561022	-1.70439674	-1.56770034
[20,]	-2.14072377	0.02217881	-1.96487123
[19,]	-2.52174212	0.97274301	-2.23763039
[13,]	-2.44253342	-0.16769496	-2.25757928
[12,]	-2.60851224	0.21278728	-2.37913015
[5,]	-2.73966777	0.01718087	-2.51552502
[3,]	-2.84793151	0.38956679	-2.58471151
[25,]	-3.08596855	-0.05775318	-2.83976229




根据一个ID列表文件从一个fasta文件里面挑取符合要求的序列
2015-07-29T05:53:40.000Z
其中一个文件是ID列表，一个ID占一行，另一个文件是fasta格式的序列，一行是>开头的标记，旗下所有行都是该标记的内容，直到下一个>开头的标记
Perl代码：
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/usr/bin/perl -w 
if( @ARGV != 2  ) 
{
    print "Usage: we need two files\n";
    exit 0;
}
 my $ID=shift @ARGV;
 my $fasta=shift @ARGV;
 open FH1,"<$ID" or die "can not open the file,$!";
while ()
{
chomp;
$hash{$_}=1;
}
#读取第一个参数，ID列表，每一行的ID都扫描进去hash表
open FH2,"$fasta" or die "can not open the file,$!";
while(defined($line=))
{
	chomp $line;
	if($line =~ />/)
	{
	$key = (split /\s/,$line)[0];
	$key =~ s/>//g;
	$flag = exists $hash{$key}?1:0;
	}#这个flag是用来控制这个标记下面的序列是否输出

	print $line."\n" if $flag == 0;

}



RStudio
2015-07-29T05:19:30.000Z
RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.

Publish to RPubs
Getting Started with RPubs
RStudio lets you harness the power of R Markdown to create documents that weave together your writing and the output of your R code. And now, with RPubs, you can publish those documents on the web with the click of a button!
Prerequisites
You’ll need R itself, RStudio (v0.96.230 or later), and the knitr package (v0.5 or later).
Instructions
In RStudio, create a new R Markdown document by choosing File | New | R Markdown.
Click the Knit HTML button in the doc toolbar to preview your document.
In the preview window, click the Publish button.

Results：
Convergence testing
1
2
3
4
5
6
7
8
library(lattice)
library(lme4)
library(blme)
library(reshape2)
library(ggplot2); theme_set(theme_bw())
library(gridExtra)  ## for grid.arrange
library(bbmle) ## for slice2D; requires *latest* r-forge version (r121)
source("allFit.R")

Load data:
1
dList <- load("data.RData")

Spaghetti plot: don’t see much pattern other than (1) general increasing trend; (2) quantized response values (table(dth$Estimate) or unique(dth$Estimate) also show this); (3) skewed residuals
1
sort(unique(dth$Estimate))

1
2
##  [1]  0.0  0.5  1.0  1.5  2.0  2.5  3.0  4.0  5.0  6.0  7.0  7.5  8.0 10.0
## [15] 12.0 15.0 17.0 20.0 25.0 30.0 35.0 40.0 50.0 60.0 70.0 75.0 90.0

1
2
3
(p0 <- ggplot(dth,aes(Actual,Estimate))+geom_point()+
    geom_line(aes(group=factor(pid)))+
    stat_summary(fun.y=mean,geom="line",colour="red",lwd=2))





全基因组重测序数据分析(Whole genomr Resequencing Analysis)
2015-07-29T04:56:57.000Z
简介(Introduction)
通过高通量测序识别发现de novo的somatic和germ line 突变，结构变异-SNV，包括重排突变（deletioin, duplication 以及copy number variation）以及SNP的座位；针对重排突变和SNP的功能性进行综合分析；我们将分析基因功能（包括miRNA），重组率（Recombination）情况，杂合性缺失（LOH）以及进化选择与mutation之间的关系；以及这些关系将怎样使得在disease（cancer）genome中的mutation产生对应的易感机制和功能。我们将在基因组学以及比较基因组学，群体遗传学综合层面上深入探索疾病基因组和癌症基因组。

实验设计与样本
（1）Case-Control 对照组设计 ；
（2）家庭成员组设计：父母-子女组（4人、3人组或多人）；
初级数据分析
1．数据量产出： 总碱基数量、Total Mapping Reads、Uniquely Mapping Reads统计，测序深度分析。
2．一致性序列组装：与参考基因组序列（Reference genome sequence）的比对分析，利用贝叶斯统计模型检测出每个碱基位点的最大可能性基因型，并组装出该个体基因组的一致序列。
3．SNP检测及在基因组中的分布：提取全基因组中所有多态性位点，结合质量值、测序深度、重复性等因素作进一步的过滤筛选，最终得到可信度高的SNP数据集。并根据参考基因组信息对检测到的变异进行注释。
4．InDel检测及在基因组的分布: 在进行mapping的过程中，进行容gap的比对并检测可信的short InDel。在检测过程中，gap的长度为1~5个碱基。对于每个InDel的检测，至少需要3个Paired-End序列的支持。
5．Structure Variation检测及在基因组中的分布: 能够检测到的结构变异类型主要有：插入、缺失、复制、倒位、易位等。根据测序个体序列与参考基因组序列比对分析结果，检测全基因组水平的结构变异并对检测到的变异进行注释。
高级数据分析
1.测序短序列匹配（Read Mapping）
（1）屏蔽掉Y染色体上假体染色体区域（pseudo-autosomal region）, 将Read与参考序列NCBI36进行匹配（包括所有染色体，未定位的contig，以及线粒体序列mtDNA（将用校正的剑桥参考序列做替代）)。采用标准序列匹配处理对原始序列文件进行基因组匹配， 将Read与参考基因组进行初始匹配；给出匹配的平均质量得分分布；
（2）碱基质量得分的校准。我们采用碱基质量校准算法对每个Read中每个碱基的质量进行评分，并校准一些显著性误差，包括来自测序循环和双核苷酸结构导致的误差。
（3）测序误差率估计。 pseudoautosomal contigs，short repeat regions（包括segmental duplication，simple repeat sequence-通过tandem repeat识别算法识别）将被过滤；
2.SNP Calling 计算 （SNP Calling）
我们可以采用整合多种SNP探测算法的结果，综合地，更准确地识别出SNP。通过对多种算法各自识别的SNP进行一致性分析，保留具有高度一致性的SNP作为最终SNP结果。这些具有高度一致性的SNP同时具有非常高的可信度。在分析中使用到的SNP识别算法包括基于贝叶斯和基因型似然值计算的方法，以及使用连锁不平衡LD或推断技术用于优化SNP识别检出的准确性。
统计SNV的等位基因频率在全基因组上的分布
稀有等位基因数目在不同类别的SNV中的比率分布（a）；SNV的类别主要考虑：（1）无义（nonsense）,（2）化学结构中非同义，（3）所有非同义，（4）保守的非同义，（5）非编码，（6）同义，等类型SNV； 另外，针对保守性的讨论，我们将分析非编码区域SNV的保守型情况及其分布（图a, b）


3.短插入/缺失探测（Short Insertion /Deletion （Indel）Call）
(1). 计算全基因组的indel变异和基因型检出值的过程
计算过程主要包含3步：（1）潜在的indel的探测；（2）通过局部重匹配计算基因型的似然值；（3）基于LD连锁不平衡的基因型推断和检出识别。Indel在X，Y染色体上没有检出值得出。
(2). Indel 过滤处理
4.融合基因的发现（Fusion gene Discovery）
选择注释的基因信息来自于当前最新版本的Ensemble Gene数据库，RefSeq数据库和Vega Gene数据库。下面图例给出的是融合基因的形成，即来自不同染色体的各自外显子经过重组形成融合基因的模式图。

5.结构变异（Structure Variation）
结构变异（Structure Variation－SV）是基因组变异的一类主要来源，主要由大片段序列（一般>1kb）的拷贝数变异（copy number variation, CNV）以及非平衡倒位（unbalance inversion）事件构成。目前主要一些基因组研究探测识别的SV大约有20,000个（DGV数据库）。在某些区域上，甚至SV形成的速率要大于SNP的速率，并与疾病临床表型具有很大关联。我们不仅可以通过测序方式识别公共的SV，也可以识别全新的SV。全新的SV的生成一般在germ line和突变机制方面都具有所报道。然而，当前对SV的精确解析需要更好的算法实现。同时，我们也需要对SV的形成机制要有更重要的认知，尤其是SV否起始于祖先基因组座位的插入或缺失，而不简单的根据等位基因频率或则与参考基因组序列比对判断。SV的功能性也结合群体遗传学和进化生物学结合起来，我们综合的考察SV的形成机制类别。
SV形成机制分析，包括以下几种可能存在的主要机制的识别发现：
（A）同源性介导的直系同源序列区段重组（NAHR）；
（B）与DNA双链断裂修复或复制叉停顿修复相关的非同源重组（NHR）；
（C）通过扩展和压缩机制形成可变数量的串联重复序列（VNTR）；
（D）转座元件插入（一般主要是长／短间隔序列元件LINE/SINE或者伴随TEI相关事件的两者的组合）。
结构变异探测和扩增子（Amplicon）的探测与识别分析:如下图所示


6.测序深度分析
测序深度分析就是指根据基因组框内覆盖度深度与期望覆盖度深度进行关联，并识别出SV。我们也将采用不同算法识别原始测序数据中的缺失片段（deletion）和重复片段（duplication）。
7.SV探测识别结果的整合与FDR推断(可选步骤)
(1). PCR或者芯片方式验证SV
(2). 计算FDR-错误发现率（配合验证试验由客户指定）
(3)  筛选SV检出结果用于SV的合并和后续分析：我们通过不同方式探测识别SV的目的极大程度的检出SV，并且降低其FDR（<=10%）。通过下属筛选方法决定后续分析所使用到的SV集合。每种SV探测识别算法得到的SV的FDR要求小于10%，并将各自符合条件的SV合并；对于FDR大于10% 的算法计算识别的SV结果，如果有PCR和芯片平台验证数据，同样可以纳入后续SV分析中。最后，针对不同算法得到的SV，整合处理根据breakpoint断点左右重合覆盖度的置信区间来评定；
8.变异属性分析
(1) neutral coalescent分析
测序数据可以探测到低频率的变异体（MAF<=5%）。根据来自群体遗传学理论（neutral coalescent理论）的期望值可以计算低频度变异的分布。我们用不同等位基因频率下每Mb变异数目与neutral coalescent 选择下的期望值比值，即每Mb 基因组windows内的theta观测值，来刻画和反映自然纯化选择与种群（cancer cell-line可以特定的认为是可以区分的种群）增长速率。该分布分别考察SNP（蓝色线），Indel（红色线），具有基因型的大片段缺失（黑色线），以及外显子区域上的 SNP（绿色线）在不同等位基因频率区间上的theta情况（参见下图）。

(2). 全新变异体(novel variant)的等位基因频率和数量分布
分析对象包括全新预测的SNP，indel，large deletion, 以及外显子SNP在每个等位基因频率类别下的数目比率（fraction）（参见下图）；全新预测是指预测分析结果与dbSNP（当前版本129）以及deletion数据库dbVar（2010年6月份版本）和已经发表的有关indels研究的基因组数据经过比较后识别确定的全新的SNP，indel以及deletion。dbSNP包含SNP和indels; dbVAR包含有deletion,duplication,以及mobile element insertion。dbRIP以及其他基因组学研究（JC Ventrer 以及Watson 基因组，炎黄计划亚洲人基因组）结果提供的short indels和large deletion。

(3). 变异体的大小分布以及新颖性分布
计算SNP，Deletion，以及Insertion 大小分布；计算SNP，Deletion，以及Insertion中属于全新预测结果的数目占已有各自参考数据库数目的比例（相对于dbSNP数据库；dbSNP包含SNP和indels;dbVAR包含有deletion,duplication,以及mobile element insertion。dbRIP以及其他基因组学研究（JC Ventrer 以及Watson 基因组，炎黄计划亚洲人基因组）结果提供的short indels和large deletion）其中，可以给出LINE，Alu的特征位置。

(4). 结构变异SV的断点联结点(BreakPoint Junction)分析
根据SV不同检出结果经过一些列筛选步骤构建所有结构变异SV的断点联结点数据库，保留长度大于等于50bp的SV；分析断点联结点处具有homology或者microhomology的SV；并将同一染色体，起始和终止位置坐标下的不同SV进行去冗余处理。
分析识别SV 的断点联结点（Breakpoint）: 将Breakpoint按照可能形成的方式可以分类为以下几类：
（a）非等位基因同源重组型（non-allelic homologous recombination-NAHR）;
（b）非同源重组（nonhomologous recombination-NHR），包括nonhomologous end-joining (NHEJ)和fork stalling /template switching（FoSTeS/MMBIR）；
（c）可变串联重复（VNTR）
（d）转座插入元件（TEI）。
图 C

SV形成偏好性分析
分析SV形成机制与断裂点临近区域序列的关系，包括染色质界标（端粒，中心粒），重组高发热点区域，重复序列以及ＧＣ含量，短DNA motif和微同源区域（microhomology region）。

9.突变率估计
针对以家庭成员为单位的测序方案，我们主要探测de novo的突变（DNM）；通过采用不同的方法/算法，我们给出每个家庭一份推断的DNM报表；
(1) 根据基因型推断结果，分别对每人每碱基位置上的de novo突变进行综合度量；
(2) 采用贝叶斯方法计算家庭组设计中DNM的后验概率
10.SNP，SNV功能分析与注释
(1). 祖先等位基因的注释
通过将人类（NCBI36），黑猩猩（chimpanzee2.1），猩猩（PPYG2）以及恒河猴（MMUL1）4种基因组进行基因组比对，发现保守的序列区域，计算祖先等位基因；以及duplication/deletion事件的进化分析。
(2). 分析基因结构序列上不同区域的多样性（Diversity）与分歧进化（divergence）
根据基因型分析结果计算基因结构序列上的多样性程度，即杂合度(heterozygosity); 杂合度指标可以说明选择效应的存在以及局部变异的结构分布特征模式。我们将考虑基因5’UTR上游200bp ，5’UTR ，第一个外显子，第一个内含子，中间外显子，中间内含子，最末外显子和内含子，以及3’UTR及其下游200bp区域左右考察的范围(参见下图a)。   分析编码转录本的起始/终止位置临近区域的多样性和进化分歧度（参见下图b）。
(3). 疾病变异体探测
将样本测序中分析得到SV与HGMD疾病变异体数据进行比对，得到交叉记录的错义和无义的SNP；通过将HGMD疾病关联突变与CUI（疾病概念分类标识数据库）比对获得HGMD中所有SV的疾病表型，并获得HGMD与测序数据分析得到的SV的疾病表型；并通过Fisher检验和Bonferroni多重假设检验校正计算样本SV所富集的疾病表型。

(4). 拷贝数变异CNV所含基因的功能注释
将CNV是否覆盖区段重复SD区域分类为2大类，每类CNV的所含基因的功能富集情况计算，显著性在横轴表示；各种显著性功能在纵轴表示。
(5). 变异的功能性分析与注释
（a）. SNP, Indels以及大的结构变异SV的功能注释;
（b）. 对包含翻译起始注释信息的转录本编码区上的SNP分类为：同义SNP，非同义SNP和无义SNP（引入终止子），干扰终止子的SNP，以及干扰剪接位点的SNP；为了降低假阳性，我们采用严格的筛选方式过滤来自indels的错误；
（c）.对错义编码区突变的功能性分析: 通过信息学分析算法评估相对于生殖系变异的体细胞突变对蛋白质的结构和功能的影响效应。
(6). SNV，SNP与miRNA研究之间的关联分析
miRNA是起重要的调控作用的小分子，我们将对miRNA的pri-mRNA，pre-miRNA以及miRNA靶基因序列进行分析，识别潜在的SNP功能位点。据文献研究提供证据表明Human pre-miRNA的二级结构中存在不同位置上的SNP，我们将通过热力学稳定性分析方法评估SNP对pre-miRNA结构的影响；另外，我们也将对miRNA-Target靶基因相互作用位点做分析，评估对SNP对靶基因靶向性的影响。
(7). SNV，SNP与GWAS研究之间的关联分析
分析GWAS研究中得到的易感基因在基因组上不同坐标上的OR值分布情况； 将当前已知的GWAS研究成果与SNP进行比较；根据LD连锁不平衡将SNP与易感基因的关系进行深入讨论;直接与间接关联方法可以分别识别与表型相关的SNP，对于不易获得（missing）和定位的SNP，通过LD连锁不平衡推断疾病易感基因突变座位。
(8) 生物学通路（代谢通路，信号通路）分析
生物学通路（Biological pathway），包括代谢通路和信号转导通路是生物功能的重要组成部分，我们将各种形式的突变、变异，包括SNV和SNP，的对应基因放到生物学通路中进行综合分析，考察功能性突变对pathway的影响程度和影响的规律。通过GSEA（配合芯片表达谱数据），KS检验，超几何分布检验等方法对变异基因在某些pathway的富集程度进行排序，识别发生功能改变的潜在通路。

(9). 蛋白质-蛋白质相互作用（PPI）网络分析
蛋白质相互作用也是生物分子功能增益和缺失的重要途径，因此我们针对蛋白质相互作用网络中的突变的蛋白及其收到影响的网络节点蛋白进行系统分析，并对收到影响的网络子结构进行功能注释分析和聚类富分析。我们采用网络分析算法对由于各种突变所受到影响的子网络（subnetwork）进行功能富集度的分析；
(10). 顺式基因调控网络模块（CRM）分析
(a) 启动子序列分析
    包括动子区域上的Motif预测，并与已知转录因子数据库TRANSFAC和JASPAR中的TFBS结合位点进行比对；
    启动子区域上保守性分析，分析突变位置和保守性区域的关联；
(b) 计算全基因组保守性。确定TFBS的保守性以及mutation位置的保守性；
（11）重排（arrangements）与突变（mutation）的全基因组统计
（a）. 体细胞(somatic)和生殖系（germline）重排（arrangements）
体细胞突变是相对于germ line 突变的一类需要重要分析的内容，我们针对Case-control设计的测序方案可以分别分析突变的情况，包括SNV，indel，以及CNV；如果仅在tumor/disease(Case组)出现而不在normal（对照组）出现的突变我们可以认为是somatic体细胞突变。将somatic mutation 与dbSNP数据库比对可以发现潜在的全新的突变和有记录的突变位置。然后，将突变分别比对到基因区域和非基因区域。基因区域具体包括：内含子区，UTR，剪接位点区和外显子区。其中外显子区分别统计：同义（synonymous），缺失（deletion），阅读框移位（frameshift），插入（insertion）,错义（missense）,无义（nonsense）以及非编码蛋白外显子（non-protein coding exon）等不同类型。综合不同方面分析的结果，并按照突变分类给出各重排(arrangements)类型：SNV，CNV的数目统计数据表（参见下图） 。对每一测序样本分别进行标注，包括体细胞突变和生殖系突变。
（b） 全基因组全局重排分布特征分析

主要将（a）染色体间和染色体内部的结构变异，（b）杂合体缺失（LOH）与等位基因不平衡的状况，（c）拷贝数变异（增益或者缺失）以及高可信度的SNV（在1Mb间隔区间统计）等不同情况配合染色体核型在环状图的不同层次上分别的表示出来（参见下图例对应a-d）
(c) 单核苷酸突变趋势与模式分析

分别统计在体细胞和生殖系细胞水平上的transversion的主要形式与各自所占比重（a）；如果有表达谱数据，可以分析表达基因与非表达基因所分别具有的突变重排数目或者种类（b）；转录起始位点上游区域的体细胞变异，生殖系germline变异以及随机变异的各自数目统计（c）和已知210种的不同肿瘤疾病的突变谱进行比较.
11.自然选择分析
我们通过测序所观测到的体细胞突变可能是经历了复杂的过程所成的。因此，我们在研究这些突变的起源，突变如何受到DNA修复机制的影响，以及在疾病发展与进化过程中突变的规律方面需要做深入的分析。自然选择一般在两个方面发挥作用，即保留有利于疾病发展进化的突变的同时限制其在基因组中重要功能区域发生突变，例如转录调控区域和编码蛋白质的区域。因此，（1）如果实验设计是将primary disease与normal control做比较的话，系统的分析可以解析复杂疾病在形成突变过程中可能的机制和自然选择的因素。（2）如果实验设计是基于病灶及其转移位置或者邻接位置样本作测序，我们可以构建突变进化与转移的模型解析突变的动态模式和基因组中不稳定态变异的模式。
正向选择的判定:  分析SNP，SNV区域的正向选择趋势，在进化和群体遗传水平解释SNV，SNP的功能性；对待control与case 组样本,我们分别采用不同统计算法计算SNP，CNV在各自样本中的差异，进而从中发现具有正向选择特征的SV。

相关文献(References)
◆ Resequencing of 31 wild and cultivated soybean genomes identifies patterns of genetic diversity and selection
◆ Resequencing 302 wild and cultivated accessions identifies genes related to domestication and improvement in soybean



Google Web
2015-07-28T14:51:06.000Z

本文将不定期维护更新，删除不能用的，增加新的可用网址。欢迎 Ctrl+D 收藏。
原版：

[推荐]https://g.namaho.com/

[推荐]http://dir.scmor.com/google/

[推荐]https://gg.wen.lu/

[推荐]https://www.souguge.com/

[推荐]https://www.guge119.com/

http://g.bt.gg/

http://g.yon.hk/

https://4b4b.xyz/

http://hisbig.com/

http://s.jiasubao.co/

http://gg.cytbj.com/

https://allee.science/

https://g.gintoki.net/

https://www.ppx.pw/

http://www.guge.link/

http://www.fcczp.com/

http://www.hlfeel.com/

http://www.ggooo.net/

https://www.guge.click/

https://www.guge.date/

https://gg.searcher.top/

http://youshengbs.com/

http://google.kainidi.cn/

https://google.xface.me/

http://google.itqy8.com/

http://go.hibenben.com/

http://www.baidu.com.se/

https://g.nie.netease.com/

https://google-hk.wewell.net/

https://www.guge119.com/

https://g.yh.gs/

http://www.guge.link/

http://carbyne.net.cn/

http://gg.cellmean.com/

http://google.sidney-aldebaran.me/

https://g.yh.gs/

https://guge.in/

http://g.yon.hk/

https://gg.wen.lu/

https://gs.awei.me/

http://www.ppx.pw/

http://gg.cytbj.com/

https://g.eeload.com/

http://www.guge.link/

https://g.searcher.top/

https://www.ko50.com/

https://www.guge.click/

https://gg.searcher.top/

https://guge.droider.cn/

http://gg.cellmean.com/

https://google.xface.me/

http://www.google52.com/

https://www.guge119.com/

http://www.googlestable.cn/

http://google.sidney-aldebaran.me/

非原版：

搜：http://sou.cloudapp.net/

南搜：http://nan.so/sites

快搜：http://www.kuaiso.hk/

谷壳：http://www.googke.me/

蝴蝶：http://www.xiexingwen.com/

搜下看：http://soxiakan.com/

谷粉网：http://www.gool.wang/

喜乐搜：http://www.xilesou.com/

谷歌婊：http://www.gugebiao.com/

谷歌363：http://www.g363.com/

给搜搜索：http://www.geiso.cn/

印象搜索：http://gl.impress.pw/

谷粉搜搜：http://www.gfsousou.cn/

翻墙谷歌：http://search.52393.com/

极客搜索：http://www.geekgle.com/

谷粉搜搜：http://gfsoso.hao136.com/

红杏谷歌：http://www.hxgoogle.com/

中国谷歌：http://www.googleforchina.com/

极速网谷歌：http://google.mjisu.com/

布谷鸟搜索：http://m.wcuckoo.com/search/

Ask：http://home.tb.ask.com/

AOL：http://m.search.aol.com/

5Ask：http://cn.5ask.com/

Avira：https://safesearch.avira.com/

WOW：http://www.wow.com/

Suche：http://suche.web.de/

OpenGG：http://www.opengg.cn/

Randomk：http://gl.randomk.org/

Im Google：http://www.imgoogle.cn/

Disconnect：https://search.disconnect.me/

WebWebWeb：http://webwebweb.com/

智能跳转：

谷大爷：http://g.phvb.cn/

大中华：http://i.forbook.net/

简约谷歌：http://free123.cc/

谷粉恰搜：http://www.qiasou.com/

逆天谷歌：http://www.go2121.com/

谷歌学术：

https://www.roolin.com/

http://www.scholarnet.cn/

http://scholar.xilesou.com/

http://xueshu.cytbj.com/scholar/

http://gfss.cc.wallpai.com/scholar/

https://duliziyou.com/!scholar.google.com/schhp

贡献来源
http://www.itechzero.com/google-mirror-sites-collect.html



awk 不排序删除重复行
2015-07-28T14:42:17.000Z
处理的文件
1
2
3
4
5
1 2 3
1 2 3
1 2 4
1 2 3
1 2 5

结果文件
1
2
3
1 2 3
1 2 4
1 2 5

处理流程
而如果使用sort加uniq进行排序的话，这个文档是看不出有什么不妥，不过我要处理的是用户名与密码一行行对应好的，如果使用sort + uniq处理的话，用户名都排到一块了，密码也又都跑到一块了。这样就分不出来那个是那个了。 而使用的脚本很简单：
1
awk '!x[$0]++' filename

注：此处的x只是一个数据参数的名字而已，随你用a、b、c、d都行。
简要解释一下，awk 的基本执行流程是，对文件的每一行，做一个指定的逻辑判断，如果逻辑判断成立，则执行指定的命令；如果逻辑判断不成立，则直接跳过这一行。
这里写的 awk 命令是!x[$0]++，意思是，首先创建一个 map 叫x，然后用当前行的全文$0作为 map 的 key，到 map 中查找相应的 value，如果没找到，则整个表达式的值为真，可以执行之后的语句；如果找到了，则表达式的值为假，跳过这一行。由于表达式之后有++，因此如果某个 key 找不到对应的 value，该++操作会先把对应的 value 设成 0，然后再自增成 1，这样下次再遇到重复的行的时候，对应的 key 就能找到一个非 0 的 value 了。
注：该处的map类似于array数组，只不过在awk中叫array不恰当。
awk Oneline中我们也学到过，awk 的流程是先判断表达式，表达式为真的时候就执行语句，可是我们前面写的这个 awk 命令里只有表达式，没有语句，那我们执行什么呢？原来，当语句被省略的时候，awk 就执行默认的语句，即打印整个完整的当前行。就这样，我们通过这个非常简短的 awk 命令实现了去除重复行并保留原有文件顺序的功能。
当然，我们也可以对该例进行下改变，通过判断某列的值相同，就只保留首行。
1
2
3
awk '!a[$3]++' filename

删除第三列重复的行

1
2
awk '!a[$NF]++' filename
删除最后一列重复的行

如何在去除重复行时对空白行不做处理，我这里总结了三种实现方法（都是仅使用awk工具），具体如下（为了便于区分，这里我使用nl命令加了行号）：
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
[root@361way ~]# cat a.txt |nl -b a   #原文件
     1  1 2 3
     2  1 2 3
     3
     4
     5  1 2 4
     6  1 2 3
     7
     8
     9  1 2 5
[root@361way ~]# awk '!NF || !a[$0]++'  a.txt |nl -b a   #方法一
     1  1 2 3
     2
     3
     4  1 2 4
     5
     6
     7  1 2 5
[root@361way ~]# awk '!NF {print;next} !($0 in a) {a[$0];print}'  a.txt |nl -b a   #方法二
     1  1 2 3
     2
     3
     4  1 2 4
     5
     6
     7  1 2 5
[root@361way ~]# awk '!/./ || !a[$0]++' a.txt |nl -b a  #方法三
     1  1 2 3
     2
     3
     4  1 2 4
     5
     6
     7  1 2 5

指定列去重（经典之作）
1
awk '!a[$1]++'

解释
<1> ：”!” 即非。
<2>：a[$0]，以$0为数据下标，建立数组a
<3>：a[$0]++，即给数组a赋值，a[$0]+=1



Linux和windows定时备份数据到百度云盘
2015-07-28T14:22:09.000Z
php实现Linux定时备份数据到百度云盘
安装bpcs_uploader 虽然关于bpcs_uploader的教程不少，但都千篇一律。虽然网上也有很详细的教程，不过可能还有漏掉的细节。
废话不多说了，开工。
下载程序包：
1
wget https://github.com/oott123/bpcs_uploader/zipball/master

解压：
1
unzip master

默认的文件夹名字很长，为了方便以后操作，重命名文件夹：
1
mv oott123-bpcs_uploader-3a33d09 baidu

这里我将文件夹名字修改成了baidu，需要注意的是，以后的默认文件夹名字可能有所不同，毕竟程序会升级，你需要看一下解压出来的文件夹名称是什么。
进入程序目录：
1
cd baidu

设置权限：
1
chmod +x bpcs_uploader.php

运行程序：
1
./bpcs_uploader.php

你可能会看到出错提示，因为运行程序需要PHP环境，而你的服务器上的PHP路径可能与程序中设置的不同，修改一下程序文件bpcs_uploader.php中的PHP路径即可。
查看PHP路径：
1
which php

编辑bpcs_uploader.php文件：
1
vi bpcs_uploader.php

将第一句#!后的路径修改为你的PHP路径，如果你安装的是WDCP一键包，路径为：/www/wdlinux/php/bin/php
登录百度开发者中心：http://developer.baidu.com/
创建一个Web应用，应用名称自定义，例如：huihuige，其他默认设置就可以了。
此时，我们可以得到该应用的API Key，运行./bpcs_uploader.php后首先要输入的就是Key。
另外我们还要在应用管理中找到API管理，开启PCS API，设置一个目录，该目录将用于存放服务器上传过来的数据。
温馨提示：开启PCS API时设置的目录不可更改，但可以在”操作”菜单中删除应用重建。
输入Key后，接下来需要输入app floder name，也就是刚才开启PCS API时设置的目录名称。
然后需要输入access token，将你的Key填入以下地址相应位置，在浏览器打开得到的地址：
1
https://openapi.baidu.com/oauth/2.0/authorize?response_type=token&client_id=KEY&redirect_uri=oob&scope=netdisk

然后你会看到一个写着”百度 Oauth2.0”的页面，将浏览器地址栏中的URL复制下来，找到access_token=和&之间的字符串，这就是access token，输入access token后就完成了，你会看到SSH终端显示出了你的百度云盘容量。
如果之前有安装过bpcs_uploader，那么可以执行以下命令初始化：
1
./bpcs_uploader.php init

bpcs_uploader用法 查询容量：
1
./bpcs_uploader.php quota

上传文件：
1
./bpcs_uploader.php upload [path_local] [path_remote]

[path_local]是指服务器上的文件路径，[path_remote]是指百度云盘中的路径。
下载文件：
1
./bpcs_uploader.php download [path_local] [path_remote]

删除文件：
1
./bpcs_uploader.php delete [path_remote]

离线下载：
1
./bpcs_uploader.php fetch [path_remote] [path_to_fetch]

自动备份脚本 接下来需要设置自动备份数据，网上有许多自动备份脚本，所以我就不再复述了。
这里要介绍的是，由于我们多半都在Linux服务器上安装了控制面板，而控制面板都有自动备份数据的功能，比如WDCP就可以设置自动备份数据到/www/backup目录，那么我们就不再需要自动备份数据的脚本了，只需要一个脚本将备份目录下的所有文件上传到百度云盘即可。
下载脚本至baidu目录下：
1
wget http://www.huihuige.com/wp-content/uploads/2013/10/baidubd.zip

解压：
1
unzip baidubd.zip

这个脚本实用于WDCP面板用户，如果你的备份目录不同，可以打开脚本修改。
测试脚本是否有效：
1
2
3
4
5
sh baidubd.sh
``` 
最后设置计划任务：
``` bash
crontab -e

加入一行：
1
0 0 * * * /root/baidu/baidubd.sh

这里设置了每天的凌晨零点自动备份数据到百度云盘。
python实现Linux命令行上传和下载百度云盘
bypy:一个python写得百度网盘的linux客户端工具.
下载
git clone https://github.com/houtianze/bypy.git
要求
python >=2.7

python需要Requests库

1
2
3
python
>>> import requests
ImportError: No module named requests

使用
完成以上安装，cd至之前的bypy的目录下，运行下面的命令开始初始化
1
2
cd bypy
./bypy.py list

首先他会要求你访问一个网址，需要你授权，授权后复制code给程序，如果没有报错，就可以看到你的同步目录了，你可以在网盘的我的应用数据文件夹里找到bypy文件夹，他就是应用目录了。
如果你迫不及待的要测试，那就试试直接把当前目录上传至百度网盘
1
./bypy.py upload

如果你想看到上传进度，请加入-v选项
1
./bypy.py -v upload

常用命令
./bypy.py list    查看目录
./bypy.py mkdir newdir    新建目录
./bypy.py upload      上传
./bypy.py downfile或者./bypy.py downdir    下载
./bypy.py delete filename
./bypy.py rm dir
用的时候注意用help查看一下参数的使用，其中remote path的/代表了apps/bypy/这个路径,且命令中的斜线/表示或，如“delete/remove/rm”表示delete，remove和rm三个命令。
自动备份到百度云
编写备份bash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/bin/sh
# File:    home/bin/bypy/baidu_sync.sh
# Author:  hope
# Version: 1.0
 
# Some vars
UPLOAD_SCRIPTS_DIR="/public/home/zpxu/scripts"
DATE=`date +%F`
DATE_YEAR=`date +%Y`
DATE_MONTH=`date +%m`
 
# Backup
cd $UPLOAD_SCRIPTS_DIR
cd ..
tar -czvf  scripts_$DATE.tar.gz $UPLOAD_SCRIPTS_DIR
/home/Python-2.7.10/./python /home/bin/bypy/./bypy.py mkdir scripts/$DATE_YEAR/$DATE_MONTH
/home/Python-2.7.10/./python /home/bin/bypy/./bypy.py -v upload scripts_$DATE.tar.gz scripts/$DATE_YEAR/$DATE_MONTH
rm scripts_$DATE.tar.gz
/home/Python-2.7.10/./python /home/bin/bypy/./bypy.py list

设置cron定时执行

1
$ crontab -e

此时会启动默认编辑器vim，添加以下内容
1
2
# backup my scripts to baidu
40 1 * * * <备份bash目录/baidu_sync.sh>

以上内容意义为：每一行由空格分割为6部分，依次为”分钟”、”小时”、”日”、”月”、”星期”、”要执行的程序”。
备份操作可能消耗大量资源和时间，应该设置在凌晨访问量小、系统负载低的时候运行。如果有独立的服务器存储备份文件，还可以在脚本中增加ftp或者email传送备份文件到远程服务器的功能。

百度限制上传/下载数度，所以对于较大文件的转移不是很方便，上传时至少可以打包压缩下，至于下载目前还不知道有何良策。

由于百度权限问题，使用百度云备份需要差不多一个月跟新一次授权，否则报错
OpenShift server failed, authorizing/refreshing with the Heroku server …
跟新授权办法如下：
1
2
运行bypy.py -c，删除令牌文件，然后重新授权一次。
如果还不行，去百度应用授权（https://passport.baidu.com/accountbind） 里删除bypy再重新授权。


Contribution from ：
http://www.lovelucy.info/auto-backup-website-shell-script.html
https://github.com/houtianze/bypy/issues/199



关于Chrome的那些小技巧
2015-07-25T13:34:53.000Z
Chrome(Download)是Google公司开发的一个现代化的网页浏览器，作为三大浏览器之一 它搭载了被称为V8的高效率Javascript引擎。
 

由于简洁的界面风格 和便捷易用的特点 Chrome的市场份额已经升至45% ，一举超越了Internet Explorer和Mozilla Firefox成为全球第一大浏览器。






作为一个现代化的浏览器，就像智能手机的APP ，每装一个APP，意味着你的手机又多了一个功能。Chrome浏览器也是如此，Google
也提供了网上应用商店供人下载扩展自定义使用。
接下来我会介绍一些大家常用的扩展
ABP广告过滤 / Adblock Plus
超过5000万人使用，适用于 Chrome 的免费的广告拦截器，可阻止所有烦人的广告及恶意软件和跟踪。
享受没有恼人广告的网络世界。
 


更好的历史记录 / Better History
替换浏览器自带的历史查看页面。本扩展能更好地查看您的历史记录。为查看您的历史记录带来最好的搜索体验，最清晰的界
面和最有帮助的筛选
 


印象笔记 剪藏 / Evernote Web Clipper
使用印象笔记扩展程序一键保存精彩网页内容到印象笔记帐户。
印象笔记•剪藏，以最快的速度，帮你保存网页、截取屏幕、添加标注、轻松归档，随时随地快速找到所需一切.。
 

网页截图:注释&批注
捕获整个页面或任何部分，矩形，圆形，箭头，线条和文字，模糊敏感信息，一键上传分享注释。支持PNG和链接。
 


网易云音乐
不用打开网页，收听网易云音乐，更简单更易用。
网易云音乐Chrome扩展，追求简单体验，不用打开网页不用客户端即可收听网易云音乐，歌单，大牌DJ，我的收藏。与其他
终端同步，更简单，更易用。

###SimpleUndoClose
简单的恢复您曾经关闭的网站标签

###Pocket
轻松地保存文章、视频等供以后查看。有了 Pocket，您的所有内容可汇聚到一个地方，以便在任何设备上随时查看。甚至不需
要网络连接。

###没有PDF阅览器 没关系 直接把pdf文件拖到Chrome窗口里面就可以查看：
 





重要的网页标签可以固定 防止误关闭(pin tab)
 






CummeRbund
2015-07-25T09:31:20.000Z
CummeRbund was designed to help simplify the analysis and exploration portion of RNA-Seq data derrived from the output of a differential expression analysis using cuffdiff with the goal of providing fast and intuitive access to your results.
Command:
$ cd cuffdiff
$ R
> library(cummeRbund)
> cuff<-readCufflinks()
> cuff

 


转录组分析之--Cufflinks（很简单）
2015-07-25T09:31:08.000Z
Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols.
Cufflinks:
cufflinks [options] 

for example:
$ cufflinks -p 8 -G transcript.gtf --library-type fr-unstranded -o cufflinks_output tophat_out/accepted_hits.bam
  -o/--output-dir              write all output files to this directory              [ default:     ./ ]
  -p/--num-threads             number of threads used during analysis                [ default:      1 ]
  --seed                       value of random number generator seed                 [ default:      0 ]
  -G/--GTF                     quantitate against reference transcript annotations                      
  -g/--GTF-guide               use reference transcript annotation to guide assembly

Cuffmerge:
Use to merge together several Cufflinks assemblies.
cuffmerge [options]* 

Cuffmerge input files
cuffmerge takes several assembly GTF files from Cufflinks’ as input. Input GTF files must be specified in a “manifest” file listing full paths to the files.
Cuffmerge arguments


Text file “manifest” with a list (one per line) of GTF files that you’d like to merge together into a single GTF file.
Cuffdiff
Use to find significant changes in transcript expression, splicing, and promoter use.
cuffdiff [options]*  \

 \

 … \

[sampleN.sam_replicate1.sam[,…,sample2_replicateM.sam]]

Cuffdiff output Files
gene_exp.diff    Gene-level differential expression. Tests differences in the summed FPKM of transcripts sharing each gene_id

cuffdiff过程中同一处理的多个样本间用逗号分隔，不同处理间空格分隔.



TopHat
2015-07-25T09:30:52.000Z
TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons. 
Usage:
tophat [options]* [reads1_2,...readsN_2]

for example:
tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq
-p 代表线程
-G 代表转录本注释信息
-o 输出文件夹
--segment-length 25 （将redas分成的最小比对片段）
--segment-mismatches 1 （片段比对错配碱基数）
--library-type （是否链特异性）fr-unstranded
—transcriptome-index （转录本的bowtie-index文件）

比对输出文件：
accepted_hits.bam（比对输出）
junctions.bed
insertions.bed and deletions.bed




Linux-sort
2015-07-25T06:21:10.000Z
Sort is a Linux program used for printing lines of input text files and concatenation of all files in sorted order. Sort command takes blank space as field separator and entire Input file as sort key. It is important to notice that sort command don’t actually sort the files but only print the sorted output, until your redirect the output.
This article aims at deep insight of Linux ‘sort’ command with 14 useful practical examples that will show you how to use sort command in Linux.
First we will be creating a text file (tecmint.txt) to execute ‘sort’ command examples. Our working directory is ‘/home/$USER/Desktop/tecmint.
The option ‘-e’ in the below command enables interpretion of backslash and /n tells echo to write each string to a new line.
1
$ echo -e "computer\nmouse\nLAPTOP\ndata\nRedHat\nlaptop\ndebian\nlaptop" > tecmint.txt

Before we start with ‘sort’ lets have a look at the contents of the file and the way it look.
1
$ cat tecmint.txt

Now sort the content of the file using following command.
1
$ sort tecmint.txt

Note: The above command don’t actually sort the contents of text file but only show the sorted output on terminal.
Sort the contents of the file ‘tecmint.txt’ and write it to a file called (sorted.txt) and verify the content by using cat command.
1
2
$ sort tecmint.txt > sorted.txt
$ cat sorted.txt

Now sort the contents of text file ‘tecmint.txt’ in reverse order by using ‘-r’ switch and redirect output to a file ‘reversesorted.txt’. Also check the content listing of the newly created file.
1
2
$ sort -r tecmint.txt > reversesorted.txt
$ cat reversesorted.txt

We are going a create a new file (lsl.txt) at the same location for detailed examples and populate it using the output of ‘ls -l’ for your home directory.
1
2
$ ls -l /home/$USER > /home/$USER/Desktop/tecmint/lsl.txt
$ cat lsl.txt

Now will see examples to sort the contents on the basis of other field and not the default initial characters.
Sort the contents of file ‘lsl.txt’ on the basis of 2nd column (which represents number of symbolic links).
1
$ sort -nk2 lsl.txt

Note: The ‘-n’ option in the above example sort the contents numerically. Option ‘-n’ must be used when we wanted to sort a file on the basis of a column which contains numerical values.
Sort the contents of file ‘lsl.txt’ on the basis of 9th column (which is the name of the files and folders and is non-numeric).
1
$ sort -k9 lsl.txt

It is not always essential to run sort command on a file. We can pipeline it directly on the terminal with actual command.
1
$ ls -l /home/$USER | sort -nk5

Sort and remove duplicates from the text file tecmint.txt. Check if the duplicate has been removed or not.
1
2
$ cat tecmint.txt
$ sort -u tecmint.txt

Rules so far (what we have observed):
1
2
3
4
Lines starting with numbers are preferred in the list and lies at the top until otherwise specified (-r).
Lines starting with lowercase letters are preferred in the list and lies at the top until otherwise specified (-r).
Contents are listed on the basis of occurrence of alphabets in dictionary until otherwise specified (-r).
Sort command by default treat each line as string and then sort it depending upon dictionary occurrence of alphabets (Numeric preferred; see rule – 1) until otherwise specified.

Create a third file ‘lsla.txt’ at the current location and populate it with the output of ‘ls -lA’ command.
1
2
$ ls -lA /home/$USER > /home/$USER/Desktop/tecmint/lsla.txt
$ cat lsla.txt

Those having understanding of ‘ls’ command knows that ‘ls -lA’=’ls -l’ + Hidden files. So most of the contents on these two files would be same.
Sort the contents of two files on standard output in one go.
1
$ sort lsl.txt lsla.txt

Notice the repetition of files and folders.
Now we can see how to sort, merge and remove duplicates from these two files.
1
$ sort -u lsl.txt lsla.txt

Notice that duplicates has been omitted from the output. Also, you can write the output to a new file by redirecting the output to a file.
We may also sort the contents of a file or the output based upon more than one column. Sort the output of ‘ls -l’ command on the basis of field 2,5 (Numeric) and 9 (Non-Numeric).
1
$ ls -l /home/$USER | sort -t "," -nk2,5 -k9

That’s all for now. In the next article we will cover a few more examples of ‘sort’ command in detail for you. Till then stay tuned and connected to Tecmint. Keep sharing. Keep commenting. Like and share us and help us get spread.
More info: sort



FASTX-Toolkit
2015-07-24T14:51:33.000Z
Here you’ll find a short description and examples of how to use the FASTX-toolkit from the command line.
Command Line Arguments
Most tools show usage information with -h.
Tools can read from STDIN and write to STDOUT, or
from a specific input file (-i) and specific output file (-o).
Tools can operate silently (producing no output if everything was OK), or
print a short summary (-v).
If output goes to STDOUT, the summary will be printed to STDERR.
If output goes to a file, the summary will be printed to STDOUT.
Some tools can compress the output with GZIP (-z).
FASTQ-to-FASTA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ fastq_to_fasta -h
usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-r]         = Rename sequence identifiers to numbers.
   [-n]         = keep sequences with unknown (N) nucleotides.
		  Default is to discard such sequences.
   [-v]         = Verbose - report number of sequences.
		  If [-o] is specified,  report will be printed to STDOUT.
		  If [-o] is not specified (and output goes to STDOUT),
		  report will be printed to STDERR.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA output file. default is STDOUT.

FASTX Statistics
$ fastx_quality_stats -h
usage: fastx_quality_stats [-h] [-i INFILE] [-o OUTFILE]

version 0.0.6 (C) 2008 by Assaf Gordon (gordon@cshl.edu)
   [-h] = This helpful help screen.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
                  If FASTA file is given, only nucleotides
          distribution is calculated (there's no quality info).
   [-o OUTFILE] = TEXT output file. default is STDOUT.

The output TEXT file will have the following fields (one row per column):
    column    = column number (1 to 36 for a 36-cycles read solexa file)
    count   = number of bases found in this column.
    min     = Lowest quality score value found in this column.
    max     = Highest quality score value found in this column.
    sum     = Sum of quality score values for this column.
    mean    = Mean quality score value for this column.
    Q1    = 1st quartile quality score.
    med    = Median quality score.
    Q3    = 3rd quartile quality score.
    IQR    = Inter-Quartile range (Q3-Q1).
    lW    = 'Left-Whisker' value (for boxplotting).
    rW    = 'Right-Whisker' value (for boxplotting).
    A_Count    = Count of 'A' nucleotides found in this column.
    C_Count    = Count of 'C' nucleotides found in this column.
    G_Count    = Count of 'G' nucleotides found in this column.
    T_Count    = Count of 'T' nucleotides found in this column.
    N_Count = Count of 'N' nucleotides found in this column.
    max-count = max. number of bases (in all cycles)
FASTQ Quality Chart
$ fastq_quality_boxplot_graph.sh -h
Solexa-Quality BoxPlot plotter
Generates a solexa quality score box-plot graph 

Usage: /usr/local/bin/fastq_quality_boxplot_graph.sh [-i INPUT.TXT] [-t TITLE] [-p] [-o OUTPUT]

  [-p]           - Generate PostScript (.PS) file. Default is PNG image.
  [-i INPUT.TXT] - Input file. Should be the output of "solexa_quality_statistics" program.
  [-o OUTPUT]    - Output file name. default is STDOUT.
  [-t TITLE]     - Title (usually the solexa file name) - will be plotted on the graph.
FASTA/Q Nucleotide Distribution
$ fastx_nucleotide_distribution_graph.sh -h
FASTA/Q Nucleotide Distribution Plotter

Usage: /usr/local/bin/fastx_nucleotide_distribution_graph.sh [-i INPUT.TXT] [-t TITLE] [-p] [-o OUTPUT]

  [-p]           - Generate PostScript (.PS) file. Default is PNG image.
  [-i INPUT.TXT] - Input file. Should be the output of "fastx_quality_statistics" program.
  [-o OUTPUT]    - Output file name. default is STDOUT.
  [-t TITLE]     - Title - will be plotted on the graph.
FASTA/Q Clipper
$ fastx_clipper -h
usage: fastx_clipper [-h] [-a ADAPTER] [-D] [-l N] [-n] [-d N] [-c] [-C] [-o] [-v] [-z] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-a ADAPTER] = ADAPTER string. default is CCTTAAGG (dummy adapter).
   [-l N]       = discard sequences shorter than N nucleotides. default is 5.
   [-d N]       = Keep the adapter and N bases after it.
          (using '-d 0' is the same as not using '-d' at all. which is the default).
   [-c]         = Discard non-clipped sequences (i.e. - keep only sequences which contained the adapter).
   [-C]         = Discard clipped sequences (i.e. - keep only sequences which did not contained the adapter).
   [-k]         = Report Adapter-Only sequences.
   [-n]         = keep sequences with unknown (N) nucleotides. default is to discard such sequences.
   [-v]         = Verbose - report number of sequences.
          If [-o] is specified,  report will be printed to STDOUT.
          If [-o] is not specified (and output goes to STDOUT),
          report will be printed to STDERR.
   [-z]         = Compress output with GZIP.
   [-D]        = DEBUG output.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
FASTA/Q Renamer
$ fastx_renamer -h
usage: fastx_renamer [-n TYPE] [-h] [-z] [-v] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.10 by A. Gordon (gordon@cshl.edu)

   [-n TYPE]    = rename type:
          SEQ - use the nucleotides sequence as the name.
          COUNT - use simply counter as the name.
   [-h]         = This helpful help screen.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
FASTA/Q Trimmer
$ fastx_trimmer -h
usage: fastx_trimmer [-h] [-f N] [-l N] [-z] [-v] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-f N]       = First base to keep. Default is 1 (=first base).
   [-l N]       = Last base to keep. Default is entire read.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
FASTA/Q Collapser
$ fastx_collapser -h
usage: fastx_collapser [-h] [-v] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-v]         = verbose: print short summary of input/output counts
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
FASTQ/A Artifacts Filter
$ fastx_artifacts_filter -h
usage: fastq_artifacts_filter [-h] [-v] [-z] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
   [-z]         = Compress output with GZIP.
   [-v]         = Verbose - report number of processed reads.
          If [-o] is specified,  report will be printed to STDOUT.
          If [-o] is not specified (and output goes to STDOUT),
          report will be printed to STDERR.
FASTQ Quality Filter
$ fastq_quality_filter -h
usage: fastq_quality_filter [-h] [-v] [-q N] [-p N] [-z] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-q N]       = Minimum quality score to keep.
   [-p N]       = Minimum percent of bases that must have [-q] quality.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
   [-v]         = Verbose - report number of sequences.
          If [-o] is specified,  report will be printed to STDOUT.
          If [-o] is not specified (and output goes to STDOUT),
          report will be printed to STDERR.
FASTQ/A Reverse Complement
$ fastx_reverse_complement -h
usage: fastx_reverse_complement [-h] [-r] [-z] [-v] [-i INFILE] [-o OUTFILE]

version 0.0.6
   [-h]         = This helpful help screen.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
FASTA Formatter
$ fasta_formatter -h
usage: fasta_formatter [-h] [-i INFILE] [-o OUTFILE] [-w N] [-t] [-e]
Part of FASTX Toolkit 0.0.7 by gordon@cshl.edu

   [-h]         = This helpful help screen.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
   [-w N]       = max. sequence line width for output FASTA file.
          When ZERO (the default), sequence lines will NOT be wrapped -
          all nucleotides of each sequences will appear on a single 
          line (good for scripting).
   [-t]         = Output tabulated format (instead of FASTA format).
          Sequence-Identifiers will be on first column,
          Nucleotides will appear on second column (as single line).
   [-e]         = Output empty sequences (default is to discard them).
          Empty sequences are ones who have only a sequence identifier,
          but not actual nucleotides.

Input Example:
   >MY-ID
   AAAAAGGGGG
   CCCCCTTTTT
   AGCTN

Output example with unlimited line width [-w 0]:
   >MY-ID
   AAAAAGGGGGCCCCCTTTTTAGCTN

Output example with max. line width=7 [-w 7]:
   >MY-ID
   AAAAAGG
   GGGTTTT
   TCCCCCA
   GCTN

Output example with tabular output [-t]:
   MY-ID    AAAAAGGGGGCCCCCTTTTAGCTN

example of empty sequence:
(will be discarded unless [-e] is used)
  >REGULAR-SEQUENCE-1
  AAAGGGTTTCCC
  >EMPTY-SEQUENCE
  >REGULAR-SEQUENCE-2
  AAGTAGTAGTAGTAGT
  GTATTTTATAT
FASTA Nucleotides Changer
$ fasta_nucleotide_changer -h
usage: fasta_nucleotide_changer [-h] [-z] [-v] [-i INFILE] [-o OUTFILE] [-r] [-d]

version 0.0.7
   [-h]         = This helpful help screen.
   [-z]         = Compress output with GZIP.
   [-v]         = Verbose mode. Prints a short summary.
          with [-o], summary is printed to STDOUT.
          Otherwise, summary is printed to STDERR.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
   [-r]         = DNA-to-RNA mode - change T's into U's.
   [-d]         = RNA-to-DNA mode - change U's into T's.
FASTA Clipping Histogram
$ fasta_clipping_histogram.pl
Create a Linker Clipping Information Histogram

usage: fasta_clipping_histogram.pl INPUT_FILE.FA OUTPUT_FILE.PNG

    INPUT_FILE.FA   = input file (in FASTA format, can be GZIPped)
    OUTPUT_FILE.PNG = histogram image
FASTX Barcode Splitter
$ fastx_barcode_splitter.pl
Barcode Splitter, by Assaf Gordon (gordon@cshl.edu), 11sep2008

This program reads FASTA/FASTQ file and splits it into several smaller files,
Based on barcode matching.
FASTA/FASTQ data is read from STDIN (format is auto-detected.)
Output files will be writen to disk.
Summary will be printed to STDOUT.

usage: /usr/local/bin/fastx_barcode_splitter.pl --bcfile FILE --prefix PREFIX [--suffix SUFFIX] [--bol|--eol] 
     [--mismatches N] [--exact] [--partial N] [--help] [--quiet] [--debug]

Arguments:

--bcfile FILE    - Barcodes file name. (see explanation below.)
--prefix PREFIX    - File prefix. will be added to the output files. Can be used
          to specify output directories.
--suffix SUFFIX    - File suffix (optional). Can be used to specify file
          extensions.
--bol        - Try to match barcodes at the BEGINNING of sequences.
          (What biologists would call the 5' end, and programmers
          would call index 0.)
--eol        - Try to match barcodes at the END of sequences.
          (What biologists would call the 3' end, and programmers
          would call the end of the string.)
          NOTE: one of --bol, --eol must be specified, but not both.
--mismatches N    - Max. number of mismatches allowed. default is 1.
--exact        - Same as '--mismatches 0'. If both --exact and --mismatches 
          are specified, '--exact' takes precedence.
--partial N    - Allow partial overlap of barcodes. (see explanation below.)
          (Default is not partial matching)
--quiet        - Don't print counts and summary at the end of the run.
          (Default is to print.)
--debug        - Print lots of useless debug information to STDERR.
--help        - This helpful help screen.

Example (Assuming 's_2_100.txt' is a FASTQ file, 'mybarcodes.txt' is 

the barcodes file):

   $ cat s_2_100.txt | /usr/local/bin/fastx_barcode_splitter.pl --bcfile mybarcodes.txt --bol --mismatches 2 \
    --prefix /tmp/bla_ --suffix ".txt"

Barcode file format
-------------------
Barcode files are simple text files. Each line should contain an identifier 
(descriptive name for the barcode), and the barcode itself (A/C/G/T), 
separated by a TAB character. Example:

    #This line is a comment (starts with a 'number' sign)
    BC1 GATCT
    BC2 ATCGT
    BC3 GTGAT
    BC4 TGTCT

For each barcode, a new FASTQ file will be created (with the barcode's 
identifier as part of the file name). Sequences matching the barcode 
will be stored in the appropriate file.

Running the above example (assuming "mybarcodes.txt" contains the above 
barcodes), will create the following files:
    /tmp/bla_BC1.txt
    /tmp/bla_BC2.txt
    /tmp/bla_BC3.txt
    /tmp/bla_BC4.txt
    /tmp/bla_unmatched.txt
The 'unmatched' file will contain all sequences that didn't match any barcode.

Barcode matching
----------------

** Without partial matching:

Count mismatches between the FASTA/Q sequences and the barcodes.
The barcode which matched with the lowest mismatches count (providing the
count is small or equal to '--mismatches N') 'gets' the sequences.

Example (using the above barcodes):
Input Sequence:
    GATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG

Matching with '--bol --mismatches 1':
   GATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
   GATCT (1 mismatch, BC1)
   ATCGT (4 mismatches, BC2)
   GTGAT (3 mismatches, BC3)
   TGTCT (3 mismatches, BC4)

This sequence will be classified as 'BC1' (it has the lowest mismatch count).
If '--exact' or '--mismatches 0' were specified, this sequence would be 
classified as 'unmatched' (because, although BC1 had the lowest mismatch count,
it is above the maximum allowed mismatches).

Matching with '--eol' (end of line) does the same, but from the other side
of the sequence.

** With partial matching (very similar to indels):

Same as above, with the following addition: barcodes are also checked for
partial overlap (number of allowed non-overlapping bases is '--partial N').

Example:
Input sequence is ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
(Same as above, but note the missing 'G' at the beginning.)

Matching (without partial overlapping) against BC1 yields 4 mismatches:
   ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
   GATCT (4 mismatches)

Partial overlapping would also try the following match:
   -ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
   GATCT (1 mismatch)

Note: scoring counts a missing base as a mismatch, so the final
mismatch count is 2 (1 'real' mismatch, 1 'missing base' mismatch).
If running with '--mismatches 2' (meaning allowing upto 2 mismatches) - this 
seqeunce will be classified as BC1.
Example: FASTQ Information
Genrating Quality Information on BC54.fq:
$ fastx_quality_stats -i BC54.fq -o bc54_stats.txt
$ fastq_quality_boxplot_graph.sh -i bc54_stats.txt -o bc54_quality.png -t “My Library”
$ fastx_nucleotide_distribution_graph.sh -i bc54_stats.txt -o bc54_nuc.png -t “My Library”

 
Example: FASTQ/A Manipulation
Common pre-processing work-flow:
Covnerting FASTQ to FASTA
Clipping the Adapter/Linker
Trimming to 27nt (if you’re analyzing miRNAs, for example)
Collapsing the sequences
Plotting the clipping results
Using the FASTX-toolkit from the command line:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ fastq_to_fasta -v -n -i BC54.fq -o BC54.fa
Input: 100000 reads.
Output: 100000 reads.
$ fastx_clipper -v -i BC54.fa -a CTGTAGGCACCATCAATTCGTA -o BC54.clipped.fa
Clipping Adapter: CTGTAGGCACCATCAATTCGTA
Min. Length: 15
Input: 100000 reads.
Output: 92533 reads.
discarded 468 too-short reads.
discarded 6939 adapter-only reads.
discarded 60 N reads.
$ fastx_trimmer -v -f 1 -l 27 -i BC54.clipped.fa -o BC54.trimmed.fa
Trimming: base 1 to 27
Input: 92533 reads.
Output: 92533 reads.
$ fastx_collapser -v -i BC54.trimmed.fa -o BC54.collapsed.fa
Collapsd 92533 reads into 36431 unique sequences.
$ fasta_clipping_histogram.pl BC54.collapsed.fa bc54_clipping.png

Mapping (or any other kind of analysis) of the Clipped + Collapsed FASTA file will be:
quicker - each unique sequence appears only once in the FASTA file.
more accurate - the Adapter/Linker sequence was removed from the 3’ end, and will affect the mapping results.

Number of samples	Unsigned and signed hybrid networks	Signed networks
Less than 20	9	18
20-30	8	16
30-40	7	14
more than 40	6	12