hope

A new day is coming,whether we like it or not. The question is will you control it,or will it control you?


  • Home

  • Archives

  • Code

  • Categories

  • Tags

  • Footprint

  • qRT-PCR

  • Project

  • Publish

  • SCI Figure

  • About

  • CV

  • Search

Analyzing Data

Posted on 2016-04-01 | In R | Comments: | Views: ℃
| Words count in article: | Reading time ≈

Summary Statistics

mean(),max(),min(),range(),sd()

以上函数运算中,若存在NA,则返回结果为NA,设置na.rm=TRUE忽略NA值。
mean()中移除异常值:trim(mean(x, trim = 0.1):先把x的最大的10%的数和最小的10%的数去掉,然后剩下的数算平均)
range():同时返回最大/最小值。
sd():标准差。

quantile(),fivenum(),IQR()

quantile(dow30$Open, probs=c(0,0.25,0.5,0.75,1.0))
返回不同的百分位数值,probs指定百分位。
fivenum():返回(minimum, 25th percentile, median, 75th percentile, and maximum)。
IQR():返回25%与75%的差值
以上函数既可用于单独数组,也可用于apply, tapply对数据框的操作。

summary()

对于数值变量计算了五个分位点和均值,对于分类变量则计算了频数。

Statistical Tests

基于正态分布的检验;自然群体大都为正态分布。

Comparing means

Specifically, suppose that you have a set of observations
x1, x2, …, xn with experimental mean μ and want to know if the experimental
mean is different from the null hypothesis mean μ0. Furthermore, assume that the
observations are normally distributed. To test the validity of the hypothesis, you can
use a t-test. In R, you would use the function t.test;

Comparing paired data(means)

For example, you might have two observations per subject: one before an experiment and one after the experiment.
In this case, you would use a paired t-test. You can use the t.test function, specifying
paired=TRUE, to perform this test.

Comparing variances of two populations

To compare the variances of two samples from normal populations, R includes the
var.test function which performs an F-test;

Comparing means across more than two groups

ANOVA单因素方差分析与R实现:http://tiramisutes.github.io/2015/10/08/ANOVA.html

Correlation tests

If you’d like to check whether there is a statistically significant
correlation between two vectors, you can use the cor.test function;

SQLite数据库简单操作

Posted on 2016-03-22 | In Bioinformatics | Comments: | Views: ℃
| Words count in article: | Reading time ≈

SQLite数据库整体查询命令

dot-commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
$sqlite3 
sqlite> .help
.backup ?DB? FILE Backup DB (default "main") to FILE
.bail ON|OFF Stop after hitting an error. Default OFF
.databases List names and files of attached databases
.dump ?TABLE? ... Dump the database in an SQL text format
If TABLE specified, only dump tables matching
LIKE pattern TABLE.
.echo ON|OFF Turn command echo on or off
.exit Exit this program
.explain ON|OFF Turn output mode suitable for EXPLAIN on or off.
.genfkey ?OPTIONS? Options are:
--no-drop: Do not drop old fkey triggers.
--ignore-errors: Ignore tables with fkey errors
--exec: Execute generated SQL immediately
See file tool/genfkey.README in the source
distribution for further information.
.header(s) ON|OFF Turn display of headers on or off
.help Show this message
.import FILE TABLE Import data from FILE into TABLE
.indices ?TABLE? Show names of all indices
If TABLE specified, only show indices for tables
matching LIKE pattern TABLE.
.load FILE ?ENTRY? Load an extension library
.mode MODE ?TABLE? Set output mode where MODE is one of:
csv Comma-separated values
column Left-aligned columns. (See .width)
html HTML <table> code
insert SQL insert statements for TABLE
line One value per line
list Values delimited by .separator string
tabs Tab-separated values
tcl TCL list elements
.nullvalue STRING Print STRING in place of NULL values
.output FILENAME Send output to FILENAME
.output stdout Send output to the screen
.prompt MAIN CONTINUE Replace the standard prompts
.quit Exit this program
.read FILENAME Execute SQL in FILENAME
.restore ?DB? FILE Restore content of DB (default "main") from FILE
.schema ?TABLE? Show the CREATE statements
If TABLE specified, only show tables matching
LIKE pattern TABLE.
.separator STRING Change separator used by output mode and .import
.show Show the current values for various settings
.tables ?TABLE? List names of tables
If TABLE specified, only list tables matching
LIKE pattern TABLE.
.timeout MS Try opening locked tables for MS milliseconds
.width NUM NUM ... Set column widths for "column" mode
.timer ON|OFF Turn the CPU timer measurement on or off

SQLite数据类型

  • text
  • integer
  • real
  • NULL, used for missing data, or no value
  • BLOB, which stands for binary large object, and stores any type of object as bytes

  • 注:SQLite并没有强制同列必须使用相同类型的数据,每个表的每一列都有优先类型(type affinity),但为了下游分析方便,最好同一列保持相同数据类型。
    当某一列是混合数据类型时,排序原则为:NULL values, integer and real values (sorted numerically), text values, and finally blob values。

    数据库内容查询

    万能的SELECT命令

    语法:

    1
    SELECT <columns> FROM <tablename>;

    基本形式:SELECT选择指令从一个table中抓取所有列的所有行(columns设定为*)。
    选取特定列:不同列之间逗号分隔(SELECT trait, chrom, position, strongest_risk_snp, pvalue FROM gwascat LIMIT 5;)
    SELECT语句除了在sqlite中交互查询外,还可在命令行中直接查询

    1
    2
    3
    4
    #交互
    sqlite> SELECT * FROM gwascat;
    #命令行
    sqlite3 gwascat.db "SELECT * FROM gwascat" > results.txt

    SQLite默认输出不规则,可做以下设置输出排列整齐易读输出:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    sqlite> SELECT trait, chrom, position, strongest_risk_snp, pvalue
    ...> FROM gwascat LIMIT 5;
    trait|chrom|position|strongest_risk_snp|pvalue
    Asthma and hay fever|6|32658824|rs9273373|4.0e-14
    Asthma and hay fever|4|38798089|rs4833095|5.0e-12
    Asthma and hay fever|5|111131801|rs1438673|3.0e-11
    Asthma and hay fever|2|102350089|rs10197862|4.0e-11
    Asthma and hay fever|17|39966427|rs7212938|4.0e-10
    sqlite> .header on
    sqlite> .mode column
    sqlite> SELECT trait, chrom, position, strongest_risk_snp, pvalue
    ...> FROM gwascat LIMIT 5;
    trait chrom position strongest_risk_snp pvalue
    -------------------- ---------- ---------- ------------------ ----------
    Asthma and hay fever 6 32658824 rs9273373 4.0e-14
    Asthma and hay fever 4 38798089 rs4833095 5.0e-12
    Asthma and hay fever 5 111131801 rs1438673 3.0e-11
    Asthma and hay fever 2 102350089 rs10197862 4.0e-11
    Asthma and hay fever 17 39966427 rs7212938 4.0e-10

    SELECT可选参数:
    LIMIT 输出查询行数

    1
    sqlite> SELECT * FROM gwascat LIMIT 2;

    ORDER BY 输出结果排序

    1
    2
    3
    4
    SELECT author, trait, journal FROM <tablename> ORDER BY author DESC LIMIT 5; 
    #(按author降序排序),排序有助于异常值检测。
    #若所指定排序列含由NULL值,可通过 IS NOT NULL 排除NULL值
    SELECT chrom, position, trait, strongest_risk_snp, pvalue FROM <tablename> WHERE pvalue IS NOT NULL ORDER BY pvalue LIMIT 5;

    WHERE 数据筛选

    1
    2
    SELECT chrom, position, trait, strongest_risk_snp, pvalue FROM <tablename> WHERE lower(strongest_risk_snp) = "rs429358";
    #SQLite大小写敏感,所以匹配时最好用lower() 转换。

    多条件筛选:

    1
    2
    3
    4
    5
    6
    7
    8
    sqlite> SELECT chrom, position, strongest_risk_snp, pvalue FROM gwascat
    ...> WHERE chrom IN ("1", "2", "3") AND pvalue < 10e-11
    ...> ORDER BY pvalue LIMIT 5;
    #或者
    sqlite> SELECT chrom, position, strongest_risk_snp, pvalue
    ...> FROM gwascat WHERE chrom = "22"
    ...> AND position BETWEEN 24000000 AND 25000000
    ...> AND pvalue IS NOT NULL ORDER BY pvalue LIMIT 5;

    AS 对原始数据的修改:

    1
    2
    3
    4
    5
    6
    7
    sqlite> SELECT lower(trait) AS trait,
    ...> "chr" || chrom || ":" || position AS region FROM gwascat LIMIT 5;
    #||为连接运算符,用来连接两个字符串
    #NULL的替换,ifnull()函数
    sqlite> SELECT ifnull(chrom, "NA") AS chrom, ifnull(position, "NA") AS position,
    ...> strongest_risk_snp, ifnull(pvalue, "NA") AS pvalue FROM gwascat
    ...> WHERE strongest_risk_snp = "rs429358";

    更多SQLite内置函数
    Function Description
    ifnull(x, val) If x is NULL, return with val, otherwise return x; shorthand for coalesce() with two arguments
    min(a, b, c, …) Return minimum in a, b, c, …
    max(a, b, c, …) Return maximum in a, b, c, …
    abs(x) Absolute value
    coalesce(a, b, c, …) Return first non-NULL value in a, b, c, … or NULL if all values are NULL
    length(x) Returns number of characters in x
    lower(x) Return x in lowercase
    upper(x) Return x in uppercase
    replace(x, str, repl) Return x with all occurrences of str replaced with repl
    round(x, digits) Round x to digits (default 0)
    trim(x, chars), ltrim(x, chars), rtrim(x, chars) Trim off chars (spaces if chars is not specified) from both sides, left side, and right side of x, respectively.
    substr(x, start, length) Extract a substring from x starting from character start and is length characters long

    集合函数(Aggregate)

    count(colname)函数:
    返回总行数(无视NULL的存在):sqlite> SELECT count(*) FROM gwascat;
    若colname是具体某列则返回出去NULL值的总行数.
    其他相似函数:avg(x),max(x),min(x),sum(x),total(x)
    计算列非重复值(unique)个数

    1
    sqlite> SELECT count(DISTINCT 列) AS unique_rs FROM gwascat;

    行分组(GROUP BY)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    sqlite> SELECT chrom, count(*) FROM gwascat GROUP BY chrom;
    chrom count(*)
    ---------- ----------
    70
    1 1458
    10 930
    11 988
    12 858
    13 432
    [...]
    #列重命名,降序,分组计算
    sqlite> SELECT chrom, count(*) as nhits FROM gwascat GROUP BY chrom
    ...> ORDER BY nhits DESC;
    chrom nhits
    ---------- ----------
    6 1658
    1 1458
    2 1432
    3 1033
    11 988
    10 930
    [...]
    #多列分组,不同列之间逗号分隔
    sqlite> select strongest_risk_snp, strongest_risk_allele, count(*) AS count
    ...> FROM gwascat GROUP BY strongest_risk_snp, strongest_risk_allele
    ...> ORDER BY count DESC LIMIT 10;
    strongest_risk_snp strongest_risk_allele count
    ------------------ --------------------- ----------
    rs1260326 T 22
    rs2186369 G 22
    rs1800562 A 20
    rs909674 C 20
    rs11710456 G 19

    自己动手写数据库

    we’ll use the basic SQL syntax to create tables and insert records into tables. Then load data into SQLite using Python’s sqlite3 module.

    创建tables

    基本语法:

    1
    2
    3
    4
    5
    6
    CREATE TABLE tablename(
    id integer primary key,
    column1 column1_type,
    column2 column2_type,
    ...
    );

    注意到所有SQLite数据库第一列总是id integer primary key,primary key是非重复整数来识别table中每一条记录。
    创建table:

    1
    2
    3
    4
    5
    6
    7
    $ sqlite3 practice.dbsqlite> CREATE TABLE variants(
    ...> id integer primary key,
    ...> chrom text,
    ...> start integer,
    ...> end integer,
    ...> strand text,
    ...> name text);

    数据写入table

    基本语法:

    1
    2
    INSERT INTO tablename(column1, column2)
    VALUES (value1, value2);

    建立索引

    基本语法:

    1
    2
    3
    4
    5
    6
    sqlite> CREATE INDEX <columns-name_idx> ON <tablename>(<columns-name>);
    #察看索引
    sqlite> .indices
    columns-name_idx
    #删除索引
    sqlite> DROP INDEX columns-name_idx;

    修改/删除table

    删除:DROP TABLE
    修改:ALTER TABLE

    python中交互操作SQLite

    连接SQLite数据库并创建table

    create_table.py

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    import sqlite3

    # the filename of this SQLite database
    db_filename = "variants.db"

    # initialize database connection
    conn = sqlite3.connect(db_filename) #connect() 连接数据库

    c = conn.cursor() #在python中用cursor()与SQLite数据库交互

    table_def = """\
    CREATE TABLE variants(
    id integer primary key,
    chrom test,
    start integer,
    end integer,
    strand text,
    rsid text);
    """

    c.execute(table_def) #SQL语法,相当于确认
    conn.commit() #提交跟新内容到SQLite数据库
    conn.close() #关闭与数据库的连接

    数据载入table

    load_variants.py

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    ##load_variants.py data.txt
    import sys
    import sqlite3
    from collections import OrderedDict

    # the filename of this SQLite database
    db_filename = "variants.db"

    # initialize database connection
    conn = sqlite3.connect(db_filename)
    c = conn.cursor()

    ## Load Data
    # columns (other than id, which is automatically incremented
    tbl_cols = OrderedDict([("chrom", str), ("start", int),
    ("end", int), ("strand", str),
    ("rsid", str)])

    with open(sys.argv[1]) as input_file:
    for line in input_file:
    # split a tab-delimited line
    values = line.strip().split("\t")

    # pair each value with its column name
    cols_values = zip(tbl_cols.keys(), values)
    # use the column name to lookup an appropriate function to coerce each
    # value to the appropriate type
    coerced_values = [tbl_cols[col](value) for col, value in cols_values]

    # create an empty list of placeholders
    placeholders = ["?"] * len(tbl_cols)

    # create the query by joining column names and placeholders quotation
    # marks into comma-separated strings
    colnames = ", ".join(tbl_cols.keys())
    placeholders = ", ".join(placeholders)
    query = "INSERT INTO variants(%s) VALUES (%s);"%(colnames, placeholders)

    # execute query
    c.execute(query, coerced_values)

    conn.commit() # commit these inserts
    conn.close()

    Linux中如何正确删除:find-rm

    Posted on 2016-03-21 | In Linux | Comments: | Views: ℃
    | Words count in article: | Reading time ≈

    假如当前目录下有诸多fastq文件想要删除,通常我们选择这样:

    直接删除

    1
    rm *-temp.fastq

    但是如果一时疏忽输入rm -temp.fastq(号和-号之间多了空格),那结果就惨了…..
    而如果结合Linux常用命令之find命令那么一切就简单多了

    1
    2
    3
    find . -name "*-temp.fastq" -exec rm -i {} \;
    #or
    find . -name "*-temp.fastq" | xargs rm

    -exec 表示由find找到的匹配项会作为”-exec后面设定的命令”的参数({}中输入值)
    -i 交互删除
    若想要删除文件夹,则-delete 代替-exec -rm {}

    打印删除rm命令并检查

    1
    2
    3
    4
    5
    6
    7
    8
    9
    find . -name "*-temp.fastq" | xargs -n 1 echo "rm -i" > delete-temp.sh
    cat delete-temp.sh
    rm -i ./zmaysA_R1-temp.fastq
    rm -i ./zmaysA_R2-temp.fastq
    rm -i ./zmaysC_R1-temp.fastq
    rm -i ./zmaysC_R2-temp.fastq
    bash delete-temp.sh
    #or
    find . -name "*.fastq" | xargs -n 1 -P 4 bash script.sh

    -n 1 表示find擦找到的参数每次只有一个输入到xargs中
    -P 并行运算

    将删除文件放入临时文件夹(tmp)

    1
    myrm(){ D=/tmp/$(date +%Y%m%d%H%M%S); mkdir -p $D; mv "$@" $D && echo "moved to $D ok"; }

    被忽视的Samtools参数

    Posted on 2016-03-20 | In Bioinformatics | Comments: | Views: ℃
    | Words count in article: | Reading time ≈

    Samtools是一个用于操作序列比对结果sam和bam文件的工具合集。

    sam文件格式

    SAM格式由两部分组成:头部区和比对区,都以tab分列。
    头部区:以’@’开始,体现了比对的一些总体信息。比对的SAM格式版本,比对的参考序列,比对使用的软件等。
    比对区: 比对结果,每一个比对结果是一行,有11个主列和1个可选列。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    @HD VN:1.0 SO:unsorted  
    头部区第一行:VN是格式版本;SO表示比对排序的类型,有unkown(default),unsorted,
    queryname和coordinate几种。samtools软件在进行排序后不能自动更新bam文件的SO值。
    picard却可以。
    @SQ SN:A.auricula_all_contig_1 LN:9401
    参考序列名。这些参考序列决定了比对结果sort的顺序。SN是参考序列名;LN是参考序列
    长度;
    @RG ID:sample01
    Read Group. 1个sample的测序结果为1个Read Group;该sample可以有多个library
    的测序结果。
    @PG ID:bowtie2 PN:bowtie2 VN:2.0.0-beta7
    比对所使用的软件。

    比对区11个列和可选列的解释
    1 QNAME 比对的序列名,即单端或双端fa/fq 中的reads编号, ‘*’ indicates the information is unavailable.
    2 FLAG Bwise FLAG(表明比对类型:pairing,strand,mate strand等)
    3 RNAME 比对上的参考序列名,An unmapped segment without coordinate has a ‘*’ at this field.
    4 POS 1-Based的比对上的最左边的定位,POS is set as 0 for an unmapped read without coordinate. If POS is 0, no assumptions can be made about RNAME and CIGAR.
    5 MAPQ 比对质量,255表示没有map
    6 CIGAR Extended CIGAR string (操作符:MIDNSHP) 比对结果信息:匹配碱基数,可变剪接等,*表示不可用。
    7 RNEXT 相匹配的另外一条序列,比对上的参考序列名,‘*’ when the information is unavailable, and set as ‘=’ if RNEXT is identical RNAME
    8 PNEXT 1-Based leftmost Mate POsition,0 when the information is unavailable.
    9 TLEN 插入片段长度,set as 0 for single-segment template or when the information is unavailable
    10 SEQ 和参考序列在同一个琏上的比对序列(若比对结果在负意链上,则序列是其反向重复序列), ‘*’ when the sequence is not stored
    11 QUAL 比对序列的质量(ASCII-33=Phred base quality),‘*’ when quality is not stored
    12 可选的行,以TAG:TYPE:VALUE的形式提供额外的信息

    比对区解释

    sam/bam比对区包含有此次比对的结果信息,其中主要信息解释如下:

  • FLAG部分


  • 0x800 表明相应位置的比对属于嵌合体比对;
    0x4 没有map上的reads;

  • CIGAR部分


  • 对于mRNA到基因组的比对,N表示内含子。
    More: http://samtools.github.io/hts-specs/SAMv1.pdf

    sam文件的几个特例解释

    Unmapped reads 统计

    Each alignment is one line of the SAM file, but not all lines are successful alignments. Unmapped reads在sam文件中的标记:FLAG列为4而且RNAME列为星号*;

    1
    2
    3
    4
    #统计包含星号的比对行数
    cut -f3 smallRNA-seq.sam | grep -c \*
    #总的比对行数
    grep -c -v "^@" smallRNA-seq.sam

    How many different read IDs are in the file?

    The query (read) ID is in field 1. Some reads may have multiple alignments, so the number of lines is not necessarily the number of reads.

    1
    2
    3
    grep -v "@" HR-1B.fq.gz.sam | cut -f1 | sort | uniq | wc -l 
    #同时统计此次比对的单端/双端 fa/fq文件发现两者结果不同,说明确实有些reads未能比对上去。
    zcat pepper/RNA-seq/sgs-clean-reads/HR-1B.fq.gz | wc -l

    How many different read sequences are in the file?

    第一列的reads ID仅能表示测序过程中的不同reads,但他们的序列可能因为PCR扩增原因或文库偏好而完全相同,所以统计第10列的uniq序列数能够准确的表示reads总数。

    1
    cut -f10 smallRNAseq.sam | sort | uniq | wc -l

    How many reads are uniquely mapped?

    1
    cut -f10 smallRNAseq.sam | sort | uniq -u | wc -l

    对于 BWA比对结果也可用grep -c XT:A:U smallRNA-seq.sam来准确统计。

    How many reads are multi-hits?

    1
    cut -f1 smallRNA-seq.sam | sort | uniq -d | wc -l

    对于 BWA比对结果也可用grep -c XT:A:R smallRNAseq.sam来准确统计。

    How many alignments are reported for each read?

    1
    2
    grep -v "^@" smallRNA-seq.sam | cut -f1 | sort | uniq -c | sort -nr > sortedreadcount.txt
    grep -v "^@" smallRNA-seq.sam | cut -f1 |sort | uniq -c | sort -nr | cut -c1-8 | sort | uniq -c

    How many different reference sequences are represented in the file?

    1
    grep -v "^@" smallRNA-seq.sam | cut –f3 | sort | uniq | wc -l

    view

    -c 计数
    -f 返回指定区间/flags比对结果
    -q 返回比对质量大于等于指定值的比对数目
    -F 4:统计map 上的 reads总数;
    -f 4:统计没有map 上的 reads总数;
    To get the unmapped reads from a bam file use :
    samtools view -f 4 file.bam > unmapped.sam, the output will be in sam
    to get the output in bam use : samtools view -b -f 4 file.bam > unmapped.bam
    To get only the mapped reads use the parameter ‘F’, which works like -v of grep and skips the alignments for a specific flag.
    samtools view -b -F 4 file.bam > mapped.bam
    samtools view -b -F 4 -f 8 file.bam > onlyThisEndMapped.bam
    samtools view -b -F 8 -f 4 file.bam > onlyThatEndMapped.bam
    samtools view -b -F12 file.bam > bothEndsMapped.bam
    samtools merge merged.bam onlyThisEndMapped.bam onlyThatEndMapped.bam bothEndsMapped.bam
    对于tophat比对结果:
    samtools view -b -f 2 accepted_hits.bam > mappedPairs.bam
    Better with:
    samtools view -b -f 0x2 accepted_hits.bam > mappedPairs.bam

    sort

    -m 指定运算内存,支持K,M,G等缩写
    -@ 并行运算核数

    index

    必须对bam文件进行默认情况下的排序后,才能进行index。否则会报错。

    建立索引后将产生后缀为.bai的文件,用于快速的随机处理。很多情况下需要有bai文件的存在,特别是显示序列比对情况下。比如samtool的tview命令就需要;gbrowse2显示reads的比对图形的时候也需要。

    faidx

    对fasta文件建立索引,生成的索引文件以.fai后缀结尾。该命令也能依据索引文件快速提取fasta文件中的某一条(子)序列。
    See more: bedtools 使用小结

    flagstat

    给出BAM文件的比对结果

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    $ samtools flagstat example.bam
    11945742 + 0 in total (QC-passed reads + QC-failed reads)
    #总共的reads数
    0 + 0 duplicates
    7536364 + 0 mapped (63.09%:-nan%)
    #总体上reads的匹配率
    11945742 + 0 paired in sequencing
    #有多少reads是属于paired reads
    5972871 + 0 read1
    #reads1中的reads数
    5972871 + 0 read2
    #reads2中的reads数
    6412042 + 0 properly paired (53.68%:-nan%)
    #完美匹配的reads数:比对到同一条参考序列,并且两条reads之间的距离符合设置的阈值
    6899708 + 0 with itself and mate mapped
    #paired reads中两条都比对到参考序列上的reads数
    636656 + 0 singletons (5.33%:-nan%)
    #单独一条匹配到参考序列上的reads数,和上一个相加,则是总的匹配上的reads数。
    469868 + 0 with mate mapped to a different chr
    #paired reads中两条分别比对到两条不同的参考序列的reads数
    243047 + 0 with mate mapped to a different chr (mapQ>=5)

    mpileup

    samtools还有个非常重要的命令mpileup,以前为pileup。该命令用于生成bcf文件,再使用bcftools进行SNP和Indel的分析。bcftools是samtool中附带的软件,在samtools的安装文件夹中可以找到。
    -f 来输入有索引文件的fasta参考序列;
    -g 输出到bcf格式。

    depth

    计算每一个位点或者区域的测序深度;

    1
    samtools depth sorted.bam > sorted.bam.txt

    一共得到3列以指标分隔符分隔的数据,第一列为染色体名称,第二列为位点,第三列为覆盖深度。


    贡献来源
    http://www.plob.org/2014/01/26/7112.html
    http://blog.sina.com.cn/s/blog_670445240101l30k.html
    https://www.biostars.org/p/56246/
    https://www.biostars.org/p/110039/
    https://www.biostars.org/p/95929/
    https://www.biostars.org/p/110157/
    https://groups.google.com/forum/#!forum/bedtools-discuss
    https://code.google.com/p/hydra-sv/

    bedtools 使用小结

    Posted on 2016-03-18 | In Bioinformatics | Comments: | Views: ℃
    | Words count in article: | Reading time ≈

    概述

    BEDTools是可用于genomic features的比较,相关操作及进行注释的工具。而genomic features通常使用Browser Extensible Data (BED) 或者 General Feature Format (GFF)文件表示,用UCSC Genome Browser进行可视化比较。

    与BEDTools使用相关的基本概念

    已有的一些genome features信息一般由BED格式或者GFF格式进行存储。
    genome features: 功能元素(gene), 遗传多态性 (SNPs, INDELs, or structural variants), 已经由测序或者其他方法得到的注释信息,也可以是自定义的一些特征信息。
    Overlapping/intersecting features: 两个genome features的区域至少有一个bp的共同片段。

    BED和GFF文件的一个差异

    BED文件中起始坐标为0,结束坐标至少是1,; GFF中起始坐标是1而结束坐标至少是1。

    相关格式

    BED format

    BEDTools主要使用BED格式的前三列,BED可以最多有12列。BED格式的常用列描述如下:
    chrom: 染色体信息, 如chr1, III, myCHrom, contig1112.23, 必须有
    start: genome feature的起始位点,从0开始, 必须有
    end: genome feature的终止位点,至少为1, 必须有
    score: 可以是p值等等一些可以刻量化的数值信息
    strands: 正反链信息

    GFF format

    seqname - name of the chromosome or scaffold; chromosome names can be given with or without the ‘chr’ prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
    source - name of the program that generated this feature, or the data source (database or project name)
    feature - feature type name, e.g. Gene, Variation, Similarity
    start - Start position of the feature, with sequence numbering starting at 1.end - End position of the feature, with sequence numbering starting at 1.
    score - A floating point value.strand - defined as + (forward) or - (reverse).
    frame - One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on..
    attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.
    See more from http://www.ensembl.org/info/website/upload/gff.html

    genome files

    BEDTools中的一些工具(genomeCoverageBed, complementBed, slopBed)需要物种的染色体大小的信息,genome file一般就是每行都是tab隔开,两列,一列为染色体的名字,第二列为这个染色体的大小。一般常用物种的genome file在BEDTools安装目录的/genome里面。
    自定义基因组genome files文件生成方法见我的另一篇博文:批量求fasta格式序列长度。

    BEDTools使用总结

    intersect/intersectBed:计算 Overlaps

    1
    bedtools intersect -a A.bed -b B.bed -wa -wb

    用来求两个BED或者BAM文件中的overlap,overlap可以进行自定义是整个genome features的overlap还是局部。
    默认的结果描述如下图:

    加-wa参数可以报告出原始的在A文件中的feature, 如下图

    加-wb参数可以报告出原始的在B文件中的feature, 加-c参数可以报告出两个文件中的overlap的feature的数量。
    当用bedtools intersect 处理大文件时比较耗内存,有效的方法是对A和B文件按照染色体名字(chromosome)和位置(position)排序(sort -k1,1 -k2,2n),然后用-sorted参数重新intersect。

    1
    bedtools intersect -a A-sorted.bed -b B-sorted.bed --sorted

    其他参数:
    -wo 返回overlap碱基数

    1
    2
    3
    4
    5
    6
    7
    $bedtools  intersect -a A.bed -b B.bed -wo
    chr1 0 15 a chr1 0 4 x 4
    chr1 0 15 a chr1 9 15 z 6
    chr1 25 29 b chr1 18 28 y 3
    chr1 18 18 c chr1 18 28 y 1
    chr1 10 14 d chr1 9 15 z 4
    chr1 20 23 e chr1 18 28 y 3

    -v 返回非overlap区间
    -s 相同链上的feature
    -c 两个文件中的overlap的feature的数量

    complement:返回基因组非覆盖区

    1
    bedtools complement -i <BED/GFF/VCF> -g <genome files>

    Slop:增加特征区间大小

    要求:单个输入bed文件(-i指定)和genome files

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    cat ranges-qry.bed 
    chr1 0 15 a
    chr1 25 29 b
    chr1 18 18 c
    chr1 10 14 d
    chr1 20 23 e
    chr1 6 7 f
    bedtools slop -i ranges-qry.bed -g genome.txt -b 4
    chr1 0 19 a
    chr1 21 33 b
    chr1 14 22 c
    chr1 6 18 d
    chr1 16 27 e
    chr1 2 11 f
    #-b 4 :两端同时缩短4个碱基

    -l 3 -r 5:增加左3右5

    flank:提取特定区域(启动子区)

    要求:基因组GTF文件(-i指定)和genome files

    1
    2
    3
    4
    5
    6
    7
    bedtools flank -i mm_GRCm38.75_protein_coding_genes.gtf \
    -g Mus_musculus.GRCm38_genome.txt \
    -l 3000 -r 0 > mm_GRCm38_3kb_promoters.gtf
    cut -f1,4,5,7 mm_GRCm38_3kb_promoters.gtf | head -n 3
    1 3671499 3674498 -
    1 4360315 4363314 -
    1 4496414 4499413 -

    getfasta:提取序列

    要求:基因组fasta文件(-fi指定)和提取区间GTF文件(-bed指定)

    1
    2
    bedtools getfasta -fi Mus_musculus.GRCm38.75.dna_rm.toplevel_chr1.fa \
    -bed mm_GRCm38_3kb_promoters.gtf -fo mm_GRCm38_3kb_promoters.fasta

    -tab Report extract sequences in a tab-delimited format instead of in FASTA format.


    提取序列之samtools(速度较快)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    #首先建立fai索引文件(第一列为染色体名字,第二列为序列碱基数)
    samtools faidx Mus_musculus.GRCm38.75.dna.chromosome.8.fa
    #序列提取,多提取区间空格隔开
    samtools faidx Mus_musculus.GRCm38.75.dna.chromosome.8.fa \
    8:123407082-123410744 8:123518835-123536649
    >8:123407082-123410744
    GAGAAAAGCTCCCTTCTTCTCCAGAGTCCCGTCTACCCTGGCTTGGCGAGGGAAAGGAAC
    CAGACATATATCAGAGGCAAGTAACCAAGAAGTCTGGAGGTGTTGAGTTTAGGCATGTCT
    [...]
    >8:123518835-123536649
    TCTCGCGAGGATTTGAGAACCAGCACGGGATCTAGTCGGAGTTGCCAGGAGACCGCGCAG
    CCTCCTCTGACCAGCGCCCATCCCGGATTAGTGGAAGTGCTGGACTGCTGGCACCATGGT
    [...]

    nuc: 计算GC含量即各碱基数

    1
    bedtools nuc -fi hg19.fa -bed CDS.bed

    输出结果解释:在原bed文件每行结尾增加以下几列

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    Output format: 
    The following information will be reported after each BED entry:
    1) %AT content
    2) %GC content
    3) Number of As observed
    4) Number of Cs observed
    5) Number of Gs observed
    6) Number of Ts observed
    7) Number of Ns observed
    8) Number of other bases observed
    9) The length of the explored sequence/interval.
    10) The seq. extracted from the FASTA file. (opt., if -seq is used)
    11) The number of times a user's pattern was observed.
    (opt., if -pattern is used.)

    genomecov:染色体和全基因组覆盖度计算

    要求:单个输入bed文件(-i指定)和genome files;如果输入为bam(-ibam指定)文件,则不需要genome files。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    cat ranges-cov-sorted.bed
    chr1 4 9
    chr1 1 6
    chr1 8 19
    chr1 25 30
    chr2 0 20
    $ cat cov.txt
    chr1 30
    chr2 20
    bedtools genomecov -i ranges-cov-sorted.bed -g cov.txt
    chr1 0 7 30 0.233333 1
    chr1 1 20 30 0.666667
    chr1 2 3 30 0.1
    chr2 1 20 20 1 2
    genome 0 7 50 0.14 3
    genome 1 40 50 0.8
    genome 2 3 50 0.06
    #name 覆盖次数 覆盖碱基数 总碱基数 覆盖度
    #同时计算单染色体和全基因组覆盖度

  • ranges-cov.bed文件需提前排序sort -k1,1 ranges-cov.bed > ranges-cov-sorted.bed
  • -bg参数可得到每个碱基的覆盖度。
  • coverage:计算染色体给定区间覆盖度

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    $ cat A.bed
    chr1 0 100
    chr1 100 200
    chr2 0 100

    $ cat B.bed
    chr1 10 20
    chr1 20 30
    chr1 30 40
    chr1 100 200

    $ bedtools coverage -a A.bed -b B.bed
    chr1 0 100 3 30 100 0.3000000
    chr1 100 200 1 100 100 1.0000000
    chr2 0 100 0 0 100 0.0000000


    贡献来源
    http://www.plob.org/2012/09/26/3748.html
    http://bedtools.readthedocs.org/en/latest/content/bedtools-suite.html
    https://code.google.com/archive/p/bedtools/wikis/Usage.wiki
    https://code.google.com/archive/p/bedtools/wikis/UsageAdvanced.wiki

    R数据整形术之 dplyr

    Posted on 2016-03-15 | In R | Comments: | Views: ℃
    | Words count in article: | Reading time ≈

    数据集类型

    将过长过大的数据集转换为显示更友好的 tbl_df 类型:

    1
    hflights_df <- tbl_df(hflights)

    可以 hflights_df 感受一下不再被刷屏的感觉.

    基本操作

    常用的数据操作行为归纳为以下五种:

    筛选: filter()

    按给定的逻辑判断筛选出符合要求的子数据集, 类似于 base::subset() 函数

    1
    2
    3
    filter(hflights_df, Month == 1, DayofMonth == 1)
    #对同一对象的任意个条件组合
    filter(hflights_df, Month == 1 | Month == 2)

    排列: arrange()

    按给定的列名依次对行进行排序

    1
    2
    3
    4
    arrange(hflights_df, DayofMonth, Month, Year)
    #对列名加 desc() 进行倒序:
    arrange(hflights_df, desc(ArrDelay))
    #这个函数和 plyr::arrange() 是一样的, 类似于 order()

    选择: select()

    用列名作参数来选择子数据集:

    1
    2
    3
    4
    5
    select(hflights_df, Year, Month, DayOfWeek)
    #还可以用 : 来连接列名, 没错, 就是把列名当作数字一样使用
    select(hflights_df, Year:DayOfWeek)
    用 - 来排除列名:
    select(hflights_df, -DrayOfWeek)

    变形: mutate()

    对已有列进行数据运算并添加为新列:

    1
    mutate(hflights_df,   gain = ArrDelay - DepDelay,   speed = Distance / AirTime * 60)

    汇总: summarise()

    对数据框调用其它函数进行汇总操作, 返回一维的结果:

    1
    summarise(hflights_df, delay = mean(DepDelay, na.rm = TRUE))

    分组动作 group_by()

    以上5个动词函数已经很方便了, 但是当它们跟分组操作这个概念结合起来时, 那才叫真正的强大! 当对数据集通过 group_by() 添加了分组信息后,mutate(), arrange() 和 summarise() 函数会自动对这些 tbl 类数据执行分组操作 (R语言泛型函数的优势).

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    mtfs_df %>%
    group_by(chr) %>%
    summarize(max_recom = max(recom), mean_recom = mean(recom), num=n())

    Source: local data frame [23 x 4]
    chr max_recom mean_recom num
    1 chr1 41.5648 2.217759 2095
    2 chr10 42.4129 2.162635 1029
    3 chr11 36.1703 2.774918 560
    [...]

    summarise中用到的函数

    n(): 计算个数 n_distinct(): 计算 x 中唯一值的个数. (原文为 count_distinct(x))
    first(x), last(x) 和 nth(x, n): 返回对应秩的值, 类似于自带函数 x[1], x[length(x)], 和 x[n]

    连接符 %>%

    使用时把数据名作为开头, 然后依次对此数据进行多步操作.

    join功能

    两数据集取交,并集等。

    • inner_join(x,y) 交集
    • semi_join(x,y)
    • left_join(x,y)
    • anti_join(x,y)
    • inner_join(y,x)
    • semi_join(y,x)
    • left_join(y,x)
    • anti_join(y,x)
    • full_join(x,y) 并集
      【Cheatsheet for dplyr join functions】

      深入学习

      dplyr 包自带的60页详细文档.
      其余几个vignettes (网页) 或 vignette(package = “dplyr”),包含了数据库相关, 混合编程, 运算性能比较, 以及新的 window-functions 等内容.
      简单看了下vignette(“window-functions”, package = “dplyr”), 提供了一系列函数, 扩展了原来只能返回一个数值的聚焦类函数(如sum(), mean())至返回等长度的值, 变成 cumsum()和 cummean(), 以及 n(), lead() 和 lag()等便捷功能.
      plyr 包的相关文档: 主页
      还有data.table包也是很强大的哦, 空下来可以学一学.

    R数据整形术之 tidyr

    Posted on 2016-03-09 | In R | Comments: | Views: ℃
    | Words count in article: | Reading time ≈

    Happy families are all alike; every unhappy family is unhappy in its own way — Leo Tolstoy



    R数据整形包之一tidyr最近迎来更新(tidyr 0.4.0),所以有必要对其Tidy data进行学习。
    以下为个人简单总结:

    #gather--mutate--separate--select--arrange
    setwd("F:/Rwork/tidyr")
    library(tidyr)
    ## Warning: package 'tidyr' was built under R version 3.2.3
    library(dplyr)
    ## Warning: package 'dplyr' was built under R version 3.2.2
    ## 
    ## Attaching package: 'dplyr'
    ## 
    ## The following objects are masked from 'package:stats':
    ## 
    ##     filter, lag
    ## 
    ## The following objects are masked from 'package:base':
    ## 
    ##     intersect, setdiff, setequal, union
    preg <-read.csv("preg.csv",stringsAsFactors = FALSE)
    preg
    ##           name treatmenta treatmentb
    ## 1   John Smith         NA         18
    ## 2     Jane Doe          4          1
    ## 3 Mary Johnson          6          7
    preg6 <-tbl_df(read.csv("preg.csv",stringsAsFactors = FALSE))
    preg6
    ## Source: local data frame [3 x 3]
    ## 
    ##           name treatmenta treatmentb
    ##          (chr)      (int)      (int)
    ## 1   John Smith         NA         18
    ## 2     Jane Doe          4          1
    ## 3 Mary Johnson          6          7
    preg2<-preg %>% 
      gather(treatment,n,treatmenta:treatmentb) %>%  
    #The first argument, is the name of the key column, which is the name of the variable defined by the values of the column headings. 
    #The second argument is the name of the value column.
    #The third argument defines the columns to gather, here, every column except religion.
     #gather(treatment,n,-name,na.rm = TRUE)    #the same above # na.rm to drop any missing values from the gather columns
      mutate(treatment=gsub("treatment","",treatment)) %>%
      arrange(name,treatment)   #arrange=sort
    preg2
    ##           name treatment  n
    ## 1     Jane Doe         a  4
    ## 2     Jane Doe         b  1
    ## 3   John Smith         a NA
    ## 4   John Smith         b 18
    ## 5 Mary Johnson         a  6
    ## 6 Mary Johnson         b  7
    #each deplay one by one 
    preg3<-preg %>% 
      gather(treatment,n,treatmenta:treatmentb)
    preg3
    ##           name  treatment  n
    ## 1   John Smith treatmenta NA
    ## 2     Jane Doe treatmenta  4
    ## 3 Mary Johnson treatmenta  6
    ## 4   John Smith treatmentb 18
    ## 5     Jane Doe treatmentb  1
    ## 6 Mary Johnson treatmentb  7
    preg33<-preg3 %>% separate(treatment, c("Treatments", "group"),9 )  #separate
    preg33
    ##           name Treatments group  n
    ## 1   John Smith  treatment     a NA
    ## 2     Jane Doe  treatment     a  4
    ## 3 Mary Johnson  treatment     a  6
    ## 4   John Smith  treatment     b 18
    ## 5     Jane Doe  treatment     b  1
    ## 6 Mary Johnson  treatment     b  7
    preg333<-preg33 %>% select(name,group,n)  # select
    preg333
    ##           name group  n
    ## 1   John Smith     a NA
    ## 2     Jane Doe     a  4
    ## 3 Mary Johnson     a  6
    ## 4   John Smith     b 18
    ## 5     Jane Doe     b  1
    ## 6 Mary Johnson     b  7
    preg3333<-preg333 %>% spread(group,n)  #spread: one column become two column
    preg3333
    ##           name  a  b
    ## 1     Jane Doe  4  1
    ## 2   John Smith NA 18
    ## 3 Mary Johnson  6  7
    preg4<-preg3 %>% mutate(treatment=gsub("treatment","",treatment))
    preg4
    ##           name treatment  n
    ## 1   John Smith         a NA
    ## 2     Jane Doe         a  4
    ## 3 Mary Johnson         a  6
    ## 4   John Smith         b 18
    ## 5     Jane Doe         b  1
    ## 6 Mary Johnson         b  7
    preg5<-preg4 %>% arrange(name,treatment)
    preg5
    ##           name treatment  n
    ## 1     Jane Doe         a  4
    ## 2     Jane Doe         b  1
    ## 3   John Smith         a NA
    ## 4   John Smith         b 18
    ## 5 Mary Johnson         a  6
    ## 6 Mary Johnson         b  7
    #reads all files from the same locaed pathway  into a single data frame.
    library(plyr)
    ## -------------------------------------------------------------------------
    ## You have loaded plyr after dplyr - this is likely to cause problems.
    ## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
    ## library(plyr); library(dplyr)
    ## -------------------------------------------------------------------------
    ## 
    ## Attaching package: 'plyr'
    ## 
    ## The following objects are masked from 'package:dplyr':
    ## 
    ##     arrange, count, desc, failwith, id, mutate, rename, summarise,
    ##     summarize
    paths <- dir("F:/Rwork/tidyr", pattern = "\\.csv$", full.names = TRUE)
    names(paths) <- basename(paths)
    all<-ldply(paths, read.csv, stringsAsFactors = FALSE)
    all
    ##               .id         name treatmenta treatmentb
    ## 1 preg - Copy.csv   John Smith         NA         18
    ## 2 preg - Copy.csv     Jane Doe          4          1
    ## 3 preg - Copy.csv Mary Johnson          6          7
    ## 4        preg.csv   John Smith         NA         18
    ## 5        preg.csv     Jane Doe          4          1
    ## 6        preg.csv Mary Johnson          6          7
    #get some data from name column incloud John Smith and Jane Doe  located in preg3 data frame 
    subset(preg3, name %in% c("John Smith", "Jane Doe"))
    ##         name  treatment  n
    ## 1 John Smith treatmenta NA
    ## 2   Jane Doe treatmenta  4
    ## 4 John Smith treatmentb 18
    ## 5   Jane Doe treatmentb  1

    Economist_Graph:复杂图形修改

    Posted on 2016-03-09 | In R | Comments: | Views: ℃
    | Words count in article: | Reading time ≈


    原图:http://www.economist.com/node/21541178

    Basic plot

    1
    2
    3
    4
    library(ggplot2)
    dat <- read.csv("EconomistData.csv")
    pc1 <- ggplot(dat, aes(x = CPI, y = HDI, color = Region))+
    geom_point()

    Trend line

    1
    2
    3
    4
    5
    6
    7
    pc2 <- pc1 +
    geom_smooth(aes(group = 1),
    method = "lm",
    formula = y ~ log(x),
    se = FALSE,
    color = "red") +
    geom_line()

    Open points

    1
    2
    pc3 <- pc2 +
    geom_point(shape = 1, size = 4)

    Labelling points

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    pointsToLabel <- c("Russia", "Venezuela", "Iraq", "Myanmar", "Sudan",
    "Afghanistan", "Congo", "Greece", "Argentina", "Brazil",
    "India", "Italy", "China", "South Africa", "Spane",
    "Botswana", "Cape Verde", "Bhutan", "Rwanda", "France",
    "United States", "Germany", "Britain", "Barbados", "Norway", "Japan",
    "New Zealand", "Singapore")
    library("ggrepel")
    pc4 <- pc3 + geom_text_repel(aes(label = Country),
    color = "gray20",
    data = subset(dat, Country %in% pointsToLabel),
    force = 10)


    选择性的标注想要的点

    Change the region labels and order

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    dat$Region <- factor(dat$Region,
    levels = c("EU W. Europe",
    "Americas",
    "Asia Pacific",
    "East EU Cemt Asia",
    "MENA",
    "SSA"),
    labels = c("OECD",
    "Americas",
    "Asia &\nOceania",
    "Central &\nEastern Europe",
    "Middle East &\nnorth Africa",
    "Sub-Saharan\nAfrica"))
    pc4$data <- dat
    pc4


    修改图例值和顺序

    Add title and format axes

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    library(grid)
    pc5 <- pc4 +
    scale_x_continuous(name = "Corruption Perceptions Index, 2011 (10=least corrupt)",
    limits = c(.9, 10.5),
    breaks = 1:10) +
    scale_y_continuous(name = "Human Development Index, 2011 (1=Best)",
    limits = c(0.2, 1.0),
    breaks = seq(0.2, 1.0, by = 0.1)) +
    scale_color_manual(name = "",
    values = c("#24576D",
    "#099DD7",
    "#28AADC",
    "#248E84",
    "#F2583F",
    "#96503F")) +
    ggtitle("Corruption and Human development"))


    利用scale来修改x,y轴,颜色和标出title

    Theme tweaks

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    library(grid) # for the 'unit' function
    pc6 <- pc5 +
    theme_minimal() + # start with a minimal theme and add what we need
    theme(text = element_text(color = "gray20"),
    legend.position = c("top"), # position the legend in the upper left
    legend.direction = "horizontal",
    legend.justification = 0.1, # anchor point for legend.position.
    legend.text = element_text(size = 11, color = "gray10"),
    axis.text = element_text(face = "italic"),
    axis.title.x = element_text(vjust = -1), # move title away from axis
    axis.title.y = element_text(vjust = 2), # move away for axis
    axis.ticks.y = element_blank(), # element_blank() is how we remove elements
    axis.line = element_line(color = "gray40", size = 0.5),
    axis.line.y = element_blank(),
    panel.grid.major = element_line(color = "gray50", size = 0.5),
    panel.grid.major.x = element_blank()
    ))


    微调主题

    Add model R^2 and source note

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    mR2 <- summary(lm(HDI ~ log(CPI), data = dat))$r.squared
    library(grid)
    png(file = "images/econScatter10.png", width = 800, height = 600)
    pc6
    grid.text("Sources: Transparency International; UN Human Development Report",
    x = .02, y = .03,
    just = "left",
    draw = TRUE)
    grid.segments(x0 = 0.81, x1 = 0.825,
    y0 = 0.90, y1 = 0.90,
    gp = gpar(col = "red"),
    draw = TRUE)
    grid.text(paste0("R² = ",
    as.integer(mR2*100),
    "%"),
    x = 0.835, y = 0.90,
    gp = gpar(col = "gray20"),
    draw = TRUE,
    just = "left")

    dev.off()


    Contribution from :
    http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html

    myself ggplot2 learning tip (再学ggplot2: 遗漏细节)

    Posted on 2016-03-09 | In R | Comments: | Views: ℃
    | Words count in article: | Reading time ≈


    原图:http://www.economist.com/node/21541178

    ggplot2绘图语法结构:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    ggplot(data = <default data set>, 
    aes(x = <default x axis variable>,
    y = <default y axis variable>,
    ... <other default aesthetic mappings>),
    ... <other plot defaults>) +

    geom_<geom type>(aes(size = <size variable for this geom>,
    ... <other aesthetic mappings>),
    data = <data for this point geom>,
    stat = <statistic string or function>,
    position = <position string or function>,
    color = <"fixed color specification">,
    <other arguments, possibly passed to the _stat_ function) +

    scale_<aesthetic>_<type>(name = <"scale label">,
    breaks = <where to put tick marks>,
    labels = <labels for tick marks>,
    ... <other options for the scale>) +

    theme(plot.background = element_rect(fill = "gray"),
    ... <other theme elements>)

    Geometric Objects(geom_XX) and Aesthetics (aes())

    1
    geom_<tab>查看所有Geometric Objects(几何学对象)

    Statistical Transformations(统计变换)

    统计变换 (stat_) 比如求均值,求方差等,当我们需要展示出某个变量的某种统计特征的时候,需要用到统计变换。
    每一个geom_XX都有一个默认的统计量,可通过args(geom_XX)查看(args(stat_bin))例如geom_bar的默认统计量是stat_count,表示进行计数。什么意思呢?举例如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    > library(ggplot2)
    > df <- data.frame(trt = c("a","a", "b", "c"), outcome = c(2.3,5, 1.9, 3.2))
    > df
    trt outcome
    1 a 2.3
    2 a 5.0
    3 b 1.9
    4 c 3.2
    > ##stat()
    > ggplot(df, aes(trt, outcome)) +
    geom_bar()
    Error: stat_count() must not be used with a y aesthetic.
    #成功报错,因为第一句aes()中我们指定了y轴为outcome,即一个x轴的trt值对应一个y轴的outcome值,而geom_bar()
    #中stat_count要进行计数,即计数trt中相同值出现的次数.最终y轴指定的outcome与需要表示的count值冲突,报错。
    ##解决:
    > ggplot(df, aes(trt)) +
    geom_bar()


    如果想要表示相同trt值相加后的对应值,修改stat=indentity即可。

    1
    2
    > ggplot(df, aes(trt, outcome)) +
    + geom_bar(stat = "identity")


    ggplot2中包含的统计变换有如下多种:

    更多参见:https://www.zhihu.com/question/24779017

    Scales

    控制aes()映射

    Aesthetic (aes()) 映射仅仅是告诉一个变量应该映射到一个aesthetic,但并没有说明如何映射?例如,当我们用aes(shape=x)去映射一个变量到shape时,并没有说明用什么shape;同样的,aes(color=z)并无说明用什么颜色;通常在我们未定义这些时ggplot2会自动用默认值,而我们可以通过scale来修改这些值。
    在ggplot2中scales包括:

  • position

  • color and fill

  • size

  • shape

  • line type

  • x,y

  • ### 修改方式
    1
    2
    scale_<aesthetic>_<type>(name,limits,breaks,labels)
    通过键入scale_<tab>查看全部可修改函数。


    特殊scale函数有额外参数,例如对颜色的修改scale_color_continuous函数有low和high参数。
    1
    2
    3
    4
    5
    scale_x_discrete(name="State Abbreviation") +
    scale_color_continuous(name="",
    breaks = c(19751, 19941, 20131),
    labels = c(1971, 1994, 2013),
    low = "blue", high = "red")



    ## Faceting
    ### 分面: 一页多图
    facet_wrap():对数据分类只能应用一个标准,例facet_wrap(~State, ncol = 10)),按State分组后每行设置10个小图依次画出全部。
    facet_grid():多个标准对数据进行分组绘图,facet_grid(color~cut,margins=TRUE),波浪号前为小图分行标准,后面为分列标准,margins指用于分面的包含每个变量元素所有数据的数据组,相当于每个小图一个title。
    ## Themes
    ### 主题
    更多ggplot2主题演示如下:
    http://docs.ggplot2.org/dev/vignettes/themes.html
    https://github.com/jrnold/ggthemes
    ### 修改主题默认值
    1
    2
    theme_minimal() +
    theme(text = element_text(color = "turquoise"))


    ### 自定义主题
    1
    2
    3
    4
    5
    6
    7
    8
    9
    theme_new <- theme_bw() +
    theme(plot.background = element_rect(size = 1, color = "blue", fill = "black"),
    text=element_text(size = 12, family = "Serif", color = "ivory"),
    axis.text.y = element_text(colour = "purple"),
    axis.text.x = element_text(colour = "red"),
    panel.background = element_rect(fill = "pink"),
    strip.background = element_rect(fill = muted("orange")))

    p5 + theme_new


    ## 关于晕晕的aes()
  • 任何与数据向量顺序相关,需要逐个指定的参数都必须写在aes里

  • 什么?还是搞不清该放aes里面还是外面?那就记着想统一整个图层时就放到aes外,想分成不同组调整,并且已经有一个与x、y长度一致的分组变量了,那就放到aes里

  • ##其他总结
  • 加注释,所有注释的实现都是通过annotate函数实现的,geom_text()是兼职的。

  • theme函数最妙的地方是将对于数据相关的美学调整和与数据无关的美学调整分离,将数据处理与数据美学分开,数据美学与数据无关的调整分开。
  • Noncoding RNAs (ncRNAs)

    Posted on 2016-01-29 | In molecular biology | Comments: | Views: ℃
    | Words count in article: | Reading time ≈

    Noncoding RNAs (ncRNAs)

  • most ncRNAs operate as RNA-protein complexes, including ribosomes, snRNPs, snoRNPs, telomerase, microRNAs, and long ncRNAs.
  • Non-coding RNA database

    http://research.imb.uq.edu.au/rnadb/

    1…789…15
    tiramisutes

    tiramisutes

    hope bioinformatics blog
    148 posts
    17 categories
    112 tags
    RSS
    GitHub E-Mail Weibo Twitter
    Creative Commons
    Links
    • qRT-PCR-Pipeline
    • GitBook《awk学习总结》
    • 棉花遗传改良团队
    • 生信菜鸟团-深圳大学城
    • 生信技能树
    • rabbit gao's blog
    • Huans(代谢+GWAS)
    • NSC-sequenceing
    • 伯乐在线
    • Linux开源中文社区
    • R-bloggers
    • R-resource
    • ggplot2
    • pele' blog
    © 2014 – 2023 tiramisutes
    Powered: Hexo
    |
    ▽ – NexT.Muse
    The total visits  times
    0%