SQLite数据库整体查询命令
dot-commands
1 | $sqlite3 |
SQLite数据类型
注:SQLite并没有强制同列必须使用相同类型的数据,每个表的每一列都有优先类型(type affinity),但为了下游分析方便,最好同一列保持相同数据类型。
当某一列是混合数据类型时,排序原则为:NULL values, integer and real values (sorted numerically), text values, and finally blob values。
数据库内容查询
万能的SELECT命令
语法:1
SELECT <columns> FROM <tablename>;
基本形式:SELECT选择指令从一个table中抓取所有列的所有行(columns设定为*
)。
选取特定列:不同列之间逗号分隔(SELECT trait, chrom, position, strongest_risk_snp, pvalue FROM gwascat LIMIT 5;)
SELECT语句除了在sqlite中交互查询外,还可在命令行中直接查询1
2
3
4#交互
sqlite> SELECT * FROM gwascat;
#命令行
sqlite3 gwascat.db "SELECT * FROM gwascat" > results.txt
SQLite默认输出不规则,可做以下设置输出排列整齐易读输出:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19sqlite> SELECT trait, chrom, position, strongest_risk_snp, pvalue
...> FROM gwascat LIMIT 5;
trait|chrom|position|strongest_risk_snp|pvalue
Asthma and hay fever|6|32658824|rs9273373|4.0e-14
Asthma and hay fever|4|38798089|rs4833095|5.0e-12
Asthma and hay fever|5|111131801|rs1438673|3.0e-11
Asthma and hay fever|2|102350089|rs10197862|4.0e-11
Asthma and hay fever|17|39966427|rs7212938|4.0e-10
sqlite> .header on
sqlite> .mode column
sqlite> SELECT trait, chrom, position, strongest_risk_snp, pvalue
...> FROM gwascat LIMIT 5;
trait chrom position strongest_risk_snp pvalue
-------------------- ---------- ---------- ------------------ ----------
Asthma and hay fever 6 32658824 rs9273373 4.0e-14
Asthma and hay fever 4 38798089 rs4833095 5.0e-12
Asthma and hay fever 5 111131801 rs1438673 3.0e-11
Asthma and hay fever 2 102350089 rs10197862 4.0e-11
Asthma and hay fever 17 39966427 rs7212938 4.0e-10
SELECT可选参数:
LIMIT 输出查询行数1
sqlite> SELECT * FROM gwascat LIMIT 2;
ORDER BY 输出结果排序1
2
3
4SELECT author, trait, journal FROM <tablename> ORDER BY author DESC LIMIT 5;
#(按author降序排序),排序有助于异常值检测。
#若所指定排序列含由NULL值,可通过 IS NOT NULL 排除NULL值
SELECT chrom, position, trait, strongest_risk_snp, pvalue FROM <tablename> WHERE pvalue IS NOT NULL ORDER BY pvalue LIMIT 5;
WHERE 数据筛选1
2SELECT chrom, position, trait, strongest_risk_snp, pvalue FROM <tablename> WHERE lower(strongest_risk_snp) = "rs429358";
#SQLite大小写敏感,所以匹配时最好用lower() 转换。
多条件筛选:1
2
3
4
5
6
7
8sqlite> SELECT chrom, position, strongest_risk_snp, pvalue FROM gwascat
...> WHERE chrom IN ("1", "2", "3") AND pvalue < 10e-11
...> ORDER BY pvalue LIMIT 5;
#或者
sqlite> SELECT chrom, position, strongest_risk_snp, pvalue
...> FROM gwascat WHERE chrom = "22"
...> AND position BETWEEN 24000000 AND 25000000
...> AND pvalue IS NOT NULL ORDER BY pvalue LIMIT 5;
AS 对原始数据的修改:1
2
3
4
5
6
7sqlite> SELECT lower(trait) AS trait,
...> "chr" || chrom || ":" || position AS region FROM gwascat LIMIT 5;
#||为连接运算符,用来连接两个字符串
#NULL的替换,ifnull()函数
sqlite> SELECT ifnull(chrom, "NA") AS chrom, ifnull(position, "NA") AS position,
...> strongest_risk_snp, ifnull(pvalue, "NA") AS pvalue FROM gwascat
...> WHERE strongest_risk_snp = "rs429358";
更多SQLite内置函数
Function Description
ifnull(x, val) If x is NULL, return with val, otherwise return x; shorthand for coalesce() with two arguments
min(a, b, c, …) Return minimum in a, b, c, …
max(a, b, c, …) Return maximum in a, b, c, …
abs(x) Absolute value
coalesce(a, b, c, …) Return first non-NULL value in a, b, c, … or NULL if all values are NULL
length(x) Returns number of characters in x
lower(x) Return x in lowercase
upper(x) Return x in uppercase
replace(x, str, repl) Return x with all occurrences of str replaced with repl
round(x, digits) Round x to digits (default 0)
trim(x, chars), ltrim(x, chars), rtrim(x, chars) Trim off chars (spaces if chars is not specified) from both sides, left side, and right side of x, respectively.
substr(x, start, length) Extract a substring from x starting from character start and is length characters long
集合函数(Aggregate)
count(colname)函数:
返回总行数(无视NULL的存在):sqlite> SELECT count(*) FROM gwascat;
若colname是具体某列则返回出去NULL值的总行数.
其他相似函数:avg(x),max(x),min(x),sum(x),total(x)
计算列非重复值(unique)个数1
sqlite> SELECT count(DISTINCT 列) AS unique_rs FROM gwascat;
行分组(GROUP BY)
1 | sqlite> SELECT chrom, count(*) FROM gwascat GROUP BY chrom; |
自己动手写数据库
we’ll use the basic SQL syntax to create tables and insert records into tables. Then load data into SQLite using Python’s sqlite3 module.
创建tables
基本语法:1
2
3
4
5
6CREATE TABLE tablename(
id integer primary key,
column1 column1_type,
column2 column2_type,
...
);
注意到所有SQLite数据库第一列总是id integer primary key,primary key是非重复整数来识别table中每一条记录。
创建table:1
2
3
4
5
6
7$ sqlite3 practice.dbsqlite> CREATE TABLE variants(
...> id integer primary key,
...> chrom text,
...> start integer,
...> end integer,
...> strand text,
...> name text);
数据写入table
基本语法:1
2INSERT INTO tablename(column1, column2)
VALUES (value1, value2);
建立索引
基本语法:1
2
3
4
5
6sqlite> CREATE INDEX <columns-name_idx> ON <tablename>(<columns-name>);
#察看索引
sqlite> .indices
columns-name_idx
#删除索引
sqlite> DROP INDEX columns-name_idx;
修改/删除table
删除:DROP TABLE
修改:ALTER TABLE
python中交互操作SQLite
连接SQLite数据库并创建table
create_table.py1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23import sqlite3
# the filename of this SQLite database
db_filename = "variants.db"
# initialize database connection
conn = sqlite3.connect(db_filename) #connect() 连接数据库
c = conn.cursor() #在python中用cursor()与SQLite数据库交互
table_def = """\
CREATE TABLE variants(
id integer primary key,
chrom test,
start integer,
end integer,
strand text,
rsid text);
"""
c.execute(table_def) #SQL语法,相当于确认
conn.commit() #提交跟新内容到SQLite数据库
conn.close() #关闭与数据库的连接
数据载入table
load_variants.py1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43##load_variants.py data.txt
import sys
import sqlite3
from collections import OrderedDict
# the filename of this SQLite database
db_filename = "variants.db"
# initialize database connection
conn = sqlite3.connect(db_filename)
c = conn.cursor()
## Load Data
# columns (other than id, which is automatically incremented
tbl_cols = OrderedDict([("chrom", str), ("start", int),
("end", int), ("strand", str),
("rsid", str)])
with open(sys.argv[1]) as input_file:
for line in input_file:
# split a tab-delimited line
values = line.strip().split("\t")
# pair each value with its column name
cols_values = zip(tbl_cols.keys(), values)
# use the column name to lookup an appropriate function to coerce each
# value to the appropriate type
coerced_values = [tbl_cols[col](value) for col, value in cols_values]
# create an empty list of placeholders
placeholders = ["?"] * len(tbl_cols)
# create the query by joining column names and placeholders quotation
# marks into comma-separated strings
colnames = ", ".join(tbl_cols.keys())
placeholders = ", ".join(placeholders)
query = "INSERT INTO variants(%s) VALUES (%s);"%(colnames, placeholders)
# execute query
c.execute(query, coerced_values)
conn.commit() # commit these inserts
conn.close()