NCBI数据库的编程检索和详细信息筛选

NCBI提供了丰富的接口,文档可参考:文档主目录方法说明和参数设置返回值的可选类型和模式 以及 九种接口简介

本文的一些参考资源(这些资料帮助了我这个很久不写python的假程序员):
Lxml库及Xpath语法详解 How to set the pandas dataframe data left/right alignment?

In [99]:
import requests
from lxml import etree
import time
import os

设置检索关键词

关键词作为初步检索的条件,待拿到abstract或summary后,可以进一步筛选信息

In [110]:
key_word =  '((colon cancer) OR (colorectal cancer) OR (rectal cancer)) and ((radiation) OR (radiotherapy))'
# 'SI[gene]+AND+cancer'

检索PubMed

1 查询

In [100]:
search_results = requests.get('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi',
             params={'db': 'pubmed',
                     'term': key_word,
                     'usehistory':'y',
                     'RetMax':'10',
                    })
body=search_results.text
xml=etree.XML(body.encode(),etree.XMLParser())
webenv = xml.xpath('//WebEnv/text()')
QueryKey = xml.xpath('//QueryKey/text()')

2 获取Summary

Summary为XML格式结构化的完整信息

In [101]:
summary_results = requests.get('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi',
             params={'Query_key':QueryKey ,
                     'db': 'pubmed',
                     'WebEnv': webenv,
                     'retmode': 'text',
                     'version': '2.0'
                    })
body=summary_results.text
if (os.path.exists("PubMed") == False):
    os.mkdir("PubMed")
file_name = "PubMed/summary_results_" + time.strftime("%Y%m%d_%H.%M", time.localtime()) + ".txt"
with open(file_name,"w",encoding='utf-8') as txt:
    txt.write(body)

3 获取Fetch

Fetch为非结构化的文本列表,pubmed的abstract主要使用这种方式获取

In [102]:
fetch_results = requests.get('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi',
             params={'Query_key':QueryKey ,
                     'db': 'pubmed',
                     'WebEnv': webenv,
                     'rettype': 'abstract',
                     #'rettype': 'Summary',
                     'retmode': 'text'
                    })
body=fetch_results.text
if (os.path.exists("PubMed") == False):
    os.mkdir("PubMed")
file_name = "PubMed/fetch_results_" + time.strftime("%Y%m%d_%H.%M", time.localtime()) + ".txt"
with open(file_name,"w",encoding='utf-8') as txt:
    txt.write(body)

检索GEO

1 查询

In [108]:
search_results = requests.get('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi',
             params={'db': 'gds',
                     'term': key_word,
                     'usehistory':'y',
                     'RetMax':'10',
                    })
body=search_results.text
xml=etree.XML(body.encode(),etree.XMLParser())
webenv = xml.xpath('//WebEnv/text()')
QueryKey = xml.xpath('//QueryKey/text()')

2 获取Summary

Summary为XML格式结构化的完整信息,GEO的summary信息量较大

In [106]:
summary_results = requests.get('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi',
             params={'Query_key':QueryKey ,
                     'db': 'gds',
                     'WebEnv': webenv,
                     'retmode': 'text',
                     'version': '2.0'
                    })
body=summary_results.text
if (os.path.exists("GEO") == False):
    os.mkdir("GEO")
file_name = "GEO/summary_results_" + time.strftime("%Y%m%d_%H.%M", time.localtime()) + ".txt"
with open(file_name,"w",encoding='utf-8') as txt:
    txt.write(body)

3 获取Fetch

Fetch为非结构化的文本列表

In [109]:
fetch_results = requests.get('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi',
             params={'Query_key':QueryKey ,
                     'db': 'gds',
                     'WebEnv': webenv,
                     'rettype': 'Summary',
                     'retmode': 'text'
                    })
body=fetch_results.text
if (os.path.exists("GEO") == False):
    os.mkdir("GEO")
file_name = "GEO/fetch_results_" + time.strftime("%Y%m%d_%H.%M", time.localtime()) + ".txt"
with open(file_name,"w",encoding='utf-8') as txt:
    txt.write(body)

检索结果的进一步筛选

In [196]:
body = summary_results.text
xml=etree.XML(body.encode(),etree.XMLParser())
summary = xml.xpath('//summary/text()')

title = xml.xpath('//title/text()')
DocumentSummary = xml.xpath('./DocumentSummarySet/DocumentSummary')
# len(DocumentSummary)
# 1150

# 下面循环可以写成 map 的形式,构建的函数需传入两个参数:一个是索引(数字),另一个是标签(字符串)
# 各个编程语言中非常重要的三种编程习惯:map用于简化“可并行”关系的循环,reduce用于简化“需串行”关系的循环,lambda用于匿名函数。
# 这种习惯借鉴于函数式编程(但严格的函数式编程是不允许命令式步骤的代码出现的),此外,其他的高级函数也可以多使用,比如filter、sort等
# 顺便回顾下java中的lambda写法,不仅可以实现匿名函数,还可以用于实现匿名内部类
# java中的lambda实现内部类可参考:https://www.cnblogs.com/coprince/p/8692972.html
search_range = range(0,len(DocumentSummary))
Accession = ["" for i in search_range]
title = ["" for i in search_range]
PDAT = ["" for i in search_range]
for i in search_range:
    Accession[i] = DocumentSummary[i].xpath('./Accession/text()')
    title[i] = DocumentSummary[i].xpath('./title/text()')
    PDAT[i] = DocumentSummary[i].xpath('./PDAT/text()')
In [114]:
summary[1]
Out[114]:
'We report the genome-wide effects of KAP1 loss on the transcriptome, the chromatin state, and on recruitment of various components of the transcription machinery in the colon colorectal cancer cell line HCT116.'
In [197]:
import re
n_pattens = 5
patterns = [re.compile('colon cancer'),
           re.compile('rectal cancer'),
           re.compile('radiation'),
           re.compile('radiotherapy'),
           re.compile('after')]

def match_info(data):
    
    results = False
    match_results = [0 for i in range(n_pattens)]
    for i in range(n_pattens):
        match_results[i] = len(re.findall(patterns[i],data))
    
    if ( (match_results[0] > 0 or match_results[1] > 0) and (match_results[2] > 0 or match_results[3] > 0) and (match_results[4] > 0) ):
        results = True
        
    return(results)
In [198]:
# 当然,下面的循环可以写成 map 的形式(python里的map和R里面的map都是差不多的)
# 对于过滤符合搜索要求的数据,可以结合 filter 进行处理
# 写法如下:
# search_results = map(match_info, summary)
# match_index = list(filter( (lambda i : search_results[i]), search_range))
## 为了结构清晰,也可以先将lambda表达式定义成变量再传入filter,如index_bool = lambda i : search_results[i]
# Accession_match = Accession[match_index][0]
# Title_match = title[match_index][0]
# PDAT_match = PDAT[match_index][0]
#------------------------------------ for 写法 ------------------------------------
search_results = [ False for i in search_range]
Accession_match = []
Title_match = []
PDAT_match = []
for i in search_range:
    search_results[i] = match_info(summary[i])
    if (search_results[i]==True):
        Accession_match.extend([Accession[i][0]])
        Title_match.extend([title[i][0]])
        PDAT_match.extend([PDAT[i][0]])
In [200]:
import pandas as pd
Match_results = pd.DataFrame({
    'Accession': Accession_match,
    'Title': Title_match,
    'Date': PDAT_match
})

pd.set_option('max_colwidth',50)
pd.set_option('expand_frame_repr', True)
dfStyler = Match_results.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])
dfStyler
Out[200]:
Accession Title Date
0 GSE139995 WNT activated cells are the origin of regrowth of colorectal cancer organoids after irradiation 2019/11/07
1 GSE87211 Colorectal cancer susceptibility loci as predictive markers of rectal cancer prognosis after surgery 2017/11/28
2 GSE103178 Gene expression profile of colorectal cancer HCT116 cells treated with single (2Gy) or fractionated (5 x 2Gy) doses of ionizing radiation. 2017/08/29
3 GSE98959 MicroRNA expression in preoperative chemoradiotherapy for rectal cancer (LARC) 2017/05/17
4 GSE93228 Cell lines iPSC CRL1831 (induced pluripotent stem cells) and CSC DLD1 (cancer stem-like cells) derived from normal colon CRL1831 and colorectal cancer DLD-1 cells in 3D cell culture conditions and subjected to ionizing radiation doses 2017/01/07
5 GSE60331 Combining bevacizumab and chemoradiation in rectal cancer. Translational results of the AXEBeam trial. 2016/08/02
6 GSE65622 Locally Advanced Rectal Cancer - Radiation Response Prediction Study - Serum Proteins 2016/04/21
7 GSE75867 Transient activation of the WNT pathway after disruption/remodeling of colorectal cancer cell clusters promotes a malignant phenotype 2015/12/10
8 GSE52413 lncRNAs expression signatures of colon cancer 2013/11/16
9 GSE29298 A specific miRNA signature correlates with complete pathological response to neoadjuvant chemo-radiotherapy in locally advanced rectal cancer 2012/04/25
10 GSE15781 New specific molecular targets for radiochemotherapy in colorectal cancer 2009/04/23
11 GSE801 HCT116-Clone2 cells 24 hours after XR treatment at 4 Gy 2004/06/04
12 GSE800 HCT116-Clone2 cells 6 hours after XR treatment at 4 Gy 2004/06/04
13 GSE799 HCT116-Clone2 cells 10 min after XR treatment at 4 Gy 2004/06/04
14 GSE526 HCT116-CloneK cells 24 hours after XR treatment at 4 Gy 2003/12/08
15 GSE525 HCT116-CloneK cells 6 hours after XR treatment at 4 Gy 2003/12/08
16 GSE524 HCT116-CloneK cells 10 minutes after XR treatment at 4 Gy 2003/12/08
17 GSE522 X-radiation (XR) sensitive HCT116-CloneK: XR at 0 and 4 Gy 2003/12/08