公开数据库汇总及下载(1)-TCGA

打印 上一主题 下一主题

主题 976|帖子 976|积分 2928

  本文的内容介绍、代码下载主要参考了网上多个文件汇总而成,本文仅作汇总,学习请翻阅原文。
1. 常用数据库

  近几年在生物医药圈不停有脱钩的传闻,随着24年美国大选临近,对于trump上台后的预期政策担忧也愈演愈烈。在本身吓本身的前提下,对各类数据库进行了整理,并逐步进行下载。

2. TCGA数据下载

  TCGA(The Cancer Genome Atlas)数据库是一个由美国国家癌症研究所(NCI)和国家人类基因组研究所(NHGRI)合作创建的大规模癌症基因组学数据库。该数据库中网络了来自多个癌症类型的临床数据、基因组数据和表达数据。
2.1 TCGA介绍

  2005.12.13日,美国国家癌症研究所(NCI)和国家人类基因组研究所(NHGRI)团结启动肿瘤基因组图谱(TCGA)项目。
  TCGA是一项以基因组为底子的大科学研究计划,以人类基因组计划的成果为底子,研究癌症中基因组的变化。与HGP专注于疾病的遗传因素不同,TCGA更关心人类出生后细胞中的基因变化。

  33种癌症类型的关键基因组变化的多维图谱。超过2PB的基因组数据公开。超过1万名病人提供配对的癌症和对照组织。7种数据类型。

  组织布局美满,由8种构成,形成数据闭环。

  由TSS网络样本,交由BCR进行数据整理,由GSC、GCC进行测序、数据鉴定,CGHub、GDAC进行把关,在DCC进行数据最后的整合,最后由NCI、NHGRI项目组发布。

  对样本的纳入有严格的标准,患者是原发性种类,有配对样本。肿瘤组织需冷冻保存、且大小适宜。肿瘤细胞含量在80%以上。
2.2 数据类型详解


  临床数据:TCGA数据库中包含了多种癌症类型的患者的临床信息,如年岁、性别、病理分期、生存期等。这些数据有助于研究人员了解癌症的发展和预后。
  1. indexed clinical: 使用 XML 文件创建的精炼临床数据
  2. XML: 原始临床数据
  3.     处理 XML 格式的临床数据分为两步:
  4.         使用 GDCquery 和 GDCDownload 来查询和下载 Biospecimen 或 Clinical XML 文件
  5.         使用 GDCprepare_clinic 来解析文件
  6. BCR Biotab: 解析 XML 文件之后的 tsv 文件
复制代码
  基因组数据:TCGA数据库中的基因组数据包罗了DNA测序和突变数据。这些数据可以资助研究人员分析癌症基因组的变化,探求与癌症发展相关的基因变异。
  表达数据:TCGA数据库中的表达数据主要是通过RNA测序获得的。这些数据可以反映癌症细胞内基因的表达水平,资助研究人员发现与癌症发展和治疗相关的基因。
  甲基化数据:TCGA数据库中的甲基化数据可以资助研究人员了解癌症细胞中DNA甲基化的模式和变化。甲基化是一种常见的基因组调控方式,与癌症的发生和发展密切相关。
  卵白质组数据:TCGA数据库中的卵白质组数据包罗了癌症细胞中卵白质的表达水平和翻译后修饰的信息。这些数据对于研究癌症细胞的功能和信号通路具有紧张意义。
2.3 TCGA数据等级


  受控数据主要是采集样本的患者隐私,测序原始数据(BAM/FASTQ),SNP6、EXON芯片的1、2级数据,生信分析的VCF、MAF等中间结果。

  从上表可知,1、2级数据为病人原始结果,涉及医学伦理学题目,一样平常处于受控状态。3、4级数据是加工处置惩罚后的数据,由TCGA整合发表。,3级部门开放,4级数据完全开放。 仅有少量的数据是open状态
  具体可阅读说明文档GDC_Introduction
2.4 TCGA网页筛选页面介绍


2.5 各文件每个字段的寄义

  可看官网的说明文档。
  各表格的字段寄义
2.6 biospecimen IDs介绍(样本ID)


  TCGA_barcode号介绍

  目前UUID成为主要标识符,下载数据时以UUID代指一个样本。
2.7 下载代码

  1. # 0. 起始 -------------------------------------------------------------------
  2. library(TCGAbiolinks)
  3. projects <- TCGAbiolinks::getGDCprojects()$project_id
  4. projects <- projects[grepl('^TCGA', projects, perl=TRUE)]
  5. # 1. Clinical -------------------------------------------------------------
  6. ## 1.1 clinical ------------------------------------------------------------
  7. ### XML -------------------------------------------------------------------------
  8. # 使用 GDCquery 和 GDCDownload 来查询和下载 Biospecimen 或 Clinical XML 文件
  9. # 使用 GDCprepare_clinic 来解析文件
  10. ### 下载所有临床数据, 并将结果汇总在一个文件中 ########
  11. ### TCGA-READ、TCGA-LGG、仅下载,未合并,因为有重复,需针对性处理
  12. library(TCGAbiolinks)
  13. library(data.table)
  14. library(dplyr)
  15. library(regexPipes)
  16. # 获取所有索引信息
  17. clinical <- TCGAbiolinks:::getGDCprojects()$project_id %>%
  18.   regexPipes::grep("TCGA", value = TRUE) %>%
  19.   sort %>%
  20.   plyr::alply(1, GDCquery_clinic, .progress = "text") %>%
  21.   rbindlist(fill = TRUE)
  22. readr::write_csv(clinical, file = "all_clin_indexed.csv")
  23. # 解析 XML 文件并信息获取对应的信息
  24. getclinical <- function(proj) {
  25.   message(proj)
  26.   result <- NULL
  27.   attempt <- 1
  28.   max_attempts <- 5  # 设置最大尝试次数
  29.   
  30.   while(attempt <= max_attempts) {
  31.     result <- tryCatch({
  32.       query <- GDCquery(project = proj, data.category = "Clinical", data.format = "bcr xml")
  33.       GDCdownload(query)
  34.       clinical <- GDCprepare_clinic(query, clinical.info = "patient")
  35.       
  36.       clinical_data <- list(clinical)
  37.       for(i in c("admin", "radiation", "follow_up", "drug", "new_tumor_event")){
  38.         message(i)
  39.         aux <- GDCprepare_clinic(query, clinical.info = i)
  40.         if(is.null(aux) || nrow(aux) == 0) next
  41.         
  42.         # 处理重复的列名
  43.         replicated <- which(grep("bcr_patient_barcode", colnames(aux), value = TRUE, invert = TRUE) %in% colnames(clinical))
  44.         colnames(aux)[replicated] <- paste0(colnames(aux)[replicated], ".", i)
  45.         
  46.         if(!is.null(aux)) clinical <- merge(clinical, aux, by = "bcr_patient_barcode", all = TRUE)
  47.       }
  48.       
  49.       # 保存临床数据到csv文件
  50.       readr::write_csv(clinical, path = paste0("TCGA_alldata/", proj, "_clinical_from_XML.csv"))
  51.       return(clinical)
  52.     }, error = function(e) {
  53.       message(paste0("Error clinical: ", proj, " Attempt: ", attempt))
  54.       attempt <<- attempt + 1  # 增加尝试次数
  55.       NULL
  56.     })
  57.    
  58.     # 如果成功获取数据,则跳出循环
  59.     if (!is.null(result)) break
  60.   }
  61.   
  62.   # 如果多次尝试后仍然失败,返回NULL并发出警告
  63.   if (is.null(result)) {
  64.     warning(paste0("Failed to get clinical data for project: ", proj, " after ", max_attempts, " attempts."))
  65.   }
  66.   
  67.   return(result)
  68. }
  69. # 患者信息
  70. # 如果内存溢出,可分批次下载, 或单独下载某个数据集
  71. clinical <- TCGAbiolinks:::getGDCprojects()$project_id %>%
  72.   regexPipes::grep("TCGA", value = T) %>%
  73.   sort %>%
  74.   plyr::alply(1, getclinical, .progress = "text") %>%  
  75.   rbindlist(fill = TRUE) %>%
  76.   setDF %>%
  77.   subset(!duplicated(clinical))
  78. readr::write_csv(clinical, path = "TCGA_alldata/all_clin_XML.csv")
  79. ## 1.2  supplement ---------------------------------------------------------
  80. ### 1.2.1 clinical-supplement-bcr biotab, 解析 XML 文件之后的 tsv 文件-------------------------------------------------------------------------
  81. sapply(projects, function(project){
  82.   cat("Processing project:", project, "\n")
  83.   
  84.   query <- GDCquery(
  85.     project = project,
  86.     data.category = "Clinical",
  87.     data.type = "Clinical Supplement",
  88.     data.format = "BCR Biotab",
  89.     access = "open"
  90.   )
  91.   GDCdownload(query, method = "api", files.per.chunk = 100)
  92.   
  93.   prepared_data <- GDCprepare(query, save = TRUE, save.filename = file.path(paste0(project, "_clinical_supplement_bcr_biotab.Rdata")))  
  94.   
  95. })
  96. ### 1.2.2 Biospecimen 获取采样信息 -----------------------------------------------------------------------
  97. sapply(projects, function(project){
  98.   cat("Processing project:", project, "\n")
  99.   
  100.   query <- GDCquery(
  101.     project = project,
  102.     data.category = "Biospecimen",
  103.     data.type = "Biospecimen Supplement",
  104.     data.format = "BCR Biotab",
  105.     access = "open"
  106.   )
  107.   GDCdownload(query, method = "api", files.per.chunk = 100)
  108.   
  109.   prepared_data <- GDCprepare(query, save = TRUE, save.filename = file.path(paste0(project, "_biospecimen_supplement_bcr_biotab.Rdata")))  
  110.   
  111. })
  112. ### 1.2.3 Indexed 使用 XML 文件创建的精炼临床数据 -------------------------------------------------------------------------
  113. clinical <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
  114. clinical <- GDCquery_clinic(project = "TARGET-RT", type = "clinical")
  115. ### 1.2.4 诊断切片(SVS 格式)-------------------------------------------------------------------------
  116. # 单个数据集50-60G,暂不下载
  117. sapply(projects, function(project){
  118.   cat("Processing project:", project, "\n")
  119.   
  120.   query <- GDCquery(
  121.     project = project,
  122.     data.category = "Biospecimen",
  123.     data.type = 'Slide Image',
  124.     experimental.strategy = "Diagnostic Slide",  # 实验策略有两种,Tissue Slide和Diagnostic Slide(患者的)
  125.     # barcode = c("TCGA-RU-A8FL", "TCGA-AA-3972")
  126.     access = "open"
  127.   )
  128.   GDCdownload(query, method = "api", files.per.chunk = 100)
  129.   
  130.   prepared_data <- GDCprepare(query, save = TRUE, save.filename = file.path(paste0(project, "_biospecimen_slide_images.Rdata")))  
  131.   
  132. })
  133. # 2. RNA -------------------------------------------------------------------------
  134. ## 2.1 mRNA-------------------------------------------------------------------------
  135. sapply(projects, function(project){
  136.   # 查询
  137.   query <-  GDCquery(
  138.     project = project,
  139.     data.category = "Transcriptome Profiling",
  140.     data.type = "Gene Expression Quantification",
  141.     workflow.type = "STAR - Counts"
  142.   )
  143.   
  144.   # 下载
  145.   GDCdownload(query, method = "api", files.per.chunk = 100)
  146.   
  147.   # 整理
  148.   GDCprepare(query, save = T, save.filename = file.path(paste0(project, "_mRNA.Rdata")))
  149.   
  150. }
  151. )
  152. ## 2.2 microRNA-------------------------------------------------------------------------
  153. sapply(projects, function(project){
  154.   query <- GDCquery(project = project,
  155.                     data.category = "Transcriptome Profiling",
  156.                     data.type = "miRNA Expression Quantification"
  157.   )
  158.   
  159.   GDCdownload(query)
  160.   
  161.   GDCprepare(query, save = T, save.filename = file.path(paste0(project, "_miRNA.Rdata")))
  162.   
  163. })
  164. ## 2.3 isoform -------------------------------------------------------------------------
  165. sapply(projects, function(project){
  166.   query.isoform <- GDCquery(
  167.     project = project,
  168.     experimental.strategy = "miRNA-Seq",
  169.     data.category = "Transcriptome Profiling",
  170.     data.type = "Isoform Expression Quantification"
  171.   )
  172.   
  173.   GDCdownload(query.isoform, method = "api", files.per.chunk = 100)
  174.   GDCprepare(query.isoform, save = T, save.filename = file.path(paste0(project, "_mirna-isoform.Rdata")))
  175.   
  176. })
  177. # 3.SNP-------------------------------------------------------------------------
  178. # Can't combine `..17$Tumor_Seq_Allele2` <character> and `..18$Tumor_Seq_Allele2` <logical>. 部分数据库的样本字段不一致,导致合并出错概率高, 先按癌种-样本下载好
  179. sapply(projects, function(project){
  180.   query <- GDCquery(
  181.     project = project,
  182.     data.category = "Simple Nucleotide Variation",
  183.     data.type = "Masked Somatic Mutation",
  184.     access = "open"
  185.   )
  186.   
  187.   GDCdownload(query, method = "api", files.per.chunk = 100)
  188.   
  189.   # GDCprepare(query, save = T,save.filename = file.path(project_dir,  paste0(project, "_SNP.Rdata")))
  190.   
  191. })
  192. # 4.Protein-------------------------------------------------------------------------
  193. # "TCGA-LAML-fail"  "TCGA-THCA-fail", 这两个数据集没有蛋白结果
  194. sapply(projects, function(project){
  195.   query <- GDCquery(
  196.     project = project,
  197.     data.category = "Proteome Profiling",
  198.     data.type = "Protein Expression Quantification"
  199.   )
  200.   
  201.   GDCdownload(query, method = "api", files.per.chunk = 100)
  202.   
  203.   GDCprepare(query, save = T, save.filename = file.path(paste0(project, "_protein.Rdata")))
  204.   
  205. })
  206. # 5.methy-------------------------------------------------------------------------
  207. # Beta值数据 分3个平台, Illumina Human Methylation 27、 Illumina Human Methylation 450、 Illumina Methylation Epic
  208. # 每个癌种下,不一定含有全部3种平台的数据
  209. # 本次先下载Illumina Human Methylation 27的
  210. # IDAT是原始的荧光信号强度数据,而Beta值是这些原始数据的标准化表示, 本次下载优先BEta值
  211. ## 5.1 Illumina Human Methylation 27---------------------------------------------------------------------
  212. ### Methylation Beta Value --------------------------------------------------------------------
  213. # "TCGA-BRCA" "TCGA-SARC—fail,数据集无此结果" "TCGA-ACC-fail"  "TCGA-UCEC" "TCGA-KIRC" "TCGA-LAML" "TCGA-SKCM-fail" "TCGA-PAAD-fail" "TCGA-TGCT-fail" "TCGA-CESC-fail" "TCGA-ESCA-fail" "TCGA-THCA-fail" "TCGA-LIHC-fail" "TCGA-PRAD-fail" "TCGA-READ"
  214. # "TCGA-OV"   "TCGA-UVM-fail"  "TCGA-BLCA-fail" "TCGA-CHOL-fail" "TCGA-GBM"  "TCGA-UCS-fail"  "TCGA-PCPG-fail" "TCGA-MESO-fail" "TCGA-DLBC-fail" "TCGA-COAD" "TCGA-STAD" "TCGA-KIRP" "TCGA-THYM-fail" "TCGA-KICH-fail" "TCGA-LGG-fail"
  215. # "TCGA-LUSC" "TCGA-LUAD" "TCGA-HNSC-fail"
  216. projects <- c("TCGA-BRCA", "TCGA-UCEC", "TCGA-KIRC", "TCGA-LAML", "TCGA-READ", "TCGA-OV", "TCGA-GBM", "TCGA-COAD", "TCGA-STAD", "TCGA-KIRP", "TCGA-LUSC", "TCGA-LUAD")
  217. projects <- c( "TCGA-LUAD")
  218. sapply(projects, function(project){
  219.   
  220.   coad_methy <- GDCquery(
  221.     project = project,
  222.     data.category = "DNA Methylation",
  223.     data.type = "Methylation Beta Value",
  224.     platform = "Illumina Human Methylation 27"
  225.   )
  226.   
  227.   GDCdownload(coad_methy, method = "api", files.per.chunk = 100)
  228.   GDCprepare(coad_methy,save = T, save.filename= file.path(paste0(project, "_METHY_beta_27.Rdata")))
  229.   
  230. })
  231. ### Masked Intensities --------------------------------------------------------------------
  232. # "TCGA-BRCA" "TCGA-SARC—fail" "TCGA-ACC-fail"  "TCGA-UCEC" "TCGA-KIRC" "TCGA-LAML" "TCGA-SKCM-fail" "TCGA-PAAD-fail" "TCGA-TGCT-fail" "TCGA-CESC-fail" "TCGA-ESCA-fail" "TCGA-THCA-fail" "TCGA-LIHC-fail" "TCGA-PRAD-fail" "TCGA-READ"
  233. # "TCGA-OV"   "TCGA-UVM-fail"  "TCGA-BLCA-fail" "TCGA-CHOL-fail" "TCGA-GBM"  "TCGA-UCS-fail"  "TCGA-PCPG-fail" "TCGA-MESO-fail" "TCGA-DLBC-fail" "TCGA-COAD" "TCGA-STAD" "TCGA-KIRP" "TCGA-THYM-fail" "TCGA-KICH-fail" "TCGA-LGG-fail"
  234. # "TCGA-LUSC" "TCGA-LUAD" "TCGA-HNSC-fail"
  235. projects <- c("TCGA-BRCA", "TCGA-UCEC", "TCGA-KIRC", "TCGA-LAML", "TCGA-READ", "TCGA-OV", "TCGA-GBM", "TCGA-COAD", "TCGA-STAD", "TCGA-KIRP", "TCGA-LUSC", "TCGA-LUAD")
  236. sapply(projects, function(project){
  237.   
  238.   coad_methy <- GDCquery(
  239.     project = project,
  240.     data.category = "DNA Methylation",
  241.     data.type = "Masked Intensities",
  242.     platform = "Illumina Human Methylation 27"
  243.   )
  244.   
  245.   GDCdownload(coad_methy, method = "api", files.per.chunk = 50)
  246. })
  247. ## 5.2 Illumina Human Methylation 450 -----------------------------------------------------------------
  248. ### Methylation Beta Value --------------------------------------------------------------------
  249. sapply(projects, function(project){
  250.   
  251.   coad_methy <- GDCquery(
  252.     project = project,
  253.     data.category = "DNA Methylation",
  254.     data.type = "Methylation Beta Value",
  255.     platform = "Illumina Human Methylation 450" # Illumina Human Methylation 450
  256.   )
  257.   
  258.   GDCdownload(coad_methy, method = "api", files.per.chunk = 30)
  259. })
  260. ### Masked Intensities --------------------------------------------------------------------
  261. sapply(projects, function(project){
  262.   
  263.   coad_methy <- GDCquery(
  264.     project = project,
  265.     data.category = "DNA Methylation",
  266.     data.type = "Masked Intensities",
  267.     platform = "Illumina Human Methylation 450" # Illumina Human Methylation 450
  268.   )
  269.   
  270.   GDCdownload(coad_methy, method = "api", files.per.chunk = 30)
  271. })
  272. # 6. CNV----------------------------------------------------------------------
  273. ## 6.1 Masked Copy Number Segment-------------------------------------------------------------------------
  274. sapply(projects, function(project){
  275.   
  276.   query <- GDCquery(
  277.     project = project,
  278.     data.category = "Copy Number Variation",
  279.     data.type = "Masked Copy Number Segment",              
  280.     access = "open"
  281.   )
  282.   
  283.   GDCdownload(query, method = "api", files.per.chunk = 100)
  284.   GDCprepare(query, save = T,save.filename = file.path(paste0(project, "_CNV.Rdata")))  
  285.   
  286. })
  287. ## 6.2 Copy Number Segment -------------------------------------------------------------------
  288. sapply(projects, function(project){
  289.   
  290.   query <- GDCquery(
  291.     project = project,
  292.     data.category = "Copy Number Variation",
  293.     data.type = "Copy Number Segment",              
  294.     access = "open"
  295.   )
  296.   
  297.   GDCdownload(query, method = "api", files.per.chunk = 300)
  298. })
  299. ## 6.3 Allele-specific Copy Number Segment ------------------------------------------------------------------
  300. # There are more than one file for the same case. Please verify query results. You can use the command View(getResults(query)) in rstudio
  301. sapply(projects, function(project){
  302.   
  303.   query <- GDCquery(
  304.     project = project,
  305.     data.category = "Copy Number Variation",
  306.     data.type = "Allele-specific Copy Number Segment",              
  307.     access = "open"
  308.   )
  309.   
  310.   GDCdownload(query, method = "api", files.per.chunk = 300)
  311. })
  312. ## 6.4 Gene Level Copy Number -------------------------------------------------------------------------
  313. # Warning: There are more than one file for the same case. Please verify query results. You can use the command View(getResults(query)) in rstudio
  314. projects <- c("TCGA-STAD", "TCGA-KIRP", "TCGA-THYM", "TCGA-KICH", "TCGA-LGG", "TCGA-LUSC", "TCGA-LUAD", "TCGA-HNSC")
  315. sapply(projects, function(project){
  316.   
  317.   query <- GDCquery(
  318.     project = project,
  319.     data.category = "Copy Number Variation",
  320.     data.type = "Gene Level Copy Number",              
  321.     access = "open"
  322.   )
  323.   
  324.   GDCdownload(query, method = "api", files.per.chunk = 30)
  325. })
复制代码

参考文件

(1)全网最全!2021最新常用肿瘤生信数据库收藏级汇总!(1)
(2)全网最全!2021最新常用肿瘤生信数据库收藏级汇总!(2)
(3)收藏:常用医学公共数据库(含临床数据库,生信数据库和机器学习数据库)
(4)6大药敏性分析数据库大汇总,助力肿瘤相关生信分析!/SCI论文/科研/研究生/生信分析热
(5)新版TCGA数据库学习:批量下载新版TCGA数据
(6)TCGA 数据下载 —— TCGAbiolinks 简单使用
(7)4年新版TCGA GDC data portal 2.0界面介绍及数据下载教程
(8)TCGA 数据下载 —— TCGAbiolinks 数据分析
(9)【TCGA数据库介绍及应用】
(10)【TCGA数据库数据分析】

免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有账号?立即注册

x
回复

使用道具 举报

0 个回复

倒序浏览

快速回复

您需要登录后才可以回帖 登录 or 立即注册

本版积分规则

涛声依旧在

金牌会员
这个人很懒什么都没写!
快速回复 返回顶部 返回列表