生信Notes - 2022-TCGA數據庫重大更新後RNASeq的STAR-Counts數據的下載與整理－鑽石舞台

TCGA|GEO|文獻閱讀|數據庫|理論知識

R語言|Bioconductor| 服務器與Linux

最近有粉絲留言，TCGA數據庫發生更新，下載的數據和之前的不一樣。比如轉錄組，之前是HTSeq流程的數據，現在是STAR-Counts的數據。具體的數據信息參考：https://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-320

下載後的數據，打開是這樣的。都放在了一個文件中。

這裡分享一下怎麼提取數據。

數據的下載和之前的教程一樣【14-TCGA數據庫下載整理】。只不過這裡選擇的是STAR-Counts了。加入購物車後下載下面的文件。

我先寫2個函數，一個是處理讀入json文件的函數，該文件包括文件信息和樣本barcode的關係。

processingJsonFiles <- function(jsonFile){ library(rjson) metadata_json_File <- fromJSON(file=jsonFile) json_File_Info <- data.frame(filesName = c(),TCGA_Barcode = c()) for(i in 1:length(metadata_json_File)){ TCGA_Barcode <- metadata_json_File[[i]][["associated_entities"]][[1]][["entity_submitter_id"]] file_name <- metadata_json_File[[i]][["file_name"]] json_File_Info <- rbind(json_File_Info,data.frame(filesName = file_name,TCGA_Barcode = TCGA_Barcode)) } rownames(json_File_Info) <- json_File_Info[,1] json_File_Info <-json_File_Info[-1] return(json_File_Info)}

jsonFile是下載的json文件的完整路徑。

下面的函數是提取數據的函數。

getTCGA_RNAseq_data = function(filepath,jsonFileInfo,data_type){ datamatrix = data.frame() for(wd in filepath){ #每一個循環讀取一個文件 tempPath <- unlist(strsplit(wd,"/")) filename <- tempPath[length(unlist(strsplit(wd,"/")))] message(paste0("微信公眾號:MedBioInfoCloud提示:正在讀入文件:\n",filename)) oneSampExp <- read.table(wd,comment.char = "#",header = T,sep = "\t") oneSampExp = oneSampExp[-c(1:4),] # 根據jsonFileInfo文件中文件名稱與barcode對應關係，命名列名 if(wd == filepath[1]){ oneSampExp = oneSampExp[,c("gene_id","gene_name","gene_type",data_type)] colnames(oneSampExp) <- c("gene_id","gene_name","gene_type",jsonFileInfo[filename,"TCGA_Barcode"]) datamatrix = oneSampExp }else{ oneSampExp = oneSampExp[,c("gene_id",data_type)] colnames(oneSampExp) <- c("gene_id",jsonFileInfo[filename,"TCGA_Barcode"]) datamatrix = merge(datamatrix,oneSampExp,by = "gene_id") } } return(datamatrix)}

filepath是下載的數據路徑。通過dir等類似的函數獲取的路徑向量。比如，我們下載的數據是一個壓縮包，解壓後，將文件名重新命名為data。

filepath = dir(path = "./data", pattern = "counts.tsv$", full.names = T, recursive = T)

jsonFileInfo是processingJsonFiles函數獲取的結果。

data_type是下面中的一種。

"unstranded";

"stranded_first";

"stranded_second";

"tpm_unstranded";

"fpkm_unstranded";

"fpkm_uq_unstranded"

對應文件中的信息

下面就可以獲取數據了，想要什麼就獲取什麼。一般就是TPM和FPKM。

jsonFileInfo <- processingJsonFiles(jsonFile = "metadata.cart.2022-04-05.json ")filepath = dir(path = "./data", pattern = "counts.tsv$", full.names = T, recursive = T)dat = getTCGA_RNAseq_data(filepath =filepath, jsonFileInfo = jsonFileInfo, data_type = "fpkm_unstranded")head(dat)[,1:5]