linuxSmart-seq2分析 - Powered by Discuz! Archiver

大连全瓷种植牙齿制作中心 发表于 2025-3-22 14:52:38

Smart-seq2分析

1、概述

Smart-seq2是一种单细胞RNA测序技术，用于分析单个细胞的基因表达情况，并可以对单个细胞的基因表达进行分析。
https://img2024.cnblogs.com/blog/2493697/202503/2493697-20250322105611108-1160222543.png
https://img2024.cnblogs.com/blog/2493697/202503/2493697-20250322105125501-262290075.png
2、根本原理

Smart-seq2利用了莫罗尼小鼠白血病病毒逆转录酶（MMLV-RT）的两个特性：
[*]该逆转录酶在合成到cDNA的3’端时会随机引入几个不依赖于模板的碱基，多数情况下是三个胞嘧啶。
[*]逆转录酶是RNA指导的DNA聚合酶，以RNA为模板，以dNTP为底物进行反转录。
通过设计oligo(dT)VN Primer作为逆转录引物，利用MMLVRT的模板转换活性，在cDNA的3’端添加一段接头序列，通过该接头序列进行反转录，生成cDNA第一条链。当逆转录酶到达mRNA5’末端时，会连续在末端添加几个胞嘧啶（C）残基。然后添加TSO（template-switching oligo）引物，退火后结合在第一条链的3’端与poly(C)突出杂交，合成第二条链。这样得到的cDNA颠末PCR扩增，然后再纯化后用于测序。
优势

[*]能够得到全长cDNA，用于分析可变剪切等。
[*]覆盖范围广，可检测到稀有转录本。
[*]检测灵敏度高，起始量低，1-1000个细胞或10pg-10ng total RNA即可高效扩增。
局限性

[*]只能分析带Poly（A）的RNA。
[*]不是链特异性的。
Pipeline FeaturesDescriptionSourceAssay Typepaired-end plate-based Smart-seq2 Overall workflowQuality control module and transcriptome quantification moduleCode available from GithubWorkflow languageWDLopenWDLGenomic reference sequenceGRCh38 human genome primary sequenceGENCODEGene ModelGENCODE v27 PRI GTF and Fasta filesGENCODEAlignerHISAT2Kim, et al.,2015; HISAT2 toolQCMetrics determined using Picard command line toolsPicard ToolsEstimation of gene expressionRSEM (rsem-calculate-expression) is used to estimate the gene expression profile. The input of RSEM is a bam file aligned by HISAT2.Li and Dewey, 2011Data Input File FormatFile format in which sequencing data is providedFASTQData Output File FormatFile formats in which Smart-seq2 pipeline output is providedBAM, Zarr version 23、数据预处理

起首，需要对原始测序数据进行质控和比对。质控可以利用FastQC和MultiQC工具来查抄数据质量。比对可以利用HISAT2工具，将测序数据比对到参考基因组1。
# 质控fastqc -t 6 -o ./fastqc_result ./RAW/SRR*fastq.gzmultiqc ./fastqc_result
# 比对hisat2 -p 10 -x genome_index -1 sample_1.fastq.gz -2 sample_2.fastq.gz -S output.samsamtools sort -O bam -@ 10 -o output.bam output.samsamtools index output.bam4、表达矩阵构建与seurat对象创建

利用featureCounts工具对比对后的BAM文件进行定量，生成基因表达矩阵2。
featureCounts -T 10 -p -t exon -g gene_name -a annotation.gtf -o counts.txt *.bam 根本批量运行流程如下：#!/bin/bash
# 检查是否安装了GNU Parallel
if ! command -v parallel &> /dev/null; then
echo "GNU Parallel could not be found, please install it first."
exit 1
fi

# 创建namelist文件，列出所有样本文件夹名称
ls -d */ > namelist

# 定义进一步处理RNA序列的函数
process_rna() {
local sample=$1
# 去除末尾的斜杠
sample=$(echo "$sample" | sed 's:/*$::')
echo "${sample} RNA processing start"

if [ ! -d "${sample}" ]; then
echo "Directory ${sample}does not exist, skipping."
return
fi

cd "${sample}" || { echo "Failed to enter ${sample}"; continue; }

source /data5/xxx/zengchuanj/Software/MACS3/MyPythonEnv/bin/activate
trim_galore -j 20 --phred33 --gzip --trim-n -o result --paired *.fastq.gz

cd result

source /data5/tan/zengchuanj/conda/bin/activate
conda activate HiC-Pro
fastp -i "../${sample}_clean_R1.fastq.gz" -I "../${sample}_clean_R2.fastq.gz" -o "${sample}_clean_R1.fq.gz" -O "${sample}_clean_R2.fq.gz" -q 20 -w 16 -n 5
fastqc -t 10 "${sample}_clean_R1.fq.gz" "${sample}_clean_R2.fq.gz"

hisat2 -p 20 -x /data5/tan/zengchuanj/pipeline/HIC/juicer/references/mm10/mm10 -1 ${sample}_clean_R1_val_1.fq.gz -2 ${sample}_clean_R2_val_2.fq.gz 2>"${sample}_hisat.txt" | samtools view -o "${sample}_outname.bam"

samtools sort -@ 20 -o ${sample}.sort.bam ${sample}_outname.bam
##featurecounts定量
# 使用ensamble的GTF
echo ${sample} 'Feature counts start'
### (ucsc的话 -t exon)
# 根据gene与exon调整参数
featureCounts -p --countReadPairs -T 20 -t gene -a /data5/xxx/zengchuanj/xxx/references/GTF/Mus_musculus.GRCm38.102.gtf -o ${sample}_count.txt ${sample}.sort.bam

echo "${sample} RNA processing finish"
cd ../../..
}

# 使用parallel并行处理每个样本，并限制最大线程数为15
export -f process_rna

# 并行进一步处理RNA序列
parallel -j 10 process_rna ::: $(cat namelist)# 定义一个用于处理Smart-seq数据并创建Seurat对象的函数process_smart_to_seurat_data

页: [1]

IT评测·应用市场-qidao123.com's Archiver

Smart-seq2分析