[Seurat] How to analyze scRNA seq data in R, Bioinformagician

안녕하세요, 꿈꾸는 약사입니다.

이번에는 "Bioinformagician" youtube에서 소개하고 있는 내용으로

scRNA seq data에 대해 seurat obj화 시킨 후 이를 분석하는 workflow에 대해 설명드리겠습니다.

Load data

# 중간에 str()을 통해 data format에 대한 이해를 바탕으로 함

# seurat obj 선언에 필요한 feature matrix value는 h5 file의 Gene expression column에 있음

# script to perform standard workflow steps to analyze single cell RNA-Seq data
# data: 20k Mixture of NSCLC DTCs from 7 donors, 3' v3.1
# data source: https://www.10xgenomics.com/resources/datasets/10-k-human-pbm-cs-multiome-v-1-0-chromium-controller-1-standard-2-0-0         

# setwd("~/Desktop/demo/single_cell_RNASeq/scripts")

# load libraries
library(Seurat)
library(tidyverse)

# Load the NSCLC dataset
nsclc.sparse.m <- Read10X_h5(filename = 'C:\\RStudio\\R practice\\20k_NSCLC_DTC_3p_nextgem_Multiplex_count_raw_feature_bc_matrix.h5')
str(nsclc.sparse.m)
cts <-  nsclc.sparse.m$`Gene Expression`



# Initialize the Seurat object with the raw (non-normalized data).
nsclc.seurat.obj <- CreateSeuratObject(counts = cts, project = "NSCLC", min.cells = 3, min.features = 200)
str(nsclc.seurat.obj)
nsclc.seurat.obj

Quality Control

# View(Seurat@meta.data) 이용해서 meta.data가 어떤 내용으로 구성되어 있는지 확인

# Mitochondrial RNA 비율이 높거나, doublet 혹은 triplet single cell data를 제거해주기 위한 cutoff를 결정

# 1. QC ------- Remove low quality cell
# Those are doublets(or multiplets) or contaning high portion of mt genes
View(nsclc.seurat.obj@meta.data)
# % MT reads
# Calculate the percentage of the genes that starting wtih "^MT-"
nsclc.seurat.obj[["percent.mt"]] <- PercentageFeatureSet(nsclc.seurat.obj, pattern = "^MT-")
View(nsclc.seurat.obj@meta.data)

VlnPlot(nsclc.seurat.obj, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
FeatureScatter(nsclc.seurat.obj, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") +
    geom_smooth(method = 'lm')

Filtering

# 앞서 결정한 cutoff value에 따라 high quality cell만 filtering

# 2. Filtering -----------------
# Based on QC analysis, filter out low-quality single cell data
nsclc.seurat.obj <- subset(nsclc.seurat.obj, subset = nFeature_RNA > 200 
                           & nFeature_RNA < 2500 & percent.mt < 5)

Normalize

# Normalize를 통해 각각의 gene expression을 전체 expression으로 나눠줌

# 이후 log transform 해준 뒤 scale factor 곱해줌

# 3. Normalize data ----------
# Divide gene expression of each cell by total expression
# multiplied by scaling factor and then log transform it
#nsclc.seurat.obj <- NormalizeData(nsclc.seurat.obj, normalization.method = "LogNormalize", scale.factor = 10000)
# OR
nsclc.seurat.obj <- NormalizeData(nsclc.seurat.obj)
str(nsclc.seurat.obj)

Identify high variable features

# Variation이 큰 상위 2000개의 gene만을 features로 결정

# Zero-inflation 감소 효과와 dimensionanlity reduction에 걸리는 시간을 줄여주는 효과

# Zero-inflation이란 실제로 expression이 0인 cell과 detection되지 않을 만큼 low expression 하는 cell 둘 다 0 값으로 읽힘에 따라 Zero를 갖는 cell data들이 도드라져 보이는 현상

# 4. Identify highly variable features --------------
nsclc.seurat.obj <- FindVariableFeatures(nsclc.seurat.obj, selection.method = "vst", nfeatures = 2000)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(nsclc.seurat.obj), 10)

# plot variable features with and without labels
plot1 <- VariableFeaturePlot(nsclc.seurat.obj)
LabelPoints(plot = plot1, points = top10, repel = TRUE)

Scaling

# Resource가 달라짐에 따라 같은 실험이어도 data값이 달라지는 batch effects를 줄여줌

# 뿐만 아니라 Dimensionality reduction에 필요한 data 전처리 단계

# 5. Scaling -------------
# Remove source variation called "batch effects"
all.genes <- rownames(nsclc.seurat.obj)
nsclc.seurat.obj <- ScaleData(nsclc.seurat.obj, features = all.genes)

str(nsclc.seurat.obj)

Dimensionality reduction

# 가장 variation을 잘 설명하는 Principle components 계산

# PCA 이해를 위한 interactive reference : https://setosa.io/ev/principal-component-analysis/

# 6. Perform Linear dimensionality reduction --------------
nsclc.seurat.obj <- RunPCA(nsclc.seurat.obj, features = VariableFeatures(object = nsclc.seurat.obj))

# visualize PCA results
print(nsclc.seurat.obj[["pca"]], dims = 1:5, nfeatures = 5)
DimHeatmap(nsclc.seurat.obj, dims = 1, cells = 500, balanced = TRUE)


# determine dimensionality of the data
ElbowPlot(nsclc.seurat.obj)

Clustering

# resolution 값에 따라 얼마나 가까운 population까지를 하나의 cluster로 인식할 것인지를 결정

# UMAP 이해를 위한 interactive reference : https://pair-code.github.io/understanding-umap/

# 7. Clustering ------------
nsclc.seurat.obj <- FindNeighbors(nsclc.seurat.obj, dims = 1:15)

# understanding resolution
nsclc.seurat.obj <- FindClusters(nsclc.seurat.obj, resolution = c(0.1,0.3, 0.5, 0.7, 1))
View(nsclc.seurat.obj@meta.data)

DimPlot(nsclc.seurat.obj, group.by = "RNA_snn_res.0.1", label = TRUE)

# setting identity of clusters
Idents(nsclc.seurat.obj)
Idents(nsclc.seurat.obj) <- "RNA_snn_res.0.1"
Idents(nsclc.seurat.obj)

# non-linear dimensionality reduction --------------
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
# 'umap-learn')
nsclc.seurat.obj <- RunUMAP(nsclc.seurat.obj, dims = 1:15)
# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
DimPlot(nsclc.seurat.obj, reduction = "umap")

- 출처 : Bioinformagician youtube, https://github.com/kpatel427/YouTubeTutorials/blob/main/singleCell_standard_workflow.R

저작자표시 비영리 변경금지

'생물정보학 > Bioinformagician' 카테고리의 다른 글

[Seurat] Reading single-cell data in R : H5AD, loom, MEX, AnnData formats, Bioinformagician (0)	2022.08.29