pheatmap 값에 따른 color 범위 조절하기

heatmap에서는 일반적으로 발현량을 RdBl 색깔로 표현한다.

p-value를 여기서 적용하면 0.05라는 경계를 두고 보기에는 조금 불편하다. 0.6과 0.4를 사람 눈으로는 구별 할 수 없기 때문이다.

value에 따라서 구분하는 방법을 설명하고자 한다.

값이 -1부터 1까지 존재한다. ( DEG 분석에서 발현이 증가하는지 감소하는지 여부를 나타내려고 했다. )

1-adjust pvalue니까 0.95이상이거나 -0.95이하일 때가 p-value 0.05 이하이다.

library(RColorBrewer)

pheatmap(df, legend_breaks = c(-0.95, 0, 0.95, max(df)), legend_labels = c("-0.95", "0", "0.95", "1-(p.adj)\n"), legend=T, color = RColorBrewer::brewer.pal(7,"RdYlBu"), breaks = c(-1,-0.95,-0.9,-0.3,0.3,0.9,0.95,1))

legend_breaks = 범례에 글이 표시될 부분을 마크한다. 글자 크기가 너무 커서 전부를 표시하려 하지는 않았다.

legend_labels = breaks가 가리키는 포인트에 입력될 문자열을 넣으면 된다.

legend = T 범례를 표시할 것인지 여부. default는 F이다.

color = plot에 들어갈 색깔을 지정한다. 여기서는 Read/Yellow/Blue의 순서대로 7개로 나누어진 palette를 의미한다.

breaks = 데이터를 어떻게 나눌지를 의미한다. -1에서 -0.95 사이의 값이 color에서 1번 색깔로 지정 될 것이다.

breaks를 더 늘리게 되면 palette의 숫자 7도 여기에 맞춰서 늘려주면 된다. 색을 세 개를 혼합하였다면 가능하면 홀수로 맞추도록 하자.

Reference -

https://stackoverflow.com/questions/32545256/define-specific-value-colouring-with-pheatmap-in-r

저작자표시 비영리 변경금지 (새창열림)

'Computer Science > R' 카테고리의 다른 글

Gene id conversion in R (0)	2018.11.16
Kegg pathway에 속하는 유전자 정보 가져오기 (2)	2018.09.11
pheatmap으로 heatmap그리기 (0)	2018.09.11
R에서 Dataframe 합치기 (0)	2018.09.05
DESeq2 에서 multiple condition 수행하기 (1)	2018.07.27

pheatmap으로 heatmap그리기

pheatmap은 pretty heamaps의 약자로 heatmap을 그릴 때 더 이쁘고 쉽게 그려보자는 취지에서 만든 R 패키지이다. 주로 사용하게 되는 몇 가지 예시에 대해서 더 자세하게 정리해보고자 한다.

첫 스탭은 당연히 pheatmap을 불러오는 것.

library("pheatmap")

두 번째는 범례를 만드는 법이다.

데이터 frame에 세포주와 약물을 처리했을 때 샘플들이 많으면 한 눈에 구분하기가 쉽지 않다. 구분하기 쉬운 색상을 미리 골라서 보기 쉽게 분리하도록 하자.

cell_line = c("cell_1","cell_2","cell_3","cell_4","cell_5")

drug = c("drug_1","drug_2","durg_3")

## annotation dataframe

anno.df = data.frame(cell_line=anno_cell_line, drug=anno_drug)

rownames(anno.df) = anno_samples

## annotation color

ann_colors = list(

cell_line = c("cell_1" = "#235725", cell_2 = "#1FD7F1", cell_3 = "#E4AF30", cell_4 = "#CADF7C", cell_5 = "#Fc0D7D"),

drug = c(drug_1 = "#BDBE32", drug_2 = "#FE9089", "drug_3" = "#82B7FC")

)

위의 예시는 cell 5 종류와 drug 3 종류를 구분하였고 data frame을 만들었다.

anno_samples는 cell_line과 drug 사이에 _를 넣고 붙인 리스트이다. 나중에 데이터와 일치화 시켜야 하기 때문에 상황에 맞게 다르게 줘야 할 수도 있다.

data frame이 제대로 만들어 졌다면 annotation 정보를 가지고 있는 아래와 같은 구조를 가질 것이다.

cell_line drug

cell_1_drug_1 cell_1 drug_1

cell_1_drug_2 cell_1 drug_2

cell_1_drug_3 cell_1 drug_3

cell_2_drug_1 cell_2 drug_1

cell_2_drug_2 cell_2 drug_2

cell_2_drug_3 cell_2 drug_3

cell_3_drug_1 cell_3 drug_1

cell_3_drug_2 cell_3 drug_2

cell_3_drug_3 cell_3 drug_3

cell_4_drug_1 cell_4 drug_1

cell_4_drug_2 cell_4 drug_2

cell_4_drug_3 cell_4 drug_3

cell_5_drug_1 cell_5 drug_1

cell_5_drug_2 cell_5 drug_2

cell_5_drug_3 cell_5 drug_3

이후에 데이터가 들어가 있는 data frame과 rowname을 일치시켜주고 plot을 그리면 된다.

names(df) = anno_samples

pdf("test.pdf")

pheatmap(df, cluster_rows=F, cluster_cols=F, show_rownames=T, annotation_col=anno.df, annotation_colors = ann_colors, main=("pheatmap_test"), fontsize_row=6, legend_breaks = c(-0.5, 0, 0.5, max(newdf)), legend_labels = c("-0.5", "0", "0.5", "1-(q-value)\n"), legend=T)

dev.off()

＊여기서 annotation 정보가 있는 data frame의 row name이 데이터가 들어가 있는 data frame의 column name과 일치되어야 한다.

rownames(anno.df) = anno_samples

names(df) = anno_samples

행과 열을 clustering하는 여부나 이름 표시 여부 등은 아래 메뉴얼을 참고하는 것이 더 빠를 듯 하다. 대부분은 직관적으로 이해할 수 있다.

legend_breaks와 legend_labels은 데이터 표시 범위의 범례를 조절하는데 default로 놓아도 크게 무리없는듯 하다.

예시대로 plot을 그리면 아래처럼 나온다. 색상은 직접 조절하도록 하자.

2018/08/31 - [etc.] - 16진수 RGB코드 알아내는법

Reference -

https://cran.r-project.org/web/packages/pheatmap/pheatmap.pdf

저작자표시 비영리 변경금지 (새창열림)

'Computer Science > R' 카테고리의 다른 글

pheatmap 값에 따른 color 범위 조절하기 (0)	2018.09.12
Kegg pathway에 속하는 유전자 정보 가져오기 (2)	2018.09.11
R에서 Dataframe 합치기 (0)	2018.09.05
DESeq2 에서 multiple condition 수행하기 (1)	2018.07.27
centos 6에서 R 설치를 위한 라이브러리 설치 (2)	2018.07.25

DESeq2에서 heatmap, PCA, MA, volcano plot 그리기

분석을 위한 DataSet이 필요하다. 예시에서는 여러 cell line에 몇 가지 drug를 각기 다른 dose로 처리했을 때의 결과를 분석하고자 한다.

DEG분석은 dose에만 초점이 맞춰지도록 디자인 하였다.

res변수는 결과가 adjust p-value 낮은순서대로 sorting하여 저장하였다.

sampleTable<-data.frame(sampleName=sampleFiles, fileName=sampleFiles, dose=I(dose), drug=drug, cell_line=cell_line)

ddsHTSeq<-DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=~dose)

dds<-DESeq(ddsHTSeq)

res <- results(dds)

res <- res[order(res$padj),]

heatmap을 그리기 위한 color를 지정해 준다. 지정하지 않으면 그림을 그릴 때 마다 다른 color가 나오기 때문에 결과를 정리하는데 매 번 수정하기가 상당히 귀찮기 때문이다. 각 항목에 맞는 값을 넣어주고 뒤에 16진법 RGB코드를 넣어주면 된다.

16진법 RGB코드는 아래 포스팅한 방법으로 찾으면 쉽게 찾을 수 있다.

2018/08/31 - [etc.] - 16진수 RGB코드 알아내는법

## annotation color

ann_colors = list(

dose = c(".." = "#F9EBD1"),

cell_line = c(".." = "#FFFFFF"),

drug = c(".." = "#BDBE32")

)

heatmap을 그리기 위해서 pheatmap 라이브러리를 로딩한다. order뒤에 [1:1000]이라고 되어 있는 부분은 유전자가 너무 많아서 발현량 기준으로 상위 1000개만 가져온 것이다. 전부 넣으려다가는 메모리 덤프가 생길 것이다. 1000개라는 갯수는 임의로 설정한 것이다.

## heatmap

library("pheatmap")

ntd <- normTransform(dds)

select <- order(rowMeans(counts(dds,normalized=TRUE)),

decreasing=TRUE)[1:1000]

df <- as.data.frame(colData(dds)[,c("cell_line","drug","dose")])

rownames(df) = sampleFiles

pdf("heatmap_normalized_readcount.pdf")

pheatmap(assay(ntd)[select,], cluster_rows=T, show_rownames=F,

cluster_cols=T, annotation_col=df, annotation_colors = ann_colors)

dev.off()

plotPCA는 함수 자체를 다시 정의한 것인데 이유는 DESeq2에서 제공하는 plotPCA함수는 PC1과 PC2밖에 고려하지 않기 때문이다.

마지막 ggplot을 불러오는 함수에서 x와 y를 PC3, PC4로 바꾸면 결과가 달라진다. PCA 분석에서 항상 PC1과 PC2의 조합이 정답이 아니기 때문에 필요한 방법이다.

## PCA

plotPCA <- function (object, intgroup = "condition", ntop = 500, returnData = FALSE)

{

rv <- rowVars(assay(object))

select <- order(rv, decreasing = TRUE)[seq_len(min(ntop,

length(rv)))]

pca <- prcomp(t(assay(object)[select, ]))

percentVar <- pca$sdev^2/sum(pca$sdev^2)

if (!all(intgroup %in% names(colData(object)))) {

stop("the argument 'intgroup' should specify columns of colData(dds)")

}

intgroup.df <- as.data.frame(colData(object)[, intgroup,

drop = FALSE])

group <- if (length(intgroup) > 1) {

factor(apply(intgroup.df, 1, paste, collapse = " : "))

}

else {

colData(object)[[intgroup]]

}

d <- data.frame(PC1 = pca$x[, 1], PC2 = pca$x[, 2], PC3 = pca$x[, 3], PC4 = pca$x[, 4], group = group,

intgroup.df, name = colnames(object))

if (returnData) {

attr(d, "percentVar") <- percentVar[1:2]

return(d)

}

ggplot(data = d, aes_string(x = "PC1", y = "PC2", color = "group")) +

geom_point(size = 3) + xlab(paste0("PC1: ", round(percentVar[1] *

100), "% variance")) + ylab(paste0("PC2: ", round(percentVar[2] *

100), "% variance")) + coord_fixed()

}

se <- SummarizedExperiment(log2(counts(dds, normalized=TRUE) + 1), colData=colData(dds,levels=c('0','0.1','1')))

pdf("PCA_normlaized_readcount.pdf")

plotPCA(DESeqTransform(se),intgroup=c("cell_line","drug","dose"),ntop=500)

dev.off()

MAplot은 유전자의 발현 값과 p-value의 상관관계를 보기 위해서 사용된다.

#MAplot

pdf("MA_plot.pdf")

plotMA(res, ylim=c(-2,2))

dev.off()

Volcano plot은 각 유전자의 log2FoldChange와 p-value를 보기 위해 사용된다. plotting을 하는건 위의 4번째 줄 까지만 입력하면 되지만 특정 조건에 맞는 유전자일 때 색을 다르게 주거나 cutoff을 선으로 표시하기 위한 줄이 추가되어 있다.

#volcano plot

pdf("volcano_plot.pdf")

par(mar=c(5,5,5,5), cex=1.0, cex.main=1.4, cex.axis=1.4, cex.lab=1.4)

topT <- as.data.frame(res)

with(topT, plot(log2FoldChange, -log10(padj), pch=20, main="Volcano plot", cex=1.0, xlab=bquote(~Log[2]~fold~change), ylab=bquote(~-log[10]~Q~value)))

with(subset(topT, padj<0.05 & abs(log2FoldChange)>2), points(log2FoldChange, -log10(padj), pch=20, col="red", cex=0.5))

#Add lines for absolute FC>2 and P-value cut-off at FDR Q<0.05

abline(v=0, col="black", lty=3, lwd=1.0)

abline(v=-2, col="black", lty=4, lwd=2.0)

abline(v=2, col="black", lty=4, lwd=2.0)

abline(h=-log10(max(topT$pvalue[topT$padj<0.05], na.rm=TRUE)), col="black", lty=4, lwd=2.0)

dev.off()

저작자표시 비영리 변경금지 (새창열림)

'bioinformatics' 카테고리의 다른 글

oncotator 설치 및 실행하기 (0)	2018.10.04
liftover하기 (0)	2018.09.28
Optical duplicate와 Library duplicates (0)	2018.08.27
Genome에 존재하는 Variant란? (0)	2018.08.20
DESeq2에서 연속적인 값을 condtion으로 받기 (0)	2018.08.10

Be great

pheatmap