Circos plot 그리기.

여러 프로그램들이 circos plot을 지원하지만 아래 프로그램이 입력 데이터 포맷도 편하고 원하는 그림을 그릴 수 있어서 추천한다.

perl 기반으로 작동하기 때문에 perl 없다면 설치부터 해야한다.

다운로드 http://circos.ca/software/download/

설치 후 bin 폴더 안에 circos 실행 파일이 있으며 실행에 필요한 perl module이 다수 있다.

cpan Config::General Font::TTF GD List::MoreUtils Math::Bezier Math::Round Math::VecStat Params::Validate Readonly Regexp::Common Set::IntSpan Text::Format SVG Clone Statistics::Basic

default perl이라면 위의 모듈만 설치해도 아마 돌아갈 것이다.

Application of Circos to displaying sequence conservation and similiarity. (800 x 693)

워낙 많은 기능을 포함하고 있어서 다 써보지도 못했지만 기본적인 내용은 best practice에 가면 configure파일과 그림을 제공하고 있으므로 따라하면 된다.

제일 기본이 되는 내용만 몇 개 요약하고자 한다.

프로그램은 configure 파일을 읽어서 설정과 입력 데이터를 확인하고 plot을 그린다.

circos를 설치한 폴더에 circos.conf라는 파일이 있는데 참고하면 된다.

1. 가장 기본이되는 backbone이 되는 karyotype이 필요하다. 기본 path는 data/karyotype이며 human을 비롯한 몇 종의 chromosome 정보가 이미 들어가 있다.

circos.conf파일에서는 human 데이터를 아래와 같이 사용하고 있다.

karyotype = data/karyotype/karyotype.human.txt

여러 genome을 동시에 그리고 싶다면 comma(,)를 넣어서 구분해주면 된다.

karyotype = data/karyotype/karyotype.human.txt,data/karyotype/karyotype.chimp.txt

hg19의 파일을 열어보면 아래와 같은 포맷으로 되어있다.

chr - hs1 1 0 249250621 chr1
chr - hs2 2 0 243199373 chr2
chr - hs3 3 0 198022430 chr3
...
chr - hs22 22 0 51304566 chr22
chr - hsX x 0 155270560 chrx
chr - hsY y 0 59373566 chry

순서대로 크로모좀, 패스, 여러 종류의 genome데이터가 들어갈 때 각각을 구분하기 위한 표시, circosplot에서 보여주는 이름, 패스, 염색체 크기, color 이다.

3열의 경우 human과 침팬지를 같이 그리고자 할 때 들어가는 input data에서 hs1 이라면 사람을 chim1이라면 침팬지를 의미하는 것으로 구분을 하고 싶을 때 사용한다.

color는 이미 chr1이 어떤 color를 가지고 있다는 것을 etc/colors.ucsc.conf 에서 정의했기 때문에 저렇게 표시해도 color이며 RGB나 다른 방식으로 색을 주어도 된다.

저작자표시 비영리 변경금지

'bioinformatics' 카테고리의 다른 글

NCBI BLAST+ 설치 및 실행하기 (0)	2017.08.16
Busco 설치 및 실행하기 (0)	2017.08.16
GLOOME 설치 및 실행하기 (0)	2016.12.19
bowtie2에서 mismatch 허용하기 (0)	2016.09.13
miRNA 명명 규칙 (0)	2016.09.01

http://www.repeatmasker.org/RMDownload.html 에 접속

1. perl 버전이 5.8.0 이상인지 확인

2. Search Engine으로 사용할 프로그램 다운로드

3. Repeatmasker 다운로드.

tar zxf RepeatMasker-open-?-?-?.tar.gz

cd RepeatMasker

perl ./configure

1. 사용할 perl의 PATH

2. Repeatmasker를 설치할 PATH

3. trf의 PATH ( trf 실행 파일의 주소까지 넣어야 함 )

4. 설치할 Search Engine. 적어도 한 가지를 설치하여야 하며 이번에는 bin folder를 경로에 입력.

으로 설치가 끝남.

Repeatmakser 4.0.6은 library의 업데이트를 필요로 하므로 작업이 더 필요한데 4.0.7은 그냥 진행 가능함.

설치가 끝나면 실행 명령은

RepeatMasker -species <human> -q <hg38.fa>

human은 약 1주일 정도 소요됨.

※ RepeatMasker 사용시 simple repeat을 찾기 위해 trf를 사용하는데 4.0.6 기준으로 trf는 GLIBC_2.14 library를 필요로 함.

error message = trf409.linux64: /lib64/libc.so.6: version `GLIBC_2.14' not found

프로그램을 돌릴 때 trf가 제대로 안돌아 가더라도 결과가 나오기 때문에 프로그램이 정상적으로 돌아간다고 착각할 수 있음.

미리 trf를 따로 실행해서 제대로 결과가 나오는지 확인 필요함.

저작자표시 비영리 변경금지

'Computer Science > linux' 카테고리의 다른 글

GCC 설치하기 (1)	2017.09.18
cURL 로컬 설치하기 (0)	2017.08.24
GBrowse2 설치하기 (0)	2017.04.25
Perl 설치 및 실행하기 (0)	2017.04.25
유닉스 명령어 grep, sed, awk 사용해보기 (0)	2016.08.24

Perl과 apache가 설치되어 있어야함.

windows버전은 GBrowse 1.70 버전까지 지원했다는 글이 있는데 현재는 확인 불가 ( 시도해 보았으나 성공하지 못함. )

Gbrowse2 다운로드 url : https://sourceforge.net/projects/gmod/files/Generic%20Genome%20Browser/

perl module을 다수 설치해야함.

perl Build.pl을 해서 초기 설정을 잡아주어야 함 이 때 module이라는 이름의 perl module이 설치되어 있지 않다면 cpan Module::Build 부터 해주어야 함.

그 이후에는 ./Build installdeps 를 하면 dependency를 알아서 설치해줌. but 수동 설치가 필요한 부분이 존재함

1.

Please enter the location of Kent source tree:

Can't find the bigWig.h and jkweb.a files at this location.

Try again, or hit <enter> to cancel:

Kent source tree는 Kentutils를 설치해야 함.

[링크]

git clone https://github.com/ENCODE-DCC/kentUtils.git

git에서 다운 받은 후 README.md를 읽으면 설치 방법이 있음.

cd kentuils && make

export KENT_SRC=/PATH/TO/INSTALL/kentUtils/src:$KENT_SRC

2.

Running install for module 'Bio::DB::Sam'

Checksum for /home/kyoungwoo/.cpan/sources/authors/id/L/LD/LDS/Bio-SamTools-1.43.tar.gz ok

Configuring L/LD/LDS/Bio-SamTools-1.43.tar.gz with Build.PL

This module requires samtools 0.1.10 or higher (samtools.sourceforge.net).

Please enter the location of the bam.h and compiled libbam.a files:

samtools의 bin 파일이 아니라 소스 파일의 경로를 확인해서 넣어주면 됨.

필수 perl module 설치가 끝나면 ./Build test ./Build isntall을 해서 build를 하고 ./Build apache_config를 해서 화면으로 출력되는 config를 복사.

apache config 파일 내에 붙여넣기를 하면 끝남.

apache config파일은 /etc/httpd/conf/httpd.conf 이며 내용 수정 후 apache를 재시작 하면 됨. apachectl -k graceful

브라우저를 켜서 localhost/gbrowse2 로 접속.

저작자표시 비영리 변경금지

'Computer Science > linux' 카테고리의 다른 글

GCC 설치하기 (1)	2017.09.18
cURL 로컬 설치하기 (0)	2017.08.24
Repeatmasker 설치 (0)	2017.05.05
Perl 설치 및 실행하기 (0)	2017.04.25
유닉스 명령어 grep, sed, awk 사용해보기 (0)	2016.08.24

다운로드는 공식 홈페이지 https://www.perl.org/get.html 에서 받으면 된다. (stable source code 추천)

wget http://www.cpan.org/src/5.0/perl-5.24.1.tar.gz (4/25/2017 stable 버전)

less README를 하면 설치 가이드를 볼 수 있다.

./Configure -des -Dprefix=$HOME/localperl -Dusethreads

-des = configure 과정 중에 질문이 있는데 항상 default로 진행 된다.

-Dprefix = 프로그램이 설치될 경로 이다.

-Dusethreads = 일부 프로그램을 perl의 multi threads를 요구하기때문에 compile단계에서 설정해주면 나중에 재설치할 필요가 없음. 단, 해당 옵션으로 설치했을 때 단일 thread 프로그램은 조금 느려질 수 있다. (사용을 추천)

make test && make install

make하는데 시간이 상당히 소요된다.

install 이 끝난 후에는 -Dprefix에 넣었던 PATH 안에 있는 의 bin과 lib 폴더를 export 해주면 끝.

저작자표시 비영리 변경금지

'Computer Science > linux' 카테고리의 다른 글

GCC 설치하기 (1)	2017.09.18
cURL 로컬 설치하기 (0)	2017.08.24
Repeatmasker 설치 (0)	2017.05.05
GBrowse2 설치하기 (0)	2017.04.25
유닉스 명령어 grep, sed, awk 사용해보기 (0)	2016.08.24

line을 읽을 때 아래와 같이 구분자가 \t으로 되어 있지 않고 space로 되어 있는데다가 그 길이가 그때그때마다 다를 경우 parsing하기가 쉽지 않다.

regular expression을 써서 구분하면 된다.

julia> a

" t= 0.2652 S= 38.3 N= 99.7 dN/dS= 0.5082 dN = 0.0697 dS = 0.1371\n"

julia> matchall(r"\d+.\d+",a)

6-element Array{SubString{String},1}:

"0.2652"

"38.3"

"99.7"

"0.5082"

"0.0697"

"0.1371"

아래의 list를 받아서 원하는 index에서 숫자를 가져오면 된다. string으로 되어 있기 때문에 float으로 바꿔서 가져와야 사용할 수 있다.

저작자표시 비영리 변경금지

'Computer Science > julia' 카테고리의 다른 글

ArgParse 모듈 사용하기 (0)	2017.08.31
StatsBase 모듈 사용하기 (0)	2017.08.24
Genome으로부터 sequence 가져오기. (0)	2017.08.18
Julia 설치 및 실행하기 (0)	2017.08.16
특정 확장자를 가진 파일을 리스트로 받기 (0)	2016.12.18

GLOOME 설치 및 실행하기

2010년 Bioinformatics에 출판된 논문(https://doi.org/10.1093/bioinformatics/btq549)에서 소개하고 있는 프로그램으로 유전자의 presense와 absense정보를 받아서 트리로 그려주는 프로그램이다.

Ofir Cohen et al, GLOOME: gain loss mapping engine, Bioinformatics, 2010

web page http://gloome.tau.ac.il/로 가면 fasta file로 input을 받는다.

fasta 형식은

>speciesA

10001100110

>speciesB

11001100110

각 1/0은 같은 유전자의 유무를 binary로 나타내면 된다.

위의 예시에서는 2번째 위치의 유전자가 A에서는 없고 B에서는 있는 것이고 그 외의 둘 다 0이거나 1인 경우는 동일하게 가지고 있거나 가지고있지 않거나를 의미한다.

위의 프로그램은 그림을 그려주기는 하나 java 에러가 나서 ph파일까지만 이메일로 받고

http://iubio.bio.indiana.edu/treeapp/treeprint-form.html 웹사이트에 가서 ph파일에서 tree.pdf 파일로 변환.

저작자표시 비영리 변경금지

'bioinformatics' 카테고리의 다른 글

NCBI BLAST+ 설치 및 실행하기 (0)	2017.08.16
Busco 설치 및 실행하기 (0)	2017.08.16
Circos plot 그리기. (0)	2017.08.15
bowtie2에서 mismatch 허용하기 (0)	2016.09.13
miRNA 명명 규칙 (0)	2016.09.01

Julia에서 폴더를 하나 선택 후 그 폴더에 있는 하위폴더만 고르거나 특정 확장자를 가지고 있는 파일만 리스트로 만드려면

먼저 해당 폴더를 input으로 받은 뒤 filter를 이용해서 조건에 맞는 파일들만 고르면 된다..

readdir() 은 디렉토리를 읽고 안에있는 모든 파일을 가져오는 것이며

isdir()은 해당 파일이 디렉토리인지 확인하는 함수이고

endswith()는 해당 파일이름의 마지막이 주어진 조건과 매치하는지 확인하는 것이다.

세 함수를 조합해서 아래처럼 사용하면 된다.

inputdir = ARGS[1]

dirlist = filter(x -> isdir(inputdir*x), readdir(inputdir))

zipfilelist = filter(x -> endswith(x,".zip"), readdir(inputdir))

스크립트를 실행하면서 넣어준 ARGS[1] 디렉토리에서

하위 폴더는 dirlist에 리스트 형식으로 저장 될 것이고

.zip으로 끝나는 파일들을 zipfilelist에 리스트 형식으로 저장 될 것이다.

저작자표시 비영리 변경금지

'Computer Science > julia' 카테고리의 다른 글

ArgParse 모듈 사용하기 (0)	2017.08.31
StatsBase 모듈 사용하기 (0)	2017.08.24
Genome으로부터 sequence 가져오기. (0)	2017.08.18
Julia 설치 및 실행하기 (0)	2017.08.16
string에 섞여있는 float 찾기 (0)	2017.03.16

Bowtie2에서 mismatch 허용하기.

bowtie2에서는 bowtie와 차이점은 단순히 read length에 따른 최적화나, gap 허용 외에도

bowtie에서는 최대 3개까지의 mismatch만을 허용하는데 반해 bowtie2에서는 mismatch 또는 indel의 각각의 페널티 점수를 입력하여 read length의 일정 비율만큼 mismatch 허용할 수 있다는 차이점이 있다.

bowtie2에 대한 자세한 옵션은 http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml 를 참고하면 된다.

이 포스팅에서는 mismatch에 대해서만 다루고자 한다.

For an alignment to be considered "valid" (i.e. "good enough") by Bowtie 2, it must have an alignment score no less than the minimum score threshold. The threshold is configurable and is expressed as a function of the read length. In end-to-end alignment mode, the default minimum score threshold is -0.6 + -0.6 * L, where L is the read length. In local alignment mode, the default minimum score threshold is 20 + 8.0 * ln(L), where L is the read length. This can be configured with the --score-min option. For details on how to set options like --score-min that correspond to functions, see the section on setting function options.

Scoring options

`--ma <int>`	Sets the match bonus. In `--local` mode `<int>` is added to the alignment score for each position where a read character aligns to a reference character and the characters match. Not used in `--end-to-end` mode. Default: 2.
`--mp MX,MN`	Sets the maximum (`MX`) and minimum (`MN`) mismatch penalties, both integers. A number less than or equal to `MX` and greater than or equal to `MN`is subtracted from the alignment score for each position where a read character aligns to a reference character, the characters do not match, and neither is an `N`. If `--ignore-quals` is specified, the number subtracted quals `MX`. Otherwise, the number subtracted is `MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) )` where Q is the Phred quality value. Default: `MX` = 6, `MN` = 2.
`--np <int>`	Sets penalty for positions where the read, reference, or both, contain an ambiguous character such as `N`. Default: 1.
`--rdg <int1>,<int2>`	Sets the read gap open (`<int1>`) and extend (`<int2>`) penalties. A read gap of length N gets a penalty of `<int1>` + N * `<int2>`. Default: 5, 3.
`--rfg <int1>,<int2>`	Sets the reference gap open (`<int1>`) and extend (`<int2>`) penalties. A reference gap of length N gets a penalty of `<int1>` + N * `<int2>`. Default: 5, 3.
`--score-min <func>`	Sets a function governing the minimum alignment score needed for an alignment to be considered "valid" (i.e. good enough to report). This is a function of read length. For instance, specifying `L,0,-0.6` sets the minimum-score function `f` to `f(x) = 0 + -0.6 * x`, where `x` is the read length. See also: setting function options. The default in `--end-to-end` mode is `L,-0.6,-0.6` and the default in `--local` mode is `G,20,8`.

bowtie2 manual 홈페이지에서 가져온 score 계산 방법이다.

우선 위의 end-to-end score threshold는 -0.6 + -0.6 * readlength 라고 적혀있다.

read length를 100으로 놓으면 비율로 생각하기 쉬우니 그렇게 계산해보면 -60.6 보다 점수가 낮으면 align하지 않겠다는 뜻으로 해석 가능하다.

각 base가 match, mismatch, gap일 때의 score를 살펴보면 마찬가지로 end-to-end일때 mismatch score는 read의 base quality에 따라 점수가 다르게 측정 되는 것으로 보인다. (maximum mismatch penalty와 minimum mismatch penalty가 존재)

preprocess과정에서 low quality base를 자르기도 하고 요즘 sequencing을 하면 대체적으로 high quality read가 많으니 일단 quality가 좋다고 가정할 때의 score인 6으로 계산한다.

(read의 quality가 나쁘면 penalty는 작다. 예를들어 read의 base가 A이며 quliaty score가 낮다면 실제로 이 base는 T이고 mismatch가 아니라 match일 수도 있기 때문이다.)

default 설정의 경우

- mismatch가 10번 생기면 -60점이니 read length의 10%라고 보면 된다.

- gap같은 경우는 open과 extend가 각각 다르게 적용되니 gap이 하나만 있다면 18개 까지 생길 수 있다.

사용할 때는 dafult 값인 mp와 rdg등은 가급적이면 안건드리고 --score-min L,-0.6,-0.3 등으로 바꿔서 (5%의 mismatch를 허용) 해보는 것을 추천 하지만 data 특성에 따라서 mismatch적절한 값을 주고 사용해야 한다.

저작자표시 비영리 변경금지

'bioinformatics' 카테고리의 다른 글

NCBI BLAST+ 설치 및 실행하기 (0)	2017.08.16
Busco 설치 및 실행하기 (0)	2017.08.16
Circos plot 그리기. (0)	2017.08.15
GLOOME 설치 및 실행하기 (0)	2016.12.19
miRNA 명명 규칙 (0)	2016.09.01

Python에서 rpy 모듈을 사용하다가

Python 2.6.9 (unknown, Feb 26 2015, 10:49:14)

[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2

Type "help", "copyright", "credits" or "license" for more information.

>>> import rpy

Fatal error: cannot mkdir R_TempDir

cannot mkdir R_TempDir 에러메세지 발생했다.

/tmp/ 폴더를 확인해 봐야 한다.

권한이 없거나, tmp 폴더에 설정한 용량을 채우면 더 진행되지 않는다.

확인필요.

저작자표시 비영리 변경금지

'Computer Science > python' 카테고리의 다른 글

Primer 서열 분석을 위한 python 코드 (0)	2021.08.17
String Format으로 길이 고정하기 (0)	2020.06.24
python multi-level argparse (0)	2019.07.12
python 파일 입출력 (0)	2019.07.12
Python 설치 및 실행하기 (0)	2017.08.16

miRNA의 명명법은 규칙이 있다.

이 내용을 mirbase에서 소개하고 있으며 요약하고자 한다.

원문은 http://www.mirbase.org/help/nomenclature.shtml 참조 하면 된다.

요약하자면

1. hsa-mir-121이라는 miRNA가 있을 때 학명-mir-숫자 의 형식을 따른다.

2. 숫자는 발견된 순서로서 마지막으로 121이라는 miRNA가 있었다면 이후에 발견되는 miRNA는 122부터 시작한다.

3. genome 상의 다른 영역에서 같은 mature miRNA를 가지는 precursor miRNA가 있다면 이름은 hsa-mir-121-1, hsa-mir-121-2를 가진다.

4. genome 상의 다른 영역에서 유사한 mature miRNA를 가지고 있다면 hsa-mir-121a, hsa-mir-121b를 가진다.

5. mature miRNA는 위치에 따라 precursor miRNA 이름 뒤에 -5p , -3p를 가진다. ex) hsa-mir-121-5p, hsa-mir-121-3p

6. 이 규칙은 항상 적용되는 것은 아니며 예외가 있을 수 있다.

마지막으로 이름은 아주 일부의 정보만 가지고 있기 때문에 miRNA의 정확한 정보를 알기 위해서는 database를 검색해야지 이름에 의존하면 안된다고 하며 설명을 마무리 하고 있다.

저작자표시 비영리 변경금지

'bioinformatics' 카테고리의 다른 글

NCBI BLAST+ 설치 및 실행하기 (0)	2017.08.16
Busco 설치 및 실행하기 (0)	2017.08.16
Circos plot 그리기. (0)	2017.08.15
GLOOME 설치 및 실행하기 (0)	2016.12.19
bowtie2에서 mismatch 허용하기 (0)	2016.09.13

Be great

분류 전체보기

Circos plot 그리기.

'bioinformatics' 카테고리의 다른 글

Repeatmasker 설치

'Computer Science > linux' 카테고리의 다른 글

GBrowse2 설치하기

'Computer Science > linux' 카테고리의 다른 글

Perl 설치 및 실행하기

'Computer Science > linux' 카테고리의 다른 글

string에 섞여있는 float 찾기

'Computer Science > julia' 카테고리의 다른 글