GETUTR provides a landscape of 3' UTR from RNA-seq data. It has three steps: preprocessing, smoothing and normalization.
Based on the RefSeq gene annotation (hg19), preprocessing extracted all reads mapped to the 3' UTRs from the BAM files and made a density function of RNA-seq using kernel density estimation. We used Gaussian kernel and found the optimal bandwidth value of kernel for the density estimation. Then we converted them to BED files, which were subjected to the smoothing step. The goal of smoothing is to flatten erroneous variations in the RNA-seq signal, but to maintain the biological changes of the 3' UTR, thereby directing a monotonic decrease of the 3' UTR toward the 3' end of a transcript. Here, we propose three smoothing algorithms: two heuristic algorithms and one regression algorithm. The time complexities of both algorithms are O(l) for n, the number of genes in a RNA-seq BAM file and l, the average of UTR lengths of genes. At last, to make RNA-seq comparable to 3P-seq, we normalized signals, which were smoothed by the smoothing algorithms, by scaling them to range between zero and one. To correct for 3' UTR ends erroneously extended by a low background RNA-seq signal, the 5% endmost signals of RNA-seq on the 3' UTR were trimmed. The resulting values and corresponding coordinates were stored with a BED format.
Name | Usage | Default | Description |
---|---|---|---|
Input | -i <inputfile name> |
RNA-seq mapped reads in BAM or BED format. | |
Output | -o <outputfile name> |
inputfile name without extension | the name-head of output files <outputfile name>.smoothed.bed: estimated 3' UTR <outputfile name>.PCS: poly(A) cleavage site |
Method | -m <index of method> |
10 (PAVA) | index for smoothing method 10: PAVA / 0: Max.fit / 1: Min.fit. |
Reference | -r <reference file> |
hg19 | reference file in GenePred/GTF format for gene annotation |
python GETUTR.py -i HeLa_Puc19_all.bam -o HeLa_Puc19_all.3UTR -m 10 -r refFlat.txt