A simplified RNA-Seq preprocessing tool for gene expression analysis
BRB Digital Gene Expression (BRB-DGE) is a new tool to help scientists preprocess RNA-Seq data for downstream gene expression analysis. It takes RNA sequence data files in FASTQ format as an input and outputs raw count data in a text format. The raw count data can be read into BRB-ArrayTools for gene expression analysis. The tool is designed to run in a Linux (we currently only support Ubuntu Linux) environment which includes local physical/virtual machines or remote servers like Amazon cloud. The tool is developed to run RNA-Seq preprocessing based on well-known RNA-Seq tools (Tophat, Samtools and HTSeq). Therefore rather than inventing a new preprocessing tool, we design BRB-DGE as easy-to-use software and hope it can make RNA-Seq preprocessing more accessible to general scientists.
Ubuntu 64-bit Desktop OS (version 12.04 or 14.04) is required in order to run BRB-DGE. The Ubuntu ISO file can be downloaded from Ubuntu Download page.
The recommended hardware requirement may vary depending on the size of data. It is recommended that the machine has at least 8GB of RAM and 500GB hard disk space. Ubuntu OS itself takes about 4.5 GB of hard disk space. BRB-DGE and other required software (Tophat, Bowtie, ...) take less than 400 MB space in total.
Depending on the machine you have, you can install Ubuntu in different ways. The official installation instruction from Ubuntu provides a step-by-step instruction with screenshots for first-time users.
The downloaded file is saved under $HOME/Downloads directory by default. For the filename the nomenclature is <bdge-dl-XXXXXXXX.tar.gz> where 'XXXXXXXX' represents the release date; for example, 20141225.
The tarball can be extracted to the desktop or any place under your $HOME directory by using either the file manager (called Nautilus in Ubuntu) or a terminal command (use the keyboard shortcut 'Ctrl+Alt+t' to open a terminal window)
tar -xzvf bdge-dl-XXXXXXXX.tar.gz
This will create a new directory called bdge-dl. This new directory contains the following files
BDGE
: GUI version of the applicationBDGECLI
: CLI version of the applicationinstall.sh
: script to 'install' application icons to the right placessamples.txt
: template of samples.txt file (tab-delimited)icon
: folder containing application icons Although it is optional, it is recommended to open a terminal and run ./install.sh
as a user to
create a desktop shortcut with the application icon. Note that running install.sh requires sudo privileges; check out
here in FAQ.
To launch the GUI application, you can do one of the following
BDGE
file to launch.
./BDGE
to launch. Don't forget the dot and the forward slash when you run an executable from the current directory!
install.sh
has been run once).
There are two ways to set up the required software (such as Bowtie, Tophat, SAMTools, ...): automatic and manual. For novice users, it is recommended to use the automatic setup method. Either way, users first open a dialog by clicking Settings > Tools Manager.
The automatic setup method provides an easy way to install the required software.
When users click the Automatic setup button, a new terminal window will be opened.
After the sudo password is entered, the application will start to download an installation script
from the internet (which is kept updated) and install the required software.
All the required software is installed under /opt/RNA-Seq/bin
directory. Note that if you do not have the sudo password, you need to ask your
system administrator for help. Check out here in FAQ on how to allow an existing user to have root privileges.
After the installation is finished, users will be asked to hit the ENTER key to close the terminal. Now they can go to the main dialog to continue to preprocess their data.
If the required software have been previously installed, you can click the Browse button to select the directory for each software (It is assumed HTSeq is available under a global environment so there is no entry for it). After the path of each software has been specified, users can click 'OK' button to proceed processing.
Note that the setup needs only to be run once. The path of each required software will be remembered by BRB-DGE.
The workflow can be represented by the following plot (a dashed line means the step is optional).
The first-time BRB-DGE is used, you have to download the required software (such as Tophat, Samtools, ...) first. BRB-DGE provides a simple way to do it. Please see the instruction here.
Before launching BRB-DGE, the user needs to create a tab-delimited file called samples.txt to provide sample information required by the software.
The application will automatically run the following programs one by one
based on samples.txt file. More information about samples.txt can be found in the Tutorial section.
Before running the alignment, users are welcomed to run quality control and trimming by using the FastQC and Fastx_trimmer tools. Both tools are available under the 'Tools' menu.
The following files are required as an input (see also Tutorial section)
The output of the program is a zip file "counts.zip" which has as many files as the number of samples (the file name is
determined by the sample name with an extension 'count').
Each file contains a table of read counts for all genes in one sample. There are two columns in the table.
The first column contains gene identifiers such as Ensembl IDs or gene symbols and the second column has count values.
These count table files can be imported into BRB-ArrayTools for gene expression analyses.
Alternatively users can use the DESeqDataSetFromHTSeqCount()
function from the DESeq2
package
in Bioconductor or
the readDGE()
function from the edgeR
package to load COUNT data into an
R environment.
To help first time users gain experience to use the BRB-DGE program, we provide test data which includes all necessary input files.. When you are ready to run BDGE on your own data, check out the Preparing Files and samples.txt sections to make sure all required input files are ready.
A subset of RNA sequence data files including genome reference, gene annotation and samples.txt files from GSE11209 can be downloaded from here (~180MB, right click the link and select 'Save Link As...' from the right-click menu to download it). After downloading the zip file, user needs to extract it to some place (e.g. $HOME). User can access the data folder at $HOME/GSE11209-master/. Fastq files are located under $HOME/GSE11209-master directory and the annotation files are under $HOME/GSE11209-master/annotation. It should take less than 5 minutes to run preprocessing on this testing data. Note: be sure to click 'Browse' button for the Annotation Dir and select 'annotation' directory on BRB-DGE program; see the screenshot here.
To run RNA-Seq data preprocessing, it is assumed the working directory contains the following files
The fastq files can be generated from sequencer or public repositories like Sequence Read Archive (SRA). If data is obtained from SRA, the data format is in sra which cannot be directly processed by Tophat program. Therefore the first step is to run a program called fastq-dump to convert the data format from sra to fastq. An example of using fastq-dump command is
/opt/RNA-Seq/bin/sratoolkit.2.3.5-2-ubuntu64/bin/fastq-dump --split-3 SRRXXXXXX.sra
This will create a new file SRRXXXXXX.fastq for single-end data or SRRXXXXXX_1.fastq and SRRXXXXXX_2.fastq for paired-end data. Note that the above command assumes users have run automatic setup tool to install required software. Fastq-dump program is part of SRA ToolKit. The full path to the fastq-dump depends on the SRAToolKit version. See also FAQ #7.
The genome reference file and genome annotation file can be downloaded from Tophat reference genome download page at Tophat website.
Consider, for example, the Homo sapiens genome GRCh37.
After downloading the file Homo_sapiens_Ensembl_GRCh37.tar.gz (17GB), the reference genome files (genome.*.bt2) are
located under Homo_sapiens/Ensembl/GrCh37/Sequence/Bowtie2Index
directory and gene annotation file (genes.gtf) is located under
Home_sapiens/Ensembl/GrCh37/Annotation/Archives/
directory.
Please copy these bt2 and gtf files to some directory such as ProjectDir/annotation/ and select this directory for the 'Annotation Dir' on the BRB-DGE program.
If you have an NIH helix/biowulf acccount, you can download the reference genome and annotation files individually from /fdb/igenomes/ directory; see Tophat on Helix.
The samples.txt file contains a table of metadata of sequencing design. More detail about how to prepare samples.txt is given in the next section.
Before starting to run preprocessing, users need to construct a table of sequencing design in a tab-delimited file samples.txt. This file has the following requirement.
The header of samples.txt consists of LibraryName, LibraryLayout, fastq1 and fastq2. Each row represents one sample. The column 'LibraryName' contains a name to uniquely represent the sample. 'LibraryLayout' should be PAIRED or SINGLE. 'fastq1' column contains the fastq filename. Multiple files can be concatenated by comma sign. If the library is paired-end, 'fastq2' column should be specified with a format the same as 'fastq1' column. An example of the samples.txt file is given below:
This file can be edited using LibreOffice Calc (similar to Microsoft Excel on Windows OS) on Ubuntu by:
bdge-dl
folder to your working directory.As of this writing, the LibreOffice version is 4.3. If you have an earlier version of LibreOffice (3.X), you can install the latest version by checking out LibreOffice website.
When the necessary files are prepared and the required software have been installed, you are ready to run processing by clicking the big 'Run Preprocessing' button. When the preprocessing is done, you will get a dialog showing a message the preprocessing is finished. You can transfer the output files to Windows OS machines and use BRB-ArrayTools for further gene expression analysis. See FAQ #10 for further help about transferring files.
If you need to run the BDGE program in a 'headless' way (without a graphical interface), you can run the command line interface (CLI) version of the program BDGECLI from a terminal. First open a terminal (keyboard binding is Ctrl+Alt+t) and cd to the directory containing BRB-DGE. Executing the following line will start to preprocess the RNA-Seq data.
$ ./BDGECLI DIRECTORY-NAME
where DIRECTORY-NAME
is the directory name that contains all of your data and related files.
A full list of arguments can be found by adding '-h' argument after the command:
$ ./BDGECLI -h
Name:
BDGECLI - command line interface for RNA-Seq data preprocessing (companion to BDGE program)
Usage: BDGECLI DIRECTORY-NAME [Options]
Options:
-h,--help Show this help message
--annotDir Annotation directory
--outputDir Output directory
--tophatDir Tophat directory
--bowtieDir Bowtie directory
--samtoolsDir Samtools directory
--createScriptOnly Only create the shell script without running it
-v,--version Display version number
Example:
./BDGECLI ~/GSE11209-master --annotDir ~/GSE11209-master/annotation \
--outputDir ~/GSE11209-master/output \
--tophatDir /opt/RNA-Seq/bin/tophat-2.0.11.Linux_x86_64 \
--bowtieDir /opt/RNA-Seq/bin/bowtie2-2.2.1 \
--samtoolsDir /opt/RNA-Seq/bin/samtools-0.1.19
Like the GUI version, the output of the program is a zip file containing the count table for each sample (just like the GUI version).
This program was built using GCC. The executable file should be able to run on most x64 Linux distributions.
If you only want to generate shell scripts without running them immediately, you can use the --createScriptOnly parameter in BDGECLI. This can be useful in a situation that the scripts should be tweaked before being used; e.g. users want to run scripts through the qsub command in a batch system.
/opt/RNA-Seq/bin
, we can type the following in a Linux terminal to find out each software version:
/opt/RNA-Seq/bin/bowtie2-2.2.1/bowtie2 --version # 2.2.1 /opt/RNA-Seq/bin/tophat-2.0.11.Linux_x86_64/tophat2 --version # 2.0.11 /opt/RNA-Seq/bin/samtools-0.1.19/samtools # 0.1.19-44428cd cat /opt/RNA-Seq/bin/HTSeq-0.6.1/VERSION # 0.6.1 /opt/RNA-Seq/bin/sratoolkit.2.3.5-2-ubuntu64/bin/fastq-dump --version # 2.3.5 /opt/RNA-Seq/bin/FastQC/fastqc -v # 0.10.1 /opt/RNA-Seq/bin/fastx/bin/fastx_trimmer -h # 0.0.13
$ ./install.sh [sudo] password for brb: Sorry, user brb is not allowed to execute 'bin/mkdir -p /opt/RNA-Seq/bin/BDGE' as root on brb-VirtualBox.A simple solution of adding an existing account with root privileges is to run the following terminal command from an account with root privileges,
sudo adduser brb sudoPS. 1. 'brb' has to be replaced with a real user name which does not have sudo privileges yet. 2. The user needs to log out and log in again to see the effect.
ifconfig eth0where the 'eth0' argument represents the first ethernet adapter. The IP address should have a format XXX.XXX.XXX.XXX. You need to enter the IP address in the 'Host name' entry of a dialog when you create a new connection session on WinSCP.