GCEN: an easy toolkit of Gene Co-Expression Network analysis for lncRNAs annotation

1. Introduction

GCEN is a command-line toolkit that allows biologists to easily build gene co-expression network and predict gene function, especially in RNA-Seq research or lncRNA annotation. GCEN is primarily designed to be used in lncRNA annotation, but is not limited to those scenarios. It is an efficient and easy-to-use solution that will allow everyone to perform gene co-expression network analysis without sophisticated programming skills. The recommended pipeline consists of four parts: data pretreatment, network construction, module identification, and function annotation. A README file and sample data are included in the software package. Because of its modular design, the GCEN can be easily integrated into another pipeline. Also, the multithreaded implementation of GCEN makes it fast and efficient for RNA-Seq data.

2. Usage

data_norm
description:
  The program data_norm normalizes the gene expression data.
usage:
  data_norm -i input_file -o output_file -m normalization_method
options:
  -i --input <input file>
  -o --output <output file>
  -m --method <upqt/median/deseq/tmm> normalization method (default: upqt)
  -v --version display GCEN version
  -h --help print help information
example:
  data_norm -i ../sample_data/gene_expr.tsv -o ../sample_data/gene_expr_norm.tsv -m upqt

data_filter
description:
  The program data_filter filter genes according to the their expression mean and standard deviation.
usage:
  data_filter -i input_file -o output_file
options:
  -i --input <input file>
  -o --output <output file>
  -c --cutoff_mean <number> mean cutoff of gene expression (default: 0.0)
  -C --cutoff_sd <number> standard deviation cutoff of gene expression (default: 0.0)
  -p --percent_mean <number> keep a proportion of total genes based mean of gene expression (default: 1.0)
  -P --percent_sd <number> keep a proportion of total genes based standard deviation of gene expression (default: 1.0)
  -v --version display GCEN version
  -h --help print help information
example:
  data_filter -i ../sample_data/gene_expr.tsv -o ../sample_data/gene_expr_filter.tsv -p 0.75

network_build
description:
  The program network_build construct gene co-expression network using gene expression matrix.
usage:
  network_build -i gene_expression_file -o co_expression_network_file
options:
  -i --input <input file>
  -o --output <output file>
  -m --method <pearson or spearman> correlation coefficient method (default: spearman)
  -l --log <log, log2 or log10> make a log(x+1) transformation (default: not transform)
  -t --thread <number> cpu cores (default: 2)
  -p --pval <number> p value cutoff (default: 0.001)
  -c --cor <number> correlation coefficient cutoff (default: 0.1)
  -s --signed <y or n> singed network (default: n)
  -f --fdr <y or n> calculate FDR (default: n)
  -v --version display GCEN version
  -h --help print help information
example:
  network_build -i ../sample_data/gene_expr.tsv -o ../sample_data/gene_co_expr.network -m spearman -p 0.001 -f y

module_identify
description:
  The program identify the gene modules using the gene co-expression network.
usage:
  module_identify -i input_file -o output_file
options:
  -i --input <input file>
  -o --output <output file>
  -s --similarity <number> similarity cutoff (default: 0.5)
  -t --thread <number> cpu cores (default: 2)
  -v --version display GCEN version
  -h --help print help information
example:
  module_identify -i ../sample_data/gene_co_expr.network -o ../sample_data/module.txt

annotate
description:
  The program annotate can perform GO/KEGG annotation based on network or module.
usage:
  annotate -g go-basic.obo -a gene_go_association_file -n input_network -o out_dir
options:
  -g --go <go-basic.obo file>
  -k --kegg <kegg information> (if the -g/--go is specified, the -k/--kegg are ignored)
  -a --assoc <gene/go association file>
  -n --network <network file>
  -m --module <module file> (if -n is specified, the -m is ignored)
  -p --pval <number> p value cutoff (default: 0.05)
  -o --output <output directory>
  -t --thread <number> cpu cores (default: 2)
  -v --version display GCEN version
  -h --help print help information
examples:
  ./annotate -g ../sample_data/go-basic.obo -a ../sample_data/gene_go.assoc -n ../sample_data/gene_co_expr.network -o ../sample_data/network_go_annotation
  ./annotate -g ../sample_data/go-basic.obo -a ../sample_data/gene_go.assoc -m ../sample_data/module.txt -o ../sample_data/module_go_annotation
  ./annotate -k ../sample_data/K2ko.tsv -a ../sample_data/gene_kegg.assoc -n ../sample_data/gene_co_expr.network -o ../sample_data/network_kegg_annotation
  ./annotate -k ../sample_data/K2ko.tsv -a ../sample_data/gene_kegg.assoc -m ./sample_data/module.txt -o ../sample_data/module_kegg_annotation

rwr
description:
  The program rwr can predict potential funcation associated genes based on RWR (Random Walk with Restart) algorithm.
usage:
  rwr -n input_network -g gene_list -o output_result
options:
  -n --network <network file>
  -g --gene <seed genes list file>
  -r --gamma <number> restart probability (default: 0.5)
  -p --prob <number> probability cutoff (defalut: 0.01)
  -o --output <output file>
  -d --directed_network the input network is directed (defalut: the input network is undirected)
  -w --weighted_network the edge weights of network will be considered (defalut: all edge weights of network are set to 1.0)
  -W --weighted_gene the weights of seed genes will be considered (defalut: all weights of seed genes are set to 1.0)
  -v --version display GCEN version
  -h --help print help information
example:
  rwr -n ../sample_data/gene_co_expr.network -g ../sample_data/rwr_seed_genes.list -o ../sample_data/rwr_ranked_gene.tsv

data_stat
description:
  The program data_stat calculate the statistics of gene expression matrix.
usage:
  data_stat -i input_file
options:
  -i --input <input file>
  -v --version display GCEN version
  -h --help print help information
example:
  data_stat -i ../sample_data/gene_expr.tsv

network_merge
description:
  The program network_merge merge two or more networks.
usage:
  network_merge -i input_files -o output_file
options:
  -i --input <input files> multiple files are separated by commas
  -o --output <output file>
  -c --cor <number> correlation coefficient cutoff (default: 0.5)
  -h --help print help information
example:
  network_merge -i ../sample_data/test_1.network,../sample_data/test_2.network -o ../sample_data/test_merge.network

enrich
description:
  The program enrich can perform GO/KEGG enrichment.
usage:
  enrich -e enrichment_gene_list_file -b background_gene_list_file -g go-basic.obo -a gene_go_association_file -p p_value_cutoff -o out_put_file
options:
  -e --enrich <enrichment gene list file>
  -b --background <background gene list file>
  -g --go <go-basic.obo file>
  -k --kegg <kegg information> (if the -g/--go is specified, the -k/--kegg are ignored)
  -a --assoc <gene/go association file>
  -p --pval <number> p value cutoff (default: 0.05)
  -o --output <output file>
  -v --version display GCEN version
  -h --help print help information
examples:
  enrich -e ../sample_data/enrichment_gene.list -b ../sample_data/background_gene.list -g ../sample_data/go-basic.obo -a ../sample_data/gene_go.assoc -p 0.05 -o ../sample_data/enrichment.go
  enrich -e ../sample_data/enrichment_gene.list -b ../sample_data/background_gene.list -k ../sample_data/K2ko.tsv -a ../sample_data/gene_kegg.assoc -p 0.05 -o ../sample_data/enrichment.kegg

generate_expr_matrix_from_rsem
description:
  The program generate_expr_matrix_from_rsem generate gene expression matrix from RSEM outputs.
usage:
  generate_expr_matrix_from_rsem -i input_file -o output_file
options:
  -i --input <input file> a text file with sample ID and path to its RSEM result file on each line
  -o --output <output file>
  -t --tpm output TPM value instead of FPKM vaule
  -v --version display GCEN version
  -h --help print help information
example:
  generate_expr_matrix_from_rsem -i ../sample_data/rsem/rsem_sample.txt -o ../sample_data/rsem/rsem_gene_expr.tsv

generate_expr_matrix_from_stringtie
description:
  The program generate_expr_matrix_from_stringtie generate gene expression matrix from StringTie outputs.
usage:
  generate_expr_matrix_from_stringtie -i input_file -o output_file
options:
  -i --input <input file> a text file with sample ID and path to its GTF file on each line
  -o --output <output file>
  -t --tpm output TMP value instead of FPKM vaule
  -v --version display GCEN version
  -h --help print help information
example:
  generate_expr_matrix_from_stringtie -i ../sample_data/stringtie/stringtie_sample.txt -o ../sample_data/stringtie/stringtie_gene_expr.tsv

3. Recommended pipeline

Step 1: data pretreatment
./data_norm -i ../sample_data/gene_expr.tsv -o ../sample_data/gene_expr_norm.tsv -m upqt
./data_filter -i ../sample_data/gene_expr_norm.tsv -o ../sample_data/gene_expr_norm_filter.tsv -p 0.75

Step 2: co-expression network construction
./network_build -i ../sample_data/gene_expr_norm_filter.tsv -o ../sample_data/gene_co_expr.network -m spearman -p 0.001 -c 0.8 -f y -t 6

Step 3: module identification (optional)
./module_identify -i ../sample_data/gene_co_expr.network -o ../sample_data/module.txt -s 0.5 -t 6

Step 4: function annotation
network based annotation
./annotate -g ../sample_data/go-basic.obo -a ../sample_data/gene_go.assoc -n ../sample_data/gene_co_expr.network -o ../sample_data/network_go_annotation
./annotate -k ../sample_data/K2ko.tsv -a ../sample_data/gene_kegg.assoc -n ../sample_data/gene_co_expr.network -o ../sample_data/network_kegg_annotation
module based annotation (optional)
./annotate -g ../sample_data/go-basic.obo -a ../sample_data/gene_go.assoc -m ../sample_data/module.txt -o ../sample_data/module_go_annotation
./annotate -k ../sample_data/K2ko.tsv -a ../sample_data/gene_kegg.assoc -m ../sample_data/module.txt -o ../sample_data/module_kegg_annotation
identify genes with specific functions based on RWR (optional)
rwr -n ../sample_data/gene_co_expr.network -g ../sample_data/rwr_interested_gene.list -o ../sample_data/rwr_result.tsv

4. Data format

To understand the format of the input and output files for each program, please take a look at the sample data included in the software package.

5. Implementation

The GCNE is developed using C++ as open-source software under the GPLv3 license. In addition to our code, we used some third-party code in compliance with their licenses. The GCEN can be compiled and run in Linux, Windows, and macOS. Compiling the GCEN requires compiler and library support for the ISO C++ 2017 standard. The GCEN does not involve any new computational methods or algorithms but implemented and integrated some classic algorithms to annotate lncRNA more easily and faster. We described the main algorithms used as follows.

Data normalization
We have implemented four algorithms, including median normalization, quantile normalization [1], median-of-ratios method [2], and trimmed mean of M-values (TMM) [3]. For the median normalization, the median of gene expression values in each sample be calculated, then each gene expression value in the same sample multiply by the mean of all medians and divide by the median of this sample. The descriptions of the other three algorithms can be found in their original papers.

Co-expression network construction
We use the Spearman or Pearson correlation coefficients directly to determine co-expression patterns between gene pairs, which are more performing and robust [4]. The coefficient of Spearman or Pearson correlation ρ was calculated as follows.

For Pearson correlation, where or represents the vector of the expression value of each gene. or stands for each expression value, or , is the mean value of these expression values. For Spearman correlation, the vector of the expression value of each gene needs to be replaced by their ranks.

Module identification
The network modules, which are groups of genes with similar expression profiles, are explored based on the topological structure of the gene co-expression network [5]. Genes in the same module tend to be functionally related and co-regulated. We implemented a module identification algorithm based on the node similarity measure of their relative interconnectedness coupled with the hierarchical clustering method [6].

Where the is the similarity between gene and gene . The is the number of shared neighbor genes of gene and gene in the co-expression network. The is the adjacency of gene and gene . and are the connectivity of gene and gene , respectively.

Function annotation
After gene co-expression network construction and module identification, we use gene function enrichment to predict novel lncRNA’s or coding gene’s function. The value of enriched function was calculated as follows.

In this equation, is the total number of genes in the network. is the total number of genes having one certain annotation, is the number of a gene's immediate neighbors and is the number of neighbor genes having one certain annotation.

Another gene function analysis algorithm we implemented is the random walk with restart (RWR) [7], which measures each node’s relevance with respect to given seed nodes (here are genes with known function annotations) based on network propagation. The information (known gene function) is propagated through the edges from seed nodes to nearby nodes until convergence. To the end, prior information associating genes with a function of interest is super imposed on the nodes of the network.

Where represents our initial, or prior, information on genes. is a stochastic matrix, that is, its columns sum to 1. The parameter describes the trade-off between prior information and network smoothing. Repeated iteration of this equation converges to a steady-state. is the prior information associating genes with a function of interest. The implementation of RWR uses Eigen, which is a C++ template library for linear algebra (http://eigen.tuxfamily.org).

References

  1. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK: limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 2015, 43(7):e47.
  2. Love MI, Huber W, Anders S: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014, 15(12):550.
  3. Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 2010, 11(3):R25.
  4. Ballouz S, Verleyen W, Gillis J: Guidance for RNA-seq co-expression network construction and analysis: safety in numbers. Bioinformatics 2015, 31(13):2123-2130.
  5. Saelens W, Cannoodt R, Saeys Y: A comprehensive evaluation of module detection methods for gene expression data. Nat Commun 2018, 9(1):1090.
  6. Wang Z, San Lucas FA, Qiu P, Liu Y: Improving the sensitivity of sample clustering by leveraging gene co-expression networks in variable selection. BMC Bioinformatics 2014, 15:153.
  7. Cowen L, Ideker T, Raphael BJ, Sharan R: Network propagation: a universal amplifier of genetic associations. Nat Rev Genet 2017, 18(9):551-562.