GCEN - Data visualization

Data visualization is critical to scientific research because it allows for more efficient exploration and communication. The release of GCEN does not include a plotting program since GCEN focuses on efficiently using gene co-expression network to predict gene function. However, we did not ignore biologist's plotting demands, and we provide a number of data visualization demos and scripts on our website. A successful visualization conveyed a large amount of data in an informative and concise manner remains a challenge, and requires clear objectives and improved implementation. We do not offer automatic plotting, instead of enlightenment to show biological significance in gene-coexpression analysis.

Network degree distribution

The figure on the left shows degree distribution of gene co-expression network. The degree of a node is how many connections it has to other nodes within the network. If a gene has 10 co-expressed genes in a gene co-expression network, then its degree is 10. The degree distribution of gene co-expression networks is similar to that of scale-free networks and approximately follows a power law. The majority of genes are related to a small number of other genes, while just a few genes are linked to a large number of genes. This degree distribution can be shown as a straight line on a log-log plot.

show code

download script

download sample data

import pandas as pd
from collections import Counter
from math import log10

import seaborn as sns
import matplotlib.pyplot as plt

# read network
network = pd.read_csv("gene_co_expr.network", header = 1, sep= '\t')
node = network.iloc[:,0].append(network.iloc[:,1])
node_degree = pd.Series(Counter(node))
distribution = pd.DataFrame(list(Counter(node_degree).items()))
distribution = distribution.applymap(log10)
distribution.columns = ["degree", "counts"]

# plot
sns.set(style="white", context="talk")
fig, ax = plt.subplots()
fig.set_size_inches(8, 8)

g = sns.lmplot(x="degree", y="counts", data=distribution, scatter=True, 
               markers='.', scatter_kws={"s": 10})
g.set_axis_labels("degree (log10)", "counts (log10)")
g.savefig('network_degree_distribution.pdf', format='pdf', bbox_inches = 'tight')

Sub Network

The figure on the left shows sub-network or module. We predict gene function based on neighboring genes in the network or genes in the same module. As a result, it's critical to demonstrate them graphically.

show code

download script

download sample data

import networkx as nx
import matplotlib.pyplot as plt

import matplotlib as mpl
mpl.rcParams['pdf.fonttype'] = 42

network = nx.Graph()
with open('sub.network', 'r') as network_file:
    for line in network_file:
        node_a,  node_b = line.strip().split('\t')[:2]
        network.add_edge(node_a, node_b)

f = plt.figure(figsize=(6, 6))
f.tight_layout()
nx.draw(network, with_labels=True, edge_color='grey', node_color='aquamarine', node_size=1500)
xl, xr = plt.xlim()
plt.xlim(xl - 0.45, xr + 0.45)
plt.savefig('subnetwork.pdf', format='pdf')

GO annotation counts

The figure on the left shows the statistics of GO annotations for all predicted genes.

show code

download script

download sample data

import os
import pandas as pd
from collections import Counter
import seaborn as sns
import matplotlib.pyplot as plt

def read_go(go_annotation_file):
    df = pd.read_csv(go_annotation_file, delimiter = "\t")
    df = df[df['enrichment'] == 'e'][['name', 'name_space', 'p_val']]
    df2 = pd.DataFrame(columns=["name", "name_space", "p_val"]) 
    for name_space in ['biological_process', 'molecular_function', 'cellular_component']:
        df_ = df[df['name_space'] == name_space]
        df_ = df_.sort_values(by='p_val')
        if df_.shape[0] > 10:
            df_ = df_.iloc[0:10]
        df2 = df2.append(df_)
    return df2

df = pd.DataFrame(columns=["name", "name_space", "p_val"])    
for go_annotation_file in os.listdir('network_go_annotation'):
    single_df = read_go('network_go_annotation/' + go_annotation_file)
    df = df.append(single_df)

df_plot = pd.DataFrame(columns=['name', 'counts', 'name_space']) 
for name_space in ['biological_process', 'molecular_function', 'cellular_component']:
    counter = Counter(df[df["name_space"] == name_space]["name"])
    df_ = pd.DataFrame.from_dict(counter, orient='index').reset_index()
    df_.columns = ['name', 'counts']
    df_ = df_.sort_values(by='counts', ascending=False)
    if df_.shape[0] > 10:
        df_ = df_.iloc[0:10]
    df_['name_space'] = name_space
    df_plot = df_plot.append(df_)

fig, ax = plt.subplots()
fig.set_size_inches(8, 8)
sns_plot = sns.barplot(x="counts", y="name", hue = "name_space", data=df_plot) 
ax.set(ylabel="", xlabel="counts (top 10)")
sns.despine(left=True, bottom=True)  
plt.savefig('go_annotation_counts.pdf', format='pdf', bbox_inches = 'tight')

GO annotation (Biological Process top 10)

The figure on the left shows GO annotations of a gene. Among the three aspects of gene ontology, the bio-logical process is the most representative of gene function and is also the most demon-strated.

show code

download script

download sample data

from itertools import islice
from math import log10
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data_list = list()
with open('module_10.go','r') as go_annotation_file:
    for line in islice(go_annotation_file, 1, 11):
        items = line.strip().split('\t')
        term = items[1]
        count = int(items[3])
        pval = -log10(float(items[7]))
        data_list.append([term, count, 'Counts'])
        data_list.append([term, pval, 'P value (-log10)'])

df = pd.DataFrame(data_list)
df.columns = ["term", "value", "type"]

fig, ax = plt.subplots()
fig.set_size_inches(8, 4)
sns.set(style="white", context="talk")
current_palette = sns.color_palette()

g = sns.barplot(y="term", x="value", data=df, hue="type")
ax.set(ylabel="", xlabel="")
g.legend(loc='center left', bbox_to_anchor=(0.57, 0.48), ncol=1)

plt.savefig('go_bp.pdf', format='pdf', bbox_inches = 'tight')