Güngör Budak's Blog

Bioinformatics, web programming, coding in general

Finding k-cores and Clustering Coefficient Computation with NetworkX

Assume you have a large network and you want to find k-cores of each node and also you want to compute clustering coefficient for each one. Python package NetworkX comes with very nice methods for you to easily do these.

k-core is a maximal subgraph whose nodes are at least k degree [1]. To find k-cores:

Add all edges you have in your network in a NetworkX graph, and use core_number method that gets graph as the single input and returns node – k-core pairs.

cores = networkx.core_number(G)

If you also need to compute clustering coefficient each node, use clustering method which gets graph and node as inputs for unweighted and returns a floating number as the clustering coefficient of the node.

cc = networkx.clustering(G, node)

Below there is an example script that does these operations. Read the code and prepare your input accordingly.


1. Preventing Unraveling in Social Networks: The Anchored k-Core Problem, http://www.cs.cornell.edu/home/kleinber/icalp12-kcore.pdf

GO Enrichment of Network Clusters

In my previous post, I mentioned how I clustered the network we obtained at the end. For functional annotation gene ontology (GO) enrichment has been done on these clusters.

There were 20 clusters and the HGNC names are obtained separately for each cluster and using DAVID functional annotation tool API, GO and pathway annotations are collected per cluster and these are saved separately.


Above URL was used to obtain chart report for some GO and pathways chart records. “ids” query was changed per cluster with HGNC names coming from each.

Functional annotation chart

Then, downloaded chart reports for each cluster are read and a matrix is created to show p values for each chart record per cluster. The matrix columns are 20 clusters and it has 1344 chart record. If the p value of the chart record in any cluster is smaller than or equal to 0.05, then it’s stored in the corresponding cell, otherwise 0 is written as the p value.

Network Clustering with NeAT - RNSC Algorithm

As we have obtained proteins at different times points from the experimental data, then we have found intermediate nodes (from human interactome) using PCSF algorithm and finally with a special matrix from the network that PCSF created, we have validated the edges and also determined edge directions using an approach which a divide and conquer (ILP) approach for construction of large-scale signaling networks from PPI data. The resulting network is a directed network and will be used and visualized for further analyses.

The first step was to cluster the network. RNSC (Restricted Neighbourhood Search Cluster Algorithm) was used to cluster the network into 20 and these were imported to Cytoscape as a network table and the network is manually splitted into 20 clusters based on RNSC results on Cytoscape.

Salmonella network

Next, nodes in the network is colored based on time points when they are found to be significantly changed. In previously visualized network, some nodes seen at more than one time point were not shown correctly. In this one, if they are seen at more than one time point, they are colored with the corresponding colors of time points.

Salmonella network zoomed

Searching Open Reading Frames (ORF) in DNA sequences - ORF Finder

Open reading frames (ORF) are regions on DNA which are translated into protein. They are in between start and stop codons and they are usually long.

The Python script below searches for ORFs in six frames and returns the longest one. It doesn’t consider start codon as a delimiter and only splits the sequence by stop codons. So the ORF can start with any codon but ends with a stop codon (TAG, TGA, TAA). Six frames are the sequence itself and its reverse complement, and each has three frames obtained by shifting one nucleotide right twice.

There are also local and online tools that perform the same task. My suggestions are:

Reconstructed Salmonella Signaling Network Visualized and Colored

After fold changes were obtained and HGNC names were found for each phosphopeptide, these were used to construct Salmonella signaling network using PCSF and then with the nodes that PCSF found as well, we generated a matrix which has node in the rows and time points in the columns and each cell shows the presence of corresponding protein under the corresponding time point(s).

The matrix has 658 nodes (proteins) and 4 time points as indicated before: 2 min, 5 min, 10 min and 20 min.

Öykü Eren Özsoy and Tolga Can from METU has developed a method which gets this matrix and an interactome and reconstructs a signaling network. This method enhances the interactions of the nodes in the network. Below, see this network on Cytocape.

Reconstructed network

Then, to understand which nodes are coming from the data and which of them belong to which time points, I colored the nodes according to a two-column table that has nodes in the first column and their time points in the second column.

Reconstructed network colored

Note there are nodes that are seen at multiple time points and in this network they are not specially colored. They will be considered later.

Python: Get Longest String in a List

Here is a quick Python trick you might use in your code.

Assume you have a list of strings and you want to get the longest one in the most efficient way.

>>>l=["aaa", "bb", "c"]
>>>longest_string = max(l, key = len)

Python: defaultdict(list) Dictionary of Lists

Most of the time, when you need to work on large data, you’ll have to use some dictionaries in Python. Dictionaries of lists are very useful to store large data in very organized way. You can always initiate them by initiating empty lists inside an empty dictionary but when you don’t know how many of them you’ll end up with and if you want an easier option, use defaultdict(list). You just need to import it, first:

from collections import defaultdict

And then initiate anywhere:

dict = defaultdict(list)

You can always collect data in this way:


Python: extend() Append Elements of a List to a List

When you append a list to a list by using append() method, you’ll see your list is going to be appended as a list:

>>>l2=["a", "b"]
['a', ['a', 'b']]

If you want to append elements of the list directly without creating nested lists, use extend() method:

>>>l2=["a", "b"]
['a', 'a', 'b']

Salmonella Data Preprocessing for PCSF Algorithm

This post describes data preprocessing in Salmonella project for Prize-Collecting Steiner Forest Problem (PCSF) algorithm.

Salmonella data taken from Table S6 in Phosphoproteomic Analysis of Salmonella-Infected Cells Identifies Key Kinase Regulators and SopB-Dependent Host Phosphorylation Events by Rogers, LD et al. has been converted to tab delimited TXT file from its original XLS file for easy reading in Python.

The data should be separated into time points files (2, 5, 10 and 20 minutes) each of which will contain corresponding phophoproteins and their fold changes. Data also has another dimension which is location (membrane, nucleus, cytosol) and when we looked at each location and each time point, we saw there are not many overlapping proteins in the same time point and different locations, therefore we decided to merge location data for each time point in each protein.

Overlapping proteins

Number of overlapping proteins in each dimensions (2N; 2 min. & Nucleus; 10C: 10 min. & Cytosol).

The data table is read for each time point row by row and if the row has any data satisfying conditions (significance 0.05 and variance 15) in one or more than one locations, they are stored and the maximum fold change among them (in the case of more than one locations) is obtained and stored with HGNC name of the protein. HGNC names of the proteins are obtained by converting IPI IDs of their peptides to HGNC names of the proteins (DAVID is used for this conversion). After collecting all HGNC names and maximum fold changes, each HGNC name and corresponding maximum fold change(s) are written as an output. If there are multiple maximum fold changes for an HGNC name due to the same protein in multiple rows, then their maximum is obtained and written next to that HGNC name.

Outputs of this preprocessing are four TXT files of HGNC name – fold change pairs and each file contains proteins that are found significant and the most activated or deactivated at one time point.

UPGMA Algorithm Described - Unweighted Pair-Group Method with Arithmetic Mean

UPGMA is an agglomerative clustering algorithm that is ultrametric (assumes a molecular clock - all lineages are evolving at a constant rate) by Sokal and Michener in 1958.

The idea is to continue iteration until only one cluster is obtained and at each iteration, join two nearest clusters (which become a higher cluster). The distance between any two clusters are calculated by averaging distances between elements of each cluster.

To understand better, see UPGMA worked example by Dr Richard Edwards.