Blog

Network Clustering with NeAT - RNSC Algorithm

As we have obtained proteins at different times points from the experimental data, then we have found intermediate nodes (from human interactome) using PCSF algorithm and finally with a special matrix from the network that PCSF created, we have validated the edges and also determined edge directions using an approach which a divide and conquer (ILP) approach for construction of large-scale signaling networks from PPI data. The resulting network is a directed network and will be used and visualized for further analyses.

Blog

Finding k-cores and Clustering Coefficient Computation with NetworkX

Assume you have a large network and you want to find k-cores of each node and also you want to compute clustering coefficient for each one. Python package NetworkX comes with very nice methods for you to easily do these. k-core is a maximal subgraph whose nodes are at least k degree [1]. To find k-cores: Add all edges you have in your network in a NetworkX graph, and use core_number method that gets graph as the single input and returns node – k-core pairs.

Blog

Searching Open Reading Frames (ORF) in DNA sequences - ORF Finder

Open reading frames (ORF) are regions on DNA which are translated into protein. They are in between start and stop codons and they are usually long. The Python script below searches for ORFs in six frames and returns the longest one. It doesn’t consider start codon as a delimiter and only splits the sequence by stop codons. So the ORF can start with any codon but ends with a stop codon (TAG, TGA, TAA).

Blog

Reconstructed Salmonella Signaling Network Visualized and Colored

After fold changes were obtained and HGNC names were found for each phosphopeptide, these were used to construct Salmonella signaling network using PCSF and then with the nodes that PCSF found as well, we generated a matrix which has node in the rows and time points in the columns and each cell shows the presence of corresponding protein under the corresponding time point(s). The matrix has 658 nodes (proteins) and 4 time points as indicated before: 2 min, 5 min, 10 min and 20 min.

Blog

Python: Get Longest String in a List

Here is a quick Python trick you might use in your code. Assume you have a list of strings and you want to get the longest one in the most efficient way. 1>>>l=["aaa", "bb", "c"] 2>>>longest_string = max(l, key = len) 3>>>longest_string 4'aaa'

Blog

Python: defaultdict(list) Dictionary of Lists

Most of the time, when you need to work on large data, you’ll have to use some dictionaries in Python. Dictionaries of lists are very useful to store large data in very organized way. You can always initiate them by initiating empty lists inside an empty dictionary but when you don’t know how many of them you’ll end up with and if you want an easier option, use defaultdict(list). You just need to import it, first:

Blog

Python: extend() Append Elements of a List to a List

When you append a list to a list by using append() method, you’ll see your list is going to be appended as a list: 1>>>l=["a"] 2>>>l2=["a", "b"] 3>>>l.append(l2) 4>>>l 5['a', ['a', 'b']] If you want to append elements of the list directly without creating nested lists, use extend() method: 1>>>l=["a"] 2>>>l2=["a", "b"] 3>>>l.extend(l2) 4>>>l 5['a', 'a', 'b']

Blog

Salmonella Data Preprocessing for PCSF Algorithm

This post describes data preprocessing in Salmonella project for Prize-Collecting Steiner Forest Problem (PCSF) algorithm. Salmonella data taken from Table S6 in Phosphoproteomic Analysis of Salmonella-Infected Cells Identifies Key Kinase Regulators and SopB-Dependent Host Phosphorylation Events by Rogers, LD et al. has been converted to tab delimited TXT file from its original XLS file for easy reading in Python. The data should be separated into time points files (2, 5, 10 and 20 minutes) each of which will contain corresponding phophoproteins and their fold changes.

Blog

UPGMA Algorithm Described - Unweighted Pair-Group Method with Arithmetic Mean

UPGMA is an agglomerative clustering algorithm that is ultrametric (assumes a molecular clock - all lineages are evolving at a constant rate) by Sokal and Michener in 1958. The idea is to continue iteration until only one cluster is obtained and at each iteration, join two nearest clusters (which become a higher cluster). The distance between any two clusters are calculated by averaging distances between elements of each cluster. To understand better, see UPGMA worked example by Dr Richard Edwards.