Güngör Budak's Blog

Bioinformatics, web programming, coding in general

In silico Network Inference Last Improvements and Visualization of Result in Cytoscape

I’m almost done with the analysis of in silico data, although I need to decide if I need further analysis with the inhibiting parent nodes in the network. Last, I couldn’t filter out duplicate edges, which were scored differently. Now, with some improvements in the script, low scores duplicates are filtered and there is a better final list of edges which is ready to be visualized.

I also tried visualizing it on Cytoscape. This is not required in the challenge but better to have to see the result. Below, there is a visualization of the result. It has colored edges and shaped edge ends (red and T-shaped edge is inhibiting, green and arrow-shaped edge is activating) according to type of interaction. Each edge has a score and it is indicated numerically and visually by having different thickness. The thicker the edge, the larger score it has relatively.

In silico network

In silico data-derived network from HPN-DREAM Challenge. Click on the network to enlarge it.

I will decide to do further analysis for in silico part according to the rank in the Leaderboard 1B. I have already started modifying the script to read and process experimental data. I will define relations between inhibitors, stimuli and antibodies for experimental data as I did for in silico data, then infer the network using the same script as in in silico part.

Some String Functions in R, String Manipulation in R

I have programmed with Perl, Python, and PHP before, and string manipulation was more direct and easier in them than in R. But still there are useful functions for string manipulation in R. I’m not an expert in R but I’ve been dealing with it for a while and I’ve learned some good functions for this purpose.

Concatenate strings

Concatenation is done with paste function. It gets concatenated strings as arguments separated bu comma and also separator character(s). Default for separator character is one white space. It directly returns output as string and starting strings remain the same.

str1 <- "Gungor"
str2 <- "Budak"
paste(str1, str2, sep="_") # Output: [1] "Gungor_Budak"

More on the documentation for paste

Extract character(s) from strings

This can be done with substr function. It gets string, start and stop indices for the extraction. It directly returns output as string and starting string remains the same.

str1 <- "Gungor"
substr(str1, 1, 3) # Output: [1] "Gun"

More on the documentation for substr

Split characters from strings

One function is strsplit which gets string and the character(s) from which it will be split as arguments. It returns a list and a list of splits in that list. So, to get a direct list of splits. You can use unlist.

str1 <- "Gungor"
unlist(strsplit(str1, "ng")) # Output: [1] "Gu" "or"

More on the documentation for strsplit

Check if character(s) are present in strings

Here, you can use grep functions. grepl takes pattern and string as arguments and returns TRUE if it matches and FALSE otherwise.

str1 <- "Gungor"
grepl("A", str1) # Output: [1] FALSE

More on the documentation for grep

Replace character(s) in strings

gsub is the function I use for this. It takes pattern, replacement and string as arguments. It returns changed string directly.

str1 <- "Gungor"
gsub("ng", "gn", str1) # Output: [1] "Gugnor"

More on the documentation for gsub

These are the ones I use most, but there are many more and I will be adding them to this post in the future.

Latest Progress on Network Inference and Edge Scoring

I have improved network inference part of the script slightly by changing the way of comparing intervention (presence of inhibitor and stimulus) and no intervention (presence of stimulus) data from in silico part.

Now, I’m using a function (simp) from an R package called StreamMetabolism, which gets time points and data values and (does integration) calculates the area under the curve (Sefick, 2009). I do this integration for both condition and then I compare them.

I also set a threshold for edge scoring. First, I set it as 0.05 and looked at the graphs of low scores. The curves from the graphs were not significantly different for these low scores. So I made it 0.1 and again observed low scores. In this case, the curves were looking different enough to be compared and to be included as an edge in the networks.

I’m inferring the network in this part for these conditions:

Inhibitors Stimuli Targets
INH1_loLIG1 loLIG1 AB12_L1
INH1_hiLIG1 hiLIG1 AB12_H1
INH2_loLIG2 loLIG2 AB5_L2
INH2_hiLIG2 loLIG2 AB5_H2
INH3_loLIG1 loLIG1 AB8_L1
INH3_hiLIG1 hiLIG1 AB8_H1
INH3_loLIG2 loLIG2 AB8_L2
INH3_hiLIG2 hiLIG2 AB8_H2

So each condition is scored separately but in networks characters like “L1” are stripped and only target names are written with the antibodies. Because of this, in networks I can get multiple results for the same kind of edge. Of course, they differ in score and I will only write the result with highest score.

Below, you can see the output from latest script.

Network edge scores

Somehow, I need to find more edges with considering targets (AB12, AB5 and AB8) as inhibitors or stimuli. For example, according to the output above, AB8 causes an inhibition in AB2. So the new analysis should be done using AB8 as inhibitor and stimuli for AB8 as stimuli for AB2. This will give another set of results of relations between AB8 and the rest. In this way, there can be some more subsequent analysis to obtain enough edge to construct a complete network.

Reference

Sefick, Stephen. 2009. Stream Metabolism-A package for calculating single station metabolism from diurnal Oxygen curves. From http://CRAN.R-project.org/package=StreamMetabolism

Scoring Edges Network Inference HPN-DREAM Challenge

Yesterday, I managed to infer a network for some part of in silico data from the challenge. Since the challenge also asks for scoring the edges in networks, I developed the script further and add a function for that.

edgeScorer function gets data object of average time points for each curve in intervention/no-intervention sets and scores each edge for each set of conditions. For this, first, it looks for the largest difference among the sets and set it as maxDifference and later, it stores differences divided by maxDifference in another data object. In this way, the edge with maximum difference is scored as 1 and the rest is scored relative to it.

I will use these scores for network inference. I will determine a threshold and the network printed will be consisting only of the edges with a score above the threshold.

Determining Edges More Progress on Network Inference

Lately, I have been writing an R script to infer network using in silico data. Last version of the script was reading MIDAS file and plotting expression profiles. I have modified it and now it reads MIDAS file, does some analyses and prints causal relations to a file. This file is a SIF file as required.

This dataset is generated with 20 antibodies but only 3 of them are perturbed. Also, for one, stimulus is missing. So, right now I can only determine causal edges between 2 of them and the rest.

In script, midasReader functions gets the content and stores each experiment in data object. And dataOptimizer reads the data object and makes it ready for inference. It does some kind of curve comparison (the curves obtained using dataPlotter, for example) and stores the results in another data object. Finally, networkInferrer gets this data object and compares related experiments and decides which kind of edge and prints in into a SIF file.

dataOptimizer does its job by taking average of all time points that belong to the same kind of experiment (the same stimulus and/or inhibitor set). Decision will be made by comparing the average of values obtained by intervention and the ones by no-intervention.

networkInferrer compares intervention and no-intervention averages and sets “relation” variable accordingly. If intervention average value is smaller than no-intervention average value, then the “relation” variable is set to “1”, because inhibition of parent node causes a decrease in the amount of child node, so the conclusion is that there is an excitatory (1) relationship between parent node and child node.

AB1 in silico network

The image above shows an example for two different sets. First, inhibition of AB12 with INH1 and stimulation with low amount of LIG1. Second, there is a change in the amount of stimulus. Both resulted the same kind of relation that is excitatory (1). Notice that these relations are defined between AB1 and AB12 because INH1 and LIG1 target AB12.

The same kind of inference can be done for AB5 which is the other antibody targeted by an inhibitor and stimulus. But since there is only inhibitor for AB8, I don’t know how to analyze it yet.

In this approach, I can only look for relations between these 3 antibodies and the rest. But, since inhibition of a child node might cause something on another grandchild node, with this kind of subsequent analysis, relations for all antibodies can be determined with each other.

Plotting Expression Profiles Data Analysis for Network Inference

For in silico data network inference I decided to develop a script because the existing tools have bugs and they are not compatible with the data. At the same time, I will try to report bugs and the compatibility issues to developers.

in silico data has 660 experiment results of 20 antibodies, 4 kinds of stimuli and 3 kinds of inhibitors. Antibodies are treated with a stimulus, say at t_0 and in the case of inhibitors, say at t_i, antibodies are pre-incubated for some time (t_pre) and then, treated with a stimulus. So, the relation is:

t_i = t_0 - t_pre

The causal edges between nodes (antibodies) in the network will be determined by the expression profile of any node, in the presence of intervention on parent node. That is, when AB12 is inhibited, if AB1 expression decreases over time, then we can draw (excitatory) edge between AB12 and AB1.

In the script for network inference, right now there are two functions: midasReader and dataPlotter.

midasReader takes MIDAS file path and child node as the arguments, reads MIDAS file into a data frame and then according to properties of data, it returns a data frame that can be used to plot expression profile graphs. Child node argument is used to get only related (expression of child node in different cases) data from whole dataset. In in silico data case, each experiment is done three times for each treatment (there are 20 different treatments) and each time point (there are 11 time points). So, there are 220 experiments, which are returned by midasReader in a data frame in three columns (treatments, timePoints, data).

Since there are three data values for each case, I took average of those for this time. But there must be more accurate ways of using them. I have to work on that more.

After the data frame is created, dataPlotter, gets three arguments to plot the graph of expression profiles, First argument is the data frame. The others are no intervention and intervention conditions (i.e. hiLIG1 and INH1+hiLIG1). The relations of inhibitors, stimuli and their targets are given in the table below.

DREAM8 in silico targets table

Relationship between ligands, stimuli and targets for the in silico model, Taken from: https://www.synapse.org/#!Wiki:syn1720047/ENTITY/56061

So, when I give hiLIG1 and INH1+hiLIG1 as input and select the child node as AB1, I’m looking for relation between AB12 and AB1. Below, there is the graph for this case. I ran the script as shown below on terminal:

bigcat@bigcat-shut-01:~/gungor/netinf$ ./netInf.R MD_insilico.csv AB1 hiLIG1 INH1+hiLIG1

AB1 expression data high

In this graph, there is no clear difference between expression profiles, so we might not tell if there is a relation looking at this graph. More accurate analyses should be done. After being sure about the accuracy of these analyses, I will move on constructing networks using these graphs.

Another example which shows clear difference is below:

bigcat@bigcat-shut-01:~/gungor/netinf$ ./netInf.R MD_insilico.csv AB1 loLIG1 INH1+loLIG1

AB1 expression data low

Here, we can say intervention on AB12 in the case of (INH1+loLIG1) causes a descrease in AB1 expression so in low amount of stimulus, AB12 is triggering AB1 amount.

However, as I mentioned above, the analysis is not efficient and should be developed. And then, more accurate results can be obtained.

Webinar on HPN-DREAM Breast Cancer Network Inference Challenge

DREAM8 organizers plan a webinar about HPN-DREAM Breast Cancer Network Inference Challenge on July 19, at 10:30 - 11:30 (PDT / UTC -7). General setup of the challenge, demo submissions to the leaderboard will be discussed and also questions about the challenge will be accepted during webinar. The number of the participants to the challenge is also announced: 138.

Registration to the webinar is done using this form. There are limited number of “seats”, but later recordings will be published.

Network Inference Challenge in silico Data

I had a meeting with BiGCaT this week and we discussed DREAM Breast Cancer Challenge. I presented the challenge and also some ways that I have found to solve the first sub-challenge network inference. Tina, from BiGCaT, suggested starting with in silico data which is much simpler than breast cancer data. Later, I can use the methods I develop for in silico data in experimental data.

in silico data contains 20 antibodies, 3 inhibitors and 2 ligand stimuli with 2 different concentration for each. And data for 10 time-points post-stimulus is provided. Inhibitors work by blocking the kinase domain of a protein. Any phosphorylated kinase is considered to be active and an inhibition on it also affects its downstream kinase activity. Inhibitors sequester away the unphosphorylated form of the target kinase and thus limit the amount of phosphorylated forms.

I used CellNOptR and CNORfeeder tools for network inference. CNORfeeder is a tool with several methods for network inference and it has its FEED method that creates boolean tables using the data and then it is possible to infer network based only on the data (i.e. data-derived network).

In silico data network

I plotted the network above using only expression data from the challenge in CNORfeeder. For this, mapBTables2model function that converts boolean table to network model should be called without a model or “model=NULL”. I had to convert treatment headers in MIDAS files into a different format because there is a small error in reading MIDAS and converting into CNOlist object: “TR:ABC:Stimuli” to “TR:ABC” and “TR:ABC:Inhibitors” to “TR:ABCi”. And a new form of makeBTables function CNORfeeder because there was no stimuli/no inhibitors condition in the data, which is not suitable for the old version.

CellNOptR and CNORfeeder have been developed by CNO developers from European Bioinformatics Institute (EBI) more information can be found on their official website.

First Impressions and Thoughts on Rosalind Project

Actually, I signed up Rosalind.info 8 months ago, I didn’t really play around with it. But last week, in a BiGCaT science cafe, after I learnt it, I was more interested than before and I just started solving problems.

In each problem, you have a description about the context and also about the problem. Also, there is a sample input and output. Sometimes there are hints about the solution. What I did was to write a solution that works for the sample and hopefully for the problem. I wrote the scripts in Python, because the project says it is optimized for that and also I want to learn more about Python.

After submitting the solution, if it is successful, you can see the suggested solutions coming from other users. Each solution is really unique and worth checking out. In this way, you can also learn new approaches. And, the project directs you the next question which is available and good to be solved after that solution, which is nice, too. Moreover, there are user success statistics, such as top 100 and by country. This might help users to be motivated by competition. From Turkey, there are only seven users and they don’t look like solving problems right now, but I hope there will be more.

There are many badges that can be earned after certain amount of successful solutions and they are motivating you to continue.

My username on Rosalind is gungorbudak. I have solved all the problems in “Python Village” and some from “Bioinformatics Stronghold”. I will go on solving more in my spare time.

Rosalind is a good way to learn programming and basics in biology, genetics, statistics, computer science and bioinformatics.

Playing around with CellNOptR Tool and MIDAS File

With CellNOptR, we will try to construct network models for the challenge. For this, the tool needs two inputs. First one is a special data object called CNOlist that stores vectors and matrices of data. Second one is a .SIF file that contains prior knowledge network which can be obtained from pathway database and analysis tools.

CNOlist contains following fields: namesSignals, namesCues, namesStimuli and namesInhibitors, which are vectors storing the names of measurements. Another vector called timeSignals stores time points. The field valueSignals contains protein abundance measurements and the fi elds valueCues, valueStimuli and valueInhibitors) are boolean matrices with a 1 or 0, for each condition (row) when the corresponding cue (column) is present or absent, respectively [4].

MIDAS File

A. Experimental setup; B. Data obtained from the experiment; C. MIDAS description. Taken from Bioinformatics (2008) 24 (6): 840-847.doi: 10.1093/bioinformatics/btn018*

We are given MIDAS files and each row in a MIDAS file belongs to a single experimental sample and each column is one sample attribute, such as identity, treatment condition, or value obtained from an experimental assay. The column header has two values: a two-letter code defining the type of column, (e.g. TR for treatment, DV for data value), and a short column name (e.g. a small molecule inhibitor added or a protein assayed). Each cell stores the corresponding value for the row (sample) such as a presence or absence of stimulus/inhibitor, time point, or data value [1].

CellNOptR has a function that can read a MIDAS file and with another one it is possible to convert it to CNOlist and work on it. readMIDAS gets the content of CSV file to a data object and using that data object, makeCNOlist makes a CNOlist[2]. After I did it and look at the CNOlist, it worked but there were some issues with naming. Stimuli vector was named as inhibitors and inhibitors vector was named as stimuli. It also stores variances of data values. The variances are not zero if replicates are found [3].

So I was able to read experimental data as an object with a small problem in naming that can be solved easily by changing the names of the vectors.

This was the first input required for CellNOptR tool. Second one is the prior knowledge network and in the following days, I will work on it.

References

  1. Flexible informatics for linking experimental data to mathematical models via DataRail, http://bioinformatics.oxfordjournals.org/content/24/6/840.full
  2. Training of boolean logic models of signalling networks using prior knowledge networks and perturbation data with CellNOptR, http://www.bioconductor.org/packages/release/bioc/vignettes/CellNOptR/inst/doc/CellNOptR-vignette.pdf
  3. Package ‘CellNOptR’, http://www.bioconductor.org/packages/release/bioc/manuals/CellNOptR/man/CellNOptR.pdf
  4. In Silico Systems Biology: Scripting with CellNOptR, http://www.cellnopt.org/cytocopter/resources/tutorial_wtac_2013.pdf