Güngör Budak's Blog

Bioinformatics, web programming, coding in general

Progress on Network Inference Sub-Challenge

This sub-challenge has several requirements:

  • Directed and causal edges on the models (32 models - 4 cell lines × 8 stimuli)
  • Edges should be scored (normalizing to range between 0 and 1) that will show confidence
  • Nodes will be phosphoproteins from the data
  • Prior knowledge network (that can be constructed using pathway databases) is allowed to be used (actually this is a must for some network inference tools)

First thing was to look for existing tools. ddepn seemed a good option but it didn’t work for us. We checked for different tools on Bioconductor for our purpose. There was a tool called CellOptNR, which I thought we could use it for the second sub-challenge. Actually, on Synapse, the second sub-challenge has been modified recently, so right now I’m sure about it. But for the first sub-challenge, CellOptNR and a tool related to it called CNORfeeder will be useful.

This tool gets two input. One is data from microarray experiments, which is simply protein abundance measurements under several stimuli/inhibitors and the other is a prior knowledge network (PKN) that can be constructed using different pathway sources such as WikiPathways, KEGG. It uses various inference methods to integrate the data and validate the network models.

So, we need to construct a PKN with the phosphoproteins in the data and then infer the network models. Next, we need to score each edge on the models and store the results in SIF and EDA files.

This sub-challenge also has in silico data part where there are similar requirements.

I tried an example (from DREAM 4 Challenge) given in vignette of CNORfeeder and it worked as expected. In vignette, a data-drive network is shown without an integration with any PKN but how it is created is not given.

Retrieving Data with AJAX using jQuery, PHP and MySQL

Last semester, I took a course from Informatics Institute at METU called “Biological Databases and Data Analysis Tools” where first we learned what is a database and how to do queries on it. Also, the technology behind databases are taught. Then, we learned many biological databases and data analysis tools available. These include gene, protein and pathway databases, tools for creating databases.

As a final project, we were asked to create an online tool that can search a database and get the data and display it on any web browsers. For that, we were given a table and using that table and some given conditions, we retrieved another table from Biomart Ensembl and created the database. Then, in searching and displaying the data, we created a user interface.

MySQL database was used and PHP was the choice of programming language, which is powerful and good in web programming.

For our task, we implemented AJAX using jQuery (a JavaScript library). The purpose was to make the search process easy and fast. In this way, the search was triggered when the first three letters from the query were entered and at the same time, the result was displayed on the page without page refresh.

The project is available online on this website.

To do this, as I said, we used AJAX calls. AJAX works on client-side and provides asynchronous calls from the server without any intervention on the current state on the page. That is, to get the data, we don’t have to stop viewing the page and get it, then refresh the page with the data. So, in this way, without page refresh it is possible to get any data from the database.

The method includes the use of jQuery method “ajax”. This method gets the necessary information from user, sends it to a scripting language that will work on server such as PHP, and at the end, retrieves the result from the server-side language and shows it to the user.

In the function below, the script gets the value in the text field with id “query” and also the names of check boxes (which determine which column to get from the database) and stores them in an object called “data”. Then, when the value of “query” is greater than 2, it executes “ajax” method where submission type is set as “post”, the script that weill interact with the database is set as “/process.php”, data is given and a callback function which will insert the result into a div with id “results”.

function getResults() {
  var data = {};
  data['query'] = $("#query").val().trim();
  var boxes = $("input[name=options]:checked"); 
  
  $.each(boxes, function(key, value){
    data[key] = $(value).val();
  });
  
  if (data['query'].length > 2) {
    $.ajax({
      url: '/process.php',
      type: 'post',
      data: data,
      success: function(response) {
        $("#results").html(response);
      }
    });
  }
}

And in process.php file, the data coming from AJAX is accepted and used to query the database. Basically, while the data is being retrieved the html codes for the insertion is generated and echoed out at the end. This is how jQuery gets it and displays.

This JavaScript function is able to get the results but it should be somehow executed. There are meny possibilities for this. Typing letters into text field, pasting something into text field, pressing enter, and changing search options are the ones I did. There are great jQuery methods for these, and really simple. We used change(), bind(), on() and keypress() methods.

For example, below you can see how ENTER key (the number 13 indicates it) is used to trigger the function. And note that here we prevent the key from submitting form by returning false.

$("#query").keypress(function(action) {
  if(action.which == 13) {
    getResults();
    return false;
  }
});

The use of others can be found on jQuery documentation.

If you have any question about this post, please leave a comment below.

Using Online Tools for Teaching Bioinformatics

I attended one of science cafe meetings of BiGCaT group today and we discussed use of online tools for teaching bioinformatics.

Andra Waagmeester (PhD student form BiGCaT) introduced Rosalind Project as a teaching tool. This project mainly focuses on bioinformatics solutions. Various questions about bioinformatics are asked on the website. Actually, those are various problems that can be seen in any bioinformatics research and by solving them, it helps you learn bioinformatics.

On the website, it is possible to start a class (with a faculty member account) and generate a curriculum with the desired content from the project. It is also possible to post new problems. There is also discussion part where one can ask questions about problems and look for help. The replies to those can be up or downvoted so it can generally be useful.

On Rosalind, problems can be solved using any programming language but they said it’s optimized to Python, so Python should be better choice. Among problems, one set is about learning Python, which is good. There are also two sets Bioinformatics Stronghold and Bioinformatics Armory where you build up bioinformatics knowledge.

There is also Code Academy where both learning and teaching programming & coding are possible. It doesn’t focus on bioinformatics specifically but it might be used to teach bioinformatics so it’s worth checking out.

The use of online tools as a studying platform in classes is really novel and I think it should be done by including this study to grading (which might raise the interest of students to the study). However, there are also some issues about it, which are discussed in the meeting. One is the stability and availability of the tools for whole course time. Second is the availability of solutions on web such as on GitHub. There can be these problems of using these tools for teaching.

Network Inference DREAM Breast Cancer Challenge

The inference of causal edges are described as the change on a node seen after the intervention of another node. If the curves obtained over time overlap (under intervention or no intervention), then there is no relation. Otherwise, we can draw an edge between those nodes and according to the level, up or down, the edge will be activating or inhibiting. These causal edges are context-specific so in different cell line data, we may have different relations.

Also, edge confidence scores should be obtained. Right now, I have no idea how to get them but we will discuss.

The relations and scores will be stored in SIF and EDA files and submitted to the competition.

This can be done by writing scripts specific to the task. However, before that I have looked for existing tools. I have found some. There is an R package called RPPAnalyzer which is designed to read RPPA result and compare the samples and plot a graph at the end but this is not exactly what we need in this challenge (See its CRAN page). Another R package specifically written for constructing signaling networks is present, and it is called ddepn (Dynamic Deterministic Effects Propagation Networks). It infers signalling networks for time-course RPPA data (See its R-Forge page).

So I started with ddepn. I installed the package (R version 3.0.1). And before using our data, I used the example described in its vignette. A similar example is also present in its documentation. However, I got an error before plotting the network. When I attempted to run ddepn function to apply genetic algorithm, it gives “Error in get(“envDDEPN”) : object ‘envDDEPN’ not found”. So I have to find a way to solve it, then I can move on doing the same for our data.

It looks like the inference I got from this step will be necessary for the next step with the data files. Because for the next step the use of CellNOptR package (See its official page) is suggested and it needs noth network and data to make predictions.

DREAM Breast Cancer Sub-challenges

I have been going over the sub-challenges before attempting to solve them. As I mentioned, there are three sub-challenges and somehow they are connected.

First, using given data and other possible data sources such as pathway databases, the causal signaling network of the phosphoproteins. There are 4 cell lines and 8 stimulus so they make total 32 networks at the end. Nodes are phosphoproteins and edges should be directed and causal (activator or inhibitor).

4 different treatments are applied to samples before stimulation, these are inhibition treatments and one of them is vehicle control (DMSO). After that, the samples are stimulated and their levels are measured in different time points.

This sub-challenge has also another part where in silico data is provided and only one network inference using the data is asked. The characteristics of this data is different. The training dataset has time points for 20 phosphoproteins under various stimuli and inhibition of nodes (See Sub-challenge 1: Network Inference for more).

Second, predictions on phosphoprotein trajectories should be made. Also, it’s asked to propose a model that can cover beyond of this data (breast cancer proteomics and in silico datasets) (See Sub-challenge 2: Time-course Prediction for more).

Third, visualization of the data should be made to be interpreted in meaningful ways. This is only for breast cancer proteomics dataset (See Sub-challenge 3: Visualization for more).

HPN-DREAM Breast Cancer Network Inference Challenge

Understanding signaling networks might bring more insights on cancer treatment because cells respond to their environment by activating these networks and phosphorylation reactions play important roles in these networks.

The goal of this challenge is to advance our ability and knowledge on signaling networks inference and protein phosphorylation dynamics prediction. Also, we are asked to develop a visualization method for the data.

The dataset provided is extensive and a result of RPPA (reverse-phase protein array) experiments. It has four (breast cancer) cell lines, each has proteomics data obtained under 3 different inhibitors and one control (DMSO) and 8 different stimuli over 7 time points. And each contains levels of about 45 phosphoproteins. There is also additional dataset with all proteins measured (phosphorylated forms and total proteins) later time points. Moreover, there is an in silico data with similar characteristics (See Data Description on Synapse).

RPPA is a method to quantitate protein levels in lysates from cells or tissues. A video about this technique can be watched on this link.

Using this data, we are asked to complete three sub-challenges.

(1) Network Inference: Modeling causal signaling networks from training data
(2) Time-course Prediction: Prediction of trajectories of protein levels following inhibitor perturbation(s) not seen in the training data
(3) Visualization: Designing a visualization strategy for high-dimensional molecular time-course data sets such as the ones used in this challenge

More information can be found their official website.

And more about the sub-challenges and how we approach to solve them are coming soon.

Dream Challenge

This year, 8th Dream Challenge takes place and I will be working on this project as my internship job in BiGCaT, Bioinformatics, UM. The challenge brings scientists to catalyze the interaction between experiment and theory in the area of cellular network inference and quantitative model building in systems biology (as said on their webpage).

In this competition, I will work on a specific challenge about network modeling, dynamic response predictions and data visualization. The name of this specific challenge is “HPN-DREAM breast cancer network inference challenge” and the information can be found on this Synapse page.

In this blog, I will write the progression of the project, try to explain the steps and the tools and methods I use and explain more about the challenge.

SRS'de Coklu Arama Yapmak

Inceleme yapan scriptin en son hali, oncekilere gore daha fazla okuma inceliyor oldugu icin her okuma icin SRS uzerinde isim aramak oldukca zaman alan bir islemdi. Oyle ki, son inceleme 4 gun surdu.

Bunu azaltmak icin inceleme scriptini tamamen degistirdim. Oncelikle her zaman oldugu gibi esik degerini gecenleri aliyor ama direkt bunlarin ID numaralarini bir dizide (array) listeliyorum. Daha sonra bu listenin herbir elemanini boru karakteri ile ayirarak bir string haline getiriyorum. Son olarak bu stringi direkt getz komutuyla SRS’de aratip, tek seferde tum sayfada esik degerini gecen organizmalarin isimlerini alip, bunlari tek tek okuyup ayirarak hash icinde sakliyorum.

while (@params = $blast_obj->next_alignment(fields => ['name', 'identity', 'overlap'])) {
        $overlap_seen = $params[2];
        $mismatch_seen = $params[2] - $params[1];

        if ($overlap_seen >= $overlap_threshold and $mismatch_seen <= $mismatch_threshold) {
            if ($database eq "refseq_dna") {
                $params[0] =~ /:([A-Z|0-9]*\_[A-Z|0-9]*)/;
                push(@names, $1);
            }
        }
    }

    $ids = join("|", @names);

    if ( !($ids eq "") ) {
        open (PIPE, "getz '[$database:$ids]' -vf 'ORGANISM' |") or die $!;
        while(my $line = <PIPE>) {
            $line =~ /\t(.*)/;
            $organism_names{$1}++;
        }
    }
    close PIPE;

Gene her ID numarasini duzenli ifade ile elde edip, bunu @names dizisine push komutu ile ekliyorum. Bu tumu icin bittiginde liste elemanlarini join komutu ile aralarinda “|” karakteri olacak sekilde birlestirip $ids stringini elde ediyorum. Sonra eger bu string bos degilse getz komutuyla arama yapip, elde ettigim cok satirli ciktiyi ozel bir sekilde okuyup her organizma ismini %organism_names hashine key olarak ekliyorum.

MegaBLAST Sonuclarini Incelemek - Parsing

Pipeline’da son asama, aranan dizilerin urettigi ciktilari baska bir script ile incelemek. Bu islemle herbir megablast dosyasi okunuyor, ve dizilerin name, identity, overlapping length gibi parametrelerinin degerleri saklanarak amaca yonelik sekilde ekrana yazdiriliyor.

Projemde HUSAR paketinde bulunan ve yukarida bahsettigim alanlari bana dizi olarak donduren Inslink adinda bir parser kullaniyorum. Bu parserin yaptigi tek sey, dosyayi okumak ve dosyadaki istenen alanlarin degerlerini saklamak.

Daha sonra ben bu saklanan degerleri, koda eklemeler yaparak gosteriyorum ve birkac ek kod ile de ihtiyacim olan anlamli sonuclar gosteriyorum.

Gene bu scripti de Unix uzerinde Emacs yazilimini kullanarak Perl dilinde yazdim.

Script parserdan gelen bilgileri kullanarak once esik degeri kontrolu yapiyor ve gecen tum okumalarin organizma ismini getz komutuyla alarak bir hash icinde saklayip sayarak bunlari daha sonra ekrana yazdiriyor. Boylece esik degerini gecen organizma sayisi uzerinden yorum yapabiliyorum.

Bu scripti projenin gelisme asamasinda cokca degistirdim. Farkli genomlari arastirirken farkli ihtiyaclar ciktigi icin her seferinde bir yenilik ile daha guvenilir hale geliyor. Daha oncekilero yazmayacagim ama su an son olarak kodladigim scripti paylasacagim.

Kalite Satirinin Degerlendirilmesi - Quality Filter

Kirleten organizma (konaminant) analizi yapacak olan pipeline’i daha fazla gelistirmek, daha anlamli sonuclar elde etmek icin ilk adimlara (henuz fastq dosyasini isliyorken) kalite filtresi eklemeyi dusunduk. Boylece belirli bir esik degerinden dusuk okumalari daha o asamadan filtreleyerek daha guvenilir sonuclar elde elebilecegiz.

Bu kalite kontrolunu fastq dosyasinda her okumanin 4. satirini anlayarak yapacagiz. Bu 4. satir (aslinda okumanin dizileme kalite skoru), cesitli dizileme cihazlari tarafindan cesitli sekillerde yaziliyor (kodlaniyor) ve bu kodlamadan tekrar kalite skorunu elde ederek filtreleme uygulanmasi gerekiyor. Bu yuzden oncelikle dizilme yapan cihazin kodlama (encoding) formatini bilmek ve bu skoru tekrar elde etmek gerekiyor. Henuz bu konuda bir fikrim yok. Dizileri aldigim bolume bir e-posta attim ve yakinda cihaz ile ilgili tum bilgileri edinecegim.

Bunu kendim de yapabilirim ancak yerine bir arac kullanmayi dusunuyorum. Simdilik bu araclar FASTX Toolkit icinde bulunan FASTQ Quality Filter araci, PRINSEQ ve FASTQC. Bu araclari tek tek arastirip hangisinin daha uygun oldugunu belirleyip onu pipeline’da kullanacagim.

Bu araclar temel olarak elimdeki fastq dosyasindan kalitesi dusuk okumalari cikarip atacak. Yani herhangi bir kesme yapmayacak, tamamen o okumayi siliyor olacak. Boylece elimde daha az ama guvenilir okuma kalacak.

Daha sonra bu okumalar arasindan megablast aramasi icin kullanilacaklari rastgele secmeyi dusunuyorum. Simdilik secimi dosyanin belirli noktalarindan baslayip 1000 okuma alarak yapiyorum.