Güngör Budak's Blog

Bioinformatics, web programming, coding in general

Set Up Google Cloud SDK on Windows using Cygwin

Windows isn’t the best environment for software development I believe but if you have to use it there are nice softwares to make it easy for you. Cygwin here will help us to use Google Cloud tools but installation requires certain things that you should be aware of beforehand.

You’ll need

Note: You’ll need to select these packages during Cygwin installation. If you already have Cygwin 32-bit, just rerun the installer and make sure you select them all and later install all dependencies when you’re asked.

To set up

  • Open up Cygwin Terminal by right clicking and choosing “Run as administrator
  • Navigate to the folder that has “google-cloud-sdk” (what’s in GCloud SDK download so move it somewhere like “C:")
  • Run ./google-cloud-sdk/install.sh
  • Follow instructions

Hopefully, you won’t have any error and will get it working.

Last note is to be able to run GCloud tools in Cygwin Terminal, you’ll always have to run it “Run as administrator”, or you’ll get “Permission denied” errors.

Super Long Introns of Euarchontoglires

There was another weird result I got about my exon/intron boundaries analysis research. To less diverse species’ genes, intron lengths are shown to increase. However, according to my findings, at a point of Euarchontoglires or Supraprimates, this increase is very sharp and seems unexpected. So, I looked at exon/intron length each gene in each taxonomic rank and try to see what makes Euarchontoglires genes with that long introns.

Exon - intron lengths 1

As you see in the graph above, Euarchontoglires introns are very long compared to the rest. So I got the Euarchontoglires genes having more than 10000 bp long introns in average which are;

ENSG00000176124 (61886 bp) ENSG00000255470 (48283 bp) ENSG00000233611 (43231 bp)</div><div>ENSG00000253161 (23128 bp)ENSG00000056487 (13482 bp)

When I checked their summaries on Ensembl, I saw that most of them have transcripts that are not protein coding so they tend to have longer introns relative to protein coding transcripts’ introns.

So a solution might be retrieving biotypes of transcripts and filtering the ones that are not protein coding. Because in this project, we’re focusing on the protein coding genes.

Remember Ensembl API, getting biotypes is really easy. All I need to do is add following to my script;

$transcript_object->biotype

So I got my data with its biotype information and filtered out the ones that are not protein coding. Later, when I repeated the analysis with the new data, the unexpected peak at Euarchontoglires introns was gone.

Exon - intron lengths 2

There is still a lot to be done of course, but for this particular issue, I solved it in this way.

An Exon of Length 2 Appeared in Ensembl

I want to share an interesting finding about our research on exon/intron analysis of human evolutionary history.

So I had the genes that emerged at each pass point of human history and I was using Ensembl API to get exons and introns of these genes to perform further analyses.

There was one gene (ENSG00000197568 - HERV-H LTR-associating 3 - HHLA3) with a surprise. Because it’s one transcript (ENST00000432224) had an exon (ENSE00001707577) of length 2. At first I couldn’t realize the oddness but later in group discussions it was obvious that an exon with only 2 bases cannot occur.

So we checked different databases (NCBI, UCSC Genome Browser) for the same gene and realized that that exon was not there and their gene finding algorithms placed those 2 bases as a part of an intron and the transcript has one less exon compared to the one in Ensembl databases.

This shows gene finding algorithms are still not in their best forms and different sources need to be checked before going into a conclusion about exons/introns.

How to Convert PLINK Binary Formats into Non-binary Formats

PLINK is a whole genome association analysis toolset and to save time and space, you need to convert your data files to binary formats (BED, FAM, BIM) but of course when you need to view the files, you have to convert them back to non-binary formats (PED, MAP) to be able to open them in your text editor such as Notepad on Windows OS.

This operation is really easy. It requires PLINK of course, and the following line of code written to DOS window (Run -> type cmd; hit ENTER) in the directory of PLINK:

plink –bfile YOUR_BINARY_FILE –recode –out YOUR_NON-BINARY_FILE

First, you need to install PLINK if you don’t have.

Note this tut is for Windows OS.

Go to Download section and download the correct version for your system. For Windows OS, it’s MS-DOS.

Then, extract it to “C:” folder in your Computer. Make sure that you have plink.exe in the extracted folder. That’s it.

To convert your files, start a new DOS window and navigate to your PLINK directory which is “C:\plink-1.07-dos”. To do that type:

cd c:\plink-1.07-dos

PLINK conversion

When you changed the directory to PLINK’s dir, you are ready to start conversion.

Not to confuse, it’s better to create a folder inside “C:\plink-1.07-dos”, say, “files”. Then, move BED, FAM and BIM files inside this folder. Then with the code below, you can convert these files into non-binary forms.

plink --bfile files/YOUR_BINARY_FILE_NAME --recode --out files/YOUR_NON-BINARY_FILE_NAME

Change “YOUR_BINARY_FILE_NAME” with the name of your files (they have the same name except for the extension). And change “YOUR_NON-BINARY_FILE_NAME” with anything you want.

Next, hit ENTER and wait for the analysis. After it’s done you’ll see:

Analysis finished: CURRENT DATE

You can navigate to your files folder (C:\plink-1.07-dos\files) and see your non-binary forms PED and MAP.

More about PLINK and information for other operating systems can be found on PLINK website.

How to Get Transcripts (also Exons & Introns) of a Gene using Ensembl API

As a part of my project, I need to obtain exons and introns of certain genes. These genes are actually human genes that are determined for a specific reason that I will describe later when I explain my project. But for now, I want to share the way to obtain this information using (Perl) Ensembl API. Note that Ensembl has started a beautiful way (Ensembl REST API) of getting data but it is beta and it doesn’t provide exons / introns information. So we have to use Ensembl API.

If you haven’t installed Ensembl API, visit my Ensembl (Perl) API installation post.

To begin with, I want to share a tutorial set by Ensembl, which I used to learn the API. Tutorials are really useful so for more detailed information about the API, please visit filmed API workshop. Also, Doxygen Perl documentation provides information about classes of the API.

First, let’s create a registry to be able to use adaptors:

use Bio::EnsEMBL::Registry;
my $registry = "Bio::EnsEMBL::Registry";
$registry->load_registry_from_db(-host => 'ensembldb.ensembl.org', -user => 'anonymous');

Basically, in this API (specifically Ensembl core database API) we have “adaptors” (of genes, transcripts, …) and “objects” (of genes, transcripts, …). Adaptors are used to retrieve objects from Ensembl database. So, if you want to get a gene (object) information, first you have to generate a gene adaptor. Then, using “fetch_by_stable_id” method passing an argument as Ensembl gene ID (e.g. ENSG00000198590) gene object is obtained.

my $gene_adaptor = $registry->get_adaptor('Homo sapiens', 'Core', 'Gene');
my $gene = $gene_adaptor->fetch_by_stable_id('ENSG00000198590');

This gene object, then will be used to get transcripts and each transcript will be used to get exons and introns. We don’t have to generate more adaptors because when we obtain gene object, transcript adaptor is automatically generated. This is the same for exons (introns don’t have an adaptor because they are not stored separately in Ensembl databases). So, to get transcripts, we need to use “get_all_Transcripts” method in gene object:

foreach my $transcript (@{ $gene->get_all_Transcripts }) {
}

In foreach loop above, exons and introns can be retrieved by “get_all_Exons” and “get_all_Introns” methods in transcript object. And of course, each exon / intron can be obtained by looping in the same way.

foreach my $exon (@{ $transcript->get_all_Exons }) {
}

foreach my $intron (@{ $transcript->get_all_Introns }) {
}

I suggest you check if you have a non-empty object for all because for some genes, Ensemble databases return null objects and if you try to use any method over them, you get errors. So do checks using an if clause:

if ($gene) {
    # get its transcripts
}

if ($exon) {
    # get its sequence
}

After you get each object there are other methods to obtain its ID, sequence, location, etc. Here I will give methods for ID and sequence retrieval but you can always refer to Doxygen Perl (core API) documentation for more information.

So to print ID of genes, transcripts and exons (not introns because they don’t have…), we need to use “stable_id” method in objects:

print $gene->stable_id;
print $exon->stable_id;

To print sequence of objects:

print $exon->seq->seq();
print $intron->seq();

Please note the small difference in printing intron sequences.

So that’s all. The complete script that I use to get exons and introns of a gene in FASTA format is available here in my GitHub repo. You run the script by supplying gene ID as an argument:

gungor@gungor:~$ perl projects/eiban/eiSingleGet.pl ENSG00000198590

Geany Color Schemes Ubuntu

There is a collection of color schemes for Geany as well.

Download it on GitHub and follow the instructions.

You’ll need to extract and copy all the files in colorschemes directory to ~/.config/geany/colorschemes/

Then, restart Geany and go to View -> Editor -> Color Schemes and choose your style.

I’m using Tango.

Source

Install Geany 1.23 on Ubuntu

Geany is a really nice text editor for Ubuntu. I would recommend it with TreeBrowser and some interface coding are color schemes.

But you’ll need the latest version which is 1.23 for now.

To install this version you need to add PPA, also this will keep it updated when you update your system.

Execute following lines one by one:

sudo add-apt-repository ppa:geany-dev/ppa
sudo apt-get update
sudo apt-get install geany

Then, when you start Geany you’ll see “This is Geany 1.23” in status bar.

Source

Install Apache2, PHP5, MySQL & phpMyAdmin on Ubuntu 12.04

First, install apache2:

sudo apt-get install apache2

Then, for it to work: sudo service apache2 restart

For custom www folder:

sudo cp /etc/apache2/sites-available/default /etc/apache2/sites-available/www
gksudo gedit /etc/apache2/sites-available/www

Change DocumentRoot and Directory directive to point to new location. For example, /home/user/www/

Save and see (link here clean URLs not working Laravel 4)

Make www default and disable default:

sudo a2dissite default && sudo a2ensite www
sudo service apache2 restart

Create new file in www

echo "<b>Hello! It is working!</b>" > /home/user/www/index.html

Go to http://localhost/

If you get 403 Forbidden error:

chmod -R 755 /home/user/www/

Next, install php5:

sudo apt-get install libapache2-mod-php5

Enable:

sudo a2enmod php5

Restart apache2:

sudo service apache2 restart

Check if it works:

mkdir ~/www/test
gedit /home/gungor/www/test/index.php

Enter:

<?php echo "It's working"; ?>

Save

Go to http://localhost/test/

Next, install mysql:

sudo apt-get install mysql-server libapache2-mod-auth-mysql php5-mysql

Set a password 

Finally, install phpmyadmin:

sudo apt-get install phpmyadmin

Select apache2 and then “Yes”, enter your password

Open following and add the line Include /etc/phpmyadmin/apache.conf:

gksudo gedit /etc/apache2/apache2.conf

Restart apache2:

sudo service apache2 restart

Navigate to http://localhost/phpmyadmin/

After all these steps, you should be able to run PHP files on your Apache server and also use MySQL with phpMyAdmin

More on Ubuntu Help