Structural Superimposition of Local Sequence Alignment using BioPython

This task was given to me as a homework in one of my courses at the university and I wanted to share my solution as I saw there is no such entry on the Internet. Objectives here are; Download (two) PDB files automatically from the server Do the pairwise alignment after getting their amino acid sequences Superimpose them and report RMSD Bio.PDB module from BioPython works very well in this case.

Blog

How to Install openpyxl on Windows

openpyxl is a Python library to read/write Excel 2007 xlsx/xlsm files. To download and install on Windows: Download it from Python Packages Then to install, extract the tar ball you downloaded, open up CMD, navigate to the folder that you extracted and run the following: C:\Users\Gungor>cd Downloads\openpyxl-2.1.2.tar\dist\openpyxl-2.1.2\openpyxl-2.1.2 C:\Users\Gungor\Downloads\openpyxl-2.1.2.tar\dist\openpyxl-2.1.2\openpyxl-2.1.2>python setup.py install It’s going to install everything and will report any error. If there is nothing that seems like an error. You’re good to go.

Blog

How to Install Numpy Python Package on Windows

Numpy (Numerical Python) is a great Python package that you should definitely make use of if you’re doing scientific computing Installing it on Windows might be difficult if you don’t know how to do it via command line. There are unofficial Windows binaries for Numpy for Windows 32 and 64 bit which make it super easy to install. Go to the link below and download the one for your system and Python version:http://www.

Blog

Data Preprocessing II for Salmon Project

So in our Multi-dimensional Modeling and Reconstruction of Signaling Networks in Salmonella-infected Human Cells project, we have several methods to construct the networks so the data is still needed to be preprocessed so that it can be ready to be analyzed with these methods. One method needed to have a matrix first row as protein name and time series (2 min, 5 min, 10 min, 20 min), and the values of the proteins in each time series were to be 1 or 0 according to variance, significance and the size of fold change.

Blog

How to Convert PED to FASTA

You may need the conversion of PED files to FASTA format in your studies for further analyses. Use below script for this purpose. PED to FASTA converter on GitHub Gets first 6 columns of each line as header line and the rest as the sequence replacing 0s with Ns and organizes it into a FASTA file. Note 0s are for missing nucleotides defined by default in PLINK How to run:

Blog

Data Preprocessing I for Salmon Project

Since we’ll be using R for most of the analyses, we converted XLS data file to CSV using MS Office Excel 2013 and then we had to fix several lines using Sublime Text 2 because three colums in these lines were left unquoted which later created a problem reading in RStudio. The data contains phosphorylation data of 8553 peptides. There are many missing data points for many peptides and since IPI IDs were used for peptides and these are not supported now, we had to convert IPI IDs to HGNC approved symbols although data had these symbols as names but they looked outdated.

Blog

Multi-dimensional Modeling and Reconstruction of Signaling Networks in Salmonella-infected Human Cells

In this study, we’re going to use a phosphorylation data from a research paper on phosphoproteomic analysis of related cells. The idea is to use and compare existing methods and develop these methods to be able to better understand the nature of signaling events in these cells and to find key proteins that might be targets for disease diagnosis, prevention and treatment. This study will be submitted as a research paper so I’m not going to publish any results here for now but I’ll mention the struggles I have and solutions I try to solve them.

Blog

Download Human Reference Genome (HG19 - GRCh37)

Many variation calling tools and many other methods in bioinformatics require a reference genome as an input so may need to download human reference genome or sequences. There are several sources that freely and publicly provide the entire human genome and I’ll describe how to download complete human genome from University of California, Santa Cruz (UCSC) webpage. Index to the gzip-compressed FASTA files of human chromosomes can be found here at the UCSC webpage.

Blog

ClipCrop Installation on Linux Mint 16 nvm, Node, npm Included

ClipCrop is a tool for detecting structural variations from SAM files. And it’s built with Node.js. ClipCrop uses two softwares internally so they should be installed first. Install SHRiMP2 SHRiMP is a software package for aligning genomic reads against a target genome. 1$ mkdir ~/software 2$ cd ~/software 3$ wget http://compbio.cs.toronto.edu/shrimp/releases/SHRiMP_2_2_3.lx26.x86_64.tar.gz 4$ tar xzvf SHRiMP_2_2_3.lx26.x86_64.tar.gz 5$ cd SHRiMP_2_2_3 6$ file bin/gmapper 7$ export SHRIMP_FOLDER=$PWD Install BWA BWA is a software package for mapping low-divergent sequences against a large reference genome.

Blog

JointSNVMix Installation on Linux Mint 16 Cython, Pysam Included

JointSNVMix is a software package that consists of a number of tools for calling somatic mutations in tumour/normal paired NGS data. It requires Python (>= 2.7), Cython (>= 0.13) and Pysam (== 0.5.0). Python must be installed by default ona Linux machine so I will describe the installation of others and JointSNVMix. Note this guide may become outdated after some time so please make sure before following all. Install Cython

Blog

Set Up Google Cloud SDK on Windows using Cygwin

Windows isn’t the best environment for software development I believe but if you have to use it there are nice softwares to make it easy for you. Cygwin here will help us to use Google Cloud tools but installation requires certain things that you should be aware of beforehand. You’ll need Python latest 2.7.x Google Cloud SDK Cygwin 32-bit (i.e. setup-x86.exe - note only this one works) openssh, curl and latest 2.

Blog

Super Long Introns of Euarchontoglires

There was another weird result I got about my exon/intron boundaries analysis research. To less diverse species’ genes, intron lengths are shown to increase. However, according to my findings, at a point of Euarchontoglires or Supraprimates, this increase is very sharp and seems unexpected. So, I looked at exon/intron length each gene in each taxonomic rank and try to see what makes Euarchontoglires genes with that long introns. As you see in the graph above, Euarchontoglires introns are very long compared to the rest.

Blog

An Exon of Length 2 Appeared in Ensembl

I want to share an interesting finding about our research on exon/intron analysis of human evolutionary history. So I had the genes that emerged at each pass point of human history and I was using Ensembl API to get exons and introns of these genes to perform further analyses. There was one gene (ENSG00000197568 - HERV-H LTR-associating 3 - HHLA3) with a surprise. Because it’s one transcript (ENST00000432224) had an exon (ENSE00001707577) of length 2.

Blog

How to Convert PLINK Binary Formats into Non-binary Formats

PLINK is a whole genome association analysis toolset and to save time and space, you need to convert your data files to binary formats (BED, FAM, BIM) but of course when you need to view the files, you have to convert them back to non-binary formats (PED, MAP) to be able to open them in your text editor such as Notepad on Windows OS. This operation is really easy. It requires PLINK of course, and the following line of code written to DOS window (Run -> type cmd; hit ENTER) in the directory of PLINK: