Güngör Budak's Blog

Bioinformatics, web programming, coding in general

Structural Superimposition of Local Sequence Alignment using BioPython

This task was given to me as a homework in one of my courses at the university and I wanted to share my solution as I saw there is no such entry on the Internet.

Objectives here are;

  • Download (two) PDB files automatically from the server
  • Do the pairwise alignment after getting their amino acid sequences
  • Superimpose them and report RMSD

Bio.PDB module from BioPython works very well in this case. It’s what I used in this solution.

How to Install openpyxl on Windows

openpyxl is a Python library to read/write Excel 2007 xlsx/xlsm files. To download and install on Windows:

Download it from Python Packages

Then to install, extract the tar ball you downloaded, open up CMD, navigate to the folder that you extracted and run the following:

C:\Users\Gungor>cd Downloads\openpyxl-2.1.2.tar\dist\openpyxl-2.1.2\openpyxl-2.1.2
C:\Users\Gungor\Downloads\openpyxl-2.1.2.tar\dist\openpyxl-2.1.2\openpyxl-2.1.2>python setup.py install

It’s going to install everything and will report any error. If there is nothing that seems like an error. You’re good to go. Verify the installation by opening Python (command line) and running the following line in Python CMD:

>>> from openpyxl import Workbook

openpyxl documentation is a great place to start with it.

How to Install Numpy Python Package on Windows

Numpy (Numerical Python) is a great Python package that you should definitely make use of if you’re doing scientific computing

Installing it on Windows might be difficult if you don’t know how to do it via command line. There are unofficial Windows binaries for Numpy for Windows 32 and 64 bit which make it super easy to install.

Go to the link below and download the one for your system and Python version:

Then, just double click and do usual Windows installation steps.

Taken from http://stackoverflow.com/a/11200146/1833371

Data Preprocessing II for Salmon Project

So in our Multi-dimensional Modeling and Reconstruction of Signaling Networks in Salmonella-infected Human Cells project, we have several methods to construct the networks so the data is still needed to be preprocessed so that it can be ready to be analyzed with these methods.

One method needed to have a matrix first row as protein name and time series (2 min, 5 min, 10 min, 20 min), and the values of the proteins in each time series were to be 1 or 0 according to variance, significance and the size of fold change. So, if the variance was less than 15% and the significance was less than 0.05, the script collected values of peptides belonging to the same protein in all time series and all locations (N, M, C) and then got the maximum value from all and determined at which time the maximum value was observed. According to this, for each protein, “1” was put, under the time where it had maximum fold change among all time points and locations. And “0” was put for the rest of the times.

Proteins with variance NA, significance NA or fold change NA were ignored. And if there were multiple proteins for one row, there were split and each got the same 1 and 0s.

Yet another method needs to have such matrix but here values of the proteins can be multiple 1s because it requires all observed maximum values from all time points satisfying the conditions. Here, since the data is multi-dimensional (it has time series and locations), we need to decide how we arrange the data according to locations. For now, we haven’t done it yet, so I’m going ro mention it later.

The script is written in Python and will be published soon.

How to Convert PED to FASTA

You may need the conversion of PED files to FASTA format in your studies for further analyses. Use below script for this purpose.

PED to FASTA converter on GitHub

Gets first 6 columns of each line as header line and the rest as the sequence replacing 0s with Ns and organizes it into a FASTA file.

Note 0s are for missing nucleotides defined by default in PLINK

How to run:

perl ped2fasta.pl "C:\path\to\PED_File"

Data Preprocessing I for Salmon Project

Since we’ll be using R for most of the analyses, we converted XLS data file to CSV using MS Office Excel 2013 and then we had to fix several lines using Sublime Text 2 because three colums in these lines were left unquoted which later created a problem reading in RStudio.

The data contains phosphorylation data of 8553 peptides. There are many missing data points for many peptides and since IPI IDs were used for peptides and these are not supported now, we had to convert IPI IDs to HGNC approved symbols although data had these symbols as names but they looked outdated.

To convert these IDs, first these IPI IDs were saved into a TXT files each in a new line. Then, conversion tool from Database for Annotation, Visualization and Integrated Discovery (DAVID) was used to map IPI IDs to UniProt ACCs (57.125 lines of maps were obtained). Next, the obtained UniProt ACCs were used to map them to HGNC IDs using ID Mapping from UniProt (10.925 lines of maps were obtained). Finally, to obtain HGNC approved symbols, all protein coding gene data from HGNC was downloaded with costumized colums allowing conversion of HGNC IDs to HGNC approved symbols (43.127 lines of maps were obtained). After collecting these three files, in R, the files were read and after column names were fixed, the union of these three files by UniProt ACC and HGNC ID was obtained (37.316 lines of maps were obtained). Later, we included these HGNC approved symbols in the data as a column to use them readily.

Later, I’ll publish the R code I used for the conversion in a GitHub repo.

Multi-dimensional Modeling and Reconstruction of Signaling Networks in Salmonella-infected Human Cells

In this study, we’re going to use a phosphorylation data from a research paper on phosphoproteomic analysis of related cells.

The idea is to use and compare existing methods and develop these methods to be able to better understand the nature of signaling events in these cells and to find key proteins that might be targets for disease diagnosis, prevention and treatment.

This study will be submitted as a research paper so I’m not going to publish any results here for now but I’ll mention the struggles I have and solutions I try to solve them.

Download Human Reference Genome (HG19 - GRCh37)

Many variation calling tools and many other methods in bioinformatics require a reference genome as an input so may need to download human reference genome or sequences. There are several sources that freely and publicly provide the entire human genome and I’ll describe how to download complete human genome from University of California, Santa Cruz (UCSC) webpage.

Index to the gzip-compressed FASTA files of human chromosomes can be found here at the UCSC webpage.

This is Feb 2009 human reference genome (GRCh37 - Genome Reference Consortium Human Reference 37). You can find more information about it in the page. Note that there are the chromosomes we know such as chr1, chrM (mitochondrial), chrX but also chr11_gl000202_random. These strange ones are scaffolds and patches that include the region that have not been fully understood and placed into an actual chromosome. They are safe to include them in your reference genome. Also, they are very small relative to the actual chromosome files.

The codes I’ll present below will be valid for Terminal in Linux OS or MacOS. In Windows, you may use Cygwin that comes with below commands or you may setup a virtual machine and install a Linux distro (You’ll love it).

1. Download all (GZ) files - chromosomes

Create a directory that will store the downloaded files:

$ mkdir ~/Downloads/hg19

Change directory to “hg19/”:

$ cd ~/Downloads/hg19

Download all files under hg19 chromosomes:

$ wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/*'

Or to download individual files (replace * with the file name) - if you don’t want all of them:

$ wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz' -O chr1.fa.gz

2. Uncompress each GZ file - chromosome in the directory

Create a directory that will store the uncompressed files:

$ mkdir uncompressed

Uncompress each using following single line loop into directory “uncompressed/” in Terminal:

$ for a in *.gz; do gunzip -c $a > uncompressed/`echo $a | sed s/.gz//`; done

3. Merge all chromosomes (1, 2, 3, …, X, Y) in one FASTA file

Change directory to “uncompressed/”

$ cd uncompressed

Concatanate all FASTA files into one:

$ cat *.fa > hg19_ref_genome.fa

Or only actual chromosomes without scaffolds and patches:

$ cat chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrM.fa chrX.fa chrY.fa > hg19_ref_genome.fa

So, in this way, you will find hg19_ref_genome.fa (around 3.2 GB - containing whole genome of human) file in your ~/Downloads/hg19/uncompressed directory.

JointSNVMix Installation on Linux Mint 16 Cython, Pysam Included

JointSNVMix is a software package that consists of a number of tools for calling somatic mutations in tumour/normal paired NGS data.

It requires Python (>= 2.7), Cython (>= 0.13) and Pysam (== 0.5.0).

Python must be installed by default ona Linux machine so I will describe the installation of others and JointSNVMix.

Note this guide may become outdated after some time so please make sure before following all.

Install Cython

You may need Python development headers for these Python packages so make sure you have them:

$ sudo apt-get update
$ sudo apt-get install python-dev
$ sudo apt-get install libevent-dev

To install Cython:

$ cd ~/Downloads
$ wget http://cython.org/release/Cython-0.20.1.tar.gz
$ tar xzvf Cython-0.20.1.tar.gz
$ cd Cython-0.20.1
$ sudo python setup.py install

Install Pysam

Pysam installation requires ez_setup that requires distribute.

To install distribute:

$ cd ~/Downloads
$ wget https://pypi.python.org/packages/source/d/distribute/distribute-0.7.3.zip
$ unzip distribute-0.7.3.zip -d distribute-0.7.3
$ cd distribute-0.7.3
$ sudo python setup.py install

To install ez_setup:

$ cd ~/Downloads
$ wget https://pypi.python.org/packages/source/e/ez_setup/ez_setup-0.9.tar.gz
$ tar xzvf ez_setup-0.9.tar.gz
$ cd ez_setup-0.9
$ sudo python setup.py install  

Ready to install Pysam now:

$ cd ~/Downloads
$ wget http://pysam.googlecode.com/files/pysam-0.5.tar.gz
$ tar xzvf pysam-0.5.tar.gz
$ cd pysam-0.5
$ sudo python setup.py install

Now, install JointSNVMix:

$ cd ~/Downloads
$ wget http://joint-snv-mix.googlecode.com/files/JointSNVMix-0.7.5.tar.gz
$ tar xzvf JointSNVMix-0.7.5.tar.gz
$ cd JointSNVMix-0.7.5
$ sudo python setup.py install


$ jsm.py
usage: JointSNVMix [-h] {train,classify} ...
JointSNVMix: error: too few arguments

Follow running guide for all arguments and options.

ClipCrop Installation on Linux Mint 16 nvm, Node, npm Included

ClipCrop is a tool for detecting structural variations from SAM files. And it’s built with Node.js.

ClipCrop uses two softwares internally so they should be installed first.

Install SHRiMP2

SHRiMP is a software package for aligning genomic reads against a target genome.

$ mkdir ~/software
$ cd ~/software 
$ wget http://compbio.cs.toronto.edu/shrimp/releases/SHRiMP_2_2_3.lx26.x86_64.tar.gz
$ tar xzvf SHRiMP_2_2_3.lx26.x86_64.tar.gz
$ cd SHRiMP_2_2_3
$ file bin/gmapper

Install BWA

BWA is a software package for mapping low-divergent sequences against a large reference genome.

BWA requires zlib so install it first

$ sudo apt-get install zlib1g-dev

Download latest BWA software from SourceForge

$ cd ~/Downloads
$ tar xjvf bwa-0.7.7.tar.bz2
$ cd bwa-0.7.7
$ make
$ sudo mv ~/Downloads/bwa-0.7.7 /usr/local/bin
$ sudo ln -s /usr/local/bin/bwa-0.7.7 /usr/local/bin/bwa
$ sudo nano /etc/profile

Copy-paste this at the end the document


Save it pressing CTRL + X, then Y and hit ENTER

Now, NVM and Node should be installed but before there are some dependencies.

$ sudo apt-get install build-essential
$ sudo apt-get install libssl-dev curl

Install NVM

NVM is Node Version Manager and allows you to use different versions of Node.js

$ git clone git://github.com/creationix/nvm.git ~/.nvm
$ source ~/.nvm/nvm.sh
$ nvm install v0.6.1
$ nvm use v0.6.1
Now using node v0.6.1 

This installation also sets up Node.js v0.6.1, check:

$ node -v

To install ClipCrop we need one last thing which is NPM - Node Package Manager.

Install NPM

$ sudo apt-get install -y python-software-properties python g++ make
$ sudo add-apt-repository ppa:richarvey/nodejs
$ sudo apt-get update
$ sudo apt-get install nodejs npm
$ node -v

This last step will also install a Node.js version but before using ClipCrop, using NVM, Node.js version will be set to v0.6.1 like this:

$ nvm use v0.6.1
Now using node v0.6.1 

If it says No command 'nvm' found:

$ source ~/.nvm/nvm.sh

Install ClipCrop

$ cd ~ 
$ npm install clipcrop
$ cd ~/node_modules/clipcrop
$ npm link


$ clipcrop

This will tell you how to use the package and what options it has. More is available here.

Some pages to refer

Installation notes for BWA version 0.7.5a-r405, http://www.vcru.wisc.edu/simonlab/bioinformatics/programs/install/bwa.htm

Installing Node.js using NVM on Ubuntu, http://achinth.com/post/58263924087/installing-node-js-using-nvm-on-ubuntu

How to install node.js and npm in ubuntu or mint, http://developwithguru.com/how-to-install-node-js-and-npm-in-ubuntu-or-mint/

Installing Node.js via package manager, https://github.com/joyent/node/wiki/Installing-Node.js-via-package-manager

npm documentation, https://www.npmjs.org/doc/