Güngör Budak's Blog

Bioinformatics, web programming, coding in general

Generating 2D SVG Images of MOL Files using RDKit Transparent Background

The latest release of RDKit (2015-03) can generate SVG images with several lines of codes but by default the generated SVG image has a white background. The investigations on sources didn’t solve my problem as I couldn’t find any option for setting background to transparent background.

RDKit Logo

An example of SVG image generation can be found on RDKit blog post called New Drawing Code.

In [3] shows the SVG image generation and it returns the SVG file content in XML. When you open this file in a text editor, you’ll see there is a rect element with a style attribute having fill #ffffff which is white. If you make this none, the whole SVG background becomes transparent.

So, if you obtain the SVG file content as XML from using moltosvg function in the blog post, just use the following function to make its background transparent.

import xml.etree.ElementTree as ET


def transparentsvg(svg):
    # Make the white background transparent
    tree = ET.fromstring(svg)
    rect = tree.find('rect')
    rect.set('style', rect.get('style').replace('#ffffff', 'none'))<
    # Recover some missing attributes for correct browser rendering
    tree.set('version', '1.1')
    tree.set('xmlns', 'http://www.w3.org/2000/svg')
    ree.set('xmlns:rdkit', 'http://www.rdkit.org/xml')
    tree.set('xmlns:xlink', 'http://www.w3.org/1999/xlink')
    return '<?xml version="1.0" encoding="UTF-8"?>' + ET.tostring(tree).strip()

You can write this to an SVG file easily;

svg = transparentsvg(svg)
with open('path/to/svg/file', 'w') as f:
    f.write(svg)

Install Cairo Graphics and PyCairo on Ubuntu 14.04 / Linux Mint 17

Cairo is a 2D graphics library implemented as a library written in the C programming language but if you’d like to use Python programming language, you should also install Python bindings for Cairo.

Cairo Logo

This guide will go through installation of Cairo Graphics library version 1.14.2 (most recent) and py2cairo Python bindings version 1.10.1 (also most recent).

Install Cairo

It’s very easy with the following repository. Just add it, update your packages and install.

sudo add-apt-repository ppa:ricotz/testing
sudo apt-get update
sudo apt-get install libcairo2-dev

Install py2cairo

cd ~
git clone git://git.cairographics.org/git/py2cairo

See what’s your prefix

python -c "import sys; print sys.prefix"
/usr

Install dependencies

sudo apt-get install automake pkg-config libtool

Build

cd ~/py2cairo
./autogen.sh --prefix=/usr
./configure
sudo make
sudo make install

Verify

python
>>> import cairo
>>> cairo.cairo_version_string()
'1.14.2'
>>> cairo.version
'1.10.1'

Now, you can use up-to-date versions of these softwares in your computer.

Install RDKit 2015-03 Build on Ubuntu 14.04 / Linux Mint 17

RDKit is an open source toolkit for cheminformatics. It has many functionalities to work with chemical files.

RDKit Logo

Follow the below guide to install RDKit 2015-03 build on an Ubuntu 14.04 / Linux Mint 17 computer. Since Ubuntu packages don’t have the latest RDKit for trusty, you have to build RDKit from its source.

Install Dependencies

sudo apt-get install flex bison build-essential python-numpy cmake python-dev sqlite3 libsqlite3-dev libboost1.54-all-dev

Download the Build

cd /usr/local
sudo wget http://sourceforge.net/projects/rdkit/files/rdkit/Q1_2015/RDKit_2015_03_1.tgz
sudo tar -xzf RDKit_2015_03_1.tgz
sudo mv rdkit-Release_2015_03_1/ rdkit
cd rdkit/
sudo mkdir build

Set Environment Variables

vim ~/.bashrc
# Enter following three lines at the end of .bashrc
export RDBASE="/usr/local/rdkit"
export PYTHONPATH="$RDBASE:$PYTHONPATH"
export LD_LIBRARY_PATH="$RDBASE/lib"
source ~/.bashrc

Download InChi API (Optional, remove the arg in the next step if you skip)

cd $RDBASE/External/INCHI-API
sudo bash download-inchi.sh

Build

cd $RDBASE/build
sudo cmake -DRDK_BUILD_INCHI_SUPPORT=ON ..
sudo make
sudo make install

Verify

python
>>> import rdkit
>>> rdkit.rdBase.rdkitVersion
'2015.03.1'

Please comment if you have any issue with this installation.

Generating 2D Images of Molecules from MOL Files using Open Babel

Open Babel is a tool to work with molecular data in any way from converting one type to another, analyzing, molecular modeling, etc. It also has a method to convert MOL files into SVG or PNG images to represent them as 2D images.

Install Open Babel in Linux as following or go to their page for different operating systems

sudo apt-get install openbabel

Open Babel uses the same command to generate SVG or PNG and recognizes the file format using the given filename to as the output option -O. Also, it’s possible to generate transparent images in SVG using option -xb with a value none. This doesn’t work for PNGs. There are also other options, one of which is --title to write the name of the molecule to the image. Leave it blank if you don’t want to see any title.

An example PNG generation from MOL file of benzene molecule.

gungor@gungors-mint ~/Desktop $ obabel benzene.mol -O benzene.png --title Benzene
1 molecule converted

Benzene

To see all other options for available image formats, follow the Open Babel documentation Image formats page.

Simple Way of Python's subprocess.Popen with a Timeout Option

subprocess module in Python provides us a variety of methods to start a process from a Python script. We may use these methods to run an external commands / programs, collect their output and manage them. An example use of it might be as following:

from subprocess import Popen, PIPE


p = Popen(['ls', '-l'], stdout=PIPE, stderr=PIPE)
stdout, stderr = p.communicate()
print stdout, stderr

These lines can be used to run ls -l command in Terminal and collect the output (standard output and standard error) in stdout and stderr variables using communicate method defined in the process.

However, if the ls command never ends, we will never get any output because the external program just hangs. This happens sometimes with some programs e.g. trying to generate 3D molecule from 2D MOL file using Open Babel when the MOL file is not correctly formed. As the Open Babel just tries to generate 3D and doesn’t check if the MOL file is okay, we never get an output from that run. So the program and our script hang.

Python has many different options to solve this problem like using multiprocessing or threading module, signal module etc. But here I’ll describe a very simple method that worked for me fine. This is tested in Linux environment with Python version 2.7.

popen_timeout.py

from time import sleep
from subprocess import Popen, PIPE


def popen_timeout(command, timeout):
    p = Popen(command, stdout=PIPE, stderr=PIPE)
    for t in xrange(timeout):
        sleep(1)
        if p.poll() is not None:
            return p.communicate()
    p.kill()
    return False

print popen_timeout(['python',
                    '/home/gungor/Desktop/test.py'], 25)

test.py

import time


for i in xrange(10):
    time.sleep(2)
    print "Gungor", i

Example run 1 where the external program doesn’t exceed the timeout threshold

gungor@gungors-mint ~/Desktop $ python popen_timeout.py
('Gungor 0\nGungor 1\nGungor 2\nGungor 3\nGungor 4\nGungor 5\nGungor 6\nGungor 7\nGungor 8\nGungor 9\n', '')

In this example, I ran popen_timeout.py with 25 seconds timeout on an external program (test.py) which runs for 20 seconds and outputs lines of strings to the standard output which are collected with communicate method by popen_timeout.py.

Example run 2 where the external program would take longer than the timeout threshold

gungor@gungors-mint ~/Desktop $ python popen_timeout.py
False

The popen_timeout.py just returns False. Because the external program was still running when the timeout has been achieved and it has been killed afterwards.

You can use this method to control the execution of the external programs in Python.

Running StarCluster Load Balancer in Background in Linux

StarCluster loadbalancer command is regularly monitors the jobs in queue and it adds or removes nodes to the master node that is created beforehand to effectively complete the queue.

To run in in the background without killing it when the terminal closed:

nohup starcluster loadbalance cluster_name >loadbalance.log 2>&1 &

or to keep standard output and standard error logs separate:

nohup starcluster loadbalance cluster_name > loadbalance.access.log 2> loadbalance.error.log &

This will start the process and output the process ID (PID) which can be used to check or kill it. Also, the standard outputs and errors will be written into loadbalance.log file in current home directory.

Processes in Linux can be obtained with following command in Terminal:

ps

To kill a process:

kill <PID>

Change Apache’s Default User www-data or Home Directory /var/www/

I was getting errors from StarCluster run due to not being able to find .starcluster directory in /var/www/.

This directory has config file and log directories for StarCluster so without it, it can’t run.

To solve the issue, I set up my own user in Apache envvars instead of www-data which also changes default home directory to mine.

Edit following file with super user permissions:

sudo nano /etc/apache2/envvars

Enter your username to following lines and save:

export APACHE_RUN_USER=user
export APACHE_RUN_GROUP=user

Restart the server:

sudo service apache2 restart

Transfer Files to Your AWS S3 Storage in Linux

Download following tool and install:

cd ~/Downloads
git clone https://github.com/s3tools/s3cmd.git
cd s3cmd/
sudo python setup.py install

Next, execute following to create a configuration file to connect to your AWS S3 account:

s3cmd --configure

And finally use the following command to transfer any directory to your bucket:

s3cmd sync <path-to-folder> s3://<s3-bucket-name>

More on installation, setup and usage of s3cmd

ImportError: Reportlab Version 2.1+ is needed!

Little bug in xhtml2pdf version 0.0.5. To fix:

$ sudo nano /usr/local/lib/python2.7/dist-packages/xhtml2pdf/util.py

Change the following lines:

if not (reportlab.Version[0] == "2" and reportlab.Version[2] >= "1"):
    raise ImportError("Reportlab Version 2.1+ is needed!")

REPORTLAB22 = (reportlab.Version[0] == "2" and reportlab.Version[2] >= "2")

With these lines:

if not (reportlab.Version[:3] >= "2.1"):
    raise ImportError("Reportlab Version 2.1+ is needed!")

REPORTLAB22 = (reportlab.Version[:3] >= "2.1")