This post describes data preprocessing in Salmonella project for Prize-Collecting Steiner Forest Problem (PCSF) algorithm.
This task was given to me as a homework in one of my courses at the university and I wanted to share my solution as I saw there is no such entry on the Internet.
openpyxl is a Python library to read/write Excel 2007 xlsx/xlsm files. To download and install on Windows:
Numpy (Numerical Python) is a great Python package that you should definitely make use of if you’re doing scientific computing
So in our Multi-dimensional Modeling and Reconstruction of Signaling Networks in Salmonella-infected Human Cells project, we have several methods to construct the networks so the data is still needed to be preprocessed so that it can be ready to be analyzed with these methods.
You may need the conversion of PED files to FASTA format in your studies for further analyses. Use below script for this purpose.
Since we’ll be using R for most of the analyses, we converted XLS data file to CSV using MS Office Excel 2013 and then we had to fix several lines using Sublime Text 2 because three colums in these lines were left unquoted which later created a problem reading in RStudio.
Multi-dimensional Modeling and Reconstruction of Signaling Networks in Salmonella-infected Human Cells
In this study, we’re going to use a phosphorylation data from a research paper on phosphoproteomic analysis of related cells.
Many variation calling tools and many other methods in bioinformatics require a reference genome as an input so may need to download human reference genome or sequences. There are several sources that freely and publicly provide the entire human genome and I’ll describe how to download complete human genome from University of California, Santa Cruz (UCSC) webpage.