Research Candy: February 2011

Monday, February 28, 2011

Doing steve's microarray stuff

I'm doing the analysis for Steve Smith's microarray data. We've looking to compare the genes which are different between influenza patients in the "uninfected" state. We noticed that there is a large difference between individuals that is independent of their infected state ... so it doesn't make sense to group infected vs. uninfected together since they have such an inherent difference.

HCV stuff still processing

The HCV stuff is still processing but it seems to be about 2/3rds done. Hopefully it will finish within a day or so.

Friday, February 25, 2011

Adding new functionality to publookup

I need to be able to do "gene enrichment" instead of "article enrichment". I made modifications to entrezpublookup.py to facilitate this. Now using sha: 7d6fb2540e604dd4f200e0b1014b71adb046c9f5

I re-ran the results for Peter's gene lists using the command:

python entrezlookup.py -s lists/testlist.txt -o lists/PeterResults.txt -g lists/GeneLevelOutput.txt --gene-based lists/GeneBasedOutput.txt

Not feeling productive today

So I think I'm going to spend some time improving my documentation and unit-test suite for Linkage code. This way I can get some important stuff done without having to actually think much.

Restarted HCV linkage

Things were taking too long on the HCV linkage and the p-values were "inconclusive". I changed the code so it won't calculate p-vals unless asked (otherwise it will just leave that column blank). I restarted the code on sha a792bca5d5d14506b83f1af96a0f0c3201a5c232 and command:

python AnalysisCode.py --link-lanl --data-dir HCVData/ --workers 3 --max-width 5

I'll create a function which can go back and recalculates the p-values. I also need to deal with "inconclusive" p-values ... which I'm defining as permutation tests where I never (or only twice) find a value as good as the real value. In a set of 5000 iterations I don't have a good estimate. So I think I'll refactor that function to keep testing until it finds 10 "good" values unless it gets to 100,000 iterations. I'll probably have a minimum of 5000 just to make sure it doesn't get a "lucky break" and find those 10 early.

Thursday, February 24, 2011

Current ToDo and brainstorming

Here's a list of stuff I think I need to work on ... in rough priority order:

Get the Circos diagram setup for the HCV results before its done.
Setup the statistical look-up table ... this will require an appreciable amount of coding and a large amount of processing time.
Setup an enrichment score for linked proteins in bacterial genomes.
Find a better way of organizing the linkage data and making sure everything gets run at once. Now that I'm going to run the analysis on MANY bacterial genomes and comparing results together I need a better way to handle things. Perhaps setting up a YAML list file with the location of data and the parameters. Then I could just run a top-function which calls the resulting functions

Started HCV linkage code

started code on linkage analysis:

python AnalysisCode.py --link-lanl --data-dir HCVData/ --workers 3 -max-width 5

on code: a792bca5d5d14506b83f1af96a0f0c3201a5c232
So it will calculate single linkages between all pairs with a max-width of 5. It will also find linkage p-values for anything with a value greater than 80%.

No Dice on Guidance and NorMD

Both Guidance and NorMD died on the smallest of my alignments. Looks like I'll have to look elsewhere. Also tried out TrimAI this morning ... it seems to only remove columns and not remove any of the bad sequences; so its not really what I'm looking for.

Wednesday, February 23, 2011

Statistical analysis of linkages

I plan to use a permutation test to find the statistical significance of linkage values. I know that the linkage is a function of the Entropy values of each column and the number of unique values in each column. There may be a way to calculate this using some sort of equation but I can't seem to figure it out. So I'm just going to do a permutation test for all values of entropy pairs and unique values. Then I'll just interpolate for the values in-between my sampling.

However, since I have a huge range that I need to check its going to be very difficult to store all of the data ... I'll need the cdfs for Entropies from 0.1 to 5.0 in spaces of 0.1 (for each column); unique values from 1 to 30 (for each column) 50*50*30*30 ... so about 2.25 million combinations and I'll need to store about 100 integers for each. So about 2.25 billion integers need to be stored and 22.5 trillion graphs that need to be calculated.

So I'll need a fast way to store, retrieve and filter this data. PyTables seems to be my best choice. I'll experiment with it tomorrow but it claims to be a super fast way to write, store and query data based on the HDF5 standard. I'm going to play with it tonight and see if I can get it to work well.

Looking for alignment quality control

I've noticed that my linkage scores are highly dependent on my alignment. Since I'm gathering a lot of sequences automatically occasionally I can get few crappy ones. These crappy ones throw off the entire alignment. So I've started looking at some automated ways to remove these bad sequences. As per this question on BioStar: Automagically remove “badly” aligning sequence from Multiple-Sequence Alignment this brought me to two suggestions: NorMD and GUIDANCE.

NorMD is a command-line C program. The v1.4 README for the newest version of NormMD does not conform to the actual code (describing functions that are no longer there and functions there are not in the readme file). And the function that I need normd_rd is not present and the help-files associated with the other code are not helpful. I've gone back to version 1.1 which has the desired files. However, this is terribly slow ... 4 hours so far and it still hasn't finished one file.

GUIDANCE is a webservice (http://guidance.tau.ac.il/). I've uploaded the core.fasta file and its been 4 hours and still isn't done.

I'll give the things the night to try to finish otherwise I'll need to code something quick and dirty.

First posts

Now that I have started a new set of projects I am moving my research lab notebook to blogger. Certain posts will be private to me (when they deal with GSK's secret projects) others will be public.

However, I do not expect that these posts will be interesting to most. I am planning on using this as a lab notebook. So I will be indicating which code, which code-version, and which dataset I'm running my data on. I will mention some conclusions I may have come up with but there won't be a big philosophical discussion.

This is more of a blog for my own needs then for other people.
-Will