Wednesday, April 27, 2011

New paper may REALLY help everything

This paper just came up in my feed reader last night: High-throughput identification of genetic interactions in HIV-1  which is a review of: A systems analysis of mutational effects in HIV-1 protease and reverse transcriptase, Both are out of Nature Genetics and run by group in CA.

They have analyzed the viral fitness of mutations from more than 70,000 clinical sequences of RT and PR. Fitness was measured by replication capacity in their in-vitro study. The experiments were performed by placing this region within a well defined test vector (NL4-3). So I will be able to examine linkage breakage between all pairwise interactions which include either RT or PR.

I've already started to write code for measuring these "linkage-breaks" from arbitrary sequence segments. I've submitted the data request, hopefully they won't be greedy and they'll actually send me something quickly!

It seems that they do some sort of "encryption" to hide patient identities ... I may have to do some finnoodling to get it working, hopefully it won't be much.

Wednesday, April 20, 2011

HCV teleconference

I had a short meeting and it was very productive. It seems that there are about ~80-100 sequences where they have quantitative infectivity data. I'll see how the data looks when I get it. The main caveate is that they are subgenomic replicon. I have found a large scale random insertion functional genomic screen ... i'll have to see how it looks.

Tuesday, April 12, 2011

Meeting about Peter's paper

Looks like there's just one more figure/table to deal with. So far all I need to do is create a table where I list each GO term and each gene associated with that term. Then color each gene based on which gene list its in.  The only way I can see to do this is using HTML tables. I set up a simple python script to take care of this ... currently it looks pretty crappy, I'll spend some time tomorrow dealing with it.

Dr. T Meeting about Sean's project

Looks like we're going back to the "meta-analysis". We'll go through all of the datasets and produce a gene-list (using SAM or some other method). Then we'll do pathway/GO enrichment. Then look at pathway overlaps. This should be a pretty simple process, Sean just needs to learn enough coding to get it together.

Friday, April 8, 2011

A new mutation finder

So in order to deal with new HIV, HCV and bacterial regions I've started to redo the mutation finder that I had created before. Now, instead of doing a django-app I'm doing it as a ruffus pipeline. The previous iteration was very difficult to add and modify, so I decided it was better to just start from scratch and learn from my previous mistakes.

So the ruffus pipeline downloads the xml data for each article after a simple search and then uses the same regexp mutation finder as in the previous iteration. The big difference in this iteration is that I plan to use the Whatizit pipeline to annotate the protein in every sentence that has a mutation. Then I will assume that any sentence which has a single mutation and a single protein means that the mutation is IN that protein. Then if this is repeated twice (or some other cutoff) I'll accept it as truth. My main idea is that the important mutations will be mentioned multiple times.  I may also be able to get some functional annotation by using the Whatizit pipeline to get mesh-terms.

So far I've implemented the pubmed search, article download and raw text extraction. Still need to do the protein annotation, mutation search and then aggregating. The code isn't up on GitHub but it should be by tomorrow night or so.