Friday, April 8, 2011

A new mutation finder

So in order to deal with new HIV, HCV and bacterial regions I've started to redo the mutation finder that I had created before. Now, instead of doing a django-app I'm doing it as a ruffus pipeline. The previous iteration was very difficult to add and modify, so I decided it was better to just start from scratch and learn from my previous mistakes.

So the ruffus pipeline downloads the xml data for each article after a simple search and then uses the same regexp mutation finder as in the previous iteration. The big difference in this iteration is that I plan to use the Whatizit pipeline to annotate the protein in every sentence that has a mutation. Then I will assume that any sentence which has a single mutation and a single protein means that the mutation is IN that protein. Then if this is repeated twice (or some other cutoff) I'll accept it as truth. My main idea is that the important mutations will be mentioned multiple times.  I may also be able to get some functional annotation by using the Whatizit pipeline to get mesh-terms.

So far I've implemented the pubmed search, article download and raw text extraction. Still need to do the protein annotation, mutation search and then aggregating. The code isn't up on GitHub but it should be by tomorrow night or so.

No comments:

Post a Comment