Research Candy: March 2011

Wednesday, March 30, 2011

Started running all Linkage Files

Using the new --do-all argument I started running all new linkage using sha: 223e3a69d6e6715ffaaf1e3ffd410aa2cc9d70c5

This will do all tree-calculating, merging, linking, and aggregating.

Crushing Back-Related Monkeys

I'm taking the day and crushing some annoying cludges that have been in my code for a while.

I finally made a simple utility to touch all data to make it "up-to-date". This was becoming such an annoyance since I'm updating the "processing file" all the time to add new stuff ... and since this is the "root" of the processing when it got updated then EVERYTHING needed to be redone. Now simply running the AnalysisCode.py script with the --fresh argument will update everything that exists!

I finally fixed the --processing-file argument ... before I would often have to change it in the source-file everytime I wanted to do an analysis on a different program. Now it properly respects that input argument!

Fixed the issue with large input sequences overloading muscle ... now if Muscle errors out then I rename the input file and skip it in all future runs.

Now can process multiple species files. If the AnalysisCode.py is given a --do-all parameter then it will do a DEPTH-FIRST analysis of all processing files in "importance-order". This will be useful when I

Removed the ClustalW related files since they are no longer needed.

Now the linkage files are converted to "reference space" after they are finished. These are what are then aggregated together.

Now the Tree-merging never aggregates "reference" genomes together since this will make it impossible to convert to the proper numbering scheme later!

Tuesday, March 29, 2011

Merging sequences within trees

Once I found Dendropy it became pretty easy to parse and determine which sequences can be merged. Something akin to this code works well:



tree = Tree.get_from_string(tree_str, schema = 'newick')

done = set()

groups = []

for t in tree.level_order_node_iter():

  if t.edge.length >= 90:

  lens = [x.edge.length >= 90 for x in t.level_order_iter() if not x.is_leaf()]

  if all(lens) or not any(lens) or t.is_leaf():

    leafs = set([str(x.taxon) for x in t.leaf_nodes()])

    groups.append(leafs-done)

    done |= leafs

groups = [x for x in groups if x]

This will produce a list of sets in which all members of the set are either all descendants have a bootstrap value of >90 or none of the values are below >90.

In my HIV data its 557 groups out of 707. The largest group is 6 sequences. Below is the tree:

I'm not really great with the evolutionary biology so I'm not sure if this a warning sign ... I'll send a message to JB to see if he sees any red flags.

I've already started this on the HCV genomes as well.

Monday, March 28, 2011

Phylip command-line

Phylip is a a pain in the ass when you're running it from a command line. You need to create a file which has all interactive prompts and then feed that along.

phylip proml < input > screenout

However, this is ANNOYING to do with python's subprocess module. It can't deal with the redirections so you have to use the subprocess input/output:



        if capture_output:

            out = open('screenout', 'w')

        else:

            out = open(os.devnull, 'w')

        inbuf = open('input','w')

        args = shlex.split(cmd)

        call(args, stdin = inbuf, stdout = out)

This is already in the code, just posting it here for reference

Thursday, March 24, 2011

HIV sequence tree

As per JB's suggestions I have created a concatenated alignment of all columns which have no gaps (from all proteins). This results in an alignment of ~1289 columns and 624 sequences (since I need to make sure all proteins are present).

Now I'm using phylip to create a tree out of these sequences ... however, this seems very slow going. Its been ~50 minutes and its only loaded 1/3rd of the sequences of the FIRST of 100 trees. I'll let it go for most of the day but this might be a little too extreme.

Wednesday, March 23, 2011

JB Meeting about HIV phylogeny

Phylip: http://evolution.genetics.washington.edu/phylip.html (for making trees)
Dendroscope: http://ab.inf.uni-tuebingen.de/software/dendroscope (for viewing large trees)

Jim's main idea is to use Phylip to generate a tree with of the HIV-1 sequences and then create a consensus for each node with a bootstrap value >80% (or some other value). Then do the same linkage analysis on those sequences.

He suggested to create the tree use concatenated columns of the variable regions (without gaps).

Getting figure generation automated

Now that I have a lot of genomes that I need to process my goal for today is to automate the figure generation process. The previous iterations have been mostly with python scripts so now I need to setup a python module to do everything automatically.

I'll need to generate the "scatter" and "image" figure that I referred to on previous posts like this one: http://researchcandy.blogspot.com/2011/03/hcv-results-for-other-genotypes.html

Tuesday, March 22, 2011

Data downloaded so far:

I've downloaded the:

Any
IndB
SubB

Still need to download:

SubC
SubA

And if needed:

IndC
IndA

Hand-downloaded HIV data for subtype B

I downloaded the Los Alamos HIV subtype B data ... It took about ~45-60 minutes of clicking. How annoying!

Unsuccessful attempts to Automate Los Alamos downloads

I've pretty much given up on automating the download of data from the Los Alamos database. It just doesn't seem to lend itself to python's mechanize. The form options are filled in with javascript (which I figured out how to deal with). The big problem is that the download of fasta files occurs through javascript (for some crazy reason) and I just can't seem to find a way to capture the request.

Saturday, March 19, 2011

HIV linkage paper rejected

The initial linkage paper on HIV has been rejected from BMC retrovirology. This doesn't really surprise me since I really didn't think a biology paper would publish a purely computational analysis.

There were two main comments shared by all three reviewers:

Predicted linkages may be due to ancestral evolution instead of biological relevance. Our computational method has no way to distinguish between the two.
Biologists really don't care about the sequence linkages between two distant regions of the genome "At present the manuscript describes a computational method that is not (yet) of interest to retrovirologists and should be submitted to a computational journal."

The reviewers quoted a few other potential sources that would account for the linkage we observe:

Ancestral Origin (founder effects)
RNA motifs
CTL selection

I agree that all of these are possible choices ... I would just argue that I really didn't care what made the linkages I only care that given a small section of the genome I can predict a large portion of the rest of it. I think when we started adding too many biological examples this message got watered down and it seemed that we were claiming a biological reason.

I'm going to start brainstorming how I can deal with the phylogenetic biasing ... since I agree that this may be inflating my "non-perfect" scores ... ie if one node of my tree is overrepresented and it has a strong linkage then it could inflate my scores. In the perfect matches it doesn't actually matter how I slice the data, I will always get a perfect score.

Thursday, March 17, 2011

TopN predictive positions in HCV

These are the topN predictive positions for various subtypes (ranked by number of predictive positions):
Subtype 3a:

   'E2' [ 95]
   'E2' [ 2]
   'E1' [213]
   'E2' [222]
   'E2' [ 93]
   'E2' [212]

Subtype 2a:

   'E2' [ 2]
   'E2' [109]
   'E2' [ 1]
   'E2' [226]
   'E2' [225]
   'P7' [ 20]

Subtype 1a:

   'E2' [ 2]
   'E2' [ 85]
   'E2' [187]
   'E2' [ 1]
   'E2' [ 6]
   'E2' [180]
   'E2' [176]
   'E2' [ 83]
   'E2' [ 82]
   'E2' [186]
   'E2' [ 16]
   'E1' [ 43]
   'E2' [245]
   'P7' [ 20]
   'E2' [175]
   'NS2' [119]
   'E1' [105]
   'NS2' [ 2]
   'P7' [ 26]
   'NS2' [ 29]

As you can see the positions are "similar" and there are some overlaps.

HCV results for other genotypes

The results posted before were for genotype 1a ... that was chosen arbitrarily since I don't really know much about HCV epidemiology.

Here are the corresponding figures for genotype 2a

and for subtype 3a:

I really don't see any difference between these figures ... but when I look at the actual numbers I can see that the conserved columns are off by a few AAs. In genotype 3a there is an extra predictive column in E1 that is not present in the others. There is also a "double-wide" single AA linkage in E2 (the last column in E2) ... in the other genotypes it reduces to a single column but in 3a there are actually 2 positions.

I'll post the predictive positions later this afternoon.

Wednesday, March 16, 2011

JB Meeting

Just some random thought/action-items from my meeting with Jim Brown:

Put together some slides today on the HCV results ... include circos and scatterplots, include a few bullets about the methods
Think about other projects that I would be interested in doing ... some include: gut microbiome, medical-records mining, the pan-viral project

I'm leaning closer to the medical-records project just because I want to do some real text-processing.

Paper to check for agreement with HCV linkage work

I came across this paper while looking for some HCV genome linkage studies: Comparative analysis of nearly full-length hepatitis C virus quasispecies from patients experiencing viral breakthrough during antiviral therapy: clustered mutations in three functional genes, E2, NS2, and NS5a
This shows that there are some PAIRS of mutations which lead to viral breakthroughs but the single mutations have no noticeable effect. They use a different reference strain so their positions aren't immediately comparable to mine but I have checked and their reference is in my alignments ... so I'll just need to do a simple conversion to check it out.

Saturday, March 12, 2011

Started running multi-bacterial linkages

Started the code-run for the multi-bacterial linkage. Using sha:

17ec9386d7cb5f5387d553ac520c431b853e7c3b

I ran the command:

python AnalysisCode.py --processing-file BacterialData/BacterialProcessing.yaml --link

This should download, unzip, align and link all of the following genomes:

Acinetobacter_baumannii
Actinobacillus_pleuropneumoniae
Bacillus_anthracis
Bacillus_cereus
Bacillus_subtilis
Bacillus_thuringiensis
Bifidobacterium_longum
Borrelia_burgdorferi
Brucella_abortus
Brucella_melitensis
Buchnera_aphidicola
Burkholderia_mallei
Burkholderia_pseudomallei
Campylobacter_jejuni
Chlamydia_trachomatis
Clostridium_botulinum
Clostridium_difficile
Clostridium_perfringens
Coxiella_burnetii
Enterococcus_faecalis
Enterococcus_faecium
Escherichia_coli
Francisella_tularensis
Haemophilus_influenzae
Helicobacter_pylori
Lactobacillus_crispatus
Lactobacillus_iners
Lactobacillus_jensenii
Lactobacillus_reuteri
Listeria_monocytogenes
Mycobacterium_tuberculosis
Neisseria_gonorrhoeae
Neisseria_gonorrhoeaes
Prochlorococcus_marinus
Pseudomonas_aeruginosa
Pseudomonas_syringae
Rhizobium_etli
Rhodopseudomonas_palustris
Salmonella_enterica
Staphylococcus_aureus
Staphylococcus_epidermidis
Streptococcus_agalactiae
Streptococcus_pneumoniae
Streptococcus_pyogenes
Streptococcus_suis
Sulfolobus_islandicus
Ureaplasma_urealyticum
Vibrio_cholerae
Vibrio_choleraes
Vibrio_parahaemolyticus
Wolbachia_endosymbiont
Yersinia_pestis

Should take a few days to finish ... or perhaps MUCH longer.

Friday, March 11, 2011

Sequence-space figures

So I've normalized the figures to be in "sequence-space" instead of alignment space ... all positions were normalized to the AF054250 sequence (it would be trivial to change this).

and

This cleaned things up relatively well ... although there are still large chunks of un-measured space ... but these are due to the low-entropy cutoff, not sure there's much I can do about that one :(. The only thing I can think of to fix this would be too allow a window to grow indefinitely if its below the entropy cutoff ... right now if the "conserved window" is greater than 5 AAs then I just skip past it. While this would fix my problem I'm not sure it is worth the time it would take to re-compute everything (and it wouldn't add much biological significance).

I could take the adjacent values (since if I allowed the window to extend it would be the same as the adjacent value). When I do that I get this:

Which looks pretty bad ... and misleading. Since this seems to imply that whole genome can be predicted from any other part with at least 70% accuracy. Which is NOT true ... if you were given one of those low-entropy regions (without the nearby AAs) you'd be out of luck.

An "image" version of the HCV linkage data

So i used imagesc to create a new representation of the linkage data:

This has quite a bit of a difference between the scatter-plot I made yesterday ... reposted here:

As you can see ... in the scatter figure it seems that E1 and E2 have the highest levels of predictive power. While in the image figure it seems that the NS3-NS5A has the majority of predictive power.

There is a simple reason for the dis-congruity ... the scatter-plot shows SINGLE AA linkages while the image figure shows MULTIPLE AA linkages. So it would seem that the linkage between NS3/NS5 with the rest of the genome only occurs at the "motif" level while the linkage between E1/E2 occurs at the single AA level.

You can see the small spikes in the area plot below the image figure that correspond to the spikes in the scatter figure. However they are drastically overshadowed by the other linkages around them.

I'm not sure of the biological significance but I saw a similar effect in the HIV data ... just never made it into the most recent version of the manuscript.

The thing that really bugs me is the "white-space" in the image figure. This is mostly due to two things: low entropy cutoffs and multiple alignment issues. There's not much I can do to fix the low entropy regions ... I could cut those sections out of the figure but that would make it weirdly distorted. I'm hoping by converting from "alignment-space" to "sequence-space" will clean up some of these issues.

Thursday, March 10, 2011

HCV Circos and Scatter figures

Alrighty, now that I've fixed the "giant png" figure problem I can post and discuss some of figures I've created. First: the scatter figure ... as mentioned before this is all SINGLE matches of 1.0 linkage.

Scan see that the E2 region has a lot of predictive power. as evidenced by the thick column of predictions in the third group (it may only look like the second but Core is very small). The P7 region also has quite a bit of predictive power. This is currently in "alphabetical order" of the genes ... when I re-order it:

Things make much more sense. Now the predictive regions are grouped together.

The circos diagram shows the same basic results except in a more visually appealing way. However, I'm having trouble posting it here ... probably a size issue. Despite converting it to a JPG at 5% quality its still over 2 mb.

The interesting thing will be what is the position which has that high prediction level. So in this poorly labeled histogram ... its just a quick one I'll do a better one tomorrow, just want a general idea:

You could guess that the E2 position: 176 (alignment-space) is that large spike and the second largest is at E1 475 (also alignment-space) ... I will have to do a little work tomorrow to convert these to sequence space.

I'm going to spend some time tomorrow getting more information into the above figure ... perhaps using the "color" of each scatter point indicating the level of linkage ... the other strategy would be to use a pcolor plot. Also doing joint histogram/pcolor would be very helpful.

Converting and shrinking PNG files

Here's a simple command to convert and down-sample PNG files so they aren't 18 mg monsters.

mogrify -quality 20 -format jpg *.png

This will convert all PNGs in the directory to jpg and reduces the quality to 20% of the original. I can't tell the difference at all.

It uses the mogrify command from the ImageMagik toolkit.

HCV circos figure

So far this looks like the best Circos plot I can generate. This has all SINGLE linkages with perfect linkage and the top 5000 links (although there are only ~1500 such linkages). All the rest have too many links and look kinda crummy.

Having trouble posting the actual image ... its about 19mb. Need to find a way to compress them properly.

Playing catch-up

Since I've been slow the past couple of weeks I need to spend the next few days playing catch-up ... which will probably include working through the weekend :(

So today I'm getting the HCV linkage circos diagrams together along with the scatter-plots. Now that the calculations are finished I should be able to get some figures together in the next few hours.

After that I need to do the GO linkage analysis for the Staph data and then find a good way to generalize it so I can run it on all of the finished genomes that I have downloaded.

Monday, March 7, 2011

HCV finally finished

The processing for the HCV linkage results finally finished sometime over the weekend. I'll put everything together sometime on Tuesday. Today I have to help grade senior design presentations and get a final exam ready.

Thursday, March 3, 2011

Started a backup system

I started my own rsync-based backup system which mirrors data on my external drive and my Dropbox account. I've setup cron to do the incremental backups on a nightly bases. Currently I'm backing up the code and data for:

LinkageAnalysis
publookup
quick* projects
judosite
pyMutF

AnalysisCode now takes lists

As of sha

f3eb120328742230926ccefde88982b75979c025

the AnalysisCode can deal with lists of data directories. It can also deal with downloading data for bacterial genomes.

Today's project

Now that SequenceDownloader has gotten 50 species I need to refactor the AnalysisCode so it can deal with MULTIPLE separate species. The question is whether I should do a "depth-first" or "breadth-first" method

Breadth First: For each step, (ie. Download, Processing, Alignment, Linkage, Enrichment, figure generation) process all species before advancing to the next step. The advantage here is that it will be the "easiest" option to code since all I'll have to do is modify the generating functions to take new inputs. The disadvantage is that I'll have to wait a while before I get any results (its an "all or nothing" type of approach). I guess I could start my processing by doing breadth-first on only a few (maybe the top 5).

Depth First: For each species do all steps before advancing to the next species. This would be nice because I can get incremental results. However, I think it would require either a complete re-tooling or some drastic "kludging" to get everything working.

As far as how am I going to keep all of the settings together ... I plan to use a YAML document and then keep that in the source-control. That way everything stays together in one place and can be easily modified when required.

Wednesday, March 2, 2011

started downloading all bacterial sequences

Using sha:

3ba1cd6720cd59c6c5da9da18fcfaf3653b5cffd

I started downloading all bacterial species which have at least 7 completed genomes (at count 50 species). The list is in the repository now.
Used command:

python SequenceDownload.py --file DownloadData.txt

Should only take ~30-60 minutes. This where I will also start with the project tag "BacterialLinkage" to describe how I plan to look for evolution of linkage across many bacterial species.

Edit: Actually takes about ~4 hours to download, unzip and aggregate all of the data.

SequenceDownloader now works completely

As of sha:

b1c22eab927963a716ddc5e6968640ba49d77a85

SequenceDownloader.py now downloads and processes the data.

Need to look for interaction mapping software

Dr. T wants an interaction network with Peter's IBD gene lists that incorporates interaction data and GO terms. I'll look around for some web-tools and see if there are any good cytoscape plugins which can do this sort of thing.

Finishing up SequenceDownload.py functions

I'm finishing and polishing the SequenceDownload function in the linkage project. Currently it only downloads data and unzips it. I still need to make a small function which integrates the data into a set of fasta files for each protein.

Tuesday, March 1, 2011

Steve's microarray results

When comparing healthy individuals there were 14 probes that were found in all patients, 190 patients that were in at least 4 patients and 197 in at least 3.

When comparing healthy to infected individuals there were NO PROBES that were conserved between individuals ... only 3 were in the 197 genes mentioned above.

So (at least from this data ... GSE18816 and GSE24533) it doesn't seem that there are genes commonly differentially expressed when infected. However, this dataset is SUPER tiny and not big enough to make conclusions from.