Tuesday, March 29, 2011

Merging sequences within trees

Once I found Dendropy it became pretty easy to parse and determine which sequences can be merged. Something akin to this code works well:

tree = Tree.get_from_string(tree_str, schema = 'newick')
done = set()
groups = []
for t in tree.level_order_node_iter():
  if t.edge.length >= 90:
  lens = [x.edge.length >= 90 for x in t.level_order_iter() if not x.is_leaf()]
  if all(lens) or not any(lens) or t.is_leaf():
    leafs = set([str(x.taxon) for x in t.leaf_nodes()])
    groups.append(leafs-done)
    done |= leafs
groups = [x for x in groups if x]


This will produce a list of sets in which all members of the set are either all descendants have a bootstrap value of >90 or none of the values are below >90.

In my HIV data its 557 groups out of 707. The largest group is 6 sequences. Below is the tree:
I'm not really great with the evolutionary biology so I'm not sure if this a warning sign ... I'll send a message to JB to see if he sees any red flags.

I've already started this on the HCV genomes as well.

No comments:

Post a Comment