I plan to use a permutation test to find the statistical significance of linkage values. I know that the linkage is a function of the Entropy values of each column and the number of unique values in each column. There may be a way to calculate this using some sort of equation but I can't seem to figure it out. So I'm just going to do a permutation test for all values of entropy pairs and unique values. Then I'll just interpolate for the values in-between my sampling.
However, since I have a huge range that I need to check its going to be very difficult to store all of the data ... I'll need the cdfs for Entropies from 0.1 to 5.0 in spaces of 0.1 (for each column); unique values from 1 to 30 (for each column) 50*50*30*30 ... so about 2.25 million combinations and I'll need to store about 100 integers for each. So about 2.25 billion integers need to be stored and 22.5 trillion graphs that need to be calculated.
So I'll need a fast way to store, retrieve and filter this data. PyTables seems to be my best choice. I'll experiment with it tomorrow but it claims to be a super fast way to write, store and query data based on the HDF5 standard. I'm going to play with it tonight and see if I can get it to work well.
No comments:
Post a Comment