Wikipedia:Reference desk/Archives/Mathematics/2017 June 15

= June 15 =

Fastest way to identify members of a set
I'm not a mathematician, so please forgive me if none of what follows is in proper math speak. In fact, maybe you guys can help me do that. I'm also not even sure if this is more a math problem or a coding problem. So, let's say I have a set of n items. Each item has a unique name. I have a process that can identify the name of any item. The cost of this process is 1. The same process can identify the names of the items in any subset of any size, for the same cost, but without telling me the order. So for instance, I can ask for the list of names of every odd item in the set, and get all of their names, for the cost of 1. At no cost, I can keep my set as a simple list of items, or turn it into a matrix of any number of dimensions, and then ask questions like, "what names are in row X, or column Y, etc". What I actually want from all of this is to know both the name and position of every item in the set. Mathematically speaking, what is the cheapest way to do this for a set of size n, and how does the cost scale with n? Now, to avoid the XY problem, here is my motivation, which is purely intellectual, I'm not actually doing this. Years ago, it was not uncommon for a geneticist to find himself with basically a soup of plasmids each containing a random segment of a subject's genome. Each plasmid was in its own living host strain (usually a bacterium), and each might even be unique, but the identities were unknown. Sequencing each plasmid was individually cheap but prohibitively expensive for the bulk required. The solution was to arrange the plasmids in grids, and then in one process sequence "every plasmid in this row", "every plasmid in this column". They broke up the grids into many patterns, each sequencing process telling them everything in the chosen subset but not their precise locations. Eventually there was enough information to pinpoint each plasmid's grid coordinates, and in fewer sequencing reactions than doing each one individually. I'm interested in known what is the mathematically perfect way to do this for a collection of arbitrary size. Please let me know if anything needs clarification. Someguy1221 (talk) 04:06, 15 June 2017 (UTC)


 * I don't know if this is a problem with an established analysis, but here's my take on it. Arrange them into a matrix. Sequence all n rows and all k columns, for a total cost of n+k. Since all elements have a unique name, you know the row and column containing each element, and you're done. The question is, what should n and k be? If the total number of items is a square number, q2, then a simple calculus problem of minimizing n+k subject to nk=q2 yields n=k=q—the matrix should be square. If the number of elements is not square, then I guess you would have to make the matrix close to square (possibly with some empty slots in the matrix), and choose the configuration with the best n+k. I cannot see how one could do any better than this procedure. Loraof (talk) 15:10, 15 June 2017 (UTC)


 * And if you have, say, 26 elements, you could put 25 of them into a square matrix and sequence each row and column. By process of elimination, when you've identified the locations of those 25, you know the omitted one is the remaining name (assuming you know all the names in advance). So doing 26 is no more costly than doing 25. Loraof (talk) 15:17, 15 June 2017 (UTC)


 * 25 elements in a 5×5 2D array takes $$2 \sqrt{25} = 10$$ sequencings, whereas 27 elements in a 3×3×3 3D array would need only $$3 \sqrt[3]{27} = 9$$ sequencings. For an m-dimensional array $$m \sqrt[m]{n}$$ sequencings are needed, and that is minimized by choosing $$m = \ln(n)$$. --catslash (talk) 16:42, 15 June 2017 (UTC)


 * Come to think of it $$\sqrt[\ln(n)]{n} = e^{\frac{1}{\ln(n)} \ln(n)} = e$$ for any n, so choose a dimension for the array that results in about e items in each direction. --catslash (talk) 16:56, 15 June 2017 (UTC)


 * In fact the world is slightly better. What you're doing is basically binary search, and indeed if all labels are distinct then it's easy to show that the base 2 logarithm is optimal.  (Plus 1, if we require to actually see each label at least once.) The savings comes from the fact that it is not actually necessary to look at the last slice in each direction -- you already know that it holds them complement of the other slices. (On the other hand, if entries are repeated, then this strategy may be a total disaster.) --JBL (talk) 12:04, 16 June 2017 (UTC)


 * Sequence assembly might be of interest, since it solves a related problem. StuRat (talk) 17:02, 16 June 2017 (UTC)

Thanks guys. Any thoughts on what to do if there are repeats? Someguy1221 (talk) 23:21, 16 June 2017 (UTC)


 * Set their number to 0 to show they aren't relevant ? StuRat (talk) 00:05, 17 June 2017 (UTC)


 * No, you can't, because you don't know where they are. Someguy1221 (talk) 01:38, 17 June 2017 (UTC)


 * As you identify each item's name, you use an insert sort (or ideally a more efficient sort, like a bin sort) to place it in a list, then you check each new name against that list, using a binary search, prior to insertion. If the name is already there, then you set the number of the current item to 0, and don't insert it into the list, since it's already there. StuRat (talk) 17:22, 17 June 2017 (UTC)
 * You're not helping. The goal is to identify the coordinates of each item. If items are not unique, the above techniques will not locate them with confidence. Someguy1221 (talk) 17:41, 17 June 2017 (UTC)


 * The same approach (estimate your shape as some sort of cuboid, do it by slices) can work if duplicates are relatively sparse. But if duplicates are very commonplace (e.g., in the worst case when every entry has equal odds of being one of two types) then there may not be anything better than checking every one.  What is actually optimal depends on the distribution of repeats and your relative tolerance for doing extra tests versus not fully determining everything. JBL (talk) 13:39, 19 June 2017 (UTC)


 * It seems that JBL is assuming that the members of the set are known in advance, whereas I was assuming that they are not. Either way, t [T]he grid method finds all the unique items, and then, some of the grid points having being filled and others needing to be filled, it's likely that the locations of the merely-uncommon items can be deduced with certainty.


 * Based on complete ignorance of the application, I would expect a small number of common species and a large number of uncommon or unique items - roughly obeying a power law, such as Zipf's law.


 * The rare types having been picked out using the grid method, there is still some small gain to be had over sequencing the residue of common items individually. Suppose approximately equal numbers of just two species, A and B remain, totalling m items. Group items into m/2 pairs and sequence each pair.  About 1/4 will be AA, 1/4 BB and 1/2 AB.  Both items in each homogeneous pair are known, so m/2 items have  been identified for m/2 tests.  Sequencing just one member of the remaining m/4 heterogeneous pairs identifies both.  The m items are thus identified in 3m/4 tests. --catslash (talk) 21:36, 19 June 2017 (UTC)


 * I'm not sure how much of your comment is addressed to me. About your first sentence, I'm not sure I understand it completely.  What is the assumption about which we disagree?  There is some universe of possible names, and once you've looked at everything once, you know what subset of it actually appears.
 * Are you sure that "the grid approach finds all the unique items"? Maybe it depends what "the grid approach" means, exactly; I think that the optimal approach in the case that there are no repeats (which requires something like 1 + log base 2 of m checks) does not have the property you describe -- you won't necessarily be able to tell which values are unique.  But certainly you are right for some method involving O(log m) checks.
 * In applications like genomics the expected distribution is definitely knowable empirically; I don't know enough statistics to know if power law is the right thing, but it sounds believable.
 * Actually it's an interesting question what the worst case is for this kind of question -- I said it was 2 equally likely species, but maybe that's not right. (You definitely are right that on average it is not necessary to use the full m checks in this setting; although the process you describe actually still requires m in the worst case.)  --JBL (talk) 22:36, 19 June 2017 (UTC)
 * Sorry, I was being obtuse and imagined you were saying that it was unnecessary to examine the last slice in any direction as opposed to in any direction after the first. --catslash (talk) 23:02, 19 June 2017 (UTC)
 * No apology necessary! --JBL (talk) 14:38, 21 June 2017 (UTC)