Wikipedia:Reference desk/Archives/Mathematics/2016 January 7

= January 7 =

Coupon collector problem with drawing the coupons in batches of k coupons (with possible repetition in the batch)
Is there a name for, or any research on this specific variant of the coupon collector's problem? Specifically, I am looking for a formula that calculates the expected number of batches we need to draw in order to collect all N kinds of coupons, given that in one batch there are k coupons that are not necessarily different (we can for example get a batch of 10 same coupons). All I've been able to find is some papers dealing with distinct batches, but I've not been able to find anything concerning batches with possible repetition. 46.217.53.148 (talk) 14:05, 7 January 2016 (UTC)
 * You're just replacing each possible value c ("you got all kinds in the first c coupons") with $$\lceil c/k\rceil$$, so $$P(B=b)=\sum_{c=(b-1)k+1}^{bk} P(C=c)$$; given the latter P (note different variable names in the reference). Does that help?  --Tardis (talk) 03:44, 10 January 2016 (UTC)

Identifying important relationships with too many variables and not enough data
I have an experimental system with a N on/off parameters, and hence 2N possible configurations. Each time I test the configuration of the system I generate N outputs each in the range [0,1] whose sum is always 1. If the n-th input is set to off, the n-th output will be zero. Beyond that, I suspect that certain inputs have either cooperative or antagonist effects. For example, perhaps output i is generally higher whenever input j is turned on (cooperative effect of j on i) or perhaps output p is generally lower whenever input q is turned on (antagonist effect of q on p).

In general, I don't expect the effects to be simple or linear, but I do think some of the relationships may be quite strong. For example, perhaps output p is always 0 if input q is turned on but usually non-zero otherwise.

It is impractical to test all 2N configurations, but I can test a much smaller sample of M configurations. What is the best way to choose the configurations to test and how should one analyze the resulting set of M*N outputs to give the best chance of identifying strong relationships among the inputs? Obviously, one can't identify all the relationships, but I hoping to find an efficient way to search for the strongest relationships, which would then be subjected to a more targeted testing for confirmation.

My instinct is that M tests provides the most information is the test configurations are essentially random and hence unrelated to each other, but I'm not sure how to extract information from the results in that case. By contrast, if you have configuration pairs that differ in only a single parameter then it is easier to interpret the results, but you won't sample the parameter space very widely. Is there literature or set of techniques that suggest the best way to approach a problem like this? Dragons flight (talk) 15:51, 7 January 2016 (UTC)


 * Sounds like you want to reverse engineer some combinatorial logic. Not easy and just pray it is combinatorial rather than say a finite state machine. Not the sort of thing one would want to tackle without a good idea of what it was all in aid of where one would in the first instance just be trying to match up the pins with a function.
 * If you are unable to just test all the inputs and produce a table to go into an optimiser I guess that random input is a good idea. You could use an AI network to recognize the patterns. With a bit of training it should be possible to identify the strongest links and you can iterate a couple of times to get the logic if it isn't more than a few layers deep, and a bit of ingenuity might enable you to identify quite complex logic like an addition unit, though of course without going through all combinations you can't be certain of finding everything. For instance if it is an error detection circuit then extracting a hamming check result saying pass/fail is rather problematic this way. Dmcq (talk) 16:31, 7 January 2016 (UTC)


 * The analogy to computer might be useful to a degree due to the binary natural of the inputs, but this is a natural system so I think we can rule out addition units and error detection circuits. Dragons flight (talk) 16:45, 7 January 2016 (UTC)
 * Typically, things like this are a good target for genetic algorithms or simulated annealing. Some more modern approaches also try local optimisation on top of GAs (i.e. in each generation for each individual flip up to k individual bits as long as that improves performance). Some of these ideas are implemented in ParamILS, which several people have used for complex optimisation tasks similar to yours (if you treat each individual output as one value to optimise). --Stephan Schulz (talk) 17:00, 7 January 2016 (UTC)


 * Here from a quick search is what people do when faced with unknown logic. They bathe the IC in acid and grind down bits and photograph it using an electron microscope and then try and figure out what it is doing from that. Still rather difficult but easier than just treating a hundred pins as inputs and outputs of a black box. Dmcq (talk) 16:40, 7 January 2016 (UTC)


 * Can you give us an idea for what numbers you are dealing with (N, M, etc.) ? Maybe you can do a quick pre-test, to determine if there is some type of correlation, and then go back for more detailed analysis of exactly what the correlation is, where one was found. StuRat (talk) 17:04, 7 January 2016 (UTC)


 * N is roughly 50 to 150 (more is better, but the size could be reduced to make studying the system easier). M should be as low as is practical, but in no event more than say 200.  As I said the goal is not to understand everything, but merely to try and find some of the strongest relationships.  Discovering 10 or 15 important interactions would be considered a great result.  Dragons flight (talk) 17:18, 7 January 2016 (UTC)

Fascinating question. I don't think I've ever encountered it, so I can't point you to any literature. But if you use the randomized technique, you could run a separate OLS regression of each output value on all the 0-1 dummy variable input values, using only data for which the associated input has not been turned off. Then if you want to look for interactive effects, you could also include interaction variables — products of dummy variables. Since OLS does not restrain the dependent (output) variable's predicted value to between 0 and 1, this could in principle give predicted values less than 0 or greater than 1; this could be avoided by first transforming the dependent variable by the inverse of the CDF of the standard normal distribution (or by the inverse of the CDF of the logistic function), regressing the transformed variable on the dummies, and then untransforming the predicted values by taking the standard normal CDF of them. Choose M large enough to get plenty of degrees of freedom (M minus the number of right-side explanators). Then the signs and relative magnitudes of the coefficients of the dummy variables in each regression give you relevant information. Loraof (talk) 17:11, 7 January 2016 (UTC)


 * If it is a naturaal system rather thaan logic then something like singular value decomposition or another of the methods used in recommender systems might do the work nicely for you. They are for where there are lots of bits to twiddle but the ovverall system isn't too complex. Dmcq (talk) 17:25, 7 January 2016 (UTC)


 * You must be a native speaker of Afrikaans. :-) StuRat (talk) 23:38, 10 January 2016 (UTC)