Wikipedia:Reference desk/Archives/Computing/2013 September 13

= September 13 =

deleting similar files
Hi, I am a medical student, i have study material in my laptop that I copied from different sources,  But i have similar files copied in more than once, For example A book of HARRISON IS COPIED  possibly in 5 different location in my laptop, It is difficult to locate all files,copied more than once, and delete them separately. Is there any convenient way, any Program that could possibly detect similar files present and delete them, making me able to leave file in just One single location,,,, ???? THANKS,,,,,, — Preceding unsigned comment added by 201.220.215.11 (talk) 03:57, 13 September 2013 (UTC)
 * I haven't tested it, but this should fit the bill :-)Phoenixia1177 (talk) 04:35, 13 September 2013 (UTC)
 * CCleaner includes a find (and delete) duplicate files function. Mitch Ames (talk) 09:18, 14 September 2013 (UTC)

random looku
Suppose I use the following lookup algorithm:

1. What value should I pick for K so that if found remains false, I can have fairly good confidence that it really isn't in there. For example, I could use this as my search algorithm in a kernel and it doesn't matter. Simple example: if there are only 2 elements, and I randomly look at them 64 times, that should be plenty: because the chances that I wouldn't look at one of htem is 1 / 2^64, i.e. vanishingly small

For this exercise let's assume that we have a good source of true random numbers!

2. What is the average performance of this 'search algorithm'?

3. What is the "worst case" performance of this search algorithm IN PRACTICE. I realize in practice it could go on 'forever' (there is no point at which you are guaranteed that the next random number is the one with the subscript you're trying to randomly find).

However, in practice, there has to be a 'typical' limit, for how long the shenanigans go on where you're not picking the one value that has the search item you're searching for!

What is that limit?

For example, if I use this algorithm to look through an array of 100 indices, then what should the practical limit be in terms of how long this execution could 'hang' (if the value is really in there)? 1000 lookups - 10,000 - 100,000 - a million - or what?

Thanks!! 195.56.96.233 (talk) 15:35, 13 September 2013 (UTC)


 * I'm don't have the time to work out the math right now, but you've started on the right approach in your explanation of point 1. If you've made it to any given round, your odds of success are 1/n. Your odds of success on or before round k are P(n,k) = P(n,k-1) + (1-P(n,k-1)) * (1/n). That basically takes the odds of success by the previous round, plus the odds you haven't succeeded yet * 1/n, the odds of succeeding during that round. You can solve that recurrence relation to get a closed form solution to P(n, k), and work out the answers to your questions from there. For example, by setting P(n, k) = .99 and solving for k you could work out how many rounds you need to run in order to expect success in 99% of situations. K ati e R  (talk) 19:27, 13 September 2013 (UTC)


 * Also, you might get better responses on the math desk. Although this is related to a programming problem, all the work involved in answering it is mathematics. K ati e R  (talk) 19:29, 13 September 2013 (UTC)


 * I agree this is a problem in statistics and the whole array thing is a distraction. If I understand the problem correctly, you have an event which occurs with probability either 0 or e (=1/arraysize) depending on some unknown parameter, say P(event)=ep where p=0 or 1. You sample for the event up to N times, if it occurs then stop sampling and conclude that the probability is e, otherwise it's inconclusive. If the probability of the event is e then the probability that the event does not occur in N tries (1-e)N. So if you want to state with say 99% confidence that the probability is 0 then solve (1-e)N=.01 for N (use logarithms). If you have reason to believe that p has a certain probability of being 1 then a Baysean approach would be better. If you know for sure that the probability is e and want to know on average how many tries will be needed for it to occur, see Geometric distribution. The mean number of tries is 1/e but the median number is something else, so it depends on what you mean by on average. --RDBury (talk) 21:47, 13 September 2013 (UTC)


 * (First, you surely meant  rather than  .)
 * The probability of failing to find the value (when it's there) is $$\left(\frac{n-1}n\right)^g$$, where n is the size of the array and g is the number of guesses. To drive that probability below p, just solve for g: $$g=\log_{1-1/n}p$$, which is approximately $$g=-n\log p$$ for large n (still positive because $$\log p<0$$ for $$p<1$$).  Therefore, your $$K=-\log p$$ asymptotically (e.g., 4.6 for $$p=1\%$$).  (It is non-trivial that K is asymptotically independent of n!)  Conveniently, the asymptotic answer (remember to take $$\lceil nK\rceil$$) is also the worst case: for smaller n, K could be smaller.  In the extreme case of $$n=2$$, $$K=g/2=\frac12\log_{1/2}p=\frac{-\log p}{2\log 2}\approx-0.72\log p$$.  (Of course, for $$n=1$$, any $$K>0$$ works for any p.)
 * Directly from geometric distribution, the mean time is $$\frac1p=n$$. (This is the p from that article, not the p in #1 and #3.)
 * There is no worst case, as you said; just pick a suitably ridiculous p ("one in a million/billion/trillion/whatever") and evaluate g. If I pick even $$p=10^{-18}$$, $$g\leq\lceil41.45n\rceil$$, so I don't think you'll ever need to consider a million lookups on a hundred elements.  (It's still $$O(n)$$ in Big-O notation for any fixed p.)  --Tardis (talk) 17:26, 14 September 2013 (UTC)