Wikipedia:Reference desk/Archives/Mathematics/2006 November 16

= November 16 =

variance of the total sample
Dear Sir/Madam,

I have a question on how to find variance of the total sample. Suppose we have a sample of size N. It is divided into k sub-samples of size f(k), f(1)+f(2)+ ... f(k)=1. n(1)+n(2)+ ... +n(k) = N where n(k) is the number (size) of sub-sample k. Suppose we know the mean and variance of each sub-sample, m(k) and var(k). Question: How can I calculate the variance of the total sample?

I am studying performance of students of different groups (regions). I have only data on regional performances (regional aggregates, no individual data). I want to evaluate total mean and variance from regional data. I would greatly appreciate your help and suggestions on reference texts. Thank you very much for your help in advance.

Best wishes,

Koo Woong Park, Economics researcher


 * I'm not quite sure what the role of f(k) is; I guess f(k) = n(k)/N. It is not used below. In computing the sample size, two methods are common, one in which a sum of squares is divided by the sample size ((1/n)×SUM), and one in which that sum is divideed by the sample size minus 1 ((1/(n−1))×SUM). The second avoids a bias of the first which underestimates the population variance; see further Variance. In the following, I assume the simpler method has been used in computing var(k); otherwise, they must be multiplied by (n(k)−1)/n(k) first. Writing M and VAR for the mean and variance of the total sample, the formulas are then:
 * M = (1/N) × Σk n(k)m(k)
 * VAR = (1/N) × Σk (n(k)var(k) + (m(k) − M)2)
 * --Lambiam Talk 06:14, 16 November 2006 (UTC)

Map of closest points
Say I want to make a map showing cities, but also the area where that city is the closest.

Within a defined area (eg. a country) if you pick a random point, you can work out the closest city. Around city A, you could define a region A, where at any point in the region, city A will be closer than any other city. How do you calculate the boundaries of this region? I think that these regions will be polygons.

I have tried searching wikipedia and google for an algorithm to do this, but cannot seem to find any.

I am eventually hoping to write a java application that can display this, given coordinates of the points.

Thanks in advance, RobBrisbane 06:41, 16 November 2006 (UTC)


 * See Voronoi diagram. --Lambiam Talk  06:46, 16 November 2006 (UTC)


 * Thanks, while that article does not have an algorithm, knowing what it is called a Voronoi diagram will certainly help. I think i've worked out in my head an algorithm - will try it out. --RobBrisbane 07:18, 16 November 2006 (UTC)


 * I'm not sure what you expect in the article itself. Through the references and links you can find descriptions of algorithms, and also complete implementations. A robust implementation must handle issues common to many algorithms in computational geometry. Especially important is the role of precision and its effect on accuracy and consistency. We also have a modest practical concern for time and space efficiency. Èuk Roman provides a detailed explanation of one well-known algorithm, that of Steve Fortune. Brief descriptions of several algorithms, though with no concern for robustness, are also online, along with a great deal of other material found through the Voronoi.com web site. Note that some algorithms prefer to compute a Delaunay triangulation, which is then trivial to convert to a Voronoi diagram. (This fact gets but one sentence in our article, despite its importance.) --KSmrqT 10:13, 16 November 2006 (UTC)


 * The basis for all algorithms I know of is to use Delaunay triangulation; for algorithms see there. See further the external links to the CGAL library, which is of high quality. --Lambiam Talk  10:03, 16 November 2006 (UTC)
 * Yes, CGAL is a valuable resource. It does have two important limitations: (1) It is an integrated package of routines, making it awkward to extract a small, targeted piece. (2) For some applications its dual-license scheme may pose a modest barrier. --KSmrqT 10:20, 16 November 2006 (UTC)

Mathematical Problem
I have a mathematical problem .... Find the domain & the range of the function : f(x) = 2x^2+6x-7


 * Please state what part of the question you don't understand so we can help you. Fredrik Johansson 13:49, 16 November 2006 (UTC)


 * Never mind that; this question is entirely ill-posed. Stating a formula for a function does not in general restrict or specify its domain, and its range is certainly dependent on that unmade choice.  I can imagine this as $$f\colon\mathbb{R}\rightarrow\mathbb{R}$$ (real numbers), or $$f\colon\mathbb{C}\rightarrow\mathbb{C}$$ (complex numbers), or even $$f\colon\mathbb{Z}_n^*\rightarrow\mathbb{Z}_n^*$$ (multiplicative group of integers modulo n for any $$n$$).  The domain and codomain could even both be $$\mathbb{Z}+\frac12$$ (the half-integers) or $$\mathbb{A}$$ (the algebraic numbers).  The domain and codomain can be different, too, with $$f\colon\mathbb{Z}\rightarrow2\mathbb{Z}+1$$ (the integers and the odd numbers).  How is anyone supposed to learn math by working problems with no answer?  (The actual content here is of course the links; perhaps the questioner will find useful information there.)  --Tardis 16:47, 16 November 2006 (UTC)


 * The process of teaching maths - and much science for that matter - consists of telling half-truths and incomplete topics, which then get expanded, replaced, or even contradicted at a later level. To criticise a pedagogic question for being ill-formed relative to the whole of the subject is unreasonable. --ColinFine


 * Perhaps I was a bit too quick to judge; this question should presumably be interpreted in the recursion theory sense of the word "domain" for partial functions, with all functions at this level taken to be partial functions mapping the reals onto themselves. In that context (which is probably obvious, even if unspecified, to the student), the question certainly has a unique answer.  I'll take leave of my soapbox now until some problem that is ill-posed due to thoughtlessness rather than reasonable circumscription arises.  --Tardis 15:39, 17 November 2006 (UTC)


 * Have you read Domain and Range? —Preceding unsigned comment added by 129.162.1.42 (talk • contribs) 16:33, 16 November 2006

To get back to the actual question, there are many methods for solving this type of thing, depending on what you've already been taught. One of the simplest (but also somewhat time consuming) methods is the create a chart of values, then graph those values. I'll start you out:

x    f(x) ==   ==== -2    -11 -1    -11  0     -7 +1     +1 +2    +13

Now, noticing that f(x) is the same (-11) when x = -1 or -2, this tells us the curve reaches a relative or absolute minimum or maximum in that area. Let's add x = -1.5 to the chart:

x    f(x) ==   ==== -2    -11 -1.5  -11.5 -1    -11  0     -7 +1     +1 +2    +13

This type of quadratic graph is also symmetrical, so we can add a few more terms to the chart:

x    f(x) ==   ==== -5    +13 -4     +1 -3     -7 -2    -11 -1.5  -11.5 -1    -11  0     -7 +1     +1 +2    +13

Now graph this chart. This should allow you to see the range and domain. StuRat 01:54, 17 November 2006 (UTC)
 * Assume we are working over the reals. First note that the highest term has an even power i.e. x2, which will dominate the result for large numbers. If you have an odd power then the range will be the whole of R, but even powers will give ranges like [a,infinity) if the coefficient is positive and (-infinity,b] if the coeff is negative. The local maxima and minima of the functions will occur when the first derivative is zero, so solve f'(x)=0 and evaluate f(x) at each point. All quadratics they can be written in the form f(x)=m(x-a)^2+n, so a bit of algebra will give a,m,n and the only min/max will occur at x=a. --Salix alba (talk) 09:44, 17 November 2006 (UTC)


 * I must agree with Tardis that, as presented here, the problem is ill-posed. However, we can make some educated guesses. A problem worded like this would not be seen in a sophisticated class, and this is most likely one of a number of homework problems. Perhaps another problem considers the function f(x) = (1+&radic;x)/(2−x). Therefore we might reasonably assume subsets of the real numbers are sought, for both domain and range. In my example, we must exclude x values that are negative, as well as the isolated value 2; that could be taken to give us a domain. After finding the domain (the permissible inputs), we must explore the possible outputs. In general, this requires insight special to the function. In my example, we know that the square root will always give a non-negative value; and for x &ge; 0, we know that 2−x varies from 2 down to −&infin;. When x is exactly 0, the value of f(x) is exactly 1&frasl;2. As x increases toward 2, the denominator approaches zero, overwhelms the numerator, and forces the function value toward +&infin;. We skip exactly 2, to avoid a divide by zero. Now as x increases toward +&infin;, the numerator remains positive but the denominator is always negative; therefore the function value goes from −&infin; toward 0. We conclude that the range is all of the real line except the half-open interval [0,1&frasl;2).
 * The actual given problem is somewhat simpler, but the general idea of partitions and trends applies. --KSmrqT 09:48, 18 November 2006 (UTC)

watertesting a gas pipeline
My son needs to watertest a natural gas pipeline that is 5.25 inchs (ID) in diameter. How many gallons of water will it take per foot? thanks...Lyle Corder


 *  email address removed to prevent spam 


 * So, it's a cylinder? And you want the volume of each foot in gallons?  So it's a cylinder that's a foot long?  --Tardis 16:58, 16 November 2006 (UTC)


 * Find the area of a circle with a diameter of 5.25 inches (A = piR^2, where R = D/2). Then multiply that by a length of 12 inches to get the volume, in square inches, of one foot's length of pipe.  Then convert that value to gallons (there are 231 cubic inches in a gallon).  Post his work here for a check. StuRat 01:22, 17 November 2006 (UTC)


 * Is it sufficient to test a gas pipeline with water? Even if it is water-tight it could leak gas. Yet it seems to be done by professionals. Interesting.
 * Anyway, since we are given an inner diameter, we may assume a standard pipe with a circular cross section. We must also make another assumption, that the gallons are U.S. gallons, not imperial gallons (and not dry gallons). Then, by definition, a gallon is exactly 231 cubic inches. If we can calculate the volume of a foot of pipe in cubic inches, the rest is easy. That's what StuRat is walking through. To get cubic inches we must work entirely in inches, which is why we convert one foot to 12 inches. We can simplify the area calculation a little by writing A = &pi;&frasl;4D2, with &pi;&frasl;4 approximately equal to 0.7854. By my reckoning, it should take more than 1 gallon and less than 5 gallons per foot. (We don't give full answers in case this is a homework question in disguise.) --KSmrqT 10:29, 18 November 2006 (UTC)

Logic problem: Getting data from every computer onto every other computer efficiently
Hi all,

I'm faced with a problem that, were it written down (which it is now), I think would count as a reasonable logic puzzle.

I (will, on Monday) have data on 60 computers. The data do not overlap -- every computer has a different data file. I need to collect all data files and put them on all 60 computers. i.e. all 60 computers will need to have all 60 different data files on them.

The computers are not networked. The "simplest" way would be to use one CD-RW and go from computer to computer to burn all the data, and then go back to each computer and unload all the data. This, of course, requires sticking in a CD 120 times, and will take quite a while.

However, I have unlimited numbers of CD-RWs, and two or three other people to help me. I feel sure that there is some remarkably clever, more efficient way to do this. Can anyone think of a clever way?

Thanks for any thoughts!

--John


 * This is known as an all-to-all broadcast or all-gather in parallel computing. Unfortunately, no articles, but I do have a link to the MPI specification; perhaps further research will turn up clever implementations.  However, I can say this: $$2n-1=119$$ CD interactions is the best you can do, because each computer must both read data and write data, and every computer except one must be writing to a disc that doesn't have everything on it and thus another disc (or the same disc again) must be provided to finish reading.  The algorithm you stated achieves this optimum, since the last computer you visit can read and burn at the same visit.  So the only thing you can hope to achieve is to utilize your accomplices to parallelize the project.  The best strategy that occurs to me is just to divide up the computers into $$m$$ (nearly) equal groups, one for each of your $$m$$ people.  Then each of you go to each of your $$\sim n/m\!$$ computers and burn, then go to all $$n$$ computers and load.  Again, the last computer you burn with can also read, so that's a total of $$n+m(n-1)$$ CD insertions; with $$m$$ people, it should take as long as one person doing $$n/m+n-1$$ insertions, which is a speedup over the optimal serial algorithm of $$\frac{2n-1}{n(1+1/m)-1}\approx\frac2{1+1/m}\mbox{ for }n\gg1$$.  So with 3 assistants you can only expect to get $$\frac2{4/3}=150\%$$ improvement.  Hope this helps, and have fun!  --Tardis 17:18, 16 November 2006 (UTC)


 * Ok, I almost understand your parallel solution, though not quite. Here's a solution I've come up with -- is it more or less efficient than the one you are suggesting (or is it the exact same solution)? The three people take 20 computers each, and burn the data onto one CD each (three CDs total). We then load all 60 files onto one computer. We then burn all 60 files onto 3 CDs. We then go around and each load the data onto 20 computers each.
 * This would be the same number of total interactions, but would only take the time required for 40 interactions (20 load, 20 unload), (+/- 1 for the "base" computer?).
 * Is that the right solution? Thanks! --John


 * Burn data on first pass: $$n$$ operations (parallel).
 * Pick one of the final computers and load everything on it: a free copy from the disc already there and then $$m-1$$ operations (serial).
 * Burning new discs, reusing whichever is last in it: $$m-1$$ operations (serial).
 * Putting data on all other computers: $$n-1$$ operations (parallel).
 * So your total cost is $$2n+2m-3$$ disc insertions (although perhaps burns should count as costly?), which is better than my $$n+mn-1$$ if $$n+2m-2<nm$$ or $$2m-2<n(m-1)$$ or $$2<n$$, which it certainly is. But you have serial operations; your total time is $$\left\lceil\frac nm\right\rceil+2m-2+\left\lceil\frac{n-1}m\right\rceil\approx2\left(\frac nm+m-1\right)$$, which is better than my $$n/m+n-1$$ if $$\frac nm+2m-1<n$$ or $$2m(m-1)+1<n(m-1)$$, which it is for $$m\ll n$$.  So that is better all around.  --Tardis 18:29, 16 November 2006 (UTC)


 * (after edit conflict) Well, I would try this: pass a CD A through computers 1..30 to collect their data, while your friend does the same with disk B on comps 31..60. When you're finished, store files 1..30 from A on comp.30, and let him store files 31..60 on comp.60. Then you swap CDs, so you can update the B with files 1..30 from comp.30 (and comp.30 with files 31..60 from B disk), and your friend updates A with files 31..60 (and comp.60 with files 1..30 from A). Then you both walk back and fill remaining 29 of your respective 30 computers with 60 files. This way each of you sticks a disk 60 times, and each CD gets sticked 60 times, so you save approx. half of time.
 * For N people (N dividing 60) the process splits into three phases:
 * each person collects data from 60/N computers on a disk,
 * disks are circulated (N-1) times on N computers, to collect 60 files on N comps and N disks,
 * disks walk back through 60/N-1 comps each.
 * Each person makes (and each disk undergoes) 2&middot;60/N + N - 1 read/write operations, ie. N=4 people do the work in 33 steps. If N does not divide 60, some fractions add to the above estimation. --CiaPan 17:30, 16 November 2006 (UTC)


 * (Note my different notation; my $$m=N$$.)
 * Collecting: $$n$$ operations (parallel).
 * Circulating and updating $$m$$ computers: $$m(m-1)$$ operations (parallel).
 * Redistributing: $$n-m$$ operations (parallel).
 * So the total cost here is $$2n+m(m-2)$$ disc insertions, which is certainly competitive for $$m\ll n$$. Since it's all parallel, the total time is just $$2\left\lceil\frac nm\right\rceil+m-2$$, as you said (except that you forgot one of your $$-1$$s), which is precisely $$m$$ time steps shorter than John's time whenever $$\left\lceil \frac nm\right\rceil=\left\lceil\frac{n-1}m\right\rceil$$ (which is true for $$n=60$$ except at $$m\in\{1,59,\dots\}$$; at $$m=1$$ all these become the same algorithm). --Tardis 18:29, 16 November 2006 (UTC)


 * Conclusion: CiaPan's answer is always the best of these three for reasonable $$m$$, although oddly enough for $$m>3$$ its total amount of work is larger than John's; it just parallelizes better. (At $$m\in\{1,3\}$$ the amount of work is the same, and at $$m=2$$ it's better because of the extreme efficiency of 1 swap.)  My answer is horrible: it's best (in terms of time) only for $$m>n$$, and better than John's semi-serial algorithm for $$m>n/2$$.  It's better (in terms of work) than CiaPan's handshaking algorithm only for $$m>n$$, and it's the best in work only for $$m=0$$, where the math breaks down!  --Tardis 18:29, 16 November 2006 (UTC)


 * There is another factor which we all missed – it's the read/write time. We count the number of operations only, assuming the work is split among people and does not parallelize any further. But that's wrong, this way we simplify the problem. In the real situation one can handle several read/write operations on different computers at (almost) the same time: put a disk into a drive, start the read/write program and, while the operation in progress, go to another computer to start R/W on it, and so on. Assuming that you can run k R/W actions before the first one completes, you perform k+1 actions in less than twice the single action's time, thus reducing the total time almost by factor ~ (k + 1)/2 . The actual gain depends on the ratio of the disk and software handling time to the actual R/W time. And also on the distance you must walk from one computer to the next one. :) Things get even more complicated when we realize that R/W's take a variable amount of time – appending one file to a CD in the first phase is shorter than appending 10 or 30 files in phase 2, and burning generally needn't be same fast as reading in phase 3... --CiaPan 19:13, 16 November 2006 (UTC)


 * Thanks so much for all the answers! It's going to take a little time to digest... As for CiaPan's last comment, indeed this is true, and is a gain I used to good effect today when putting the initial program on all 100 computers myself. With three CDs and good timing, I managed to almost always have 2 drives spinning while I was loading the third.
 * Thanks a lot! --John
 * (feel free to keep hashing out the finer math points, if you like)

You said that they are "not networked together", but do they all have Internet access ? If so, this may provide a way to pass files directly between them (via FTP, for example). StuRat 01:13, 17 November 2006 (UTC)

One more approach, may be a bit crazy. Force parallelizing as far as you can, and you get a logarithmic algorithm...
 * For two computers you need two CD's: insert them into drives, burn, swap, read — done.
 * For four computers take four disks: write simultaneously one file on each, swap disks in computer pairs (1,2) and (3,4), next swap disk pairs, so that comp.1 and 2 get files from 3 and 4, and comps. 3 and 4 get files 1 and 2. This way after two swaps you spreaded 4 files among 4 computers.
 * In the next step swap four disks from computers 1..4 with (similarly prepared) four disks from 5..8, and so on — after S swaps you get 2S computers updated with 2S files.

That works fine as long as we have the power of 2 computers. For the given case of 60 computers you'll have to add 4 virtual computers. These are just placeholders on the last table, where you put additional 4 CDs. The whole routine consists of 7 steps or phases: ...and so on, until in step 7 you swap disks 1..32 with 33..64 and copy 32 files from each CD 1..28 onto HD 33..60, and 28 files from each CD 33..64 onto HD 1..32. Done! This step contains a CD–to–HD copying only.
 * 1) Put 60 disks into drives and write one file on each (4 spare disks get 'written' no files).
 * 2) Swap every pair of disks (disk n and n+1 for every odd n). Disks 61..64 don't need swapping ;) On each comp append that comp's file to a CD, and copy file from CD to local HD. Now you have 2 files on each comp and each CD.
 * 3) Swap pairs of disks in every four (i.e. swap (1,2) with (3,4), (5,6) with (7,8), ... (57,58) with (59,60)). Disks 60..64 don't need swapping. Add 2 files from each HD to CD, and copy 2 files from CD to HD. Now you have 4 files on each disk.
 * 4) Swap fours of CDs in eights. This is the first step involving four spare disks — CDs 57..60 go to the desk, CDs 61..64 get loaded into computers 57..60. Again copy all four files from each CD to HD and append new four files from HD to CD. Now every CD contains 8 files (except CDs 57..64 which contain 4 real files no 57..60 and 4 ghost files 61..64).

The whole process would take only $$7 = \lceil \log_2 60 \rceil + 1$$ steps, however each step consists of loading, reading, burning, ejecting and swapping of sixty CDs. That totals to as much as 7×60=420 load-copy-burn-eject sequences. If you can hire 8 people for CD juggling, this method offers you a huge saving of the total process time (only 7 parallel phases!) On the other hand, if you try to do it yourself alone, it would take you much longer than simple algorithms discussed above—the whole gain achieved on parallel R/W would be lost on sequential load-run-eject actions. --CiaPan 03:30, 17 November 2006 (UTC)