User:Wwheaton/test8

Draft procedure for source intensity calculation

This is a proposed method for estimating the strength and significance of isolated point sources seen in the BATSE/EBOP Radon imaging skymaps and source tiles.

Basic strategy
The main idea of this analysis is to represent the observed data signal at a point in the sky as the sum of two unknown components: a cosmic point source intensity S, described by the point response function (PRF), and a background intensity B, which is constant across the field of view. We define two regions, one (say G) dominated by the source, and the other (say H) dominated by the background. We use the PRF to estimate the source signal at each point in region G, sum over pixels to obtain the source signal in G, and add the background B, thus obtaining one linear equation for the data D(G) in the two unknowns S & B.

We repeat this procedure for the second region, H, obtaining data D(H), which is mostly dominated by B, and obtain another linear equation in S & B.

We then solve the two equations for S and B.

We also estimate the uncertainties in D(G) and D(H), σg and σh, and propagate those uncertainties through the solutions for S and B to obtain their uncertainties, σS and σB. The source SNR is then estimated to be S/σS.

Assumptions and definitions
I assume that we have a data image file D, evaluated at or near the source position, on a grid of pixels at positions (xi,yj), where Di,j is the data value at pixel (i,j) in the array, returned by the inverse Radon trf algorithm.

I also assume that we have a Point Response Function, P, in the same data image file format ("dat1"), specified on the same pixels as D: P(xi,yj). The PRF is supposed to be the response at (xi,yj) to a "unit flux" source at some position (x0,y0) where the source is located, for the precession cycle or data interval in question. The precise definition of a "unit source" remains unspecified & TBD; for now I assume the PRF is generated by the current Inverse Radon algorithm, by putting in 1's for the data when the source is not occulted, and 0's otherwise.

Initially I assume that the Radon data image D, and the PRF data P, have both been generated for a tile or window centered at (x0,y0). The more general situation, for new source candidates located off-center in a skymap tile, will be addressed later.

I suppose that the data in the image are the sum of three principal effects:


 * 1) Flux from the source at (x0,y0), centered in the tile by assumption, and smeared out into the PRF which is peaked and non-circular, but has a roughly elliptical core. The intensity, S, of the source is an unknown quantity to be estimated.
 * 2) An image background intensity B, also an unknown quantity to be estimated, which registers as a constant increment in each pixel of the image tile. (Note that B is not demanded to be .GE. 0 here, due to the likely presence of background subtraction errors at earlier stages of the analysis.)
 * 3) Flux from other cosmic sources at nearby positions, known and unknown. In the current  procedure these sources are essentially ignored, or if numerous and individually not too strong, their combined effect may be absorbed into the constant background B.

The procedure will be inaccurate if there are strong sources in the response of the window that have not been accounted for, or if the background B has strong spatial variations of a diffuse nature.

Assuming the above, then the model for the expected Radon data Di,j, observed at pixel (xi,yj) in the image, is:

(1)    Di,j = Pi,j*S  +  B

Linear equations
Suppose then that we sum Eqn 1 over a set G of pixels, defining D(G) = Σ(i,j ε G){Di,j }. Then:

(2)    D(G)  =  Σ(i,jεG) {Pi,j*S} + Σ(i,jεG) {1*B} =  Σ(i,jεG) {Pi,j}*S + Σ(i,jεG) {1}*B =  Σ(i,jεG) {Pi,j}*S    +  N(G)*B

where 1 is the pixel response to the constant background, and Σ(i,jεG){1} = N(G) = Ng is just the number of pixels in G.  If G is chosen to be the peak region around the assumed source position, D(G) will be dominated by the source intensity S.

We now pick the second region, H, containing N(H) = Nh pixels, far enough away from the peak to be dominated by B, but close enough to a good representative of the actual background at the source position (x0,y0), considering the possibility of spatial variations in the background that our model does not include. We obtain in this way:

(3)    D(H)  = Σ(i,jεH) {Pi,j*S} + Σ(i,jεH) {1*B} = Σ(i,jεH) {Pi,j}*S +  Nh*B

If we define Ag = Σ(i,jεG) {Pi,j} and Ah = Σ(i,jεH {Pi,j}, these equations become simply

(4)   D(G) = Ag*S + Ng*B

(5)   D(H) = Ah*S + Nh*B,

where D(G) and D(H) are the data. and Ag, Ah, Ng, and Nh are just numbers, which we compute directly.

Solution
Because of the way we have chosen the regions G and H, it is obvious that the two equations, (4) and (5), will be independent and non-singular. If we set D(G) = Dg and D(H) = Dh, the solution is

(6) S† = (Dg*Nh - Dh*Ng)/Δ

(7) B† = (Ag*Dh - Ah*Dg)/Δ,

where S† and B† are our estimates for S and B, and the determinant Δ≠0 is

(8) Δ = Ag*Nh - Ah*Ng,

using Cramer's Rule.

Note that up to this point we have made no statistical assumptions at all, only that the physics of the model is linear as described by the matrix elements Ag, Ah, Nh, and Ng.

Uncertainties
We are now able to compute the uncertainties, σS and σB in the estimates S† and B†, if we know the uncertainties, σg and σh, in Dg and Dh. For this we suppose the statistical errors in the data for G & H have variances σg2 and σh2, respectively. (Note that we make no assumption that the data are normally distributed about the model.) We also make use of the following rules for linear combination of the variance:

If u and v are independent random variables, Var[u] is the variance of u, and a is a constant, then


 * 1) Var[u+v] = Var[u] + Var[v]
 * 2) Var[au] = a2Var[u].

From (6) and (7) one can see that S† and B† are linear combinations of the data, Dg and Dh.

Then their uncertainties are

(9) σS2 = Var[(Dg*Nh - Dh*Ng)/Δ] = (Nh/Δ)2 σg2 + (Ng/Δ)2 σh2

(10) σB2 = Var[(Ag*Dh - Ah*Dg)/Δ] = (Ag/Δ)2 σh2 + (Ah/Δ)2 σg2

Here we make our only statistical assumptions, that the data are independent random variables, with finite variances, so that the properties (1) & (2) above obtain. In particular, we do not assume the data are Gaussian distributed. The consequences of that will be discussed in the next section.

Statistical errors in the data, σg and σh
The basic concept is simply to look at the scatter of the image data around the model for the data, implied by the solution for the source intensity S and the background B. This has the advantage that that it needs no dubious analytic form for the PRF, but simply uses the PRF data we already have. If we plug S† and B† into Eqn (1) for any pixel, we get the model value for that pixel.

So we compute the residuals, (Di,j -( Pi,j*S† + B†)) for every pixel (i,j) in the two regions, G & H. We square those numbers, and sum, getting the sum-of-squares for both G & H.

The variance of Dg & Dh (or mean square error), is what we need for Eqns (9) & (10), which is roughly

(11) Var[Dg] = Σ(i,jεG){(Di,j -( Pi,j*S† + B†))2}/Ng

for G, and

(12) Var[Dh] = Σ(i,jεH){(Di,j -( Pi,j*S† + B†))2}/Nh

for H.

There is a small correction because we have effectively fit for two parameters, S & B, so the number of degrees of freedom is slightly reduced. I am uncertain what that correction should be, but tentatively replacing Ng and Nh in (11) and (12) with (Ng-2) and (Nh-2) should be conservative. At this moment I suspect the variances in (11) and (12) should actually be corrected by the overall factor

(13) f = (Ng + Nh)/(Ng + Nh - 2)

Then the statistical errors in the data, σg and σh, would be given by:

(14) σg2 = f*Σ(i,jεG){(Di,j -( Pi,j*S† + B†))2}/Ng

and

(15) σh2 = f*Σ(i,jεH){(Di,j -( Pi,j*S† + B†))2}/Nh,

respectively.

Thus we have the expected RMS uncertainties in the answers S† and B†. It is not guaranteed that the estimated answers will be Gaussian distributed. Therefore these RMS uncertainties are not the Gaussian sigmas we know and love. But because S† and B† are sums over many random variables (ultimately one for each pixel in G & H), the Central Limit Theorem gives us fairly strong assurance that they will be "nearly Gaussian". But the variances of the estimates should be exact, regardless of that, subject of course to the assumptions about the correctness of the model (a big issue), etc.

The main practical difference I think comes about due to the possibility that the tails of the real data distributions go out far beyond the Gaussian ones. Serious errors in the models and unusual events affecting the samples are the usual suspects. Meanwhile we should be careful in our publications and presentations not to confuse people by referring to a (Gaussian) "sigma", but talk instead in terms of the more general "variance" and "standard deviation" or RMS error.

The regions G and H
We suppose the region G will normally be a small, roughly elliptical region centered on the source, its aspect ratio and position angle determined by the core of the PRF. We pick a fraction, g, with a value somewhat less than 1.0, and set a threshold value of g*P0, where P0 = P(x0,y0) is the maximum value of the PRF. We define region G to include all pixels for which Pi,j is greater than this threshold. For the set H, we define an annular (roughly elliptical) region where h1 < P < h2, with h1 chosen large enough to exclude regions distant from the source (which might be affected by other sources or diffuse background variations), and h2 < g, but large enough to catch a significant set of data pixels in D(H). Perhaps h1 = 0.1, h2 = 0.3, and g = 0.75 would be reasonable values to try. It is important to realize that the optimal g, h1, and h2 will depend to some degree on the spatial structure of the background and source emission.

The reason for not picking h1 = 0.0, say, is that if we did that, we would be taking all the background clear out to the edge of the tile. But there is no guarantee that the background is truly constant physically in the image data. There are often stripes due to distant sources, and there can be weak sources in the tile that increase the variance of the pixels in the far region beyond the gain in statistical accuracy one gets by using more data. A similar situation would arise for the core region G if there are other unmodeled sources close enough to our target source to contaminate the core flux substantially.

But the uncertainty σg in the source flux, reflecting as it does both the statistical and systematic influences on the background and source regions, should help us to monitor these effects, and set the thresholds intelligently. In confused regions, we may want to adjust them to get the best SNR. (This may be tedious at times, but if such happens often, we will probably be grateful for the underlying scientific rewards.) When we have had more experience and confidence with the method, we may even want to automate the process of adjusting the thresholds to achieve the highest possible SNR, considering both statistics and systematics.

Notice that it is not generally necessary, in picking the pixels in the regions G and H, to determine the analytic elliptical form of the PRF; one may simply compare the PRF pixel values Pi,j to g, h1, and h2.

If there is severe confusion due to nearby sources, or a strong background that substantially affects individual pixel values, computing the analytic ellipse might be advisable, and one might need even to consider the ellipses associated with confusing sources. This could lead one to consider defining an additional region centered on the core of the confusing source, in which case one would obtain three equations, in three unknowns, for the two source intensities and the background. The extension, even to additional sources, should be straightforward.

Source positions
The treatment so far has made no use of the analytic approximation to the PRF core as a bivariate normal distribution. In order to find the best positions of previously unknown sources with unknown positions, I propose fitting the data near the core with

D(x,y) = A*exp(-0.5*r2),

where

r2 = a*x2 + b*x*y + c*y2 + d*x + e*y + f,

a quadratic form in x & y, found by a LLSQ fit to the log of the data, ln( D(x,y)). Given the coefficients, a, b,...,f, we can solve for the minimum of the quadratic, which is the peak of the Gaussian, and should be a good fit to the source position. The exact size of the core region to be used is again something that will depend to some degree on the structure (confusing point sources, background variations) in the surrounding field.

Combining precession cycles
The proper way to do this is now obvious. If we believe the single-cycle data are consistent with a single source flux, we simply write down all the linear equations for the different cycles, with one unknown for the constant flux, and probably one for each cycle background. Then for N cycles, we get 2*N equations in N+1 unknowns. If it looks like the backgrounds are really constant, we can reduce the number of bkg unknowns accordingly.