NICER is a method used to estimate extinction based on the reddening of the light of background stars, that is, how much redder stars appear to be compared to their intrinsic colours. The reddening is caused by dust clouds between the stars and the observer and thus gives a measure for the amount of interstellar matter along that line of sight. NICER is a way to optimally combine measurements of several clouds. The method is described in an article by Lombardi and Alves (2001) .

The calculation of a map of extinction consists of two step. First, one estimates the extinction for every star in the field. Second, one calculates extinction values for each pixel in the map. Unlike stars, pixels form a regular grid on the sky and each pixel value is obtained as a weighted average over the extinction values of stars near that pixel. Both steps should be easy to parallelise: extinction estimates of individual stars can be calculated in parallel and similarly the averages for each pixel can be calculated in parallel.

We wrote a Python program that used two OpenCL kernels that correspond to the two steps mentioned above. The task of the main program is only to read in the input data, call the kernels, and write the results to files. Here is the listing of the first kernel, where the work items together loop over all the stars. The kernel is listed (with minimal comments) just to show what kind of calculations it involves. The kernel calls only one external routine, solve_cramer2, which calculates the inverse of a matrix (not shown).

```
__kernel void nicer(
__global float *K, // [NB], extinction relative to Av
__global float *RCOL, // average colours in reference area
__global float *RCOV, // covariance of colour in ref. area
__global float *MAG, // magnitudes [NS*NB]
__global float *dMAG, // magnitude error estimates
__global float *AV, // output: estimate Av [NS]
__global float *dAV // output: estimated dAv [NS]
)
{
int i, j, ss = get_global_id(0) ; // one pixel
int gs = get_global_size(0) ;
if (ss>=NS) return ; // id > NS, the number of ON field stars
float C[NB*NB] ; // NB = number of bands
float av, b[10] ;
for(int s=ss; s<NS; s+=gs) { // work items loop together over all NS stars
for(i=0; i<NC; i++) { // NC = number of colours
C[NC*NB+i] = -K[i] ;
C[i*NB+NC] = -K[i] ;
}
for(i=0; i<NB; i++) b[i] = 0.0f ;
b[NC] = -1.0f ;
for(i=0; i<NC; i++) {
for(j=0; j<NC; j++) {
C[i*NB+j] = 0.0f ;
}
}
for(i=0; i<NC; i++) {
C[i*NB+i] = dMAG[s*NB+i]*dMAG[s*NB+i] + dMAG[s*NB+i+1]*dMAG[s*NB+i+1] ;
}
C[NC+NB*NC] = 0.0f ;
for(i=0; i<NC-1; i++) {
C[i*NB + i+1] = -dMAG[s*NB+i+1]*dMAG[s*NB+i+1] ;
C[(i+1)*NB+i] = -dMAG[s*NB+i+1]*dMAG[s*NB+i+1] ;
}
for(i=0; i<NC; i++) {
for(j=0;j<NC; j++) {
C[i*NB+j] += RCOV[i*NC+j*j] ;
}
}
float C0=C[0], C1=C[1], C4=C[4] ;
float det ;
solve_cramer3(C, b, &det) ;
av = 0.0 ;
for(i=0; i<NC; i++) av += ((MAG[s*NB+i]-MAG[s*NB+(i+1)])-RCOL[i]) * b[i] ;
AV[s] = av ;
dAV[s] = sqrt(b[0]*b[0]*C0 + b[1]*b[1]*C4 + 2.0f*b[0]*b[1]*C1) ;
}
}
```

The second kernel calculates for each pixel a weighted average of the Av values of individual stars. This time work items (threads) loop over the map pixels. The calculation is done actually two times, the second loop dropping outliers based on user-defined thresholds (CLIP_UP and CLIP_DOWN).

```
__kernel void smooth(
__global float *RA, // coordinates of the stars [NS]
__global float *DE, // -"-
__global float *A, // Av of individual stars
__global float *dA, // dAv of individual stars
__global float *SRA, // coordinates of the pixels [NPIX]
__global float *SDE, // -"-
__global float *SA, // Average Av for a pixel
__global float *dSA // error estimate
)
{
int j, idd = get_global_id(0) ; // index of smoothed value, single pixel
int gs = get_global_size(0) ;
if (idd>=NPIX) return ;
// calculate weighted average with sigma-clipping
float ra, de ; // centre of the beam
float cosy = cos(de) ; // we can use the plane approximation for distances
float w, dx, dy, weight, sum, s2, uw, d2, ave, std, count ;
float K = 4.0f*log(2.0f)/(FWHM*FWHM) ; // radian^-2
const float LIMIT2 = 9.0f*FWHM*FWHM ; // ignore stars further than sqrt(LIMIT2)
for(int id=idd; id<NPIX; id+=gs) {
ra = SRA[id] ;
de = SDE[id] ;
weight = sum = s2 = uw = count = 0.0f ;
for(j=0; j<NS; j++) {
dx = cosy* (RA[j]-ra) ;
dy = (DE[j]-de) ;
d2 = dx*dx + dy*dy ;
if (d2<LIMIT2) {
w = exp(-K*d2) / (dA[j]*dA[j]) ;
weight += w ;
sum += w*A[j] ; // weighted sum
uw += A[j] ; // unweighted sum
s2 += A[j]*A[j] ; // for unweighted standard deviation
count += 1.0f ;
}
}
ave = sum/weight ; // weighted average
std = sqrt(s2/count - (uw/count)*(uw/count)) ; // unweighted std
// Repeat, this time with sigma clipping
weight = sum = s2 = uw = count = 0.0f ;
for(j=0; j<NS; j++) {
dx = cosy* (RA[j]-ra) ;
dy = DE[j]-de ;
d2 = dx*dx+dy*dy ;
if ((d2<LIMIT2) && (A[j]>(ave-CLIP_DOWN*std)) && (A[j]<(ave+CLIP_UP*std)) ) {
w = exp(-K*d2) / (dA[j]*dA[j]) ;
weight += w ;
sum += w*A[j] ;
uw += A[j] ;
count += 1.0f ;
s2 += w*w*dA[j]*dA[j] ; // weighted
}
}
if (count>1.0) {
SA[id] = sum/weight ; // weighted average
dSA[id] = sqrt(s2)/weight ; // weighted
} else { // count <= 1
if (weight>0.0) {
SA[id] = sum/weight ; dSA[id] = 999.0 ;
} else {
SA[id] = 0.0 ; dSA[id] = 999.0 ;
}
}
}
}
```

Interestingly the two kernels scale differently on CPU and GPU. We computed a series of test cases, starting with a map that was 0.5×0.5 degrees in size and had a pixel size of 1.0 arcmin. The number of stars over the area was 75 000. We scaled the problem by increasing either the number of stars (which affects both kernels) or, alternatively, only the number of pixels (affecting the second kernel). The resulting run times are shown in the plot below.

On the x-axis value of 1 corresponds to the original problem size as described above. Here CPU refers to a run with a CPU with 6 hyperthreaded cores. When the number of stars is increased, the scaling is similar for both CPU and GPU, the latter being faster by a factor of 3-4. On the other hand, when the number of pixel is increased and a larger fraction of time is spend on the second kernel, the GPU run time barely increases at all. However, here the number of pixels is increased only up to ~60 000 pixels and by that time GPU is again almost a factor of 4 faster than the 6-core CPU.

The tests were run on a laptop with a six core i7-8700K CPU running at 3.70 GHz and with an Nvidia GTX-1080 GPU.