

# Norwegian University of Science and Technology

skal ha flervalgskjema □

Department of Computer & Information Science (IDI)

## **Examination paper for TDT4200 Parallel Programming**

| Academic contact during examination: Thoma    | s L. Falch          |                  |
|-----------------------------------------------|---------------------|------------------|
| Phone: +47 472 43 175                         |                     |                  |
| Examination date: TUESDAY Nov 29, 2016        |                     |                  |
| Examination time (from-to): 09:00 – 13:00     |                     |                  |
| Permitted examination support material: C – E | itt stemplet ark me | d håndskrevne    |
| notater (leveres inn sammen med besvarelsen   | ) samt vedlagte su  | pplerende        |
| informasjonsark.                              |                     |                  |
| ONE STAMPED SHEET WITH HANDWRITTEN I          | NOTES (TO BE TU     | RNED IN WITH     |
| THIS EXAM) AND ENCLOSED SUPPPLEMENT           | ARY INFORMATION     | <b>I</b>         |
| Other information:                            |                     |                  |
| Alle svar innføres, som angitt, på v          | edlagte ark /       |                  |
| All answers need to be submitted on the       | e attached shee     | ts as indicated. |
| Language: ENGLISH                             |                     |                  |
| Number of pages (front page excluded): 15     |                     | Created by:      |
| Number of pages enclosed: 15                  | 25/11-16            | In Alte          |
|                                               | Date                | Signature        |
| Informasjon om trykking av eksamensoppgave    |                     | -<br>-           |
| Originalen er:                                |                     | Checked by:      |
| 1-sidig X 2-sidig □                           |                     | _                |
| sort/hvit <b>X</b> farger □                   | 25/11-16            | But Lolla        |

Date

Signature

| Page 1 of 15 | TDT4200 FINAL EXAM 2016 | Candidate no.: |
|--------------|-------------------------|----------------|
|--------------|-------------------------|----------------|

### 1. WARM-UPS – TRUE/FALSE (20%)

Indicate the correct answer with a clear "X" in the appropriate column. It is NOT necessary to justify your answers on TRUE/FALSE questions, unless requested.

NOTE: You will get a -1% negative score for each wrong answer and 0 for not answering or answering both TRUE and FALSE.

| No. | Question                                                                                        | TRUE | FALSE |
|-----|-------------------------------------------------------------------------------------------------|------|-------|
| a)  | An efficient parallel implementation of a serial program is found                               |      |       |
|     | by finding an efficient implementation of each step of the serial program                       |      |       |
| b)  | Separation of memory and CPU is often called the von Neumann                                    |      |       |
|     | bottleneck                                                                                      |      |       |
| c)  | Caching may provide superlinear speed-ups                                                       |      |       |
| d)  | The most common cache eviction policy is Least Recently Used                                    |      |       |
| e)  | Flynn's taxonomy includes SIMD and MIMD                                                         |      |       |
| f)  | Multi-threading is considered a coarse-grained parallelism                                      |      |       |
| g)  | Strongly scalable programs always improve performance with no. of cores/processors.             |      |       |
| h)  | Weakly scalable programs follow Gustavson's Law                                                 |      |       |
| i)  | Elster's Bit-Reversal algorithm is O (N log N)                                                  |      |       |
| j)  | Snooping cache coherence takes advantage of bus architectures                                   |      |       |
| k)  | One-sided communication is different from remote memory access                                  |      |       |
| 1)  | Amdahl's law says that if a fraction r of a serial program remains                              |      |       |
|     | serial, then we cannot get a better speed-up than 1/r. This means                               |      |       |
|     | we MUST resort to task parallelism in order to be able to scale further.                        |      |       |
| m)  | MPI_Broadcast does not need tags                                                                |      |       |
| n)  | MPI_Reduce may use the same buffer for both input and output                                    |      |       |
| 0)  | CUDA is less verbose than OpenCL, but OpenCL is offered on a wider selection of devices.        |      |       |
| p)  | CUDA warps may be synchronized                                                                  |      |       |
| q)  | CUDA uses sychthreads() to synchronize across SMs                                               |      |       |
| r)  | The OpenCL-equivalent of a CUDA warp is called team                                             |      |       |
| s)  | Coalescing memory on GPUs improves efficiency by using strided memory locations                 |      |       |
| t)  | Some recent CUDA devices give you the option to trade-off amount of shared memory with caching. |      |       |

- 3. Multiple Choice (one or more may be correct) and Short answer (14%)
- 3a) CUDA: Consider the following code snippets, which are part of CUDA kernels, where n threads each read n values from an  $n \times n$  array. Which will be faster? (Circle answer)

- 3b) Circle the following schemes that protect access to critical sections:
  - I) Semaphores
- III) Do-while loops
- II) Mutex Locks
- IV) Busy-waiting
- 3c) What is the main advantage of read-write locks?
- 3d) Which of the following schemes are used to reduce branching?
  - I) Hoisting most frequent case to top or out in separate if-statement
  - II) Removing branches with labels
  - III) Removing branches with masks
  - IV) Memory coalescing (GPUs)
  - V) Memory striding
  - VI) All of the above

3e) For each of the following code snippets, determine if the access pattern exhibits spatial locality, temporal locality, both, or neither. Circle accordingly. (a, b and c are in all cases appropriately sized int arrays):

```
I)
      for(int i = 0; i < 10000; i++) {
                                               Spatial / Temporal
        a[i] = b[i] + c[i];
      }
II)
     for(int i = 0; i < 100; i++){
                                              Spatial / Temporal
       for (int j = 0; j < 100; j++) {
             a[j] = b[j] + c[j];
       }
     }
      for (int i = 0; i < 100; i++) {
                                               Spatial/Temporal
        for(j = 0; j < 10; j++){
             a[j*1000] += b[j * 2000] + c[j * 3000];
        }
      }
```

3f) What is thread divergence (hint: GPU)?

3g) Which of the following code snippets will (if executed on a GPU) cause branch divergence? (Circle YES if they do, NO otherwise)

```
I) if (blockIdx.x > 16) {
                foo();
   }
   else{
                                                     YES
                                                           / NO
               bar();
   }
II) if(threadIdx.x > 16) {
                                                     YES / NO
     foo();
    }
    else{
           bar();
    }
                                                     YES / NO
c) for(int i = 0; i < threadIdx.x; i++) {</pre>
      foo();
  }
```

#### 4. More on MPI (6%)

4 a) Consider the following code, where each rank sends a value to two other ranks, and receives a value from two other ranks. The two ranks that each rank should communicate with, are stored in the variables n1 and n2, and are arbitrary, and specified by the user at runtime.

```
MPI_Send(data_send1, N, MPI_INT, n1, 0, MPI_COMM_WORLD);

MPI_Recv(data_rcv1, N, MPI_INT, n1, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);

MPI_Send(data_send2, N, MPI_INT, n2, 0, MPI_COMM_WORLD);

MPI_Recv(data_rcv2, N, MPI_INT, n2, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
```

Can this code potentially cause a deadlock? Circle YES / NO

4 b) If you circled YES to the question above, rewrite the code so that it will never deadlock, using the following MPI functions. Not all the listed functions may be required/are relevant.

Do not use any other MPI functions, but the ones listed below.

```
int MPI_Ssend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
int MPI_Wait ( MPI_Request *request, MPI_Status *status)
int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int
tag, MPI_Comm comm, MPI_Request *request )
int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int
tag, MPI_Comm comm, MPI_Request *request )
int MPI_Issend( void *buf, int count, MPI_Datatype datatype, int dest, int
tag, MPI_Comm comm, MPI_Request *request )
int MPI_Sendrecv( void *sendbuf, int sendcount, MPI_Datatype sendtype, int
dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int
source, int recvtag, MPI_Comm comm, MPI_Status *status )
```

#### Page 6 of 15 TDT4200 FINAL EXAM 2016 Candidate no.: \_

4b) contin: Repeating code to re-write and leaving space for re-write:

```
MPI_Send(data_send1, N, MPI_INT, n1, 0, MPI_COMM_WORLD);
MPI_Recv(data_rcv1, N, MPI_INT, n1, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
MPI_Send(data_send2, N, MPI_INT, n2, 0, MPI_COMM_WORLD);
MPI_Recv(data_rcv2, N, MPI_INT, n2, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
```

and leaving space for re-write using only the functions listed on the previous page:

#### 4 c) Consider the following lines, extracted from an MPI program:

Complete the code above by filling in the arguments to MPI\_Type\_vector above, so that the MPI\_Send will send all the elements set to 10 by the for loop. The definition of MPI\_Type\_vector is:

```
int MPI_Type_vector(int count, int blocklength, int stride, MPI_Datatype
old_type, MPI_Datatype *newtype_p);
```

#### 5. OpenMP (5%)

Consider the following code, which computes the sum of the elements in an array:

```
int sum = 0;
for(int i = 0; i < N; i++) {
    sum += array[i];
}</pre>
```

Show how the code can be parallelized using OpenMP. The resulting code should achieve good parallel speedup. The following OpenMP directives and functions can be useful: structured-block

clause:

#### Useful OpenMP directives and functions:

```
#pragma omp parallel [clause[ [, ]clause] ...]
    if(scalar-expression)
    num_threads(integer-expression)
    default(shared | none)
    private(list)
    firstprivate(list)
    shared(list)
```

```
copyin(list)
    reduction (reduction-identifier: list)
    proc bind(master | close | spread)
#pragma omp for [clause[ [, ]clause] ...]
     for-loops
clause:
    private(list)
     firstprivate(list)
     lastprivate(list)
     reduction(reduction-identifier: list)
     schedule(kind[, chunk size])
     collapse(n)
     ordered
     nowait
#pragma omp parallel for [clause[ [, ]clause] ...]
     for-loop
clause: Any accepted by the parallel or for directives,
     except the nowait clause, with identical meanings and
restrictions.
#pragma omp single [clause[ [, ]clause] ...]
     structured-block
clause:
     private(list)
     firstprivate(list)
     copyprivate(list)
     nowait
#pragma omp critical [(name)]
   structured-block
#pragma omp atomic [read | write | update | capture]
expression-stmt
int omp_get num_threads(void);
int omp get thread num(void);
```

| Page 9 of 15 | TDT4200 FINAL EXAM 2016 | Candidate no.: |
|--------------|-------------------------|----------------|
|--------------|-------------------------|----------------|

5 CONTINUED: Show below how the code for summing the elements of an array shown earlier can be parallelized using OpenMP. The resulting code should achieve good parallel speedup: (See previous page for useful OpenMP functions)

#### 6. Pthreads (6%)

For each of the following code snippets, determine if it will cause a deadlock if executed by more than 3 threads in parallel. Thread id is an integer storing the id of the thread which is a unique number between 1 and the total number of threads.

a) The following (circle) MAY / MAY NOT deadlock with >3 threads:

```
if(thread_id % 2 == 0) {
    pthread_mutex_lock(&mutex_a);
    pthread_mutex_lock(&mutex_b);
    a++;
    b++;
    pthread_mutex_unlock(&mutex_b);
    pthread_mutex_unlock(&mutex_a);
}
else{
    pthread_mutex_lock(&mutex_b);
    pthread_mutex_lock(&mutex_a);
    a++;
    b++;
    pthread_mutex_unlock(&mutex_a);
    pthread_mutex_unlock(&mutex_a);
    pthread_mutex_unlock(&mutex_b);
}
```

6 b) The following (circle) MAY / MAY NOT deadlock with >3 threads:

```
if(thread_id % 2 == 0) {
    pthread_mutex_lock(&mutex_a);
    pthread_mutex_lock(&mutex_b);
    a++;
    b++;
    pthread_mutex_unlock(&mutex_a);
    pthread_mutex_unlock(&mutex_b);
}
else{
    pthread_mutex_lock(&mutex_a);
    pthread_mutex_lock(&mutex_b);
    a++;
    b++;
    pthread_mutex_unlock(&mutex_a);
    pthread_mutex_unlock(&mutex_b);
}
```

6c) The following (circle) MAY / MAY NOT deadlock with >3 threads:

```
if(thread_id % 3 == 0) {
    pthread_mutex_lock(&mutex_a);
    pthread_mutex_lock(&mutex_c);
    a++;
    c++;
    pthread_mutex_unlock(&mutex_c);
    pthread_mutex_unlock(&mutex_a);
```

```
}
else if (thread id % 3 == 1) {
      pthread mutex lock(&mutex b);
      pthread mutex lock(&mutex a);
      a++;
      b++;
      pthread_mutex_unlock(&mutex_a);
      pthread_mutex_unlock(&mutex_b);
else{
      pthread_mutex_lock(&mutex_c);
      pthread mutex lock(&mutex b);
      C++;
      b++;
      pthread mutex unlock(&mutex b);
      pthread mutex unlock(&mutex c);
}
```

#### 7. OpenCL – 4%

#### Consider the following OpenCL kernel:

```
__kernel__ work()
int x = get_global_id(0);
int y = get_global_id(1);

if(x % 8 > 4) {
funcl();
}
else{
func2();
}
}
```

#### Which is launched like this:

```
size_t global_work_size[2];
global_work_size[0] = 1024; global_work_size[1] = 1024;
size_t local_work_size[2];
local_work_size[0] = m; local_work_size[1] = n;
clEnqueueNDRangeKernel(queue, kernel, 2, NULL,
global work size, local work size, 0, NULL, NULL);
```

Which of the following values should be used for n and m to minimize divergence when the code is executed on an NVIDIA GPU? (Circle the correct answer.)

I) 
$$n = 4$$
,  $m = 16$  II)  $n = 32$ ,  $m = 4$  III)  $n = 8$ ,  $m = 8$  IV)  $n = 4$ ,  $m = 32$ 

#### 8. CUDA - 10%

In this problem we will look at a CUDA program which computes the average of neighbor elements in an array. In particular, it should perform the following computation on the GPU:

```
out[0] = 0;
out[n-1] = 0;
for(int i = 1; I < n-1; i++) {
out[i] = (in[i] + in[i+1] + in[i-1]) / 3.0;
}
```

Thus, if the input is the array [2,3,1,5] the output should be [0,2,3,0], since i.e. (3+1+5)/3 = 3. The elements at the start and end of the output array are set to 0.

In the code given below, the size of the arrays, N, is an arbitrary number larger than 64. The thread block size (i.e. the number of threads in a thread block) is hard coded to be 64. Shared memory is used in an attempt to improve performance, in a similar manner to the way it was done in assignment 6.

The code contains at least 1, and no more than 5 bugs (i.e. 1, 2, 3, 4, or 5 bugs). None of the bugs are syntax errors, i.e. the code will compile without problem, but will crash or produce incorrect results when executed. Ignore problems/bugs related to poor performance, or not freeing memory. The bugs can be in both the device code (i.e. in the kernel) or in the host code, launching the kernel.

#### Each bug can be fixed by either:

- Changing a single line.
- Removing a single line, and adding a new line somewhere else.

The code is printed with double line spacing. Correct the bugs by striking out the incorrect lines, and add the corrected version below it, or by striking out a line and adding a new line somewhere else.

Note that in three cases, for the two cudaMalloc calls, and for the final assignment to the out array, one line of code is printed across two lines, but is still regarded as a single line when counting the number of bugs.

```
__global__ void average(float* in, float* out, int N) {
   int index = threadIdx.x + blockDim.x*blockIdx.x;
   int lindex = threadIdx.x;
```

```
shared float shared_array[66];
if(index < N){</pre>
     shared_array[index] = in[index];
     if(lindex == 0 && index != 0){
          shared array[0] = in[index-1];
     }
     if(lindex == 63 && index != N-1){
          shared array[65] = in[index + 1];
     }
     syncthreads();
}
if(index == 0 \mid | index == N-1) {
     out[index] = 0;
}
else if(index < N){</pre>
     out[index] = (shared array[lindex-1] +
     shared_array[lindex] + shared_array[lindex + 1])/3.0f;
```

```
}
}// Continues on next page
float* func(int N) {
     float* in host = (float*)malloc(sizeof(float) * N);
     float* out host = (float*)malloc(sizeof(float)* N);
     float* in dev;
     float* out dev;
     cudaMalloc((void**)&in dev, sizeof(float)*N);
     cudaMalloc((void**)&out dev, sizeof(float)*N);
     fill input(in host);
     cudaMemcpy(in_dev, in_host, sizeof(float)*N,
     cudaMemcpyHostToDevice);
     int nBlocks = N/64;
     if( N % 64 != 0){
          N++;
     }
     average<<<nBlocks, 64>>>(in dev, out dev, N);
     cudaMemcpy(out host, out dev, sizeof(float)*N,
     cudaMemcpyDeviceToHost);
     return out_host;
}
```

EXTRA PAGE, IF NEEDED. PLEASE INDICATE CLEARLY WHICH PROBLEM you are answering here, if any. May also be used as scratch paper.