

## UNIVERSITI TUN HUSSEIN ONN MALAYSIA

# **FINAL EXAMINATION SEMESTER II SESSION 2016/2017**

COURSE NAME

COMPUTER SYSTEMS

**ENGINEERING** 

COURSE CODE

: BEC41603

PROGRAMME

: BEJ

EXAMINATION DATE : JUNE 2017

DURATION

: 2 HOURS AND 30 MINUTES

INSTRUCTION : ANSWER ALL QUESTIONS

TERBUKA

THIS ASNWER SCHEME CONSISTS OF SIX (6) PAGES

Q1 (a) Intel 80X86 processor and IBM PowerPC processor have ADD instruction in their Instruction Set Architecture (ISA). Determine whether ADD instruction for one processor can be executed by the other processor. Give reasons to support your answer.

(5 marks)

(b) Briefly explain the issue of bus architecture for multicore processor architecture in terms of long communication delay due to arbitration.

(4 marks)

(c) Sketch a diagram and explain the working principle of wormhole routing in Network on Chip (NoC) architecture.

(5 marks)

(d) Determine the network diameter for the Network on Chip (NoC) topology (i), (ii), (iii), and (iv) that are shown in **Figure Q1**.



FIGURE Q1

(6 marks)

- Q2 (a) Explain the concept of cache incoherency using an appropriate diagram. (4 marks)
  - (b) By referring to **Figure Q2(b)**, relate the memory access mechanism to the types of memory architecture in multicore.





#### FIGURE Q2(b)

(6 marks)

(c) Predict the states of memory location 0x0fd4 in cache A, 0x0430 in cache B, and 0x03dc in cache C from **Figure Q2(c)** based on the following conditions using MESI coherency protocol.



#### FIGURE Q2(c)

Conditions:

Initially: (1) 0x0fd4 = 00110101, 0x0430 = 00110101, 0x03dc = 00110101Then, (2) processor A modifies cache A address 0x0fd4 to 11111010 and (3) processor A writes the updated data to the main memory.

(6 marks)

(d) From the C++ code excerpt shown in **Figure Q2(d)**, determine the output of variable X and Y at the end of the execution using sequential consistency model, assuming initial condition is X = 1 and Y = 1:



#### FIGURE Q2(d)

(4 marks)

Q3 (a) Distinguish between parallel and concurrent programming.

(4 marks)

(b) Assume that there are 10 cores in a computer that can be used to execute an application in parallel and 95% of application code is parallelizable. Determine whether it is possible to achieve speedup of 7? Calculate the numbers of cores that are needed.

(4 marks)

(c) From the code shown in **Figure Q3**, select the suitable OpenMP pragmas to be used:

```
int main (int argc, char ** argv)
 1
 2
 3
            int i, n;
 4
           int h, x, sum;
 5
            n = atoi(argv[1]);
 6
           h = 2;
 7
            sum = 0;
 8
 9
           for (i=0; i <= n; i++) {
                 x = h * (I + 5);
10
11
                  sum += (1 + x * x);
12
13
           return h * sum;
14
15
16
           if (sum = 10) {
17
                  result = sum;
18
           }
19 }
```

FIGURE Q3



- (i) to get the number of threads in the parallel region (2 marks)
- (ii) to set the number of threads in the parallel region (2 marks)
- (d) From the C++ code shown in **Figure Q3(d)**, evaluate the output of the program when the number of threads is 4:

```
#include <stdio.h>
 2
    #include <omp.h>
 4
    int main (void)
 5
 6
           int counter;
 7
 8
           counter = 111;
 9
10
           #pragma omp parallel default()none shared(counter)
11
12
13
                     #pragma omp atomic
14
                      counter++;
15
                     printf("counter = %d\n", counter);
16
17
           }
18
           return 0;
19
20 }
```

FIGURE Q3(d)

(8 marks)

- Q4 (a) Explain the Flynn's Taxonomy for computer architecture. (8 marks)
  - (b) Argue whether it is possible to achieve superlinear speed up or not. (4 marks)
  - (c) Describe the reasons to implement synchronization in parallel programming. (4 marks)
  - (d) Detect if a race condition happens in the following C++ code that is shown in **Figure Q4**. Explain your answer.



```
1 for (i=0; i<100; i++)
2 for (j=0; j<100; j++)
3 a[i][j] = b[i][j] + c[i][j];
```

FIGURE Q4

(4 marks)

- Q5 (a) Differentiate between Central Processing Unit (CPU) and Graphic Processing Unit (GPU) architecture using appropriate diagrams.

  (4 marks)
  - (b) Explain TWO (2) limitations of GPU architecture. (2 marks)
  - (c) Based on the Compute Unified Device Architecture (CUDA) code shown in **Figure Q5**, write comments for the function of each block.

```
1
     int main ()
 2
 3
          size t size = n * sizeof(float);
 4
          float* d a;
          cudaMalloc((void**) &d_a, size);
 5
                                                                             Block 1
 6
          float* d_b;
 7
          cudaMalloc((void**) &d_b, size);
 8
          float* d c;
 9
          cudaMalloc((void**) &d_c, size);
10
11
          cudaMemcpy(d a, h a, size, cudaMemcpyHostToDevice);
                                                                             Block 2
12
          cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
13
          int threadsPerBlock = 256;
14
                                                                             Block 3
          int threadsPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
15
          vecAdd<<<threadsPerGrid, threadsPerBlock>>>(d a, d b, d c);
16
17
18
          cudaMemcpy(h c, d c, size, cudaMemcpyDeviceToHost);
                                                                             Block 4
19
          cudaFree(d a);
20
21
          cudaFree(d b);
22
          cudaFree(d c);
23
   }
```

### FIGURE Q5

(8 marks)

(d) Determine whether it is possible to compile normal C programs without CUDA code using NVCC compiler. Explain your answer. If we use multicore CPU with GPGPU, do you think it could improve execution time when compared with single core CPU with GPGPU? Explain your answer.

- END OF QUESTIONS -

