|| Checking for direct PDF access through Ovid
Latency hiding techniques are increasingly used to minimize the effect of a long memory latency in multiprocessors. Their use requires additional network bandwidth. The network organization and its design parameters alone can significantly affect performance. With latency hiding, system performance depends on how well the interconnection network can support the use of such techniques and their interaction with network organization. This paper investigates these issues for prefetching and weak consistency in a 128-processor shared-memory system with either a 2-D torus, a multistage, or a single-stage network. The performance impact of network organization and the link bandwidth, with and without the use of latency hiding techniques is shown. The effect of caching and of limiting the number of outstanding memory requests is shown. Multistage is the most robust network and has the best performance under all conditions. Single-stage network is very close in performance when sufficient channel bandwidth is available. Torus network comes in last when channel bandwidth is high, but can exceed single stage performance when it is low. The relative performance of the three networks with prefetching remains similar, with torus gaining the most. Benchmark execution time can decrease by as much as 25% with prefetching. Further gains depend on reducing the effect of write traffic. Finally, the existence of an optimal number of outstanding requests is shown but the value is program-dependent.