In this paper there are analyzed important technical aspects that can influence the overall performance of an application developed for CUDA enabled GPUs: the increased speedup offered by putting into use the shared and cache memory; the alignment of data in memory; optimal memory access patterns; aligning to the L1 cache
line; the balance achieved between single or double precision and its effect on memory usage; joining more kernel functions into a single one; adjusting the code to the available memory bandwidth in accordance with the memory latency and the necessity to transfer data between the host and the device.
In the private cache organization, most of the L1 cache
misses can be handled by the L2 cache, so the number of remote on-chip L2 cache accesses is reduced and it is not necessary to cross the interconnection network, which reduces the miss latency.
0 GHz, 20 KB L1 cache
, 512 KB L2 cache, 4 MB L3 cache, 8 GB main memory.
Each Power5 core has 64KB of L1 cache
, and the two cores on a chip share a 1.
35 GHz, 256 KB L1 cache
, 2 MB L2 cache, 32GB main memory
The cache model simulated a direct mapped cache similar to the L1 cache
of our Sun machine.
It includes a 16-KB instruction and 16-KB data L1 cache
(32 KB total) and an external 512-KB half processor speed Level 2 cache.
The Athlon also features a 128KB on-chip L1 cache
and a 512KB standard L2 cache, scalable to 8MB.
Note, however, that Transmeta's processor has 128KB of L1 cache
, while Intel's chip has just 32KB.
shows that our approach is better at reducing the L1 cache
miss rate than Hashemi et al.
SMT has three key advantages over multiprocessing: flexible use of ILP and TLP, the potential for fast synchronization, and a shared L1 cache
40 GHz, 64 KB L1 cache
and 256 KB L2 cache per core, 30 MB L3 cache per processor, 128 GB main memory, one virtual machine using 40 virtual CPUs and the SAP MaxDB[R] 7.