This order reflects the data locality as we are supposed to ask more often for a value from the L1 cache
, than from the L2 cache.
Since L1 cache
is virtually indexed by page offset, there is only one fixed position in cache for sensitive data.
These flaws also allow access to data residing in the L1 cache
but are a little more serious.
In this paper there are analyzed important technical aspects that can influence the overall performance of an application developed for CUDA enabled GPUs: the increased speedup offered by putting into use the shared and cache memory; the alignment of data in memory; optimal memory access patterns; aligning to the L1 cache
line; the balance achieved between single or double precision and its effect on memory usage; joining more kernel functions into a single one; adjusting the code to the available memory bandwidth in accordance with the memory latency and the necessity to transfer data between the host and the device.
In Figure 1 we show the NVIDIA Fermi architecture, where each SM (vertical rectangular strip) has scheduler and dispatch units (orange portion), execution units (green portion), and a configurable memory of 64 KB (light blue portions), which consists of a register file, an internal shared memory, and an L1 cache
. This memory is configurable in 16 KB (or 48 KB) for shared memory and 48 KB (or 16 KB) for L1 cache
This work keeps the L1 cache
size smaller to get more memory access which results in more accurate behavior of memory access and memory bus.
* An e500 System-on-Chip (SoC), which integrates both an L1 cache
with 32 KB instruction and 32 KB data and a 512 KB L2 cache.
The server was configured as follows: Fujitsu PRIMEQUEST 1800E, 8 processors / 64 cores / 128 threads, Intel Xeon Processor X7560, 2.26 GHz, 64 KB L1 cache
and 256 KB L2 cache per core, 24 MB L3 cache per processor, 512 GB main memory.
The device has L1 cache
with 32 KB instruction and 32 KB data per core with parity protection, L2 cache with 1 MB per core with optional ECC and a MPX bus of up to 500 MHz.
Intel Deep Power Down Technology (C6) is the lowest power state of the CPU when the core clock, PLL, L1 cache
, and L2 cache are off.
Configuration of the central server was as follows: HP Integrity rx6600, 4 processors / 8 cores / 16 threads, Dual-Core Intel Itanium 2 9050, 1.6 GHz, 32 KB(I) + 32 KB(D) L1 cache
, 2 MB(I) + 512 KB(D) L2 cache, 24 MB L3 cache and 48 GB main memory.