This order reflects the data locality as we are supposed to ask more often for a value from the
L1 cache, than from the L2 cache.
Since
L1 cache is virtually indexed by page offset, there is only one fixed position in cache for sensitive data.
These flaws also allow access to data residing in the
L1 cache but are a little more serious.
In this paper there are analyzed important technical aspects that can influence the overall performance of an application developed for CUDA enabled GPUs: the increased speedup offered by putting into use the shared and cache memory; the alignment of data in memory; optimal memory access patterns; aligning to the
L1 cache line; the balance achieved between single or double precision and its effect on memory usage; joining more kernel functions into a single one; adjusting the code to the available memory bandwidth in accordance with the memory latency and the necessity to transfer data between the host and the device.
This work keeps the
L1 cache size smaller to get more memory access which results in more accurate behavior of memory access and memory bus.
The server was configured as follows: Fujitsu PRIMEQUEST 1800E, 8 processors / 64 cores / 128 threads, Intel Xeon Processor X7560, 2.26 GHz, 64 KB
L1 cache and 256 KB L2 cache per core, 24 MB L3 cache per processor, 512 GB main memory.
Intel Deep Power Down Technology (C6) is the lowest power state of the CPU when the core clock, PLL,
L1 cache, and L2 cache are off.
Configuration of the central server was as follows: HP Integrity rx6600, 4 processors / 8 cores / 16 threads, Dual-Core Intel Itanium 2 9050, 1.6 GHz, 32 KB(I) + 32 KB(D)
L1 cache, 2 MB(I) + 512 KB(D) L2 cache, 24 MB L3 cache and 48 GB main memory.