Number 275 - April 2006

What Does A Cache Do For A Computer?
by Brian K. Lewis, Ph.D.*,
Sarasota Personal Computer Users Group, Inc.


   A cache (pronounced "cash") is a form of memory storage that generally operates faster than RAM memory or the time required to access a hard drive. The cache is smaller, faster memory that stores copies of the data from the most frequently used memory locations. Computer processors (CPUs) utilize both internal and external caches. You will also find references in the specifications of hard drives, CD & DVD drives to caches of various sizes. In order to see how these caches benefit computer operations we'll look at the operation of the internal caches on CPUs.

   Before looking at the cache function, you need to have some understanding of the architecture of a CPU. Much of the internal structure of a CPU is composed of registers that hold small bits of information and also can be used in manipulating information. As one example, the Intel Pentium 4 processors have 128 registers. Some registers hold instructions, others hold data, others have memory addresses and others are arithmetic manipulators. The instructions are found in the program code and they tell the processor what to do with the data. The processor loads instructions from memory and then loads data that is manipulated based on the instructions. So the registers hold data to be processed, the results of calculations, or addresses pointing to the location of other data. The processor can act on data in registers almost instantaneously. However, the registers are far too small to hold all the data required. Instead, instructions and data have to be read from or written to RAM.

   If the program code were always loaded directly from memory and all the data were written directly back to memory and then to the hard drive, the overall process would be quite slow compared to what we normally see. It is the use of caches that greatly speeds up the total process so the processor isn't stalled waiting for either instructions or data. The fastest cache is the one that is part of the processor and is referred to as the L1 cache. It can operate at the same speed as the processor. So if you have a 30-gigahertz (GHz) CPU, the L1 cache also operates at 30 GHz. Thus data can be accessed in one clock cycle. This cache is generally 128 kilobytes (KB) in size or smaller, although the Pentium 4 has an internal cache of 16 KB plus an internal Trace cache of 150 KB.

   The following diagram displays the relative relationship of the RAM memory and the components of the caches in the CPU body:

RAM Memory:
   - L2 Memory Cache
   - L1 Instruction Cache
   - Fetch Unit
   - Decode Unit
   - Execution Unit
   - L1 Memory Cache

   The above components marked (*) run at the same rate as the internal CPU clock. The next cache in distance from the processor is the L2 cache. In older CPUs this was totally external to the processor. In most cases, the L2 cache is now integrated on the CPU chip. The data path in these processors is 256 bits wide allowing for the transfer of more bits per clock cycle than the older processors that had 64 or 128 bit paths. The data path between the CPU and the external RAM is usually 64 bits or 128 bits wide. In a system with an 800 MHz bus, the real clock rate is 200 MHz, but transfer occurs in 4 blocks per clock cycle. This gives an effective transfer rate of 800 MHz or 6.4 GB/second. Still considerably slower than the transfer rate within the CPU.

   The theory of using caches is that instructions and data in the cache will be the next set of information requested by the CPU for processing. If the requested information is in either the L1 or L2 cache, it will not be necessary to go to RAM. Thus it can be accessed at the internal clock rate. If it is present, it is referred to as a "hit"; otherwise it is a "miss". (Logical, right?) Now, the bigger the memory cache, the better the chances of finding the data required by the CPU. However, there is a catch to this. The bigger the cache, the more time that is required to find the data. This is referred to as the "latency" time. In an ideal setup you would have a single cache with a high hit rate and a low latency. This is very difficult to achieve in practice. Consequently, we have two caches, a small one with low latency and lower hit rate combined with a large cache with higher hit rate and high latency.


   Now that we've reviewed the architecture, we need to see how all this works. Let's start with the Fetch unit that is used to load information from memory on demand from the processor. It first checks the caches to see if the required instructions or data are there. If not, it will load the information from system RAM. This information is then passed to the Decode unit. Note that when I refer to information it can either be instructions or data.

   If the information is a program instruction, the Decode unit will figure out what that particular instruction does. It does that by consulting a ROM memory that exists inside the CPU called microcode. Each instruction that a given CPU understands has its own microcode. The microcode will "teach" the CPU what to do. It is like a step-by-step guide to every instruction. If the instruction loaded is, for example, add a+b, its microcode will tell the decode unit that it needs two parameters, a and b. The Decode unit will then request the Fetch unit to grab the data present in the next two memory positions, which fit the values for a and b. After the Decode unit has "translated" the instruction and grabbed all the data required to execute the instruction, it will pass the data and the "step-by-step cookbook" on how to execute that instruction to the Execute unit. There is an exception to this in the newest Pentium 4 processors. In these processors the L1 Instruction Cache has be relocated to after the Decode unit. It now contains the translated instructions and is referred to as the Trace cache.

   The Execute unit will finally execute the instruction. On modern CPUs you will find more than one execution unit working in parallel. This is done in order to increase the processor performance. For example, a Pentium 4 CPU with six execution units can execute six instructions per clock cycle. In theory it could achieve the same performance as six processors with just one execution unit. After the processing is over, the result is sent to the L1 Memory cache. From there it can be written to RAM or sent elsewhere.

   Modern processors have another feature called the "pipeline". This is the capability of having several different instructions at different stages of processing in the CPU at the same time. On Pentium III processors the pipeline was 11 stages - each a unit of the CPU. The latest Pentium 4 processors have 31 stages. With the greater number of stages, fewer transistors are required per stage, resulting in a higher clock rate. O.K, so what's the value of stages in the pipeline? After the Fetch unit sends an instruction for decoding, it grabs the next instruction. This can be sent on as soon as the first instruction is sent to the Execution unit. If an instruction has to be processed by all 11 (or 31) stages, it takes the most time, while other instructions might require fewer stages. Only when the first instruction is finished processing can it be sent out, but others that required processing by fewer stages might immediately follow. The consequence of this is that multiple instructions can be processed simultaneously. This greatly increases the overall processing throughput.

   Other caches found in computers are not associated with the processor. One such type of cache that you use frequently, probably without being aware of it, is the web page cache managed by your web browser. When you visit a web page, it is downloaded to your computer. If you visit that same page within a few days, your browser pulls the page from its temporary cache, compares it with the current page on the web server and updates only the changed portions. This speeds up the appearance of the page on your computer. For example, my home page is Yahoo.com. The major part of this page doesn't change from day to day, so the downloading of the page is limited to those parts that have actually changed. This allows the page to appear on my screen quite rapidly.

   So in CPU processing, the use of caches has greatly increased the speed of data handling. The same is true of caches used elsewhere in the computer. In all cases they are short-time storage of information. Luckily, you don't have to have a complete understanding of caches to use your computer. Let the computer do the work!

   *Dr. Lewis is a former university & medical school professor. He has been working with personal computers for more than thirty years. He can be reached via e-mail at bwsail@yahoo.com.

   Copyright 2005. This article is from the March 2006 issue of the Sarasota PC Monitor, the official monthly publication of the Sarasota Personal Computer Users Group, Inc., P.O. Box 15889, Sarasota, FL 34277-1889. Permission to reprint is granted only to other non-profit computer user groups, provided proper credit is given to the author and our publication.
  Number 275 - April 2006