Harishankar Rai: Core i 7

Nehalem: An Overview
It’s hard to talk about an overview of an architecture like Nehalem, which is fundamentally designed to be modular. The Intel engineers wanted to design a set of building blocks that can be assembled like Legos to create the various versions of the architecture.

It is possible, though, to take a look at the flagship of the new architecture—the very high-end version that will be used in servers and high-performance workstations. At first glance, the specs will likely remind you of the Barcelona (K10) architecture from AMD. It is natively quad-core and has three levels of cache, a built-in memory controller, and a high-performance system of point-to-point interconnections for communicating with peripherals and other CPUs in multiprocessor configurations. This proves that it wasn’t AMD’s technological choices that were bad, but simply its implementation, which hasn’t scaled well enough through its current design.

But Intel has done more than just revise its architecture by taking inspiration from their competitor’s innovations. With a budget of more than 700 million transistors (731 million, to be exact), the engineers were able to greatly improve certain characteristics of the execution core while adding new functionality. For example, simultaneous multi-threading (SMT), which had already appeared with the Pentium 4 "Northwood" under the name Hyper-Threading has made its comeback. Associated with four physical cores, certain versions of Nehalem that incorporate two dies in a single package will be capable of executing up to 16 threads simultaneously. While this change appears simple at first glance, as we’ll see later on, it has a wide impact at several levels of the pipeline; many buffers need to be re-dimensioned so that this mode of operation doesn’t impact performance. As has been the case with each new architecture for several years now, Intel has also added new SSE instructions to Nehalem. The architecture supports SSE 4.2, components of which appear to be borrowed from AMD’s K10 micro-architecture. .
Now that you know the broad outlines of the new architecture, it’s time to take a more detailed look, starting with the front end of the pipeline—the part that’s in charge of reading instructions in memory and preparing them for execution.

Reading And Decoding Instructions
Unlike the changes made in moving from Core to Core 2, Intel hasn’t done much to Nehalem’s front end. It has the same four decoders that made their appearance with the Conroe—three simple and one complex. It is still capable of macro-ops fusion and so offers a theoretical maximum throughput of 4+1 x86 instructions per cycle.

Though there are no revolutionary changes at first glance, the devil is in the details. As we noted in our article on the Barcelona architecture, increasing the number of processing units is an extremely inefficient way to boost performance. The cost is high and the gains shrink more and more with each addition, following the law of diminishing returns. So instead of adding a new decoder, the engineers concentrated on making the existing ones more efficient.
First, they added support for macro-ops fusion in 64-bit mode, which is justified for an architecture like Nehalem that makes no attempt to hide its ambitions in the server market segment. But the engineers didn’t stop there. Where the Conroe architecture could fuse only a limited number of instructions, the Nehalem architecture supports a greater number of variations, making it possible to use macro-ops fusion more frequently.
Another new feature introduced by the Conroe has also been improved: the Loop Stream Detector. Behind this name lies what is in fact a data buffer that holds a few instructions (18 x86 instructions on Core 2s). When the processor detects a loop, it disables certain parts of the pipeline. Since a loop consists of executing the same instructions a given number of times, it’s not necessary to perform branch prediction or to recover the instruction from the L1 cache at each iteration of the loop. So the Loop Stream Detector acts as a small cache memory that short-circuits the first stages of the pipeline in such situations. The gains made via this technique are twofold: it decreases power consumption by avoiding useless tasks and it improves performance by reducing the pressure on the L1 instruction cache.

With the Nehalem architecture, Intel has improved the functionality of the Loop Stream Detector. First of all the buffer is larger—it can now store 28 instructions. But what’s more, its position in the pipeline has changed. In Conroe, it was located just after the instruction fetch phase. It’s now located after the decoders; this new position allows a larger part of the pipeline to be disabled. The Nehalem’s Loop Stream Detector no longer stores x86 instructions, but rather µops. In this sense, it’s similar to the Pentium 4’s trace cache concept. It’s no surprise to find certain innovations ushered in by that architecture in the Nehalem, given that the Hillsboro team now in charge of the Nehalem was responsible for the Pentium 4 project. However, where the Pentium 4 used the trace cache exclusively, since it could only count on one decoder in case of a data cache miss, Nehalem has the benefit of the power of its four decoders, while the Loop Stream Detector is only an additional optimization for certain situations. In a way, it’s the best of both worlds.
Branch Predictors
The last improvement to the front end has to do with the branch predictors. The efficiency of branch prediction algorithms becomes crucial in architectures that need high levels of instruction parallelism. A branch breaks the parallelism because it necessitates waiting for the result of a preceding instruction before execution of the flow of instructions can be continued. Branch prediction determines whether or not a branch will be taken, and if it is, quickly determines the target address for continuing execution. No complicated techniques are needed to do this; all that’s needed is an array of branches—the Branch Target Buffer (BTB)—that stores the results of the branches as execution progresses (Taken or Not Taken and target address) and an algorithm for determining the result of the next branch.
Intel hasn’t provided details on the algorithm used for their new predictors, but it is known that they are now two-level predictors. the first level is unchanged from the Conroe architecture, but a new level with slower access that can store more branch history has been added. According to Intel, this configuration improves branch prediction for certain applications that use large volumes of code, such as databases—more evidence of Nehalem’s server orientation. Another improvement is to the Return Stack Buffer, which stores the return addresses of functions when they’re called. In certain cases this buffer can overflow, which could lead to faulty predictions. To limit that possibility, AMD increased its size to 24 entries, whereas with the Nehalem Intel has introduced a renaming system for this buffer.

The Return Of Hyper-Threading
So, the front end hasn’t been profoundly overhauled; neither has the back end. It has exactly the same execution units as the most recent Core processors, but here again the engineers have worked on using them more efficiently.

With Nehalem, Hyper-Threading makes its great comeback. Introduced with the Northwood version of Intel’s NetBurst architecture, Hyper-Threading—also known outside the world of Intel as Simultaneous Multi-Threading (SMT)—is a means of exploiting thread parallelism to improve the use of a core’s execution units, making the core appear to be two cores at the application level.
In order to use parallel threads, certain resources—such as registers—must be duplicated. Other resources are shared by the two threads, and that includes all the out-of-order execution logic (the instruction reorder buffer, the execution units, and cache memory). A simple observation led to the introduction of SMT: the “wider” (meaning more execution units) and “deeper” (meaning more pipeline stages) processors become, the harder it is to extract enough parallelism to use all the execution units at each cycle. Where the Pentium 4 was very deep, with a pipeline having more than 20 stages, Nehalem is very wide. It has six execution units capable of executing three memory operations and three calculation operations. If the execution engine can’t find sufficient parallelism of instructions to take advantage of them all, “bubbles”—lost cycles—occur in the pipeline.

To remedy that situation, SMT looks for instruction parallelism in two threads instead of just one, with the goal of leaving as few units unused as possible. This approach can be extremely effective when the two threads are executing tasks that are highly separate. On the other hand, two threads involving intensive calculation, for example, will only increase the pressure on the same calculating units, putting them in competition with each other for access to the cache. It goes without saying that SMT is of no interest in this type of situation, and can even negatively impact performance.
SMT Implementation

Still, the impact of SMT on performance is positive most of the time and the cost in terms of resources is still very limited, which explains why the technology is making a comeback. But programmers will have to pay attention because with Nehalem, all threads are not created equal. To help solve this puzzle, Intel provides a way of precisely determining the exact topology of the processor (the number of physical and logical processors), and programmers can then use the operating system affinity mechanism to assign each thread to a processor. This kind of thing shouldn’t be a problem for game programmers, who are already in the habit of working that way because of the way the Xenon processor (the one used in the Xbox 360) works. But unlike consoles, where programmers have very low-level access, on a PC the operating system’s thread scheduler will always have the last word.
Since SMT puts a heavier load on the out-of-order execution engine, Intel has increased the size of certain internal buffers to avoid turning them into bottlenecks. So the reorder buffer, which keeps track of all the instructions being executed in order to reorder them, has increased from 96 entries on the Core 2 to 128 entries on Nehalem. In practice, since this buffer is partitioned statically to keep any one thread from monopolizing all the resources, its size is reduced to 64 entries for each thread with SMT. Obviously, in cases where a single thread is executed, it has access to all the entries, which should mean that there won’t be any specific situations where Nehalem turns out to have worse performance than its predecessor.
The reservation station, which is the unit in charge of assigning instructions to the different execution units, has also increased in size: from 32 to 36 entries. But unlike the reorder buffer, here partitioning is dynamic, so that a thread can take up more or fewer entries as a function of its needs.
Two other buffers have also been resized: the load buffer and the store buffer. The former has 48 entries as opposed to 32 with Conroe, and the latter 32 instead of 20. Here too, partitioning between threads is static.
Another consequence of the return of SMT is that the performance of thread synchronization instructions has improved, according to Intel.
SSE 4.2 And Power Consumption
With the Nehalem architecture, Intel couldn’t resist adding some new items to the already long list of SSE instructions. Nehalem supports SSE 4.2, which has all the instructions supported by Penryn (SSE4.1) and adds seven more. Most of these new instructions are for manipulating character strings, one purpose of which, Intel says, is to speed up the processing of XML files.

The two other instructions are aimed at specific applications; one is POPCNT, which appeared with Barcelona, and is used to count the number of non-zero bits in a register. According to Intel, this instruction is especially useful in voice recognition and DNA sequencing. The last instruction, CRC32, is used for accelerating the calculation of error detection codes.
Power Consumption Under Control
Over and over Intel says that for any potential innovation in one of its new architectures, the engineers weigh performance gains against the impact on power consumption—it’s proof that they’ve learned well from Pentium 4. With the Nehalem architecture, the engineers have gone even farther with techniques for limiting consumption. There’s now a built-in microcontroller—the Power Control Unit—that constantly watches the temperature and power use of the cores and can disable them completely when they’re not being used. Thanks to this technology, the energy consumption of an unused core is next to zero, whereas before Nehalem there were still losses due to leakage currents.

Intel has implemented this in an original way with what it calls a Turbo mode. When the processor is operating below its standard TDP, for example, the Turbo mode increases the frequency of the cores being used, while still keeping within the TDP limit.

Note also that like the Atom processor, Nehalem’s L1 and L2 caches use eight transistors instead of the usual six, which reduces consumption at the cost of a slightly larger die surface.

QuickPath Interconnect
While the Core architecture was remarkably efficient, certain design details had begun to show their age, first among them the Front Side Bus (FSB) system. This bus connecting the processors to the northbridge was a real anachronism in an otherwise highly modern architecture. The major fault was most noticeable in multiprocessor configurations, where the architecture had a hard time keeping up with increasing loads. The processors had to share this bus not only for access to memory, but also for ensuring the coherence of the data contained in their respective cache memories.
In this kind of situation, the influx of transactions to the bus can lead to its saturation. For a long time, Intel simply worked around the problem by using a faster bus or larger cache memories, but it was high time for them to address the underlying cause by completely overhauling the way its processors communicate with memory and outside components.

The solution Intel chose—called QuickPath Interconnect (QPI)—was nothing new; an integrated memory controller is an extremely fast point-to-point serial bus. The technology was introduced five years ago in AMD processors, but in reality it’s even older than that. These concepts, showing up in AMD and now Intel products, are in fact the result of work done ten years ago by the engineers at DEC during the design of the Alpha 21364 (EV7). Since many former DEC engineers ended up in Santa Clara, it’s not surprising to see these concepts surfacing in the latest Intel architecture.
From a technical point of view, a QPI link is bidirectional and has two 20-bit links—one in each direction—of which 16 are reserved for data; the four others are used for error detection codes or protocol functions. This works out to a maximum of 6.4 GT/s (billion transfers per second), or a usable bandwidth of 12.8 GB/s, both read and write. Just for comparison, the FSB on the most recent Intel processors operates at a maximum clock frequency of 400 MHz, and address transfers need two clock cycles (200 MT/s) whereas data transfers operate in QDR mode, with a bandwidth of 1.6 GT/s. With its 64-bit width, the FSB also has a total bandwidth of 12.8 GB/s, but it’s usable for writing or reading.
So a QPI link has a theoretical bandwidth that’s up to twice as high, provided reads and writes are well balanced. In a theoretical case consisting of reads only or writes only, the bandwidth would be identical to that of the FSB. However, you have to keep in mind that the FSB was used both for memory access and for all transfers of data to peripherals or between processors. With Nehalem, a QPI link will be exclusively dedicated to transfers of data to peripherals, with memory transfers handled by the integrated controller and inter-CPU communications in multi-socket configurations by another QPI link. Even in the worst cases, a QPI link should show significantly better performance than the FSB.

As we’ve seen, Nehalem was designed to be a flexible, scalable architecture, and so the number of QPI links available will vary with the market segment being aimed at—from one link to the chipset for single-socket configurations to as many as four in four-socket server configurations. This enables true fully-connected four-processor systems, meaning that each processor can access any position in memory with a maximum of a single QPI link hop since each CPU is connected directly to the three others.
Memory Subsystem
An Integrated Memory Controller
Intel has taken its time catching up to AMD on this point. But as is often the case, when the giant does something, he takes a giant step. Where Barcelona had two 64-bit memory controllers supporting DDR2, Intel’s top-of-the-line configuration will include three DDR3 memory controllers. Hooked up to DDR3-1333, which Nehalem will also support, that adds up to a bandwidth of 32 GB/s in certain configurations. But the advantage of an integrated memory controller isn’t just a matter of bandwidth. It also substantially lowers memory access latency, which is just as important, considering that each access costs several hundred cycles. Though the latecy reduction achieved by an integrated memory controller will be appreciable in the context of desktop use, it is multi-socket server configurations that will get the full benefit of the more scalable architecture. Before, while bandwidth remained constant when CPUs were added, now each new CPU added will increase bandwidth, since each processor has its own local memory space.

Obviously this is not a miracle solution. This is a Non Uniform Memory Access (NUMA) configuration, which means that memory accesses can be more or less costly, depending on where the data is in memory. An access to local memory obviously has the lowest latency and the highest bandwidth; conversely, an access to remote memory requires a transit via the QPI link, which reduces performance.

The impact on performance is difficult to predict, since it’ll be dependent on the application and the operating system. Intel says the performance hit for remote access is around 70% in terms of latency, and that bandwidth can be reduced by half compared to local access. According to Intel, even with remote access via the QPI link, latency will still be lower than on earlier processors where the memory controller was on the northbridge. However, those considerations only apply to server applications, and for a long time now they’ve already been designed with the specifics of NUMA configurations in mind.

A Three-Level Cache Hierarchy
The memory hierarchy of Conroe was extremely simple and Intel was able to concentrate on the performance of the shared L2 cache, which was the best solution for an architecture that was aimed mostly at dual-core implementations. But with Nehalem, the engineers started from scratch and came to the same conclusions as their competitors: a shared L2 cache was not suited to a native quad-core architecture. The different cores can too frequently flush data needed by another core and that surely would have involved too many problems in terms of internal buses and arbitration to provide all four cores with sufficient bandwidth while keeping latency sufficiently low. To solve the problem, the engineers provided each core with a Level 2 cache of its own. Since it’s dedicated to a single core and relatively small (256 KB), the engineers were able to endow it with very high performance; latency, in particular, has reportedly improved significantly over Penryn—from 15 cycles to approximately 10 cycles.

Then comes an enormous Level 3 cache memory (8 MB) for managing communications between cores. While at first glance Nehalem’s cache hierarchy reminds one of Barcelona, the operation of the Level 3 cache is very different from AMD’s—it’s inclusive of all lower levels of the cache hierarchy. That means that if a core tries to access a data item and it’s not present in the Level 3 cache, there’s no need to look in the other cores’ private caches—the data item won’t be there either. Conversely, if the data are present, four bits associated with each line of the cache memory (one bit per core) show whether or not the data are potentially present (potentially, but not with certainty) in the lower-level cache of another core, and which one.

This technique is effective for ensuring the coherency of the private caches because it limits the need for exchanges between cores. It has the disadvantage of wasting part of the cache memory with data that is already in other cache levels. That’s somewhat mitigated, however, by the fact that the L1 and L2 caches are relatively small compared to the L3 cache—all the data in the L1 and L2 caches takes up a maximum of 1.25 MB out of the 8 MB available. As on Barcelona, the Level 3 cache doesn’t operate at the same frequency as the rest of the chip. Consequently, latency of access to this level is variable, but it should be in the neighborhood of 40 cycles.
The only real disappointment with Nehalem’s new cache hierarchy is its L1 cache. The bandwidth of the instruction cache hasn’t been increased—it’s still 16 bytes per cycle compared to 32 on Barcelona. This could create a bottleneck in a server-oriented architecture since 64-bit instructions are larger than 32-bit ones, especially since Nehalem has one more decoder than Barcelona, which puts that much more pressure on the cache. As for the data cache, its latency has increased to four cycles compared to three on the Conroe, facilitating higher clock frequencies. To end on a positive note, though, the engineers at Intel have increased the number of Level 1 data cache misses that the architecture can process in parallel.
TLB
For many years now, processors have been working not with physical memory addresses, but with virtual addresses. Among other advantages, this approach lets more memory be allocated to a program than the computer actually has, keeping only the data necessary at a given moment in actual physical memory with the rest remaining on the hard disk. This means that for each memory access a virtual address has to be translated into a physical address, and to do that an enormous table is put in charge of keeping track of the correspondences. The problem is that this table gets so large that it can’t be stored on-chip—it’s placed in main memory, and can even be paged (part of the table can be absent from memory and itself kept on the hard disk).
If this translation stage were necessary at each memory access, it would make access much too slow. As a result, engineers returned to the principle of physical addressing by adding a small cache memory directly on the processor that stored the correspondences for a few recently accessed addresses. This cache memory is called a Translation Lookaside Buffer (TLB). Intel has completely revamped the operation of the TLB in their new architecture. Up until now, the Core 2 has used a level 1 TLB that is extremely small (16 entries) but also very fast for loads only, and a larger level 2 TLB (256 entries) that handled loads missed in the level 1 TLB, as well as stores.
Nehalem now has a true two-level TLB: the first level of TLB is shared between data and instructions. The level 1 data TLB now stores 64 entries for small pages (4K) or 32 for large pages (2M/4M), while the level 1 instruction TLB stores 128 entries for small pages (the same as with Core 2) and seven for large pages. The second level is a unified cache that can store up to 512 entries and operates only with small pages. The purpose of this improvement is to increase the performance of applications that use large sets of data. As with the introduction of two-level branch predictors, this is further evidence of the architecture’s server orientation.
Let’s go back to SMT for a moment, since it also has an impact on the TLBs. The level 1 data TLB and the level 2 TLB are shared dynamically between the two threads. Conversely, the level 1 instruction TLB is statically shared for small pages, whereas the one dedicated to large pages is entirely replicated—this is understandable given its small size (seven entries per thread).
Memory Access And Prefetcher
Optimized Unaligned Memory Access
With the Core architecture, memory access was subject to several restrictions in terms of performance. The processor was optimized for access to memory addresses that were aligned on 64-byte boundaries—the size of one cache line. Not only was access slow for unaligned data, but execution of an unaligned load or store instruction was more costly than for aligned instructions, regardless of actual alignment of the data in memory. That’s because these instructions generated several µops for the decoders to handle, which reduced the throughput of this type of instruction. As a result, compilers avoided generating this type of instruction, by substituting sequences of instructions that were less costly.
Thus, memory reads that overlapped two cache lines took a performance hit of approximately 12 cycles, compared to 10 for writes. The Intel engineers have optimized these accesses to make them faster. First of all, there’s no performance penalty for using the unaligned versions of load/store instructions in cases where the data are aligned in memory. In other cases, Intel has optimized these accesses to reduce the performance hit compared to that of the Core architecture.
More Prefetchers Running More Efficiently
With the Conroe architecture, Intel was especially proud of its hardware prefetchers. As you know, a prefetch is a mechanism that observes memory access patterns and tries to anticipate which data will be needed several cycles in advance. The point is to return the data to the cache, where it will be more readily accessible to the processor while trying to maximize bandwidth by using it when the processor doesn’t need it.
This technique produced remarkable results with most desktop applications, but in the server world the result was often a loss of performance. There are many reasons for that inefficiency. First of all, memory accesses are often much less easy to predict with server applications. Database accesses, for example, aren’t linear—when an item of data is accessed in memory, the adjacent data won’t necessarily be called on next. That limits the prefetcher’s effectiveness. But the main problem was with memory bandwidth in multi-socket configurations. As we said earlier, there was already a bottleneck between processors, but in addition, the prefetchers added additional pressure at this level. When a microprocessor wasn’t accessing memory, the prefetchers kicked in to use bandwidth they assumed was available. They had no way of knowing at that precise point that the other processor might need the bandwidth. That meant the prefetchers could deprive a processor of bandwidth that was already at a premium in this kind of configuration. To solve the problem, Intel had no better solution to offer than to disable the prefetchers in these situations—hardly a satisfactory answer.
Intel says the problem is solved now, but provides no details on the operation of the new prefetch algorithms; all its says is that it won’t be necessary to disable them for server configurations. But even if Intel hasn’t changed anything, the gains stemming from the new memory organization and the resulting wider bandwidth should limit any negative impact of the prefetchers
Conclusion
Conroe laid strong foundations and Nehalem builds on them. It has the same efficient architecture, but it’s now much more modular and scalable, which should guarantee success in the different market segments. We’re not saying that Nehalem revolutionizes the Core architecture, but it does revolutionize the Intel platform, which has now once again become a formidable match for AMD in terms of design and surpassed its competitor in terms of implementation.

With all the improvements made at this level (integrated memory controller, QPI), it’s not really surprising that the changes to the execution core are purely incremental. But the return of Hyper-Threading is significant and there are several little optimizations that should guarantee a notable gain in performance compared to Penryn at equal frequencies.
Clearly, the greatest gains will be in situations where memory was the main bottleneck. In reading this article, you’ve probably noticed that this is the area the engineers have focused their attentions on. In addition to the integrated memory controller, which will undoubtedly produce the biggest gains where memory access is concerned, there’s a raft of other improvements, both big and small—the new cache and TLB hierarchies, unaligned memory access and prefetchers.
After all these theoretical considerations, the next step will be to see if the improvements in actual applications will live up to the expectations that the new architecture has aroused. We’ll look into that in a series of upcoming articles, so stay tuned!