Overview of the RISC-V Design with Tomasulo's Algorithm

Disclaimer#

The content of this blog post has a large portion of AI-generated text (Google Gemini 2.5 Pro Deep Research). Although I have reviewed, edited the text, and did fact check, I cannot guarantee that it is 100% accurate or free of errors. Please use this content as a starting point for your own research and understanding, and verify any critical information independently.

With that said, I believe this post is super well-written and informative, and what really fascinates me is the “problem-solving” learning curve, which highlights the flaws and problems in every design choice, and segues into the components that solves the problem.

Part I: The Language of Hardware - Verilog Fundamentals#

The study of processor design requires a fundamental shift in perspective. The tools and languages used to design hardware, such as Verilog, represent a different paradigm of computation. Understanding this paradigm is the first and most crucial step toward grasping the intricate workings of a modern CPU.

Section 1.1: Thinking in Parallel#

The most significant conceptual leap from software to hardware is the transition from sequential to concurrent execution. A typical software program, written in a language like C++, is a sequence of instructions executed one after another by a processor. The program’s state changes in a predictable, linear progression. In contrast, a physical hardware circuit is a collection of components—gates, flip-flops, memory blocks—that, once powered on, operate continuously and in parallel. A thousand logic gates do not wait their turn; they all compute their output based on their current inputs simultaneously, every moment in time.

It is for this reason that Verilog is classified as a Hardware Description Language (HDL), not a programming language in the traditional sense. Its primary purpose is not to provide a list of commands for a processor to execute, but to describe the physical structure and behavior of a digital electronic circuit. This description serves two main purposes: it can be fed into a simulation tool to model how the described circuit will behave over time, or it can be used by a synthesis tool to generate a netlist, which is a detailed blueprint for manufacturing an Application-Specific Integrated Circuit (ASIC) or configuring a Field-Programmable Gate Array (FPGA).

The fundamental unit of design in Verilog is the module. A module is a self-contained block of hardware logic, analogous to a class in C++ or a physical integrated circuit (IC) chip. It encapsulates internal logic and defines a clear interface to the outside world through a set of ports, which are declared as input, output, or inout. This modularity is essential for hierarchical design, allowing complex systems like an entire CPU to be built by connecting smaller, well-defined modules such as an Arithmetic Logic Unit (ALU), a register file, and a control unit.

Section 1.2: Describing Behavior - `initial` and `always` Blocks#

Within a Verilog module, the behavior of the circuit is described primarily within two types of procedural blocks: initial and always. These blocks contain statements that define how the outputs and internal state of the module should change in response to inputs and time.

The `initial` Block#

The initial block is the simpler of the two. As its name suggests, it contains a block of code that begins execution only once, at the very start of a simulation, at time zero. If multiple initial blocks are defined within a module, they all start concurrently at time zero.

This “run-once” behavior has a critical implication: initial blocks are generally not synthesizable. Real hardware does not have a concept of a “beginning of time” in the same way a simulation does; once powered on, it operates continuously. Therefore, an initial block cannot be translated into a physical circuit that performs an action only at power-on. Its primary role is within the realm of simulation, specifically in the construction of a testbench. A testbench is a separate Verilog module written to test the design (often called the “Design Under Test” or DUT). Within a testbench, initial blocks are indispensable for generating clock signals, providing a sequence of input stimuli to the DUT, and setting up initial memory states to verify the design’s correctness.

The `always` Block#

The always block is the cornerstone of synthesizable Verilog code. It contains a block of statements that execute repeatedly throughout the simulation. The execution of an always block is triggered by events specified in its sensitivity list, denoted by @(...). This behavior directly models the nature of real hardware, which continuously reacts to changes in its input signals or to clock edges.

The sensitivity list dictates what kind of hardware the always block describes:

always @(posedge clk): This syntax specifies that the block should execute only on the positive (rising) edge of the signal named clk. This is the standard way to describe sequential logic, such as flip-flops and registers, which are memory elements that capture and store a value at a specific moment defined by a clock signal.
always @(*): The asterisk is a shorthand that tells the simulator to execute the block whenever any of the signals read on the right-hand side of assignments within the block changes its value. This describes combinational logic—circuits like adders, multiplexers, or decoders whose outputs depend solely on their current inputs, with no memory of past states.

Because these constructs map directly to physical hardware components (clocked registers and logic gates), always blocks are the primary tool for describing the synthesizable behavior of a digital design.

Section 1.3: The Heart of Synthesis - Blocking vs. Non-blocking Assignments#

Perhaps the most frequent and critical point of confusion for those transitioning from software to Verilog is the distinction between the two types of assignment operators: blocking (=) and non-blocking (<=). This is not a matter of stylistic preference; the choice of operator is a direct instruction to the synthesis tool about the type of hardware circuit to create. Misunderstanding this distinction is the leading cause of simulation-synthesis mismatches, where a design works perfectly in simulation but fails when implemented in actual hardware.

Blocking Assignments (`=`)#

A blocking assignment is executed in the order it appears within a procedural block, much like in a C program. The execution of the current statement “blocks” the execution of any subsequent statements in the same begin...end block until it is complete. The variable on the left-hand side is updated immediately, and this new value is used by all subsequent statements in the block.

This immediate-update behavior models a chain of combinational logic. Imagine a series of logic gates connected by wires. The output of the first gate is instantaneously available as the input to the second gate. Blocking assignments are therefore the correct choice for describing this type of logic, typically within an always @(*) block.

Non-blocking Assignments (`<=`)#

A non-blocking assignment operates in a two-phase manner that is fundamentally different from any software assignment. Within a block triggered by an event (like a clock edge), all the right-hand side (RHS) expressions of the non-blocking assignments are evaluated and stored in temporary variables first. Only after all RHS expressions have been evaluated does the second phase begin, where the left-hand side (LHS) variables are all updated simultaneously with their corresponding temporary values. The execution of one non-blocking assignment does not block the evaluation of the next.

This two-phase mechanism perfectly models the behavior of a bank of sequential logic elements, such as D-type flip-flops, that share a common clock. On a clock edge, all the flip-flops simultaneously sample the data at their D inputs. A short time later (the clock-to-Q delay), all their Q outputs change to reflect the newly captured values. The value of one flip-flop’s output at the beginning of the clock cycle determines the input of the next flip-flop, but the update doesn’t happen until the end of the cycle. Non-blocking assignments are therefore the correct and safe way to model state changes in sequential logic, and they should be used exclusively for assignments within a clocked always @(posedge clk) block.

Example: The Shift Register#

The difference becomes crystal clear with a simple 3-bit shift register example. The goal is to have a value at data_in shift one position to the right on each clock cycle: data_in -> q1 -> q2 -> q3.

Incorrect Version (using Blocking =):

always @(posedge clk) begin
  q1 = data_in;
  q2 = q1;
  q3 = q2;
end

verilog

In simulation, on a single rising clock edge, q1 is immediately updated with data_in. Because this is a blocking assignment, that new value of q1 is then immediately used to update q2. And that new value of q2 is immediately used to update q3. The result is that the value from data_in propagates all the way to q3 within a single clock cycle. The synthesis tool will interpret this as a direct wire from data_in to q3, not a series of registers. This is not a shift register.

Correct Version (using Non-blocking <=):

always @(posedge clk) begin
  q1 <= data_in;
  q2 <= q1;
  q3 <= q2;
end

verilog

On a rising clock edge, the simulator evaluates all RHS expressions first: data_in, the current value of q1, and the current value of q2. Then, at the end of the simulation time step, it updates the LHS variables simultaneously. q1 gets the value of data_in, q2 gets the old value of q1, and q3 gets the old value of q2. This correctly models three separate flip-flops, and it takes three clock cycles for a value to propagate from data_in to q3. This is a true shift register.

Pitfalls and Best Practices#

Mixing blocking and non-blocking assignments in the same always block, or using the wrong type for the logic intended, can lead to indeterminate behavior known as a race condition. This occurs when the final state of a variable depends on the unpredictable order in which a simulator evaluates concurrent events. To avoid these issues and ensure a design that is both simulatable and synthesizable, designers adhere to strict rules of thumb:

When modeling sequential logic (clocked always blocks), use non-blocking assignments (<=).
When modeling combinational logic (always @(*) blocks), use blocking assignments (=).
Do not mix blocking and non-blocking assignments in the same always block.

The underlying reason for these rules is to bridge the gap between the discrete event-scheduling model of a simulator and the continuous, physical reality of hardware. A non-blocking assignment is a directive to the simulator to schedule an update for the end of the current time step, which is how a synthesis tool understands the need for a memory element (a flip-flop) that holds a value across clock cycles. A blocking assignment directs the simulator to update a value immediately, which is how a synthesis tool understands a direct connection of logic gates whose output changes as soon as the input changes. Using the wrong operator creates a mismatch between what is simulated and what is built, which is the root cause of many hardware design bugs.

Feature	Blocking Assignment (`=`)	Non-blocking Assignment (`<=`)
Operator	`=`	`<=`
Execution Model	Sequential, in-order execution within a block. Updates are immediate.	Parallel evaluation of RHS, followed by simultaneous update of LHS.
Hardware Inference	Combinational logic (wires, gates).	Sequential logic (flip-flops, registers).
Typical `always` Block	`always @(*)`	`always @(posedge clk)`
Use Case Example	`assign y = sel ? b : a;`	`always @(posedge clk) begin q <= d; end`

Part II: The Blueprint of a CPU - The RISC-V ISA#

Having established the language for describing hardware, the next step is to understand the vocabulary that a processor speaks. This vocabulary is its Instruction Set Architecture (ISA), the fundamental interface between software and hardware. For this exploration, the RISC-V ISA provides an ideal foundation due to its modern, clean, and extensible design.

Section 2.1: An Introduction to Instruction Set Architectures (ISA)#

An ISA is the abstract model of a computer that is visible to a machine-language programmer or compiler. It is the definitive contract between the software that runs on a processor and the hardware that executes it. This contract specifies a set of critical elements, including:

The set of available instructions (the “opcodes”).
The native data types.
The programmer-visible registers.
The memory addressing modes.
The handling of events like interrupts and exceptions.

Any processor that correctly implements a given ISA will execute the same machine code and produce the same results, regardless of its internal microarchitectural design. An Intel Core i9 and an AMD Ryzen processor, for example, have vastly different internal designs but can both run Windows because they both implement the x86-64 ISA.

Section 2.2: The RISC-V Revolution - Openness and Modularity#

RISC-V (pronounced “risk-five”) is not just another ISA; it represents a paradigm shift in how ISAs are developed and used. It was born at the University of California, Berkeley, in 2010 with the goal of creating a practical, high-quality ISA that was open, free, and suitable for a wide range of computing applications, from academic research to industrial deployment.

The RISC Philosophy#

At its core, RISC-V is a pure embodiment of the Reduced Instruction Set Computer (RISC) philosophy. This design approach contrasts with the Complex Instruction Set Computer (CISC) paradigm of architectures like x86. The core tenets of RISC, and by extension RISC-V, are:

A small number of simple instructions: The instruction set is kept minimal, focusing on fundamental operations. More complex operations are built by combining these simple instructions.
Fixed-length instruction encoding: All base instructions are the same length (32 bits), which dramatically simplifies the hardware required for instruction fetching and decoding.
Load/Store architecture: The only instructions that access memory are explicit load and store operations. All arithmetic and logical operations are performed on operands held in processor registers. This simplifies the control logic and encourages efficient register usage by compilers.
One instruction per cycle: The simplicity of the instructions is designed to allow for execution in a single clock cycle in a basic pipeline, which is key to achieving high performance.

This adherence to simplicity results in a more streamlined processor design, leading to improved performance, lower power consumption, and reduced design complexity.

Open and Free#

Unlike proprietary ISAs such as x86 and ARM, the RISC-V specification is developed and maintained by the non-profit RISC-V International and is available under open-source licenses. This means anyone can design, manufacture, and sell RISC-V chips and software without paying royalties. This openness has catalyzed a global wave of innovation, enabling startups, academic institutions, and even large corporations to develop custom processors tailored for specific applications without the barrier of licensing fees or vendor lock-in.

Modular Design#

A defining feature of RISC-V is its inherent modularity. The ISA is not a monolithic entity but is structured as a small, mandatory base integer ISA with a rich set of optional standard extensions. A processor’s full ISA is specified by its base and the extensions it implements. For instance, a common configuration for a 64-bit general-purpose processor is denoted RV64GC, which stands for RV64IMAFDC.

The base integer ISAs are:

RV32I: The base 32-bit integer instruction set with 32 integer registers (x0-x31).
RV64I: The base 64-bit integer instruction set, extending the registers and operations to 64 bits.
RV32E: An embedded variant of RV32I with only 16 integer registers, designed for the smallest microcontrollers.

The most common standard extensions, often grouped under the letter ‘G’ for “General-Purpose,” are:

M: Standard Extension for Integer Multiplication and Division. Adds instructions like mul, div, and rem.
A: Standard Extension for Atomic Instructions. Provides instructions for atomic memory operations (e.g., amoswap), essential for synchronization in multi-core systems.
F: Standard Extension for Single-Precision Floating-Point. Adds a separate floating-point register file (f0-f31) and instructions for 32-bit floating-point arithmetic.
D: Standard Extension for Double-Precision Floating-Point. Extends the F extension with support for 64-bit floating-point operations.
C: Standard Extension for Compressed Instructions. Defines 16-bit versions of the most common 32-bit instructions. This can significantly reduce code size and improve instruction fetch bandwidth, which is critical in memory-constrained embedded systems and for performance in high-end cores.

This modularity allows designers to create highly optimized processors. A tiny microcontroller for an IoT sensor might only implement RV32EMC, while a high-performance application processor in a data center might implement RV64G plus extensions for vector processing (V) and bit manipulation (B).

Section 2.3: Anatomy of a RISC-V Instruction#

All base RISC-V instructions are 32 bits long and fall into one of a few well-defined formats. The regularity of these formats is a key design feature that enables the simple, high-performance pipelines for which RISC architectures are known. The primary formats are:

R-type (Register): Used for register-to-register operations like add, sub, and, or.
```
| funct7 (7) | rs2 (5) | rs1 (5) | funct3 (3) | rd (5) | opcode (7) |
```
plaintext
- opcode: Defines the instruction type (e.g., OP for register-register arithmetic).
- rd: The destination register.
- funct3: Further specifies the operation (e.g., ADD/SUB).
- rs1, rs2: The two source registers.
- funct7: An additional field to differentiate operations (e.g., ADD from SUB).
I-type (Immediate): Used for operations with an immediate value, including addi, and for load instructions like lw.
```
| imm[11:0] (12) | rs1 (5) | funct3 (3) | rd (5) | opcode (7) |
```
plaintext
- imm[11:0]: A 12-bit signed immediate value.
- rs1: The source register.
- rd: The destination register.
S-type (Store): Used for store instructions like sw (store word).
```
| imm[11:5] (7) | rs2 (5) | rs1 (5) | funct3 (3) | imm[4:0] (5) | opcode (7) |
```
plaintext
- The 12-bit immediate is split to accommodate the two source register fields.
- rs1: The base address register.
- rs2: The register containing the data to be stored.
B-type (Branch): Used for conditional branch instructions like beq (branch if equal). Similar to S-type, the immediate is split.
```
| imm[12|10:5] (7) | rs2 (5) | rs1 (5) | funct3 (3) | imm[4:1|11] (5) | opcode (7) |
```
plaintext
- rs1, rs2: The registers to be compared.
- imm: The signed branch offset, which is multiplied by 2 and added to the PC.
U-type (Upper Immediate): Used for loading a 20-bit upper immediate value, as in lui (load upper immediate).
```
| imm[31:12] (20) | rd (5) | opcode (7) |
```
plaintext
J-type (Jump): Used for unconditional jumps like jal (jump and link).
```
| imm[20|10:1|11|19:12] (20) | rd (5) | opcode (7) |
```
plaintext

The deliberate and consistent placement of the opcode, rs1, rs2, and rd fields across these formats is not an accident. It is a cornerstone of efficient RISC design. In a pipelined processor, the Instruction Decode (ID) stage must identify the source registers and read their values from the register file. Because rs1 and rs2 are always in the same bit positions for all instruction formats that use them (R, I, S, B), the decoder hardware is greatly simplified. It can begin reading from the register file before it has even finished fully decoding the instruction to determine the exact operation. This parallelism within the ID stage is a crucial enabler of the classic 5-stage RISC pipeline, a concept that forms the foundation of modern processor execution.

Part III: The Assembly Line - Pipelined Execution and Its Perils#

To achieve high performance, modern processors do not execute instructions one at a time, waiting for each to complete before starting the next. Instead, they use a technique called pipelining, which overlaps the execution of multiple instructions, much like an assembly line in a factory. This approach is fundamental to all high-performance CPUs, and the RISC-V ISA is explicitly designed to facilitate it.

Section 3.1: The Classic 5-Stage RISC Pipeline#

Pipelining increases the instruction throughput—the number of instructions completed per unit of time—without necessarily decreasing the latency of any single instruction. The concept is best understood through the analogy of doing laundry. A sequential approach would be to wash, dry, fold, and put away one load of laundry completely before starting the next. A pipelined approach starts the washer on the second load as soon as the first load moves to the dryer. By keeping all stages (washer, dryer, folding table) busy, the total time to complete many loads is significantly reduced.

Similarly, the execution of a RISC instruction can be broken down into a series of uniform steps. The classic RISC pipeline consists of five stages:

IF (Instruction Fetch): The processor fetches the 32-bit instruction from the instruction memory (or cache) at the address currently held by the Program Counter (PC). Concurrently, the PC is updated to point to the next instruction, which is typically at address $PC+4$ since each instruction is 4 bytes long.
ID (Instruction Decode and Register Fetch): The fetched instruction is decoded by the control unit to determine what operation to perform. The format of the instruction is identified, and the required control signals for subsequent stages are generated. Simultaneously, the source register identifiers (rs1 and rs2) are used to read their corresponding values from the processor’s register file. Any immediate value in the instruction is also sign-extended and prepared for use.
EX (Execute): This is where the actual computation occurs. The Arithmetic Logic Unit (ALU) performs the operation specified by the instruction. This could be an arithmetic operation (add, sub), a logical operation (and, or), a memory address calculation for a load or store (by adding the base register and the immediate offset), or a comparison for a branch instruction.
MEM (Memory Access): This stage is active only for load and store instructions. For a load instruction (lw), the address calculated in the EX stage is used to read data from the data memory (or cache). For a store instruction (sw), the address and data are used to write to the data memory. For all other instructions (e.g., arithmetic or branch), this stage performs no operation.
WB (Write-Back): The final stage writes the result of the operation back into the register file. For an arithmetic instruction, the result comes from the ALU. For a load instruction, the result is the data read from memory. The destination register identifier (rd) from the instruction determines which register is written.

In an ideal scenario, a new instruction enters the IF stage every clock cycle. After five cycles, the pipeline is full, and one instruction completes every cycle, achieving an ideal throughput of one instruction per cycle (IPC).

Section 3.2: When the Assembly Line Breaks - Pipeline Hazards#

The simple, elegant model of the 5-stage pipeline breaks down when dependencies between instructions conflict with the overlapped execution model. These conflicts are known as pipeline hazards, and they are the primary challenge in processor design. Hazards force the pipeline to stall, inserting “bubbles” where no useful work is done, thereby degrading performance. There are three main types of hazards.

Structural Hazards#

A structural hazard occurs when two or more instructions in the pipeline require the same hardware resource at the same time. A classic example is a processor with a single, unified memory for both instructions and data. In such a design, a load instruction in its MEM stage would need to access memory simultaneously with a later instruction in its IF stage, which also needs to access memory to be fetched. This resource conflict would force one of the instructions to wait. The standard solution in RISC processors is to use a Harvard architecture, which employs separate, independent memories or caches for instructions and data, thus eliminating this specific hazard. Another potential structural hazard is in the register file, which is accessed for reads in the ID stage and for writes in the WB stage. This is typically resolved by designing the register file with separate read and write ports, or by performing writes in the first half of the clock cycle and reads in the second half.

Data Hazards#

Data hazards arise from data dependencies between instructions. They occur when an instruction’s execution depends on the result of a preceding instruction that is still in the pipeline.

Read-After-Write (RAW): This is the most common and intuitive data hazard. An instruction attempts to read a source register before a previous instruction has written its result back to that register. Consider the sequence:
```
add x5, x1, x2  // Instruction 1
sub x6, x5, x3  // Instruction 2
```
plaintext
The sub instruction needs the value of x5, but the add instruction only calculates it in its EX stage and writes it back in its WB stage. By the time the sub instruction is in its ID stage ready to read x5, the add instruction has not yet completed its WB stage, so the register file contains an old, stale value for x5.
Write-After-Read (WAR): An instruction tries to write to a destination register before a preceding instruction has finished reading that register’s original value. This is not a problem in the simple 5-stage pipeline because reads always happen in an earlier stage (ID) than writes (WB). However, it becomes a major issue in processors with out-of-order execution.
Write-After-Write (WAW): Two instructions in the pipeline are scheduled to write to the same destination register. Similar to WAR, this is not an issue in a simple in-order pipeline where writes happen in program order, but it is a critical hazard that must be managed in more complex designs.

Control Hazards#

Control hazards, also known as branch hazards, are caused by branch and jump instructions that change the normal flow of program execution. The processor does not know the outcome of a conditional branch (whether it is taken or not taken) until the comparison is performed in the EX stage. By that time, the processor has already fetched and started decoding the instructions that sequentially follow the branch (at $PC+4$ ). If the branch is taken, these fetched instructions are incorrect and must be flushed from the pipeline, and the fetch must restart from the branch target address. This flushing process introduces stalls, or bubbles, into the pipeline, reducing performance.

Section 3.3: Basic Hazard Resolution - Stalling and Forwarding#

To ensure correct program execution, hazards must be detected and resolved by the processor’s control logic.

Stalling (Pipeline Bubbles)#

The most straightforward solution to a hazard is to stall the pipeline. When the hazard detection logic in the ID stage identifies a dependency (e.g., a RAW hazard), it can freeze the early stages of the pipeline and insert no-operation instructions, or “bubbles,” into the later stages. For the add/sub example above, the sub instruction would be held in the ID stage for several cycles until the add instruction completes its WB stage and the new value of x5 is available in the register file. While simple and effective, stalling is inefficient as it directly reduces the pipeline’s throughput.

Forwarding (Bypassing)#

A much more efficient solution for most data hazards is forwarding, also known as bypassing. The key observation is that the result of an operation is often available within the pipeline long before it is written back to the register file. For example, the result of the add instruction is available at the output of the ALU at the end of the EX stage. Forwarding logic adds extra data paths to send this result directly from the output of a later stage (like EX or MEM) back to the input of an earlier stage (like EX) for a subsequent, dependent instruction. This bypasses the need to wait for the result to be written to and then read from the register file. In the add/sub example, the result from the add instruction’s EX stage can be forwarded directly to the input of the sub instruction’s EX stage, completely eliminating the stall.

However, forwarding cannot solve all data hazards. A classic case is the load-use hazard. Consider this sequence:

lw  x5, 0(x1)   // Instruction 1
add x6, x5, x2  // Instruction 2

plaintext

The lw instruction only has the data from memory available at the end of its MEM stage. The add instruction needs this data at the beginning of its EX stage. Even with a forwarding path from the MEM stage back to the EX stage, the data arrives one cycle too late. The add instruction must be stalled for one cycle. This limitation, along with the performance penalty from control hazards and the inefficiency of handling long-latency operations like floating-point division, reveals the inherent performance ceiling of a rigid, in-order pipeline. It is this ceiling that motivates the development of more sophisticated, dynamic execution techniques that can look further ahead in the instruction stream to find independent work to do.

Hazard Type	Description	Example RISC-V Sequence	Simple Pipeline Effect	Solution(s)
Structural	Two instructions need the same resource in the same cycle.	`lw` in MEM stage, `add` in IF stage, both needing a unified memory.	One instruction must stall.	Separate Instruction/Data Memories (Harvard Architecture).
Data (RAW)	An instruction needs the result of a previous, unfinished instruction.	`add x5, x1, x2` followed by `sub x6, x5, x3`	`sub` reads a stale value of `x5` from the register file.	Stalling, Forwarding (Bypassing).
Control	The address of the next instruction is unknown due to a branch.	`beq x1, x2, L1` followed by `add x3, x4, x5`	Processor fetches `add` before knowing if the branch to `L1` is taken.	Stall until branch resolves, Branch Prediction, Flush incorrect path.

Part IV: The Brains of the Operation - Dynamic Scheduling with Tomasulo’s Algorithm#

The limitations of in-order pipelining become severe in the presence of long-latency operations (like floating-point arithmetic or cache misses) and frequent data dependencies. Stalls can quickly dominate the execution time, leaving valuable functional units idle. To overcome this, high-performance processors employ dynamic scheduling, a technique that allows instructions to execute out of their original program order. The seminal hardware algorithm for this is Tomasulo’s algorithm, first implemented in the IBM System/360 Model 91.

Section 4.1: Beyond In-Order Execution#

The core idea behind dynamic scheduling is to shift from a control-flow-driven execution model to a dataflow-driven one. In a simple pipeline, an instruction executes when it reaches the front of the line. In a dynamically scheduled machine, an instruction is allowed to execute as soon as all of its required operands are available, regardless of its position in the original program sequence. This decoupling of instruction issue (fetching and decoding) from execution allows the processor to look ahead in the instruction stream, find independent instructions, and execute them while a prior, dependent instruction is stalled waiting for its data. This significantly increases the utilization of the processor’s multiple execution units and improves overall performance.

Section 4.2: Core Components of the Tomasulo Machine#

Tomasulo’s algorithm achieves this dataflow execution through three key hardware components that work in concert.

Reservation Stations (RS)#

Instead of a single pipeline, a Tomasulo-based processor has a set of functional units (e.g., one or more adders, multipliers, load/store units), each equipped with its own set of buffers called Reservation Stations (RS). When an instruction is decoded, it is issued to a free reservation station associated with the required functional unit. The RS acts as a waiting area, holding the instruction until it is ready to execute.

Each entry in a reservation station contains the following fields:

Busy: A bit indicating whether the station is in use.
Op: The operation to be performed (e.g., ADD, MUL).
Vj, Vk: The actual values of the two source operands. These fields are filled if the operand values are already available in the register file when the instruction is issued.
Qj, Qk: The source operand tags. If an operand is not yet available because it is being produced by another instruction currently in-flight, these fields will hold a tag that identifies which reservation station will produce the required result. A value of zero or null in these fields indicates that the corresponding V field holds a valid operand.
Dest: A tag identifying the destination of the result (in modern implementations, this is a pointer to a Reorder Buffer entry).

The RS continuously monitors for its required operands. Once both Qj and Qk are zero (meaning Vj and Vk are both valid), the instruction is ready to be dispatched to its functional unit for execution.

The Common Data Bus (CDB)#

The Common Data Bus (CDB) is a broadcast bus that connects the outputs of all functional units to the inputs of all reservation stations and the register file. When a functional unit finishes its computation, it does not just write the result to a register. Instead, it places both the computed value and its unique tag (the name of the reservation station that produced it) onto the CDB.

All reservation stations are “snooping” (monitoring) the CDB in every cycle. If an RS sees a tag on the CDB that matches a tag in its Qj or Qk field, it knows its long-awaited operand is now available. It grabs the value from the CDB, places it into the corresponding Vj or Vk field, and clears the Qj or Qk field to zero. This mechanism allows results to be forwarded directly from producer to consumer without ever needing to pass through the register file, dramatically reducing stalls from RAW dependencies.

Hardware Register Renaming#

Out-of-order execution introduces the possibility of WAR and WAW hazards, which were not a problem in the simple in-order pipeline. Tomasulo’s algorithm elegantly eliminates these hazards through a mechanism called hardware register renaming.

Why WAR and WAW hazards are a problem in out-of-order execution? You can think about it yourself, or read Appendix A.

The key is to decouple the architectural registers (the names visible to the programmer, e.g., F0, F2, F4) from the physical storage locations (the reservation stations). A mapping table, often called the Register Alias Table (RAT) or Register Result Status, maintains the current mapping. For each architectural register, this table stores the tag of the reservation station that will produce the next value for that register.

The process works as follows:

Issue: When an instruction like ADD.D F6, F8, F2 is issued, the control logic looks up F8 and F2 in the RAT.
- If the RAT entry for a source register is empty, the value is ready in the main register file. This value is copied to the V field of the reservation station.
- If the RAT entry contains a tag (e.g., Add1), it means another instruction is currently computing the value. This tag is copied into the Q field of the new reservation station.
Rename: After reading the source tags, the logic updates the RAT entry for the destination register, F6, with the tag of the newly allocated reservation station (e.g., Add2). Now, any subsequent instruction that needs F6 will be directed to get its value from Add2.

This renaming process breaks false dependencies. If a later instruction also writes to F6 (a WAW hazard), it will simply be allocated a new reservation station (Add3), and the RAT will be updated to point to Add3. The original ADD.D instruction is unaffected because it is already linked to Add2. Similarly, WAR hazards are eliminated because source operands either get their value immediately or are linked to a specific producer via a tag; a subsequent write to that source register will be renamed to a new physical location and will not affect the original value needed by the earlier instruction.

Section 4.3: A Cycle-by-Cycle Walkthrough of Tomasulo’s Algorithm#

To solidify these concepts, a detailed, cycle-by-cycle trace of a sequence of dependent instructions is invaluable. This walkthrough will demonstrate the dynamic interplay between the reservation stations, the RAT, and the CDB.

Simulation Setup:#

Functional Units: 1 Integer Unit (for effective address calculation): 1 cycle latency. 2 FP Adders (for ADD.D, SUB.D): 2 cycles latency. 2 FP Multipliers (for MUL.D): 10 cycles latency. 1 FP Divider (for DIV.D): 40 cycles latency.
Instruction Issue: 1 instruction per cycle.
CDB: 1 result can be broadcast per cycle.
Reservation Stations: 3 for Add/Sub, 2 for Mult/Div.

Example Instruction Sequence:#

L.D F6, 34(R2)
L.D F2, 45(R3)
MUL.D F0, F2, F4
SUB.D F8, F6, F2
DIV.D F10, F0, F6
ADD.D F6, F8, F2

plaintext

Initial State:#

Register File: R2=100, R3=200, F4=2.0. All other FP registers have some initial value.
Memory: Mem[134]=10.0, Mem[245]=5.0.
All Reservation Stations are empty.
Register Result Status is empty (all values are in the Register File).

Cycle 1:#

Events: L.D F6, 34(R2) is issued.
Actions: A Load buffer (Load1) is allocated. The value of R2 (100) is read from the integer register file. The effective address is calculated immediately: $100 + 34 = 134$ . The Register Result Status for F6 is updated to point to Load1.

Instruction	Issue	Execute	Write Result
`L.D F6, 34(R2)`	1

Reservation Stations	Busy	Op	Vj	Vk	Qj	Qk	Address
Load1	Yes	Load	100				34
Load2	No

Register Result Status	F0	F2	F4	F6	F8	F10
Qi				Load1

CDB Activity: None.

Cycle 2:#

Events: L.D F2, 45(R3) is issued. Load1 begins memory access.
Actions: Load2 buffer is allocated. Value of R3 (200) is read. Effective address $200 + 45 = 245$ is calculated. Register Result Status for F2 is updated to Load2.

Instruction	Issue	Execute	Write Result
`L.D F6, 34(R2)`	1	2
`L.D F2, 45(R3)`	2

Reservation Stations	Busy	Op	Vj	Vk	Qj	Qk	Address
Load1	Yes	Load	100				34
Load2	Yes	Load	200				45

Register Result Status	F0	F2	F4	F6	F8	F10
Qi		Load2		Load1

CDB Activity: None.

Cycle 3:#

Events: MUL.D F0, F2, F4 is issued. Load1 completes memory access. Load2 begins memory access.
Actions: A multiplier RS (Mult1) is allocated. RAT is checked for sources F2 and F4. F2 is being produced by Load2, so Qj of Mult1 gets tag Load2. F4 is ready in the register file, so its value (2.0) is copied to Vk. RAT for destination F0 is updated to Mult1.

Instruction	Issue	Execute	Write Result
`L.D F6, 34(R2)`	1	2	3
`L.D F2, 45(R3)`	2	3
`MUL.D F0, F2, F4`	3

Reservation Stations	Busy	Op	Vj	Vk	Qj
Load1	Yes	Load	…
Load2	Yes	Load	…
Mult1	Yes	MUL.D		2.0	Load2
Add1	No

Register Result Status	F0	F2	F4	F6	F8	F10
Qi	Mult1	Load2		Load1

CDB Activity: Load1 broadcasts result Mem[134] (value 10.0) with tag Load1.
- Snooping: No waiting RS needs Load1 yet. The RAT entry for F6 is updated with the value and the tag is cleared.

Cycle 4:#

Events: SUB.D F8, F6, F2 is issued. Load2 completes memory access.
Actions: An adder RS (Add1) is allocated. RAT is checked for F6 and F2. F6 is now ready (value 10.0 from Load1’s broadcast), so Vj gets 10.0. F2 is still being produced by Load2, so Qk gets tag Load2. RAT for F8 is updated to Add1.

Instruction	Issue	Execute	Write Result
`L.D F6, 34(R2)`	1	2	3
`L.D F2, 45(R3)`	2	3	4
`MUL.D F0, F2, F4`	3
`SUB.D F8, F6, F2`	4

Reservation Stations	Busy	Op	Vj	Vk	Qj	Qk
Mult1	Yes	MUL.D		2.0	Load2
Add1	Yes	SUB.D	10.0			Load2

Register Result Status	F0	F2	F4	F6	F8	F10
Qi	Mult1	Load2			Add1

CDB Activity: Load2 broadcasts result Mem[245] (value 5.0) with tag Load2.
- Snooping: Both Mult1 and Add1 are waiting for Load2. They both snoop the CDB, capture the value 5.0, and clear their Q fields. Mult1’s Vj becomes 5.0. Add1’s Vk becomes 5.0.

Cycle 5:#

Events: DIV.D F10, F0, F6 is issued. Both Mult1 and Add1 are now ready to execute.
Actions: A divider RS (Div1) is allocated. RAT is checked for F0 and F6. F0 is being produced by Mult1. F6 is ready. Div1 gets tag Mult1 in Qj and value 10.0 in Vk. RAT for F10 is updated to Div1. Mult1 begins its 10-cycle execution (5.0 * 2.0). Add1 begins its 2-cycle execution (10.0 - 5.0). Note the out-of-order execution: SUB.D starts before MUL.D.

Instruction	Issue	Execute	Write Result
…	…	…	…
`MUL.D F0, F2, F4`	3	5
`SUB.D F8, F6, F2`	4	5
`DIV.D F10, F0, F6`	5

Reservation Stations	Busy	Op	Vj	Vk	Qj
Mult1	Yes	MUL.D	5.0	2.0
Add1	Yes	SUB.D	10.0	5.0
Div1	Yes	DIV.D		10.0	Mult1

Register Result Status	F0	F2	F4	F6	F8	F10
Qi	Mult1				Add1	Div1

CDB Activity: None.

…This process continues. The SUB.D will finish in cycle 6 and broadcast its result. The ADD.D (instruction 6) will issue and wait for results from Add1 and Load2. The MUL.D will finish in cycle 14 and broadcast, allowing the DIV.D to start its long 40-cycle execution. This detailed trace reveals how the hardware dynamically resolves dependencies and executes instructions as soon as their data is ready, maximizing parallelism.

Section 4.4: Taming the Chaos with the Reorder Buffer (ROB)#

While Tomasulo’s algorithm is brilliant at extracting instruction-level parallelism, its out-of-order completion creates a significant problem: it makes handling exceptions and branch mispredictions incredibly difficult. If MUL.D completes after SUB.D, but SUB.D causes an arithmetic exception, the machine state is inconsistent. The processor has modified state (F8) from an instruction that is logically after the faulting instruction. This is called an imprecise exception, and it makes operating systems and recovery mechanisms nearly impossible to implement correctly.

The solution is to add a new hardware structure, the Reorder Buffer (ROB), which extends the original algorithm to ensure that while instructions may execute out of order, they commit their results to the architectural state (the main register file and memory) in strict program order.

ROB Mechanism#

The ROB is a circular buffer that operates on a First-In, First-Out (FIFO) basis. It bridges the gap between out-of-order execution completion and in-order architectural update.

Issue: When an instruction is decoded, it is allocated an entry at the tail of the ROB. This ROB entry number becomes the instruction’s new tag. The register renaming table (RAT) now points to ROB entries, not reservation stations.
Execute: Instructions are still sent to reservation stations and execute out-of-order as before.
Write Result: When a functional unit completes, it broadcasts its result and its ROB tag on the CDB. The result is written into the corresponding entry in the ROB, not the register file. The ROB entry is marked as “ready”. Any waiting reservation stations also snoop the CDB and grab the result.
Commit: The processor examines the instruction at the head of the ROB. If its entry is marked “ready,” the instruction is committed. This means its result is finally written from the ROB to the architectural register file or memory. The instruction is then removed from the ROB (the head pointer advances). If the instruction at the head is not yet ready, the commit stage stalls, and no subsequent instructions can be committed, thus enforcing in-order retirement.

Each entry in the ROB typically contains these fields:

Busy: Indicates if the entry is valid.
Instruction Type: Specifies if it’s a branch, store, or register operation.
State: Tracks the instruction’s progress (e.g., Issue, Execute, WriteResult, Commit).
Destination: The architectural register number or memory address to be written.
Value: The computed result, held here until commit.
Ready: A bit indicating the result is valid in the Value field.
Exception: Stores any exception information generated during execution.

Section 4.5: Achieving Precise Exceptions and Speculation#

The addition of the ROB is the key that unlocks two of the most powerful features of modern high-performance processors: precise exceptions and speculative execution.

Precise Exceptions#

The ROB provides a simple and elegant mechanism for handling exceptions precisely. When an instruction (e.g., a DIV.D by zero) encounters an exception during its execution, the exception is not acted upon immediately. Instead, the exception status is simply recorded in the Exception field of the instruction’s entry in the ROB. The processor continues to execute and complete other instructions out of order. The exception is only handled when the faulting instruction reaches the head of the ROB and is ready to be committed. At that point, the processor knows the exception is not speculative and is the next one to occur in the program’s sequential order. It can then flush the entire pipeline and ROB, save a precise state, and jump to the operating system’s exception handler.

Branch Speculation#

The ROB is also the enabler of efficient branch speculation. When the processor encounters a branch, a branch predictor guesses the outcome. The processor then speculatively fetches, issues, and executes instructions from the predicted path, filling the ROB with these speculative instructions.

If the prediction is correct: The branch instruction eventually reaches the head of the ROB and is committed. The speculative instructions that follow it then commit normally as they reach the head. No time was lost.
If the prediction is incorrect: When the branch instruction is finally executed and the misprediction is discovered, the processor performs a recovery. It flushes all speculative instructions from the pipeline, reservation stations, and the ROB (this is as simple as resetting the ROB’s tail pointer to its head pointer). No architectural state was corrupted because none of the speculative instructions were ever committed. The processor then begins fetching from the correct path.

The combination of a Tomasulo-style dataflow execution core with a Reorder Buffer for in-order commit forms the foundation of virtually all modern high-performance, out-of-order (OOO) processors. This two-part architecture elegantly solves the problems of data dependencies, false dependencies, and imprecise state, allowing for a massive increase in instruction-level parallelism.

Appendix A#

WAR Hazard#

A WAR hazard, or “anti-dependence,” happens when an instruction wants to write to a register before an earlier instruction has finished reading that register’s original value.

Here is a simple example:

# Instruction 1 w/ long latency
1.  FMUL.D  F2, F4, F6   // Multiplies F4 and F6, result goes to F2

# Instruction 2 w/ short latency, independent operands
2.  FADD.D  F4, F8, F10  // Adds F8 and F10, result goes to F4

plaintext

A Tomasulo processor’s goal is to maximize performance by executing instructions as soon as their operands are ready.

Instruction 1 (FMUL.D) is issued. Let’s say it’s a long operation that will take 10 cycles. It needs to read F4.
Instruction 2 (FADD.D) is issued right after. The processor sees its source operands (F8 and F10) are ready and that the addition functional unit is free. It’s a short 2-cycle operation.

The processor executes FADD.D immediately, without waiting for the FMUL.D to finish.

So here is the WAR hazard: The FADD.D finishes in 2 cycles and wants to write its result to register F4. But the FMUL.D instruction hasn’t even started its long execution yet and still needs the original value from F4! If the FADD.D were allowed to write to the actual architectural register F4, it would corrupt the input for the FMUL.D, leading to an incorrect program result.

WAW Hazard#

A WAW hazard, or “output dependence,” happens when two different instructions want to write to the same destination register, and the instruction that came later in the program finishes execution first.

Here is a simple example:

1.  FMUL.D  F2, F4, F6    // Writes to F2

2.  FADD.D  F2, F8, F10   // Also writes to F2

plaintext

The correct final value in F2 should be the result of the FADD.D instruction, since it comes later in the program.

The FADD.D (Instruction 2) is short and finishes in 2 cycles. It’s ready to write its result.
The FMUL.D (Instruction 1) is long and finishes 10 cycles later.

So here is the WAW hazard: If the FADD.D writes its result to F2, and then 8 cycles later the FMUL.D also writes its result to F2, the final value in the register will be from the FMUL.D. This is incorrect! The result from the instruction that was supposed to happen first has overwritten the result from the instruction that was supposed to happen last.

Disclaimer#

Part I: The Language of Hardware - Verilog Fundamentals#

Section 1.1: Thinking in Parallel#

Section 1.2: Describing Behavior - initial and always Blocks#

The initial Block#

The always Block#

Section 1.3: The Heart of Synthesis - Blocking vs. Non-blocking Assignments#

Blocking Assignments (=)#

Non-blocking Assignments (<=)#

Example: The Shift Register#

Pitfalls and Best Practices#

Part II: The Blueprint of a CPU - The RISC-V ISA#

Section 2.1: An Introduction to Instruction Set Architectures (ISA)#

Section 2.2: The RISC-V Revolution - Openness and Modularity#

The RISC Philosophy#

Open and Free#

Modular Design#

Section 2.3: Anatomy of a RISC-V Instruction#

Part III: The Assembly Line - Pipelined Execution and Its Perils#

Section 3.1: The Classic 5-Stage RISC Pipeline#

Section 3.2: When the Assembly Line Breaks - Pipeline Hazards#

Structural Hazards#

Data Hazards#

Control Hazards#

Section 3.3: Basic Hazard Resolution - Stalling and Forwarding#

Stalling (Pipeline Bubbles)#

Forwarding (Bypassing)#

Part IV: The Brains of the Operation - Dynamic Scheduling with Tomasulo’s Algorithm#

Section 4.1: Beyond In-Order Execution#

Section 4.2: Core Components of the Tomasulo Machine#

Reservation Stations (RS)#

The Common Data Bus (CDB)#

Hardware Register Renaming#

Section 4.3: A Cycle-by-Cycle Walkthrough of Tomasulo’s Algorithm#

Simulation Setup:#

Example Instruction Sequence:#

Initial State:#

Cycle 1:#

Cycle 2:#

Cycle 3:#

Cycle 4:#

Cycle 5:#

Section 4.4: Taming the Chaos with the Reorder Buffer (ROB)#

ROB Mechanism#

Section 4.5: Achieving Precise Exceptions and Speculation#

Precise Exceptions#

Branch Speculation#

Appendix A#

WAR Hazard#

WAW Hazard#

Section 1.2: Describing Behavior - `initial` and `always` Blocks#

The `initial` Block#

The `always` Block#

Blocking Assignments (`=`)#

Non-blocking Assignments (`<=`)#