

## Lecture 21

Multicycle Processor



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <2>



#### • Single-cycle:

+ simple

- cycle time limited by longest instruction (LDR)
- separate memories for instruction and data
- 3 adders/ALUs
- Multicycle processor addresses these issues by breaking instruction into shorter steps

shorter instructions take fewer steps

- o can re-use hardware
- $\circ$  cycle time is faster







#### • Single-cycle:

+ simple

- cycle time limited by longest instruction (LDR)
- separate memories for instruction and data
- 3 adders/ALUs

## • Multicycle:

- + higher clock speed
- + simpler instructions run faster
- + reuse expensive hardware on multiple cycles
- sequencing overhead paid many times





#### • Single-cycle:

+ simple

- cycle time limited by longest instruction (LDR)
- separate memories for instruction and data
- 3 adders/ALUs

## • Multicycle:

- + higher clock speed
- + simpler instructions run faster

- Same design steps as single-cycle:
- first datapath
- then control
- + reuse expensive hardware on multiple cycles
- sequencing overhead paid many times





# Multicycle State Elements

Replace Instruction and Data memories with a single unified memory – more realistic





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <6>



## Multicycle Datapath: Instruction Fetch

#### **STEP 1:** Fetch instruction







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Lectu

Lecture 21 <7>



## Multicycle Datapath: LDR Register Read







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Lec

Lecture 21 <8>



## Multicycle Datapath: LDR Address

#### **STEP 3:** Compute the memory address





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Lect

Lecture 21 <9>



#### Multicycle Datapath: LDR Memory Read

#### **STEP 4:** Read data from memory





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015



## Multicycle Datapath: LDR Write Register

#### **STEP 5:** Write data back to register file





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <11>



## Multicycle Datapath: Increment PC

#### Meanwhile: Increment PC

#### Concurrent with fetching instruction





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <12>



## Multicycle Datapath: Access to PC

#### PC can be read/written by instruction





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <13>



## Multicycle Datapath: Access to PC

#### PC can be read/written by instruction

• Read: R15 (PC+8) available in Register File





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <14>



## Multicycle Datapath: Read to PC (R15)

#### Example: ADD R1, R15, R2



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <15>



## Multicycle Datapath: Read to PC (R15)

#### Example: ADD R1, R15, R2

- R15 needs to be read as PC+8 from Register File (RF) in 2<sup>nd</sup> step
- PC+4 was computed in 1<sup>st</sup> step
- So (also in 2<sup>nd</sup> step) ALU computes (PC+4) + 4 for R15 input





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <16>



## Multicycle Datapath: Read to PC (R15)

#### Example: ADD R1, R15, R2

- R15 needs to be read as PC+8 from Register File (RF) in 2<sup>nd</sup> step
- PC+4 was computed in 1<sup>st</sup> step
- So (also in 2<sup>nd</sup> step) ALU computes (PC+4) + 4 for R15 input
  - SrcA = PC (which was already updated in step 1 to PC+4)
  - SrcB = 4
  - ALUResult = PC + 8
- ALUResult is fed to R15 input port of RF in 2<sup>nd</sup> step (which is then routed to RD1 output of RF)





## Multicycle Datapath: Access to PC

#### PC can be read/written by instruction

- Read: R15 (PC+8) available in Register File
- Write: Be able to write result of instruction to PC





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <18>



## Multicycle Datapath: Write to PC (R15)

#### Example: SUB R15, R8, R3



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <19>



## Multicycle Datapath: Write to PC (R15)

#### Example: SUB R15, R8, R3

- Result of instruction needs to be written to the PC register
- ALUResult already routed to the PC register, just assert PCWrite



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015



## Multicycle Datapath: Write to PC (R15)

#### Example: SUB R15, R8, R3

- Result of instruction needs to be written to the PC register
- ALUResult already routed to the PC register, just assert PCWrite





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015



# Multicycle Datapath: STR

#### Write data in Rn to memory





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <22>



## Multicycle Datapath: Data-processing

# With immediate addressing (i.e., an immediate *Src2*), no additional changes needed for datapath





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <23>



## Multicycle Datapath: Data-processing

#### With register addressing (register *Src2*): Read from Rn and Rm





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <24>



## Multicycle Datapath: B

#### Calculate branch target address: BTA = (*ExtImm*) + (PC+8) *ExtImm = Imm24 << 2* and sign-extended





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <25>







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <26>



# Multicycle Control





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <27>



# Multicycle Control: Decoder





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <28>



# Multicycle Control: Decoder





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015



# Multicycle Control: Decoder



#### **ALU Decoder and PC Logic same as single-cycle**



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Lecture

Lecture 21 <30>



# Multicycle Control: Instr Decoder

$$Op_{1:0} \qquad lnstr \qquad lmmSrc_{1:0} 
RegSrc_0 = (Op == 10_2) 
RegSrc_0 = (On = 01_2) 
RegSrc_0$$

 $RegSrc_{1} = (Op == 01_{2})$  $ImmSrc_{1:0} = Op$ 

| Instruction  | Ор | Funct <sub>5</sub> | Funct <sub>0</sub> | <b>RegSrc</b> <sub>0</sub> | RegSrc <sub>1</sub> | ImmSrc <sub>1:0</sub> |
|--------------|----|--------------------|--------------------|----------------------------|---------------------|-----------------------|
| LDR          | 01 | x                  | 1                  | 0                          | X                   | 01                    |
| STR          | 01 | x                  | 0                  | 0                          | 1                   | 01                    |
| DP immediate | 00 | 1                  | Х                  | 0                          | X                   | 00                    |
| DP register  | 00 | 0                  | Х                  | 0                          | 0                   | 00                    |
| В            | 10 | Х                  | Х                  | 1                          | Х                   | 10                    |



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <31>







Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <32>



# Multicycle Control: Main FSM





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015



# Main Controller FSM: Fetch





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <34>



# Main Controller FSM: Decode





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <35>



# Main Controller FSM: Address





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <36>



#### Main Controller FSM: Read Memory





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <37>



#### Multicycle ARM Processor





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <38>



#### Main Controller FSM: LDR





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <39>



#### Main Controller FSM: STR





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <40>



#### Main Controller FSM: Data-processing





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <41>



#### Main Controller FSM: Data-processing





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <42>



#### Multicycle Controller FSM

State Fetch Decode MemAdr MemRead MemWB MemWrite ExecuteR ExecuteI ALUWB Branch

 $\begin{array}{l} \textbf{Datapath}\,\mu\textbf{Op}\\ Instr \leftarrow Mem[PC];\,PC \leftarrow PC+4\\ ALUOut \leftarrow PC+4\\ ALUOut \leftarrow Rn + Imm\\ Data \leftarrow Mem[ALUOut]\\ Rd \leftarrow Data\\ Mem[ALUOut] \leftarrow Rd\\ ALUOut \leftarrow Rn \ op \ Rm\\ ALUOut \leftarrow Rn \ op \ Imm\\ Rd \leftarrow ALUOut\\ PC \leftarrow R15 + offset\\ \end{array}$ 





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <43>



#### Multicycle Control





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <44>



### Multicycle Control: Cond. Logic





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <45>



### Single-Cycle Conditional Logic





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <46>



#### Multicycle Conditional Logic





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015 Lect

Lecture 21 <47>



• Instructions take different number of cycles.



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <48>



#### Multicycle Controller FSM

State Fetch Decode MemAdr MemRead MemWB MemWrite ExecuteR Executel ALUWB Branch

 $\begin{array}{l} \textbf{Datapath}\,\mu\textbf{Op}\\ Instr \leftarrow Mem[PC];\,PC \leftarrow PC+4\\ ALUOut \leftarrow PC+4\\ ALUOut \leftarrow Rn + Imm\\ Data \leftarrow Mem[ALUOut]\\ Rd \leftarrow Data\\ Mem[ALUOut] \leftarrow Rd\\ ALUOut \leftarrow Rn \ op \ Rm\\ ALUOut \leftarrow Rn \ op \ Imm\\ Rd \leftarrow ALUOut\\ PC \leftarrow R15 + offset\\ \end{array}$ 





Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <49>



- Instructions take different number of cycles:
  - 3 cycles:
  - 4 cycles:
  - 5 cycles:





- Instructions take different number of cycles:
  - 3 cycles: B
  - 4 cycles: DP, STR
  - 5 cycles: LDR





- Instructions take different number of cycles:
  - 3 cycles: B
  - 4 cycles: DP, STR
  - 5 cycles: LDR
- CPI is weighted average
- SPECINT2000 benchmark:
  - 25% loads
  - 10% stores
  - 13% branches
  - 52% data processing





- Instructions take different number of cycles:
  - 3 cycles: B
  - 4 cycles: DP, STR
  - 5 cycles: LDR
- CPI is weighted average
- SPECINT2000 benchmark:
  - 25% loads
  - 10% stores
  - 13% branches
  - 52% data processing

Average CPI = (0.13)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12





Multicycle critical path:

- Assumptions:
  - RF is faster than memory
  - writing memory is faster than reading memory

$$T_{c2} = t_{pcq} + 2t_{mux} + \max(t_{ALU} + t_{mux}, t_{mem}) + t_{setup}$$





|                            | Delay (ps)                                                       |
|----------------------------|------------------------------------------------------------------|
| $t_{pcq\_PC}$              | 40                                                               |
| <i>t</i> <sub>setup</sub>  | 50                                                               |
| t <sub>mux</sub>           | 25                                                               |
| t <sub>ALU</sub>           | 120                                                              |
| t <sub>dec</sub>           | 70                                                               |
| t <sub>mem</sub>           | 200                                                              |
| <i>t<sub>RFread</sub></i>  | 100                                                              |
| <i>t<sub>RFsetup</sub></i> | 60                                                               |
| •                          | $t_{setup}$ $t_{mux}$ $t_{ALU}$ $t_{dec}$ $t_{mem}$ $t_{RFread}$ |



| Element                                                                      | Parameter                  | Delay (ps) |
|------------------------------------------------------------------------------|----------------------------|------------|
| Register clock-to-Q                                                          | $t_{pcq\_PC}$              | 40         |
| Register setup                                                               | <i>t</i> <sub>setup</sub>  | 50         |
| Multiplexer                                                                  | t <sub>mux</sub>           | 25         |
| ALU                                                                          | t <sub>ALU</sub>           | 120        |
| Decoder                                                                      | $t_{\rm dec}$              | 70         |
| Memory read                                                                  | t <sub>mem</sub>           | 200        |
| Register file read                                                           | <i>t<sub>RFread</sub></i>  | 100        |
| Register file setup                                                          | <i>t<sub>RFsetup</sub></i> | 60         |
| $T_{c2} = t_{pcq} + 2t_{mux} + \max[t_{ALU} + t_{mux}, t_{mem}] + t_{setup}$ |                            |            |
| = [40 + 2(25) + 200 + 50]  ps = 340  ps                                      |                            |            |





For a program with **100 billion** instructions executing on a **multicycle** ARM processor

- **CPI** = 4.12 cycles/instruction
- Clock cycle time:  $T_{c2}$  = 340 ps

#### Execution Time = ?





For a program with **100 billion** instructions executing on a **multicycle** ARM processor

- **CPI** = 4.12 cycles/instruction
- Clock cycle time:  $T_{c2}$  = 340 ps

#### Execution Time = (# instructions) × CPI × $T_c$ = (100 × 10<sup>9</sup>)(4.12)(340 × 10<sup>-12</sup>) = 140 seconds





For a program with **100 billion** instructions executing on a **multicycle** ARM processor

- **CPI** = 4.12 cycles/instruction
- Clock cycle time:  $T_{c2}$  = 340 ps

#### Execution Time = (# instructions) × CPI × $T_c$ = (100 × 10<sup>9</sup>)(4.12)(340 × 10<sup>-12</sup>) = 140 seconds

This is **slower** than the single-cycle processor (84 sec.)





#### Processor Comparisons



#### Single Cycle

One cycle/instruction Long clock period Separate I and D Mem Combinational controller Architectural State Only



Multicycle 3-5 cycles/instruction Shorter clock period Unified Memory FSM controller Extra state



Digital Design and Computer Architecture: ARM<sup>®</sup> Edition © 2015

Lecture 21 <60>

