Yesterday I wrote about instruction set and ABI manuals. Today I’d like to go into details about the ABIs I listed there. This was done mostly as a summary for me: it’s tiresome to search for the information in the manuals, especially since some of the manuals are PDFs without links. For example, I never remember what is the order of the registers used in parameter passing on x86-64. So what you’ll find here is a listing of what I found interesting for when I might need to read or write assembly code.
As a bonus for you, dear reader, I added a few words about each platform.
First, a summary with numbers.
|Instruction width (bits)||8 to 112||8 to 112||41 or 82||16 or 32||32||16 or 32||32||32|
|# general-purpose registers||8||16||128 (a)||16 (b)||32||32||32||32 (a)|
|GPR width in bits||32||64||64+1 (c)||32||64||32 / 64||32 / 64||32 / 64|
|# Special GPRs architecturally + ABI||1 + 0||1 + 0||1 + 3||1 + 1||1 + 2||1 + 3||1 + 1||2 + 8|
|# GPRs used in parameter passing||0||4 or 6 (d)||8||4||8||4 or 8 (e)||8||6|
|# scratch GPRs (f)||4||7 or 9 (d)||24+96||5||18||18 or 17 (e)||11||7+9+7|
|# saved GPRs (f)||3||8 or 6 (d)||7+96||8||10||9 or 10 (e)||20||15|
|Number of floating point registers||8+8||8+16||128||16 or 32||32||32||32||32|
|FPR width in bits||80 / 128 (g)||80 / 128 (g)||82||64||128||32 or 64||64||64|
|# FPR used in parameter passing||0+0||4 or 8 (d)||8||8||8||2 or 8 (e)||8||0|
- 128 registers on Itanium can be accessed, but the processor has a minimum of 144 and can have more; for SPARC, 32 registers can be accessed, but the processor has anywhere between 64 and 528
- in Thumb mode, some instructions can only access 8 of the 16 registers
- the extra bit, called the Not-A-Thing (NAT) bit, is only used in some special circumstances
- the first number applies to Windows, the second number applies to Unix systems
- the first number applies to o32 and o64; the second number applies to n32 and n64
- “scratch” registers are those that a function may overwrite and need not save, also known as “caller-saved”, including the registers used as parameter passing; “saved” registers are those that a function must save before using; the concept does not apply directly to the rotating registers found on the Itanium and on SPARC (see below)
- the 387 registers are 80-bits wide and the SSE registers are at least 128-bits wide; they have been extended to 256 bits with the AVX extensions
i386 or x86 or IA-32
The x86 architecture is the oldest in consideration and its age shows. The 32-bit architecture debuted with the Intel 80386 (whence the name “i386″) in 1985. It expanded on the Intel 8086 16-bit assembly by expanding the registers to 32-bit among other things. This architecture is still in use today and even modern processors like my Intel® Core™ i7-2620M (Sandy Bridge) boot into 8086 real-mode. I have some applications running on my Linux that are still i386 (like Skype).
The name x86 is because the 80386 (family 3) was followed by the 80486 (family 4), the Pentium (family 5) and the Pentium Pro (P6 archiecture, family 6). Some Linux distributions compile their packages for higher architectures, so you’ll find .i586.rpm and .i686.rpm too. The name IA-32 means “Intel Architecture, 32-bit,” which was created to indicate the difference to IA-64.
The instructions on x86 have variable lengths and can be anywhere from 1 to 15 bytes, averaging usually between 3 and 5 bytes, making the code density around 4 instructions per 16 bytes. That means jump and call targets can use all 32 bits of the addressing space. For performance and ABI reasons, jump targets and functions are usually aligned to 16 bytes (the ABI requires the low 3 bits to be clear for C++ member functions).
The traditional parameter passing uses no registers for parameter passing and pushes the parameter values from right to left as 32-bit slots onto the stack, which is popped by the caller. The stack is memory, so it suffers some penalties for its use. For that reason, most compilers offer alternative calling conventions, which allow passing some values in registers, pushing from left to right, and/or having the stack popped by the callee. You can find them by the names of “regparm” (GCC), “stdcall”, “syscall”, “pascal”. On Windows, the Win32 API is actually “stdcall”, whereas on Linux you’ll seldom ever find public API using anything other than the default convention. You can find more details about them in the Wikipedia article about X86 calling conventions.
The base i386 processor has 8 general-purpose registers and 8 stacked 80-bit wide floating point registers. All the floating point registers are scratch and can hold IEEE 754 single, double- or extended-precision values, while the general-purpose registers are distributed as follows (on Linux at least, though I think it applies to Windows too):
|Registers used for return values||EAX, EDX|
|Scratch registers||EAX, ECX, EDX|
|Saved registers||EBX, ESI, EDI, EBP|
|Floating-point register used for return values||ST(0), ST(1)|
|Scratch floating-point registers||all (ST(0) to ST(7))|
The ESP register is special for architectural reasons: instructions that manipulate the stack work on it exclusively. That includes the procedure call and return mechanism, which store the return address on the stack. All the other registers can be used in almost any condition, even though there are certain instructions preferring one register over another. Only a few special instructions refer exclusively to a particular register (ECX in looping instructions and ESI and EDI in streaming instructions).
The EBP register is most often used as the “frame pointer” register: its value is the memory address where the previous function’s frame pointer was saved. It is used to load and store the incoming and local values at a fixed position. When writing assembly, it’s important to remember too that the EBX register is often used as the PIC register and cannot be used. If you need to use it in an special instruction, you’ll need to save it and restore afterwards (such as pushing it onto the stack or by xchg’ing it with another register).
The x86 architecture gained 8 MMX technology registers with the Pentium MMX, which are aliased to the floating-point registers and are all thus scratch. Later, with the Pentium III, 8 SSE registers, 128-bit wide, were added and then extended to 256-bits with the Sandy Bridge family. They are also all scratch and they can hold IEEE 754 single- or double-precision floating-point values. They can also be used in a variety of scalar, integer or floating point SIMD instructions.
When the x86 architecture gained 64-bit support, not only were the registers expanded to 64-bit, the register set itself was expanded to 16 general-purpose registers, 16 MMX-technology registers and 16 SSE-technology registers. The floating-point registers are unchanged, though, as they are considered legacy. Unlike the i386 before it, the 64-bit expansion did away with compatibility with the 16- and 32-bit assembler instructions. Programs running in 64-bit mode (the “long mode“) run with a slightly different list of instructions. (Note that the 16-bit assembly is technically source compatible with the 32-bit one, but it’s not binary compatible)
As with x86, instructions have variable length in bytes, but the ABI and performance requirements are the same, so jump targets and functions are often aligned to 16 bytes.
As this architecture was created after SSE registers were introduced, the SSE registers are part of the calling convention. In fact, the SSE and SSE2 instructions are the preferred way of manipulating single- and double-precision floating-point values. The ABI for this architecture was specified by AMD when it launched the first 64-bit processor and by Microsoft for its Windows operating system.
|Function return address||top of stack|
|GPRs used for return values||RAX, RDX|
|GPRs used in paramter passing||RDI, RSI, RDX, RCX, R8, R9||RCX, RDX, R8, R9|
|Scratch GPRs||RAX, RCX, RDX, RSI, RDI, R8-R11||RAX, RCX, RDX, R8-R11|
|Saved GPRs||RBX, RBP, R12-R15||RBX, RBP, RSI, RDI, R12-R15|
|387 register used for return values (long double)||ST(0), ST(1)|
|Scratch 387 registers||all (ST(0) to ST(7))|
|Floating-point registers used for return values||XMM0, XMM1|
|Floating-point parameter registers||XMM0-XMM7||XMM0-XMM3|
|SSE scratch registers||all (XMM0-XMM15, YMM0-YMM15)|
Like 32-bit x86, the RSP register is architecturally-special and it’s manipulated by the push, pop, call, ret and similar instructions. The RBP register is also used as a frame pointer. The x86-64 architecture does allow for RIP-relative addressing, which was introduced so that a PIC register wouldn’t be necessary. Yet RBX is still used by some compilers under some conditions like that, so it’s best to apply the same saving mechanisms as before.
On Windows, this architecture runs in LLP64 mode: long longs and pointers are 64-bit wide, but longs and ints are 32-bit. On Linux, this architecture can run in both LP64 mode (longs and pointers are 64-bit wide) and in ILP32 mode (ints, longs and pointers are 32-bit). The ILP32 mode, called “x32“, makes use of the 8 additional GPRs and 8 additional SSE registers along with this calling convention as an effort to renew the 32-bit x86 world.
The Intel Itanium architecturewas the result of the joint project between Hewlett-Packard and Intel in the late 1990s and was released in 2001. It was designed to take the best of the expertise of the time and produce a new, future-proof architecture for years to come. It was intended to replace the old 32-bit x86 architecture, which is why it got the name of IA-64.
Itanium uses a concept called Very long instruction word (VLIW) and each instruction is 41 bits in length, with a few 82 bits for encoding of 61- to 64-bit immediates. Each 3 instructions are grouped in a “bundle” occupying 128 bits (16 bytes), so all jump and function targets are aligned to 16. Another concept used in the architecture is Explicitly parallel instruction computing (EPIC) where the compiler must tell the processor which instructions can be executed in parallel and which ones must wait for others. This is encoded in the assembly as “stop bits”, which are coded to the 5 remaining bits of the 128-bit bundle. Not all combinations of instructions and stop bits are possible, so Itanium code has often many “nop” instructions and is very big. Code density is 3 instructions per 16 bytes, including counting the “nop”.
The Itanium architecture is still the record-holder in terms of raw number of accessible registers. Application programmers have access to 128 general-purpose registers, 128 floating-point registers, 128 architecturally-specific registers, 64 1-bit predicate registers and 8 branch registers. That’s 4 times as many GPRs and FPRs as any other architecture I listed, plus the other special registers. They are divided thus:
|Architecturally-special GPR||r0 (reads are always 0, writes are discarded)|
|ABI-defined special GPRs||r1 (gp), r12 (sp), r13 (tp)|
|GPRs used in return values||r8-r11|
|GPRs used in integer parameter passing||r32-r39 (in0 to in7), specially r8|
|Non-rotating scratch GPRs||r2, r3, r8-r11, r14-r31|
|Non-rotating saved GPRs||r4-r7|
|Rotating GPRs||r32-r127 (in0-in96, loc0-loc96, out0-out96)|
|Architecturally-special FPRs||f0 (0.0) and f1 (+1.0)|
|Registers used for return values||f8-f15|
|Registers used in parameter passing||f8-f15|
|Non-rotating scratch FPRs||f6-f15|
|Non-rotating saved FPRs||f2-f5, f16-f31|
|Architecturally-special PR||p0 (always 1)|
|Non-rotating scratch PRs||p6-p15|
|Non-rotating saved PRs||p1-p5|
|Function return address||b0 (rp)|
|Scratch registers||b0 (rp), b6, b7|
On function entry, the r8 register contains the address of a memory region for the return value if the struct or union being returned is larger than 32 bytes (i.e., doesn’t fit r8-r11).
The three architecturally-special registers (r0, f0 and f1) always have the same value when read: integer 0, floating point 0.0 and floating-point 1.0 respectively. This allows for the assembly to do away with some instructions by just making them alias to others: for example, there is no instruction to load small immediate values onto a GPR. Instead, the instruction is replaced by an addition instruction where one of the operads is r0. The same applies to the floating-point multiplication and addition instructions: the Itanium only has a 4-operand fused multiply-add, so pure additions are done by multiplying one of the sources by f1 and pure multiplications are done by using f0 as the other source.
The 96 upper GPRs, FPRs and the 48 upper PRs are rotating: that means that some instructions can cause the register names to rotate. The three types of registers can be used in rotating loops, where several iterations of the loop are running in parallel with different registers. When not used in rotating fashion, all those registers can be used as scratch.
In addition to loops, the 96 upper GPRs can be rotated on function calls and returns. For that reason, each function can consider it has up to 96 saved registers because those registers simply cannot be seen by functions it calls. They are saved by the Register Stack Engine, asynchronously and at processor-specified times. The architecture allows each function to select how many rotating registers it wants to use and how many of those are to be available to functions called (those are the outgoing registers), though when writing assembly, one specifies how many registers are incoming, local and outgoing, so the named registers are available in the function body. The ABI limits the number of outgoing registers to only 8 rotating registers.
A leaf function or one with tail-call optimisation may opt to keep the rotating registers unchanged. It has available 24 non-rotating scratch GPRs, 10 non-rotating FPRs and 96 rotating ones, 10 non-rotating PRs and 48 rotating, and 3 scratch branch registers (one of them containing the return address), plus the incoming registers. That’s more than enough for most leaf functions, without even using the stack. A tail-call optimisation, however, requires that the called function take no more arguments than this function took, as expanding the number of outgoing registers would destroy another register that must be saved (ar.pfs).
An interesting feature of the GPRs is that they are actually 65-bit in width: the extra bit is called the “Not a Thing” (NAT) bit, which is an indication of whether the other 64 contain a valid value or not. The Itanium has some instructions that allow a “speculative load”: the instruction will try to load the value from memory so long as it doesn’t cause a page fault. If the value could not be loaded, the NAT bit is set and software must later check it, once it determines that it really needs that value. Using the value contained in a GPR while the NAT bit is active, besides copying the contents to another register or saving the contents with a special spill instruction, causes an exception.
The floating-point registers are 82 bits in width, allowing each to hold intermediate values of higher precision than IEEE 754 extended-precision. The “application registers” are 128 special 64-bit registers, each of which with a special meaning. Some of those registers are read-only, some are used by certain instructions and are thus scratch, most have special purpose. In particular, the ar.pfs register must be saved across function calls.
Itanium is defined for LP64 and ILP32 mode for Unix and LLP64 mode for Windows. The ILP32 mode is supported by a special instruction for dealing with pointers: once loaded from 32-bit storage, the pointer is “pointer-extended” to 64-bit before it can be used.
The ABI for Itanium was specified by Intel in the document I linked to in the last blog. Interestingly, Intel specified almost everything relating to the Itanium, including a full C++ ABI. This became known as the Itanium C++ ABI and is what GCC uses in all platforms, not just Itanium.
ARM 32-bit mode (AArch32)
Instructions in the ARM architecture, when running in “ARM mode”, are all 32-bits wide. For that reason, all jump targets and function addresses are aligned to 4 bytes and the low 2 bits are always unused. However, when ARM code is in “interworking” with Thumb code, those two bits are special, which mean that function addresses on ARM require the use of all bits. This has implications for the C++ ABI: since all bits are used in function addresses, the bit indicating whether a pointer-to-member-function is virtual is moved to the adjustment field.
32-bit ARM has 16 registers, one of which is the program counter. All registers are 32 bits in width and can be used in all instructions alike, including the PC, which makes it possible to have branching with arithmetic instructions (for example, “add pc,pc,r0″). The PC register is special and all operations on it are not supported. Moreover, reading from it yields the address of the current instruction plus 8. “nop” instructions are not common in ARM assembly, so the code density is 4 instructions per 16 bytes. However, due to the limited range of immediates, ARM code is often littered with nearby constants that must be loaded, not executed.
The ARMv6 architecture mandates at least 16 floating-point 64-bit wide registers, and ARMv7 allows optionally for 16 more of them to exist. The registers are divided so:
|Architecturally-special GPR||r15 (pc)|
|ABI-defined special GPR||r13 (sp)|
|GPRs used for returning values||r0-r3|
|GPRs used in parameter passing||r0-r3|
|Function return address||r14 (lr)|
|Scratch GPRs||r0-r3, r12 (ip), r14 (lr)|
|FPRs used for returning values||d0-d7|
|FPRs used in parameter passing||d0-d7|
|Scratch FPRs||d0-d7, d16-d31|
The table above assumes that one is using the floating-point hardware registers to pass parameters, in what is called in the ARM world “hard float”. According to the ARM Architecture Procedure Call Standard, this is optional: if not enabled, the floating-point parameters are converted to their 32- or 64-bit representations and passed in the GPRs.
The floating-point registers can be accessed in 64-bit mode to hold one IEEE 754 double-precision value or as two 32-bit registers holding IEEE 754 single-precision values. Extended-precision is not supported by hardware — on the ARM ABI, the “long double” type is an alias to “double”. Each of the original sixteen 64-bit FPRs can be accessed as two 32-bit FPRs when one prefixes them with “s” instead of “d”: s(2N) corresponds to the lower half of dN, while s(2N+1) to the upper half. A pair of any two sequential FPRs, starting on an even-numbered register, can also be accessed as sixteen quad-word (128-bit) registers when prefixed with “q”.
The r13 (sp) register was chosen by the ABI more-or-less arbitrarily, as any other register could be used to store the current address of the top of the stack. However, this register becomes architecturally-specific when Thumb mode is in use.
ARM CPUs can also run a sub-mode called Thumb, in which most instructions are 16-bit in width. Older ARM processors can only run 16-bit Thumb instructions, while newer ones support additional 32-bit Thumb instructions. Thumb instructions are therefore not aligned to 4-byte boundaries. When ARM and Thumb code interwork, the lowest bit of jump and call addresses indicates the instruction mode: 0 indicates ARM code while 1 indicates Thumb code. However, when the PC register is accessed in Thumb mode, the lowest two bits are forced to zero, so a read of the PC always yields a 4-byte aligned value. Like in ARM mode, reads of the PC give the current instruction plus 8 (rounded down).
The 16-bit instructions are a reduced set, with restricted access to registers. The r13 (sp) register is hardcoded in some stack operations, which makes it architecturally-specific. The ABI itself does not change, but 16-bit instructions can only encode the lower 8 of the 16 registers, which means that 16-bit Thumb functions are limited to 4 scratch and 4 saved registers.
ARM 64-bit (AArch64)
The ARM 64-bit architecture has expanded the ARM 32-bit register set from 16 to 32 general-purpose registers and 32 floating-point registers. The program counter register is no longer architecturally visible and the stack pointer register is architecturally special.
There are still no ARMv8 chips available and I have not seen any compiler generating code for it. The information below comes from the public ARM manuals I could find. The instruction width appears to be 32-bit, like in AArch32 ARM mode (A32). The ABI was specified by ARM too.
|Architecturally-special GPR||SP and ZR|
|GPRs used for returning values||r0-r7|
|GPRs used in parameter passing||r0-r7, specially r8|
|Function return address||r30 (lr)|
|Scratch GPRs||r0-r18, r30 (lr)|
|FPRs used for returning values||v0-v7|
|FPRs used in parameter passing||v0-v7|
|Scratch FPRs||v0-v7, v16-v31|
|Saved FPRs||v8-v15 (lower 64-bit values only)|
On function entry, the r8 register contains the address of a memory region for the return value if the type being returned would not be stored in registers if it were the first parameter in a function call.
The SP and ZR registers are actually the same register, the difference is how the instruction deals with them. Some instructions are specified to work on SP and will read and write values to it. Some others treat that register as a zero or discard the output when it is the destination. When used in those conditions, the assembly lists it as “ZR”, or, like Itanium, uses different mnemonics to indicate the absence of source.
The floating-point registers can be accessed as 32-, 64- or 128-bit wide registers, holding single-precision, double-precision and quad-precision values respectively. However, the hardware only supports floating-point math in single- and double-precision. The ABI asks that only the 64 lower bits of the registers v8 to v15 be saved. That means if a function stores data in the high bits, it must save them on its own before calling another.
In assembly code, one will not see the registers named with the prefixes “r” or “v”. Instead, the GPRs are prefixed with “w” to mean 32-bit access or “x” for 64-bit, while the FPRs are prefixed “s” (32-bit), “d” (64-bit) or “q” (128-bit). The “v” prefix is seen on SIMD operations.
The ARM 64-bit architecture is defined for LP64 mode, but it might be possible to run it in ILP32 mode to make use of the extra registers and improved calling convention.
MIPS processors exist with both 32- and 64-bit registers, respectively called MIPS32 and MIPS64. Unlike the x86 and ARM architectures where the assembly language differs considerably between 32- and 64-bit, the MIPS64 instruction set is a complete superset of MIPS32 and differences in programs are only due to the ABI. The MIPS64 processors do not have a special architectural mode to run MIPS32 code. MIPS64 processors were mostly used with the IRIX operating system, which has been discontinued (SGI now sells Linux machines running on Itanium). MIPS32 processors are quite common in the embedded world, finding their use on many WiFi routers and Set-Top-Boxes.
MIPS instructions are 32-bit in width, but some processors support an extension called microMIPS16 and can run 16-bit wide instruction code. I have not investigated if the instruction streams can be mixed or a technique similar to the ARM-Thumb interworking is necessary. GCC does not seem able to generate microMIPS16 instructions.
MIPS processors have 32 general-purpose registers, which are 64-bit wide on MIPS64 and 32-bit on MIPS32. Additionally, an FPU and a second co-processor are optional. The FPU, if present, has 32 registers that are 32-bit wide and support (at least) single precision, double precision and 32-bit integers, with double-precision values stored in a pair of registers. 64-bit support in the FPU is possible, but optional.
|o32 and o64 ABI||n32 ABI||n64 ABI|
|Architecturally-special GPR||$0 (zero)|
|ABI-specified special GPRs||$26 (kt0), $27 (kt1), $29 (sp)|
|GPRs used for returning values||$2, $3|
|GPRs used in parameter passing||$4 to $7||$4 to $11|
|Function return address||$31 (ra)|
|Scratch GPRs||$1 (at), $2 to $15, $24, $25, $28 (gp)||$1 (at), $2 to $15, $24, $25|
|Saved GPRs||$16 to $23, $30||$16 to $23, $28 (gp), $30|
|FPRs used for returning values||$f0 to $f3||$f0 and $f2 (not $f1)|
|FPRs used in parameter passing||$f12 to $f15||$f12 to $f19|
|Scratch FPRs||$f0 to $f19||$f0 to $f19, $f20, $f22, $f24, $f26, $f28, $f30||$f0 to $f23|
|Saved FPRs||$f20 to $f31||$f21, $f23, $f25, $f27, $f29, $f31||$f4 to $f31|
Some of the registers deserve special mention:
- $1 is used by the assembler for some operations where an assembly mnemonic does not fit; it is known as “at” (assembler temporary)
- $25 contains the address of the function on entry on PIC code; since the compiler usually doesn’t know whether the target function is PIC or not, it will most likely load its address on $25
- $26 and $27 are reserved to hold values from the kernel and should not be modified
- $28 (gp), unlike the previous architectures, on o32 and o64, the global pointer (the PIC register) is not stored in a saved register
- $f20 to $f31: since the early MIPS double-precision operations operate on a register pair, the registers must be saved in pairs too (o32 only)
In particular, the o32 ABI uses only the floating point registers always in pairs: the odd-numbered registers are never used alone. The o64, n32 and n64 ABIs allow using them independently and assume they can hold a double-precision value.
An interesting feature of the MIPS assembly, that it shares with SPARC, is that the first instruction after a taken branch is still executed.
POWER and PowerPC
The Power Architecture defines processors with both 32- and 64-bit registers. Unlike the x86 and ARM architectures where the assembly language differs considerably between 32- and 64-bit, the 64-bit instruction set is a complete superset of the 32-bit one and differences in programs are only due to the ABI.
The Power Architecture specifies two profiles for its processors: the Server Platform, which is mandatorily 64-bit, and the Embedded Platform. It has 32 general-purpose registers, one of which is special. If the FPU is present, it provides 32 registers capable of holding 64-bit double-precision floating point values. It also has an optional vector unit extension, known as Altivec.
The Power Architecture ABI document I found specifies many optional functionality, including soft-float. I am listing here the use of the floating point registers for parameter passing and a common-sense profile.
|ABI-defined special GPR||r1 (sp)|
|GPRs used for return values||r3-r6|
|GPRs used for parameter passing||r3-r10|
|Function return address||LR|
|Scratch GPRs||r0, r3-r12|
|Saved GPRs||r2 (tp), r13-r31|
|FPR used for return values||f1|
|FPRs used in parameter passing||f1-f8|
|VR used for return values||v2|
|VRs used in parameter passing||v2-v13|
- lr: it’s a special register that contains the value of the return address after a function call and must be saved (it’s a scratch register)
- r0: it is a valid register containing values, but some instructions cannot access it; instead, they will always read a zero value and will discard the result
The register names above are prefixed with a letter for convenience. The POWER assembly unfortunately uses no prefixes for registers, addresses or absolute values: when you see a number like “2″ in the disassembly, you need to understand the instruction in question to determine if that refers to r2, f2 or a value of 2.
An interesting feature of the assembly is the Enforce In-Order Execution of I/O instruction, whose mnemonic reads “EIEIO”.
The SPARC architecture began as 32-bit but was extended to 64-bit with the SPARCv9 in 1993. The SPARC processors are most commonly known as UltraSPARC today. 32-bit processors are not sold anymore, however, ILP32 applications still exist and run unmodified in current processors as the difference is only in the ABI.
The SPARC architecture has 32 general-purpose registers, of 64-bit in width on SPARCv9 and above. Like Itanium, the SPARC has a rotating register window: 24 GPRs are rotating and 8 are fixed. Unlike Itanium, the rotation window has a fixed size. The save instruction rotates it by 16, making the registers %r8 to %r15 become %r24 to %r31 and 16 clean registers at %r8 to %r23, while the restore instruction does the opposite and restores the 16 registers that had been saved. So, unlike Itanium, the outgoing parameters are in lower-numbererd registers and the incoming ones are in higher numbered ones.
The general-purpose register set is divided in four groups of 8 registers because of the window. The lowest 8 registers are fixed and are named %g0 to %g7; the next 8 are shared with a function being called, so they are the outgoing registers and named %o0 to %o7; the next 8 are only visible in the current function and are therefore named %l0 to %l7; finally, the upper 8 registers are shared with this function’s caller and are named %i0 to %i7.
Because of the register rotation, the definition of “scratch” and “saved” does not apply directly: the registers a function must preserve for its caller are not the same registers that its callee will preserve. The following table shows the registers in the point of view of the function after is has rotated the register window.
|Architecturally-special GPR||%g0, %i7 (partially)|
|ABI-defined special GPRs||%g2-%g7, %o6 (%sp), %i6 (%fp)|
|GPRs used in return values||%i0|
|GPRs used in passing parameters||%i0-%i5|
|Function return address||%i7|
|GPRs preserved by a callee||%l0-%l7, %i0-%i7|
|GPRs destroyed by a callee||%g1, %o0-%o5, %o7|
|FPRs used in passing parameters||None, they are passed in the outgoing registers or stack|
|FPRs used in returning values||%f0, %f1|
The registers %g2 to %g4 are reserved by the ABI for the application, for uses to be decided by the application and compiler, while registers %g5 to %g7 are to be considered read-only by the application and compiler.
When a function is called, it has available 6 rotating registers containing its incoming values and can be considered scratch. Additionally, %g1 also is scratch. For that reason, leaf functions or functions with tail-call optimisation do not have to rotate the register window if they are satisifed with those 7 registers.
The rotating window also makes the %sp register become the %fp upon rotation, allowing for easy save of that value. Unlike other architectures, it’s the %fp register that must be preserved for the caller, so a leaf function can use the %sp register as a scratch if it needs to (provided it has rotated the register window).
Like MIPS, SPARC’s branch instructions also execute the instruction immediately following the branch, what’s called the “delay slot”. Disassemblers often indent that instruction further for clarity. However, unlike other architectures, the SPARC call instruction does not save the return address, but its own address. To return, a function must jump to %i7 + 8.
The FPU contains 32 registers that are 64-bit wide, numbered %f0, %f2, %f4, … %f62. Each pair of registers, starting at one numbered multiple of four (%f0, %f4, %f8, … %f60) can be accessed in a quad-precision way. Single-precision access is done in sequentially-numbered registers %f0, %f1, %f2, … %f31, where each pair aliases one 64-bit register. Note that SPARC is a big-endian platform, so the upper halves of a larger register are found in the lower numbered register.