A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Assembly Directives
Predefined Section Directives
Include File Directive
Procedure Directives
Symbol Scope Declaration
Declaring Local Scope
Declaring Global Scope
Alignment Statement
A
Acquire Hint
Ambiguous Memory Accesses
B
Big-endian
Branch
Branch Handling
C
Cardinality
Code Emission
Comparison Relations (crel)
Control Dependency
Copy Propagation
Cycle Break
D
Data Dependency
Dead Store Elimination
F
Floating-point Comparison Relations (frel)
Floating-point
Status Register (FPSR)
Fortran
H
Hide Memory Latency
High-level Optimizations
I
IA-32 Architecture
Induction Variable
Intel® Itanium® Architecture
Intel® Itanium® Architecture Software
Developer's Manual
Immediate
Improve Branch handling
Increase ILP
Instruction Level Parallelism (ILP)
Instruction Pointer (IP)
L
LC Application Register
ldf-Load Floating-point
ldfp-Load Floating-point
Pair
Little-endian
Loop Branch
Loop Unrolling
M
Memory disambiguation
Memory latency
Modular Code Support
Modulo-scheduled Counted Loops
Modulo-scheduled While Loops
Multiple Status Fields Registers
Multiply and Accumulate Instructions (fma)
N
NaT Bit/NaT Value (Not a Thing)
Normal Compare Type
P
Parallel Compare Types
Pointer-precision data
types
POINTER_32
POINTER_64
Polymorphism
Postpass schedulings
Predicate Registers
Predication
Prediction Strategy Hint
Procedure Frame
Procedure stack
R
RAW (Read-After-write) Dependency Violation
Register Load and Store Instructions
Release Hint
Representative Workload
Rotating Registers
S
Scaling Pointers
Scoreboarding
SIMD
Spatial Locality
Speculation
Software Pipelining
Stage Predicates
Strength Reduction
T
Templates
Temporal Locality
Trip Count
U
ulps
Unconditional Compare Type
Uniform Data Model (UDM)
W
WAW (Write-Afer-Write)
Dependency Violation
This hint is applicable to ld instructions. The load instruction becomes visible to all future data references, however prior data references may become visible later.
Ambiguous memory accesses are a pair of memory accesses that may refer to the same address in memory.
A method of storing data so that the most significant byte appears in a lower-numbered location in memory.
The Intel® Itanium® architecture supports several types of branches. These include conditional and unconditional branches (jumps), function calls and returns, and loop branches.
A branch instruction that is mispredicted incurs a misprediction penalty. The misprediction penalty gets higher as the depth and width of processors grow.
The range of numbers a data item can count.
The process of emitting the sequence of instructions for a function. Code emission can be done in text as a sequence of assembly instructions, or can be done in binary form into a .obj file.
The two source operands of the compare (cmp) instructions are compared for one of the following ten relations (crel):
crel |
a related to b |
|
eq | a==b | |
ne | a!=b | |
lt/ult | a<b |
signed/unsigned |
le/ule | a<=b |
signed/unsigned |
gt/ugt | a>b |
signed/unsigned |
ge/uge | a>=b |
signed/unsigned |
An instruction is control dependent if it depends on a branch instruction to execute.
Eliminates unnecessary assignments by using the value assigned to a variable in place of the variable itself. In many cases, the compiler can avoid using a register.
The cycle break (;;) indicates the end of an instruction group. It is placed in the code by the assembly writer, or compiler.
Instructions are considered to be data dependent if the first produces a result that is used by the second, or if the second instruction is data dependent on the first through a third instruction. Dependent instructions cannot be executed in parallel. You cannot change the execution sequence of dependent instructions.
Seeks to ensure that there is no store to the same memory location twice without an intervening read from that location.
The two source operands of the floating-point compare (fcmp) instructions are compared for one of the following 12 relations (frel):
frel |
f2 related to f3 |
frel |
f2 related to f3 |
|||
eq | f2==f3 | neq | !(f2==f3) | |||
lt | f2<f3 | nlt | !(f2<f3) | |||
le | f2<=f3 | nle | !(f2<=f3) | |||
gt | f2>f3 | ngt | !(f2>f3) | |||
ge | f2>=f3 | nge | !(f2>=f3) | |||
unord | f2?f3 | ord | !(f2?f3) |
The Intel® Itanium® architecture provides four separate status fields (sf0-sf3) enabling four different computational environments. Each status field contains dynamic control and status information for floating-point operations.
The FPSR contains the four status fields and a traps field that traps the IEEE exception events and denormal operand exceptions. This register also includes 6 reserved bits which must be 0.
In Fortran written for the Intel® Itanium® architecture, all pointers are 64-bit quantities.
96 general registers, starting at r32, used to pass parameters to the called procedure and store local variables for the currently executing procedure.
The Intel® Itanium® architecture provides the means to hide memory latencies by:
Include
IA-32 is Intels 32-bit and 16-bit instruction set architecture supported on the Pentium® and P6 family of processors. See the Intel Architecture Software Developers Manual , Volume 2 Instruction Set Reference Manual, Order Number 243191, for detailed information.
The Itanium architecture is a 64-bit architecture. The Itanium architecture also provides full compatibility with Intel's 32-bit architecture also known as IA-32.
The Intel® Itanium® Architecture Software Developer's Manual
Order numbers:
An immediate is a numeric instruction operand.
The Intel® Itanium® architecture improves branch handling by:
The Itanium® architecture increases ILP by:
In their simplest form, induction variables are variables whose successive values form an arithmetic progression over some part of a program, usually a loop. Usually the loop's iterations are counted by an integer-valued variable that proceeds upward (or downward) by a constant amount with each iteration.
The ability to execute many instructions in parallel in multiple functional units during the same cycle.
The 64-bit instruction pointer holds the address of the bundle of the currently executing instruction. The IP cannot be directly read or written, it increments as instructions are executed. Branch instructions set the IP to a new value. The IP is always 16-byte aligned.
The Loop Count (LC) register is a 64-bit counter used in counted loops. LC is decrement by counted loop type branches.
Itanium® architecture assembly instruction that loads a single floating point value into a register.
Itanium® architecture assembly instruction that loads two floating-point values into two registers simultaneously.
A method of storing data so that the least significant byte appears in a lower-numbered location in memory.
The branch from the "bottom" of the loop to the "top" of the loop. The branch, if taken, continues the loop computation. If the branch is not taken, control exits out of the loop.
A method used to improve the parallelism of a loop. The loop instructions are replicated and the end code adjusted to eliminate the branch.
The process of determining whether two or more pointers are pointing to the same memory location. In C/C++, it is possible to make two or more memory references access the same memory location. In Fortran, memory ambiguity is not a problem, due to language semantics.
The time required by the processor, between an issuance of a load instruction and the moment when the result of this instruction can be used.
Hide memory latencies: Intel® Itanium® architecture provides the means to hide memory latencies by:
The Intel® Itanium® architecture supports the current compiler trend to produce modular code by providing specific hardware support for function calls and returns.
For modulo-scheduled counted loops, the calculation of whether the branch is taken or not depends on the Loop Count application register and on the epilog condition: whether the Epilog Counter application register is greater than one or not.
Use the modulo-scheduled counted loop instructions br.ctop and br.cexit when the loop decision is located at the bottom of the loop body and therefore a taken branch will continue the loop while a fall through branch will exit the loop.
These instructions are only allowed in instruction slot 2 within a bundle. Executing such an instruction in either slot 0 or 1 will cause an Illegal Operation fault, whether the branch would have been taken or not.
For modulo-scheduled while loops, the calculation of whether the branch is taken or not depends on the qualifying predicate and on the epilog condition: whether the Epilog Counter application register is greater than one or not.
Use the modulo-scheduled while loop instructions br.wtop and br.wexit when the loop decision is located somewhere other than the bottom of the loop and therefore a fall though branch will continue the loop and a taken branch will exit the loop.
These instructions are only allowed in instruction slot 2 within a bundle. Executing such an instruction in either slot 0 or 1 will cause an Illegal Operation fault, whether the branch would have been taken or not.
The Intel® Itanium® architecture supports 4 sets of control and status fields with the first being the main set. The multiple sets allow intermediate calculations to be performed on the alternate sets.
The Intel® Itanium® architecture supports various arithmetic floating-point instructions to meet the common needs. For example, a floating-point multiply and add (fma), multiply and subtract (fms) and many more.
The fma instruction, with its four operands (f = a * b + c) forms the basis of all the floating-point arithmetic.
The fma instruction, also provides improved accuracy in multiply and add operations, since there is only one rounding stage, after the add.
The NaT bit and NaTVal enable propagating exception tokens in general and floating-point registers:
The normal (no ctype) compare instruction writes the compare result to one target, and the complement to the other.
The OR, AND and OR and complement (or.andcm) compare instructions, either write a specific answer to the predicate registers, or leave them unchanged, depending on the result of the compare operation. This allows multiple simultaneous OR-type or multiple simultaneous AND-type compares to target the same predicate register.
Data types that are the same size as pointers.
POINTER_32 is a 32-bit pointer.
In Win32, this is a native pointer.
In Win64, POINTER_32 is created by truncating a 64-bit pointer. All pointers
are 64-bit on any 64-bit platform.
POINTER_64 is a 64-bit pointer. In Win32, POINTER_64 is created by sign extending a 32-bit pointer. In Win64, this is a native pointer. Note that no assumptions should be made about pointer sign bits.
The ability of one data item to have a different type depending on the way in which it is used.
Scheduling performed after register allocation in the backend of the compiler. The register allocator may introduce spills, or may get rid of MOV instructions. Blocks where such changes have been made are re-scheduled by the postpass scheduler.
64 one-bit predicate registers enable controlling the execution of instructions. When the value of a predicate register is true (1), the instruction is executed. The predicate registers enable:
There are:
Instructions that are not explicitly preceded by a predicate, defaults to the first predicate register, pr0, which is read-only, and is always true (1).
The conditional execution of instructions based on their predicate. When the predicate is true (1), the instruction is executed. When is is false (0), the instruction is treated as a NOP.
A prediction strategy hint describes how the processor should predict conditional branches. Depending on the value of the hint, the processor can predict the branch as a taken branch, can not predict it, or can base the prediction on a specified predicate which is set up in advance.
The subset of stacked registers visible to a procedure. The procedure frame contains a predefined number of input and output registers, to a maximum of 96 registers.
A contiguous array of memory locations, commonly referred to as “the stack”, used in many processors, to save the state of the calling procedure, pass parameters to the called procedure and store local variables for the currently executing procedure.
A predicate register indicating whether or not the instruction is executed. When the value of the register is true (1), the instruction is executed. When the value of the register is false (0), the instruction is executed as a NOP. Instructions that are not preceded by a predicate explicitly, assume the first predicate register, p0, which is always true.
A type of data dependency between two instructions in one instruction group. The later instruction reads data from the same register to which the earlier instruction wrote.
Example: |
add r4=r5,r6 mov r9=r4 |
A RAW data dependency exists between r4 in the first line and r4 in the second line. |
Moving data between registers to and from memory is performed strictly through the load (ld) and store (st) instructions. The Intel® Itanium® architecture supports loads and stores of all data types. Because registers are written as 64-bit, loads are zero-extended. Stores always write the exact number of bytes for the required format.
This hint is applicable to a st instruction. The store instruction becomes visible after all prior data references, however later data references may become visible earlier.
The work performed is typical of the stress on the system under normal operating conditions.
Registers which are rotated by one register position on each loop execution. The logical names of the registers are rotated in a wrap-around fashion, so that logical register X is logical register X+1 after one rotation. The predicate, floating-point and general registers can be rotated.
Use these pointers when casting a pointer to an integer for pointer arithmetic
Technique that enables instructions to execute out of order when sufficient resources exist, and when no data dependencies exist.
The processor maintains a table that indicates the status of instructions and the registers to which they are writing.
Critical data dependency violations arise from any of the following:
WAR and WAW benefit from register renaming, which leaves us with the RAW true dependency. Scoreboarding enables maximum concurrency limited only by the true RAW dependency and structural dependency violations.
Example:
.mfi |
On issue of the fma, the target register is marked "invalid data". This marking is removed once the operation has finished, four cycles later, and the valid result can be accessed. If an instruction tries to read the data before the "invalid data" tag is removed, the new operation stalls until the data is ready. The data in f29 isn't ready because the fma is a scoreboarded operation. Therefore the second fma stalls for three cycles. |
Single Instruction Multiple Data (SIMD) technique. This technique speeds up performance by using one instruction to process multiple data elements in parallel.
Software pipelining is a method that enables the processor to execute, in any given time, several instructions in various stages of the loop.
Data with spatial locality is data with memory addresses close to the data or instructions currently in use.
To hide memory access latencies, advanced load instructions (ld.a) move potentially data dependent loads earlier in the code, and control-speculative load instructions (ld.s) hoist loads above conditional branches.
Predicates that turn on or off instructions in a software-pipelined loop. A software-pipelined loop has several stages. Each instruction is executed in a particular stage and is predicated by the stage predicate corresponding to that stage.
Replaces expensive operations such as multiplications and divisions with less expensive ones such as additions and subtractions.
Templates | ||||
The set of templates define the combinations of functional units that can be invoked by a executing a single bundle. This in turn lets the compiler schedule the functional units in an order that avoids contention. The template can also indicate a stop. The 24 available templates are listed opposite. M - is a memory function * L+X is an extended type that is dispatched to the I-unit. |
MII |
MIIs |
Data with temporal locality is data that is likely to be reused. The older the data, the less likely the program is to use it again.
Loop count.
A measure of the error between an infinitely precise result and the actual machine result.
The unconditional (unc) compare instruction first initializes both predicate targets to 0, independent of the qualifying predicate. It then operates the same as the normal type, writing the compare result to one target, and the complement to the other.
The Uniform Data Model (UDM) proposes to use identically named data types for both the Win32 and Win64 environments. By using this model, you can maintain a single source code development environment for both Win32 and Win64, provided no architecture specific design features are implemented.
A type of data dependency between two instructions in one instruction group. The two instructions write to the same register.
Example: |
add r4=r5,r6 add r4=r5,r6 |
A WAW data dependency exists between r4 in the first line and r4 in the second line. |
The predefined section directives define and option between
commonly-used sections. A predefined section directive creates a new section
with the default flags and type attributes, and makes that section the current
section. The predefined section directive mnemonics are the same as the section
names.
The table below lists the predefined section directives, and their default flags
and type attributes.
Directive/ Section Name | Flags | Type | Usage |
.text |
"ax" |
"progbits" |
Read-only object code |
.data |
"wa" |
"progbits" |
Read-write initialized long data |
.sdata |
"was" |
"progbits" |
Read-write initialized short data |
.bss |
"wa" |
"nobits" |
Read-write uninitialized long data. |
.sbss |
"was" |
"nobits" |
Read-write uninitialized short data. |
.rodata |
"a" |
"progbits" |
Read-only long data (literals) |
.srodat |
"as" |
"progbits" |
Read-only short data (literals) |
.comment |
"" |
"progbits" |
Comments in the object file |
To include the contents of another file in the current source file, use the .include directive in the following format:
.include "filename"
Where "filename" Specifies a string constant. If the specified filename is an absolute pathname, the file is included.
The .proc and .endp directives combine code belonging to the same procedure.
The .proc directive marks the beginning of a procedure, and the .endp directive marks the end of a procedure. A single procedure may consist of several disjointed blocks of code. Each block should be individually bracketed with these directives. Name operands within a procedure can be used only for that specific procedure.
The following code sequence shows the basic format of a procedure:
.proc name,... name: // label ... // instructions in procedure .endp name,...
Where name represents one or
more entry points of the procedure. Each entry point has a different name.
Name operands of the .endp directive are ignored.
Symbols are declared as global, weak, or local scopes. Symbol scopes are used to resolve symbol references within one object file or between multiple object files. Symbol scopes are placed in the object file symbol table and any reference to a symbol is resolved in link time. By default, symbols have a local scope, where they are available only to the current assembly- language source file in which they are defined.
References to symbols with a local scope are resolved from within the object file in which the symbols are declared. Local symbols with the same name in different object files do not refer to the same entity.
Symbols have a local scope by default, so it is not necessary to declare symbols with local scopes. However, the .local directive is available for completeness. The .local directive has the following format:
.local name,name, ...
Where name represents a symbol name.
References to symbols with a global scope are resolved
within the object file in which the symbols are declared, and within other object
files. Global symbols with the same name in different object files refer to
the same entity.
To declare one or more symbols with a global scope, use
the .global directive. These symbols are flagged
as global symbols for the
linkage editor. The .global directive has the
following format:
.global name,name, ...
Where name represents a symbol name.
References to symbols with a global scope are resolved within the object file
in which the assembler automatically aligns instructions and data objects on
the appropriate boundaries within a section. It aligns bundles on
16-byte boundaries, and data objects according to their size. The assembler
does not align string data, since they are byte arrays.
To disable automatic alignment of data objects in data
allocation statements, add the .ua completer
after the data allocation mnemonic,
for example, data4.ua.
Each section has an alignment attribute that is determined by the largest aligned object within the section.
Section location counters are not aligned automatically.
To align the location counter in the current section to a specified alignment
boundary use the .align statement.
The .align statement has the following format:
.align expression
Where expression is an integer that specifies the alignment boundary for the location counter in the current section. The integer must be a power of two.
The .align statement enables the assembler to reserve space in any section type, including a "nobits" section. During program execution time the contents of a "nobits" section are initialized as zero by the operating system program loader. When using the .align statement in any other section type, the assembler initializes the reserved space with zeros for non-executable sections, and with a NOP pattern for executable sections.
The following example presents an opportunity to load data from memory before the control dependency.
int add5(int *a) { if (a==NULL) return (-1); else return (*a+5); }