Assembler
Assembler
Note: This documents the officially supported syntax of the assembler. The
assembler accepts some files that don't fully conform to this syntax, but this
behavior may change without warning in the future.
Lexical structure
Comments: Comments begin with a semicolon and go to the end of the line. Since no
valid tokens start with a semicolon, there is no ambiguity. Comments are ignored
during parsing.
Lines that are empty except for whitespace or a comment are ignored. Many grammar
productions require certain parts to be separated by a newline (LF/CRLF/CR). This
is represented below by the terminal EOL. Due to the rules above, EOL can represent
an optional comment, followed by a newline, followed by any number of empty/comment
lines. There are no line continuations.
Integer, Long, Float, and Double literals use the same syntax as Java with a few
differences:
* No underscores are allowed
* Doubles cannot be suffixed with d or D.
* Decimal floating point literals with values that can’t be represented exactly in
the target type aren’t guaranteed to round the same way as in Java. If the exact
value is significant, you should use a hexidecimal floating point literal.
* If a decimal point is present, there must be at least one digit before and after
it (0.5 is ok, but .5 is not. 5.0 is ok but 5. is not).
* A leading plus or minus sign is allowed.
* Only decimal and hexadecimal literals are allowed (no binary or octal)
* For doubles, special values can be represented by +Infinity, -Infinity, +NaN, and
-NaN (case insensitive). For floats, these should be suffixed by f or F.
* NaNs with a specific binary representation can be represented by suffixing with
the hexadecimal value in angle brackets. For example, -NaN<0x7ff0123456789abc> or
+NaN<0xFFABCDEF>f
Note: NaN requires a leading sign, even though it is ignored. This is to avoid
ambiguity with WORDs. The binary representation of a NaN with no explicit
representation may be any valid encoding of NaN. If you care about the binary
representation in the classfile, you should specify it explicitly as described
above.
String literals use the same syntax as Java string literals with the following
exceptions
* Non printable and non-ascii characters, including tabs, are not allowed. These
can be represented by escape sequences as usual. For example \t means tab.
* Either single or double quotes can be used. If single quotes are used, double
quotes can appear unescaped inside the string and vice versa.
* There are three additional types of escape sequences allowed: \xDD, \uDDDD, and \
UDDDDDDDD where D is a hexadecimal digit. The later two are only allowed in unicode
strings (see below). In the case of \U, the digits must correspond to a number less
than 0x00110000. \x represents a byte or code point up to 255. \u represents a code
point up to 65535. \U represents a code point up to 1114111 (0x10FFFF), which will
be split into a surrogate pair when encoded if it is above 0xFFFF.
* There are two types of string literals - bytes and unicode. Unicode strings are
the default and represent a sequence of code points which will be MUTF8 encoded
when written to the classfile. A byte string, represented by prefixing with b or B,
represents a raw sequence of bytes which will be written unchanged. For example, "\
0" is encoded to a two byte sequence while b"\0" puts an actual null byte in the
classfile (which is invalid, but potentially useful for testing).
Reference: The classfile format has a large number of places where an index is made
into the constant pool or bootstrap methods table. The assembly format allows you
to specify the definition inline, and the assembler will automatically add an entry
as appropriate and fill in the index. However, this isn’t acceptable in cases where
the exact binary layout is important or where a definition is large and you want to
refer to it many times without copying the definition each time.
For the first case, there are numeric references, designated by a decimal integer
in square brackets with no leading zeroes. For example, [43] refers to the index 43
in the constant pool. For the second case, there are symbolic references, which is
a sequence of lowercase ascii, digits, and underscores inside square brackets, not
beginning with a digit. For example, [foo_bar4].
Bootstrap method references are the same except preceded by "bs:". For example,
[bs:43] or [bs:foo_bar4]. These are represented by the terminal BSREF. Bootstrap
method references are only used in very specific circumstances so you probably
won’t need them. All other references are constant pool references and have no
prefix, designated by the terminal CPREF.
Note: Constant pools and bootstrap method tables are class-specific. So definitions
inside one class do not affect any other classes assembled from the same source
file.
Note: Labels refer to positions in the bytecode of the enclosing Code attribute
where they appear. They may not appear outside of a Code attribute.
Words are used to specify names, identifiers, descriptors, and so on. If you need
to specify a name that can’t be represented as a word (such as using forbidden
characters), a string literal can be used instead. Words are represented in the
grammar by the terminal WORD.
For example, 42 is not a valid word because it begins with a digit. A class named
42 can be defined as follows:
.class "42"
In addition, when used in a context following flags, words cannot be any of the
possible flag names. These are currently public, private, protected, static, final,
super, synchronized, open, transitive, volatile, bridge, static_phase, transient,
varargs, native, interface, abstract, strict, synthetic, annotation, enum, module,
and mandated. In addition, strictfp is disallowed to avoid confusion. So if you
wanted to have a string field named bridge, you’d have to do
.field "bridge" Ljava/lang/String;
Note: The most deeply nested possible valid constant is 6 levels (InvokeDynamic ->
Bootstrap -> MethodHandle -> Method -> NameAndType -> Utf8). It is possible to
create a more deeply nested constant definitions in this grammar by using
references with invalid types, but the assembler may reject them.
Note: If the right hand side is a reference, the left hand side must be a symbolic
reference. For example, the following two are valid.
.const [foo] = [bar]
.const [foo] = [42]
method_body: attribute*
legacy_method_body: limit_directive+ code_body
limit_directive: ".limit" ("stack" | "locals") u16 EOL
Attributes
attribute: (named_attribute | generic_attribute) EOL
generic_attribute: ".attribute" utfref length_override? attribute_data
length_override: "length" u32
attribute_data: named_attribute | STRING_LITERAL
bootstrap_methods: ".bootstrapmethods"
Note: The content of a BootstrapMethods attribute is automatically filled in based
on the implicitly and explicitly defined bootstrap methods in the class. If this
attribute’s contents are nonempty and the attribute isn’t specified explicitly, one
will be added implicitly. This means that you generally don’t have to specify it.
It’s only useful if you care about the exact binary layout of the classfile.
Note: A Code attribute can only appear as a method attribute. This means that they
cannot be nested.
deprecated: ".deprecated"
stack_map_table: ".stackmaptable"
synthetic: ".synthetic"
Code
op_iinc: "iinc"
op_bipush: "bipush"
op_sipush: "sipush"
label: WORD
Annotations
element_value_line: element_value EOL
element_value: primtag ldc_rhs | "string" utfref | "class" utfref | "enum" utfref
utfref | element_value_array | "annotation" annotation_contents annotation_end
primtag: "byte" | "char" | "double" | "int" | "float" | "long" | "short" |
"boolean"
element_value_array: "array" EOL element_value_line* ".end" "array"
type_parameter_target: "typeparam" u8
supertype_target: "super" u16
type_parameter_bound_target: "typeparambound" u8 u8
empty_target: "empty"
method_formal_parameter_target: "methodparam" u8
throws_target: "throws" u16