Skip to content
This repository has been archived by the owner on Jun 6, 2021. It is now read-only.

Specification

Alex Rønne Petersen edited this page Jun 9, 2013 · 46 revisions

Introduction

This is the formal specification of the Flect programming language. All implementations of the language are expected to conform to the minimum subset of the language that this specification requires.

Conformance

When the terms required, must, shall, and similar are used in this specification, it means that an implementation must behave as specified in order to conform to this specification. Conversely, if the terms must not, shall not, and so on are used, an implementation must not behave in the specified way in order to confirm to this specification.

Only when the terms optional, can, and may are explicitly used in this specification does it indicate that an implementation does not have to implement the specified behavior in order to be conforming. It is, however, recommended that implementations also conform to all optional behaviors described in this specification.

Examples are provided throughout this specification. These demonstrate the expected behavior of an implementation and serve as clarification for possibly unclear or ambiguous statements. In other words, an implementation must behave in the way that examples in this specification demonstrate.

Finally, rationales are given in some parts of this specification where a design decision may not have an immediately obvious justification.

Grammar Notation

Throughout this specification, the Flect language grammar will be given where relevant. It is specified in a variation of the Extended Backus-Naur Form (EBNF). EBNF consists of a series of production rules (also called non-terminals) built on fundamental symbols, operators, and literals (called terminals). The meaning of the EBNF variant used in this specification is given here.

A production rule is defined as follows, using the ::= operator:

rule-name ::= ...

A number of operators are used to build the right-hand side of production rules:

  • "A": Means the literal character A. Some literals may be escaped, e.g. "\\" which means the \ character and "\"" which means the " character.
  • U+NNNNNNNN: Means the Unicode code point as specified by the hexadecimal NNNN value.
  • "A" .. "B": Constructs a code point range from A to B according to the order of Unicode code points. Such ranges indicate that any one of the code points in the range is allowed.
  • A ... | B ...: The pipe character constructs an alternation. This means that either A ... or B ... (be they production rules or terminal symbols) are allowed.
  • [ A ... ]: Square brackets construct an option. This means that A ... may or may not appear.
  • { A ... }: Curly braces construct a repetition. This means that zero or multiple of A ... may appear.
  • < A ... >: Angle brackets construct a repetition where at least one occurrence of A ... must appear, and possibly more.
  • A ... * N: The asterisk indicates that A ... must occur N times.
  • ( A ... ): Parentheses perform simple grouping (as in arithmetic) to resolve precedence issues.
  • ? ... ?: Specifies a special sequence. The meaning of the sequence is explicitly given in the ... part.

(In the above rules (except for the ? ... ? rule), three periods (...) indicate zero or more production rule names or terminal symbols.)

No production rules allow end of file (EOF) to occur unless explicitly specified.

For example, given the above definitions, one could specify production rules for integer and floating point literals as follows:

integer ::= "0" .. "9"
float ::= integer "." integer [ exponent ]
exponent ::= ("e" | "E") [ "+" | "-" ] integer

Lexical

A Flect source file consists of a series of Unicode code points encoded as UTF-8. In order to correctly interpret such a source file, a compiler must pass these code points through the Flect lexical grammar which produces a series of tokens, which are then passed through the preprocessor grammar which produces a filtered set of tokens, which are finally passed through the Flect syntactic grammar. The end result is then (presumably) a syntax tree used for further semantic analysis and finally code generation and/or execution.

The terminal symbols of the lexical grammar are Unicode code points, while the terminal symbols of the syntactic grammar are the tokens produced from lexical analysis.

The lexical grammar is given in this section. The syntactic grammar is given throughout this specification in relevant sections, with a full grammar in the final section.

token ::= directive
          | operator-or-separator
          | identifier
          | keyword
          | integer-literal
          | float-literal
          | character-literal
          | string-literal

All white space in the Unicode Zs (Separator, Space) category must be dropped during lexical analysis, as must the horizontal tab, vertical tab, and form feed characters.

white-space ::= ? any Unicode category Zs character ?
                | U+00000009
                | U+0000000B
                | U+0000000C

White space is thus ignored in the other productions in the lexical grammar and assumed not to be present in between the parts that make up the right-hand side of production rules.

Comments

Comments must be stripped during lexical analysis as with white space.

comment ::= line-comment | block-comment
line-comment ::= "//" { ? any character except U+0000000A ? } ( U+0000000A | ? end of file ? )
block-comment ::= "/*" { ? any character sequence except "*/" ? } "*/"

An implementation may choose to preserve comments for the purpose of documentation generation but they shall have no effect on program semantics.

As with white space, comments are ignored in the lexical grammar's production rules and assumed not to be present in between the parts that make up the right-hand side of production rules.

Preprocessor Directives

Preprocessor directives are used to control conditional compilation of code. Unlike in other languages, preprocessing in Flect happens between lexing and parsing, so preprocessor directives only have to be tokenized during lexical analysis.

directive ::= "\\\\" identifier

Operators and Separators

These are various operators used in expressions, and separators used to delineate code elements.

operator-or-separator ::= operator | separator
operator ::= "+"
             | "-"
             | "->"
             | "*"
             | "/"
             | "%"
             | "&"
             | "&&"
             | "|"
             | "||"
             | "|>"
             | "^"
             | "~"
             | "!"
             | "!="
             | "!=="
             | "."
             | ".."
             | "@"
             | "="
             | "=="
             | "==="
             | "<"
             | "<<"
             | "<="
             | "<|"
             | ">"
             | ">>"
             | ">="
separator ::= "("
              | ")"
              | "{"
              | "}"
              | "["
              | "]"
              | ","
              | ";"
              | ":"
              | "::"

Identifiers

Identifiers are used for naming types, functions, variables, and so on. They are simple alphanumerical sequences (underscores are also allowed).

identifier ::= ( "a" .. "z" | "A" .. "Z" | "_" ) { "a" .. "z" | "A" .. "Z" | "0" .. "9" | "_" }

All identifiers prefixed with __ (two underscores) are reserved for future expansion of the language and for implementation-specific features.

Keywords

Keywords are special identifiers used to direct syntactic analysis of Flect programs. They shall not be treated as identifiers.

Some keywords are reserved for future expansion of the language.

keyword = used-keyword | reserved-keyword
used-keyword ::= "mod"
                 | "use"
                 | "pub"
                 | "priv"
                 | "trait"
                 | "impl"
                 | "struct"
                 | "union"
                 | "enum"
                 | "type"
                 | "fn"
                 | "ext"
                 | "ref"
                 | "glob"
                 | "tls"
                 | "mut"
                 | "imm"
                 | "let"
                 | "as"
                 | "if"
                 | "else"
                 | "cond"
                 | "match"
                 | "loop"
                 | "while"
                 | "for"
                 | "break"
                 | "goto"
                 | "return"
                 | "safe"
                 | "unsafe"
                 | "asm"
                 | "true"
                 | "false"
                 | "null"
                 | "new"
                 | "assert"
                 | "in"
                 | "meta"
                 | "test"
                 | "macro"
                 | "quote"
                 | "unquote"
reserved-keyword ::= "yield"
                     | "fixed"
                     | "pragma"
                     | "scope"
                     | "move"

Literals

Literals are the fundamental building block of the Flect language. These are the values that can only be expressed directly as terminal symbols to the syntactic grammar and not through any other language construct.

Integer Literals

An integer literal is a number with a base of 2, 8, 10, or 16 and an optional type specifier.

integer-literal ::= < "0" .. "9" >
                    | "0" ( "b" | "B" ) < "0" .. "1" >
                    | "0" ( "o" | "O" ) < "0" .. "7" >
                    | "0" ( "x" | "X" ) < "0" .. "9" | "a" .. "f" | "A" .. "F" >
typed-integer-literal ::= integer-literal [ ":" ( "i" | "u" ) [ "8" | "16" | "32" | "64" ] ]

A literal prefixed with 0b is a binary integer literal, 0o is an octal literal, and 0x` is a hexadecimal literal. If no prefix is present, the literal is decimal.

The suffix :i8 means that the literal is interpreted as a signed 8-bit integer, while :u8 means it is interpreted as an unsigned 8-bit integer, and so on. The special suffixes :i and :u mean word-sized integer types, signed and unsigned respectively. If no suffix is given, the literal's type is inferred during semantic analysis.

Floating Point Literals

A floating point literal is an IEEE 754 floating point number consisting of an integral part, a fractional part, an optional exponent part, an optional exponent sign, and an optional type specifier.

float-literal ::= float-part "." float-part [ float-exponent ]
typed-float-literal ::= float-literal [ ":f" ( "32" | "64" ) ]
float-part ::= < "0" .. "9" >
float-exponent ::= ( "e" | "E" ) [ "+" | "-" ] float-part

If the suffix :f32 is used, the number is interpreted as an IEEE 754 binary32 value. If the :f64 suffix is used, it is interpreted as an IEEE 754 binary64 value. If no suffix is given, the literal's type is inferred during semantic analysis.

Character Literals

A character literal is a single Unicode code point. Its value is the number of the code point.

character-literal ::= "'" ( ? any character except "'" and "\\\\" ? | character-escape-sequence ) "'"
character-escape-sequence ::= "\\\\" ( character-escape-code | character-escape-unicode )
character-escape-code ::= "0" | "a" | "b" | "f" | "n" | "r" | "t" | "v" | "'" | "\\\\"
character-escape-unicode ::= "u" ( ( "0" .. "9" | "a" .. "f" | "A" .. "F" ) * 8 )

An escape sequence can be used to form a special character as shown in the following table.

Escape sequence Character name Unicode code point
\0 Null U+00000000
\a Alert U+00000007
\b Backspace U+00000008
\f Form feed U+0000000C
\n Line feed U+0000000A
\r Carriage return U+0000000D
\t Horizontal tab U+00000009
\v Vertical tab U+0000000B
\' Single quote U+00000027
\\ Backslash U+0000005C
\uPPPPPPPP Code point U+PPPPPPPP

String Literals

A string literal is a series of Unicode code points encoded as UTF-8.

string-literal ::= "\\"" { ? any character except "\\"" and "\\" ? | string-escape-sequence } "\\""
string-escape-sequence ::= "\\\\" ( string-escape-code | string-escape-unicode )
string-escape-code ::= "0" | "a" | "b" | "f" | "n" | "r" | "t" | "v" | "\\"" | "\\\\"
string-escape-unicode ::= "u" ( ( "0" .. "9" | "a" .. "f" | "A" .. "F" ) * 8 )

An escape sequence can be used to form a special character as shown in the following table.

Escape sequence Character name Unicode code point
\0 Null U+00000000
\a Alert U+00000007
\b Backspace U+00000008
\f Form feed U+0000000C
\n Line feed U+0000000A
\r Carriage return U+0000000D
\t Horizontal tab U+00000009
\v Vertical tab U+0000000B
\\ Backslash U+0000005C
\" Double quote U+00000022
\uPPPPPPPP Code point U+PPPPPPPP

Common Grammar Elements

This section lists a few grammar elements that are commonly used throughout the syntactic grammar.

Qualified Identifiers

A qualified identifier is an unambiguous name referring to a module, type, function, variable, or macro (depending on the lexical context).

qualified-identifier ::= identifier { "::" identifier }

Modules and Bundles

A Flect program consists of one or more modules which make up a bundle (which is a static library, shared library, or executable). These concepts are the fundamental building blocks for abstractions in Flect.

program ::= { module-declaration }

Modules

The module is the fundamental unit of encapsulation and reuse in Flect. It is a container of declarations; that is, types, functions, variables, and macros. A module has a visibility (pub or priv) which indicates whether or not it can be imported outside its containing bundle. Each declaration in the module also has a visibility which indicates whether that declaration can be at all accessed outside the module.

A source file can contain multiple module declarations; modules have no particular semantic relationship with the source file(s) they reside in.

Modules cannot be lexically nested. However, nested module namespaces are allowed (that is, one module declaration's name is foo::bar and another's is foo::bar::baz).

module-declaration ::= "mod" qualified-identifier "{" { declaration } "}"

By convention, module names should contain at least one prefixing component corresponding to the bundle they are part of. For example, if a bundle math (libmath) has a module for vector math, this module might be math::vector.

Note that all module names prefixed with a component equal to core (core run-time modules), std (standard library modules), etc (implementation-specific modules), or exp (experimental modules for future inclusion into one of the former) are reserved for use by implementations of the Flect language.

Bundles

A bundle is a collection of Flect modules. A bundle can be either a static library, a shared library, or an executable. Bundles are the primary mechanism through which code is packaged and distributed for use or reuse.

Bundles have no direct representation in the language as they are purely an aspect of Flect's compilation model.

It shall be an error for multiple modules with the same full name to exist in the same bundle. It is also an error if two modules with the same full name are present during compilation of a bundle. Suppose for instance that bundles A and B both have modules called foo. Bundle C now links to bundle A and B. Since two modules named foo exist, name resolution is ambiguous in bundle C, and an error shall therefore be issued.

Note that the bundle names core, std, etc, and exp are reserved for implementations of the Flect language.

Type System

The type system in Flect consists of integers, floating point numbers, the bool and unit types, tuples, structures, discriminated unions, arrays, vectors, various pointer types, function pointer types (with closures), and finally, user-defined, nominal types. Inner type qualifiers (mut, imm) are used to construct types with different levels of mutation guarantees. Traits and implementations (effectively a type class system) are used to aid in writing type-generic code.

type ::= nominal-type
         | tuple-type
         | function-type
         | array-type
         | vector-type
         | pointer-type

Note that while some type names (i8, f32, self, etc) get special treatment during semantic analysis, they are not keywords. They are treated as regular identifiers by lexical analysis and can be used as such.

There is a special grammar rule for function return types:

return-type ::= type | "!"

The ! character indicates that the function diverges; that is, it does not return, so it does not have a return type. This is primarily intended for functions such as the abort function in the standard C library.

Nominal Types

nominal-type ::= named-type [ type-arguments ]
named-type ::= integer-type
               | float-type
               | bool-type
               | unit-type
               | self-type
               | qualified-identifier
type-arguments ::= "[" type { "," type } "]"

Integer Types

integer-type ::= "i8" | "i16" | "i32" | "i64" | "int" | "u8" | "u16" | "u32" | "u64" | "uint"

Floating Point Types

float-type ::= "f32" | "f64"

Boolean Type

bool-type ::= "bool"

Unit Type

unit-type ::= "unit"

Self Type

self-type ::= "self"

Tuple Types

tuple-type ::= "(" type < "," type > ")"

Function Types

function-type ::= function-pointer-type | closure-pointer-type
function-pointer-type ::= "fn" [ function-type-convention ] function-type-parameters "->" return-type
function-type-convention ::= "ext" string-literal
function-type-parameters ::= "(" [ function-type-parameter { "," function-type-parameter } ] ")"
function-type-parameter ::= [ "mut" ] [ "ref" ] type
closure-pointer-type ::= "fn" "@" function-type-parameters "->" return-type

Array Types

array-type ::= managed-array-type | unsafe-array-type | general-array-type
managed-array-type ::= "@" [ "mut" | "imm" ] "[" type "]"
unsafe-array-type ::= "*" [ "mut" | "imm" ] "[" type "]"
general-array-type ::= "&" [ "mut" | "imm" ] "[" type "]"

Vector Types

vector-type ::= "[" type ".." integer-literal "]"

Pointer Types

pointer-type ::= managed-pointer-type | unsafe-pointer-type | general-pointer-type
managed-pointer-type ::= "@" [ "mut" | "imm" ] type
unsafe-pointer-type ::= "*" [ "mut" | "imm" ] type
general-pointer-type ::= "&" [ "mut" | "imm" ] type

Declarations

Attributes

attribute ::= "@" "[" attribute-name [ attribute-arguments ] "]"
attribute-name ::= keyword | identifier
attribute-arguments ::= "(" [ attribute-argument { "," attribute-argument } ] ")"
attribute-argument ::= attribute-name [ "=" attribute-value ]
attribute-value ::= attribute-literal | attribute-arguments
attribute-literal ::= typed-integer-literal | typed-float-literal | character-literal | string-literal

Expressions

Macros

Compile-Time Evaluation

Memory Management

Application Binary Interface

The application binary interface (ABI) specifies certain conventions that shall be followed when compiling Flect source code to machine code.

Name Mangling

When compiled to an object file format (such as ELF or PE/COFF), Flect functions (fn declarations) should have their names mangled according to the following procedure:

  1. Start out with the string fl__.
  2. Take the full module name of the module containing the function and replace all instances of :: with _.
  3. Append the adjusted module name.
  4. Append two underscores (__).
  5. Append the name of the function.

For example, a function do_stuff in a module foo::bar shall be mangled as fl__foo_bar__do_stuff.

Note that only functions with flect linkage need to be mangled. Functions with any other linkage, such as cdecl, follow the name mangling rules of the ABI on the target platform.

Global variables (glob declarations) and constants (const declarations) are also subject to name mangling according to this procedure:

  1. Start out with the string fl_g__ (for global variables) or fl_c__ (for constants).
  2. Take the full module name of the module containing the global variable or constant and replace all instances of :: with _.
  3. Append the adjusted module name.
  4. Append two underscores (__).
  5. Append the name of the global variable or constant.

For example, a global variable data in a module foo::bar shall be mangled as fl_g__foo_bar__data. A constant table in a module bar::baz shall be mangled as fl_c__bar_baz__table.

Memory Layout

The memory layout of the program stack is implementation-defined. It is, however, recommended that implementations follow the requirements of the target platform's C ABI.

Structures (struct declarations) shall map 1:1 to C structs on the target platform; that is, they shall follow the same alignment and padding rules as the target platform's C ABI specifies.

For instance, consider this structure:

pub struct Foo {
    pub x : u32;
    pub y : f64;
}

This will compile to a C struct like this:

struct Foo {
    unsigned int x;
    double y;
};

Enumerations (enum declarations) shall compile down to the underlying type specified as part of the declaration.

Take for instance this enumeration:

enum Foo : u16 {
    Bar = 0;
    Baz = 1;
    Qux = 2;
}

Whenever a value of type Foo is created, it shall compile directly to the u16 equivalent. For instance, Foo.Qux shall compile to 2:u16.

Unions (union declarations) shall compile down to C structs where the first field is a uint tag describing the union case the instance represents. The rest of the resulting struct is mostly opaque, but the remaining space must be large enough to hold all fields in the largest union case.

For example, consider this discriminated union:

pub union Union {
    Foo {
        pub x : i32;
    }
    Bar {
        pub x : i32;
        pub y : i32;
    }
}

This would compile down to this C code:

struct Union {
    size_t tag;
    char data[8];
};

The data field's size represents the size of the largest case in the union. This size may actually differ depending on alignment and padding rules of the target platform - the above is only what the struct would look like on a 32-bit x86 processor.

Whenever a union is matched against, the data field is simply reinterpreted as the relevant union case's memory. For the purposes of memory layout, a union case can be thought of as a structure by itself.

Calling Convention

The default (flect) linkage uses an implementation-defined calling convention. This specification does not dictate any aspects of it, but does recommend that implementations use a commonly supported calling convention such as cdecl.

All other linkage types are subject to whatever rules are dictated by the target platform's C ABI.

Foreign Function Interface

Unit Testing

Documentation Comments

Clone this wiki locally