Language Specification

Version information

  • Version 0.21
  • Last updated: 19 August 2018

Introduction

This is the reference specification for Rosewood. It specifies the semantics and syntax of the Rosewood language by specifying requirements for a conforming Rosewood source code. For implementation details, see the Go reference implementation.

Rosewood is a domain-specific simple language primarily intended to simplify the automatic generation and formatting of statistical tables. The language provides facilities for representing tables in a simple, human-readable format and for manipulating and formatting the structure and style of those tables. Because Rosewood’s source files are plain text files, they can be created by hand or automatically generated using a statistical package (e.g., SAS, R, Stata) or other general-purpose scripting or programming languages.

Rosewood uses a simple markup, inspired by Markdown, to define a table and its contents making it easier to create and maintain than more complicated markup languages such as HTML or XML. The simple non-intrusive markup makes Rosewood tables easily readable even without rendering the source code or using special viewers. This feature was the overriding design goal of Rosewood as analysts need to frequently inspect the tables for correctness while analyzing the data, and forcing them to use a different tool to view the table contents tends to interrupt their work flow. For this reason, the code needed to format (e.g., merge certain cells) and style the table and its content is stored in a different section of the file, so not to reduce the readability of the tabular data.

Source code representation

Rosewood’s source code is Unicode text encoded in UTF-8. For security reasons, the Unicode NUL character (U+0000) must be replaced with the REPLACEMENT CHARACTER (U+FFFD).

Grammar in Extended Backus-Naur Form

(see footnote 1 for EBNF syntax. A checked box indicates that the grammar is implemented.)

The rest of this document explains the above grammar.

Rosewood file structure

A valid Rosewood file consists of one or more table blocks. Each table block starts with a Section Separator, and consists of exactly four sections separated by a Section Separator. The last Section Separator is optional if followed by end of file (TODO: verify).

A table block’s four sections must be in this order: 1. The Caption section: zero or more lines of valid Markdown text to be rendered above the tabular data. 1. The Data section: one or more lines conforming with Rosewood table markup. 1. The Footnotes section: zero or more lines of valid Markdown text to be rendered below the tabular data. 1. The Rules section: zero or more valid Rosewood commands.

Example of valid complete Rosewood file:

+++
arbitrary caption text goes in here
+++
                        One big merged row                                             |
first col merged in rows 1-2| cols 2 and 3 merged in row 2       | merged in rows 1-2  |
                            | subheading       |  subheading     |                     |
entry 1                     |       281        |      281.0      |        D51.0        |
entry 2                     |       283        |      283.0      |        D59.1        |
entry 3                     |       720        |      720.0      |         M45         |
+++
1. Some footnote
2. Some other footnote
+++
merge row 1
merge row 2 col 1:2
merge row 2:3 col 1 
merge row 2:3 col 4
style row 1:3 header
+++

This code will generate this HTML file (TODO: add link)

Example of a data-only file

+++
+++
Item             | Stars |
Butter chicken   |   5   |
Star-anise candy |   4   |
Wilted lettuce   |   0   |
+++
+++
+++

This code will generate this HTML file (TODO: add link)

Terms and definitions

  • In this document, the unqualified term character is synonymous with the term Unicode code point.
  • The term Unicode code point means a Unicode scalar value where possible, and an isolated surrogate code point when not.
  • A line is a sequence of zero or more characters other than newline (U+000A) or carriage return (U+000D), followed by a line ending or by the end of file.
  • A line ending is a newline (U+000A), a carriage return (U+000D) not followed by a newline, or a carriage return and a following newline.
  • A blank line is a line containing no characters, or a line containing only spaces (U+0020) or tabs (U+0009).

The Caption Section

  • consists of zero or more lines (i.e., it can be empty).
  • The following Markdown elements are supported:
    • ^Superscript^
    • ~Subscript~
    • *italic*
    • **bold**
    • _italic_
    • __bold__
  • These characters can be escaped using '\'. '\' itself can be escaped using '\\'.

The Data Section

  • holds the tabular data of the table block. It cannot be empty; the table must have at least one row. Only one Data Section is allowed per table block.
  • sandwiched between two Section Separators.
  • Section Separator is a string literal of 3 consecutive plus signs (U+002B, “+++”) followed by end of line or end of file.
  • consists of one or more rows. Each row consists of one or more cells (i.e, a row must not be empty). However, the number of cells can vary between rows.
  • Each cell consists of zero or more Unicode characters (i.e., it can be empty), and ends with a Column Separator.
  • Column Separator is a single Vertical Line (U+007C or “|”)
  • Although discouraged, it is not an error if the text and column separators in different rows do not line up prettily.
  • the same Markdown elements supported in the Caption Section are supported here.
  • White space before and after cell contents will be trimmed. TODO: confirm if this should be optional?

The Footnotes Section

  • consists of zero or more lines (i.e., it can be empty).
  • the same Markdown elements supported in the Caption Section are supported here.

The Command Section

  • comprised of zero or more lines (i.e., it can be empty). Each line must be either a comment or a valid Rosewood command.
  • The first section to be parsed and interpreted because certain commands may change the rules of parsing and interpreting other sections.

Lexical elements used in the Command Section

Comments

  • serve as program documentation. Rosewood comments start with two consecutive Backslash (U+005c, //) and stop at the end of the line. For example, the following two lines are valid in a Rosewood Rules Section: //this line should parse without errors... merge row 1:2 col 1:2 //this too.
  • A comment cannot start inside a string literal. TODO: verify

White space

  • one or more of the following: spaces (U+0020), horizontal tabs (U+0009), carriage returns (U+000D), and newlines (U+000A).
  • is ignored by the parser except as it separates tokens that would otherwise combine into a single token.
  • While breaking the input into tokens, the next token is the longest sequence of characters that form a valid token.

Identifiers

  • name program entities such as constants and command arguments.
  • a sequence of one or more letters and digits. The first character must be a letter.
  • A letter is either a Unicode Letter or “_“.
  • A Unicode Letter is a Unicode code point in any of the Unicode Standard 8.0, Section 4.5 “General Category” categories - Lu, Ll, Lt, Lm, or Lo
  • Examples of valid identifiers: myHeader, _header, header1.

Keywords

  • reserved identifiers.
  • currently used keywords are col, define, merge, select, set, style, row, var.
  • keywords reserved for future use: by, subset, version, TODO: what else?

Operators

  • the unary minus
  • the unary plus
  • U+003D, the Equals Sign (=), is the assignment operator
  • U+003A, the Colon (:), is the range operator
  • U+002C, the Comma (,), is the list operator

Integer literals

  • a sequence of a Unicode digits, optionally prefixed with a U+002D HYPHEN-MINUS character (-) or +, representing an integer constant in base ten.
  • Non-decimal bases (e.g, octal, hexadecimal) are not permitted.
  • Signed integers A string is a valid integer if it consists of one or more ASCII digits, optionally prefixed with .
  • valid integer are in the range x,y,z
  • xx is used internally as a marker of missing values.
  • xx is used internally as a marker of maximum integer

String Literals

  • represents a string constant obtained from concatenating a sequence of characters.
  • are interpreted by the parser using the following rules

    Escaped characters

    After a backslash, certain single-character escapes represent special values:

  • \a U+0007 alert or bell

  • \b U+0008 backspace

  • \f U+000C form feed

  • \n U+000A line feed or newline

  • \r U+000D carriage return

  • \t U+0009 horizontal tab

  • \v U+000b vertical tab

  • \ U+005c backslash

  • \’ U+0027 single quote

  • \” U+0022 double quote

Segment-specifier

  • describes a range of cells to which a command applies.
  • starts with one of the reserved keywords “row” or “col”, followed by either a single cell range or cell List (TODO: verify that these are exclusive)
  • A cell range is comprised of a cell coordinate indicating the start of the range followed by an optional step coordinate and/or another cell coordinate indicating the end of the range.
  • A cell coordinate is comprised of a valid row or column number (unsigned integer) or the reserved identifier “max”.
  • A step coordinate is comprised of “:” followed by a signed integer (it can be negative).
  • A cell list is a comma-separated list of valid cell coordinates.
  • The following are valid segment specifiers:
segment specifier meaning
row 1 command applies to row 1
row 1:3 command applies to rows 1 to 3 inclusive
row 1:3:9 command applies to every i + 3 row starting with row 1, ie rows 1,4,7

TODO: add more examples TODO: clarify rules about merging in commands like merge row 1:2:10; should this be legal merging all columns in the specified rows.

It is an error to

  • to have 2 segment specifiers of the same type, e.g., merge row 1:2 row 1:2 is not valid.
  • to have a non-integer, 0 or negative coordinate number. All the following are not valid commands: merge row 1.4 col 1 //float row number merge row x1.4 col 1 //non-integer row number merge row 0 col 1 //zero row number merge row 0.0 col 1 //zero row number
  • to have “row” or “col” without at least one coordinate. All the following are not valid commands:
    merge row col 1 //missing row coordinate merge row 1 col //missing col coordinate
  • to have a right coordinate that is larger than the left coordinate. merge row 3:1 col 1 is not a valid command.
  • to have a step number in a cell range without a right coordinate. merge row 1:2: is not a valid command.
  • to have a step number equal to zero. merge row 1:0:10
  • to have a cell range after a cell list.merge row 1,2, 3:4
  • to have a “max” in a left coordinate. merge row max:1
  • to have “max” as a step in a cell range. merge row 1:max:10
  • to have a cell list in or after a cell range. merge row 1,2,3, max, merge row 1,2,3, max, 4, 5

Define Command

  • create an non-mutable identifier and binds it to a string or integer literal value (i.e., expressions are not allowed, so this is more of c-style #define than the const of Pascal, Go or Javascript).
  • scoped to the current table, i.e., the variable can be accessed only in the Rules Section
  • define global scoped to the current file, i.e., the variable can be accessed in any Rules Section in the file including those before the define was declared.
  • Syntax: define [global] identifer = identifier | stringLiteral | signedInteger
  • Examples: define myCol = 3 define greeting = "Hello"
  • It is an error to redefine or assign to an existing constant.

Merge Command

  • merges the cells specified by the row and/or col segment specifiers. Syntax: merge segmentSpecifier [segmentSpecifier] .

Select Command (TODO: verify utility)

  • subsets a table into a smaller table that only includes the cells specified by the row and/or col segment .
  • Example: select 4:9, defines a table that excludes the first 3 rows of the original table and all rows after the 9th row. From the perspective of all subsequent commands, row 4 of the original table is now considered as row 1 and max row is now considered as row 9.
  • useful for simplifying further processing in subsequent commands
  • Syntax: select all | segmentSpecifier [segmentSpecifier]
  • select all selects the entire table (removes the subset).

Set Command

  • changes the value of a built-in setting.
  • Syntax: set identifier = identifier | integer-literal | string-literal .

Style Command

  • applies a CSS class to the cells specified by the row and/or col segment specifiers. Syntax: style segmentSpecifier [segmentSpecifier] identifier [identifier…] .

Change log

0.21

  • expanded the “Caption” and “Footnotes” definition to specify supported MD elements.

Footnotes

Footnote 1: Syntax of Extended Backus-Naur Form (EBNF) used in this document

Production  = production_name "=" [ Expression ] "." .
Expression  = Alternative { "|" Alternative } .
Alternative = Term { Term } .
Term        = production_name | token [ "…" token ] | Group | Option | Repetition .
Group       = "(" Expression ")" .
Option      = "[" Expression "]" .
Repetition  = "{" Expression "}" .

Productions are expressions constructed from terms and the following operators, in increasing precedence:
|   alternation
()  grouping
[]  option (0 or 1 times)
{}  repetition (0 to n times)

See https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form

19/08/2018