Contents:

Warning

This documentation is incomplete. It will be finished as soon as possible.

1. TODO

2. Introduction

QML is a markup language designed by Baron R.K. von Wolfsheild for Qtask. This program parses a string of QML text into a QML document tree (that can then be converted to, say, XHTML with an appropriate "emitter").

Parsing is done with a pipeline made of three stages. The first stage uses parse on the input string and sends QML commands to the second stage; the second stage, whose main function is greatly simplifying the implementation of the third stage, ensures command balancing ("open" commands must be followed by "close" commands in the correct order, and correctly nested; also there are two levels of commands, "block" and "inline", that must nest properly), preprocessing and translating commands, and then sends them to the third stage; the third stage converts the stream of commands into a QML document tree.

The second and third stages are implemented as finite state machines (FSMs). The document tree that comes out of the third stage is optimized with a tree rewriting engine before returning it. The three rewriting rules also process the =repeat and =data commands; after the document has been completely expanded and optimized (which may require more than one pass), any table of contents is created and headers are numbered; numbers for enum lists are calculated too. (Numbering and TOC creation has to be done last because of =repeat.)

3. Overview

This section is the one users of this scanner are interested the most, because it shows the "interface".

The QML scanner uses a stack-based finite state machine interpreter and a data structure matching and rewriting engine, so we include them here. The qml-scanner object is defined, which contains all the code for the scanner.

Two functions are available for users: scan-doc and search. scan-doc takes a string as its argument and returns a QML document tree (a block). Please note that scan-doc is not reentrant and you cannot call it recursively (not that there is any way for you to do so). You can use the /with refinement to provide a block of pairs of strings, to set default options for commands such as =box and =table. (See the Example usage.) It is also possible to use the /keep refinement to stop scan-doc from resetting default options, so that you can use the ones that have been set in a previous session; note that if you use /keep, /with is ignored, so you should not provide both. (There is also a limitation, in that /keep does not actually keep the default value for "alias".)

search takes a string and a QML document tree; it performs a simple substring search inside the document (ignoring formatting etc.). A block is returned, with pairs of header and text snippet (with one exception explained below); the text snippet is a string of about 100 characters around the substring match, and the header is a QML header node (a block) for the "section" that contains the text (i.e., the last header encountered before the string was found). The returned block contains a pair of header and snippet for each occurrence of the substring in the document. If the substring is found before encountering any header in the document, the result block will contain the word 'doc-start in place of a header block.

Note that the text snippet does not contain any formatting (but paragraphs are separated with a newline).

search also returns anchor name exact matches: if the document contains an anchor whose name matches the text string exactly, then the first pair in the result block will be the header for the section that contains the anchor (or 'doc-start), and the anchor node. (You can distinguish this case because the "snippet" is a block instead of a string; also, this can only happen for the first pair in the result block, and since anchor names need to be unique only one can match.)

Overview

#include %fsm.r
#include %rewrite.r

qml-scanner: context [
 The parser (stage 1)
 Command options parsing
 Setting default options
 Balancing (stage 2): the stage2 function
 Generating QML document tree (stage 3)

 Searching a QML document

 ; the interface
 
 search: func [
  "Search a QML document tree for a substring"
  doc [block!] "The QML document tree (as returned by SCAN-DOC)"
  text [string!] "The substring to search for"
  
  /local search's locals
 ] [
  Search a QML document tree for a substring
 ]

 scan-doc: func [
  {Parse a QML text string and return a QML document tree}
  text [string!]
  /with defaults [block!] "Default options for commands"
  /keep {Keep default options from previous session (ignores /with)}
 ] [
  Scan a text string and return a QML document tree
 ]
]

3.1 Example usage

To scan a document, just call scan-doc with the document text; you can use the /with refinement to set default options for commands. (Commands that support default options are: =toc (sets the default number style), =alias (sets the default "magic" sequence), =table, =row, =column, =cell, =box and =image.) You can pass the resulting document tree to an emitter (such as the XHTML emitter) to generate a document that the user can read.

To search a substring in a document just call search passing the document tree and the substring to search for.

Example usage

text: "...some QML text..."
doc: qml-scanner/scan-doc text

doc: qml-scanner/scan-doc/with text ["box" "yellow dashed" "table" "all dotted"]

search-results: qml-scanner/search doc "substring"

4. The parser (stage 1)

The first stage of the pipeline is the parser. The parse-qml function just uses parse on the text string.

The parser (stage 1)

The parse rules
The set-magic function

parse-qml: func [text [string!] magic [string! none!]] [
 ; initialize magic char
 set-magic any [magic "="]
 parse/all text qml-rule
]

4.1 The parse rules

The parse-qml function uses qml-rule to parse the text; everything that is not a command, is considered text and is handed over to the second stage (with the stage2 function, see Balancing (stage 2): the stage2 function).

The txt-chars charset is initialized by the set-magic function (see The set-magic function).

The parse rules

qml-rule: [
 some [commands | text]
]

commands: [
 parse rule for commands
]

txt: none
txt-chars: none
text: [copy txt some txt-chars (stage2 [text:] txt)]

4.2 parse rule for commands

The heart of the parser is the commands rule.

The newline character is treated like a command (produces the "^/" command); spaces at the end and at the beginning of lines are ignored. (As already noted, the stage2 function is used to hand the commands over to the second stage of the pipeline; see Balancing (stage 2): the stage2 function.)

Other commands are introduced by magic-char (set by the set-magic function, see The set-magic function); there are some special commands that require their own parse rules, but most commands are parsed by the rule shown here. This parses normal opening commands (like "=cmd"), opening commands with options ("=cmd[...]" or "=cmd{...}"), and ending commands ("=cmd." or "=cmd/"); commands with options eat at most one space following them (i.e., "=cmd[...]text" is exactly the same as "=cmd[...] text"; if you need a space after a command with options, use two spaces); ending commands do not eat any space (so "=cmd.text" is not the same as "=cmd. text"); commands without options require at least a space following them and will eat all spaces following them. Note that the mk: magic-char :mk construct allows putting commands one after another without spaces between them (e.g., "=b=i"). Note that this rule also parses the dot command "=.".

As you can see each command is sent to the second stage as a pair of values: the command as a string, and the command's options (which can be none). As you have already seen in The parse rules, this is also true for text, which is sent as the text: (a set-word!) command with the text as options (a set-word! is used so that there is no clash with other commands, e.g. if we ever need to add a =text command and so on).

parse rule for commands

any spc newline any spc (stage2 "^/" none)
|
magic-char [
 Special commands
 |
 ; ignore a = at the end of the text
 end
 |
 copy cmd any cmd-chars [
  "[" copy options to "]" skip opt spc (stage2 cmd options)
  |
  "{" copy options to "}" skip opt spc (stage2 cmd options)
  |
  ["." | "/"] (stage2 join any [cmd ""] "." none)
  |
  [some spc | mk: [newline | magic-char] :mk | end] (stage2 cmd none)
 ]
 |
 ; malformed command?
 (stage2 [text:] magic-char)
]

4.2.1 Special commands

Some command require custom parse rules. I've tried to keep this to a minimum, but Reichart loves special cases. ;-)

Special commands

magic-char (stage2 [text:] magic-char)
|
[" " | mk: newline :mk] (stage2 " " none)
|
"alias" some spc copy cmd some cmd-chars any spc (set-magic cmd)
|
"csv" [
 "[" copy options to "]" skip any spc opt newline (options: refinements/parse-arg-string "csv" any [options ""])
 |
 "{" copy options to "}" skip any spc opt newline (options: refinements/parse-arg-string "csv" any [options ""])
 |
 any spc opt newline (options: context [name: show: none])
] (csv: make block! 256) some [
 [magic-char "csv" ["." | "/"] any spc opt newline | end] (stage2 "csv" make options [contents: csv]) break
 |
 [copy txt to newline newline | copy txt to end] (append/only csv parse/all txt ",")
]
|
copy cmd escape-cmd [
 "[" copy options to "]" skip any spc opt newline
 |
 "{" copy options to "}" skip any spc opt newline
 |
 any spc opt newline (options: rejoin [magic-char cmd "."])
] [copy txt to options options any spc opt newline | copy txt to end] (stage2 cmd txt)
|
some "-" [some spc opt newline | newline | mk: magic-char :mk | end] (stage2 "-" none)
|
copy cmd some ">" [some spc | mk: [newline | magic-char] :mk | end] (stage2 ">" length? cmd)
|
"[" copy options to "]" skip (stage2 "" options)
|
"{" copy options to "}" skip (stage2 "" options)
|
"," (stage2 "," none)
|
"repeat" (options: make block! 16) [
 (opt-open-char: "[" opt-close-char: "]") rebol-options (stage2 "repeat" options)
 |
 (opt-open-char: "{" opt-close-char: "}") rebol-options (stage2 "repeat" options)
 |
 ["." | "/"] (stage2 "repeat." none)
]

4.2.2 Words used by the commands rule

The commands parse rule uses a number of words that we need to make local. It also uses the rebol-options subrule which we define here.

cmd-chars and magic-char are initialized by the set-magic function (see The set-magic function); escape-cmd is the list of escape commands.

rebol-options uses the load-next function to parse the options as REBOL values; in turn it uses load/next to parse the text and tries to recover in case of errors.

The parse rules +≡

mk: cmd: options: csv: none
spc: charset " ^-"
spc+: charset " ^-^/"
cmd-chars: none
magic-char: none

escape-cmd: ["HTML" | "REBOL" | "MakeDoc" | "Example"]

opt-open-char: "[" opt-close-char: "]"
rebol-options: [
 opt-open-char
 txt: (txt: load-next options txt) :txt
 some [any spc+ opt-close-char break | end break | txt: (txt: load-next options txt) :txt]
 opt spc
]
load-next: func [out text /local val] [
 if error? try [
  set [val text] load/next text
  insert/only tail out val
 ] [
  insert tail out copy/part text text: any [find text opt-close-char tail text]
 ]
 text
]

4.3 The set-magic function

This function initializes the values of magic-char, cmd-chars and txt-chars. cmd-chars is a charset containing the characters that are valid in command names; txt-chars contains the characters that are valid for text outside commands.

The set-magic function

set-magic: func [magic [string!]] [
 ; magic cannot be empty
 if empty? magic [magic: "="]
 magic-char: magic
 cmd-chars: complement charset join " ^-^/[]{}./" first magic-char
 txt-chars: complement charset join "^/" first magic-char
]

5. Balancing (stage 2): the stage2 function

The second stage of the pipeline is meant to simplify the third stage by "normalizing" the command stream. It takes care of ensuring command balancing, and handles the space, dot and comma commands ("= ", "=." and "=,"). Parsing of command options is also done in the stage2 function (just because this is a convenient place to do it, even though conceptually it belongs to the first stage).

This stage uses a Finite State Machine, handling each command as an event. Please see the documentation of the FSM interpreter for more informations (linked in from Overview).

As seen in the first stage, the stage2 function is invoked with a command and its options. A command can be a string! or a set-word!; for simplicity set-words can be passed in a block (as in stage2 [text:] txt), so if cmd is a block we just extract the first value. The options are parsed with the parse-command-options function; the cmd and opts words are then set in the stage2-ctx object, so that they are available to the FSM rules, and cmd is issued as an event for the state machine. (stage2 needs to do another thing after this, we'll discuss this later on.)

Balancing (stage 2): the stage2 function

stage2-fsm: make fsm! [ ]

stage2: func [cmd opts] [
 if block? cmd [cmd: first cmd]
 stage2-ctx/opts: parse-command-options stage2-ctx/cmd: cmd :opts
 stage2-fsm/event cmd
 Additional code for the stage2 function
]

stage2-ctx: context [
 cmd: opts: none
 Balancing (stage 2): the Finite State Machine
]

Initialization and termination of the FSM

The merge-style function

6. Command options parsing

Command options (also called "refinements" in the code) are parsed by the parse-command-options function, which is called by the stage2 function (although, conceptually options parsing belongs to the first stage; the stage2 function parses the options before actually processing the command in the FSM).

Only some commands need their options to be parsed; if cmd is one of them, and options is a string, then the parse-arg-string function is used. Otherwise options is returned as-is.

Command options parsing

parse-command-options: func [cmd options] [
 either all [
  string? options
  find [
   "table" "row" "column" "cell" "box"
   "image" "font" "f" "span" "data"
  ] cmd
 ] [
  refinements/parse-arg-string cmd options
 ] [
  options
 ]
]

refinements: context [
 Option values' types
 Type map
 Object map
 parse-arg-string's support functions
 parse-arg-string: func [cmd args /local parse-arg-string's locals] [
  Parse the args string into an object
 ]
]

6.1 Option values' types

Options are parsed as a list of values; option values can be of a number of different types, each recognized by syntax (like in REBOL) and parsed by a specific parse rule. Values can be of one of the following types: flag!, set-word!, color!, string!, integer!, url!, percent!, pair! and comma-pair!; after being parsed, each value is represented by a REBOL value, and each type is mapped to a REBOL type: flag! is mapped to word!; set-word!, string!, integer!, url! and pair! are mapped to the respective REBOL types; color! is mapped to one of issue!, refinement! or tuple! (depending on how the color was specified, if as a hex string, a color name or a tuple); percent! is mapped to money!; comma-pair! is mapped to block!.

The types object holds the parse rules for the option types. The flag-word and set-word rules are set dynamically based on the command that is being parsed. Note that strings can be specified even without quotes, so the string! rule must be applied last (other types need to be tried first); also note that we only allow integer percent values, even tough we use money! to represent them; also, comma-pair! can contain integers or percents, so the block representing it can contain integer! or money! values.

Option values' types

types: context [
 flag!: [flag-word [some spc | end]]
 set-word!: [set-word any spc]
 color!: [
  [
   color-keyword
   |
   tuple
   |
   [opt "#" copy value 6 hex-digits | "#" copy value 3 hex-digits] (value: to issue! value)
  ]
  [some spc | end]
 ]
 string!: [
  [{"} copy value some dquotechars {"} | "'" copy value some quotechars "'" | copy value some chars]
  [some spc | end]
 ]
 integer!: [copy value some digits (value: to system/words/integer! value) [some spc | end]]
 url!: [
  copy value [some urlchars ":" 0 2 "/" some urlchars any ["/" some urlchars]]
  (value: to system/words/url! value)
  [some spc | end]
 ]
 percent!: [copy value 1 3 digits "%" (value: to money! value) [some spc | end]]
 pair!: [copy value [some digits "x" some digits] (value: to system/words/pair! value) [some spc | end]]
 comma-pair!: [
  (value: make block! 4)
  copy val some digits ["%" (append value to money! val) | none (append value to integer! val)]
  ","
  copy val some digits ["%" (append value to money! val) | none (append value to integer! val)]
  [some spc | end]
 ]
]

6.1.1 Subrules and other words used by the parse rules

As already said above, flag-word and set-word are dynamically set to a command-specific rule. See Rules for flag! and Rules for set-word!.

Option values' types +≡

value: val: none
chars: complement spc: charset " ^-^/"
urlchars: complement charset {"':/ ^- }
dquotechars: complement charset {"}
quotechars: complement charset "'"
digits: charset "1234567890"
hex-digits: union digits charset "ABCDEFabcdef"
flag-word: none
set-word: none
Rules for color!
Rules for flag!
Rules for set-word!

6.1.2 Rule for parsing a generic value

The rule value-rule can be used to parse a generic (non flag or set-word) value; the order of the rules is significant. (Note that it's written this way for compatibility with older REBOLs.)

Option values' types +≡

value-rule: bind [color! | percent! | pair! | comma-pair! | integer! | url! | string!] in types 'self

6.1.3 Rules for color!

Rules for color!

color-keyword: [
 "clear" (value: /transparent) | copy value [
  "aliceblue" | "antiquewhite" | "aqua" | "aquamarine" | "azure" |
  "beige" | "bisque" | "black" | "blanchedalmond" | "blue" |
  "blueviolet" | "brown" | "burlywood" | "cadetblue" | "chartreuse" |
  "chocolate" | "coral" | "cornflowerblue" | "cornsilk" | "crimson" |
  "cyan" | "darkblue" | "darkcyan" | "darkgoldenrod" | "darkgray" |
  "darkgreen" | "darkkhaki" | "darkmagenta" | "darkolivegreen" |
  "darkorange" | "darkorchid" | "darkred" | "darksalmon" |
  "darkseagreen" | "darkslateblue" | "darkslategray" | "darkturquoise" |
  "darkviolet" | "deeppink" | "deepskyblue" | "dimgray" |
  "dodgerblue" | "feldspar" | "firebrick" | "floralwhite" |
  "forestgreen" | "fuchsia" | "gainsboro" | "ghostwhite" | "gold" |
  "goldenrod" | "gray" | "green" | "greenyellow" | "honeydew" |
  "hotpink" | "indianred" | "indigo" | "ivory" | "khaki" | "lavender" |
  "lavenderblush" | "lawngreen" | "lemonchiffon" | "lightblue" |
  "lightcoral" | "lightcyan" | "lightgoldenrodyellow" | "lightgreen" |
  "lightgrey" | "lightpink" | "lightsalmon" | "lightseagreen" |
  "lightskyblue" | "lightslateblue" | "lightslategray" |
  "lightsteelblue" | "lightyellow" | "lime" | "limegreen" | "linen" |
  "magenta" | "maroon" | "mediumaquamarine" | "mediumblue" |
  "mediumorchid" | "mediumpurple" | "mediumseagreen" |
  "mediumslateblue" | "mediumspringgreen" | "mediumturquoise" |
  "mediumvioletred" | "midnightblue" | "mintcream" | "mistyrose" |
  "moccasin" | "navajowhite" | "navy" | "oldlace" | "olive" |
  "olivedrab" | "orange" | "orangered" | "orchid" | "palegoldenrod" |
  "palegreen" | "paleturquoise" | "palevioletred" | "papayawhip" |
  "peachpuff" | "peru" | "pink" | "plum" | "powderblue" | "purple" |
  "red" | "rosybrown" | "royalblue" | "saddlebrown" | "salmon" |
  "sandybrown" | "seagreen" | "seashell" | "sienna" | "silver" |
  "skyblue" | "slateblue" | "slategray" | "snow" | "springgreen" |
  "steelblue" | "tan" | "teal" | "thistle" | "tomato" | "turquoise" |
  "violet" | "violetred" | "wheat" | "white" | "whitesmoke" | "yellow" |
  "yellowgreen" | "transparent"
 ] (value: to refinement! value)
]
tuple: [
 copy value [1 3 digits "." 1 3 digits "." 1 3 digits] (value: attempt [to tuple! value])
]

6.1.4 Rules for flag!

Each command has a list of available flags. Note that order is important (because of the way the bold and italic rules are defined, for example).

Rules for flag!

flag-words: [
 "table" [
  outline | dashed | dotted | solid | borderless | vertical | horizontal | all | hide | headerless |
  center | left | right | justify | middle | top | bottom | imagecenter | imageleft |
  imageright | imagemiddle | imagetop | imagebottom | float | space2 | tilev | shadow | rounded |
  tileh | tileless | tile | boxcenter | boxleft | boxright | times | helv | courier | bold | italic
 ]
 "cell" "row" "column" [
  outline | dashed | dotted | solid | borderless | all |
  center | left | right | justify | middle | top | bottom | imagecenter | imageleft |
  imageright | imagemiddle | imagetop | imagebottom | tilev |
  tileh | tileless | tile | times | helv | courier | bold | italic
 ]
 "box" [
  outline | dashed | dotted | solid | borderless |
  center | left | right | justify | middle | top | bottom | imagecenter | imageleft |
  imageright | imagemiddle | imagetop | imagebottom | float | tilev |
  tileh | tileless | tile | boxcenter | boxleft | boxright | times | helv | courier | shadow | rounded |
  bold | italic
 ]
 "image" [
  outline | dashed | dotted | solid | borderless | float |
  boxleft | space ;| shadow | rounded
 ]
 "font" "f" [
  times | helv | courier | bold | italic | space
 ]
 "span" none
 "csv" [show]
 "data" none
]
bold: ["b" opt "old" (value: 'bold)]
italic: ["i" opt ["talic" opt "s"] (value: 'italic)]
vertical: [["vertical" | "tablev"] (value: 'vertical)]
float: [["float" | "flow"] (value: 'float)]
tilev: ["tilev" opt "ertical" (value: 'tilev)]
tileh: ["tileh" opt "orizontal" (value: 'tileh)]
space2: ["space" (value: 'force-space)]

We have only specified the parse rules for a small set of flags; all the others are generated automatically: for example, [dashed] becomes ["dashed" (value: 'dashed)]. This is done by processing the flag-words block using parse.

Rules for flag! +≡

rule: word: none
parse flag-words [
 some [
  some string! set rule block! (
   while [not tail? rule] [
    either all [rule/1 <> '| not block? get/any word: rule/1] [
     rule: insert/only change rule
      form word to paren! compose [value: (to lit-word! word)]
    ] [rule: next rule]
   ]
  )
  |
  some string! rule: 'none (rule/1: [end skip])
 ]
]

6.1.5 Flag actions

By default, a flag sets the respective word to true in the object that represents the parsed options. Some flags, however, do something different; for example, the dashed flag sets the word outline-style to 'dashed. The flag-actions object contains the flags that do something different from the default.

Rules for flag! +≡

flag-actions: context [
 dashed: [outline-style: 'dashed]
 dotted: [outline-style: 'dotted]
 solid: [outline-style: 'solid]
 outline: [outline-style: 'solid]
 borderless: [outline-style: 'borderless]
 rounded: [outline-style: 'rounded]
 center: [text-halign: 'center]
 left: [text-halign: 'left]
 right: [text-halign: 'right]
 justify: [text-halign: 'justify]
 middle: [text-valign: 'middle]
 top: [text-valign: 'top]
 bottom: [text-valign: 'bottom]
 imagecenter: [image-halign: 'center]
 imageleft: [image-halign: 'left]
 imageright: [image-halign: 'right]
 imagemiddle: [image-valign: 'center]
 imagetop: [image-valign: 'top]
 imagebottom: [image-valign: 'bottom]
 tile: [image-tiling: 'both]
 tilev: [image-tiling: 'vertical]
 tileh: [image-tiling: 'horizontal]
 tileless: [image-tiling: 'neither]
 times: [typeface: 'times]
 helv: [typeface: 'helvetica]
 courier: [typeface: 'courier]
 boxcenter: [position: 'center]
 boxright: [position: 'right]
 boxleft: [position: 'left]
]

6.1.6 Rules for set-word!

Set-words are parsed in a way similar to flags. Each command has a list of set-words that are accepted.

Rules for set-word!

set-words: [
 "table" [
  color | typeface | fontsize | background | outline | dashed | dotted | solid | image | width | height |
  name
 ]
 "cell" "row" "column" [
  color | typeface | fontsize | background | outline | dashed | dotted | solid | image | width | height |
  column | row
 ]
 "box" [
  color | typeface | fontsize | background | outline | dashed | dotted | solid | image | width | height
 ]
 "image" [
  background | outline | dashed | dotted | solid | src | width | height | space
 ]
 "font" "f" [
  color | typeface | fontsize | background | space
 ]
 "span" none
 "csv" [name]
 "data" [name | index]
]
color: [["colo" opt "u" "r:" | "foreground:" | "fg:"] (value: first [color:])]
typeface: [opt "type" "face:" (value: first [typeface:])]
fontsize: ["size" opt "face" ":" (value: first [