A scanner is defined with the MAKE-SCANNER function, that returns an object.
my-scanner: make-scanner [
syntax: [
...
]
init-scanner: does [
...
]
...
]
The INIT-SCANNER function is optional and allows you to initialize your scanner; you can include any other needed word to the object by including them in the block.
The syntax block is a dialect that describes the syntax for the input text. The input text is parsed into paragraphs, separated by blank lines. The beginning of paragraphs is checked for commands; the most common form of command is a text string that defines the kind of paragraph. Example:
"*" block 'bullet
It is also possible to have command aliases:
"===" "-1-" line 'sect1
which means that paragraphs beginning with "===" or "-1-" are of the kind SECT1. If the string is followed by the BLOCK keyword, as in the bullet example above, then the text up to the next blank line is considered part of the paragraph; if it is followed by LINE then only the current line is considered part of the paragraph.
The available keywords are:
| end | the string marks the end of the document. This paragraph and anything following it are ignored.
|
| line | only the current line is considered part of the paragraph.
|
| block | all the text up to the next blank line is considered part of the paragraph.
|
| define | special paragraph composed of two parts; it must be followed by another string and either LINE or BLOCK. The other string is a marker; all the text up to the marker string is the first part of the paragraph; then, if LINE is used the rest of the line after the marker is the second part, otherwise if BLOCK is used the text after the marker and up to the next blank line is the rest of the paragraph. (See examples below.)
|
Examples:
"###" end
"===" "-1-" line 'sect1
"---" "-2-" line 'sect2
"+++" "-3-" line 'sect3
"..." "-4-" line 'sect4
With this definition, the "###" string marks the end of the document. The "===" or "-1-" string mark a one-line paragraph of kind SECT1, and so on.
"***" "*>>" block 'bullet3
"**" "*>" block 'bullet2
"*" block 'bullet
"#>>" block 'enum3
"#>" block 'enum2
"#" block 'enum
These definitions are used to create bullets and enumerations with different nesting levels. <B>Note that when two strings overlap, such as "*" and "**", THE LONGEST STRING MUST APPEAR FIRST IN THE BLOCK.</B> So the correct order in a case like this is the one shown above; any other order would not work correctly.
":" define " -" block 'define
This defines a paragraph made of two parts, with " -" as the marker, and the second part being a block. The paragraph is of kind DEFINE. The above definition will parse a paragraph like the following:
:word - definition.
into the two parts "word" and "definition". The pair is given the kind DEFINE.
";" block none
Using NONE as a kind allows to ignore a paragraph. With the above definition paragraphs starting with a semicolon will be ignored and will not appear in the output.
Following the keyword, as you have seen in the examples above, you can specify the kind of paragraph with a lit-word, or ignore the paragraph with NONE. It is also possible to specify an action using a paren. The content of the paren is REBOL code evaluated with DO. Inside the paren, the word TEXT refers to the parsed paragraph; typically, you'll use the built in EMIT funcition to output the paragraph. In particular, the following two lines are equivalent:
"*" block 'bullet
"*" block (emit bullet text)
The lit-word is just a shortcut for the code shown above. The first argument to EMIT is a word specifying the kind of paragraph, and the second argument is the text for the paragraph.
Other than strings, it is possible to use keywords to specify other aspects of the syntax. The available keywords are:
| default | defines the default action, i.e. the paragraph kind or the code to evaluate for paragraphs that do not begin with any of the given strings. It can appear at any time in the syntaxt block.
|
| unrecognized | defines the action in case of parsing error.
|
| indented | defines the action for indented text. Any line beginning with at least a space is considered part of the indented text and is treated like a paragraph. Typically you'll give it the CODE kind.
|
| output | defines a marker for lines that are to be passed as-is to the emitter. All the lines beginning with the given marker are considered part of the paragraph. The paragraph is always given the OUTPUT kind (you cannot specify a custom action). See examples below.
|
| sections | defines the section commands. See the examples below.
|
| commands | defines special commands. See the examples below.
|
| verbatim | defines the list of paragraph kinds that are to be considered verbatim, that is no inline parsing is performed for paragraphs whose kind is listed here. See below for more info on inline parsing.
|
| inline | defines the syntax for inline commands. See the examples below.
|
Examples:
output "=="
Allows to send text directly to the output by preceding it with "==". <i>You should avoid using this as it assumes the document will be rendered with a specific emitter to a specific format</i> (e.g. HTML will only work if you are using the html.r emitter). Nonetheless, it can be extremely useful in some special cases. With the above, if you were outputting to HTML, you could insert Javascript with:
== [script type="text/javascript"] (gt and lt replaced by [ and ] because of Qwiki bug)
== document.write("Something");
== [/script]
Sections are a way to group parts of your text. You can define sections with:
sections in "\" out "/" [
'in or 'indent
'note options text
'table options text
'group
'center
'column
]
which means that sections start with "" and end with "/"; the OR keyword allows definining shortcuts, so that with the example above you can use IN or INDENT to start an INDENT section (the last lit-word in the OR list defines the section name). INDENT creates an INDENT-IN paragraph in the intermediate format (with NONE as the text), while /INDENT emits INDENT-OUT. The OPTIONS keyword allows the users to follow the command with options, which can be of TEXT kind (the text up to the end of line as a string) or of REBOL kind (text up to end of line is considered a REBOL dialect and passed as a block). Options are passed in the intermediate format as the text for the section start command, i.e. with a note like this:
\note Note title
This is a note
/note
according to the above definition and assuming the defalt kind is PARA, you'd get the intermediate format:
[note-in "Note title" para "This is a note" note-out none]
(Note: we're ignoring inline processing for this example. See below for inline processing.)
The COMMANDS keyword is used to specify other commands:
commands "=" [
'image options rebol
'row or 'table-row
'column
'options options rebol (attempt [append opts to block! text])
'template options text (repend opts ['template as-file text])
]
which means that commands are introduced by "=". You can use OR and OPTIONS like in section definition, so in the above the IMAGE command will accept options as a REBOL dialect, and TABLE-ROW can be specified as just ROW. You can also specify an action (this actually applies to sections too) with a paren.
The only remaining available keyword is INLINE and defines inline processing. Inline processing is applied to <b>all parsed paragraph text</b> unless the paragraph kind is in the VERBATIM list. This means that all strings are turned to a block of pairs [word! string!] (as in the intermediary format).
Let's look at an example:
inline [
escape "\"
quote "'" 'word
"**" 'strong
"*" 'emph
rebol "=[" "]"
]
The block defines an escape sequence ("") that can be used to escape the special meaning of any character. (With the example above, if you want to include a literal "" in the output, you need to escape it as "\".) It also defines marker for text. There are two kinds of markers: quote markers, introduced by the QUOTE keyword, and normal markers. A quote marker only applies to a single word. In the example above, the "'" (single quote) marker applies to a word and makes it of kind WORD. This means that the text:
This is a REBOL 'word in the text.
is parsed into:
[normal "This is a REBOL " word "word" normal " in the text."]
Normal markers, instead, surround text. In the example above, we're defining two normal markers, "**" and "*". <b>Note that when markers overlap, the longest one MUST be specified first.</b> With that definition, the following text:
This is *emphasized text,* while this is **strong text.**
is parsed into:
[normal "This is " emph "emphasized text," normal " while this is " strong "strong text."]
To avoid ambiguity for the markers, and to avoid having to use too much escaping, the parser tries to be smart when recognizing markers. <i>A start marker is only recognized as such if it is preceded by white space (or appears at the beginning of the line) and it is <b>not</b> followed by white space. An end marker is only recognized as such if it is <b>not</b> preceded by white space and it is followed by white space (or appears at the end of text).</i> These rules allow writing text as:
2 * 5 = 10
without needing to escape the * character, as it is clearly not an emphasize marker. You can even write:
*2 * 5 = 10*
without ambiguity.
Note that <b>markers cannot be nested</b>.
The remaining available markers are REBOL markers. In the above, the "=[" marker starts an embedded REBOL block, and "]" ends it. They can be used to embed a REBOL dialect in the text, for example to specify links and things like that.
Emitters are created with the MAKE-EMITTER function:
my-emitter: make-emitter [
initial: [
...
]
inline: [
...
]
init-emitter: func [doc] [
...
]
build-doc: func [text] [
...
]
...
]
The INIT-EMITTER and BUILD-DOC functions are optional; the first can be used to initialize the emitter, while the second can be used to build the final document with the emitter output (it is called after the document has been converted, and the result of the conversion is passed as argument).
INITIAL is the initial state of the emitter FSM. The FSM processes one paragraph at a time from the intermediate format block; this is a pair of a word (the paragraph kind) and the data (a string, or a block and so on). For each state, you can define what to do with each kind. Let's start with an example:
initial: [
para: (emit ["p" text "/p"])
]
This means that in the initial state, paragraphs of PARA kind are emitted by wrapping their text with "p" and "/p". (Due to a Quiki bug we cannot use less-than or greater-than here.) The kind is expressed as a set-word. It is possible to take the same action for many kinds, for example:
initial: [
para1: para2: (emit ["p" text "/p"])
]
will do the same thing for both PARA1 and PARA2. Kinds that are not specified in the state block are ignored for that state. It is possible to use the DEFAULT pseudo-kind to apply an action to any kind that was not specified:
initial: [
para1: para2: (emit ["p" text "/p"])
default: (emit text)
]
In the above, anything other than PARA2 and PARA2 gets emitted as-is.
In the paren that follows the kind set-word, and that is used to specify the code to be evaluated for that paragraph kind, you can use the followind words:
| text | inline processed text. If the data for the paragraph is a block is the result of inline parsing from the scanner, then this block is processed according to the defined inline processing (see below). TEXT is actually a function that does the inline processing for the paragraph data.
|
| data | the paragraph data, as-is. This word refers to the value that follows the paragraph kind in the intermediate block.
|
| emit | this functions emits data to the output. Takes one argument, the value to emit.
|
| error | prints an error message. Takes one argument (the message). (Currently the passed value is just printed out. This is subject to changes.)
|
Other than the action paren, it is also possible to specify a destination state. The FSM switches to that state after evaluating the action paren; the next paragraph will be handled by the new state. Example:
initial: [
table-in: (emit "table") in-table
]
in-table: [
...
]
In the above case, TABLE-IN will cause the emitter to emit "table" and then switch to the IN-TABLE state to handle the next paragraph. It is also possible to switch states without any paren action, in case you have nothing to do:
initial: [
table-in: in-table
]
in-table: [
...
]
In the IN-TABLE state, it is possible to use the word RETURN in place of a state name to return to the "caller" state. Example:
initial: [
table-in: (emit "table") in-table
]
in-table: [
table-out: (emit "/table") return
]
In this case TABLE-OUT will cause to emit "/table" and then return to the previous state (which in this example can only be INITIAL).
When using RETURN, the caller state can also define an action to execute after the called state has returned. Example:
initial: [
table-in: (emit "table") in-table (emit "/table")
]
in-table: [
table-out: return
]
In this case, TABLE-IN will cause to emit "table", then switch to IN-TABLE. When it return, "/table" is emitted. There is one extra advantage of using this technique to close the table, instead of the one in the previous example: <i>when the doc ends, the FSM automatically returns to the initial state</i>, so that if there is no TABLE-OUT in the document, the table is closed automatically when the FSM returns to INITIAL at the end of the doc.
The full grammar for a state block is:
[some [some set-word! opt paren! ['return | 'continue 'return | 'continue word! opt paren! | word! opt paren! | none]]]
The meaning of the keywords is:
| return | returns to the previous state. (See examples above.)
|
| continue return | let the previous state handle this paragraph. That is, it is the same as RETURN, except that instead of advancing to the next paragraph, the current paragraph is handled again by the new state.
|
| continue state | let the new state STATE handle this paragraph. It is the same as a normal state switch, except that instead of advancing to the next paragraph, the current paragraph is handled again by the new state.
|
| state | switch to the new state STATE. See the discussion above.
|