Login Register

Markup previews project: status report

The previous week was really hard for me. I had to do a lot of university stuff,
so unfortunately I had no chance to spend much time on my project.

But I did some research earlier though.

We had a conversation with my mentor about the best way of implementing a
generic parser. I wrote a simple implementation of multi-pass regex approach
(when you have a set of regexps and replacements for them and parser just
consecutively applies them to input text), which the original Markdown parser
uses.
And David implemented a simple lexer for MediaWiki syntax so we were able to
compare both approaches.

As a result I decided to use a lexer approach, because it is more flexible and
allows to simplify the language definition part (requires less code and regexps
are much simpler), so it would be easier to add a new language. With a lexer
approach it is good that you have an ability to operate with tokens and write
in token definition something like:

mode: "header",
allowed_children: [ "text", "bold", "link" ]

it allows to create complex definitions but keep them simple and understandable.

The language definition could look something like:

tokens:
[ { start: '\n##', end: '##(?#\n)',
    mode: "header",
    allowed_children: [ "text", "bold", "link" ],
    token: dojox.markup.HeaderToken
  },
  { start: "[a-zA-Z0-9]", end: '[\\[\'\n]',
    mode: "text",
    token: dojox.markup.TextToken
  },
  // etc..
]

With this approach we will have a standard set of tokens which know how to
convert themselves to html, these tokens would be used for all markup
languages (eg. BoldToken, ItalicToken, HeaderToken, LinkToken, ..). Of
course they can be redefined or extended if needed.

Currently I'm implementing a nested tokens processor. Will commit first code
soon.

If you have some questions or suggestions - please feel free to comment.