DESIGN
======

I've always wanted a tool capable of formatting source code according to
style(9) from OpenBSD. It shouldn't require one to fiddle with knobs nor supply
needed include path(s) in order to do get the desired formatting right. There
are other tools out there but they didn't manage to satisfy my requirements.

Another motivation for knfmt was the discovery of a paper entitled "A prettier
printer" by Philip Wadler[1]. I found his work both novel and elegant. This was
a chance to put his ideas into use, which others already have done[2].

This document is an attempt to give an introducing to the knfmt source code and
is divided into one section per file. The same files are enumerated in order of
execution. Followed by some noteworthy topics in no particular order.

Design documents are doomed to become outdated and inaccurate, this is no
exception.

[1] https://homepages.inf.ed.ac.uk/wadler/papers/prettier/prettier.pdf
[2] http://blog.vjeux.com/2017/javascript/anatomy-of-a-javascript-pretty-printer.html

lexer.c
=======

The lexer turns source code into a list of tokens and owns the associated
memory. This has several advantages as all tokens are constant, pointers to
them can be compared for equality and remains valid until lexer_free() is
invoked.

In general terms, the lexer API is divided into two categories:

1. Routines that consume one or many tokens, often prefixed with lexer_if.
   Such routines returns non-zero if it managed to consume the next token(s)
   given a criteria, such as being of a certain type. Otherwise, zero is
   returned.

2. Routines that peek at one or many tokens, prefixed with lexer_peek. Such
   routines returns non-zero if the peeked at token(s) satisfied the given
   criteria, such as being of a certain type. Otherwise, zero is returned.

A token can have more tokens associated with it, called dangling tokens. See the
cpp section below for such use case.

clang.c
=======

Not to be confused with LLVM Clang. C language specific routines used by the
lexer, see the cpp conditionals section below.

parser.c
========

The parser recognizes the tokens emitted by the lexer according to the C
grammar in a recursive descent fashion. The parser is also responsible for
constructing the document which represents the formatted source code. Parsing is
therefore not implemented as a distinct step of the execution. It might be
beneficial to separate the two at some point by letting the parser only be
concerned with creation of an AST like structure.

simple-*.c
==========

Various transformations used to subjectively simplify the source code.

expr.c
======

The expr parser is implemented using the Pratt parsing technique[1], a common
approach[2][3].

Some expressions which are valid C are not recognized the expr parser without
assistance from the parser. One example is sizeof() which can be given a type.
Whenever the expr parser encounters something it cannot recognize it gives the
parser a chance to recover, allowing the expr parser to continue; see
expr_recover() and the cpp section below.

[1] https://en.wikipedia.org/wiki/Operator-precedence_parser#Pratt_parsing
[2] https://craftinginterpreters.com/compiling-expressions.html
[3] https://quasilyte.dev/blog/post/pratt-parsers-go/

doc.c
=====

A document represents the formatted source code. In its essence, a document
consists of different types of documents forming a tree which roughly
reassembles the structure of the source code. The most important type of
document is the group which represents something that's intended to fit on a
single line. In doc_exec(), the document tree is traversed causing the formatted
source code to be emitted. Upon entering a group, it checks if all documents
nested underneath it fits on the current line, see doc_fits(). If not, break
mode is entered in which line breaks are allowed to be emitted in the hopes of
not crossing the maximum number of columns per line. If it does fit, munge mode
is entered in which no line breaks are emitted. Causing the document
to reconsider switching between the two modes can therefore only be achieved by
entering a group.

The document representation and any decision to switch between the two modes can
be examined by invoking knfmt as follows:

	$ knfmt -td

The following type of documents are available:

* DOC_CONCAT
  Concat consists of zero or more nested documents. It's often used as a child
  of a group document.

* DOC_GROUP
  Group represents something that's intended to fit on a single line. This is
  the only type of document that can trigger a refit.

* DOC_INDENT
  Indent increases the indentation, the same indentation is not emitted until
  emitting a new line.

* DOC_NOINDENT
  Removes the current indentation by trimming the current line, assuming nothing
  other than indentation has been emitted. The only use case for this type of
  document is goto labels which should never be indented.

* DOC_ALIGN
  Align emits enough white space in order to reach the column associated with
  the same document. Used the by the ruler.

* DOC_LITERAL
  Literal represents a string that must be emitted as is. Tokens emitted by the
  lexer are turned into literal documents.

* DOC_VERBATIM
  Verbatim represents parts of the source code taken as is, such as comments and
  preprocessor directives. These type of documents often carry a trailing new
  line which requires some special care.

* DOC_LINE
  Line emits a new line while in break mode and a space in munge mode.

* DOC_SOFTLINE
  Softline emits a new line while in break mode and nothing in munge mode.

* DOC_HARDLINE
  Hardline emits a new line in both break and munge mode.

* DOC_OPTLINE
  Optional line only honored while inside DOC_OPTIONAL. Used to honor new
  line(s) from the original source code in certain constructs.

* DOC_MUTE
  Mute informs doc_print() to not emit anything expect DOC_HARDLINE. Used while
  traversing cpp branches.

* DOC_UNMUTE
  Variation of DOC_UNMUTE only used while handling clang-format off/on comments.
  Any missed new line(s) while being muted is not honored while going unmute.

* DOC_OPTIONAL
  Allow DOC_OPTLINE to be emitted.

* DOC_MINIMIZE
  Given a list of potentials indentations, select the one that minimizes the
  number of new line(s) and line(s) exceeding the column limit.

* DOC_SCOPE
  Used to enter a new "scope" in some arbitrary sense. Has the side effect of
  resetting document related state, used by DOC_INDENT_NEWLINE for instance.

* DOC_MAXLINES
  Used to alter the maximum number of allowed consecutive new line(s).

cpp.c
=====

Formatting of cpp constructs, such as alignment of line continuations.

ruler.c
=======

The ruler aligns certain language constructs such as function prototypes and
variable declarations. The parser is responsible for calling ruler_insert() on
each line it wants to align. Note it can call ruler_insert() for several columns
on a single line. Once all lines that must be aligned are emitted the parser
calls ruler_exec() which calculates the needed alignment for each line and
column. The ruler is at this point aware of all columns and can therefore deduce
the longest one calculate the necessary alignment accordingly.

error.c
=======

Errors generated by the parser upon encountering invalid grammar are not
directly emitted. Instead, they are buffered using error_write(). This is
necessary since the error might be suppressed after taking the next available
cpp branch, see the cpp conditionals section below.

cpp
===

knfmt does not run the source code through the preprocessor since it would
require one to supply the required include path(s) which in my opinion makes the
utility less user friendly. That has one severe downside of not being able to
distinguish preprocessor directives from other well defined language constructs.

Usage of preprocessor directives can literally be present anywhere in the source
code. Letting the parser cope with this fact would make the implementation
tedious. Instead, preprocessor directives are made completely transparent to the
parser. An emitted token by the lexer will never directly represent a
preprocessor directive. Instead, a token always represents a well defined
language construct and can instead have one or many dangling tokens associated
with it, see tk_prefixes and tk_suffixes. Such dangling tokens can among other
things represent preprocessor directives. The dangling tokens are emitted by
doc_token().

One particular type of preprocessor macros that requires special attention are
the ones that expands to a loop construct. Such macros often include the word
foreach in the identifier. Not being aware that such macros actually expand to
a loop construct would otherwise make the parser halt as something that looks
like a function call followed by either a pair of braces or another statement is
not considered valid C, see parser_stmt_cpp(). An example of such macros are
the ones provided by queue(3).

Another problematic type of preprocessor macros are the ones that expands to a
binary expression. Such macros often have a binary operator as a distinct
argument. Macro invocations are handled by expr parser as they look like
function calls without any further preprocessor knowledge. Encountering a binary
operator in an unary context causes the expr parser to halt. However, in such
scenarios it gives the parser a chance to recover from the situation; allowing
the expr parser to continue, see expr_recover(). An example of such macros are
the ones provided by timercmp(3).

cpp conditionals
================

Conditional preprocessor constructs, internally referred to as branches, can
confuse the parser unless considering each branch in isolation. Some code is
therefore not valid while disregarding the branches:

	int				[1]
	main(void)
	{
		while
	#if 0
			(0)		[2]
	#else
			(1)		[3]
	#endif
			continue;	[4]
	}

Recall, preprocessor directives are made completely transparent to the parser as
described in the cpp section above. Preprocessor branches can literally be
present anywhere in the source code. Letting the parser cope with this fact
would make the implementation tedious.

Instead, clang and lexer keeps track of all branches, see clang_read() and
clang_read_cpp(). Upon entering a branch continuation denoted by #else or #elif,
lexer_pop() will halt while consuming tokens but move passed the branch
continuation while peeking. This gives the parser the impression of traversing
valid grammar and language constructs can still be detected. At some point, the
parser must call lexer_branch() in order to take the next available branch
continuation. lexer_branch() will remove any tokens associated with the branch
just taken[2] and then seek backwards to the last stamped token by the
parser[1]. The stamped token must be positioned at the beginning of some
language construct that the parser can recognize[1]; effectively making the
parser traverse the same tokens again but this time with the previous branch
removed. This has the advantage of not requiring any explicit handling of
branches inside the parser implementation as it reuses existing logic.

One caveat with approach is caused by the parser potentially traversing the same
code more than once as it seeks backwards[1]. While traversing a branch
continuation, nothing should be emitted until unexamined tokens[3] have been
reached. While emitting the last token before a branch continuation[2],
doc_token() will cause any subsequent document to be muted, see DOC_MUTE above.
Once the branch continuation has been crossed, the aforementioned routine
unmutes. The mute decision is therefore tied to tokens and the corresponding
logic resides in lexer_branch().

cpp recover
===========

Some conditional preprocessor constructs, see the cpp conditionals section
above, are never evaluated to true during compilation causing the wrapped code
to be discarded. Such code is therefore not strictly needed to be syntactically
correct as its discarded:

	#if 0				[1]
	main(void) {}
	#endif

	int				[2]
	main(void)
	{
	#ifdef notyet
	}				[3]
	#endif
	}				[4]

The two examples above shows two common idioms for disabled chunks of code.
However, such idioms comes in many forms and maintaining an exhaustive listing
of all variations is unmanageable.

As knfmt by defaults assumes all branches are syntactically correct, it could
enter a disabled branch only to encountered something invalid. Instead of
immediately halting, it gives the lexer a chance to recover by removing the
closest branch given the position of parser, see lexer_recover().

Some scenarios[2] are more tedious than others, while taking the branch[3] the
function implementation is correctly terminated. However, the following token[4]
is not expected at this point. Calling lexer_recover() as this point would cause
the branch[3] to be removed effectively making the unexpected token[4] be the
terminating right brace for the function implementation.
