User:Krauss/arXiv-2

This draft "user space article" is a fork and a partial version of User:Krauss/arXiv-1, to details some of the formalism behind the Web template engine and template system articles.

This draft is adopted for a ARXIV.ORG article colaborative construction, by users Krauss,  Gabriel.scapin, and any other wikipedian that understand the necessity of this issue, with a special invite for Dreftymac.

Identifying template systems: towards a black-box reference model

ABSTRACT DRAFT


 * A template system, in a context of digital communication systems, with repetitive transmissions and massive distribution of messages, is the most widely way to reuse content. It is formally expressed by input template, input data content, and output. Using the black-box approach is possible to check if a candidate system can be identified as a template system, avoiding questions about template language and details about template processor implementation.
 * Starting with a historic and related work review, the system intrinsic properties is showed. The proposed definition can account several cases and standards, incluing XSLT, xQuery and CSS, as well traditional macro languages and the most popular template engines. General classification and characterization of the template systems were also founded on the basis of black-box behaviour.

Introduction


It is common to observe a specialized part of communication systems that, to accomplish the message transmissions, encapsulates, adapts or restructure messages, optimizing it for the channel state and for each receiver. This dedicated system can be represented by a layer (figure-1) in a Shannon-Weaver's model, between source and transmitter.

When "living" in digital media, is easy to this kind of system to do a massive reuse of (cached) content or message fragments, favoring repeated message transmissions with (different messages but) similar information or similar structure, and reducing the source layer demand. We will name them as template systems.

The simplest template conception is the "document with placeholders". If X is a placeholder and "Hello X!" is a template document, then "Hello world!" and "Hello Jonh!" are documents generated by "X=world" and "X=Jonh". This view was popularized by word processors, in their early days, at the 1980s. Then, the template concept evolved into more sophisticated tools, with forms, macro languages, style control, database querying, and another features.

Parallelly, the concept of digital document has expanded to any digital artifact representing information and designed with the capacity to communicate. The automatic generation of documents by templates was adopted in database systems, and evolved into reporting tools.

Today (2010s) there are many systems promoted as being template systems &mdash; also named template engines, template processors, macro processors or document assembly systems. Wikipedia listed about 60 "template engines" used for dynamic Web page generation, and a dozen of "template processors" and "data transform" systems, used for document assembly or software source code generation. There are a high diversity, but, any attempt to organize the universe of the template systems is faced also with the problem of delimitate this universe.

In the field of Web documents, who have experienced intensive use of template systems, a first formal definition was done by Hsu and Yih ,. It is a syntactic definition, with focus on extract information from documents generated by templates, and at that time (1997) the diversity of  template engines and template languages was not emerged as a problem. A formal linguistic approach to define templates and template engines, was done later (2004) by T. Parr, in a wider cited work. His focus is the use of templates in a Model-View-Controller (MVC) software pattern in Web applications. The Parr's definitions have accounting the diversity of template languages, but take out another template system's parts, and some important common sense languages, such as the W3C standard, XSLT. After them no general purpose template system definition arrived: we need to integrate the two definitions and to conciliate the common sense view.

The Hsu-Yih's template definition is more generic about template tools, but not describe all essential template system properties. Motivated by a similar "template detection problem", Bar-Yossef and Rajagopalan use a more informal set of definitions, showing the relevance of a "template central authority" and the demand to deal with templates of document fragments.

Most of the revised template system definitions have focus on only one component, like processor, template language, or output document. We understand that to describe components we need to describe or contextualize it in the whole system. Our proposal is to "evaluate a candidate system" abstracting it as a black-box and checking whether meets a predetermined set of properties based on its input and output characteristics. This methodology ensures that our formal definition is based solely on general aspects of observable elements.

We suggest that this general definition can integrate models and delimit a field of study, that help researchs in Information Extraction, Information Recover, Computational Linguistic, Software Engineering, or others, to deal with the template systems and the "template generation hypothesis".

Illustrating concepts
In a reduced scope, templates are build-in tools for simplify programmers lives. Parr, Blaževic and others suggested a simple way to express templates, as string functions. Let,
 * k1, k2, ..., km references for string values, a set K of named constants;
 * c1, c2, ..., cn references for input string values, a set C of named parameters;

a template function is any function T that only concatenate these elements,
 * T(C) = k1 + c1 + k2 + ... + cn + km

where the symbol "+" is the string concatenation operator, n and m (&le;n+1) can range from 0 (no parameters or no constants) to any number. In this model the "reused content" is represented by K, and the "variable content" by C. Example of two different templates, T1 and T2, sharing some constants and parameters,
 * T1(C) = k1 + c1 + k2 + c2 + k3
 * T2(C) = c1 + k1

Virtually all the popular programming languages have a facility to express this kind of function. Usually the ki parameters are inline constants (string literals), example:
 * T(c) =  + c +

This reduced scope template representation permits to illustrate, by analogy from usual programming languages, many concepts and conceptual problems.

Print templates
Many languages incorporate the above concatenation functionality in a print function, and, it does something more, that template systems must do: present the results on a final output ("process output" concept of Operating Systems). A well-known example is the printf function of the standard C programming language, used also in Awk, Perl, PHP, Java and other languages,

Now the string literals "Hello " and "!" are compacted in only one k argument, using the strategy of placeholder mark (%s) substitution &mdash; and also some formating procedures when using another percent marks, %i for integer formating, %f for float formating, etc.

It is still a function, but at this point it is interesting to take a new look for the components. Naming the constant part ("Hello %s!") as template, the parameter reference, c, as input, and the function it self as template processor. This simple change of view might produce some confusion: what we call "template" is a function or is a string with placeholders? Later in this article we will prefer the second.

When printf use many parameters, it is not easy to see the correspondence between parameters and placeholders. A common strategy to simplify the programmers life is to compact placeholder mark and associate parameter in only one mark. Perl, PHP and others use this simple syntax:

We will call (hereinforth) these "parameter marks" as data-references. In the "<tt>Hello $c!</tt>" example, <tt>$c</tt> is the data-reference.

The template data-references can be augmented, as <tt>printf</tt>, with formaters. Exemplifying in PHP ,

where the augmentations are done by a object-orientation style, expressing format function by a method. This style can be used as a query language, to do data-references when C is a structured object,

Sometimes there are a demand for "do logic decisions". Still examples with PHP,

where "$t?" and "($n&gt;1)?" are conditions, the first sentence a "if true" result, and after ":", the second, "if false". The use of formaters and control logic is good for general purpose outputting, but creates another problem: can they do a unrestricted (Turing-complete) computations?

The second example shows the "pluralize problem", that can also expressed by dictionary-format functions ,

it introduce a important class of requiriments and a new kind of problems: can the placeholder syntax embody output document syntax? A usual word processing corrector, like OpenOffice and MS-Word syntax analysers, can "pluralize" a full phrase by linguistic syntax rules. The use of analyser modules can simplify programmers life,

These concatenation and print examples permits, in a reduced scope, to understand the template concepts and problems. The scale of work of template systems, that simplify the designers and programmers life (in the use cases of the section bellow), is taken by the whole document. This little change of scale represents a big change of requirements (see document templates).

Functinal view of parameter choices
Still with the reduced scope and expressing templates as string functions, we can arrive at a better template representation.

Example: in a typical application we need template variations to adapt the system for different user nationalities, then, different languages. In the "say hello" example, there are one template variant for each language (Perl code),
 * have the form  k11+c1+k2
 * have the form  k12+c1+k2

We can reuse the template structure for both cases: instead of harcoding K inside the template, we can define the common structure as a general template. A simple way to implement this ideia is turning the greeting a input data content,


 * have the form  c2+c1+k2

where we losting a reuse (of k1) opportunity. Another way is to use a conditional template where the correct string selected by a (input) selector, s,


 * have the form  (if c2 then k11 else k12)+c1+k2

but, both solutions cause a demand for more one input paramter. The selector idea is good anyway: we can use s as a "template name", reproducing the two prints above by two template calls, with the same "template library".

Puting in concise form, greeting as a constant k11 (Hello) or k12 (Olá), conform to the selector s, we can be express in simplified notation as indexed function,
 * T(K,C) = T({s,k11,k12,k2},C) = Ts({k1s,k2},C)

Then, the strategy is reuse printing Ten and Tbr. In a strict functional language, the closure concept offer a way to this "reuse into two stages". Let express with more detail as an implementation. In Perl, anonymous functions become closures when they reference variables defined in their enclosing functions,

Now is more evident what kind of parameters must be inputs (as Jonh and Maria), what must be "inherent" constants (as "!" in <tt>$k2</tt>), and what may be "adaptive constants" (<tt>$k1</tt>).

The closure is, in a functional framework, the best way to represent templates and template libraries.

Find/replace templates
The automated reuse of digital contents, looking back at 1960s, started before mainframes  were able to store digital documents: software source code, when readable by humans, is a kind of digital content. At this time, low level language (assembler code) was a the main language for express computer algorithms. A strategy for reuse source code fragments (or text, as was done later) was presented in 1959 as macro instruction: a rule that specifies how a input label should be mapped to an output text. More precisely, it is designed for "...save time and clerical-type errors in writing sequence of instructions which are often repeated..." , and to enforce program writing standards.

Text input was done with punched cards, no simple text editor or copy/paste authoring tool was available. The main task for macro programmer was to expresses short labels to more large names and/or frequently-repeated fragments of code, producing a set of labels (the macro library). The automated macro processor find this labels and replace it by the respective fragments. Labels with parameters, like functions, permits customized fragment expansion. So, a macro's label at the input text is like a template-reference with optional parameters (used as local placeholders for the fragment). Example, in <tt>cpp</tt> (C preprocessor macro language ):

Applying these definitions to a text content,

the labels <tt>HEAD</tt>, <tt>PI</tt> and <tt>FOOT</tt>, are expanded (replaced by your definition), producing

<tt>FOOT</tt> and PI are typical templates, and <tt>HEAD</tt> have a parameter for define a local placeholder in your definition. Applying the same definitions to another content,

where the label <tt>PI</tt> not used (it stay as library item), and the other labels are expanded producing

The simplest macro language system, with only simple definitions (without parameters) are referred here as "template-reference expansion system" (see section).

With little macro language enhancements, like use of parameters (HEAD macro example), complexity grows. Using another FOO defnition like that,

a template-reference (of PI) in a template definition must be avaluated. If we add recurence and logic ("if" commands) the macro language will be Turing-complete. If a macro language need or not all this computational power, is another matter of discussion.

Document templates
{figure here: content1+template1=content2; content1+template2=content3; content4+template1=content5} In the generation process of "new content", starting with content embodied into a template or supplied as input in a template system, content is reused (see illustration).

A book, a letter, a song or a set of pixels in a screen, they have information that can be interpreted as content by humans; and this content may have been authored by human or not, like a database generated report.

For any kind and any case, authoring costs and misunderstanding risks (of a inadequate form) might be reduced if digital content could be reused across the different use forms. Reuse can be defined as "the use of existing digital content to produce new content, or the application of existing content to a new context or setting".

...

... The highlighted terms can be organized, sketching a typology of template system typical uses: FIG: 3 typical kinds of simple placeholder templates.
 * Document generation: For generating final documents or assembling non-final ones. With one template (ex. a letter template) and many inputs, many documents are produced (ex. the final letters). The template system produce a set of "template generated documents". Main cases (see illustration):
 * Form letter: office memorandum, standard form (adhesion) contract and legal boilerplates (reused fragments of law) are typical text candidates for this kind of template. They allows mass production of similar content documents. Specific examples: appendix "Simplest examples", OppenOffice or MS-Word.
 * Letter frame: knowed also as skin, letterhead or "header and footer frame" templates. Any content is "filled in the blank" of this kind of template. It allows a mass production of non-similar content documents, with a "standard frame". A sample document with a letterhead of corporate identity, or a document model for a corporate memorandum, are knowed letter frames.   Specific examples: appendix "Simplest examples", OpenOffice frames.
 * Query report: a report document produced from a input information like a input form or a database query tool. Typical report templates are like a letter frame template with many placeholders. Examples: the report writers of OpenOffice Base, MS-Access, Oracle, etc.; a "confirm your data" report after a "please fill the form" interface; a report data spreadsheet (like OpenOffice-Calc or MS-Excel) calculated from a spreadsheet of raw data.


 * Dynamic web page generation: With one template (ex. a letter template) and one input (ex. a posted form), a document is produced (ex. the final letter). There are many possible inputs, then, the template system can produce many possible different documents. This is the main difference with in document generation, a set of "possible documents" is different from a set of "concrete documents". Form letter, letter frame and query report can used as a typical examples of content generation into a web page.


 * Source code generation: With a template language augmenting a programming language, a software source code can be expanded and easily reused. New  source code is generated by the template. It is analog to the document generation use cases, where documents are interchanged by software source code.  Example: a macro processor, such as the C PreProcessor (UNIX cpp). The source code generation illustrate also the case where script language and output language are the same (or the same kind), and illustrate the main cases where the template language separation criteria is not lexical, but syntactical (see  metaprogramming ).

Template languages
A black-box reference model for a template system not need any syntax assumptions about template language, but is opporturne to introduce language concepts, to distinguish template parts and types of template languages.

In the (reduced scope) functional vision, we presented a template as a concatenation of constants, k, and parameters, c, and how it can be viewed also as a string of k-references and c-references. With the Parr's lexical modeling approach, a template string T is an alternating list of k and c, represented by a simple production rule,


 * where &epsilon; is the empty template, "+" indicates one or more, and "|" separates alternatives.

In the scale of documents, k is expressed by the "output language" &mdash; the source code of output documents, like HTML &mdash;, and c is expressed by a "query language". Templates have also the power to generate different structures, and the logic instructions to accomplish this task are expressed into a "macro language". There are four syntactic components:


 * 1) constants (k): the K constants, for content or layout. Transforms on k and template-references are also considered constants.
 * 2) script (s): a lexical construct with the form          containing,
 * 3) * expressions (e): each expression is a data-reference, c, or a specification of a transform applied into c, f(c). Expressions with data-references and constants, f(k,c), are also considered expressions.
 * 4) * logic (l): the macro script, logical specifications of the use of constants and expressions. It is like a embedded software program.
 * 5) delimiters (d): a lexical frontier between s and k.

Then, lexically, a template is any sequence of k, s, and d with the form,

where h is a optional "template headers", that some languages use to encapsulate all other things. Ignoring headers and delimiters, and merging expressions and logic, we have the Parr's unrestricted template definition, "an alternating list of output literals and action expressions". A query can be a simple data-reference or a complex (computationally) unrestricted expression, it depends on the query language power. Similarly the macro language, can be a simple regular grammar or a unrestricted one.

This simple lexical model for T is only illustrative. It describes usual template languages, like PHP or Velocity, but not any &mdash; particularly by  Parr's definition, XSLT style sheets "are not templates at all".

There are another problem, to describe non-lexical delimiters, where syntax of the output language and the script language are mixed (see the pluralize problem illustrated in the sec. "Print templates"). As stated by Brabrand and Schwartzbach, "The paramount characteristic of a macro language is whether it operates at the lexical or syntactical level". Syntactic scripts not need delimiters and are integrated with output language. For black-box modeling we can suppose that a syntactic decoupling is done by a observation function, returning a set of output language fragments of the template.

Template generation hypothesis
Observing two or more documents with a very similar content, we can imagine that were produced by a templating process. Typical boilerplate texts can be generated from template systems or simple copy-and-paste edition.

So there are at least situations where, given a set of very similar (content) documents, we can imagine a common process of production, that explains this high similarity. There are many other cases, where the similarity is not (only) about the content, but about the document structure.

The template systems are very popular and are generating digital documents over years. Add to them the many other mass produced documents that was created by "by hand", with copy and find/replace edition procedures. There are so many template generated documents: the "template generation hypothesis" is hardly ever null, for any kind of documet. For web pages, Gibson et al. estimated 40% or more can be suppose as template generated.

There are two main ways to use this hypothesis:


 * 1) In a statistical context (implicit templates): document clusterization, classification , , linguistic analysis , plagiarism analysis[refs], cloned code detection , and many other document analysis, that compare documents, can refine your methods doing assertions about the probability of occurrence of template generated documents.
 * 2) In a information extraction context (detected templates):  early proposed by Hsu and Yih (1997). The template-based information extraction methologies and algorithms suppose the existence of similar (template generated) documents in your working set. They perform something as "reverse engineering of the template", and extract data according this template.  There are two typical goals in this context, knowed also as  template detection or document segmentation:
 * 3) * Extract content: range from loose, like on Yih's template structure extraction, to precise extraction. An example where precision is relevant, is the conversion of scientific journal full-text articles, from a raw text (HTML, OpenDocument, etc.) into XML of the NLM DTD, undertaken by publisher organizations that not have a XML publishing pipeline, but want to deposit articles into the PubMed Central . Semi-automated tools like INERA's perform assisted conversions to the NLM DTD.  Since scientific articles have a rigid structure (with parts such as title, authors, affiliations, abstract, body, reference list, etc.), and the journals maintains a stable style, ever is possible to use the template generation hypothesis in a large number of articles.
 * 4) * Extract noise: template systems can also introduce content perceived as pollution, mainly in commercial web pages, that usually have navigation bars, advertisement banners and widgets introduced by templates. It is a noise, and must be removed for enhance content analysis. Reserches and applications are oriented for this remove goal:, , ,.

They need a formal reference model of template systems, and for Statistical Theory applications it is a typical black-box model.

Objectives
The goal of this article is to offer a black-box reference model for template systems, so: Secondary goals are: to render more generic and inclusive (than of Parr's and Hsu-Yih's) conceptualization and definitions; to fix a reference model for the template generation hypothesis; provide subsidies for the establishment of "field limits".
 * provide a syntax-independent approach for template system identification;
 * use a black-box approach to characterize the template system properties;
 * provida a uniform and consistent terminology;
 * conciliate the software engineering views with the statistical views, towards a common reference model.

Main related works
...

System context characterization

 * FIG3 (ocupa duas colunas): In upper detail, the Shannon's diagram of a general communications system . The template system is located near the message creation: at the end of the information source or at the beginning of the transmitter. The main information, supplied by author, is packed with the custom information, organized by the designer. If template need some "authoring power" (generating information) there are a "authoring-oriented template system", else it is transmission-oriented.

Since template systems are tools for the production of documents, each template system is a part (subsystem) of a Communication system.

A "transmission model of Communication" shows, for technical contextualization, the rule of the template system in communication actions. In a classical information theory framework, represented by the Shannon's diagram (illustrated), the template system is located at the end of the information source or at the early portion of the transmitter box. It depends on the objective of use of the template system. The typical usage is for "packing message", as pre-transmitter parser, for better fit to the channel or to the receiver profile. Once in use, the template system plays a important rule in the information supply chain, and, is natural to aggregate other responsibilities: the main one is to filter the information source.

The Shannon's framework shows how accurately the document content is transmitted. To check how precisely is the meaning "conveyed" or how effectively does the received meaning affect readers behaviour, we need a more wider contextualization. Communication, as a social phenomena, make sense in the context of a community.

For authors, writers and editors is difficult to work on the production of a document, if the audience (your public) is not defined. The long-term feedback between senders and receivers produce the communicational standards and assumptions that they use in the interpretation of the documents. As expressed by Wnek (2002), "Understanding documents is a relatively easy task for intelligent human reader (...) This is due to the fact that the documents are prepared using some common assumptions about struturing them, and authors' intend to convey information in ways allowing readers accurate and efficient interpretation".


 * FIG4: ... usual "creative authoring" can be a isolated task for authors, but all the team must fit to the community assumptions in the templating tasks, even the authoring of boilerplate texts.

The community, where sender and receiver are embedded, define standards and common assumptions to be used as requirements in the set of documents generated by the template system, particularly the requirement about "repeated portions" of  structure or content. The community, directly or indirectly, sets forth the "replication requeriments", that templates must fit.

In the context of templated webpages, Bar-Yossef and Rajagopalan consider that "... the common look-and-feel of these pages is controlled or influenced by a (single) central authority". This is a specific case where the community have a central control. By other hand the same consider repeated portions of content may be "accidentally share" by many documents, as important signatures of communities. We prefer to consider this signatures also as a part of common assumptions that justify the use (intentional or not) of templates, distinguishing only the hierarchy level of the "central control comunity" (like a template designer) and a more wide one, like a set of lawyers forming the community of legal boilerplate texts.

To identify the community is no so simple task. We sugest to express a "communicating by documents hierarchy of communities". For any example we can do that by abstracting one or more layers. Take the case of a scientific article, as part of a english-text specialized scientific journal (illustrated): it is produced for and from persons that are members of a specialized portion of the scientific community, and, this community is a member of a wider community, of the english language speekers.

Finally, another social aspect is about how to operate a template system with a team. We can define the main rules of this task:
 * template writer: if there are more than one person, there are a need to separate tasks, "(...) programmers and graphic people have very different skills and work habits (...)" ,
 * programmer: work with the template script, that is, the "logic part" of the template.
 * designer: work with the "content and layout part" of the template. The target is the look and feel of the output documents. Can also work, making minor changes, as author or as programmer.
 * author (content supplier): he/she supply the main content of the output documents, não importando se dynamic content (input) and static (blocks of contents of the template).
 * editor (manager): do the strategic decisions and operate (turn on/off) the template processor.

If there are no team or it is a simple task, one or more rules are joined into one job.

The "separation of concerns" evocated as the main aim of template systems, by Parr, Mazzocchi , and others, is either a source/transmitter separation, and a division of labor. Then, its depends on the system context characterization.

Resuming: to characterize a template system is sufficient to express the place (as transmitter or as "source and transmitter"), the content community context (some layers), and user context (needs and actors of the team).

Content characterization
The classic digital document representation consider the document as a opaque object, the computer file, without considering semantics interpretation of its content. Old Library Science approches required that each file is examined by a human expert, producing a index or categorization of the object. Then, the 1980's Information Retrieval (IR) Systems started to convert document file objects into digital transparent and simplest standard representation of textual documents, the TXT file (MIME text/plain format).

Afterwards, the more sussefull approach to use the TXT representation without human assistence, was spliting it into terms (tokens), supplying the term distribution as a content representation. This approache was issued by the Salton's vector space model, where each document j is represented by a vector r,


 * vectot rj = (w1,j ,w2,j, ... ,wn,j)

Each dimension (i=1,2,...,n) corresponds to a separate term. A term is a string token, a word, a lemma, a composit term, a sentence, or any other well-defined "textual atom". If a term occurs in the document, its value in (term) weight, wi,j, is non-zero. Term weights are intra-document term's frequency or another quantitative linguistic metric, like TF-IDF. The vector representation can be converted to a set representation (set of ordered pairs),


 * set Rj = {("term1",w1,j), ("term2", w2,j), ..., ("termN",wn,j)},  without pairs where wn,j=0. Associative arrays and relational (SQL) representations are easily suited to the  representation of sets of pairs.

The mainstream of today's IR systems use optimized forms of the term-weight representations. It is enough for a lot of applications, ranging from indexation of PDF files, to plagiarism analysis. For more precise content representation there are some criticisms: fine-grained terms, as words, have pour semantic and orthogonality problems, convertiong words to lemmas have orthogonality gains with information losses, coarse-grained terms, as sentences or paragraphs, have problems of chunking and generalization.

For content characterization purposes, the term-weight supply a document’s signature (fingerprint). We can use more than one term-weight representation for each document, as we will see bellow, by varying gain size and term filters.

In the final 1990's and 2000's the XML standard was consolidate as a simplest and best uniform way to represent digital documents. The main enhancement, from TXT to XML content representation, is the structure (sections, chapters, paragraphs, etc.), semantics (ex. differentiating a appendix section from a chapter section) and the possibility of add metadata. The content representation can be supplied directly be the XML file, like the NLM DTD, for scientific articles; or isolated from noise and "secondary contents" by a sofisticated block analisys ,.

For our purposes, where content of template systems's components need the same representation, we adopt the following considerations,


 * Objects XML representation: there are a XML representation for all, input template, input data and result document;


 * Conversion resultant set: there are a well-defined content representation based on terms. The term definition depends on elected choices of granularity (word/setence/paragraph/etc.), filter loss level (ex. loss of "stop words" or converting words into lemmas), and term richness (ex. using also tokens or XML tags as terms). This choice can be symbolized by a identifier &alpha;.


 * Conversion process:
 * First step (split to &alpha;+1): XML can be segmented into a set of blocks &mdash; is like to convert the whole XML document into a set of simplified TXT fragments. Each block, inferred by the XML tree (or another criteria), have a contiguous subsequence of terms of granularity &alpha;. The "block type" have adequate granularity to a "block analysis" (ex. sentences), so, we have agreed to blocks are terms with granularity "&alpha;+1". Note: Broder et all and other authors use size h of fixed-size blocks as parameter.
 * Final step (split to &alpha;): the blocks (terms of type &alpha;+1) can be splitted (preprocessing with with adequate  filters and convertions)  into terms of the elected type &alpha;.

Content as a set of information units
The chosen of a specific type of content representation, of a XML object X, can be formally expressed by a function controlled by the parameter &alpha; (the identifier for granularity level, loss level, and term richness), that produce a set C of &alpha;-term-weight units of information,
 * C = asInfo&alpha;(X).

As a set, C can be submitted to Set Theory operators: union, intersection, and difference. For a satisfactory asInfo&alpha; function we can suppose the (term) weights, wi,j, as integers counting the occurrence of each term in the document object. As a implementation example, suppose a relational representation, with a tuple &lt;<tt>obj_id</tt>, <tt>term_id</tt>, <tt>alpha</tt>, <tt>term_count</tt>&gt; of integers. The operators can be modeled as SQL queries,

See appendix of "Observation tool kit" for more other operations and details.

Relative uniqueness
The content of two objects, X1 and X2, can differ, that is, asInfo&alpha;(X1)&ne;asInfo&alpha;(X2), and if all elements of asInfo&alpha;(X1) are not in asInfo&alpha;(X2), they are disjoint,
 * asInfo&alpha;(X1) &cap; asInfo&alpha;(X2) = &empty;.

If also X1 and X2 are not empty, we can say that each element of asInfo&alpha;(X1)&cup;asInfo&alpha;(X2) "is unique", or use a function to summarize this assertion, about a specific &alpha;,
 * isUnique&alpha;(X1,X2) &hArr; asInfo&alpha;(X1)&ne;&empty; &amp; asInfo&alpha;(X2)&ne;&empty; &amp; asInfo&alpha;(X1)&cap;asInfo&alpha;(X2)=&empty;

The isUnique&alpha; function permits to express "objects signatures" or fingerprints.

An important consequence of this &alpha;-uniqueness definition is the uniqueness of &alpha;+1,
 * isUnique&alpha;(X1,X2) &rArr; isUnique&alpha;+1(X1,X2)

since terms with coarse grain have, by process definition, the same content,
 * asInfo&alpha;(asInfo&alpha;+1(X))=asInfo&alpha;(X)

Scripts and structured contents
Template files, as showed in the Introduction, are expressed with two languages: output language and script language. The asInfo&alpha; function extract only the content expressed into the output language. To compare templates in this paper we use file comparisons, implemented with difference analysers, as the UNIX <tt>diff</tt>, or "fuzzy difference" as the NCD semelhance.

In a black-box context, the script content is not fully accessible, and the asInfo function applied to T use a script filter. Without filtering, templates with the same content would be different. Two templates with the same logic behaviour can differ by use of internal variable names, or the structure of the template library. There are two ways to compare template scripts without this kind of confusion: 1) comparing by input-output analysis; and 2) comparing abstract syntax trees of the parsed scripts. We use the first, modelling template script and process  as hidden parts of the black-box model.

By other hand, many techniques of information retrivial use abstract syntax trees in the content processing, to deal with structure expressed by the output language. The term-weight representation, however limits this use to the &alpha;+1 block sizing and &alpha; term detection.

Illustrating by a hypothetical case where "." and ";" are delimiters, and letters A-Z are terms. The first delimiter, ".", a phrase delimiter, is filtered. The second, with &alpha; is a term, but with &alpha;+1 it is not.
 * Let X="A;B.C." and Y="B;A.C.",
 * asInfo&alpha;(X) = {"A", ";", "B"} = asInfo&alpha;(Y)
 * asInfo&alpha;+1(X) = {"A;B","C"}  &ne;   {"B;A","C"} = asInfo&alpha;+1(Y)

Always is possible to find a "minimal &alpha;" to capture structure, or a "best &alpha;" that maximizes the granularity and the number of elements of asInfo&alpha;, with filter loss level required for content analysis.

Black-box template system modeling


A "candidate system" to be characterized as a template system must think as a black-box: a abstraction representing a class of concrete systems which can be viewed solely in terms of its "stimuli inputs" and  "output reactions". &laquo;''The constitution and structure of the box are altogether irrelevant to approach under consideration, which is purely external or phenomenological. In other words, only the behavior of the system will be accounted for''&raquo;. Black-box testing methods ensures that, based solely on observable elements, it is a valid model.

The dataflow illustrated below show the system model and your parts:
 * template (T) and data content (C), are inputs;
 * the template processor (P), with the template script is a black-box;
 * resultant output document (R), is the output.

The data model of C must be recognized by the template script. Other a priori informations about processor implementation and template script are not considered, and the inputs and outputs not need a syntax characterization. The "observer" use a set of "observation tools" that are also black-boxes.

Informal checklist
In a informal survey we found what the most programmers and designers interviewed believe as "essential functionalities" of a template system &mdash; using as reference the simplest ones, like the old mail merge systems. About the essence and utility of these systems, they pointed many properties, that was consolidated as following,


 * 1. Singular behaviours, about cases with empties (T, C or R empty). It depends also how the template system is represented.


 * 2. Output reuse:
 * 2.1. R can be reused as input (template or data content). It is the "save the document as a template" feature.
 * 2.2. reuse of blocks. Similar to 2.1, but selecting only a block of R. The "copy/paste fragments to reuse it" feature.


 * 3. Modus operandi: a useful system need to operate with three modes,
 * 3.1. reusing C (varying template): changing T, changes R.
 * 3.2. reusing T (varying input data): changing C, changes R.
 * 3.3. reporting (varying all inputs): changing T and C, changes R.


 * 4. Conserving behaviours: ideally (as a simplification hypothesis) the system must be "conservative",


 * 4.1. conservation in time (P determinism): at any time, if inputs are the same, the output must be the same. As a reference model statment, we can suppose also that random functions cannot used (by query or macro languages).


 * 4.2. information conservation: conserve information of the input into the output. Conform to the goals of the system, we must flexibilize this condition, for filter and injection requirements. Filtering is a usual goal in query languages, and a usual demand in communication systems.


 * 4.3. knowledge conservation: when macro language or query language do format transforns, like pluralize words (ex. octopus into octopi) or transform <tt>1010999</tt> into <tt>1010,999</tt>, the "original content" become corrupted, but it is a "required corruption", and the transform is reversible. When summarize operations takes content from C or T, and condenses it, the summarized data in R is a new information &mdash; for instance, using addition for inserts totals in a report. They conserve the "original intent" of inputs, not add new knowledge. By other hand, if a P processor have a bug, injecting some information, it is not required; or if use memory for inferences, it is a new knowledge. The basic requiriments can be expressed in terms of "delegation of some or none authorship" to the template system. If "none", the  information conservation is satisfied, if "some", it must be relaxed, and knowledge conservation adopted.


 * 5. Context condiction: about where a template system can be used, to be considerated a template system. The general conditions was pictured at the "System context characterization", a more technical conditions are detailed in the "Pipelined expansions and orthogonality" section.

The behaviours 1 and 3 are generalizations of empirical facts. Behaviour 2 is a important "start point" for users, with any kind of document. Item 4 is an intuitive requisite, and can also deduced from Bar-Yossef and Rajagopalan equivalence principle &mdash; they define templates as structural blocks that occurs in a significant collection of documents, and expressed formally this fact using the content equivalence.

The asInfo&alpha; function is suitable for a rigorous content evaluation in all the first four properties, but more precise characterizations is need to show each one.

Formal definitions
To formally define, we can map the system parts into mathematic parameters, and model the whole template system as a function and your parameters. In the "Content characterization" section we assumed that the input and output are XML objects, then each type of object have a XML Schema as first approximation to your domain.


 * Domains:
 * D is the data model, formally a (universe) set of all possible inputs to L. The schema that validate C is sufficient to define this domain.
 * Q is the output data model, completelly defined by your schema.
 * template language is a special language, usually composed by fragments of output language (domain Q) and script language (domain S). The domain R contains all valid templaplates.


 * Inputs:
 * Template library, L = {T1, T2, … Ti, … TN}
 * There are N referenciable templates. The template T1 is also used as "default template". The template logic of the templates need to recognize the D data model of the input data content.
 * Content, C &sube; D
 * C is a set of incoming indexed values (c1, c2, etc.), from the content resource. C is read-only, and each ci can be a structured value.


 * Process, P(L, C) or P(T, C) or P(L, K, C)
 * It is a black-box model for the template processor. All is processed in one step.
 * Output, R &sube; S
 * It is the result. When the template system is a document generator, R is a document. R = P(L, C)

There are many inputs, C1, C2, ..., Ci, producing the correspondent results, R1, R2, ...,  Ri. The  data models, D and  S, are also part of the "black-box channels" specification.

Symbols for special cases:
 * &empty;: the empty set.
 * &epsilon;: a document with no content (empty page).
 * T&epsilon;: a template that produce &epsilon;. L&epsilon;={T&epsilon;}.

Observation functions

In the black-box model identification tasks we need a "functional toolkit" for observation and testing the candidate systems.


 * asInfo&alpha;(X): converts object X (a XML document, template or datum) into a set of term-frequency pairs, representing "the information of X" (see section "Content characterization").


 * asTpl(R): converts a document into a template. It is only a "pack transform". For languages like ASP, PHP, JSP the asTpl function is as the identity: a XHTML file only change the extension (ex. from <tt>.htm</tt> to <tt>.php</tt>) for use it as template. By other hand a XHTML file not turns a XSLT template only by changing extension, we must copy/paste the document into a "template container" (see Appendix).


 * asDoc(X): converts content X into a document (R).


 * blk(R): a specific copy/paste block from R, the function do a "contiguous block extraction". The returned type is valid as input for asTpl and asDoc functions.


 * isNct(X): test if X is a "no-content" (empty of content), asInfo(X)=&empty;. For R, if isNct(R), then R=&epsilon;.


 * isNop(X): isNop(T) tests if T is a "no-operation and/or no-output" template (without logic or placeholders). isNop(L) is a test about all elements Tj of L.

For implementation, the asTpl, isNct and isNop functions are specific for each template language, but asInfo and isSigned functions are generic tools (see Appendix).

Representations
 * Usual P(L, C): any template system can be modeled directly with this representation.


 * Single P(T, C): can be adopted when L={T}, a library with only one element.


 * Canonical P(L, K, C): for indirect modeling, by a system that represent a class of systems. See section "Canonical representation".

Data-reference expansion system
Data-references are the placeholders of the old (1980s) "mail merge systems", and its unique template construct. It performs the following steps,
 * 1) the content select for reuse is expressed as template, with some placeholders (data-references) where content variation is necessary.
 * 2) the placeholder compatible input data is prepared (set of sets of input values).
 * 3) each set of attribute values is processed (R is produced) with the template: the attribute references are replaced (expanded) by its values.

Another important characteristic of this class of system is that is ease, syntactly or functionally, to separate the K and C parts of the template, as in the concatenation model: K-literals and C-references are lexically evident. This simple system have useful properties, listed bellow.

Formal definition
Note-1: equality and inequality operators ("=" and "&ne;") can be binary ones (something like UNIX <tt>diff</tt>) or adapted for the set theory operations, applying asInfo&alpha; or asInfo&alpha;+1 to both sides of the symbol.

Note-2: the commented equivalence, presented in, that a template T is a shared content by many documents, R1, R2, ..., RN, can be deduced by property 4,
 * asInfo&alpha;(T) = asInfo&alpha;(R1) &cap; asInfo&alpha;(R2) &cap; ... &cap; asInfo&alpha;(RN).

Examples and discussion
Many other template systems can be used as this "data-reference expansion system", since not using another features &mdash; filter, conditional content, subtemplate, loops, or other sophistications. Example: a PHP template with only HTML content and placeholders supplied by single variables.

Property 2 illustrate the main start point for reuse, a template without placeholder. Replacing a simple fragment of the text, like a name, by a placeholder, it becomes a complete template, and all the operations of property 4 are achieved.

A distinct but important application arises when asInfo&alpha;(T)=&empty;, that is a "template with only structure", and ensures a DTD (Document Type Definition ), for content expressed only by parameter values. A HTML template with no content, with only tags (expressing table, bold, italic, header, etc.), satisfy the condition, supplying asInfo&alpha;(P(T,C)) = asInfo&alpha;(C).

Templates for letter-form, frame-letter and query-report can be expressed with this simple template system, with limitations  when, for instance, need express conditional structures or size-variable repetitive groups, like table rows or item-list items.

Template-reference expansion system
The macro languages and template-references has previously introduced in the section "Find/replace templates", and can be generalized for expansion of any labeled block of text, like on TeX Macros. This simple reuse mechanism can be described with following steps,
 * 1) define the template: a text-fragment to be reused and represented by a label &mdash; this label must be a unique-key for the text on the next step.
 * 2) choose a input data content: a full-text where the label (like a abbreviation) is used.
 * 3) expand the template-references: use the "macro engine" to automatically expand the labels (template-references) into fragments.

Example: with the popular CPP (C-precompiler) we can declare a library, as the previous in the Introduction, with labels to expand, and submit a text for this template library,

then, all occurrences of MULT_2PI and FOOT (the template-references) will be expanded after runs

producing

if runs with <tt>fullText2.txt</tt> the same expansion will take place. The files <tt>fullText1.txt</tt> and <tt>fullText2.txt</tt> are modeled as instances of input data contents, C, the files <tt>result1.txt</tt> as outputs, R, and the UNIX shell line play the rule of the process P, for R=P(T,C).

The template-reference is like a placeholder in the input data content, to be expanded with the template content. There are some symmetry with data-reference, discussed at sec. "Pipelined expansions and orthogonality".

A template-reference expansion system use the simplest macro language, where its unique template construct is the template-reference. This kind of system is not very usefull, but have didactic motivations.

Examples and discussion
As showed into the Appendix, any "macro language system" when using only definitions (no condition, no parametrization, etc.) can be modeled as template-reference expand system.

If we remove the "no filter" simplification, that is, change the equal signal, "=", of the property 4 to the subset signal, "&sube;", many other important systems can be modeled as this kind of template system. The filtering brings us to several singular behaviours,

Modeling usual "Cascading Style Sheets (CSS) render systems" (see appendix) as template system, needs the restriction that the system use a  "library with only layout directives", L=Llayout, that preserve the input data content structure (C structure) and changes only the layout. Formally: asInfo&alpha;(Llayout)=&empty; and asInfo&alpha;(P(L,C)) = asInfo&alpha;(C), with P(Llayout,C)&ne;C. It is valid for &alpha; and &alpha;+1.

More general CSS involves the use of "CSS display property" and "CSS content property", that requires a more complex model.

Pipelined expansions and orthogonality
A template system pipeline is a composition about P,
 * P(L2,P(L1,C)) or  P(P(L,C1),C2)

where P(L1,C)&sube;D2, that is, the result is in the domain of the input X of P(L2,X); or  P(L,C1) produce a new template as output, in the language required by Z of  P(Z,C2).

In a typical template system, the pipeline is not recommended: there are only one "final process" in the Shannon's transmission model (see sec. "System context characterization"). Also for users, like designers, with less expertise in programming and recurrence analysis, is difficult to predict the pipeline behaviour. A apparent exception arises when exist a hierarchy encapsulating the message, and the pipeline maps a division of responsibilities.

Let see, first, cases and conditions where pipeline can be justified. Then see in depth the properties of template systems that can be revealed by the pipeline analysis.


 * P(L2,P(L1,C)) cases: there are special configurations in many CMS's (Content Management Systems) where the first step of the pipeline is for "document assembly" and the second step for "message packing", accommodating the transmission model.     See MediaWiki, Plone and other CMS's, that use the first for expand article subtemplates (at L1), like custom-tags, and the second step with L2 as "webpage skin" (a kind of letter-frame).


 * P(P(L,C1),C2) cases: typical use is a degenerate version of the closure (see section "Functinal view"), where the first step produce a "translated template".

Another use for pipelines is for reduce the CPU usage in a dynamical web page production. With the cache optimization (sec. "Architecture characterization") all context and semantic considerations about K (template content) and C (data content) are ignored, and they are considerated interchangeable, since C domain is finite and knowed. The cache condition exposes a important asymmetry aspect: K always has a finite number of values, while C can have a infinite domain, and only C can be used as a selector for library options.

Let see a more deep asymmetry fact. The template language feature "subtemplate calling another subtemplate" (ex. XSLT and CPP have this feature) can be very useful for designers. Example: a form-letter subtemplate T1 defining the letter, with a paragraph into the letter defined by another subtemplate T2. Because both subtemplates, T1 and T2, are elements of a template library L={T1, T2, ..., Tm}, the example can be expressed as a composition P(L,P(L,C))=R. It is equivalent to P(T2,C)=R, where T2 is a new version of L with T2 expanded into T1. The library L2 is a direct parser of L, that can be cached (pre-processed) by L2=parse(L). The independence of C in the generation of L2 is also guaranteed by the fact that
 * asInfo&alpha;[P(L2,C)]&cap;asInfo&alpha;(C)&sube;asInfo&alpha;[P(L,C)]&cap;asInfo&alpha;(C), &forall;L,L2,

and information added in the pipeline is only asInfo&alpha;(L).

The "symmetric feature" of "subtemplate calling another subtemplate", for a data-reference expansion system, is the feature "placeholder into a attribute value". It need a system with the composition P(P(L,C),C), but it can't cached with L, because both stages of the pipeline depends on C. Example: in a form-letter template of a mail-merge system (a kind of data-reference expansion system), we can use person's name and another attributes as placeholders into a templated letter. If, using a orthogonality hypothesis for the mail-merge system, we change template to a letter-frame template, with the whole letter as data-reference, the placeholders into the letter fail to resolve the person's name, because need a composition for it.

The lesson is: if the reused content need a reference to another reused content, reuse it as template content (K), not as data content (C).

For general template systems, that accept data-references and subtemplate-references in the template language, this situation (impossibility of use "placeholder into a attribute value") shows a non-orthogonality of instruction sets. In any general template system, templates can use subtemplate-references, but expanded attribute values cannot use data-references. As consequence, even in the simplest concatenation models, K and C are not always interchangeable.

Any template system
Parr define unrestricted template language as extremely powerful (Turing-complete) languages, and, in opposition of them (with less power for preserve separation of concerns principle), define the restricted template languages.

data-reference and template-reference expand systems are subclasses of the restricted class, and show four broad groups of properties that characterize, by a black-box model, more template systems than that subclasses.

These four groups of properties can be generalized for any, restricted and unrestricted, TSs. To deal with this generality in the template system identification, some properties will be used as conditional and secondary, while others are fundamental and mandatory.

Query power of a template system
If a template processor (regulated by a template script language) can or not generate information (property 4) is a matter of choice, done by the system user (choice of system and/or language) or by the system developer (implementing or not language features). There are a hierarchy of restriction level of "no information loose/generation", and it is associated with the use of the template system as a query system. The property 4 have four "flavours" intuitively expressed as "query power". We illustrate it with ANSI SQL, and examples with a table named <tt>input</tt> with columns <tt>a,b,c</tt>:


 * 4.0. copy template (default flavour, no power): the input attributes are only copied, and all inputs are used.
 * Example: <tt>SELECT * FROM input</tt>
 * 4.1. templates with a filter power: not all inputs are used, reducing the input information. "Trackable" reduction of information is also considered as filter power.
 * Examples of filtering columns (projection) and filtering rows (selection): <tt>SELECT a,b FROM input;</tt> <tt>SELECT * FROM input WHERE a>1</tt>
 * Example of typed transform (see 4.2), that lose information : <tt>SELECT round(a), round(b), c FROM input</tt>
 * 4.2. typed templates with a transform power: it have a typed attributes, transforming data types about valid reversible transforms. About string (no type) view, it adds information; about "typed view", not, becase it is a part of the input data model, changing only representation (ex. cardinal and textual number representations). If there are some loss of information, as year(date) function, or a "trackable" change, like the pluralizeWord function (see Introduction), it can characterize also a filter or a summarizer/analyser template.
 * Typical example: <tt>SELECT a, addThousandDots(b), normalizeSpaces(c) FROM input</tt>
 * Example with global transform (seems to change information): <tt>SELECT * FROM input ORDER BY b,c</tt>
 * Example where transform seems to lost and add information: <tt>SELECT a, concat(b,c) FROM input</tt>
 * 4.3. templates with a summarization power (or analyse power): it can join many attributes in a summarization, like a total or a mean. About string view, it adds information, about "knowledge view", it not add new knowledge about inputs. Changes like that made by pluralizePhrase functions are also "knowledge conservative". In software design patterns, the summarization is particularly necessary when the template system have also the filter power and is summarizing filtered data &mdash; in a MCV context, the MVC-Model canot predict the MVC-View filter, then, canot supply summarized data.
 * Examples of columon and row summarizations with filtering: <tt>SELECT a+b, c FROM input WHERE a>1;</tt> <tt>SELECT sum(a), mean(a+b) FROM input WHERE a>1</tt>

These "query powers" (and combinations) subsides the TSs classification.

Essential properties
...The essential properties that template systems must satisfy....

... A candidate system modeled as a black-box where the properties are (theoretical or empirical) satisfied, is a template system ...

Discussion
... retomar os exemplos da introducao ...

... a discussao fica em torno da aplicacao das classificacoes, grau de aderencia, vantagens e limitacoes das definicoes, etc...

Appendix

 * Simplest examples
 * Terminology comparison