User:JohnMie

=The well-behaved document - has Dublin Core inside=

By John W. Miescher, Bizgraphic - Geneva/Switzerland, miescher@bizgraphic.ch

Abstract
A well-behaved document is an electronic document which is both user friendly and library friendly.

The intent of this document is to show the importance and indeed the necessity for authors and publishers to produce only well-behaved electronic documents because they have a greater chance of being found (in search engines), consulted and referenced on a regular basis and they can save both time and money (during gestation).

User friendliness is best achieved by appropriately instructing authors and publishers as well as parties ordering authoring work. It should include definition of topics, keywords, formatting and style usage in their briefing and insist on implementation discipline.

Library friendliness means a document includes embedded cataloging information in a known format that allows for automatic processing of library records without or only little manual intervention. The Dublin Core set of standard terms seems to be the best choice for embedding such information. It is widespread, PDF files automatically include some of them and there is software available for both reading and writing metadata in this format.

Professional libraries prefer to keep the metadata of their documents in separate repositories for reasons of integrity, but since one does not exclude the other, embedding (a selection of) the same metadata directly into a digital resource, automatically makes this data available to third parties who download such resources to preserve locally in their own knowledge base and/or to consult offline.

Attention: The plea for the well behaved document is not an evaluation or criticism of the Dublin Core or any other metadata generation, description or harvesting system, but a suggestion for embedding simple metadata in electronic documents that are meaningful for the user of a document and that he/she can extract with suitable software for inclusion in his/her private collection.

Keywords
Well-behaved document; Dublin Core; embedded metadata; Knowledge Management; Digital Asset Management; PDF, EPUB, HTML; catalogue data; automatic library recording

=Plea for the well-behaved document= A well-behaved document is an electronic document that is both user friendly and library friendly.

User friendly means a document (we are talking about any document with more than just a few pages) is easy to read and easy to navigate on any device and for which reading software is readily available. It is in an open format and does not depend on proprietary (requiring purchase) software for display, styles and multimedia content. It must be searchable, has bookmarks (in applications that allow for it, such as PDF files in Acrobat or Adobe Reader), an interactive table of contents, i.e. one with ‘clickable’ links to the correct target page, and possibly an interactive index, cross references and links to external resources. Except for copyrighted material it should not be password protected or encrypted but must allow the user to print it out and to copy/paste portions of the text and possibly to add bookmarks and comments of his own.

This applies not only to scientific papers and manuals but to all documents that are not necessarily read in a continuous stream from cover to cover, such as novels and literary works.

Library friendly means a document has useful embedded metadata which librarians can exploit to automatically classify a document without or with only minimal manual intervention. It is searchable and can easily be indexed for full-text searching across a collection of documents.

Most professional (public, national, academic) libraries prefer to keep the metadata of their documents in separate repositories (aka catalogues) for reasons of integrity and to be able to output such metadata on demand in a variety of formats (Dublin Core http://dublincore.org/documents/dcmi-terms/, MARC/MARC 21 http://www.loc.gov/marc/, MODS http://www.loc.gov/standards/mods/, Dewey Decimal Classification http://www.oclc.org/dewey/ etc.), but since one does not exclude the other, embedding (a selection of) the same metadata directly into a digital resource, automatically makes this data available to third parties who download such resources to preserve locally in their own knowledge base and/or to consult offline.

The Dublin Core set of standard terms in combination with appropriate software is probably the best option to ensure that metadata can be considered useful.

Both user friendliness and library friendliness are, in principle, easy to accomplish, but we found that the majority of documents available on the web, as well as those published internally by large organizations, are neither user friendly nor library friendly.

DCMI Today – Success and issues
DCMI has attracted a large following and is, by now, known to all professional librarians around the world. They have software to manage and to extract the information contained therein and to deploy it within their own systems complete with their own search mechanisms. With 15 elements and 55 terms to describe resources plus possible qualifiers and an unlimited number of namespaces to make things even more refined, this standard has become the reserve of the most initiated of librarians only. Which is why a large number of the almost 2 mio documents we found on the web under “Dublin Core” stem from universities and full-time librarians. Yet we found very few documents on the internet with embedded metadata beyond the basics and even less that included a meaningful set of DC terms that a librarian could directly exploit to complete his records.

The following assertion is apparently still a far way from being commonly accepted: “Catalogue data that travels with the document facilitates automatic library recording” (J.W.Miescher)

DC terms have not yet attracted the attention of authors
Books, manuals, papers and all other documents that are being published today, whether in print or just electronically, are authored in the majority of cases by non-librarians who have never heard of the semantic web and are neither familiar with nor interested in DC terms. Most authoring tools and page layout software offer no way of specifying these beyond the basics, i.e. title, author, subject and keywords and often the discipline to fill in even these basics is not there.

DC terms must be added manually
Librarians and content managers of organizations large enough to have a publication department and a library of their own are thus obliged to profile each document that passes through their hands and manually complete the information that describes the resource adequately.

This implies a lot of guesswork and impairs the quality of the metadata and the reliability of library records and makes knowledge organization difficult.

DC terms would enhance the quality of library listings
These same content managers regularly publish overviews of their entire libraries or just updates of the new additions. The usefulness of such overviews depends directly on the quality and precision of the descriptions of the resources. The informative value of the associated DC terms could enhance this quality.

Content managers may also issue updates including actual documents for the members of their community or employees of their organization on their internal networks, on the web or on CD. In the case of electronic distribution such updates typically contain an interactive table of contents (TofC hereafter) that allows users to click on a title to bring up the underlying document. Ideally such TofCs would also list titles by topics and/or author and, in the case of international organizations, in multiple languages.

DC terms would improve user friendliness of (electronic) documents
A search mechanism that depends on embedded metadata is often also included in the displaying application (e.g. Adobe Reader®). In most cases the titles of documents found are displayed in the search result as the only meaningful and unique pointer to the contents of a document and, therefore, the title should be the first and foremost DC term that is embedded. Surprisingly over 65% of PDF files we find on the web and in dedicated collections of international organizations have no meaningful titles embedded as meta tags and an even much higher number of electronic documents are not user friendly. They have no bookmarks and no interactive TofCs for easy navigation due to the fact that the original intent of the publication was for printing only.

Dublin Core has limitations
The developers of the DC standard had obviously certain scenarios in mind (such classifying web content) and wanted to introduce more specific metadata beyond the basics (title, author, subject, keywords and date (which almost any word processor today can embed automatically), by introducing an extended set of 55 additional terms and refinements or qualifiers, but in so doing the matter became extremely complicated for the average consumer of an electronic document to the point that only well versed librarians manage to correctly map properties of a document to DC terms with the right refinements. Average users may find it difficult to understand why the 15 classical or legacy elements also appear as part of the 55 newer terms and how to deal with the ambiguities this presents (see also [|Dublin Core’s dirty little secret]).

Most users are only interested in the actual words and don’t care about namespaces and other refinements the Dublin Core system offers. A simple notation in attribute/literal pairs is probably adequate for most private or local repositories.

Another limitation is its limited compatibility with all the other systems in use (see below).

Dublin Core has competition
Dublin core as a standard has many competitors, standards that are issued by major libraries, universities, school authorities and other interest groups (e.g. [|MARC/MARC 21], [|MODS], [|Dewey Decimal Classification] and many more). None of these groups is willing to abandon their proprietary standard which complicates the life of authors, publishers and librarians alike.

One- or two-way bridges (so-called cross-walks like [|MARC to DC]) could be made available to translate between the other standards and Dublin Core so that librarians around the world can easily and automatically classify any document in their own system regardless of source or content matter. However this is an almost impossible task since some formats are richer than others and have many meta tags for which there is no 1:1 equivalent in any of the other standards. And on top of it, many of the implementers (e.g. Library of Congress, other national libraries, academic libraries etc.) apply different value types to tags, create their own tags and include lots of information that has no meaning for the recipient of a document. =Dublin Core Inside= The term “Dublin Core inside” should be viewed as a mark of quality for all truly well-behaved electronic documents which must be both user friendly and library friendly per above definitions to be allowed to carry and publicly display this mark.

Both conditions of the well-behaved document are, in principle, easy to satisfy and the benefits are potentially quite important, particularly if implemented already at the authoring stage. This may imply some education and perhaps the use of some simple and affordable software tool that can be used by authors, desktop publishers and designers on the one hand and by (local-) librarians and content managers on the other.

Education is about raising awareness among all players in the publication process. It starts with the author or the party ordering the work and the publisher who can already plan their work for user friendliness and library acceptability. The software tools should help managing library collections of electronic documents in PDF, EPUB and HTML format while respecting the DCMI interoperable online metadata standards. Optionally it should also handle physical objects (e.g. printed documents, artefacts, images, movies, music etc.) as a typical user probably has both types in his collection.

Education
Education should include training of all parties involved in the production of documentation to enhance understanding of the workflow and to recognize the areas where time and money savings can be achieved by avoiding duplication and unnecessary handling. This can take the form of on-site seminars, public lectures, subscription e-mail and written help files and links to relevant papers on Dublin Core sites.

The importance and indeed the necessity of including DC terms in all documents of consequence should be stressed as an integral part of such education. Parties commissioning authoring work should be encouraged to include certain DC elements (e.g. schemes and namespaces to respect) already in their briefs. The incentive for this lies in the savings that can be achieved in the subsequent steps, see workflow chart (fig. 1.) below.

Some of the steps to take are really simple and involve practically no extra work for one but can save a substantial amount of extra work for someone else further down the line. Good planning from concept to output is one such step. This includes the early assignment of relevant topics according to a corporate wide list (namespace). Tweaking the authoring tool or page layout program to generate useful cataloguing data is another useful step, so would be the consistent use of styles to generate a more structured text which would allow (the electronic version of) the document to become interactive, complete with bookmarks and a live TofCs that points to the correct page when clicked.

This last point is particularly important in a world where more and more documents never see print at all but are intended for on-screen consumption and must therefore be searchable and easily navigable. All parties involved in the origination of a document must bear this in mind and be instructed accordingly.

Organizations must establish firm rules and insist on their application when ordering authoring work. Authors must be taught to structure their work from the very beginning and how to include meta tags and navigation elements.

FIG. 1. Genesis of a typical corporate document

The biggest saving potentials are those items that avoid a second handling:

•	Assignment of topics (namespaces) and other dc.elements even in the first draft

•	Considering aspects of multi-purposing from the very beginning

•	Consistent use of styles to automatically generate navigation items

Coaching
As any content or production professional knows, developing a workflow that actually works can be a major challenge. Keeping track of important files and assets at each stage is critical. Effective file management is an important and necessary part of the creative process.

Coaching,seminars and workshops are probably best suited for imparting the required knowledge and there are several software tools available on the market for document or content management, for maintaining collections of electronic documents and for embedding metadata in electronic files.

The tools
Software tools to be used in this context should ideally cover two types of tasks:

One for library and content managers and one for persons with a need to read, modify and embed Dublin Core terms to documents such as authors and publishers.

The library management part is essentially a tool to build data bases of physical and electronic library objects and to output these as searchable lists and interactive tables of contents. Individual records contain some logistics information plus fields representing all Dublin Core elements and terms.

Physical (e.g. printed), electronic and web-based documents should live side by side in one list. Physical objects may have to be entered manually whereas electronic documents can be parsed (extracted) automatically regardless of location, i.e. whether stored locally, on a network or on the web.

The document handling part takes care on the one hand of the metadata extraction and manipulation, making files library friendly. On the other hand it allows for adding bookmarks and interactive tables of contents to existing (PDF-) files, making files user friendly.

See table 1 for a listing of desirable features of a suitable software tool and table 2 for a list of formats that could or should be supported

''Table 1. Desirable features of software for managing collections of well-behaved documents''

Library management
The data can be arranged into collections or lists which can be formatted to tables of contents sorted alphabetically, by topic or by any of the included DC items. Titles and Topics can also be associated via external scripts which facilitates making multiple TofCs from the same collection, e.g. in multiple languages or sorted differently.

These TofCs can be printed and exported to a tab-delimited text file, to Excel®, or to inDesign®. In a second step they can be converted to PDF or HTML and thus become interactive tables of contents to be published electronically, on CD or on the web.

Embedding metadata
Library-relevant information becomes an integral part of the electronic document itself by embedding standard meta-tags beyond the basics (title, author, subject, keywords). Users can easily view and modify the embedded metadata and, in the case of electronic documents in PDF, EPUB or HTML format, these modifications can be re-embedded into the files themselves (except if web-based).

Physical objects have to be entered manually or imported into the input mask (Fig. 2.) via a suitable bridge tool, electronic files are parsed to reveal embedded metadata which can then be edited in the input mask with spaces for all 15 classical DC elements and the 55 newer DC terms. Schemes and other refinements can be added where appropriate. This helps ensure conformity and uniformity.

Supported document standards
Supported standards for electronic documents are at least PDF, EPUB and HTML (could eventually be extended to XHTML, XML and others).

•	All three supported formats are very popular and open

•	there is free reader software available for PCs, eBook readers and mobile devices

''Table 2. Supported file formats''

Formats not supported
Most other document formats are (for the time being) not supported because they are either not popular enough or they require proprietary software for reading (except plain ASCII, ANSI and Unicode text files). Some document formats include hidden metadata which the author may not even be aware of and that he may or may not want to have published. Typical examples of this are MS-Word documents and the output from the other MS-Office packages Excel and PowerPoint. But these can easily be converted to PDF which resolves both the hidden metadata and the public accessibility problems. =Conclusion= The term “Dublin Core Inside” should become the mark of any well-behaved document. Only documents that are truly user friendly and fully compliant with DCMI standards should be allowed to carry and publicly display this mark. Perhaps this term could even be registered as a trade mark by the Dublin Core Organization. With appropriate training, instructions and incentives, authors and publishers must be shown how easy it is to produce well-behaved documents. They should be brought to structure their work and to include useful metadata in DCMI standard notation from the very beginning. This will ensure that electronic documents intended for a large audience are well-behaved, i.e. they are both user friendly and library friendly.