Thesaurus File Formats

Appendix: Thesaurus File Formats

whirlDOC supports two thesaurus file formats, an XML format and a text format. The XML format is whirlDOC’s native format. The text format is supported to exchange thesaurus data with other applications.

Basic Concepts

Thesauruses are collections of phrase sets, which are sets of phrases that are interchangeable synonyms of each other. An example of phrases in a phrase set is “red,” “rouge”, and “ruddy.” These phrases could be turned into spintax used as an adjective for something with a reddish color. The spintax would be: “{red|rouge|ruddy}”. When spinning a document, the spintax is resolved by choosing one of the three adjectives at random.

whirlDOC supports phrase sets containing an empty phrase. When spintax is created from a phrase set with an empty phrase, the spintax will resolve to nothing if the empty phrase is chosen from the phrase set’s phrases. If the previous example had an empty phrase then the spintax would be: “{red|rouge|ruddy|}”

Internally, whirlDOC uses identifiers to refer to phrase sets. The XML format includes these IDs. They are optional in the enhanced text format. When a thesaurus file is imported into whirlDOC, new IDs will be assigned and they will be unique among the various thesauruses used by the program. The only thing to be concerned about regarding IDs is that each is unique within a thesaurus file.

XML Format

The XML file format is a lean format designed to be easy for third party developers to create code to read and write it. By convention, an XML thesaurus file uses an extension of “.xthe”. A small thesaurus XML file with one phrase set is shown below.

<?xml version="1.0" encoding="UTF-8"?>
<thesaurus idprefix="u" idcounter="15" version="1.0">
  <phraseset id="u001">
    <phrase>dirty</phrase>
    <phrase>dusty</phrase>
    <phrase/>
  </phraseset>
</thesaurus>

This example file’s one phrase set has an identifier of “u001″. It consists of two phrases, dirty and dusty, plus the empty phrase.

The format’s root element is “thesaurus”. It has three required attributes: “idprefix”, “idcounter”, and “version”. These are described below.

  • idprefix: A string used to create identifiers. By convention the "user" thesaurus uses “u”, the "master" thesaurus uses “m”, and the thesaurus internal to whirlDOC documents uses “d”.
  • idcounter: A counter used to create unique identifiers. The count is combined with the ID prefix to form a complete identifier. A uniqueness check is done when an ID is created and another is created if the ID is not unique. This advances the ID counter and makes a counter that would produce non-unique IDs self correcting.
  • version: A version number for the thesaurus XML file format. The current version is “1.0″.

Phrase sets use the tag “phraseset” with the set’s identifier put in an attribute named “id”. Each phrase is put between “phrase” tags. The optional empty phrase uses an empty “phrase” tag.

Text Format

The basic thesaurus text file format is extremely simple. Each phrase set uses a single line. Phrases are separated by a “|” character, which is sometimes called a pipe. Comments are lines that have a hash character, “#”, as their first non-whitespace character. Comment lines are ignored. Blank lines are also ignored. Note that other applications may not support blank lines or comment lines.

An example follows.

# A comment

red|rouge|ruddy
dirty|dusty|

The example contains one comment, a blank line, and two phrase sets. The second phrase set has an empty phrase, as indicated by its trailing pipe character.

Enhanced Text Format

The basic text format is the format that is most likely used or supported by other applications. whirlDOC accepts an enhanced format that includes identifiers and supports more than one phrase set on a single line. It also supports comments and blank lines just like the normal text format.

Identifiers are put before an equals sign that is followed by the phrase set definition. The previous example’s phrase sets with identifiers could be as follows.

u001=red|rouge|ruddy
u002=dirty|dusty|

An ID will only be set if a line defines a single phrase set. It will be ignored if more than one phrase set is defined.

More than one phrase set can be put on a line so it is easy to define phrase sets with different variations of the same phrases. For example, two phrase sets, one with singular and one with plural forms of the same phrases, can be combined into one definition. Aside from singulars and plurals, uses include forming adverbs or verb forms. Here are a few examples.

dog(s) | canine(s)
courteous(ly) | polite(ly)
bit(e:es::ing) | chew(:s:ed:ing)

One obvious enhancement is that whitespace can be placed around the pipe characters to increase readability. The first two examples each produce two phrase sets. The last creates four. Here are the equivalent phrase sets in basic format.

dog|canine
dogs|canines
courteous|polite
courteously|politely
bite|chew
bites|chews
bit|chewed
biting|chewing

The enhanced format uses parenthesis to enclose content that will be split between different phrase sets. Content outside of parenthesis is used for all phrase sets. Within parenthesis, colons separate the content pieces that will be used for the multiple phrase sets. Here is an example that produces four phrase sets followed by the four phrase sets created.

creat(e:es:ed:ing) | ma(ke:kes:de:king)

create | make
creates | makes
created | made
creating | making

It is allowable to have no content between colons, in which case nothing is added. Here is an example of this.

create(:s) | make(:s)

create | make
creates | makes

To make plural instances of phrase sets easy to represent, the enhanced format supports a special case. If parenthesis contain only one pieces of content then the non-parenthetical content is used to create one phrase and another is created by adding the parenthetical content. The above example can be written as:

create(s) | make(s)

The parenthetical content does not have to be placed at the end of phrases. It can be placed anywhere within the phrase. There can also be more than one sets of parenthetical content. Here is a contrived example phrase set.

red(s) and yellow(s) | blue(s) and green(s)

Finally, a dash character can be used to prevent a phrase from being generated. Here are two examples of this followed by the phrase sets created.

clums(y:ily) | ungraceful(ly) | ungain(ly:-)
glow(:s:ed:ing) | radiat(e:es:ed:ing) | radiant(-:-:-:)

clumsy | ungraceful | ungainly
clumsily | ungracefully
glow | radiate 
glows | radiates
glowed | radiated
glowing | radiating | radiant

Phrase sets using these enhanced forms can always be expanded to use the basic form, but combining phrase sets can make a file much smaller. In fact, the basic format is just a subset of the enhanced format.