Document File Format

Appendix: Document File Format

whirlDOC uses an XML file format to store its documents, and the files usually have an extension of “.xwd”. The format is fairly simple and should be easy to write tools to parse. A short document file is displayed below.

<?xml version="1.0"?>
<whirldoc version="1.0">
  <properties>
    <property name="foobar">qwerty</property>
  </properties>
  <thesaurus idprefix="d" idcounter="100">
    <phraseset id="d001">
      <phrase>rain</phrase>
      <phrase>snow</phrase>
    </phraseset>
  </thesaurus>
  <root>
    <pgroup>
      <p>
        <lgroup>
          <line>
            <literal>The</literal>
            <spintax>d001</spintax>
            <literal>in Spain falls mainly on the plain.</literal>
          </line>
        </lgroup>
      </p>
    </pgroup>
  </root>
</whirldoc>

The document’s root element is “whirldoc”. It has a required attribute named “version”. The current version is “1.0″.

The “properties” section is optional and contains name-value pairs. These pairs can be used by applications that use whirlDOC documents to store application specific settings. whirlDOC currently uses properties to store document settings, which are used for such things as HTML compatibility when spinning variations and keywords that are used to calculate keyword density. The names of the standard properties are listed in the properties section later in this appendix.

The optional properties are followed by the document’s thesaurus that stores the phrase sets used by the document’s spintax. This thesaurus section is identical to the XML thesaurus format described in the thesaurus file formats appendix except for the version attribute, which is not included. Refer to that appendix for a full description of the format.

The document’s content comes next. It mirrors the document structure shown when editing a whirlDOC document. It is a hierarchy of the root element that contains paragraph groups that contain paragraphs that contain line groups that contain lines. Each of those elements can include attributes for the frequency, select mode, and combine mode. This simple example does not set any element attributes so defaults will be used.

The example’s one line is made up of a spintax element between two pieces of literal content. whirlDOC automatically puts spaces between the literals and spintax when editing documents and spinning document variations. The spintax uses the one phrase set in the document’s thesaurus, which is referred by its ID, “d001″.

Element Attributes

There are three element attributes: select mode, combine mode, and frequency. Paragraph groups, paragraphs, and line groups can have all three attributes. Lines have only frequency, and the root element has a select mode and a combine mode but no frequency. Each of the attributes is explained more fully earlier in this manual but the sections below give a brief summary.

Select Mode

The select mode determines how an element selects its children when spinning a document variation from a whirlDOC. For example, a line group can select one of its lines for inclusion in the generated output. Because lines always include all of their content, they do not have a select mode. All elements from root to line group do.

The select mode is set on an element with the XML attribute “select”. The valid values for the attribute are listed below.

  • random-choice: One child is selected at random. The children’s frequencies are used to determine the probability each child has of being selected. How frequencies are used is described later.
  • random-order: All children are selected but are put in random order. The childrens’ frequencies are ignored.
  • sequence: All children are selected in the same order they appear in the document.

Combine Mode

The combine mode determines how selected children are combined when generating a document variation. For example, a paragraph usually combines the lines selected by its line groups by concatenating the lines with a space between. Just like the select mode, lines do not have a combine mode. All elements from root to line group do.

The combine mode is set on an element with the XML attribute “combine”. The valid values are listed below.

  • concat: All selected children are appended together with a space between.
  • lines: Selected children are kept as separate lines.

Frequency

The frequency attribute is a number that determines the probability of an element being selected when its parent has a select mode of “random-choice”. The frequencies of all children are summed and the chance of selection is a child’s frequency divided by the sum. For example, if an element has three children with frequencies of 1.0, 2.0, and 3.0 then the chance of selecting the first child would be one in six (1.0 / 6.0).

The XML attribute for frequency is “frequency”. It is a floating point number. The root element does not have a frequency because it is always selected. All elements from paragraph group to line do have a frequency.

An Example with Attributes

An example that includes element attributes is shown below. Each element has its attributes set to the element defaults, which are described in the next section.

<?xml version="1.0"?>
<whirldoc version="1.0">
  <thesaurus idprefix="d" idcounter="100">
    <phraseset id="d001">
      <phrase>rain</phrase>
      <phrase>snow</phrase>
    </phraseset>
  </thesaurus>
  <root select="sequence" combine="lines">
    <pgroup select="random-choice" combine="lines" frequency="1.0">
      <p select="sequence" combine="concat" frequency="1.0">
        <lgroup select="random-choice" combine="lines" frequency="1.0">
          <line frequency="1.0">
            <literal>The rain in Spain...</literal>
          </line>
          <line frequency="1.0">
            <literal>The ants in France...</literal>
          </line>
        </lgroup>
      </p>
    </pgroup>
  </root>
</whirldoc>

Attribute Defaults

Any attribute that is not set will use the default value for the element type. The defaults for each type reflect typical usage of the element. For example, line groups have a default select mode of “random-choice” because a paragraph is typically constructed by choosing one line from each its line groups. The table below shows the default attribute values for each element type.

  Select Combine Frequency
Root sequence lines
Paragraph Grouprandom-choicelines1.0
Paragraph sequence concat 1.0
Line Group random-choice lines 1.0
Line 1.0

Standard Document Properties

A document’s properties can be used by applications to store auxilliary information. This information can be extra information about the document content, meta information, or data used by an application to manipulate the document. whirlDOC uses properties to store document settings. Refer to that section of the manual for an explanation of each setting. whirlDOC ignores any property it does not recognize.

Below are listed each document setting, the name of property used to store the setting, and the property’s value type.

Setting Property Name Type
AMP char.amp Boolean
QUOT char.quot Boolean
LT char.lt Boolean
GT char.gt Boolean
PARA para.html Boolean
KEYWORDS 0 keywords.0 Comma String
KEYWORDS 1 keywords.1 Comma String
KEYWORDS 2 keywords.2 Comma String
KEYWORDS 3 keywords.3 Comma String
KEYWORDS 4 keywords.4 Comma String
KEYWORDS 5 keywords.5 Comma String

Boolean types are true if the value is "true". This is case insensitive. The property is considered false if it has any other value or if the property is not included in the document.

The Comma String type is a comma separated list of strings. For keywords the first string is the primary keyword and the subsequent strings are variants.

An example properties section is shown below.

<properties>
    <property name="char.amp">true</property>
    <property name="para.html">true</property>
    <property name="keywords.0">zombies,walking dead,undead</property>
    <property name="keywords.1">article marketing</property>
</properties>