Spintax Is Not Enough

Spintax Is Not Enough

The goal of document spinning is to produce variations of a document. This is usually done with spintax. If the spinning is done well then a variation should be different enough from other variations that it is not evident they all come from the same source document. While such variations can be used for marketing ad copy and descriptions of a product, company, or web site, they are most often used for search engine optimization (SEO). A big concern for SEO use is whether the variations will be seen as unique content. Because of this there are serious problems with using just spintax for document spinning.

What Is Spintax?

Spintax is a common method of document spinning. The concept is simple. At its most basic, words in a document are designated as choices between synonyms. One of the synonyms is chosen at random when spinning. For example, suppose the word "car" appeared in a document. The word could be replaced with spintax that chooses between "car", "automobile," and "auto." The spintax for these synonyms used in a simple sentence is shown below.

His {car|automobile|auto} broke down.

The standard spintax syntax is used, which has the spintax enclosed by braces and the choices separated by "pipe" characters. The sentence could be spun into three different sentences.

This shows spintax at its simplest where a single word is replaced with synonyms, but spintax can be used for multiple word phrases as well as single words. It can also be nested, so a spintax segment can contain another spintax segment.

A Spintax Example

Spinning software typically replaces words and phrases with spintax automatically, often with dubious grammatical correctness, but the details are shown here along with the possible variations that can be spun. The example will be used later to demonstrate duplicate content problems with using spintax for words and phrases.

The sentence below will be used. It starts out as plain text.

The cat crept through the grass, and the mouse was unaware.

To add spintax several words can be replaced with synonyms. The words and their synonyms are shown below in spintax form.

{cat|kitty}
{crept|stalked}
{unaware|oblivious}

Adding the spintax to the sentence produces the following:

The {cat|kitty} {crept|stalked} through the grass, and the
mouse was {unaware|oblivious}.

With three spintax segments, each with two choices, there are eight possible sentences that can be spun. (2*2*2 = 8) These are are shown below. All are grammatically correct while retaining the meaning of the original sentence.

The cat crept through the grass, and the mouse was unaware.
The cat crept through the grass, and the mouse was oblivious.
The cat stalked through the grass, and the mouse was unaware.
The cat stalked through the grass, and the mouse was oblivious.
The kitty crept through the grass, and the mouse was unaware.
The kitty crept through the grass, and the mouse was oblivious.
The kitty stalked through the grass, and the mouse was unaware.
The kitty stalked through the grass, and the mouse was oblivious.

Problems With Spintax

For SEO purposes, the hope for using spintax for document spinning is that if enough words and phrases are converted to spintax then spun variations will not be seen as duplicate content by search engines, but there are serious problems because of the way search engines process documents.

Search engines do not read documents like humans do. They do not "understand" what they read. They use a document normalization process to reduce documents to a skeleton of important words. During normalization, text is extracted from a document’s HTML, the content is linearized, meaningless words are removed, and words are stemmed. Latent semantic indexing (LSI) is then used to deduce the relevance of the normalized document to search terms. This process has serious implications for using spintax for document spinning.

Document Normalization

The first step of normalization is extraction of text and linearization. When a search engine scans an HTML page, text is taken left to right and top to bottom. Punctuation marks, extraneous characters, and extra white space are removed. The result is a long stream of words. What this means for document spinning is that differences in text formatting and punctuation are obliterated. They make no difference.

The second step is filtering of words that are not used to decide what the text is about. Words can be divided into two categories, functions words and content words. Words like "a" and "the" are considered function words. These are words that add little to no meaning to the text. They are stripped from the text. This typically removes 40% of words. What is left is a framework of content words, which are used to index the document based on a search’s keywords.

The implication of this second step is that segments of text, phrases, and idioms that differ mainly by function words may be seen as being equivalent. Such differences produced by automated spinning software will not avoid spun variation from being seen as duplicate content.

The third step is stemming. This reduces words to their roots, called stems. For example, the words "divide," "divided," and "dividing" all have the same stem, "divid". What this means for spinning is that words with the same stem are seen as equivalent. Again, just like functions words, automated spinning software that uses variations of words with the same stem does not make a difference.

The earlier example had eight variations that could be spun from the source sentence. After normalization, the eight sentences might look like this:

cat crept through grass mouse was unaware
cat crept through grass mouse was oblivious
cat stalk through grass mouse was unaware
cat stalk through grass mouse was oblivious
kitty crept through grass mouse was unaware
kitty crept through grass mouse was oblivious
kitty stalk through grass mouse was unaware
kitty stalk through grass mouse was oblivious

Latent Semantic Indexing

The most important search engine feature for document spinning so variations will not be seen as duplicate content comes after normalization. Latent semantic indexing, often referred to as LSI, is used to retrieve a list of documents relevant to a search’s keywords. LSI is a complex subject, but one result of it is that words with essentially the same meaning can index the same content. This can be demonstrated by doing a search for keywords containing "car". Instead of finding only pages that use the word "car," pages that use synonyms like "automobile" are also found.

One way to think of this is that synonyms are all seen as being the same. This is a common computer science concept. Many unique representations of the same thing are converted to a canonical form to reduce complexity. It is trivial. Any reasonably sophisticated algorithm for detecting duplicate content would use this concept to convert synonyms to canonical forms, thus erasing document differences that come only from use of synonyms.

Continuing with the earlier example, suppose the canonical form of "kitty" is "cat", "stalked" is "crept", and "oblivious" is "unaware". The eight variations of the example would all be reduced to the normalized original sentence, which is shown below.

cat crept through grass mouse was unaware

The implication of LSI is devastating: Using spintax to change words or phrases into synonym words or phrases may be useless for a complex duplicate detection algorithm.

Beyond Spintax

The solution is spinning at two levels. Spintax should be used to change individual words and phrases. This can be considered low level spinning. It avoids variations having short word sequences that are exactly the same. Higher level spinning should be done by altering the structure of a document. Sentence structure should be rearranged as well as that of paragraphs. This structural modification in combination with word and phrase changes will produce content that will be seen as unique even to sophisticated duplicate detection processes.

Low level spinning using spintax to change words and phrases into their synonyms serves an important purpose. It prevents the same sub sentence sequences of words from existing in spun variations. Small sequences are used by simple duplicate detection algorithms and services like those used for plagiarism discovery. Such low level spinning is also good for documents that are intended to be read by people. The human mind is adept at recalling patterns, and a unique span of words can trigger memory of a previously read document. This is important because spun documents often need to pass muster by a human reader so they can be placed for discovery by search engines. Of course low level spinning is also important when spinning documents for non-SEO uses.

High level spinning can radically alter a document’s structure. A sentence can have its order of subject, verb, and object changed. Adding or subtracting details that do not substantially change a sentence’s meaning presents a rich opportunity for changes beyond grammatical reordering. Sentence restructuring is easy to accomplish with sentence variants. This can be considered as spintax operating at a larger scope than synonyms. Just as spintax is typically used to choose a synonym at random during spinning, sentence variants can also be chosen at random.

While sentence variants can be used to restructure sentences, paragraphs can be similarly restructured. Variant paragraphs can differ in their number of sentences. One variant could have three sentences while another has five. This will change the size of spun paragraphs, which avoids variations from being visually similar and displaces the document’s subsequent text so it is harder to connect similar sections of text between variations. Aside from the number of sentences, paragraphs can often be varied by reordering sentences or optionally dropping sentences.

Spinning at low and high levels produces variations that differ in complex ways. It requires a next generation spintax editor and document spinner that is designed work at both levels. Our whirlDOC editor is such an editor.