Rare words 'author's fingerprint'
A simple analysis of the words in a book is an indication of who wrote it
Analyses of classic authors' works provide a way to "linguistically fingerprint" them, researchers say.
The relationship between the number of words an author uses only once and the length of a work forms an identifier for them, they argue.
Analyses of works by Herman Melville, Thomas Hardy, and DH Lawrence showed these "unique word" charts are specific to each author.
The work is published in the New Journal of Physics.
Researchers also suggest each author pulls their works from a hypothetical "meta book". One description of this concept might be a framework for the way an author uses language. It is from this framework that all their works are ultimately derived.
In 1935, the Harvard University linguist George Kingsley Zipf demonstrated a mathematical relationship between the frequency of a word in a text and its rank in the list of an author's most used words.
So, the second most frequent word in a book occurs half as often as the first, the third most frequent occurs one-third as often, and so on.
The rule laid the groundwork for many mathematical analyses of words, in which the Zipf law seemed to be a universal property of English - and by extension, of language itself.
Building on that idea, researchers at Umea University in Sweden have found that language use isn't as universal as Zipf's law might suggest.
They have used a related approach that comes up with a unique identifier for each author.
Clearly, a longer written work has more unique words - words that appear just once in the text.
However, even the best writer's vocabulary will at some point run out of words that have not yet been used.
Thomas Hardy's curve looked less word-rich than Herman Melville's
The researchers gathered together the complete works of Hardy, Melville, and Lawrence, and measured that dependence - counting the number of new unique words as a particular author's works get longer and longer.
They used sections from books of varying lengths, randomly pulled from novels, alongside shorter works and short stories.
They found that the authors had distinctly different "unique word" curves.
The team suggests that a work by an unknown author could therefore be compared to prior works, with the curve acting as a linguistic "fingerprint".
The meta book concept proposed by the authors is not simply the list of all the words they know, but also the "distribution" of those words produced by an author, whether in drafting an e-mail or writing War and Peace.
"It doesn't matter if I pull out 10,000 words from a book of 100,000 or from a book of 200,000, I get the same behaviour; you always simply pull a piece out of your very, very big 'meta book', which is just a representation of your style," said Sebastian Bernhardsson, who led the work.
"That story you're writing right now is a piece of that big book and that is what you're pulling out," he told BBC News.
The team will continue the analyses with different English authors, and with authors in different languages. As their collection of fingerprints grows, Mr Bernhardsson said, they will try to identify the authors of anonymous works.
But not every result is a happy one, he added.
"It's a fun and interesting exercise, but I've plotted my own thesis in this sense and it was kind of discouraging comparing to some more famous authors."