explaining the long tale

The long tail has been extensively discussed.  But what about the long tale?  What is the nature and significance of the long tale?

Consider two very long tales. The longest tale printed with a Latin or Cyrillic alphabet is Madeleine de Scudéry's Artamène, ou le Grand Cyrus.  This type of work is known among specialists as a "roman de longue haleine" (long-winded novel). First published in Paris, 1649-1653, Artamène consists of ten volumes encompassing 7,443 pages and about 2.1 million words. A second long tale is Samuel Richardson's Clarissa, or, the History of a Young Lady, published in London in 1748. Its first edition has seven volumes with a total of 2564 pages and about a million words.[1]

Reading a long tale takes a long time. At current, typical prose reading speed, Artamène would take about 140 hours to read. But in the seventeenth century, books were often read aloud. Reading aloud takes roughly 70% more time than silent reading.[2] Moreover, if reading occurred by candlelight, the need to maintain and trim the candle plausibly might increase reading time by 5%. So reading Artamène could easily have required 250 hours of reading time. Reading Clarissa could easily have required 100 hours of reading time.

While they were long tales, Artamène and Clarissa were also best-sellers of their times. One scholar declared of Artamène:

from 1649 to 1654, from one end of France to the other, at the court and in the most aristocratic circles, as well as among the more cultivated bourgeoisie, at Paris and in the provinces, in all ranks of a society the most polite in the world, one read them not merely with pleasure, one seized upon, one devoured bit by bit as they appeared, every one of those ten great volumes.[3]

In the course of printing, the printer increased the print run for currently printing volumes and printed additional copies of earlier volumes. While printing of the first edition finished in 1653, by 1655 the printer had already produced a complete fourth edition and a printer in England had already printed an English translation of the full, ten-volume work.[4] From 1654 to 1660, Scudéry produced another ten-volume work Clélie, Histoire Romaine. That action testifies to the success of Artamène. Clélie turned out also to be highly popular.[5]

The success of Clarissa can be described more quantitatively. Richardson probably printed 3000 sets of the seven-volume, first edition of Clarissa in 1748.  He printed additional editions of Clarissa in 1749, 1751, and 1759.  These later editions probably amounted in total to about 3000 sets.[6]   Through 1769, a total of eleven editions of Clarissa were printed in London and Dublin.  For comparison, few editions of British novels between 1750 and 1770 had print runs greater than 1,000, and most probably were printed in 500-800 copies.[7]

Just as did Scudéry, Richardson quickly followed one long tale with another.  About five years after writing and publishing Clarissa, Richardson wrote and published a new work, The History of Sir Charles Grandison (1753). Its first edition has seven volumes comprising 2459 text pages and about 907,000 words.[8] Three editions comprising 6,500 sets were printed within a year.[9] Both the size and print runs of Grandison suggest the prior success of Clarissa.

The commercial success of Clarissa measures reasonable well against that of Richardson's path-breaking best-seller, Pamela, or Virtue Rewarded (1740). Pamela probably sold 20,000 two-volume sets within fourteen months after it was first published and had fourteen editions through 1769.[10]  Total volumes sold of Pamela through 1769 probably did not exceed by more than 50% those sold of Clarissa. Moreover, in 1766, the copyright of Pamela sold for £288, while the copyright for Clarissa sold for £600.[11] Both Pamela and Clarissa were best-sellers in America. Pamela, published in the U.S. in 1744, sold more than 10,000 copies through 1749. Clarissa, published in the U.S. in 1786, sold more than 25,000 copies through 1789.[12]

Long tales published in the twentieth century differ significantly from Artamène and Clarissa. Marcel Proust's À la recherche du temps perdu has nine volumes totaling about 3,200 pages and 1.5 million words.[13] Despite considerable advances in writing and printing technology, Proust's work was published over a fifteen-year period (1913-1927), while Scudéry's Artamène, about 50% longer, was published over only a five-year period (1649-1653). Moreover, Artamène was a best-seller, while À la recherche du temps perdu was far from a best-seller. The first volume of Proust's work had an initial print run of 1,750 copies, and perhaps 4,100 copies were printed between 1913 and 1918.[14] A best-seller in the U.S. about this time would sell 900,000 copies to a population about twice the size of France's.[15] Other long tales of the twentieth century attracted even less popular attention than Proust's work.

The closest the past century has come to producing a best-selling long tale is J.K. Rowling's Harry Potter series.  The seven Harry Potter books, published from 1997 to 2007, have a total of 4175 pages and about a million words.[16]  The final book in the series broke sales records by selling 2.7 million copies in the U.K. and 8.3 million copies in the U.S. in its first 24-hours on sale.  The Harry Potter series as a whole differs from a long tale in that its volumes were marketed as single works and not widely sold as a set.  Moreover, the Harry Potter series was published over a period more than twice as long as that for Artamène and Clarissa.   Rowling shows no signs of adopting the form of the Harry Potter series as a template for another work.  Instead, Rowling plans to take time off and then write an encyclopedia of Harry Potter characters and places.   The Harry Potter series has not re-established the long tale as a generic type of work.

Vertical integration favors the production of the long tale. The longer the tale, the greater the cost and the risk in printing it.  Richardson was not only an author; he was also a master printer who printed his own works.  Thus he did not have to pay another printer for the cost and risk of printing a long tale.  The more imperfect the market for printing and risk-bearing, the greater the advantage to being able to assume both these functions within an author-printer enterprise.  Richardson produced Clarissa and Grandison with the advantage of vertical integration at a time when transaction costs associated with the nascent novel-printing business were relatively high.

Social influence favors the success of the long tale. Recent research indicates that greater social influence favors greater concentration of demand among the most highly popular works.[17]  Salons and coffee houses were important social institutions in seventeenth-century France and eighteenth-century England. Scudéry herself conducted at her Paris home an important salon known as Samedi:

the main purpose of the salon was for amusement. Among the activities were excursions, elegant dinners, and surprise visits to friends staying in the country. The glory of a certain pastry shop in rue Saint-Honoré that Mlle de Scudéry and her friends loved to frequent has come down to us and we also know of Mme Aragonais' dolls, which the ladies of the Samedi dressed in the current mode. Other diversions were the experiments done by Claude Perrault, architect and anatomist, to observe the chameleon's ability to change to change color according to its environment. ... Poems were exchanged, of course, as were certain gallantries...[18]

The vibrant salon world of seventeenth-century France created extensive, powerful channels for social influence. Social influence arising from these salons, and from Scudéry's position as a leading salonnière, are probably an important part of the explanation for the long tail.

starting to read Richardson's Clarissa

The communication industry has changed greatly since the time of Artamène and Clarissa.   The average duration of online videos watched in the U.S. in March, 2008, was only 2.8 minutes per video.  That's much, much less than the 250 hours it probably took to read the best-seller Artamène in seventeenth-century France.   Less vertical integration on the supply side and less social influence on the demand side may be an important part of the explanation for this huge difference.

Notes:

[1] Artamène is available online. While the authorship of the work is not obvious, most scholars believe that Madeleine de Scudéry wrote it. The online source states that the first edition had 13,095 pages, while the online (1656) edition has 7443 pages. If that's correct, the first edition must have had either a very large typeface or widely spaced lines. Wikipedia lists the word count as 2.1 million. I've verified the plausibility of this figure with page sampling from the online edition. Clarissa is also available online. My page count is first-edition text pages, as documented in Sale (1969) pp. 45-8. The word count is from the online edition; see long-tale data.
[2] Calculation based on a typical reading speed of 250 words per minute, and a typical speed for spoken text of 140 words per minute.
[3] Cousin (1886) v. 1, p. 2.
[4] Newman (2003) p. 1.
[5] Aronson (1978) pp. 54, 82.
[6] Keymer (1994) pp. 392-3. The later figure is based on scaling Rivington's revenue figures.
[7] Raven (1987) pp. 15, 40.
[8] Page count for first edition, based on Sale (1969) pp. 70-4. Word count scaled from words in the online volume 4. See long-tale data.
[9] Eaves and Kimpel (1971) pp. 384, 401.
[10] Keymer and Sabor (2005) p. 20; Raven (1987) p. 15.
[11] Eaves and Kimpel (1971) p. 490.
[12] Mott (1966) p. 304. Grandison, also published in the U.S. in 1786, was a "better seller" (not quite a best-seller) from 1786-1789.
[13] The page count and word count are from Wikipedia, here and here.
[14] Tadié (2000) p. 595. According to a history of Éditions Gallimard, which became Proust's publisher, the company sold more than four million copies of À la recherche du temps perdu (in French, apparently worldwide) in seventy years through its copyright expiration in 1987. A significant share of these copies may have been purchased due to course assignments.
[15] Mott (1966) App. A.
[16] From WikiAnswers here and here. Included in long-tale data.
[17] See Salganik, Dodds, and Watts (2006).
[18] Aronson (1978) p. 39.

References:

Aronson, Nicole. 1978. Mademoiselle de Scudéry. Boston: Twayne Publishers.

Cousin, Victor. 1886. La société française au XVIIe siècle d'après Le Grand Cyrus de Mlle de Scudéry. Paris: Perrin & Cie.

Eaves, Thomas Cary Duncan, and Ben D. Kimpel. 1971. Samuel Richardson: a biography. Oxford: Clarendon.

Keymer, Tom. 1994. "Clarissa's Death, Clarissia's Sale, and the Text of the Second Edition." Review of English Studies, New Series, v. xlv, n. 179, pp. 389-96.

Keymer, Tom, and Peter Sabor. 2005. Pamela in the marketplace: literary controversy and print culture in eighteenth-century Britain and Ireland. Cambridge: Cambridge University Press.

Mott, Frank Luther. 1966. Golden multitudes: the story of best sellers in the United States.

Newman, Karen. 2003. "Volume Editor's Introduction," in Scudéry, Madeleine de, and Karen Newman. 2003. The story of Sapho. Chicago: University of Chicago Press.

Raven, James. 1987. British fiction, 1750-1770: a chronological check-list of prose fiction printed in Britain and Ireland. Newark: University of Delaware Press.

Sale, William Merritt. 1969 [1936]. A Bibliographic Record of His Literary Career with Historical Notes. Archon Books.

Salganik, Matthew J., Peter Sheridan Dodds, and Duncan J. Watts. 2006. "Experimental study of inequality and unpredictability in an artificial cultural market," Science, 311, 854-856 (2006).

Tadié, Jean-Yves. 2000. Marcel Proust. New York: Viking.

Tags: , , , , , ,

lack of power laws and other popularity problems

Discussion of tails and popularity seems to be maturing into more comprehensive considerations. Folks aiming for uber-geek status might chat about the difference between power laws and log-normal distributions. This difference has similar consequences to the difference between non-stationary and stationary macroeconomic time series. If those terms are obscure to you, you might just chat about how much bigger infinity is than any other number. It's a huge issue!

Government bureaucrats and other practical, get-the-job-done types might just try to produce some simple, intuitive, and relevant graphs. Consider, for example, log-log graphs of website page traffic by page rank. They show a "drooping tail" relative to an approximating line for the left part of the popularity distribution.

Whether the droop is a typical characteristic of web page popularity distributions is not clear. Surely a relatively large subset of relatively bad (unlinked, search-word-poor, spam-associated) pages could contribute to the droop. On the other hand, the droop could be a typical effect of the usual distribution of page content and general patterns of linking and searching. These two possibilities could be tested by comparing the magnitude of the droop in different websites' page traffic distributions.

Fitting a line to a popularity distribution is more useful as a descriptive technique than as a literal claim that the popularity distribution follows a power law. The term "power law" is not meaningful to most persons. Moreover, knowledge about power laws does not provide a lot of insight into the factors that determine website traffic or trends in website traffic over time.

The possibilities for statistical distributions are not limited to power laws (or power laws and log-normal distributions). The personal behavior and information and communication systems that effect page popularity are complex. They may not be uniform across different circumstances. For example, the factors that govern the popularity of the least popular pages may be rather different than the factors that govern the popularity of the most popular pages.

Power laws and log-normal distributions are two-parameter distributions. The distributional form that best characterizes traffic to all pages may have many more than two parameters. Compared to a log-normal distributions and other distributions with more than two paramters, an approximating line has greater value for providing a simple, intuitive description of an important part of the popularity distribution.

Two website traffic distributions suggest that website traffic distributions may have flattened over the past decade. Traffic to Sun's website pages in July, 1996, according to my calculation, had a descriptive log-log line with slope about -1.1. Traffic to useit.com pages in the summer of 2006 had a descriptive log-log line with slope about -0.8.* This difference may reflect differences between the two websites. It might also indicate flatteining over time in typical website traffic distributions. Such flattening was also a general pattern of change in name popularity distributions in the England (and the U.S.) from 1800 to the present.

Much more evidence of this sort might lead one to expect website traffic distributions, and perhaps other distributions for digital goods, to continue to flatten in the future. "More of the same" is a very crude sort of prediction. But absent any other relevant information, it might be the best that one can make.

Knowlege concerning the factors that have produced changes in popularity distributions could help to predict and shape future changes. Persons may have chosen less popular names in response to changes in patterns of work and residence that put persons closer together and more extensively and uniformly regulated their interactions. In short, personalization may have been a personal counter-reaction to factories and urbanization (the Industrial Revolution).

Suppose new information and communication technologies favor a more dispersed workforce, and family structure continues to shift toward more single persons and smaller households. My theory then predicts a decrease in personalization and an increased desire to associate with popular symbols.

Ummm, about all current trends indicate that either my theory is wrong, or that there are a lot of other, much more important factors. Does anyone have some better ideas?

* This is the slope measured in log-log coordinates, not in the coordinates of the axes' labels. The intercept on a log(page traffic) axis typically varies greatly with total website traffic. Thus I prefer graphs that have a y-axis labeled with log(page traffic share in total website traffic). An approximating line has the same slope with either labeling for the Y axis.

Tags: , , , ,

more discussion about tail size

Once again men are heatedly discussing tail size. Just ponder this queston: How large is the long tail? Personally, I'm going to keep looking before I decide for myself.

While it's novel to bring mathematical precision to such matters, unfortunately it seems to me that this mathematical model focuses attention on misleading features. The model says that the share of the k most popular items is log(k)/log(n), where n is the total number of items on offer. Thus, in this model, the total number of items on offer determines the share of the most popular items.

This isn't a sensible model. Mathematically, a power law describes an infinite number of items on offer. The slope of the power law, or more precisely, the slope of an approximating power law at the high popularity end of the distribution, usually describes well the high-end shares. The question is what determines the slope of the power law. The number of items on offer isn't a good answer to that question, particularly for n varying from two million to six billion.

For a concrete example, consider the popularity of the ten-most-popular given names. The set of possible given names (given names on offer) is huge, and probably hasn't changed much in the past two-hundred years. However, the popularity of the ten-most-popular given names for males in England has fallen from about 85% in 1800 to about 28% in 1994. If you want to understand changes in the popularity of the most popular items in a collection of symbols instantiated and used in a similar way, try to understand this change.

* * *

For additional amusement, here's a post I stuck in the galbithink.org newsfeed a little more than a year ago, back in the time of Web-Pleistocene:

Tail aficionados might enjoy pondering the distinguishing features of the long tail. I think that size, which tail authorities have categorized as long or short, matters less than shape. It should be no surprise to anyone that shape can change over time. For some graphical evidence, see the detailed images here.

So don't just sit around complaining that "diversity plus freedom of choice creates inequality". Power laws don't imply any particular amount of inequality. The power of the powerlaw determines the difference between tails. Look at some examples and see for yourself!

Tags: , , , ,