explaining the long tale

The long tail has been extensively discussed.  But what about the long tale?  What is the nature and significance of the long tale?

Consider two very long tales. The longest tale printed with a Latin or Cyrillic alphabet is Madeleine de Scudéry's Artamène, ou le Grand Cyrus.  This type of work is known among specialists as a "roman de longue haleine" (long-winded novel). First published in Paris, 1649-1653, Artamène consists of ten volumes encompassing 7,443 pages and about 2.1 million words. A second long tale is Samuel Richardson's Clarissa, or, the History of a Young Lady, published in London in 1748. Its first edition has seven volumes with a total of 2564 pages and about a million words.[1]

Reading a long tale takes a long time. At current, typical prose reading speed, Artamène would take about 140 hours to read. But in the seventeenth century, books were often read aloud. Reading aloud takes roughly 70% more time than silent reading.[2] Moreover, if reading occurred by candlelight, the need to maintain and trim the candle plausibly might increase reading time by 5%. So reading Artamène could easily have required 250 hours of reading time. Reading Clarissa could easily have required 100 hours of reading time.

While they were long tales, Artamène and Clarissa were also best-sellers of their times. One scholar declared of Artamène:

from 1649 to 1654, from one end of France to the other, at the court and in the most aristocratic circles, as well as among the more cultivated bourgeoisie, at Paris and in the provinces, in all ranks of a society the most polite in the world, one read them not merely with pleasure, one seized upon, one devoured bit by bit as they appeared, every one of those ten great volumes.[3]

In the course of printing, the printer increased the print run for currently printing volumes and printed additional copies of earlier volumes. While printing of the first edition finished in 1653, by 1655 the printer had already produced a complete fourth edition and a printer in England had already printed an English translation of the full, ten-volume work.[4] From 1654 to 1660, Scudéry produced another ten-volume work Clélie, Histoire Romaine. That action testifies to the success of Artamène. Clélie turned out also to be highly popular.[5]

The success of Clarissa can be described more quantitatively. Richardson probably printed 3000 sets of the seven-volume, first edition of Clarissa in 1748.  He printed additional editions of Clarissa in 1749, 1751, and 1759.  These later editions probably amounted in total to about 3000 sets.[6]   Through 1769, a total of eleven editions of Clarissa were printed in London and Dublin.  For comparison, few editions of British novels between 1750 and 1770 had print runs greater than 1,000, and most probably were printed in 500-800 copies.[7]

Just as did Scudéry, Richardson quickly followed one long tale with another.  About five years after writing and publishing Clarissa, Richardson wrote and published a new work, The History of Sir Charles Grandison (1753). Its first edition has seven volumes comprising 2459 text pages and about 907,000 words.[8] Three editions comprising 6,500 sets were printed within a year.[9] Both the size and print runs of Grandison suggest the prior success of Clarissa.

The commercial success of Clarissa measures reasonable well against that of Richardson's path-breaking best-seller, Pamela, or Virtue Rewarded (1740). Pamela probably sold 20,000 two-volume sets within fourteen months after it was first published and had fourteen editions through 1769.[10]  Total volumes sold of Pamela through 1769 probably did not exceed by more than 50% those sold of Clarissa. Moreover, in 1766, the copyright of Pamela sold for £288, while the copyright for Clarissa sold for £600.[11] Both Pamela and Clarissa were best-sellers in America. Pamela, published in the U.S. in 1744, sold more than 10,000 copies through 1749. Clarissa, published in the U.S. in 1786, sold more than 25,000 copies through 1789.[12]

Long tales published in the twentieth century differ significantly from Artamène and Clarissa. Marcel Proust's À la recherche du temps perdu has nine volumes totaling about 3,200 pages and 1.5 million words.[13] Despite considerable advances in writing and printing technology, Proust's work was published over a fifteen-year period (1913-1927), while Scudéry's Artamène, about 50% longer, was published over only a five-year period (1649-1653). Moreover, Artamène was a best-seller, while À la recherche du temps perdu was far from a best-seller. The first volume of Proust's work had an initial print run of 1,750 copies, and perhaps 4,100 copies were printed between 1913 and 1918.[14] A best-seller in the U.S. about this time would sell 900,000 copies to a population about twice the size of France's.[15] Other long tales of the twentieth century attracted even less popular attention than Proust's work.

The closest the past century has come to producing a best-selling long tale is J.K. Rowling's Harry Potter series.  The seven Harry Potter books, published from 1997 to 2007, have a total of 4175 pages and about a million words.[16]  The final book in the series broke sales records by selling 2.7 million copies in the U.K. and 8.3 million copies in the U.S. in its first 24-hours on sale.  The Harry Potter series as a whole differs from a long tale in that its volumes were marketed as single works and not widely sold as a set.  Moreover, the Harry Potter series was published over a period more than twice as long as that for Artamène and Clarissa.   Rowling shows no signs of adopting the form of the Harry Potter series as a template for another work.  Instead, Rowling plans to take time off and then write an encyclopedia of Harry Potter characters and places.   The Harry Potter series has not re-established the long tale as a generic type of work.

Vertical integration favors the production of the long tale. The longer the tale, the greater the cost and the risk in printing it.  Richardson was not only an author; he was also a master printer who printed his own works.  Thus he did not have to pay another printer for the cost and risk of printing a long tale.  The more imperfect the market for printing and risk-bearing, the greater the advantage to being able to assume both these functions within an author-printer enterprise.  Richardson produced Clarissa and Grandison with the advantage of vertical integration at a time when transaction costs associated with the nascent novel-printing business were relatively high.

Social influence favors the success of the long tale. Recent research indicates that greater social influence favors greater concentration of demand among the most highly popular works.[17]  Salons and coffee houses were important social institutions in seventeenth-century France and eighteenth-century England. Scudéry herself conducted at her Paris home an important salon known as Samedi:

the main purpose of the salon was for amusement. Among the activities were excursions, elegant dinners, and surprise visits to friends staying in the country. The glory of a certain pastry shop in rue Saint-Honoré that Mlle de Scudéry and her friends loved to frequent has come down to us and we also know of Mme Aragonais' dolls, which the ladies of the Samedi dressed in the current mode. Other diversions were the experiments done by Claude Perrault, architect and anatomist, to observe the chameleon's ability to change to change color according to its environment. ... Poems were exchanged, of course, as were certain gallantries...[18]

The vibrant salon world of seventeenth-century France created extensive, powerful channels for social influence. Social influence arising from these salons, and from Scudéry's position as a leading salonnière, are probably an important part of the explanation for the long tail.

starting to read Richardson's Clarissa

The communication industry has changed greatly since the time of Artamène and Clarissa.   The average duration of online videos watched in the U.S. in March, 2008, was only 2.8 minutes per video.  That's much, much less than the 250 hours it probably took to read the best-seller Artamène in seventeenth-century France.   Less vertical integration on the supply side and less social influence on the demand side may be an important part of the explanation for this huge difference.

Notes:

[1] Artamène is available online. While the authorship of the work is not obvious, most scholars believe that Madeleine de Scudéry wrote it. The online source states that the first edition had 13,095 pages, while the online (1656) edition has 7443 pages. If that's correct, the first edition must have had either a very large typeface or widely spaced lines. Wikipedia lists the word count as 2.1 million. I've verified the plausibility of this figure with page sampling from the online edition. Clarissa is also available online. My page count is first-edition text pages, as documented in Sale (1969) pp. 45-8. The word count is from the online edition; see long-tale data.
[2] Calculation based on a typical reading speed of 250 words per minute, and a typical speed for spoken text of 140 words per minute.
[3] Cousin (1886) v. 1, p. 2.
[4] Newman (2003) p. 1.
[5] Aronson (1978) pp. 54, 82.
[6] Keymer (1994) pp. 392-3. The later figure is based on scaling Rivington's revenue figures.
[7] Raven (1987) pp. 15, 40.
[8] Page count for first edition, based on Sale (1969) pp. 70-4. Word count scaled from words in the online volume 4. See long-tale data.
[9] Eaves and Kimpel (1971) pp. 384, 401.
[10] Keymer and Sabor (2005) p. 20; Raven (1987) p. 15.
[11] Eaves and Kimpel (1971) p. 490.
[12] Mott (1966) p. 304. Grandison, also published in the U.S. in 1786, was a "better seller" (not quite a best-seller) from 1786-1789.
[13] The page count and word count are from Wikipedia, here and here.
[14] Tadié (2000) p. 595. According to a history of Éditions Gallimard, which became Proust's publisher, the company sold more than four million copies of À la recherche du temps perdu (in French, apparently worldwide) in seventy years through its copyright expiration in 1987. A significant share of these copies may have been purchased due to course assignments.
[15] Mott (1966) App. A.
[16] From WikiAnswers here and here. Included in long-tale data.
[17] See Salganik, Dodds, and Watts (2006).
[18] Aronson (1978) p. 39.

References:

Aronson, Nicole. 1978. Mademoiselle de Scudéry. Boston: Twayne Publishers.

Cousin, Victor. 1886. La société française au XVIIe siècle d'après Le Grand Cyrus de Mlle de Scudéry. Paris: Perrin & Cie.

Eaves, Thomas Cary Duncan, and Ben D. Kimpel. 1971. Samuel Richardson: a biography. Oxford: Clarendon.

Keymer, Tom. 1994. "Clarissa's Death, Clarissia's Sale, and the Text of the Second Edition." Review of English Studies, New Series, v. xlv, n. 179, pp. 389-96.

Keymer, Tom, and Peter Sabor. 2005. Pamela in the marketplace: literary controversy and print culture in eighteenth-century Britain and Ireland. Cambridge: Cambridge University Press.

Mott, Frank Luther. 1966. Golden multitudes: the story of best sellers in the United States.

Newman, Karen. 2003. "Volume Editor's Introduction," in Scudéry, Madeleine de, and Karen Newman. 2003. The story of Sapho. Chicago: University of Chicago Press.

Raven, James. 1987. British fiction, 1750-1770: a chronological check-list of prose fiction printed in Britain and Ireland. Newark: University of Delaware Press.

Sale, William Merritt. 1969 [1936]. A Bibliographic Record of His Literary Career with Historical Notes. Archon Books.

Salganik, Matthew J., Peter Sheridan Dodds, and Duncan J. Watts. 2006. "Experimental study of inequality and unpredictability in an artificial cultural market," Science, 311, 854-856 (2006).

Tadié, Jean-Yves. 2000. Marcel Proust. New York: Viking.

Tags: , , , , , ,

debating the long tail

Getting a good, well-understood model often takes you three-quarters of the way toward solving a class of problems.   Persons' choices among a large number of symbolic items still lacks a good, well-understood model for business analysis.

Among a set of similarly instantiated symbolic items in a given domain of choice, item popularity vs item rank in log-log coordinates typically can be well described by a straight line.  Put differently, a power-law distribution typically provides a reasonably good model for the aggregate pattern of choices.   In a comment referring to a graph of Facebook app popularity, Chris Anderson seems to describe his Long Tail theory as equivalent to observing a straight line in log-log space:

A Long Tail is a powerlaw distribution, which looks exactly like what you've shown. All powerlaws have a huge drop-off like that--but the tail being long (get it?) the area under what appears to almost nothing adds up to a lot. The only way you can tell whether it really does conform to the theory or not is to plot it log-log and see if it's a straight line.

At least under one definition of heavy-tailed distribution, all power laws are heavy-tailed. This meaning of heavy-tailed largely concerns the interpretation of observables and the management of risk.  Heavy-tailed distributions are associated with rarely observed, difficult-to-predict outcomes that can dominate values of concern, such as aggregate profits.  From this perspective, power laws and other extreme-value distributions are characteristic features of blockbuster-oriented, highly unpredictable businesses.[1]

As Anand Rajaraman insightfully observes, the Internet has produced powerful tools for communicating among persons and for observing and aggregating users' choices and ratings.  The process that produces blockbusters now depends more on influence among users.  Experiments indicate that increasing social influence increases the unpredictability of success.[2]  But even when blockbusters depend on more centralized, directed marketing campaigns, blockbusters have always been highly unpredictable.[3]

The slope of the approximating straight line for item popularity vs item rank in log-log coordinates offers a rough means for distinguishing between a blockbuster-oriented business and a niche-oriented business.  The less the absolute value of the slope, the more business is distributed across relatively low popularity items.  An even simpler index is the popularity of the most popular item.  That's the intercept of the item-popularity vs item-rank line where the x-axis goes from 1 to the number of items available.  But this extremely simple index ignores most of the data: unusual circumstances may determine the popularity of the most popular item and make the approximating line fit badly for the top-ranked item.  Thus the slope of the approximating line is probably a better, simple description of the business.[4]

Evidence is mixed on the evolving importance of blockbuster businesses relative to niche businesses in symbolic economies.   Because a large number of possible names has been freely available since the invention of language (supply side), studying the distribution of chosen names is a good way to isolate demand-side factors in mass symbolic choice.  In England over the past thousand years, given names show a remarkable flattening in the approximating power law beginning about the time of the Industrial Revolution and continuing to the present.  On the other hand, experiments indicate that increasing social influence increases the steepness of the slope, meaning social influence makes popular items relatively more popular. [5] The Internet is plausibly associated with greater social influence, which may be sufficient to reverse apparent long-term trends toward diversification in symbolic choices.

A recent study of business data would have made a greater contribution to understandings symbolic economics with more attention to defining useful statistics.   The study reported that among more than a million tracks offered through Rapsody in 2006, "the top 10% of titles accounted for 78% of all plays, and the top 1% of titles for 32% of all plays."  For just under 16,000 movie titles offered through Quickflix in 2006, "the top 10% of DVDs accounted for 48% of all rentals, and the top 1% for 18% of all rentals."[6] A problem with these statistics is that the total number of titles on offer is changing greatly.  Hence statistics such as the "top 10% of titles" and "the top 10% of DVDs" lack enduring significance.  Because humans have physically limited brains and communication capabilities,  rapidly increasing the total number of symbolic items that persons could choose isn't likely to affect the aggregate pattern of actual choices among relatively popular items.

Nielsen VideoScan indicates an increasing number of titles are rarely chosen:

The number of titles that sold only a few copies almost doubled for any given week from 2000 to 2005. In the same period, however, the number of titles with no sales at all in a given week quadrupled. Thus the tail represents a rapidly increasing number of titles that sell very rarely or never. ... Moreover, we determined that this is not simply a function of the sharp increase in the number of titles that have come onto the market in recent years, or of the transition from VHS to DVD; it is the truth of the long tail.[7]

The author did not describe how the analysis separated the effects of the sharp increases in total titles from actual choices.   Disentangling the effects of an increase in titles is a difficult problem.   That the form of the author's statistics depend strongly on the total number of titles suggests that the author hasn't actually figured out how to do that.

Some data indicate a growing business in a small number of titles.  With respect to Nielsen Videoscan data:

success is concentrated in ever fewer best-selling titles at the head of the distribution curve. From 2000 to 2005 the number of titles in the top 10% of weekly sales dropped by more than 50%—an increase in concentration that is common in winner-take-all markets.

The effect seems not to be consistent with a linear popularity model.  With respect to Nielsen Videoscan data, the author observes:

The importance of individual best sellers is not diminishing over time. It is growing.

But with respect to Nielsen Soundscan data, the author notes:

although today’s hits may no longer reach the sales volumes typical of the pre-piracy era, an ever smaller set of top titles continues to account for a large chunk of the overall demand for music.

If individual hits decrease in popularity, but an ever smaller set of top titles continues to account for the same large share of demand, than a linear popularity model doesn't describe well what's happening.   Perhaps the decrease in sales volume for hits (individual best-sellers) refers to a decrease in the overall demand for (commercially sold) music.

The aggregate characteristics of persons' choices among a nearly infinite set of symbolic goods isn't well-understood.  But the importance of such choices clearly is increasing.  As is conventional, I'll end this post with a call for more research, and for more support for regulators.

Notes:

[1] See De Vany, Arthur S. Hollywood Economics: How Extreme Uncertainty Shapes the Film Industry. Contemporary political economy series. London: Routledge, 2004, and Taleb, Nassim. The Black Swan: The Impact of the Highly Improbable. New York: Random House, 2007.

[2] See Matthew J. Salganik, Peter Sheridan Dodds, and Duncan J. Watts, "Experimental study of inequality and unpredictability in an artificial cultural market" Science, 311, 854-856 (2006).

[3] Extensive marketing and promotion may be necessary for traditional-media blockbusters, but it is not sufficient.  See, e.g. De Vany (2004).

[4] Viewed as a distribution of popularity shares, a power-law approximation for item popularity vs. item rank has only one free parameter.  The minimum item rank is necessary one (the most popular item) and the total popularity shares must sum to one.   But remember that an approximating line is a model, a tool for analysis, a means for organizing fruitful comparison and discussion.   The slope of a linear approximation to the popularity distribution for a range of items of practical interest seems to me to best serve this purpose.

[5] Salganik, Dodds, and Watts (2006).

[6] See Anita Elberse, "Should You Invest in the Long Tail?" Harvard Business School Review, July-Aug. 2008.

[7]  This and subsequent quotes are from Elberse (2008).

Tags: , , ,

lack of power laws and other popularity problems

Discussion of tails and popularity seems to be maturing into more comprehensive considerations. Folks aiming for uber-geek status might chat about the difference between power laws and log-normal distributions. This difference has similar consequences to the difference between non-stationary and stationary macroeconomic time series. If those terms are obscure to you, you might just chat about how much bigger infinity is than any other number. It's a huge issue!

Government bureaucrats and other practical, get-the-job-done types might just try to produce some simple, intuitive, and relevant graphs. Consider, for example, log-log graphs of website page traffic by page rank. They show a "drooping tail" relative to an approximating line for the left part of the popularity distribution.

Whether the droop is a typical characteristic of web page popularity distributions is not clear. Surely a relatively large subset of relatively bad (unlinked, search-word-poor, spam-associated) pages could contribute to the droop. On the other hand, the droop could be a typical effect of the usual distribution of page content and general patterns of linking and searching. These two possibilities could be tested by comparing the magnitude of the droop in different websites' page traffic distributions.

Fitting a line to a popularity distribution is more useful as a descriptive technique than as a literal claim that the popularity distribution follows a power law. The term "power law" is not meaningful to most persons. Moreover, knowledge about power laws does not provide a lot of insight into the factors that determine website traffic or trends in website traffic over time.

The possibilities for statistical distributions are not limited to power laws (or power laws and log-normal distributions). The personal behavior and information and communication systems that effect page popularity are complex. They may not be uniform across different circumstances. For example, the factors that govern the popularity of the least popular pages may be rather different than the factors that govern the popularity of the most popular pages.

Power laws and log-normal distributions are two-parameter distributions. The distributional form that best characterizes traffic to all pages may have many more than two parameters. Compared to a log-normal distributions and other distributions with more than two paramters, an approximating line has greater value for providing a simple, intuitive description of an important part of the popularity distribution.

Two website traffic distributions suggest that website traffic distributions may have flattened over the past decade. Traffic to Sun's website pages in July, 1996, according to my calculation, had a descriptive log-log line with slope about -1.1. Traffic to useit.com pages in the summer of 2006 had a descriptive log-log line with slope about -0.8.* This difference may reflect differences between the two websites. It might also indicate flatteining over time in typical website traffic distributions. Such flattening was also a general pattern of change in name popularity distributions in the England (and the U.S.) from 1800 to the present.

Much more evidence of this sort might lead one to expect website traffic distributions, and perhaps other distributions for digital goods, to continue to flatten in the future. "More of the same" is a very crude sort of prediction. But absent any other relevant information, it might be the best that one can make.

Knowlege concerning the factors that have produced changes in popularity distributions could help to predict and shape future changes. Persons may have chosen less popular names in response to changes in patterns of work and residence that put persons closer together and more extensively and uniformly regulated their interactions. In short, personalization may have been a personal counter-reaction to factories and urbanization (the Industrial Revolution).

Suppose new information and communication technologies favor a more dispersed workforce, and family structure continues to shift toward more single persons and smaller households. My theory then predicts a decrease in personalization and an increased desire to associate with popular symbols.

Ummm, about all current trends indicate that either my theory is wrong, or that there are a lot of other, much more important factors. Does anyone have some better ideas?

* This is the slope measured in log-log coordinates, not in the coordinates of the axes' labels. The intercept on a log(page traffic) axis typically varies greatly with total website traffic. Thus I prefer graphs that have a y-axis labeled with log(page traffic share in total website traffic). An approximating line has the same slope with either labeling for the Y axis.

Tags: , ,

more discussion about tail size

Once again men are heatedly discussing tail size. Just ponder this queston: How large is the long tail? Personally, I'm going to keep looking before I decide for myself.

While it's novel to bring mathematical precision to such matters, unfortunately it seems to me that this mathematical model focuses attention on misleading features. The model says that the share of the k most popular items is log(k)/log(n), where n is the total number of items on offer. Thus, in this model, the total number of items on offer determines the share of the most popular items.

This isn't a sensible model. Mathematically, a power law describes an infinite number of items on offer. The slope of the power law, or more precisely, the slope of an approximating power law at the high popularity end of the distribution, usually describes well the high-end shares. The question is what determines the slope of the power law. The number of items on offer isn't a good answer to that question, particularly for n varying from two million to six billion.

For a concrete example, consider the popularity of the ten-most-popular given names. The set of possible given names (given names on offer) is huge, and probably hasn't changed much in the past two-hundred years. However, the popularity of the ten-most-popular given names for males in England has fallen from about 85% in 1800 to about 28% in 1994. If you want to understand changes in the popularity of the most popular items in a collection of symbols instantiated and used in a similar way, try to understand this change.

* * *

For additional amusement, here's a post I stuck in the galbithink.org newsfeed a little more than a year ago, back in the time of Web-Pleistocene:

Tail aficionados might enjoy pondering the distinguishing features of the long tail. I think that size, which tail authorities have categorized as long or short, matters less than shape. It should be no surprise to anyone that shape can change over time. For some graphical evidence, see the detailed images here.

So don't just sit around complaining that "diversity plus freedom of choice creates inequality". Power laws don't imply any particular amount of inequality. The power of the powerlaw determines the difference between tails. Look at some examples and see for yourself!

Tags: , ,