debating the long tail

Getting a good, well-understood model often takes you three-quarters of the way toward solving a class of problems.   Persons' choices among a large number of symbolic items still lacks a good, well-understood model for business analysis.

Among a set of similarly instantiated symbolic items in a given domain of choice, item popularity vs item rank in log-log coordinates typically can be well described by a straight line.  Put differently, a power-law distribution typically provides a reasonably good model for the aggregate pattern of choices.   In a comment referring to a graph of Facebook app popularity, Chris Anderson seems to describe his Long Tail theory as equivalent to observing a straight line in log-log space:

A Long Tail is a powerlaw distribution, which looks exactly like what you've shown. All powerlaws have a huge drop-off like that--but the tail being long (get it?) the area under what appears to almost nothing adds up to a lot. The only way you can tell whether it really does conform to the theory or not is to plot it log-log and see if it's a straight line.

At least under one definition of heavy-tailed distribution, all power laws are heavy-tailed. This meaning of heavy-tailed largely concerns the interpretation of observables and the management of risk.  Heavy-tailed distributions are associated with rarely observed, difficult-to-predict outcomes that can dominate values of concern, such as aggregate profits.  From this perspective, power laws and other extreme-value distributions are characteristic features of blockbuster-oriented, highly unpredictable businesses.[1]

As Anand Rajaraman insightfully observes, the Internet has produced powerful tools for communicating among persons and for observing and aggregating users' choices and ratings.  The process that produces blockbusters now depends more on influence among users.  Experiments indicate that increasing social influence increases the unpredictability of success.[2]  But even when blockbusters depend on more centralized, directed marketing campaigns, blockbusters have always been highly unpredictable.[3]

The slope of the approximating straight line for item popularity vs item rank in log-log coordinates offers a rough means for distinguishing between a blockbuster-oriented business and a niche-oriented business.  The less the absolute value of the slope, the more business is distributed across relatively low popularity items.  An even simpler index is the popularity of the most popular item.  That's the intercept of the item-popularity vs item-rank line where the x-axis goes from 1 to the number of items available.  But this extremely simple index ignores most of the data: unusual circumstances may determine the popularity of the most popular item and make the approximating line fit badly for the top-ranked item.  Thus the slope of the approximating line is probably a better, simple description of the business.[4]

Evidence is mixed on the evolving importance of blockbuster businesses relative to niche businesses in symbolic economies.   Because a large number of possible names has been freely available since the invention of language (supply side), studying the distribution of chosen names is a good way to isolate demand-side factors in mass symbolic choice.  In England over the past thousand years, given names show a remarkable flattening in the approximating power law beginning about the time of the Industrial Revolution and continuing to the present.  On the other hand, experiments indicate that increasing social influence increases the steepness of the slope, meaning social influence makes popular items relatively more popular. [5] The Internet is plausibly associated with greater social influence, which may be sufficient to reverse apparent long-term trends toward diversification in symbolic choices.

A recent study of business data would have made a greater contribution to understandings symbolic economics with more attention to defining useful statistics.   The study reported that among more than a million tracks offered through Rapsody in 2006, "the top 10% of titles accounted for 78% of all plays, and the top 1% of titles for 32% of all plays."  For just under 16,000 movie titles offered through Quickflix in 2006, "the top 10% of DVDs accounted for 48% of all rentals, and the top 1% for 18% of all rentals."[6] A problem with these statistics is that the total number of titles on offer is changing greatly.  Hence statistics such as the "top 10% of titles" and "the top 10% of DVDs" lack enduring significance.  Because humans have physically limited brains and communication capabilities,  rapidly increasing the total number of symbolic items that persons could choose isn't likely to affect the aggregate pattern of actual choices among relatively popular items.

Nielsen VideoScan indicates an increasing number of titles are rarely chosen:

The number of titles that sold only a few copies almost doubled for any given week from 2000 to 2005. In the same period, however, the number of titles with no sales at all in a given week quadrupled. Thus the tail represents a rapidly increasing number of titles that sell very rarely or never. ... Moreover, we determined that this is not simply a function of the sharp increase in the number of titles that have come onto the market in recent years, or of the transition from VHS to DVD; it is the truth of the long tail.[7]

The author did not describe how the analysis separated the effects of the sharp increases in total titles from actual choices.   Disentangling the effects of an increase in titles is a difficult problem.   That the form of the author's statistics depend strongly on the total number of titles suggests that the author hasn't actually figured out how to do that.

Some data indicate a growing business in a small number of titles.  With respect to Nielsen Videoscan data:

success is concentrated in ever fewer best-selling titles at the head of the distribution curve. From 2000 to 2005 the number of titles in the top 10% of weekly sales dropped by more than 50%—an increase in concentration that is common in winner-take-all markets.

The effect seems not to be consistent with a linear popularity model.  With respect to Nielsen Videoscan data, the author observes:

The importance of individual best sellers is not diminishing over time. It is growing.

But with respect to Nielsen Soundscan data, the author notes:

although today’s hits may no longer reach the sales volumes typical of the pre-piracy era, an ever smaller set of top titles continues to account for a large chunk of the overall demand for music.

If individual hits decrease in popularity, but an ever smaller set of top titles continues to account for the same large share of demand, than a linear popularity model doesn't describe well what's happening.   Perhaps the decrease in sales volume for hits (individual best-sellers) refers to a decrease in the overall demand for (commercially sold) music.

The aggregate characteristics of persons' choices among a nearly infinite set of symbolic goods isn't well-understood.  But the importance of such choices clearly is increasing.  As is conventional, I'll end this post with a call for more research, and for more support for regulators.

Notes:

[1] See De Vany, Arthur S. Hollywood Economics: How Extreme Uncertainty Shapes the Film Industry. Contemporary political economy series. London: Routledge, 2004, and Taleb, Nassim. The Black Swan: The Impact of the Highly Improbable. New York: Random House, 2007.

[2] See Matthew J. Salganik, Peter Sheridan Dodds, and Duncan J. Watts, "Experimental study of inequality and unpredictability in an artificial cultural market" Science, 311, 854-856 (2006).

[3] Extensive marketing and promotion may be necessary for traditional-media blockbusters, but it is not sufficient.  See, e.g. De Vany (2004).

[4] Viewed as a distribution of popularity shares, a power-law approximation for item popularity vs. item rank has only one free parameter.  The minimum item rank is necessary one (the most popular item) and the total popularity shares must sum to one.   But remember that an approximating line is a model, a tool for analysis, a means for organizing fruitful comparison and discussion.   The slope of a linear approximation to the popularity distribution for a range of items of practical interest seems to me to best serve this purpose.

[5] Salganik, Dodds, and Watts (2006).

[6] See Anita Elberse, "Should You Invest in the Long Tail?" Harvard Business School Review, July-Aug. 2008.

[7]  This and subsequent quotes are from Elberse (2008).

Tags: , , ,

lack of power laws and other popularity problems

Discussion of tails and popularity seems to be maturing into more comprehensive considerations. Folks aiming for uber-geek status might chat about the difference between power laws and log-normal distributions. This difference has similar consequences to the difference between non-stationary and stationary macroeconomic time series. If those terms are obscure to you, you might just chat about how much bigger infinity is than any other number. It's a huge issue!

Government bureaucrats and other practical, get-the-job-done types might just try to produce some simple, intuitive, and relevant graphs. Consider, for example, log-log graphs of website page traffic by page rank. They show a "drooping tail" relative to an approximating line for the left part of the popularity distribution.

Whether the droop is a typical characteristic of web page popularity distributions is not clear. Surely a relatively large subset of relatively bad (unlinked, search-word-poor, spam-associated) pages could contribute to the droop. On the other hand, the droop could be a typical effect of the usual distribution of page content and general patterns of linking and searching. These two possibilities could be tested by comparing the magnitude of the droop in different websites' page traffic distributions.

Fitting a line to a popularity distribution is more useful as a descriptive technique than as a literal claim that the popularity distribution follows a power law. The term "power law" is not meaningful to most persons. Moreover, knowledge about power laws does not provide a lot of insight into the factors that determine website traffic or trends in website traffic over time.

The possibilities for statistical distributions are not limited to power laws (or power laws and log-normal distributions). The personal behavior and information and communication systems that effect page popularity are complex. They may not be uniform across different circumstances. For example, the factors that govern the popularity of the least popular pages may be rather different than the factors that govern the popularity of the most popular pages.

Power laws and log-normal distributions are two-parameter distributions. The distributional form that best characterizes traffic to all pages may have many more than two parameters. Compared to a log-normal distributions and other distributions with more than two paramters, an approximating line has greater value for providing a simple, intuitive description of an important part of the popularity distribution.

Two website traffic distributions suggest that website traffic distributions may have flattened over the past decade. Traffic to Sun's website pages in July, 1996, according to my calculation, had a descriptive log-log line with slope about -1.1. Traffic to useit.com pages in the summer of 2006 had a descriptive log-log line with slope about -0.8.* This difference may reflect differences between the two websites. It might also indicate flatteining over time in typical website traffic distributions. Such flattening was also a general pattern of change in name popularity distributions in the England (and the U.S.) from 1800 to the present.

Much more evidence of this sort might lead one to expect website traffic distributions, and perhaps other distributions for digital goods, to continue to flatten in the future. "More of the same" is a very crude sort of prediction. But absent any other relevant information, it might be the best that one can make.

Knowlege concerning the factors that have produced changes in popularity distributions could help to predict and shape future changes. Persons may have chosen less popular names in response to changes in patterns of work and residence that put persons closer together and more extensively and uniformly regulated their interactions. In short, personalization may have been a personal counter-reaction to factories and urbanization (the Industrial Revolution).

Suppose new information and communication technologies favor a more dispersed workforce, and family structure continues to shift toward more single persons and smaller households. My theory then predicts a decrease in personalization and an increased desire to associate with popular symbols.

Ummm, about all current trends indicate that either my theory is wrong, or that there are a lot of other, much more important factors. Does anyone have some better ideas?

* This is the slope measured in log-log coordinates, not in the coordinates of the axes' labels. The intercept on a log(page traffic) axis typically varies greatly with total website traffic. Thus I prefer graphs that have a y-axis labeled with log(page traffic share in total website traffic). An approximating line has the same slope with either labeling for the Y axis.

Tags: , , , ,