describing name frequency distributions

The most popular given names have become much less popular over the past two centuries.  In the U.K. about 1800, about 85% of males and 82% of females had names that were among the ten most popular given names for males and females, respectively.  Apart from temporary effects of the Norman Conquest in 1066 and the Black Death in the mid-fourteenth century, given names seem to have had a similar distribution from 1000 to 1800.  But after 1800, the name distribution flattened.  By 1994, the share of males and females with given names among the ten most popular given names had fallen to 28% and 24%.[1]  That seems to me to be a quite astonishing change in an important class of symbols.

Describing the name distribution as flattening is a simple way to describe the change from 1800 to the present.   More specifically, plot name popularity (frequency of a name divided by the total number of named persons in the sample) by popularity rank.  Approximate this plot across some range of ranks by a line.  The slope of that line has flattened over the past two centuries.  That's equivalent to the complementary distribution function of name frequency increasing in slope at the high end of the name frequency distribution.[2]

My work on given name frequency distributions does not support claiming that given names follow a power-law distribution.   To the extent that I have in the past made such a claim, please recognize that I provided no statistical support for that claim.  You can find evidence that I don't take such a claim seriously.   I'm interested in  understanding major changes in symbolic choices such as the change  in name popularity over  the past two centuries.   Given this interest, expending effort on estimating a stationary statistical model for the name distribution doesn't seem to me worthwhile.  I hereby explicitly renounce any claim that given names follow a power-law distribution.

Moreover, I heartily recommend recent work on estimating a power-law model, evaluating its goodness of fit, and comparing it to alternative statistical models.  With their article, "Power-Law Distributions in Empirical Data," Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman provide a clear exposition of power laws, useful estimation strategies (see especially equation 3.7), and analysis of twenty-four real-world data sets.[3]  Even better, they have made available on the web code, in several languages, that implements their estimation methods.   They have also made their test data sets available through the web to the extent that they could.  In short, their work  is an outstanding example of actual, significant advancement of public knowledge.

Notes:

[1]  These figures are from Galbi, Douglas (2002), Long-Term Trends in Personal Given Name Frequencies in the UK, Table 1.

[2]  Name frequencies divided by sample sizes give name popularities.  Moving leftward from the high-end name frequencies and assuming distinct frequency ranks, the empirical complementary distribution function increases by constant probability increments equal to the inverse of the total number of names in the sample.  Hence, to scaling parameters, the popularity/rank plot is equal to a transformation about the x-y axis of the complementary distribution function.  Plotting in log-log space eliminates the effects of the scaling parameters.

[3]  One of their data sets is surname frequency from the U.S. Census of 1990.   Their analysis favors for these data, above a minimum frequency threshold, a power-law with exponential cut-off.   A power-law distribution and a log-normal distribution are not clearly rejected.   However, these statistics shouldn't be taken too seriously.  Compare the source Census surname data set to Clauset, Shalizi, and Newman's constructed surname frequency data set.  They constructed surname frequencies from surname frequency percent shares for surnames with shares above 0.005, reported with only one significant digit.   Even at their estimated x-min (upper frequencies), the reported surname share has only two significant digits (0.0045).  Hence, while the computed frequencies are reported with seven significant digits, their accuracy in most cases is much less.   That the analysis of surname frequencies doesn't employ good data probably isn't important.  A statistical model for surnames seems to me less interesting than a statistical model for given names.  Moreover, as discussed above, I think the most interesting issue for given names is distribution dynamics.   I hope that smart statisticians will work on understanding given-name distribution dynamics.

Tags:

value of given name data

Given names are an important but under-appreciated type of data.   Given names represent significant symbolic choices.  Large populations of persons have been making this well-defined symbolic choice for millennia.  Given names are thus useful data for studying symbolic choice, effects of communication technologies, and information economics.

Given name frequency data are now also important to valuable new population estimation techniques.  Survey costs are typically directly related to sample size.  Most persons, however, know many other persons and can provide information about persons that they know.  So, for example, if you want to estimate how many persons in the U.S. openly blog regularly, you could ask a sample of persons whether they blog regularly, and also ask them how many persons they know who blog regularly.  The sample size is then effectively scaled up by the size of personal networks within the U.S. with sufficiently informed connections to know if a personal connection blogs.  That scale-up might be a factor of about 500.

Research on scale-up estimates has used given name frequency data in making estimates.  Given names provide a good means for estimating personal network size, i.e. the number of persons that someone knows.  Most persons probably could not answer well the question, "How many persons do you know?"  But they can answer quite well, "How many persons named Bao do you know ?"  Answers to that question, combined with data on the frequency of the name Bao in the population, can be used to compute a good estimate of personal network size.  Defining "know" to mean "talk to each other about personal interests at least once a month" might provide an estimate of personal network size relevant to a scale-up estimate of the total population of bloggers.

Governments could relatively easily make good data on given names freely available.  Good data on given names would consist of a large, random sample of given names, along with the person's sex, age or age range, geographic region, and race/ethnicity, if reported.  In the administration of various government programs,  governments collect large datasets that include such information.  Ensuring that the finest category intersection  had at least a few data points would provide sufficient privacy for personal information that is not highly sensitive and that is widely known in any case.

Making such data freely available would contribute to valuable public knowledge.    The conclusion to an important paper on scale-up estimates noted:

Though the methods presented here account for bias in individual degree estimation in ways that are not present in other methods, they are only as good as the available data on the demographics of first names. Using "How many X's do you know?" data to estimate person network size requires knowing the number of people in the population with the different first names.  In many countries such information may not be available.[*]

Such information undoubtedly exists.  Not making it available is an intellectual and economic waste.  Communication economists, statisticians, and others potentially can create considerable public value from analysis of given names.

*  *  *  *

[*] McComick TH, Salganik MJ, Zheng T (2008) How many people do you know?: Efficiently estimating personal network size. Journal of the American Statistical Association, forthcoming.  Tian Zheng provides a good overview of the technique in this presentation.

Tags: ,

reasoning about symbolic choices

Parents usually consider carefully and at length what name to give to their new new-born children.  Recent research shows that given names that increase faster in popularity also decrease faster in popularity.  According to survey evidence, parents reason that names that are rapidly increasing in popularity are less likely to have enduring appeal.  Hence parents are less likely to choose those names.[*]

This sort of reasoning is relatively sophisticated.  Persons concerned about product quality might reason about aggregate sales of the product.  Reasoning about the slope of aggregate demand for a product is less common.  Concern about inter-personal choice effects (fads) and a long-term horizon for valuing a good favor aggregate, dynamic reasoning in individual choices such as that for children's names.

The sophistication of reasoning in personal naming sheds some light on the large change in the shape of the given name popularity distribution that begin early in the nineteenth century.  Major twentieth-century changes in media have registered little effect in magazine advertising, which is a type of aggregate symbolic distribution.  Changes in the information economy early in the nineteenth-century are less obvious.  However, sophisticated reasoning about symbolic choices can produce large changes that have relatively obscure relations to aggregate circumstances.

*  *  *  *  *

[*]  See Jonah Berger and Gaël Le Mens. 2009. "How adoption speed affects the abandonment of cultural tastes." Proceedings of the National Academy of Sciences, doi:10.1073/pnas.0812647106

Tags: ,

debating the long tail

Getting a good, well-understood model often takes you three-quarters of the way toward solving a class of problems.   Persons' choices among a large number of symbolic items still lacks a good, well-understood model for business analysis.

Among a set of similarly instantiated symbolic items in a given domain of choice, item popularity vs item rank in log-log coordinates typically can be well described by a straight line.  Put differently, a power-law distribution typically provides a reasonably good model for the aggregate pattern of choices.   In a comment referring to a graph of Facebook app popularity, Chris Anderson seems to describe his Long Tail theory as equivalent to observing a straight line in log-log space:

A Long Tail is a powerlaw distribution, which looks exactly like what you've shown. All powerlaws have a huge drop-off like that--but the tail being long (get it?) the area under what appears to almost nothing adds up to a lot. The only way you can tell whether it really does conform to the theory or not is to plot it log-log and see if it's a straight line.

At least under one definition of heavy-tailed distribution, all power laws are heavy-tailed. This meaning of heavy-tailed largely concerns the interpretation of observables and the management of risk.  Heavy-tailed distributions are associated with rarely observed, difficult-to-predict outcomes that can dominate values of concern, such as aggregate profits.  From this perspective, power laws and other extreme-value distributions are characteristic features of blockbuster-oriented, highly unpredictable businesses.[1]

As Anand Rajaraman insightfully observes, the Internet has produced powerful tools for communicating among persons and for observing and aggregating users' choices and ratings.  The process that produces blockbusters now depends more on influence among users.  Experiments indicate that increasing social influence increases the unpredictability of success.[2]  But even when blockbusters depend on more centralized, directed marketing campaigns, blockbusters have always been highly unpredictable.[3]

The slope of the approximating straight line for item popularity vs item rank in log-log coordinates offers a rough means for distinguishing between a blockbuster-oriented business and a niche-oriented business.  The less the absolute value of the slope, the more business is distributed across relatively low popularity items.  An even simpler index is the popularity of the most popular item.  That's the intercept of the item-popularity vs item-rank line where the x-axis goes from 1 to the number of items available.  But this extremely simple index ignores most of the data: unusual circumstances may determine the popularity of the most popular item and make the approximating line fit badly for the top-ranked item.  Thus the slope of the approximating line is probably a better, simple description of the business.[4]

Evidence is mixed on the evolving importance of blockbuster businesses relative to niche businesses in symbolic economies.   Because a large number of possible names has been freely available since the invention of language (supply side), studying the distribution of chosen names is a good way to isolate demand-side factors in mass symbolic choice.  In England over the past thousand years, given names show a remarkable flattening in the approximating power law beginning about the time of the Industrial Revolution and continuing to the present.  On the other hand, experiments indicate that increasing social influence increases the steepness of the slope, meaning social influence makes popular items relatively more popular. [5] The Internet is plausibly associated with greater social influence, which may be sufficient to reverse apparent long-term trends toward diversification in symbolic choices.

A recent study of business data would have made a greater contribution to understandings symbolic economics with more attention to defining useful statistics.   The study reported that among more than a million tracks offered through Rapsody in 2006, "the top 10% of titles accounted for 78% of all plays, and the top 1% of titles for 32% of all plays."  For just under 16,000 movie titles offered through Quickflix in 2006, "the top 10% of DVDs accounted for 48% of all rentals, and the top 1% for 18% of all rentals."[6] A problem with these statistics is that the total number of titles on offer is changing greatly.  Hence statistics such as the "top 10% of titles" and "the top 10% of DVDs" lack enduring significance.  Because humans have physically limited brains and communication capabilities,  rapidly increasing the total number of symbolic items that persons could choose isn't likely to affect the aggregate pattern of actual choices among relatively popular items.

Nielsen VideoScan indicates an increasing number of titles are rarely chosen:

The number of titles that sold only a few copies almost doubled for any given week from 2000 to 2005. In the same period, however, the number of titles with no sales at all in a given week quadrupled. Thus the tail represents a rapidly increasing number of titles that sell very rarely or never. ... Moreover, we determined that this is not simply a function of the sharp increase in the number of titles that have come onto the market in recent years, or of the transition from VHS to DVD; it is the truth of the long tail.[7]

The author did not describe how the analysis separated the effects of the sharp increases in total titles from actual choices.   Disentangling the effects of an increase in titles is a difficult problem.   That the form of the author's statistics depend strongly on the total number of titles suggests that the author hasn't actually figured out how to do that.

Some data indicate a growing business in a small number of titles.  With respect to Nielsen Videoscan data:

success is concentrated in ever fewer best-selling titles at the head of the distribution curve. From 2000 to 2005 the number of titles in the top 10% of weekly sales dropped by more than 50%—an increase in concentration that is common in winner-take-all markets.

The effect seems not to be consistent with a linear popularity model.  With respect to Nielsen Videoscan data, the author observes:

The importance of individual best sellers is not diminishing over time. It is growing.

But with respect to Nielsen Soundscan data, the author notes:

although today’s hits may no longer reach the sales volumes typical of the pre-piracy era, an ever smaller set of top titles continues to account for a large chunk of the overall demand for music.

If individual hits decrease in popularity, but an ever smaller set of top titles continues to account for the same large share of demand, than a linear popularity model doesn't describe well what's happening.   Perhaps the decrease in sales volume for hits (individual best-sellers) refers to a decrease in the overall demand for (commercially sold) music.

The aggregate characteristics of persons' choices among a nearly infinite set of symbolic goods isn't well-understood.  But the importance of such choices clearly is increasing.  As is conventional, I'll end this post with a call for more research, and for more support for regulators.

Notes:

[1] See De Vany, Arthur S. Hollywood Economics: How Extreme Uncertainty Shapes the Film Industry. Contemporary political economy series. London: Routledge, 2004, and Taleb, Nassim. The Black Swan: The Impact of the Highly Improbable. New York: Random House, 2007.

[2] See Matthew J. Salganik, Peter Sheridan Dodds, and Duncan J. Watts, "Experimental study of inequality and unpredictability in an artificial cultural market" Science, 311, 854-856 (2006).

[3] Extensive marketing and promotion may be necessary for traditional-media blockbusters, but it is not sufficient.  See, e.g. De Vany (2004).

[4] Viewed as a distribution of popularity shares, a power-law approximation for item popularity vs. item rank has only one free parameter.  The minimum item rank is necessary one (the most popular item) and the total popularity shares must sum to one.   But remember that an approximating line is a model, a tool for analysis, a means for organizing fruitful comparison and discussion.   The slope of a linear approximation to the popularity distribution for a range of items of practical interest seems to me to best serve this purpose.

[5] Salganik, Dodds, and Watts (2006).

[6] See Anita Elberse, "Should You Invest in the Long Tail?" Harvard Business School Review, July-Aug. 2008.

[7]  This and subsequent quotes are from Elberse (2008).

Tags: , , ,

lack of power laws and other popularity problems

Discussion of tails and popularity seems to be maturing into more comprehensive considerations. Folks aiming for uber-geek status might chat about the difference between power laws and log-normal distributions. This difference has similar consequences to the difference between non-stationary and stationary macroeconomic time series. If those terms are obscure to you, you might just chat about how much bigger infinity is than any other number. It's a huge issue!

Government bureaucrats and other practical, get-the-job-done types might just try to produce some simple, intuitive, and relevant graphs. Consider, for example, log-log graphs of website page traffic by page rank. They show a "drooping tail" relative to an approximating line for the left part of the popularity distribution.

Whether the droop is a typical characteristic of web page popularity distributions is not clear. Surely a relatively large subset of relatively bad (unlinked, search-word-poor, spam-associated) pages could contribute to the droop. On the other hand, the droop could be a typical effect of the usual distribution of page content and general patterns of linking and searching. These two possibilities could be tested by comparing the magnitude of the droop in different websites' page traffic distributions.

Fitting a line to a popularity distribution is more useful as a descriptive technique than as a literal claim that the popularity distribution follows a power law. The term "power law" is not meaningful to most persons. Moreover, knowledge about power laws does not provide a lot of insight into the factors that determine website traffic or trends in website traffic over time.

The possibilities for statistical distributions are not limited to power laws (or power laws and log-normal distributions). The personal behavior and information and communication systems that effect page popularity are complex. They may not be uniform across different circumstances. For example, the factors that govern the popularity of the least popular pages may be rather different than the factors that govern the popularity of the most popular pages.

Power laws and log-normal distributions are two-parameter distributions. The distributional form that best characterizes traffic to all pages may have many more than two parameters. Compared to a log-normal distributions and other distributions with more than two paramters, an approximating line has greater value for providing a simple, intuitive description of an important part of the popularity distribution.

Two website traffic distributions suggest that website traffic distributions may have flattened over the past decade. Traffic to Sun's website pages in July, 1996, according to my calculation, had a descriptive log-log line with slope about -1.1. Traffic to useit.com pages in the summer of 2006 had a descriptive log-log line with slope about -0.8.* This difference may reflect differences between the two websites. It might also indicate flatteining over time in typical website traffic distributions. Such flattening was also a general pattern of change in name popularity distributions in the England (and the U.S.) from 1800 to the present.

Much more evidence of this sort might lead one to expect website traffic distributions, and perhaps other distributions for digital goods, to continue to flatten in the future. "More of the same" is a very crude sort of prediction. But absent any other relevant information, it might be the best that one can make.

Knowlege concerning the factors that have produced changes in popularity distributions could help to predict and shape future changes. Persons may have chosen less popular names in response to changes in patterns of work and residence that put persons closer together and more extensively and uniformly regulated their interactions. In short, personalization may have been a personal counter-reaction to factories and urbanization (the Industrial Revolution).

Suppose new information and communication technologies favor a more dispersed workforce, and family structure continues to shift toward more single persons and smaller households. My theory then predicts a decrease in personalization and an increased desire to associate with popular symbols.

Ummm, about all current trends indicate that either my theory is wrong, or that there are a lot of other, much more important factors. Does anyone have some better ideas?

* This is the slope measured in log-log coordinates, not in the coordinates of the axes' labels. The intercept on a log(page traffic) axis typically varies greatly with total website traffic. Thus I prefer graphs that have a y-axis labeled with log(page traffic share in total website traffic). An approximating line has the same slope with either labeling for the Y axis.

Tags: , ,

more discussion about tail size

Once again men are heatedly discussing tail size. Just ponder this queston: How large is the long tail? Personally, I'm going to keep looking before I decide for myself.

While it's novel to bring mathematical precision to such matters, unfortunately it seems to me that this mathematical model focuses attention on misleading features. The model says that the share of the k most popular items is log(k)/log(n), where n is the total number of items on offer. Thus, in this model, the total number of items on offer determines the share of the most popular items.

This isn't a sensible model. Mathematically, a power law describes an infinite number of items on offer. The slope of the power law, or more precisely, the slope of an approximating power law at the high popularity end of the distribution, usually describes well the high-end shares. The question is what determines the slope of the power law. The number of items on offer isn't a good answer to that question, particularly for n varying from two million to six billion.

For a concrete example, consider the popularity of the ten-most-popular given names. The set of possible given names (given names on offer) is huge, and probably hasn't changed much in the past two-hundred years. However, the popularity of the ten-most-popular given names for males in England has fallen from about 85% in 1800 to about 28% in 1994. If you want to understand changes in the popularity of the most popular items in a collection of symbols instantiated and used in a similar way, try to understand this change.

* * *

For additional amusement, here's a post I stuck in the galbithink.org newsfeed a little more than a year ago, back in the time of Web-Pleistocene:

Tail aficionados might enjoy pondering the distinguishing features of the long tail. I think that size, which tail authorities have categorized as long or short, matters less than shape. It should be no surprise to anyone that shape can change over time. For some graphical evidence, see the detailed images here.

So don't just sit around complaining that "diversity plus freedom of choice creates inequality". Power laws don't imply any particular amount of inequality. The power of the powerlaw determines the difference between tails. Look at some examples and see for yourself!

Tags: , ,