building a data search engine
Suppose you want to find data relating to a specific topic. Software can easily distinguish between alphabetical text and numbers. Software can also easily recognize tabular formats in HTML and plain text. Moreover, the extent to which documents contain numbers and tables strongly distinguishes among documents. “Look for data” seems like a significant search qualifier that wouldn’t be costly to implement. So where are the data search tools?
Zanran is search engine designed to find data. The only way to tell Google that you want to search for data is to specify a search for .xls (Microsoft Excel) documents. That’s a poor specification for a data search, especially for a company that’s not Microsoft. With Zanran, you specify that you want to search for data simply by using Zanran for the search. Zanran returns documents containing data. Unlike Datafiniti, Zanran doesn’t attempt to extract the data into a tabular form. Unlike WolframAlpha, Zanran doesn’t attempt to do calculations with data that it finds. Zanran finds the relevant documents. You extract the data and make the specific tables or calculations that you want. At least until the forthcoming world-wide implementation of linked data, Zanran’s approach is the most cost-effective division of tasks for using data from the whole web of various document types.
A stand-alone data search engine unfortunately doesn’t seem economically propitious. Building a data search engine involves all of the challenges of building a general search engine.[*] Search for data is mainly a tuning of the relevance-ranking algorithm. Even without such a tuning, a Google search for historical advertising expenditure leads to better data than a similar Zanran search. Moreover, searches for data don’t provide context for lucrative advertising. Persons searching for data aren’t looking to buy mass-market consumer products. Perhaps Zanran or similar services can succeed on a subscription or pay-per-search basis. For the sake of fact-based policy and business analysis, let’s hope so.
* * * * *
Related posts:
[*] Zanran analyzes images to determine if they contain a graph, chart, or table. Thus a Zanran search can potentially return a desired graph or data that a purely textual search would miss. That’s valuable, but searching images for numerical data and data presentations seems to me to be a rather small share of the overall value of data search to users.
Tagged: data
describing statistics
Verbally describing statistics accurately and succinctly is often difficult. Needle and Thread offers a better way. In a recent paper on Thread, Glenn McDonald, the Thread designer, stated:
The Thread query describes a path through the data … we believe that it is more expressive, easier to map to the shapes of human inquiry, and far more compact and data-focused. These may be enough to give the query language a qualitatively different role: a tool not just for people to talk to machines about data, but for people to talk to themselves and each other about data in such a way that machines can participate in, and inform, the discussions.
With Needle and Thread, you can understand statistics by seeing how they’re calculated and experimenting with alternatives.
badly structured tables have a bright future
Which is a better, one big table, or two or more smaller tables? The organization of the data sources, the number of smaller tables, the extent of the relationships between the smaller tables, and economies in table processing all affect the balance of advantage. But cheaper storage, cheaper computing power, and fancier data tools probably favor the unified table. At the limit of costless storage, costless processing, and tools that make huge masses of data transparent, you can handle a component of the data as easily as you can handle all the data. Hence in those circumstances, using one big table is the dominant strategy.[*]
Unified tables are likely to be badly structured from a traditional data modeling perspective. With n disjoint components, the unified table has the form of a diagonal matrix of tables, where the diagonal elements are the disjoint components and the off-diagonal elements are empty matrices. It’s a huge waste of space. But for the magnitudes of data that humans generate and curate by hand, storage costs are so small as to be irrelevant. Organization, in contrast, is always a burden to action. The simpler the organization, the greater the possibilities for decentralized, easily initiated action.
Consider collecting data from company reports to investors. Such data appear within text of reports, in tables embedded within text, and (sometimes) in spreadsheet files posted with presentations. Here are some textual data from AT&T’s 3Q 2010 report:
More than 8 million postpaid integrated devices were activated in the third quarter, the most quarterly activations ever. More than 80 percent of postpaid sales were integrated devices.
These data don’t have a nice, regular, tabular form. If you combine that data with data from the accompanying spreadsheets, the resulting table isn’t pretty. It gets even more badly structured when you add human-generated data from additional companies.
Humans typically generate idiosyncratic data presentations. More powerful data tools allow persons to create a greater number and variety of idiosyncratic data presentations from well-structured, well-defined datasets. One might hope that norms of credibility evolve to encourage data presenters to release the underlying, machine-queryable dataset along with the idiosyncratic human-generated presentation. But you can think of many reasons why that often won’t happen.
Broadly collecting and organizing human-generated data tends to produce badly structured tables. No two persons generate exactly the same categories and items of data. Data persons present change over time. The result is a wide variety of small data items and tables. Combining that data into one badly structured table makes for more efficient querying and analysis. As painful as this situation might be for thoughtful data modelers, badly structured tables have a bright future.
* * * * *
[*] Of course the real world is finite. A method with marginal cost that increases linearly with job size pushes against a finite world much sooner than a method with constant marginal cost. The above thought experiment is meant to offer insight, not a proof of a real-world universal law.
Tagged: data
exploring and remodeling table fields
Sometimes tables are messy not just in their data items, but also in the fields that define the table columns.[1] Various techniques help to deal with such “second order” messiness. Sorting table fields alphabetically or evaluating them with more powerful text similarity measures help to identify inadvertently duplicated fields. Sorting table fields by the number of non-null items in the fields would tell you the relative data size of different fields. Of course, you would also like to be able to merge and split fields just as easily as you select fields to include in a sub-table of interest.
Transforming a table into row number/field/item triples can be helpful for exploring and remodeling table fields. With such a transformation, fields become new data items. Excluding triples with null items and faceting the fields shows the count of items in each field. Merging and splitting fields become operations of selecting relevant triples and renaming the field/data item. When finished such meta-cleaning, you might want to transform the triples back into a remodeled table. Since field merges can create a field containing more than one item, the remodeled table ideally would support multiple items per cell. [2]
The triple transform is an instance of more general table reshaping operations. A long transformation replaces a set of fields with a key field containing those field names and a new field containing the keyed values. A wide transformation reverses the long transformation. It expands a key field into separate key-instance fields populated with the corresponding keyed values. Non-varying fields are expanded or contracted as necessary. Table reshaping operations change both the column and row dimensions of a table. The reshape command in Stata implements this class of table operations. The Stata reshape documentation shows examples of these transformations. SDDL / STT provides table reshaping using spreadsheet-based markup.
The awesome new data tool Google Refine can do table reshaping. Here’s the Google Refine procedure for transforming a table into row/field/item triples:
- Create a row-number data column called “row” at the beginning of your table, if you don’t already have such a column. That column will be identical to the (non-editable) row number column on the Google Refine table presentation frame.
- From the column drop-down menu in the first column after the row number column, select “Transpose” / “Cells across columns into rows…” from the column drop down menu. Select the field below the row field in the “from column” and the last field in the “to column”. Call the column name “field”, check “prepend the column name”, and separate the column name and cell value with “:” (or some other value not included in item text). Then click “Transpose”. The table is then elongated, but it needs some further modifications.
- From the “row” column drop-down menu, select “Edit cells” / “Fill down”. That fills in the row numbers for all the records.
- In the field column, select “Edit column” / “Split into several columns…” , and enter the separator (“:”) used in step 2. Clicking ok completes transforming the table into triples.
To transform the triples back into the wide table:
- In the drop-down menu for the rightmost column (field2), select “Transpose” / “Cells across columns into rows”. For the number of rows to be transpose, enter the number n of fields in the original table (not including the row number field).
- Rename the new “field 2 n” fields based on the field names in the previous column (field 1).
- In any of the former “field 2 n” fields, select from the drop-down menu “Facet” / “Text facet”. In the left panel select “(blank)”. You should now see all the blank “field 2 n” rows.
- In the drop-down menu for “all”, select “Edit rows” / “Remove all matching rows”.
- Click on “Reset All” and the table is back to its original form.[3]
This triple transform unfortunately isn’t as helpful as it could be. With this triple transform, you can then facet the (former) table fields and understand what fields exist in the table. But any changes that you make just in the field column won’t have any effect in the reverse transform. A function that would implement the reverse transform with changes in the field column (and automatically create the corresponding field names) seems like it would be straightforward to implement.[4] A robust reverse triple transform would make a great data tool even more useful.[5]
* * * * *
Notes:
[1] Throughout this post, table means a data collection with one dimension of categorization (a record list), e.g. a typical relational database table that organizes fields into records.
[2] If it doesn’t, the table cleaner has to ensure before a field merge that no multi-item cells will be created.
[3] Similar steps can implement table reshaping on just a subset of columns.
[4] The possibility of having multiple items in one row-field cell would have to be addressed. One solution would be to ask for a delimiter before the transform, and use that delimiter to form multi-item cells.
[5] Collecting human-generated unstructured and structured data using SDDL / STT tends to result in messy, wide, sparse tables.
Tagged: data
describing and organizing spreadsheet data
Even in this age of big data, most persons collect data in spreadsheets. Two challenges are common with spreadsheet data, particularly spreadsheet data collected from a variety of sources. First, you need to understand what numbers you have. That means both the definition of a specific number and the presence or absence of particular numbers. Second, you need to combine unstructured data and data tables of various forms into an encompassing data structure that you can flexibly query and re-organize. Here are some ideas and data tools to address these challenges.
Suppose you are collecting unstructured data from a variety of sources. For example, a Wall Street Journal news article states:
Nielsen, at the request of The Wall Street Journal, analyzed cellphone bills of 60,000 mobile subscribers and found adults made and received an average of 188 mobile phone calls a month in the 2010 period, down 25% from the same period three years earlier.
To collect that data, you might paste the string “adults made and received an average of 188 mobile phone calls a month in the 2010 period” in a spreadsheet cell. But that’s data you can’t use in spreadsheet calculations. Ok, stick the number 188 in the cell to the right of that data string so you can do calculations on the relevant number. Even better, stick the number in the first cell and the description string in the second cell, so that the long, left-justified description string lines up close to the corresponding number. If you’re really serious, stick the source url in another column to the right so that you can check if you’ve missed something. This sort of procedure gives you a spreadsheet that looks something like this. This data format may be good enough for some private-sector work, or even some government work. If so, fine.
But collecting numbers and ad hoc data description strings has some weaknesses. In the example above, you haven’t fully captured the data description. That data concerns “mobile subscribers.” Mobile subscribers may or may not be different from mobile prepaid service users. Moreover, suppose you had a hundred numbers like that above. To find a particular statistic of interest, you would have to start reading through the description strings until you found one similar to the statistic that you sought. That’s tedious and time-consuming.
If you describe spreadsheet data in a more structured way, you can understand it better and process effectively larger amounts of it. A common approach would be to set up a spreadsheet table and stick numbers into it. For example, suppose you want to collect wireless service data by company. You place in a row your data categories (fields): company, customer type, number of customers 3Q 2010, etc. Then you start putting data into the relevant columns below. If you have a lot of categories and the data you find isn’t generally organized in the order of your categories, populating the table will be a tedious and time-consuming task. Moreover, suppose that you are collecting the data by company. You have to repeatedly enter the company name in the company column. That sort of annoyance can easily be multiplied if you are collecting data with additional categorical organizations such as date and business segments. If these slowly varying categories are spread across a large table, the pain is even more intense.[1]
Here’s some good news: that supercomputer on your desk can work as a tabulating machine. You just need to describe your data with the Spreadsheet Data Description Language (SDDL) and then tabulate it with STT (Spreadsheet Tabulating Tool, or Simple Tabulating Tool, or SDDL Tabulating Tool). SDDL and STT provide fine new acronyms for you to use without requiring you to learn much or do much different with the beloved spreadsheet programs that you’ve been using since Visicalc came out. To use SDDL, you put a category in one cell, and in the next cell to the right, an item for that category. You can add additional categories and items aligned below, in any order most convenient. When you stick an item into a category that already contains an item, you’ve implicitly tabulated a record and started a new record with empty categories. STT is a spreadsheet macro that adds to a table on a separate sheet the categories (if necessary) and items contained in a block of SDDL. SDDL is so simple that you’ve probably been setting up a lot of spreadsheet data in SDDL, without even knowing that you were using SDDL.
Here are some SDDL / STT examples using Google Docs spreadsheets.[2] So that you can see the SDDL and the tabulation side-by-side, I’ve copied the tabulations from the tabulated sheet to the SDDL examples sheet.[3] Examples 1.1-1.2 show placing items into categories, and how sticking an item into an occupied category generates a tabulated record. This technique is more efficient than hand-tabulating data for a wide, sparse table.
You can easily combine hand tabulating and SDDL / STT. If you start with a tabulation, running STT on a block of SDDL adds any needed categories and adds all items to their categories. See Example 2 in the SDDL example spreadsheet.
SDDL uses prefixes for pinning and unpinning items in categories. If you precede a category name with the character “*”, the associated item is pinned in that category across records. So if you are working on tabulating a group of records for a company, you can pin the company name in the company category for that record group. That company name will then populate each record in that record group. When you insert a new name in the company category, you decide whether to pin that new name. See Example 3.
You can create SDDL using all the editing, linking, and calculating capabilities of the spreadsheet. So, for example, if you have a figure somewhere in a spreadsheet, you can reference that figure and do a calculation using it to produce a figure in an SDDL cell. Similarly you can copy and paste categories to lessen the work of entering SDDL.
SDDL stack and twoway formats provide an easy way to create groups of similar records. A stack is a table of items with one dimension of categories (a record list). A twoway format describes a table with two dimensions of categories. Both stacks and twoways are commonly called tables, but technically these forms differ significantly.
In SDDL, a stack is one or more rows in a block where more than one item follows the category that starts each row. The row-starting categories, called 1cats, must not be duplicated within one stack.[5] STT tabulates one record for each stack column.[4] Each tabulated record contains all the items sitting in the current record before the stack occurred. To visually distinguish stacks, and for some minor cases of missing values, the stack category can be prefixed with a 1. See Examples 5.1 and 5.2.
While a stack implies a record for each data column, the twoway format typically creates a record for each item placed in a rectangular grid. The twoway format begins with the twoway directive “%twoway”, with “precat” categories following in that directive’s row. Below the twoway directive row are a stack of “1cat” rows and a set of “2cat” rows. The record for grid item at position (r,c) is created from items in the “precat” section of row r, the items in the stack above the grid in column c, and the grid item. The grid item goes into the “2cat” category that begins the grid row. See example 6. Once you understand how to read the relevant categories (precats, 1cats, and 2cats), a twoway SDDL form is just a well-described, two-way table.[6]
A key use for the twoway is to re-organize and aggregate various data tables. Financial publications typically include pivot tables (two-way tables) constructed from a master data table (a one-way table / record list). The SDDL twoway format allows you to reverse pivot the table to the extent possible (unfortunately, you can’t un-sum a cell total). For an example of an SDDL table-to-list conversion using real data, see the first data table here (from an AT&T 3Q2010 financial spreadsheet) and the first SDDL twoway here. A Google search shows that many others struggle to convert (two-way) tables to (one-way) record lists. With SDDL / STT, you can do that job easily and flexibly. Moreover, you can automatically aggregate the tables that you convert into one, big master tabulation.
Go ahead and experiment with SDDL / STT and consider whether it might be useful to you. Here’s an example of using SDDL / STT to aggregate both unstructured and structured data from AT&T’s and Verizon’s third-quarter, 2010, financial reports (wireless segment). Here’s a full definition of SDDL and the source code for the STT script. Want to customize the script to your liking? Got an idea for improving it? Go for it! If you make an improvement to the script, please make it available to everyone. That’s not a legal requirement; it’s just a polite request.
Only a few persons have big data. Ultimately, it’s not the size of your data, but what you do with it that really matters.
* * * * *
Notes:
[1] Tricks exist to fill down blank cells. But you still have to locate the right columns for the first-changed items and filling down.
[2] STT is currently freely available as a Google Apps Script. I hope that public-spirited programmers will port it to Microsoft Excel, OpenOffice, Zoho Spreadsheets, and any other applications where it would be useful, and that they will make the resulting ports freely available.
[3] The tabulations appear in a separate sheet from the SDDL. That sheet by default is named “tabulated”.
[4] My design philosophy for STT’s processing of SDDL is “STT will do the best it can with what it’s given”. If you duplicate 1cats in a stack, STT will break up the stack into sub-stacks to overcome the duplication. The STT log file records such situations.
[5] An SDDL stack is the same as the transpose of a record list, where the first row contains field names. The SDDL directive “gettab” adds a record list to an STT tabulation (which itself is a record list).
[6] The twoway can handle different categories placed as 2cats. See the SDDL / STT documentation.
Tagged: data