building a data search engine

A good search engine for data would make the web more enlightening. Lack of a good business model may be keeping users in data darkness.


Tagged:

beware of statistical designs

Misleading statistics on the declining farm share have supported agricultural subsidies. Beware of tendentious use of statistics.


Tagged: ,

why SEC filings don't contain semantic, queryable data

SEC filings don’t contain semantic, queryable data because companies aren’t interested in making their financial data readily available as such data.


Tagged:

making data more factually important

Data accessibility requires both data production and data access. Discursive norms help to provide good incentives.


Tagged:

badly structured tables have a bright future

Which is a better, one big table, or two or more smaller tables?  The organization of the data sources, the number of smaller tables, the extent of the relationships between the smaller tables, and economies in table processing all affect the balance of advantage.  But cheaper storage, cheaper computing power, and fancier data tools probably [...]


Tagged:

exploring and remodeling table fields

Sometimes tables are messy not just in their data items, but also in the fields that define the table columns.[1]  Various techniques help to deal with such “second order” messiness.  Sorting table fields alphabetically or evaluating them with more powerful text similarity measures help to identify inadvertently duplicated fields.   Sorting table fields by the [...]


Tagged:

describing and organizing spreadsheet data

Even in this age of big data, most persons collect data in spreadsheets.  Two challenges are common with spreadsheet data, particularly spreadsheet data collected from a variety of sources.  First, you need to understand what numbers you have.  That means both the definition of a specific number and the presence or absence of particular numbers.  [...]


Tagged:

confidential documents are costly

Confidential documents submitted to government agencies have significant costs.  Confidential documents don’t contribute to public knowledge.  Persons face significant costs and complications to access confidential documents.  Moreover, the receiving agency has to follow special, relatively expensive procedures for storing and archiving confidential documents.   Both the cost of confidentiality to the public and to the [...]


Tagged:

micro-consituencies support global information sharing

Creating a new, common language for machine-readable information allows information to be shared across organizations with disparate information systems and information formats.  The Global Justice XML Data Model is a successful example of such a language.  Its success prompted the development of a similar, but broader initiative called the National Information Exchange Model.  Both models [...]


Tagged:

knowledge from the nineteenth century to now

In a nineteenth century dataset, cabinets group birds at a high level of similarity.  Each cabinet contains stuffed and mounted birds, arranged roughly in a grid.  On the top row of this cabinet are three flycatchers followed by three kingbirds.  On the second row are two kingbirds follwed by four flycatchers.  The next row displays [...]


Tagged:
Next Page »