Installing the stupid filter

I enjoy Seth Godin's blog. The entry today is Installing the stupid filter which is about how humans don't always accept questions or directions as stated. That is, humans ask "Are you sure?" Machines don't.

The problem is that machines are given bad data all the time and most accept it verbatim. Crossref, my employer until Jan 11, handles lots of XML encoded data. So we need to manage both complicated structures and many types of data -- publication dates, personal names, country names, company names, page numbers, volume numbers, ORCID iDs, ISBNs, ISSNs, Pub Med ids, etc. Some types have a strict syntax and so we can know if the value is valid. What we can't know is whether the value is appropriate. We have to guess.

Is a publication date 2 months from now appropriate? In most cases, the answer is "yes" as the publisher is depositing the metadata for a forthcoming publication. But what about 3 years from now? If the publication is part of a book set that is expected to take 10 years to complete publication, then "yes," too. If it is a journal article then almost certainly the answer is "no." At what point is an article title too long? Is it, as we have experienced, a misplaced abstract? It seems the more data we have the more the questions we have about it.

I don't have any answers for these questions. I just want to make the comment that even in machine to machine data exchanges there needs to made sanity checks on the data and those checks have to be within the larger context of each datum.