Nat Friedman

We have an API

It’s always nice when you find some valuable database is available online via an API.

An API means you don’t have to write grubby code to screen-scrape their web site, and you can get all the data you need.

For example, recently I was messing around with book information and found that the New York Times has a best sellers API you can use to access their bestseller lists going back several years. Cool, right?

Of course like many online APIs they rate-limit you to 5,000 queries per day. So if you want, say, all the bestseller data from the last ten years, you’d need about two days to grab all of that (assuming there are ~4 updates to the list per month, and you want to grab all 15 of their bestseller lists).

So here’s my proposal. Instead of making us run slow crawlers on your APIs to access historical data, just provide a sqlite database we can download.  It’s easier for everyone.

The API is still useful, of course, for up-to-the-moment data.

Now, commercial websites like the New York Times might want to use the inconvenience of an online API as a way to limit access to their data or to enforce some terms and conditions. But I think in practice it just means that people have to run their crawlers a little longer. And maybe they implemented an API because they thought it’s what people wanted.

In the case of government APIs, this is especially important. All government “open data” web sites should be providing downloadable data sets. If they’re too big, chunk them.

So, keep calling for online APIs. But ask for downloadable datasets too.

28 March 2010
Show comments
  1. Most commercial endeavors are based on some form of scarcity, even if that scarcity is artificial. We may not like it, but there needs to be a way to generate revenue to support the institution that aggregates the data. How would you go about that if you were in the Times’ shoes?

    Reply

    1. I’m sympathetic to the need of businesses to make money.

      But in this case, I’m pretty sure the Times isn’t generating revenue off the bestseller API.

      Reply

      1. Hard to say that they are or not – they definitely make money on the list itself. I think rate limits are fine for most people. If you need to have it raised just send an email.

        Reply

  2. I agree that data should be broadly accessible for analysis, review, or value-added applications–especially publicly funded information. However, I don’t agree that your proposal is the best approach.

    There are three main issues:

    Provenance–the data comes from a reliable source that people accept as being an authority.

    Reliability–the data can be reliable and reproducible, and therefore conclusions based on information derived from the data can be accurately reproduced.

    Historical Volatility–once the data is published, the need to update previously stated information is well understood. The NYT best-seller list should have little-to-no volatility, but a financial report database might require frequent updates as company restate earnings based on new accounting principles, laws, or audits.

    Data that must have provenance or that is historically volatile, but requires reliability are good candidates for API-level access, rather than downloaded databases. A downloaded SQLite database can be modified and the data and redistributed. Depending on how it is done, it can create problems with provenance, reliability, and can suffer from volatility issues that may not be taken into account. Furthermore, this can happen both maliciously, or accidentally, but in either case it creates a number of problems for the data publisher, and downstream consumers.

    It’s unlikely that anyone who has a vested interest in maintaining reliability and provenance of their data is going to allow it to be downloaded in this manner.

    APIs provide a way for publishers to ensure accuracy and ownership of data, but still allow people to create derivative content. If you need more access to the data than the API access agreements offer you, you are more likely to make headway with the data providers by asking them to increase the rate-limit for the service, instead of asking them to relinquish all control of the data.

    But, it’s also important to understand the data that you are requesting and the way in which you will use it, so that you determine what the appropriate rate-limit should be. Let’s look at the statements that you present as an example:

    * NYT Bestsellers data is fully available
    * NYT Bestsellers data changes infrequently (~4/per week/list)

    For someone using the data in an online application, or even on research effort, querying March 2000, Hardcover Fiction is going to get the same data every query. Given that, you should be caching that query, rather than passing it to the NYT API. Given that the data changes infrequently, once you have historical data backfilled, querying the data more frequently than it updates and caching the result should lead to accurate data.

    Given that, the NYT API query limit is at least 1000x times more than anyone would need on an ongoing basis. I’m sure that the limit was put into place to avoid their API services from being beaten to death by poorly architected applications.

    Even in the case of research, a two day wait to get access to the data isn’t really that onerous of a requirement, assuming there was no other way to get that data.

    In the case of government data, they are concerned about data reliability, and possibly volatility: that is they want the ability to provide accurate, reliable data that everyone can access and get the same results while incorporating possible updates, if needed. APIs do a better job of ensuring that than distributed data sources.

    For public information that is published and static, a downloadable format is great, and many government agencies provide access to data in downloadable format in machine processable formats such as MS Excel, CSV, XML, or text. Others provide data that is intentionally scrape-able from their site.

    Of course, all of this could be resolved by ensuring that data that is consumed is appropriately attributed, that modifications to data are properly described, and that methodologies are both documented and technically sound. However, there are people who improperly interpret data (intentionally or accidentally, in good faith and bad), and there are non-disclosed agendas, and people use data to covertly influence others.

    So, until Utopia arrives we have to address the sometimes conflicting needs and available resources of publishers and consumers.

    Reply

    1. Very thoughtful comment. Thanks.

      You’re probably right, in the case of the New York Times bestseller data. Two days isn’t that bad to download 10 years of data (although is the throughput equivalent of like a 14.4k modem).

      For data that’s being constantly restated, a downloadable database isn’t as good. But there are plenty of large data sets that don’t change too. A 1-5k queries/day limit on those datasets could be crippling.

      People are calling for open data APIs left and right. My point is that we should also call for downloadable datasets.

      Reply

  3. Hey duder,

    Did you see my post on OData as an API that exposes data as a queriable source?

    There is also the Dallas market place that allows people to publish their oData sources and decide how they want to charge for it:

    https://www.sqlazureservices.com/Catalog.aspx

    Reply

  4. Hear hear. I am also quite satisfied with how this site looks on my iPhone. Bravo.

    Reply

  5. Someone’s been playing with BeautifulSoup, I sense. :)

    I’d rather access the full archive through the API than download a sqlite file and manually have to maintain currency of my local data cache.

    What would work better would be rate-limiting to N multiples of the population of the archive. So if the data space has 5000 elements, the daily rate limit might be 2x or 3x that many hits.

    Reply

  6. @Nat

    A “download data” item is just an API call that can return all the data. I agree that all APIs should have such a thing.

    There’s a bigger picture to this though. Why did you need all the data? Probably because you wanted to do a query, or aggregation, that the API’s author hadn’t thought of.

    The people over at Scraperwiki are working towards you being able to edit your API in a wiki. So if there was a missing API call, you’d be able to go and add it.

    I’ve constantly struggled with this question. The original UK parlparse (Parliament) screenscraper we made back in 2003 publishes all its data as XML, with an rsync download. More recently, TheyWorkForYou offers the same data using an API.

    Not many people use either, although some people use each. The reason I was struggling to work out what format to offer the data in, is because I didn’t know how people would use it.

    Flickr sidestepped this with their API, by writing their website on it. And then by adding functions that people needed for other things (upload applications etc.).

    @Mark Colburn

    I’ve not seen cases of people messing about or trying to fake Government data. If they wanted to, they could easily do so with data gathered from the API – either by downloading it, or adding a faking proxy layer in front of it. Has it ever happened to you? Could you write it up?

    I don’t think that stopping data downloads will help. All you’ll do is make the data less useful. Far far far more beneficial would be for newspapers start citing all their sources with links. Then you could check where they came from – going back to the original API / download, or a trusted/documented source who had processed it.

    Reply

    1. Well put Francis. I’m not against API, I just want an additional entry point that gives me a dump of all the data.

      Reply

  7. Bah, why SQLITE? Why not just a CSV or XML or some file. I think a marginally documented text format is going to be far more useful (and probably smaller) than a database file.

    Perhaps I just hate SQLITE too much.

    Reply

    1. Doesn’t have to be sqlite. I just find it to be a really handy way to pass around large datasets. But CSV or JSON or XML is fine too. Actually CSV is not that great since no one implements it properly – the others are fine though.

      I’m curious why you dislike sqlite though…

      Reply

  8. @Miguel de Icaza

    “There is also the Dallas market place that allows people to publish their oData sources and decide how they want to charge for it:

    https://www.sqlazureservices.com/Catalog.aspx

    One has to admire Microsoft for doing again something ambiguous. That page is filled with *free* things that in fact are *TRIAL OFFERS*; even one from a government.

    Reply

  9. Good APIs let you download the main bulk and let you do the deltas. It makes sense for all. Here’s how we do it for Netflix: http://developer.netflix.com/docs/REST_API_Reference#0_42335

    Reply

  10. You could have just used Tor to make your connections originate from a large number of different IPs – http://www.torproject.org/

    Reply

  11. I’m on a crusade to create analyzable data sets too, but to a lot of companies it’s quite a threatening act. You can see the result of my planned release of crawled public profile data from Facebook here:
    http://petewarden.typepad.com/searchbrowser/2010/03/facebook-data-destruction.html

    You should also check out Infochimps, they’re trying to build a business around distributing these sort of data sets:
    http://infochimps.com/

    Reply

  12. To me this looks like the reason that RSS came into existance. It is not necessary that the copy of the database needs to be available from the original database. One person can take it upon themselves to have a sync’d copy and makes available a downloadable copy when called for and updates as requested.

    We seldom ask why a person would subscribe to an RSS feed rather than an email list nor is it fruitful here to ask why someone would want a sync’d copy of the database. They just do. Overall it is easier to handle the results of an RSS feed than the multiple formats of the received emails. Likewise, it is easier for the recipient to squirrel away the results of a sync’d copy than the multiple formats of the received queries against a lot of different databases.

    Reply

  13. Great stuff here. I like Nat’s point and the discussion that follows. Ultimately, we want to provide data in a variety of different means. It is still early on for nytimes and apis/data – stay tuned and keep the thoughtful discussion going because we are listening.

    Reply

    1. Thanks for writing Derek! For the record, I am very impressed and happy that NYTimes provides this data in a computer-readable form at all.

      If you do add a downloadable list in the future, that would be great – and I’d be curious if you do have any business reasons not to do that, or if it’s just a matter of time and priorities.

      Reply

      1. It is a two fold problem one is simply finding the time. It is isn’t much work – and I do love the idea of a sqllite db (we have used it here internally for a very very long time) but setting aside the time amidst all the other things we are trying to do.

        One of the things we want to do and haven’t entirely solved is tracking where our data/apis are being used. It is important for the business to know where the data is flowing so we can identify new opportunities and be an actually useful partner w/ developers.

        Reply

        1. Thanks for responding again. Time/resources – that’s what I figured.

          I felt like your web signup form was a good thing because it did ask for a fair amount of info about the app and why I am using your API. So I would expect that you have decent info there.

          Reply

  14. For our Open Policitics API “Deutschland API” (http://deutschland-api.de) we’re not limiting requests and provide all kinds of Query Languages (for ex. YQL).

    Publishing the full data set as well is a good suggestion though!

    Reply

  15. You might be interested in looking at CouchDB’s replication support. It was designed with exactly this use case in mind.

    Reply

  16. At Sunlight Labs (we advocate for Gov open data), the concept of we’re advocating for bulk access to data. Generally, what we think government ought to be doing is the following– for just about every data-driven website it creates:

    1. Provide bulk access to the data that powers the site in some kind of machine readable format. CSV, SQLite, JSON, XML– just use something that makes some kind of sense. Not a .tiff file or a .pdf (unless it has the source documents appended)

    2. If the datasets are significantly large– it’s cool to provide a functional, registration-less API on top of that dataset, but once you go into registration-land and I have to give you my email address to get a key to that dataset– well, no. I already paid for the data. It’s public domain. Make it free with no strings attached.

    3. Make your website powered by the same bulk data or API you’re providing. Don’t provide bulk data, then build your own website with some better, cleaner copy of that data. That’ll make it so you take care of the bulk data like you take care of the graphics on your website.

    It’s remarkable how many gov agencies want to jump into API land without providing bulk access *first*, but generally I don’t think it’s because of “information control” issues, rather, it’s because API is a three letter acronym– and the government thinks three letter acronyms are really sexy.

    Reply

    1. > It’s remarkable how many gov agencies want to jump into API land without providing bulk access *first*, but generally I don’t think it’s because of “information control” issues, rather, it’s because API is a three letter acronym– and the government thinks three letter acronyms are really sexy.

      That’s what I figured too. But it’s so much HARDER in many cases to provide an API than to just put up a database dump. So it’s funny that they get it backwards.

      (Thanks for commenting, I’m a big fan of Sunlight Labs!)

      Reply

  17. Amen! At MusicBrainz, we offer a public database dump that is regenerated weekly I think. It does require some PostgreSQL specific extensions, but it’s the best we can do. We also offer a live replication stream (paid for) with hourly replication packets. And then on top of that we also offer a web service with a 1 query per second limit. They all get heavily used, and we certainly wouldn’t ever think about removing one of the options now.

    Reply

  18. @Clay if “API” is sexy and “bulk data download” is not, perhaps we need to talk about offering “BD2 Compatibility”. (half-serious)

    Reply

  19. Just in the last couple of days I’ve been researching some claims by our local school board member that 70% of the voters in our county are 62 or older. Trying to find the data has been an interesting experiment in open data. It turns out that the elections commission only offers the data in a PDF of a printed report, and that comes from the state anyway. I would much prefer to have been able to download raw data, I don’t want an API at all.

    Oh, and it turns out that the number is 17%, not 70%. So now I’m not sure if my source misheard or if the board member misspoke, but either way it’s good to be able to dig into the raw data myself.

    Reply

  20. Why not provide the API in couchDB format. That way the built in replication feature could be used to pull down new data at specified times. This could also count as the full allocation of requests for a day

    Reply


Copyright © 1998 - 2011 Nat Friedman