Friday, January 14, 2011

Why SQL/MED is Cool

One of the big patches that is in the works for PostgreSQL 9.1 -- and will hopefully but not for sure make the cut -- is a series of patches that implement basic SQL/MED functionality for PostgreSQL.   What is SQL/MED and why should you care?

The "MED" in "SQL/MED" stands for "management of external data", which basically means managing data that's not stored in PostgreSQL much as if it were stored in PostgreSQL.  In other words, you can make other data sources look like tables.  There are two parts to this functionality: some core support, to make PostgreSQL understand the concept of foreign tables; and then individual foreign data wrappers, which act as a bridge between PostgreSQL and some other data source.  Each foreign data wrapper is responsible for the interaction with one particular type of external data - for example, you might have a foreign data wrapper that reads CSV files, another one that talks to other PostgreSQL instances, and another that downloads a URL out there on the internet that returns tabular data in XML when you send it a GET request.  Foreign data wrappers must be written in C, but you've written and installed it, you can create new foreign tables based on that foreign data wrapper just by issuing SQL commands.

So, imagine for example that you get hired to write a web application, and you decide that for the backend database you should use PostgreSQL.  You build this application and everyone thinks it's great, but there's one problem: they want it to integrate with the accounting database, which runs on Microsoft SQL Server.  I've solved this problem three times so far, and I've done it basically the same way all three times: I wrote a Perl script that runs periodically from crontab and replicates the information from the accounting database into PostgreSQL tables.  SQL/MED allows a different approach: just create foreign tables on your PostgreSQL server that point to the SQL server tables, and pretend that that they are local tables, and away you go.  No more copying data.

Sadly, things aren't going to be quite that good in PostgreSQL 9.1.  The only foreign data wrappers which have been submitted for inclusion in PostgreSQL 9.1 are one which understands CSV files, and another that can talk to other PostgreSQL databases - and before we can even think about including either of those, we have to get the core support committed, which hasn't happened yet.  Furthermore, the version of SQL/MED that we're going to have in PostgreSQL 9.1 is going to be pretty dumb: every time you request data from a foreign table, it'll scan the whole table.  My guess is that many people will set up foreign tables and then periodically do "CREATE TABLE localcopy AS SELECT * FROM remotetable" and then create some indexes on the local copy.  So there's clearly a lot of room for improvement here.


Despite that, I think that this is a great feature.  Assuming the core support gets committed, I expect that we'll see people writing foreign data wrappers for all kinds of crazy data sources.  Some will be things that'll never be included in our core distribution (like Hitoshi Harada's twitter_fdw) and others will - hopefully - be things that ultimately end up shipping with the base product.  But even before they get incorporated into the base product, people can start publishing them for use with PostgreSQL 9.1.  Foreign data wrappers will become the glue between PostgreSQL and any place you'd like to get data from: instead of incorporating logic into your application to talk to 27 different data sources, you can have PostgreSQL talk to 27 different data sources and write your application to talk only to PostgreSQL.  While not everyone may want to do it that way, it's a very cool option to have available.

8 comments:

  1. Thanks Robert.

    I actually understood every word - without Googling 20x.

    ReplyDelete
  2. It's great to see my Evil Plan come to fruition.

    ReplyDelete
  3. I think the proposed interface can push down WHERE clause to remote, so it doesn't scan whole the table always.

    ReplyDelete
  4. Of course! My Evil Plan was to inspire others to do the work inside the engine by prototyping something in user space :)

    ReplyDelete
  5. could this allow to store BLOBs outside of PG ? Like, in S3 or Riak ?

    ReplyDelete
  6. Hi Robert,

    This is great stuff. I am looking forward to learning more about SQL/MED at the upcoming SFPUG meeting.

    It will be great if the implementation could push down restrictions as Hitoshi Harada mentions, but also projections -- why return all column values in cases where the query only asks for a few. In other words, it will be great for SQL/MED to take the opportunity to offload as much of the table-scan grunt work to the external data source.

    ReplyDelete
  7. There are similar efforts in many of the open source ETL and data cleaning tools, for example, DataCleaner has an eobjects.org MetaModel that connects DataCleaner to a variety of data sources. Would it be possible to have a common set of external data readers for multiple open source projects? Since some are in Java and others in C, perhaps the best solution would be a common binary format rather than a common implementation.

    For EnterpriseDB, a GUI tool like DataCleaner would be a nice exploratory front end for SQL/Med. At least in my ETL experience, the 'easy' part is mapping the data and the headaches come when you have to deal with data inconsitencies, especially in inputs that are not a structured at relational database tables.

    ReplyDelete