Exciting Times Ahead – Hybrid Data Cluster with Postgres, MongoDB and Hadoop
Sameer Kumar I Senior Solution Architect, Ashnik
There is something really exciting happening in PostgreSQL world. Recently with this patch commit: cb1ca4d800621dcae67ca6c799006de99fa4f0a5 in PostgreSQL/Allow foreign tables to participate in inheritance Tom Lane pushed an important piece of work in Postgres (by Shigeru Hanada and Etsuro Fujita).
The change means that now you will be able to create a foreign table which participates in inheritance hierarchy. PostgreSQL already facilitates table partitioning and partition pruning via table inheritance. Now the same can be extended to Foreign Tables, i.e. you can have your partitions residing on different servers. Huh? Isn’t that sharding? Yes you got it right it is sharding. Though this is something which will be available in PostgreSQL v9.5, people have already started to play around with it .
In one of my previous posts I outlined how one can use Foreign Data Wrappers to store data in a underlying sharded MongoDB store and use Postgres’ SQL Interface. But this new feature of v9.5 is something even more exciting. You can have a sharded Postgres cluster where each shard could be on different servers and a central server/primary shard can be used for controlling the data insertion. So far the examples that I have seen around are primarily using postgres_fdw, but the possibilities are endless. In ‘not so distant future’ we may have other FDW too being capable of this. i.e. I can have one of my child tables on MongoDB and other one in Postgres or on Hadoop.
Now doesn’t Hadoop or MongoDB already have sharding capabilities? Why would I mix it up with PostgreSQL? The answer is ‘simplicity’. I just want to keep things simple for my users. My users understand SQL pretty well and they find SQL to be quite intuitive for them. So I can have my most recent data sitting on Postgres and as it becomes cold I might start moving it to MongoDB and then to Hadoop. I can use the oldest set of data on Hadoop for doing complex analytics – to churn out a model for predictive analysis etc and I can continue to use that data for answering rare queries on historical data. Or I can use a current ‘data store’ data according to its nature of a platform which is best suited for it and then logically combine them under one entity (a table) in PostgreSQL (using FDW and child tables).
Whatever be your use case the big advantage is, users (who prefer SQL) will still look at this data as one single logical table in a Postgres Database. When they fire a query Postgres optimizer will go ahead and figure out whether it needs to be fetched from local child table or child table linked to MongoDB or child table linked to Hadoop. There are two other things being discussed and actively pursued in Postgres world-
- Pushdown of aggregates, predicates and joins i.e. I would be able to push down the aggregations queries meant for MongoDB nodes to Mongo server.
- Parallel scans which means if the data for my query is spread across different child tables or foreign child tables I can scan them in parallel.
These possibilities make it really intersting. Imagine you are able to harness capabilities of MongoDB and Hadoop in parallel and combine the resulting set without doing any bit of programming at your end. This all sounds like a fiction but as you track the recent developments and discussions in Postgres community it all looks possible in near future. Postgres community has been thriving to be in mainstream and address recent trends and demands in data-world. The community is well positioned to address the needs of hybrid architecture which is going to pick up very soon. So organizations would look for combining all the bits and pieces of their data – whether ‘Big’ or ‘Large’, ‘hot’ or ‘warm’, ‘BI-data’ or ‘BA-data’, ‘Hadoop’ or ‘Mongo’, ‘structured’ or ‘unstructured’. Postgres with its FDW, inheritance and parallel scan capabilities will be a key player not just for being the data store but also for being the logical data-hub.
Exciting times are ahead!
- Sameer Kumar is the Database Solution Architect working with Ashnik. Sameer and his team works towards providing enterprise level solutions based on open source and cloud technologies. He has an experience in working on various roles ranging from development to performance tuning and from installation and migration to support and services roles.
- Would PostgreSQL leak your data? No! But YOU might!
- Integrating Docker Trusted Registry with Google Chat
- Chaos Engineering with Docker EE