2/20/2010
Data Migration (2)
There are three difficulties about nosql data migration, particularly if the data is serialized object state:
- Relationship between entities causing the dependencies between data migration rules.
- Lack of ad-hoc query support.
- Hard to migrate in batch.
The two approaches we talked about: Migrate on load, Migrate in one time. Both of them pros and cons, and they are very much related to the three difficulties mentioned above.
The pros of Migrate on load: It do not need to shutdown your database or application. So, in theory, live migration is possible this way. Another related big benefit is you spread the cost of data migration over the period. So, if the data set is huge, it is very economical to do it this way. Especially a lot of the data are not frequently being used.
The cons of Migrate on load: very very difficult to deal with the dependencies between data migration rules. Not able to fail fast, if there is flaws in the data migration code itself. Also the design more sophisticated so more likely to run into problem.
The pros/cons of Migrate in one time is exactly the opposite of the above. Both of them have problems to deal with lack of ad-hoc query support. For example, if you want to change a reference from id to a business id, then it is very likely you need to translate from one particular id to another business id. This kind of query is very unlikely to have designed index tables. So if we do not have ad-hoc query support, then the data migration code is very hard to write. You might need to build special index table just for data migration purpose. Luckily, if we use SQLServer as nosql database, then we can leverage the xquery capability.
For batching, if you migrate on load, it is not a problem. But if you migrate in one time, it might be very time-consuming. It is now the NO.1 concern in my team around data migration. I have no good solution to this one yet. Previously, we write data migration using SQL, it is batch processing in it's nature. But now we do not have schema, so SQL is not applicable anymore, which means more RPC round-trip involved in the data migration. We need to literally load the whole database out. The long term mitigation is to introduce Map-reduce.
The not so well known problem is the problem of dependencies. One simple question. If we move a field from class A to B. When we load the object of class A, should we do the migration of class A and referenced object of class B? When we load the object of class B, should we do the migration of class B and referenced object of class A? Then, are we running into the circular reference problem here? This is just a obvious example, there are much more not so obvious examples. For example, if we delete a class. That means all data migration referencing that class must be all executed against the whole database, otherwise we have possibility to not able to load object of that class anymore. How can we avoid that? Let's talk about it in next post.
Subscribe to Posts [Atom]