2/18/2010

 

Avoiding N+1 Problem (2)

In the previous post, it talked about how to use batch load to avoid loading references one by one, causing too many queries. First we start with optimizing loading collection. Then we found, all references can be optimized in the same way. But for the indirect references, it seems like very tedious to optimize.

How can we make things batchable into a batch without making the code looking like a mess? For example, object A reference object B, object C. And object B reference D, E. And object C refernece F, G. How can we loading D, E, F, G in one batch? This was not a issue if we use tranditional ORM, because the schema are different for B, C, so the SQL will be different, there is no way to do such kind of batch loading. But now because all entities are stored in the same schema in EntityState table, it is logically possible to do this optimization.

The difficulty is not about loading or batching the entities. The loading is just the same SQL. The differences between loading D, E, F, G is the post processing. For different object need to be loaded and then assigned to different fields. So it is essential to know what the post processings are. A ideal way to do this in C# is:

IDictionary<Guid, Action<EntityState>> accumulatedCallbacks

If we can store the post processings in a dictionary called accumulatedCallbacks, then we can decide when to do the post processings. So, instead of doing

var entityState = entityStateLoader.Load("xxxx");
DoMyPostProcessing(entityState);

we pass the post processing as Action, and store them in the dictionary. Then, when the batch is "big enough", we can call those callbacks passing the loaded entity states.

entityStateLoader.LoadLater("xxxx", DoMyPostProcessing);

Now, this seems works, except when are we going to call these accumulated callbacks. When are we going to load entities with those ids? To answer this, we'd better to look at the code

public void Flush()
{
  while (accumulatedCallbacks.Count > 0)
  {
    var callbacks = new Dictionary<Guid, Action<EntityState>>(accumulatedCallbacks);
    accumulatedCallbacks.Clear();
    ApplyCallbackOnEntities(callbacks.Keys.ToArray(), callbacks);
  }
}

private void ApplyCallbackOnEntities(IEnumerable<Guid> ids, Dictionary<Guid, Action<EntityState>> callbacks)
{
  var loadedStates = states.BatchLoad(ids);
  foreach (var state in loadedStates)
  {
    callbacks[state.Id](state);
    callbacks.Remove(state.Id);
  }
  foreach (var callback in callbacks.Values)
  {
    callback(null);
  }
}

The important stuff is in the while loop. The sequences are:

  1. Copy accumulatedCallbacks to a local variable
  2. Clear the accumulatedCallbacks
  3. Loading happed: states.BatchLoad(ids)
  4. Each callback being called
  5. A tricky thing is, while callback being called, the accumulatedCallbacks will accumulate more callbacks in the mean time, because the callback will call LoadLater to load its references as well.
  6. If the accumulatedCallbacks not empty, repeat the steps again

Essentially, we turned a sequential process into a async connected steps, which is also known as continuation. Then, we can archive better runtime performance and still not making the main logic (load and assign back to fields) not knowning the the performance optimization we have done. Another example of separation of concern.


Comments:
Hi Taowen - funny you mention this problem. Some friends at MS just told me about this:

http://tomasp.net/blog/csharp-async.aspx

... seems like you are trying to do some of the same things where you create an iterator to abstract out the idea of an async workflow.
 
yes, they are same thing. And Jeffrey Richter wrote about "AsyncEnumerator", and I read about it before. The idea of async is pervasive, but applying to batch loading is my idea :)
 
Post a Comment

Subscribe to Post Comments [Atom]





<< Home

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]