1/25/2017

 

Architecture Fundamentalism: Review of My 2016

I am writing code everyday, I thought I will never categorized myself as "unpractical" or "theoretical". However, after reviewing the learning journey of my 2016, I invented a word "Architecture Fundamentalism".

Architecture Fundamentalism: applying principle or architecture without context

That is exactly what I did, in retrospect. It is too hard to say it in my mother language, so I write my retrospection in English.
I learned the following experience in the hard way.

Idea 1: Adding test to regain confidence

The idea is simple. Test is important for agility. If you want people to have confidence and do more refactoring, you should have test first. So, let's add test to a big legacy codebase. In hope, the developers can appreciate the test, and regain the confidence when making changes. Isn't it a good idea? Is it working? Not really...

Idea 2: Integration test over mock

Having stayed in a company which invented xxxMock (like 10) frameworks, I had enough bad experience when the build is broken it tells you nothing but the mocking behavior has changed. I hate mocks, I do not deny. Let's embrace the great grand integration test. Run all the components within a bounded context, and test them together. The insight is: the boundary between client/server should be stable. We write the test simulating the client, it should produce very stable result between releases. What went wrong? A lot of things...

Idea 3: Unify the middleware

The technical stack is a disaster. There are 5 langues, 8 frameworks, countless libraries being used in production. If we want to do distributed tracing, service discovery, traffic coloring, shouldn't we unify the stack? Like every successful company is backed by one core RPC stack. Shouldn't we do that? Yes we should. Is it the only way and best way? Maybe...

Idea 4: Decoupling system by event

RPC is evil, and fragile. Fred George, a funny old thought leader. What I learnt from him and many other big bloggers, don't model your system using RPC. The system should be divided into bounded context, and they should be integrated with messaging system. Loose coupling, those idiot can never get. Turns out, I am the moron who blindly trust an architecture, without proving it is fit in the context. There are simply too many unanswered questions, to replace RPC with event based architecture.

===

All the ideas are attractive at first look. Now, let me summarize what actually went wrong. And what is the practical solution (maybe quick & dirty) that actually worked.

Experience 1: Small flow in production trumps everything

TDD is dead. Don't get me wrong, test is a good thing. It is like daily physical workout, it keeps your body fit. However, what most people want is just a system not bleeding all the time. When you actually have a system at the edge of crashing every day, adding test should NEVER be your first priority. The golden tool in the internet business: small flow in production. Nothing is more convincing than it is actually working in production.

Isn't we are already doing small flow release every day? Turns out, we are not doing it enough. With everything coupled in one module, you only can at most do 4 small flow release, with 4 hour each to verify it is actually working. This is not enough, given there are so many changes coming into the big ball of mud. The most important result of SOA decoupling, is not blah, blah, blahhh. It is allowing you to have more modules to do small flow release. This way, we can multiple module doing online testing, giving the new code more time to test itself.

Isn't the test irrelevant? NO. It is still a good practice. When keep a system from crashing is no longer your daily business. Keeping your developer happy is important then. Modifying a PHP source file without the ability to run in test environment, only to find out you mis-spelled a method call causing PHP error in production is NOT a fun experience. Even we have small flow to contain the damage, the release/roll back process is still time consuming and frustrating. Without a good test suite, you can not expect your developers will be brave enough to do the right thing (I mean actually refactoring the code to fit the business model)

But, let's face the cold fact. It doesn't matter. Small flow in production trumps everything. Everything comes second.

Experience 2: Single module test matters

My principle is really simple. The test should cover the business rule, not the code. The test should guard against the business from financial loss, to ensure the main business  process will continue after every release.

My principle is still rock solid. But it does not mean the test should ONLY do that. I didn't get it in the first time. I thought, why are you guys creating test for a single module, and mocks everything around it? I blame the team structure for that. Because the team is setup in solo, so they only care for their own business. So, in the end, no one is responsible for the whole process with every components assembled together.

Without context, my argument seems right. Let me give you some context, we have a such large PHP codebase, with so many dependencies, developer can not even run the code up to the point where they made the change. To make the code actually run, from the index.php to the point of change, you need to setup more than 10  system dependencies with a lot data configuration in the database. Given the code is written in PHP, there is no object model, there is no compile type safety. The only defense between your change and a total disaster is the god damn "small flow in production". 

What should you do in this context? Right, run your code before going into production.Give those poor boys a run button, even just to check the PHP syntax is correct. What is the most easy way to make the code running? Yes, just run the module under change, and mock everything else. How to prevent the mock from failing? If we hand write the mock, it will be tedious and fragile. If we get the tracing data from production, and replay it in test environment, it will be cheap to update the mocking behavior.

So, do write test for a single module. It matters, and matters a lot. I was wrong.

Experience 3: Result oriented thinking

Unifying the language/framework/library is the right thing to do. But you need enough interest to justify the cost. The core business of the middleware is encoding/decoding/connection pooling. The attached value of a rpc framework is the unified logging/metrics, and service discovery, resilience. Having a thrift framework with everything included is not enough to justify the cost of switching the stack from http+json. The stability of the core business is more important. Changing from http+json to thrift, we might get the attached value we want, but it also means a lot of things need to be re-tested, re-verified in the production.

What the boss want is NOT unifying the technical stack. NO one is paying for that. They want a distributed tracing system, so that there is a customer complaining bad experience, we can actually look it up through a web interface. They want load-testing the whole business process in production, which requires coloring the testing traffic from normal traffic.

To archive the result, is the unifying stack the right solution? Not really. Even we have 5 languages and 8 frameworks to maintain, we can still make changes to them one by one. Instead of changing one place, we are changing 30 or more places to get the job done. Is it costly? Yes, it is. Is the result done? Yes, it is done, sir. Good.

This is result oriented thinking. What can actually justify the cost of unifying the technical stack? I don't know yet. Maybe, when the boss actually want to have a smaller team so that less people doing similar stuff? Just, maybe.

Experience 4: Classical event based architecture is unpractical

I love event based architecture. I love decoupling. But telling is not working. There are very practical reasons. And classical event based architecture is actually unfit in the context. We should listen first, not blindly impose our will first. Here is the list of why we should stick to RPC

  • Our business is a realtime business. Delaying a event in the queue, and make it eventually consistent is just not acceptable. Delivering the event to the next system is mission critical to the business. If it is not done, we should try our best to re-route it and keep the main process flowing. Queuing makes the upstream harder to make sure the delivery happen, and happen fast.
  •  The client side is expecting a synchronous response. When a order being finished, how much it will cost you is part of the response and displayed on the screen. It is not acceptable to do the calculation async, which requires a big API/Interaction change. It is just unpractical to change every UI interaction to be async.
  • Lack of tooling support. Event based architecture requires a lot of tooling to make it feasible. Without good distributed tracing, it is very hard to track down there is no message flow out, is because which link dropped it or missed it. If the message is passed via RPC, we can be sure there is a error log somewhere. And most importantly, every developer knows RPC, and know to look at those error logs.
They are very practical reasons. I found all benefits promised by grand async architecture simply can not justify the cost. I tried to use kafka somewhere in the system, can not find enough applicable places to make a difference. The main business process is still a big ball of mud, you seems can not decouple them with any messaging system without changing the business requirement. Which is really really frustrating.

Mimic RPC with duplex messaging channel seems like a good idea. And we actually tried it. The project finally get cancelled, and I will never try that again. It is a total disaster. 

What actually worked? Well, it turns out this is what the system ends up

  • do rpc and return result in sync way, in normal days
  • when the down-stream malfunctioned, return a fake result with disclaimer, so that end-user will  know or they simply do not care
  • store the failure message in a async queue
  • the queue keeps retrying the process, until everything recovered
No one is designing the system to work this way, after patching and patching, it evolved itself into this form. It actually revealed a important paradigm. It is still a event based architecture in a nutshell. Compared to classical design, the upstream and downstream is not completely decoupled. In normal flow, RPC is invoked and result can be returned. When bad things happened, we downgrade the system into async mode. Through reliable event recording, and eventually consistent message replaying, we can guarantee the data is consistent eventually.

===

Be pragmatic, and carry on

2/27/2012

 

VBA Stacktrace

Logging in VBA is hard, believe it or not. There is no obvious way to find out what went wrong even you decided to do some manual logging. Unlike C# or other "industrial" programming environment, there is no easy way to tell the stacktrace of current execution point. The article describes a way that I invented/found to do it "elegantly" (relatively speaking).It looks like
Public Sub GetStockPrice()
If HEG Then On Error GoTo PROC_ERR
  'sub body
PROC_ERR:
  GetLogger().Error "GetStockPrice"
End Sub
Public Sub SetStockCode(StockCode)
If HEG Then On Error GoTo PROC_ERR
    'sub body
PROC_ERR:
    GetLogger().ReraiseError "StockInfoForm.SetStockCode", Array(StockCode)
End Sub
HEG is a global boolean const, stands for "Handle Error Globally". If we wrap all our sub/function with blocks like above, we can ensure there will be logs printed out when some errors happens. And the stacktrace will be available in the log file, along with the invocation parameters in the invocation chain.
You might wonder how it is implemented. Actually quite simple, it is NOT maintaining a stacktrace some where. If we do it that way, we need to push/pop frame into/from the stacktrace. My way is a little bit simpler, it does force you to maintain a separate stacktrace. When a exception is raised, the On Error GoTo statement will catch it, and GetLogger().Error function call will write out the error to log file. One thing is not that obvious is instead of resume next, the GetLogger().ReraiseError function call will also raise another exception, which can be caught in outter level. Again, the reraised exception will be caught and log to file. This way, a stack trace can be recorded in the file, with the root cause in the top, and the most outside calling place in the bottom.
The complete source code is available here (Logger.cls):
Option Explicit

Const Level As String = "Info"
Const Output As String = "File"
Const ReraisedErrorNumber As Long = vbObjectError + 1985

Private Context As New Collection
Private FileNumber As Integer

Public Sub ClearContext()
    Set Context = New Collection
End Sub

Public Sub Error(FunctionName As String, Optional Args)
    If IsMissing(Args) Then
        Args = Array()
    End If
    HandleError False, FunctionName, Args
End Sub

Public Sub ReraiseError(FunctionName As String, Optional Args)
    If IsMissing(Args) Then
        Args = Array()
    End If
    HandleError True, FunctionName, Args
End Sub

Public Sub Info(Msg As String)
    If "Error" = Level Then
        Context.Add Msg
    Else
        PrintToOutput "Info", Msg
    End If
    FlushOutput
End Sub

Public Sub Dbg(Msg As String)
    If "Dbg" = Level Then
        PrintToOutput "Debug", Msg
    Else
        Context.Add Msg
    End If
    FlushOutput
End Sub

Private Sub PrintToOutput(Level As String, Msg As String)
    Dim FormattedMsg As String
    FormattedMsg = "[" + Level + "]" + " " + CStr(Now()) + ": " + Msg
    If "File" = Output Then
        Print #GetFileNumber(), FormattedMsg
    Else
        Debug.Print FormattedMsg
    End If
End Sub

Private Sub FlushOutput()
    Close #GetFileNumber()
    FileNumber = 0
End Sub

Private Function GetFileNumber()
    If FileNumber = 0 Then
        FileNumber = FreeFile
        Open GetFilePath() For Append Access Write Shared As FileNumber
    End If
    GetFileNumber = FileNumber
End Function

Private Function GetFilePath()
Dim FileName As String
    FileName = "zebra-word-" + CStr(Year(Now())) + "-" + CStr(Month(Now())) + "-" + CStr(Day(Now())) + ".log"
    GetFilePath = Application.Path + ":Zebra:Log:" + FileName
End Function

Private Sub HandleError(ReraisesError As Boolean, FunctionName As String, Args)
    If 0 = Err.Number Then
        Exit Sub
    End If
    If ReraisedErrorNumber = Err.Number Then
        PrintToOutput "Error", "Stack: " + FormatInvocation(FunctionName, Args)
    Else
        PrintToOutput "Error", "Stack (Root): " + FormatInvocation(FunctionName, Args)
        PrintToOutput "Error", "Err Number: " + CStr(Err.Number)
        PrintToOutput "Error", "Err Source: " + Err.Source
        PrintToOutput "Error", "Description: " + Err.Description
        PrintToOutput "Error", "Help File: " + Err.HelpFile
        PrintToOutput "Error", "Help Context: " + CStr(Err.HelpContext)
        PrintToOutput "Error", "Last Dll Error: " + CStr(Err.LastDllError)
        DumpContext
    End If
    Err.Clear
    FlushOutput
    If ReraisesError Then
        Err.Raise ReraisedErrorNumber
    Else
        MsgBox "Opps... something wrong happend. Please send your blame to taowen@gmail.com"
    End If
End Sub

Private Sub DumpContext()
Dim Msg
    PrintToOutput "Context", "Dumping context..."
    For Each Msg In Context
        PrintToOutput "Context", CStr(Msg)
    Next Msg
    PrintToOutput "Context", "Dumped context"
    Set Context = New Collection
    FlushOutput
End Sub

Private Function FormatInvocation(FunctionName, Args)
Dim i As Integer
Dim InvocationDescription As String
    InvocationDescription = FunctionName + "("
    For i = LBound(Args) To UBound(Args)
        If i > LBound(Args) Then
            InvocationDescription = InvocationDescription + ", "
        End If
        InvocationDescription = InvocationDescription + FormatArg(Args(i))
    Next i
    InvocationDescription = InvocationDescription + ")"
    FormatInvocation = InvocationDescription
End Function

Private Function FormatArg(Arg)
Dim ArgType As Integer
    ArgType = VarType(Arg)
    If vbEmpty = ArgType Then
        FormatArg = "[Empty]"
    ElseIf vbNull = ArgType Then
        FormatArg = "[Null]"
    ElseIf vbInteger = ArgType Then
        FormatArg = CStr(Arg)
    ElseIf vbLong = ArgType Then
        FormatArg = "[Long]" + CStr(Arg)
    ElseIf vbSingle = ArgType Then
        FormatArg = "[Single]" + CStr(Arg)
    ElseIf vbDouble = ArgType Then
        FormatArg = "[Double]" + CStr(Arg)
    ElseIf vbCurrency = ArgType Then
        FormatArg = "[Currency]" + CStr(Arg)
    ElseIf vbDate = ArgType Then
        FormatArg = "[Date]" + CStr(Arg)
    ElseIf vbString = ArgType Then
        FormatArg = """" + Arg + """"
    ElseIf vbObject = ArgType Then
        FormatArg = "[Object]"
    ElseIf vbError = ArgType Then
        FormatArg = "[Error]"
    ElseIf vbBoolean = ArgType Then
        FormatArg = CStr(Arg)
    ElseIf vbVariant = ArgType Then
        FormatArg = "[Variant]"
    ElseIf vbDataObject = ArgType Then
        FormatArg = "[DataObject]"
    ElseIf vbDecimal = ArgType Then
        FormatArg = "[Decimal]" + CStr(Arg)
    ElseIf vbByte = ArgType Then
        FormatArg = "[Byte]" + CStr(Arg)
    ElseIf vbUserDefinedType = ArgType Then
        FormatArg = "[UserDefinedType]"
    ElseIf vbArray = ArgType Then
        FormatArg = "[Array]"
    Else
        FormatArg = "[Unknown]"
    End If
End Function

1/04/2011

 

Retrospection: the mistakes I have made these years

Someone told me, it took more than 10000 hours repeated practices to make a professional mature. I am still far away from the standard, but after 5 years of programming as my profession, I realized I already made so many mistakes, that worth some conscious retrospection.

One presentation I did not watch but really liked their slides: http://www.infoq.com/presentations/LMAX. In the slides, they said:

On a single thread you have ~3 billion instructions per second to play with: to get 10K+ TPS if you don't do anything too stupid.

I have to say, I did do many things smart at first and turned out to be stupid, which made the ~3 billion instructions per second hardware helpless to the project. Not just about performance, there are many mistakes leads to other symptoms as well.

Sometimes, the "I" here can be substitute with "We". I am sure, and I have seen other people made same mistakes as I did. As poor software developer, we do not have control of many things, but code at our hands. It is not surprising people spent a lot of time to make their code "smart". A lot lessons can be learned from those smartness.

I do not have a full list yet, but as a start I will list some here. As this blog is primarily technical, I will keep the items mostly relevant. If I find time I will complete them one by one:

doubanclaim8fc580be0d52f419

11/27/2010

 

Package, the missing language feature - Part II

Problems

In previous post, we have talked about how package works in Python language. Essentially, the problem is, the package is a good box, but color is not black. We want the package to expose all its API at the package level, and seal up any internal details. from A import * should give you all the things you need, you do not need to import A.B, or import A.C.

Unimportable

So, how to make a module unimportable? There are two things you need to do. First, remove the B attribute from package A object. By doing this, import A.B will fail. Because import A.B will first import A, and then import A.B, and then get B from A. By deleting B from A the import will fail. Second you need to remove A.B from sys.modules, by sys.modules['A.B'] = None. This will make from A.B import * fail.

delattr(package_A, 'B')
sys.modules['A.B'] = None

This way, we completely hide the existence of A.B. Which is the behavior we want when other package want to import this private internal. The drawback of this mechanism is that, the error message user get is not friendly. They will be told the module does not exist, but it actually exist if you look it up in the file browser.

When?

By making internal packages modules unimportable we can make the parent package a blackbox. But when we do this, deleting all the internal packages and modules?

The best place is in the __init__.py of parent package. But after we delete the internal packages and modules, they are gone. What if A.B reference A.C in the code? The thing we need to do is to make sure A.B are initialized(imported) before sealing up A. In A.B it might use import A.C or from A.C import xxx, both ways copy the referenced name to local namespace. So even A.C no longer exists, in A.B they can still be referenced.

from .B import *
package_A = sys.modules['A']
delattr(package_A, 'B')
sys.modules['A.B'] = None
Where?

Do I need to write those kind of ungly delattr in every __init__.py file? Isn't that a cross-cutting concern that should not be repeated in every place?

Yes, let's find some way to magically inject those code in every __init__.py file. The code actually has three parts. Part 1, expose API. Part 2, eager load sub modules. Part 3, delete sub modules. API stil need to be manually defined in __init__.py. But part 2 and 3, they can be put into "post-import-hook".

What is post import hook? They are the code executed after module being imported. After A being imported, we can eager load all its sub modules by scanning folder and then delete them. Post import hook is not directly supported in Python, but can be done by more powerful meta import hook.

def register_meta_import_hook(should_apply, post_import_hooks):
    import sys
    import imp

    class Importer(object):
        def __init__(self, file, pathname, description):
            self.file = file
            self.pathname = pathname
            self.description = description

        def load_module(self, name):
            try:
                module = imp.load_module(name, self.file, self.pathname, self.description)
                for post_import_hook in post_import_hooks:
                    post_import_hook(module)
                return module
            finally:
                if self.file:
                    self.file.close()

    class Finder(object):
        def find_module(self, qualified_module_name, path):
            if not should_apply(qualified_module_name):
                return
            if not path:
                path = sys.path
            module_name = qualified_module_name.rpartition('.')[2]
            file, pathname, description = imp.find_module(module_name, path)
            return Importer(file, pathname, description)

    sys.meta_path.append(Finder())
Conclusion

By eager loading sub modules and delete them in post import hook. We can seal up package and force people to define the package API in __init__.py, because that is the only way to let outsider to use the internal.

Another interesting side effect is that the circular dependency between packages are no longer possible. In python, circular dependency between modules are not possible, but because module in package is lazy loaded, so circular dependency between packages were possible. But after eager loading sub modules, we now disabled the circular dependency between packages. It is good thing, but could be very strict.

Finally, we have the box. And automatically seal it up in post import hook. If the box writer want to make the box external usable, they need to define its API in __init__.py. Plus, no circular dependency ever.


11/25/2010

 

Package, the missing language feature - Part I

Introduction

We have spent a way too much time on functions and classes. We put a lot of energy to maintain a clean and concise interface of classes. We care about the dependency between classes, by encouraging dependence injection and wire up objects via interface.

Besides functions and classes, we do have higher level construct. We have hacked the class loading of java to the hell, and then Satan gives us back his OSGi. We have invented a dedicated job to maintain the manifest of EJB. When dependency injection is not enough, people do find concept called Module emerging in modern things like Guice and Autofac.

What is Package? Itself is merely a name. It is just some annoying leading dots before the thing you actually want. It is not even a thing, it is just a being ignored prefix. People might say, oh yes, package is not doing anything, why should I care? Function is doing something, class is also doing somethings, package is just some dummy folder that I can put those valuable stuff inside it.

True, very true... so does the language designers. I can not say all of them does, but at least some of them does. Stroustrup ignored package. Gosling ignored package. Even Hejlsberg ignored package (But, assembly is better than nothing). What a huge mistake!

The problem we normally need to solve when writing business software is not some scientific work. In my mind, the only problem we need to solve is managing complexity. As we learned long time ago, the only way to control complexity is to break it down, and break it down further. One thing containing many other things. We need blackbox to encapsulate the internal complexity monster and give outside a clean and simple illusion. But constantly, when using Java or C# I find I need to reinvent all kinds of blackboxes to meet my needs. And none of them seems naturally to new comers, simply because they are not part of the original language, not known to most people, and not supported by many tools. There are many design patterns, people say they exist because the language itself is flawed. There are also many component platform/framework, I say it is because the language itself is flawed. It is because the language does not give us the blackbox, so we need to invent one ourself.

Package, it is a missing language feature for a long time. But luckily, Java or C# is not the only choice we have. In another open wonderland, without money but happiness, we have our lovely Python. In there, we finally see what is called package.

Package in Python

Package in python is simple. If you have a folder called some_package, and you have a __init__.py file in that folder, then it becomes a package called some_package. If you happen to have another folder inside some_package folder called another_package, and itself also has a __init__.py file inside the folder, then it becomes some_package.another_package.

The key difference between package in Java and package in Python is, in Java, the package is just a literal symbol, it does not exist in the runtime. In Python, the package is a living object and you set and get attribute on it anytime. some_package.another_package.abc = 'def' is a valid statement in Python language.

This gives us the box we want. We can use this box to define our interface and hide our internal complexities. A package structure like A.B.C, A should hide B, C. B should hide C. In the A level, you might say start the car. In the B level, you might say start the engine, and then start radio and air conditioner. The hierarchical structure of package naming is the best fit for natural encapsulation.

The box can also initialize itself. It has a __init__.py file which can be used to execute any bootstrapping code. Sometimes we need to sort out some internal stuff before ready for outside service. Sometimes, we need to register ourself as subscriber for event published somewhere else. Having simple __init__ solves a lot of problem. It is powerful enough? Not really, it does not support thousand other features, like full lifecycle management, standard remote control interface, etc. As a user facing public component, the package construct exposed by the language is very limited. But we can build on top of what language provides us.

The real problem of python package is not it does not support things like JMX. The real problem is the box is not really a black box. Actually, everything in Python is sort of made by Glass. You can see right through nearly everything, that is public in Java term. Although we can use _ as convention, and __ as hard compiler constraint in some place. But here, the ugly underscore is not helping. You can always reference A.B.C.xxx anytime you want, and that is dangerous. It breaks encapsulation, introducing tangling dependency without being noticed. No one likes that, we want to make sure Y.X.Z only reference A.B.C.xxx through A.yyy. It should not know the internals like A.B.C.xxx.

Classic "pythonic" response would be, that is just a convention. When convention is there, people should follow. The problem is, this is not a easy convention, and can be broken in any minute. There is no easy rule people can follow. The real difficulty is, when you reference A.B.C.xxx you can not always reference it as A.yyy. If your code lives inside A.B, then you should not reference A.yyy, because the inside package should not reference the outside, as it is in the lower place in the dependency pyramid. In this case, you do need to reference A.B.C.xxx as A.B.C.xxx as it is some thing you have to deal with. It is no longer a hidden internal, you are living inside the internal. In other words, A.B.C.xxx is not always public or private. It is accessible or not depending on where you are. And that is exactly what encapsulation is about.

How can we make the box really a black box? Let's continue in the Part II.

Update 11.27

Alan (https://alanfranz.pip.verisignlabs.com/) commented: I'm not 100% sure about what you'd like to say in the next part, however:

Thanks Alan! I am surely aware of "bootstrapping code". The major problem of import time initialization is it is implicit. And it will be even worse if the bootstrapping is I/O intensive or causing other side effects. But if there are too many clients, like a lot of unit tests, pushing responsibility to them is also inconvenient. I use __import__('x.y.z') in the main function to implicit stating that I want to use those packages and initialize them now.

Starting with _ means private, that is fine. But a package is not always private, it is public to its siblings, but private to other packages. Traditional visibility only allow you to specify one thing is public or not, I think that is not enough. How many details you are allowed to know, or should depending on, might be contextual.

Python does check the circular dependency on module level. But it does not check circular dependency on package level. For example, A.B can not circular depend on C.D, but if A.E depend on C.D and C.D depend on A.B, that is allowed. But actually, that means package A depend on C, package C depend on A as well.


2/22/2010

 

Data Migration (3)

The final question about data migration. How to write it? We already know it is just a function to transform a dictionary into another dictionary. We also know there will be question around dependencies between entities. So, what we need are several functions, with each one upgrade one version. The migration function is per version delta, not per entity. The function need to do several things:

  1. Find out what are the entities need to be migrated.
  2. Load the entity state as dictionary.
  3. Apply the migration logic on the dictionary.
  4. Save the entity state back.

For finding entities to be migrated is easy. We already know SQLServer as XQuery support. We can write customized xquery to find out what are target entities. Most of time, it will be based on CLR_TYPE. Then the only thing that being a problem is how to write a function to transform a dictionary in memory to another dictionary.

It might seems easy, we just need to write a function in C#, which takes dictionary as input and return a new dictionary. Yes, this would work. But the code would be very details, and looks unintentional. It would need to a lot of casting to cast a entry to a list or a string or another dictionary, based on your knowledge of the object graph. It also need to do a lot of detailed operation, like copy a field to another and delete the older field to do a renaming. The issue of plumbing code might be solved by introducing non-static typed language, like ruby as a migratioin scripting language. But the more essential problem is how to raise the abstraction level, so that the migration script can looks like more intentional, and reveals the original requirements.

One naive change is we write several function for the well known refactorings. Like we can have Rename, MoveType, ExtractEntities. And that was exactly what we have tried before. The problem of these small functions are they are not really that reusable. Say, we have a rename function, which a change the direct field from one name to another. But what if we renamed a field but it is inside the object graph not directly on the root. Then the rename function can no longer help us. We might think we can abstract the "locating" part of the function. Instead of passing in two string to identify the fields by name, we pass in two locators.

The locator is not easy to implement. Say, we are renaming field x.y.z to x.y.k, x is branch of the root, and y is branch of x, and z was the field on y, and k is the new name of z. The rename function need to take "x.y.z" and "x.y.k" as input, and know how to apply them. For "x.y.z" we need to "get" the value, and then use "x.y.k" to "set" the value, then use "x.y.z" again to delete the field. The logic of getting value is very different from the logic of setting value.

In general, this apporach was called Functional Programming. By decomposing the big function to smaller one, and then compose them back to cope with different situations, we can maimize the reusbility.


2/21/2010

 

Flattening & Rebuilding

Using SQLServer as your nosql database to persist objects state has issue with data migration. I mentioned three in the previous post. There are two more, today we are going to talk one of them. You can not deserialize object state back to class whoes fields have been changed. Think about a class used to have a field called name, now the field changed to firstName. When deserialize, where should the value of name assigned back to? If we can not get the raw value, how can we apply the data migration logic? We already talked about the data we stored in SQLServer is XML, are we going to parse the XML and manipulate the xml element directly? Yeah, I think we have to.

So the data migration logic is not applied on the same object model which your application logic dealing with. It has to be at a lower level. We could data migrate the xml data elements, but that is just too tightly coupled with the data format. Before we use the xml, we actually tried JSON for a month, until we found the XQuery is really a killer. Also, xml element has many things we do not care about. So, what we need is a model which can capture the states of the objects, but simple enough. This is model is also directly related to how serialization/de-serialization is working. It works like this:

object ==Flatten==> many dictionaries(with dict/list/string inside) ==Serialize==> XML
XML ==Deserialize==> dictionary(load referenced entity state on demand) ==Rebuild==> objects

The XML looks like this:

<Entity CLR_TYPE="Domain.Calendar.Location" Country="China" 
CountryAbbreviation="CN" LId="43-123" TaxUnit="Jiangxi" TaxUnitAbbreviation="JX" />

It will be deserialized to a dictionary containing 5 entries. The CLR_TYPE will be used in the rebuilding process to rebuild the dictionary back to a object. Except dictionary, string is also valid. string is used to store the field name as well as the simple field value. The persistence layer need to define how to translate a date time into a string, and etc. Collection is also valid. Although in theory, collection is just a special case of dictionary.

The XML is just state storage for a single entity. Entities can inter-relate with each other. We are not going to store the state for other entity in same XML. They will be referenced by ID, and stored separated in different rows in the table EntityState.

Because we separated the serialization into two phases, that is why we can do data migration. The data migration is just a function, who take a dictionary as input, and produce another dictionary. Now the only problem is, how can we write such kind of function? Yes, it might be trivial for just one version, but if you are going to change it very frequently doing agile software development, then it is a big issue. We are going to talk about the "reusbility" of data migration rules in the next post.


This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]