Architecture Fundamentalism: Review of My 2016
- Our business is a realtime business. Delaying a event in the queue, and make it eventually consistent is just not acceptable. Delivering the event to the next system is mission critical to the business. If it is not done, we should try our best to re-route it and keep the main process flowing. Queuing makes the upstream harder to make sure the delivery happen, and happen fast.
- The client side is expecting a synchronous response. When a order being finished, how much it will cost you is part of the response and displayed on the screen. It is not acceptable to do the calculation async, which requires a big API/Interaction change. It is just unpractical to change every UI interaction to be async.
- Lack of tooling support. Event based architecture requires a lot of tooling to make it feasible. Without good distributed tracing, it is very hard to track down there is no message flow out, is because which link dropped it or missed it. If the message is passed via RPC, we can be sure there is a error log somewhere. And most importantly, every developer knows RPC, and know to look at those error logs.
- do rpc and return result in sync way, in normal days
- when the down-stream malfunctioned, return a fake result with disclaimer, so that end-user will know or they simply do not care
- store the failure message in a async queue
- the queue keeps retrying the process, until everything recovered
Public Sub GetStockPrice() If HEG Then On Error GoTo PROC_ERR 'sub body PROC_ERR: GetLogger().Error "GetStockPrice" End Sub
Public Sub SetStockCode(StockCode) If HEG Then On Error GoTo PROC_ERR 'sub body PROC_ERR: GetLogger().ReraiseError "StockInfoForm.SetStockCode", Array(StockCode) End Sub
Option Explicit Const Level As String = "Info" Const Output As String = "File" Const ReraisedErrorNumber As Long = vbObjectError + 1985 Private Context As New Collection Private FileNumber As Integer Public Sub ClearContext() Set Context = New Collection End Sub Public Sub Error(FunctionName As String, Optional Args) If IsMissing(Args) Then Args = Array() End If HandleError False, FunctionName, Args End Sub Public Sub ReraiseError(FunctionName As String, Optional Args) If IsMissing(Args) Then Args = Array() End If HandleError True, FunctionName, Args End Sub Public Sub Info(Msg As String) If "Error" = Level Then Context.Add Msg Else PrintToOutput "Info", Msg End If FlushOutput End Sub Public Sub Dbg(Msg As String) If "Dbg" = Level Then PrintToOutput "Debug", Msg Else Context.Add Msg End If FlushOutput End Sub Private Sub PrintToOutput(Level As String, Msg As String) Dim FormattedMsg As String FormattedMsg = "[" + Level + "]" + " " + CStr(Now()) + ": " + Msg If "File" = Output Then Print #GetFileNumber(), FormattedMsg Else Debug.Print FormattedMsg End If End Sub Private Sub FlushOutput() Close #GetFileNumber() FileNumber = 0 End Sub Private Function GetFileNumber() If FileNumber = 0 Then FileNumber = FreeFile Open GetFilePath() For Append Access Write Shared As FileNumber End If GetFileNumber = FileNumber End Function Private Function GetFilePath() Dim FileName As String FileName = "zebra-word-" + CStr(Year(Now())) + "-" + CStr(Month(Now())) + "-" + CStr(Day(Now())) + ".log" GetFilePath = Application.Path + ":Zebra:Log:" + FileName End Function Private Sub HandleError(ReraisesError As Boolean, FunctionName As String, Args) If 0 = Err.Number Then Exit Sub End If If ReraisedErrorNumber = Err.Number Then PrintToOutput "Error", "Stack: " + FormatInvocation(FunctionName, Args) Else PrintToOutput "Error", "Stack (Root): " + FormatInvocation(FunctionName, Args) PrintToOutput "Error", "Err Number: " + CStr(Err.Number) PrintToOutput "Error", "Err Source: " + Err.Source PrintToOutput "Error", "Description: " + Err.Description PrintToOutput "Error", "Help File: " + Err.HelpFile PrintToOutput "Error", "Help Context: " + CStr(Err.HelpContext) PrintToOutput "Error", "Last Dll Error: " + CStr(Err.LastDllError) DumpContext End If Err.Clear FlushOutput If ReraisesError Then Err.Raise ReraisedErrorNumber Else MsgBox "Opps... something wrong happend. Please send your blame to email@example.com" End If End Sub Private Sub DumpContext() Dim Msg PrintToOutput "Context", "Dumping context..." For Each Msg In Context PrintToOutput "Context", CStr(Msg) Next Msg PrintToOutput "Context", "Dumped context" Set Context = New Collection FlushOutput End Sub Private Function FormatInvocation(FunctionName, Args) Dim i As Integer Dim InvocationDescription As String InvocationDescription = FunctionName + "(" For i = LBound(Args) To UBound(Args) If i > LBound(Args) Then InvocationDescription = InvocationDescription + ", " End If InvocationDescription = InvocationDescription + FormatArg(Args(i)) Next i InvocationDescription = InvocationDescription + ")" FormatInvocation = InvocationDescription End Function Private Function FormatArg(Arg) Dim ArgType As Integer ArgType = VarType(Arg) If vbEmpty = ArgType Then FormatArg = "[Empty]" ElseIf vbNull = ArgType Then FormatArg = "[Null]" ElseIf vbInteger = ArgType Then FormatArg = CStr(Arg) ElseIf vbLong = ArgType Then FormatArg = "[Long]" + CStr(Arg) ElseIf vbSingle = ArgType Then FormatArg = "[Single]" + CStr(Arg) ElseIf vbDouble = ArgType Then FormatArg = "[Double]" + CStr(Arg) ElseIf vbCurrency = ArgType Then FormatArg = "[Currency]" + CStr(Arg) ElseIf vbDate = ArgType Then FormatArg = "[Date]" + CStr(Arg) ElseIf vbString = ArgType Then FormatArg = """" + Arg + """" ElseIf vbObject = ArgType Then FormatArg = "[Object]" ElseIf vbError = ArgType Then FormatArg = "[Error]" ElseIf vbBoolean = ArgType Then FormatArg = CStr(Arg) ElseIf vbVariant = ArgType Then FormatArg = "[Variant]" ElseIf vbDataObject = ArgType Then FormatArg = "[DataObject]" ElseIf vbDecimal = ArgType Then FormatArg = "[Decimal]" + CStr(Arg) ElseIf vbByte = ArgType Then FormatArg = "[Byte]" + CStr(Arg) ElseIf vbUserDefinedType = ArgType Then FormatArg = "[UserDefinedType]" ElseIf vbArray = ArgType Then FormatArg = "[Array]" Else FormatArg = "[Unknown]" End If End Function
Retrospection: the mistakes I have made these years
Someone told me, it took more than 10000 hours repeated practices to make a professional mature. I am still far away from the standard, but after 5 years of programming as my profession, I realized I already made so many mistakes, that worth some conscious retrospection.
One presentation I did not watch but really liked their slides: http://www.infoq.com/presentations/LMAX. In the slides, they said:
On a single thread you have ~3 billion instructions per second to play with: to get 10K+ TPS if you don't do anything too stupid.
I have to say, I did do many things smart at first and turned out to be stupid, which made the ~3 billion instructions per second hardware helpless to the project. Not just about performance, there are many mistakes leads to other symptoms as well.
Sometimes, the "I" here can be substitute with "We". I am sure, and I have seen other people made same mistakes as I did. As poor software developer, we do not have control of many things, but code at our hands. It is not surprising people spent a lot of time to make their code "smart". A lot lessons can be learned from those smartness.
I do not have a full list yet, but as a start I will list some here. As this blog is primarily technical, I will keep the items mostly relevant. If I find time I will complete them one by one:
- How build tools re-invent scripting language and command line, especially msbuild.
- The evil of lazy loading
- Other evil things of ORM
- How to hack your dependency injection tools to be a rocket science
- Build castle on top of sand, aka outlook and isolation
- Anything related to Microsoft is evil, especially COM
- Encapsulation might helps initially, but not that helpful as you expect, even harmful sometimes
- Re-invent the wheel, in many ways and how to make reasons to make it looks good
- Abandoned architecture is even worse than wrong architecture
Package, the missing language feature - Part II
In previous post, we have talked about how package works in Python language. Essentially, the problem is, the package is a good box, but color is not black. We want the package to expose all its API at the package level, and seal up any internal details. from A import * should give you all the things you need, you do not need to import A.B, or import A.C.Unimportable
So, how to make a module unimportable? There are two things you need to do. First, remove the B attribute from package A object. By doing this, import A.B will fail. Because import A.B will first import A, and then import A.B, and then get B from A. By deleting B from A the import will fail. Second you need to remove A.B from sys.modules, by sys.modules['A.B'] = None. This will make from A.B import * fail.
delattr(package_A, 'B') sys.modules['A.B'] = None
This way, we completely hide the existence of A.B. Which is the behavior we want when other package want to import this private internal. The drawback of this mechanism is that, the error message user get is not friendly. They will be told the module does not exist, but it actually exist if you look it up in the file browser.When?
By making internal packages modules unimportable we can make the parent package a blackbox. But when we do this, deleting all the internal packages and modules?
The best place is in the __init__.py of parent package. But after we delete the internal packages and modules, they are gone. What if A.B reference A.C in the code? The thing we need to do is to make sure A.B are initialized(imported) before sealing up A. In A.B it might use import A.C or from A.C import xxx, both ways copy the referenced name to local namespace. So even A.C no longer exists, in A.B they can still be referenced.
from .B import * package_A = sys.modules['A'] delattr(package_A, 'B') sys.modules['A.B'] = NoneWhere?
Do I need to write those kind of ungly delattr in every __init__.py file? Isn't that a cross-cutting concern that should not be repeated in every place?
Yes, let's find some way to magically inject those code in every __init__.py file. The code actually has three parts. Part 1, expose API. Part 2, eager load sub modules. Part 3, delete sub modules. API stil need to be manually defined in __init__.py. But part 2 and 3, they can be put into "post-import-hook".
What is post import hook? They are the code executed after module being imported. After A being imported, we can eager load all its sub modules by scanning folder and then delete them. Post import hook is not directly supported in Python, but can be done by more powerful meta import hook.
def register_meta_import_hook(should_apply, post_import_hooks): import sys import imp class Importer(object): def __init__(self, file, pathname, description): self.file = file self.pathname = pathname self.description = description def load_module(self, name): try: module = imp.load_module(name, self.file, self.pathname, self.description) for post_import_hook in post_import_hooks: post_import_hook(module) return module finally: if self.file: self.file.close() class Finder(object): def find_module(self, qualified_module_name, path): if not should_apply(qualified_module_name): return if not path: path = sys.path module_name = qualified_module_name.rpartition('.') file, pathname, description = imp.find_module(module_name, path) return Importer(file, pathname, description) sys.meta_path.append(Finder())Conclusion
By eager loading sub modules and delete them in post import hook. We can seal up package and force people to define the package API in __init__.py, because that is the only way to let outsider to use the internal.
Another interesting side effect is that the circular dependency between packages are no longer possible. In python, circular dependency between modules are not possible, but because module in package is lazy loaded, so circular dependency between packages were possible. But after eager loading sub modules, we now disabled the circular dependency between packages. It is good thing, but could be very strict.
Finally, we have the box. And automatically seal it up in post import hook. If the box writer want to make the box external usable, they need to define its API in __init__.py. Plus, no circular dependency ever.
Package, the missing language feature - Part I
We have spent a way too much time on functions and classes. We put a lot of energy to maintain a clean and concise interface of classes. We care about the dependency between classes, by encouraging dependence injection and wire up objects via interface.
Besides functions and classes, we do have higher level construct. We have hacked the class loading of java to the hell, and then Satan gives us back his OSGi. We have invented a dedicated job to maintain the manifest of EJB. When dependency injection is not enough, people do find concept called Module emerging in modern things like Guice and Autofac.
What is Package? Itself is merely a name. It is just some annoying leading dots before the thing you actually want. It is not even a thing, it is just a being ignored prefix. People might say, oh yes, package is not doing anything, why should I care? Function is doing something, class is also doing somethings, package is just some dummy folder that I can put those valuable stuff inside it.
True, very true... so does the language designers. I can not say all of them does, but at least some of them does. Stroustrup ignored package. Gosling ignored package. Even Hejlsberg ignored package (But, assembly is better than nothing). What a huge mistake!
The problem we normally need to solve when writing business software is not some scientific work. In my mind, the only problem we need to solve is managing complexity. As we learned long time ago, the only way to control complexity is to break it down, and break it down further. One thing containing many other things. We need blackbox to encapsulate the internal complexity monster and give outside a clean and simple illusion. But constantly, when using Java or C# I find I need to reinvent all kinds of blackboxes to meet my needs. And none of them seems naturally to new comers, simply because they are not part of the original language, not known to most people, and not supported by many tools. There are many design patterns, people say they exist because the language itself is flawed. There are also many component platform/framework, I say it is because the language itself is flawed. It is because the language does not give us the blackbox, so we need to invent one ourself.
Package, it is a missing language feature for a long time. But luckily, Java or C# is not the only choice we have. In another open wonderland, without money but happiness, we have our lovely Python. In there, we finally see what is called package.Package in Python
Package in python is simple. If you have a folder called some_package, and you have a __init__.py file in that folder, then it becomes a package called some_package. If you happen to have another folder inside some_package folder called another_package, and itself also has a __init__.py file inside the folder, then it becomes some_package.another_package.
The key difference between package in Java and package in Python is, in Java, the package is just a literal symbol, it does not exist in the runtime. In Python, the package is a living object and you set and get attribute on it anytime. some_package.another_package.abc = 'def' is a valid statement in Python language.
This gives us the box we want. We can use this box to define our interface and hide our internal complexities. A package structure like A.B.C, A should hide B, C. B should hide C. In the A level, you might say start the car. In the B level, you might say start the engine, and then start radio and air conditioner. The hierarchical structure of package naming is the best fit for natural encapsulation.
The box can also initialize itself. It has a __init__.py file which can be used to execute any bootstrapping code. Sometimes we need to sort out some internal stuff before ready for outside service. Sometimes, we need to register ourself as subscriber for event published somewhere else. Having simple __init__ solves a lot of problem. It is powerful enough? Not really, it does not support thousand other features, like full lifecycle management, standard remote control interface, etc. As a user facing public component, the package construct exposed by the language is very limited. But we can build on top of what language provides us.
The real problem of python package is not it does not support things like JMX. The real problem is the box is not really a black box. Actually, everything in Python is sort of made by Glass. You can see right through nearly everything, that is public in Java term. Although we can use _ as convention, and __ as hard compiler constraint in some place. But here, the ugly underscore is not helping. You can always reference A.B.C.xxx anytime you want, and that is dangerous. It breaks encapsulation, introducing tangling dependency without being noticed. No one likes that, we want to make sure Y.X.Z only reference A.B.C.xxx through A.yyy. It should not know the internals like A.B.C.xxx.
Classic "pythonic" response would be, that is just a convention. When convention is there, people should follow. The problem is, this is not a easy convention, and can be broken in any minute. There is no easy rule people can follow. The real difficulty is, when you reference A.B.C.xxx you can not always reference it as A.yyy. If your code lives inside A.B, then you should not reference A.yyy, because the inside package should not reference the outside, as it is in the lower place in the dependency pyramid. In this case, you do need to reference A.B.C.xxx as A.B.C.xxx as it is some thing you have to deal with. It is no longer a hidden internal, you are living inside the internal. In other words, A.B.C.xxx is not always public or private. It is accessible or not depending on where you are. And that is exactly what encapsulation is about.
How can we make the box really a black box? Let's continue in the Part II.
Alan (https://alanfranz.pip.verisignlabs.com/) commented: I'm not 100% sure about what you'd like to say in the next part, however:
- beware about "bootstrapping code". Many times such "static initializer" is known to provoke unpredictable problems, and will prevent package breakup via pkg_resources namespace_package , if ever needed.
Most of the times initialization should be performed by the client code or should be performed at first request; import-time initialization is absolutely abused in python coding.
- convention is fine. If anybody imports a module or package that starts with underscore, it's their business - after all, if they've got the source code, they can modify all the names and make them public, can't they? Would you prefer java-like things where you can set methods and attributes private, and then you can access them via other means through some common.util.lang tool?
- nesting too much might just be unneeded, and if you want a leaf (a.b.c) not to depend on its parent you can use relative imports. But remember that two modules importing one the other trigger an error in Python, you simply can't do that.
Thanks Alan! I am surely aware of "bootstrapping code". The major problem of import time initialization is it is implicit. And it will be even worse if the bootstrapping is I/O intensive or causing other side effects. But if there are too many clients, like a lot of unit tests, pushing responsibility to them is also inconvenient. I use __import__('x.y.z') in the main function to implicit stating that I want to use those packages and initialize them now.
Starting with _ means private, that is fine. But a package is not always private, it is public to its siblings, but private to other packages. Traditional visibility only allow you to specify one thing is public or not, I think that is not enough. How many details you are allowed to know, or should depending on, might be contextual.
Python does check the circular dependency on module level. But it does not check circular dependency on package level. For example, A.B can not circular depend on C.D, but if A.E depend on C.D and C.D depend on A.B, that is allowed. But actually, that means package A depend on C, package C depend on A as well.
Data Migration (3)
The final question about data migration. How to write it? We already know it is just a function to transform a dictionary into another dictionary. We also know there will be question around dependencies between entities. So, what we need are several functions, with each one upgrade one version. The migration function is per version delta, not per entity. The function need to do several things:
- Find out what are the entities need to be migrated.
- Load the entity state as dictionary.
- Apply the migration logic on the dictionary.
- Save the entity state back.
For finding entities to be migrated is easy. We already know SQLServer as XQuery support. We can write customized xquery to find out what are target entities. Most of time, it will be based on CLR_TYPE. Then the only thing that being a problem is how to write a function to transform a dictionary in memory to another dictionary.
It might seems easy, we just need to write a function in C#, which takes dictionary as input and return a new dictionary. Yes, this would work. But the code would be very details, and looks unintentional. It would need to a lot of casting to cast a entry to a list or a string or another dictionary, based on your knowledge of the object graph. It also need to do a lot of detailed operation, like copy a field to another and delete the older field to do a renaming. The issue of plumbing code might be solved by introducing non-static typed language, like ruby as a migratioin scripting language. But the more essential problem is how to raise the abstraction level, so that the migration script can looks like more intentional, and reveals the original requirements.
One naive change is we write several function for the well known refactorings. Like we can have Rename, MoveType, ExtractEntities. And that was exactly what we have tried before. The problem of these small functions are they are not really that reusable. Say, we have a rename function, which a change the direct field from one name to another. But what if we renamed a field but it is inside the object graph not directly on the root. Then the rename function can no longer help us. We might think we can abstract the "locating" part of the function. Instead of passing in two string to identify the fields by name, we pass in two locators.
The locator is not easy to implement. Say, we are renaming field x.y.z to x.y.k, x is branch of the root, and y is branch of x, and z was the field on y, and k is the new name of z. The rename function need to take "x.y.z" and "x.y.k" as input, and know how to apply them. For "x.y.z" we need to "get" the value, and then use "x.y.k" to "set" the value, then use "x.y.z" again to delete the field. The logic of getting value is very different from the logic of setting value.
In general, this apporach was called Functional Programming. By decomposing the big function to smaller one, and then compose them back to cope with different situations, we can maimize the reusbility.
Flattening & Rebuilding
Using SQLServer as your nosql database to persist objects state has issue with data migration. I mentioned three in the previous post. There are two more, today we are going to talk one of them. You can not deserialize object state back to class whoes fields have been changed. Think about a class used to have a field called name, now the field changed to firstName. When deserialize, where should the value of name assigned back to? If we can not get the raw value, how can we apply the data migration logic? We already talked about the data we stored in SQLServer is XML, are we going to parse the XML and manipulate the xml element directly? Yeah, I think we have to.
So the data migration logic is not applied on the same object model which your application logic dealing with. It has to be at a lower level. We could data migrate the xml data elements, but that is just too tightly coupled with the data format. Before we use the xml, we actually tried JSON for a month, until we found the XQuery is really a killer. Also, xml element has many things we do not care about. So, what we need is a model which can capture the states of the objects, but simple enough. This is model is also directly related to how serialization/de-serialization is working. It works like this:
object ==Flatten==> many dictionaries(with dict/list/string inside) ==Serialize==> XML
XML ==Deserialize==> dictionary(load referenced entity state on demand) ==Rebuild==> objects
The XML looks like this:
<Entity CLR_TYPE="Domain.Calendar.Location" Country="China" CountryAbbreviation="CN" LId="43-123" TaxUnit="Jiangxi" TaxUnitAbbreviation="JX" />
It will be deserialized to a dictionary containing 5 entries. The CLR_TYPE will be used in the rebuilding process to rebuild the dictionary back to a object. Except dictionary, string is also valid. string is used to store the field name as well as the simple field value. The persistence layer need to define how to translate a date time into a string, and etc. Collection is also valid. Although in theory, collection is just a special case of dictionary.
The XML is just state storage for a single entity. Entities can inter-relate with each other. We are not going to store the state for other entity in same XML. They will be referenced by ID, and stored separated in different rows in the table EntityState.
Because we separated the serialization into two phases, that is why we can do data migration. The data migration is just a function, who take a dictionary as input, and produce another dictionary. Now the only problem is, how can we write such kind of function? Yes, it might be trivial for just one version, but if you are going to change it very frequently doing agile software development, then it is a big issue. We are going to talk about the "reusbility" of data migration rules in the next post.
Subscribe to Posts [Atom]