taowen: Architecture Fundamentalism: Review of My 2016

1/25/2017

Architecture Fundamentalism: Review of My 2016

I am writing code everyday, I thought I will never categorized myself as "unpractical" or "theoretical". However, after reviewing the learning journey of my 2016, I invented a word "Architecture Fundamentalism".

Architecture Fundamentalism: applying principle or architecture without context

That is exactly what I did, in retrospect. It is too hard to say it in my mother language, so I write my retrospection in English.

I learned the following experience in the hard way.

Idea 1: Adding test to regain confidence

The idea is simple. Test is important for agility. If you want people to have confidence and do more refactoring, you should have test first. So, let's add test to a big legacy codebase. In hope, the developers can appreciate the test, and regain the confidence when making changes. Isn't it a good idea? Is it working? Not really...

Idea 2: Integration test over mock

Having stayed in a company which invented xxxMock (like 10) frameworks, I had enough bad experience when the build is broken it tells you nothing but the mocking behavior has changed. I hate mocks, I do not deny. Let's embrace the great grand integration test. Run all the components within a bounded context, and test them together. The insight is: the boundary between client/server should be stable. We write the test simulating the client, it should produce very stable result between releases. What went wrong? A lot of things...

Idea 3: Unify the middleware

The technical stack is a disaster. There are 5 langues, 8 frameworks, countless libraries being used in production. If we want to do distributed tracing, service discovery, traffic coloring, shouldn't we unify the stack? Like every successful company is backed by one core RPC stack. Shouldn't we do that? Yes we should. Is it the only way and best way? Maybe...

Idea 4: Decoupling system by event

RPC is evil, and fragile. Fred George, a funny old thought leader. What I learnt from him and many other big bloggers, don't model your system using RPC. The system should be divided into bounded context, and they should be integrated with messaging system. Loose coupling, those idiot can never get. Turns out, I am the moron who blindly trust an architecture, without proving it is fit in the context. There are simply too many unanswered questions, to replace RPC with event based architecture.

===

All the ideas are attractive at first look. Now, let me summarize what actually went wrong. And what is the practical solution (maybe quick & dirty) that actually worked.

Experience 1: Small flow in production trumps everything

TDD is dead. Don't get me wrong, test is a good thing. It is like daily physical workout, it keeps your body fit. However, what most people want is just a system not bleeding all the time. When you actually have a system at the edge of crashing every day, adding test should NEVER be your first priority. The golden tool in the internet business: small flow in production. Nothing is more convincing than it is actually working in production.

Isn't we are already doing small flow release every day? Turns out, we are not doing it enough. With everything coupled in one module, you only can at most do 4 small flow release, with 4 hour each to verify it is actually working. This is not enough, given there are so many changes coming into the big ball of mud. The most important result of SOA decoupling, is not blah, blah, blahhh. It is allowing you to have more modules to do small flow release. This way, we can multiple module doing online testing, giving the new code more time to test itself.

Isn't the test irrelevant? NO. It is still a good practice. When keep a system from crashing is no longer your daily business. Keeping your developer happy is important then. Modifying a PHP source file without the ability to run in test environment, only to find out you mis-spelled a method call causing PHP error in production is NOT a fun experience. Even we have small flow to contain the damage, the release/roll back process is still time consuming and frustrating. Without a good test suite, you can not expect your developers will be brave enough to do the right thing (I mean actually refactoring the code to fit the business model)

But, let's face the cold fact. It doesn't matter. Small flow in production trumps everything. Everything comes second.

Experience 2: Single module test matters

My principle is really simple. The test should cover the business rule, not the code. The test should guard against the business from financial loss, to ensure the main business process will continue after every release.

My principle is still rock solid. But it does not mean the test should ONLY do that. I didn't get it in the first time. I thought, why are you guys creating test for a single module, and mocks everything around it? I blame the team structure for that. Because the team is setup in solo, so they only care for their own business. So, in the end, no one is responsible for the whole process with every components assembled together.

Without context, my argument seems right. Let me give you some context, we have a such large PHP codebase, with so many dependencies, developer can not even run the code up to the point where they made the change. To make the code actually run, from the index.php to the point of change, you need to setup more than 10 system dependencies with a lot data configuration in the database. Given the code is written in PHP, there is no object model, there is no compile type safety. The only defense between your change and a total disaster is the god damn "small flow in production".

What should you do in this context? Right, run your code before going into production.Give those poor boys a run button, even just to check the PHP syntax is correct. What is the most easy way to make the code running? Yes, just run the module under change, and mock everything else. How to prevent the mock from failing? If we hand write the mock, it will be tedious and fragile. If we get the tracing data from production, and replay it in test environment, it will be cheap to update the mocking behavior.

So, do write test for a single module. It matters, and matters a lot. I was wrong.

Experience 3: Result oriented thinking

Unifying the language/framework/library is the right thing to do. But you need enough interest to justify the cost. The core business of the middleware is encoding/decoding/connection pooling. The attached value of a rpc framework is the unified logging/metrics, and service discovery, resilience. Having a thrift framework with everything included is not enough to justify the cost of switching the stack from http+json. The stability of the core business is more important. Changing from http+json to thrift, we might get the attached value we want, but it also means a lot of things need to be re-tested, re-verified in the production.

What the boss want is NOT unifying the technical stack. NO one is paying for that. They want a distributed tracing system, so that there is a customer complaining bad experience, we can actually look it up through a web interface. They want load-testing the whole business process in production, which requires coloring the testing traffic from normal traffic.

To archive the result, is the unifying stack the right solution? Not really. Even we have 5 languages and 8 frameworks to maintain, we can still make changes to them one by one. Instead of changing one place, we are changing 30 or more places to get the job done. Is it costly? Yes, it is. Is the result done? Yes, it is done, sir. Good.

This is result oriented thinking. What can actually justify the cost of unifying the technical stack? I don't know yet. Maybe, when the boss actually want to have a smaller team so that less people doing similar stuff? Just, maybe.

Experience 4: Classical event based architecture is unpractical

I love event based architecture. I love decoupling. But telling is not working. There are very practical reasons. And classical event based architecture is actually unfit in the context. We should listen first, not blindly impose our will first. Here is the list of why we should stick to RPC

Our business is a realtime business. Delaying a event in the queue, and make it eventually consistent is just not acceptable. Delivering the event to the next system is mission critical to the business. If it is not done, we should try our best to re-route it and keep the main process flowing. Queuing makes the upstream harder to make sure the delivery happen, and happen fast.
The client side is expecting a synchronous response. When a order being finished, how much it will cost you is part of the response and displayed on the screen. It is not acceptable to do the calculation async, which requires a big API/Interaction change. It is just unpractical to change every UI interaction to be async.
Lack of tooling support. Event based architecture requires a lot of tooling to make it feasible. Without good distributed tracing, it is very hard to track down there is no message flow out, is because which link dropped it or missed it. If the message is passed via RPC, we can be sure there is a error log somewhere. And most importantly, every developer knows RPC, and know to look at those error logs.

They are very practical reasons. I found all benefits promised by grand async architecture simply can not justify the cost. I tried to use kafka somewhere in the system, can not find enough applicable places to make a difference. The main business process is still a big ball of mud, you seems can not decouple them with any messaging system without changing the business requirement. Which is really really frustrating.

Mimic RPC with duplex messaging channel seems like a good idea. And we actually tried it. The project finally get cancelled, and I will never try that again. It is a total disaster.

What actually worked? Well, it turns out this is what the system ends up

do rpc and return result in sync way, in normal days
when the down-stream malfunctioned, return a fake result with disclaimer, so that end-user will know or they simply do not care
store the failure message in a async queue
the queue keeps retrying the process, until everything recovered

No one is designing the system to work this way, after patching and patching, it evolved itself into this form. It actually revealed a important paradigm. It is still a event based architecture in a nutshell. Compared to classical design, the upstream and downstream is not completely decoupled. In normal flow, RPC is invoked and result can be returned. When bad things happened, we downgrade the system into async mode. Through reliable event recording, and eventually consistent message replaying, we can guarantee the data is consistent eventually.

===

Be pragmatic, and carry on

# posted by taowen @ 01:08

Comments: Post a Comment

Subscribe to Post Comments [Atom]

<< Home

Subscribe to Posts [Atom]

taowen

1/25/2017

Architecture Fundamentalism: Review of My 2016

About Me

Links

Archives