Some of my principles for designing system architecture

After working for more than 20 years, I have seen many companies’ system architecture and many problems in the past 20 years. When I communicate and discuss with these companies, including implementation and program comparison, there are a lot of comparisons and compromises of various programs. Today, I would like to write this article to summarize my personal experience and ideas in the hope that more people can refer to and learn from them, and be able to make a better architecture. In addition, my way of thinking and principles are aimed at the existing market many unreasonable architecture and program, so, it is also a kind of “correction” ……

Note that the architectural principles described in this article are generally applicable to relatively complex business, if it is just some simple and not very accessible applications, then you may come to the opposite conclusion.

sobyte

Principle 1: Focus on the real benefits rather than the technology itself

For software architecture, I think the first important thing is the benefits of the architecture, if not the benefits, just for the sake of technology, and there is no point. For technical gains, I think the following gains are very important.

Whether the technical threshold can be lowered to speed up the development process of the whole team . Being able to speed up the engineering process of the whole team and release quickly is a problem that software engineering has been solving, so the system architecture needs to be able to perform parallel development, parallel go-live and parallel O&M without making a team a bottleneck point. (Note: Even if the drag on the team is the organizational structure, it does not prevent us from making a parallel system architecture design)
Does it allow the whole system to run more stable . To make the whole system more stable and improve the SLA of the whole system, we need to make solutions for planned and unplanned downtime.
Whether costs can be reduced through simplification and automation. The highest optimization cost is the human cost, which is not only slow and expensive, but also the constant human error. if you can’t reduce the human cost, but need more people, then the architecture design must be a failure. In addition, is the cost of time, capital costs.

If a system architecture can not play a role in the above three things, then there is no point.

principle 2: application services and API as the perspective, rather than resources and technology as the perspective

Many companies in China will have a lot of division of labor, basically will be divided into operations and development, operations and maintenance will be divided into basic operations and application operations and maintenance, development will be divided into basic core development and business development. Different divisions will lead to completely different perspectives and starting points. For example, basic operations and development colleagues are more concerned about resource utilization and performance, while application operations and business development are more concerned about the application and service stuff. The two would have been related, but because of the evolution of distributed architecture, there are some systems that are not clear whether they are basic or application layer, for example, things like service governance, in which there are underlying basic technologies that also require business colleagues to work with, including k8s, which has underlying technologies such as networking, but also requires business cooperation with readniess and liveness such as health checks, and business applications need to configMap and so on ……

All these things make me feel that the so-called DevOps is actually because many technologies and components are already indistinguishable from Dev or Ops, so there is a need to merge Dev and Ops. Also, the entire organization and architecture can no longer be optimized by tuning a single division of labor or a single component. It requires a top-down, holistic planning and unified design approach to achieve overall improvement (think of urban traffic optimization, when the city scale to a certain level, the overall performance you can not optimize a few roads or a few blocks to complete, you need to do the overall functional body of the city planning to achieve overall efficiency). And in order to achieve the overall improvement, all people need to have a unified perspective and goal, and over the past few years, I think this goal is - to stand in the service and external API perspective to see the problem, rather than the technical and underlying perspective.

principle 3: choose the most mainstream and mature technology

I’ve seen cases over the past few years where users have migrated their PHP, Python, .NET, or Node.js architectures completely to Java + Go as systems have become more complex. This is still a painful process. It’s a painful process, but you can’t help it. As your system gets more complex and larger, you can’t play around with toy technologies anymore, you need more industrial technologies.

Use more mature and industrialized technology stacks as much as possible, rather than the technology stacks you are familiar with. The so-called industrialized technology stack, you can look at the technology stack used by most companies, such as: Internet, finance, telecommunications …… and so on , large companies will have more technical input, but also need to be more large-scale production, so the technology they use is usually more industrialized. In the technology selection, do not be - “look at a video company is also using this technology”, or some in the forum to see some programmers spit technology views (without any data, only their own preferences) to decide their own technology. It would be more reliable to look at the technology stack that most of the mainstream companies are actually using.
Choose the technology that is popular globally, not the technology that is popular in China . Technology must be a global thing, not a localized thing. So, must choose the international will be better. In addition, don’t be fooled by the “special case” of some companies, even if the case is very sexy, the key is to look at the problem-solving ideas and the use of technology is universal. Only universal technology has a stronger vitality.
Use the mainstream technology with big dividends as much as possible, rather than inventing your own wheel, let alone magical changes . I’ve seen a number of companies that have modified open source software, such as a company that modified mesos and ended up inventing another kubernetes, and I’ve seen many companies or technology teams that like to invent their own specialized wheels that end up being replaced by mainstream open source software. It’s completely unnecessary. Not reinventing the wheel, not magic, not because your own technology can’t, but because, the world is long past the days of doing everything yourself, this era is about finding ways to integrate and collaborate with the entire industry, the entire technology community, so that you can get the most bang for your buck. Those who try to make their own set of play because a special case needs to be made, short-term no problem, but in the long run, I do not even look good.
In the vast majority of cases, you can’t go wrong with Java if you don’t have very specific requirements. On the one hand, this is because the productivity of Java business development is very good, and with the Spring framework to protect the code is difficult to write bad, in addition, the community of Java is too mature, you need a variety of architecture and technology can be easily obtained, the technical dividend is too big. There are so many benefits to a language that runs on the JVM. On the Java technology stack, your architectural risks and architectural costs (both in terms of labor, time, and money) are optimal in the long run

In the companies I have seen, the architecture of some companies is kidnapped by the personal preferences, expertise and personal experience of the technology leader, not at all from an objective point of view to make technology selection. In fact, from the stage of 0 to 1, you can use any kind of technology, if you make a simple application, no transaction processing, no complex transaction process, such as some forums, social applications, you can use any language. NET to Java, Taobao from PHP to Java… …

Note, some people with subjective preferences must feel uncomfortable with my above description of Java, and I’ll use some evidence to illustrate - all e-commerce platforms in China, hundreds of banks, three major telecom operators, all insurance companies, securities companies’ systems, systems in hospitals, electronic government systems, etc., are basically developed in Java. Java development, including the mainstream language of AWS is also Java, Ali cloud at the beginning with C++/Python write control system, later also began to use Java …… you may say B station is using go language, but you may not know B station’s e-commerce and big data is using Java …… If you know about data analysis, it is recommended to search the number of Java jobs on major job sites, you will know whether a technology is mainstream and popular ……

Principle 4: Completeness will be more important than performance

I found that when architects in some companies do architecture, the primary consideration is whether the performance of the architecture can hold up to a large or large amount of traffic, rather than considering the completeness and scalability of the system. So, I have seen many cases where they started using non-relational databases like MongoDB or put data directly in Redis and abandoned the data integrity model of relational databases, and later when they needed to do relational queries on the data, they found that the NoSQL databases performed too poorly on Join, and then started all kinds of In order not to do Join, we start to redundant data, but we can’t maintain the data consistency problem after redundant data, which leads to all kinds of data disorder and loss.

Therefore, I give the following some of the following architectural principles.

Use the most scientifically rigorous technical model as the main, and supplement it with an undisciplined model . For the case above, that means - always use a fully ACID-enabled relational database, and then supplement it with NoSQL, rather than abandoning the relational database altogether. The principle here is the so-called “tighten first, then loosen”, at first tight, you can slowly loosen, but start loosening, later you want to tighten never tighten over.
There are always many solutions to performance stuff. My experience over the years tells me that there are always solutions to performance things, and the means are the most, and this is really nothing to worry too much about compared to the completeness and scalability of the architecture.

In order to pursue the so-called performance, the integrity of the entire system is lost, rather than worth the loss.

Principle 5: Develop and follow obedience to standards, specifications and best practices

This principle is very important, because only by obeying standards can your architecture have better scalability. For example: I regularly see many companies whose systems neither obey industry standards nor form their own company standards, feeling like a bunch of rabble-rousers. The most typical example is the status return code for HTTP calls. The industry standard for you is 200 for success, 3xx for jump, 4xx for error on the caller side, and 5xx for error on the server side. I really don’t understand why everyone likes to return 200 regardless of success and failure, and then point out in the body whether it is error (two years ago I saw a famous Internet veteran in a WeChat public number recommending the use of both correct and error). return 200 practice, I double-checked in the background, I found that such architects really hurt people). The biggest problem with this is - the monitoring system will work in an inefficient state. The monitoring system needs to open all the network request packets before it knows if it is an error, and it has absolutely no idea if it is an error on the caller side or the server side, so some control systems like retry or fuse have no idea what to do (if it is a 4xx error, then retry or fuse is pointless, only 5xx makes sense). Sometimes, I will have a feeling that the more I live, the more I regress, the error code design of such a basic and most basic thing why there is no? And a company would let people mess up? How can these basic skills just be lost?

I have also seen companies that do not have a unified user ID design throughout their organization, and synchronize user data between systems by user ID, yes, the real-world ID, including the user whitelist set up on the gateway is actually using ID ID. I have great concerns about user privacy management within this company. A company, an organization, without standards and norms, will also have abstraction, which is bound to be all kinds of chaos.

Below, I list some standards and specifications that you need to be aware of (including but not limited to).

Protocol standards and specifications for inter-service calls . This includes Restful API paths, HTTP methods, status codes, standard headers, custom headers, etc., return data JSon Scheme ……, etc.
Some naming standards and specifications. These include e.g. user ID, service name, tag name, status name, error code, message, database ……, etc.
Logging and monitoring specifications. This includes: log format, monitoring data, sampling requirements, alarms ……, etc.
Configuration specifications . This includes: operating system configuration, middleware configuration, package ……, etc.
Specifications for middleware usage . Databases, caches, message queues …… and so on
Software and development library version harmonization . It is desirable that software or development library versions are upgraded once a year throughout the organization and then unified across teams.

Two things are important to mention here.

Restful API specification . I think it’s very important to give two references that I think are the best written: Paypal and Microsoft. The biggest benefit of having a standard and specification for Restful APIs is that monitoring can easily do all kinds of statistical analysis, and control systems can easily do traffic scheduling and dispatching.
Another one is service call chain tracking . For service call chain tracing, we basically refer to the paper Google Dapper, and there are many implementations, the most rigorous one is Zipkin, which is also The benefit of Zipkin’s proximity to the Google Dapper paper is that it is stateless, gets the span out quickly, and does not consume memory or CPU on the application side of the service. applications.
Software Upgrade . I have found that many companies, including BAT, have no software upgrade activity at all and rely on developers to do it on their own. However, this kind of systematic activity can never be formed by the spontaneity of the public. A company should have a software version upgrade review at least once a year, and then form a unified and consistent software version, which will extremely too simplify the complexity of the system architecture.

Principle 6: Focus on architectural scalability and maintainability

In many architectures I have seen, the technical staff only consider the present, but never consider the future scalability and maintainability of the system. The so-called management of the birth of the child, regardless of the nurturing. If you give birth to a child with fewer arms and legs, severely deformed, then the future is very difficult to play. Because the architecture and software is not written and finished, it is necessary to constantly modify the constant maintenance, 80% of the software costs are in the maintenance. So, how to make your architecture has better scalability, can be easier to operate and maintain, this is more important. By scalability, I mean that I can easily add more features or systems, and by maintainability, I mean that I can make any changes to the online system. Scalability requires a standardized and uncoupled business architecture, while maintainability requires the ability to control, i.e., a set of various control systems.

Service orchestration architecture to reduce coupling between services . For example, a dedicated service for a business process or middleware like Workflow, Event Driven Architecture, Broker, Gateway, Service Discovery, etc. to reduce the dependencies between services.
Reducing O&M complexity of service dependencies through service discovery or service gateways. Service Discovery can be a great way to reduce the complexity of operations and maintenance for dependent services, allowing you to easily bring services online or offline, or scale services.
Be sure to use a variety of software design principles . For example, principles like SOLID, best practices for architectures like IoC/DIP, SOA or Spring Cloud, practices related to distributed system architecture, or microsoft’s Cloud Design Patterns") ……, etc.

Principle 7: Full closure of the control logic

All programs will have two kinds of logic, one is business logic, one is control logic, business logic is the logic to complete the business, control logic is auxiliary, such as you use multi-threaded, or distributed, is the use of database or file, how to configure, deployment, operations and maintenance, monitoring, transaction control, service discovery, elastic scaling, grayscale release, high concurrency, etc., etc. …… these are control logic, and business logic has nothing to do with a dime. The technical depth of the control logic will usually be deeper than the business logic, the threshold will also be higher, so it is best to professional programmers to be responsible for the development of control logic, unified planning and unified management, to close the mouth. This includes.

Traffic Closure . Including the scheduling of north-south and east-west traffic, mainly through the traffic gateway, development framework SDK or Service Mesh such technology.
Service Governance Gateways. Including: service discovery, health checks, configuration management, transactions, events, retries, fusion, flow limiting …… mainly through the development of frameworks SDK - such as: Spring Cloud, or technologies such as Service Mesh Service Mesh.
Monitoring data intake . Including: logs, metrics, call chains …… mainly through some standard mainstream probes, plus backend data cleansing and data storage to complete, preferably using non-intrusive techniques. The monitored data must be unified in one place for correlation so that the information is generated.
Resource Scheduling has the closing point for application deployment . This includes: closures for compute, network and storage, mainly done through containerized solutions such as k8s.
Receiving port for middleware . Includes: database, messaging, caching, service discovery, gateway ……, etc. This type of harvesting is generally done by unifying a shared cloud-based middleware resource pool within the enterprise.

In this regard, the principle here is.

You want to choose technologies that make it easy to separate business logic from control logic . Here, Java’s JVM + bytecode injection + AOP-style Spring development framework will give you too many advantages.
You need to choose a technology that can enjoy the technical dividends of “those who have gone before you, those who will come after you”. For example, there is a large community of mutually compatible technologies, such as Java, Docker, Ansible, HTTP, Telegraf/Collectd ……
Middleware you want to use technologies that can support HA clusters and multi-tenancy . Basically all major middleware here will support HA clustering approach.

Principle 8: Don’t Accommodate Technical Debt for Older Systems

I have found that many companies have very, very large technical debts that are manifested as follows.

use old technology . For example, the use of HTTP 1.0, Java 1.6, Websphere, ESB, socket-based communication protocols, outdated models ……, etc.
Irrational design. For example, writing a lot of business logic in gateway, monolithic architecture, deep coupling of data and business logic, wrong system architecture (treating cache as database, synchronizing data with message queue) …… etc.
Lack of supporting facilities . For example, no automated testing, no good software documentation, no good quality code, no standards and specifications …… etc.

People who come to me for technical help have all kinds of problems. I say the same thing to them all - " If you come to me case-by-case to solve a problem, I am not interested, because you should not hope that you can simply change a Charly into a Ferrari, or put a foundation of a crooked building not properly built crooked building to make it right. Previous technical debts, have to be repaid, did not lay a good foundation to re-build, did not build supporting facilities are to be built. These infrastructure if not built in accordance with the correct scientific way, you can not have a good system, I also have no way to help you case-by-case problem solving …… “At first, they will all say to me, no problem, we just have to pay off the debt, but At the end, they found that they had a lot of debt to pay, and they couldn’t afford it, so they started to show up.

They started to find all kinds of rationalizations for their “technical debt” - explaining to you all kinds of historical reasons and reasons why they had to do it. As we talked, I got the impression that they wanted a mindset of progress without changing or paying for anything, and they would rather let the new technology go down to accommodate these technical debts and misuse the new technology in a messy way. There is a company, their system architecture and technology selection are basically wrong, using the wrong model to build the system, resulting in very poor performance of the whole system, only tens of millions of data, but they do not want to pay off the debt, not to build the foundation and supporting facilities, and to repair the building higher, on more systems - they think The existing system is quite good, the reason for the performance problem is that they don’t have a big data platform, so they want to build a big data platform ……

I’ve seen many, many companies, including big ones like BAT, build more on the original technical debt, and then, the technical debt gets bigger and bigger, the interest gets bigger and bigger, and eventually becomes a loan shark, never to be repaid.

Here are a few principles and approaches that I hold very close to my heart and share with you.

Instead of spending a lot of effort to accommodate technical debt, it is better to just pay off the technical debt. It is the so-called long pain is better than short pain.
Build a “new city” without technical debt, and pass the “corruption-layer” architecture model, do not let technical debt invade the “new city” .

Principle 9: Don’t rely on your experience, rely on data and learning

Several people have come to me to tell me their technical problems, and then expect me to give them an answer. I said, I need to understand the situation of your existing system, that is, you need to do a diagnosis first, I only get the data, I may understand what the real cause is , I may give you a better technical solution. I personally feel that this is a responsible approach to the other party, because there are too many technical means, all of which have adapted scenarios and have various trade-offs, so a decision can only be made after research. This is the same as a doctor’s visit, to confirm the cause of the disease can not rely on experience, or rely on diagnostic data. In front of science, all experience is not reliable ……

Also, if one day you start to rely on your past experience when making technical decisions, then you are no longer able to grow. No one can progress by constantly repeating the past, and people never progress by learning what they don’t know. So, never rely on your own experience to make decisions. Before making any decision, it is better to spend a little time to look up relevant information, technical blogs, articles, papers, etc., and also to see how each company, or each open source software they do? Then, compare the Pros/Cons of multiple solutions and finally form your own decision, so that you can make a better decision.

Principle 10: Be careful with X - Y problems, ask the original requirements

For the X-Y problem, that is, the user to solve the X problem, he felt that Y can be solved, so I asked Y how to do, but in the end, found that the original X problem to be solved, the best solution is not Y, but Z. This X-Y problem is really quite a lot, I have seen too much. So, every time a user comes to me, I have to keep asking what is the X problem.

For example, several users would come to me and ask for a big data streaming process, only to find out that their problem was due to the large amount of state in the service, the need to put the same user’s data requests on the same service, and a slow function by design that slowed down the entire application service. Ultimately it’s just a matter of doing a little performance tuning, there’s no need to get on any big data streaming.

I love to ask why, and this kind of questioning will make customers follow along to rethink. For example, a customer came to me to evaluate a technical architecture decision, which, in theory, seemed to be very good in the user’s scenario. However, this scenario and this architecture was something I had never seen before in my career. So, I started to ask why this was such a scenario? As I pursued the question, I found that the users all felt that this scenario was unreasonable in various ways. Finally it caused a very deep discussion, and after the end user fixed that scenario, and the architecture suddenly became a common and mature model ……

Principle 11: Radical is better than conservative, innovation and practicality do not conflict

My attitude towards technology is rather radical, but the so-called radical is not blind, nor is it new technology, but actively embracing new technologies that will change the future, such as Docker/Go, I am very fast to follow, but like blockchain or Rust, I am not very active. Because, it doesn’t hit several characteristics of what I consider to be technology trends. Of course, I don’t stop learning what I don’t like. I learn as much about blockchain and Rust as I do, and I know the advantages of these technologies, but I don’t use them on a large scale. Also, I respect the conservative decision, there is no right or wrong in it. However, I personally feel there are too many benefits to being radical about technology over being conservative. On the one hand, for users, to a large extent, new technologies are often superficially competitive, and I’ve seen too many successful companies actively embrace new technologies, while conservative ones are generally getting worse.

Some people will tell me that we are pragmatic, we don’t need to innovate, we can solve the problems of the moment, so we don’t need new technology, we just need to use the existing technology. This kind of company, their technology design on the first day in debt, although can solve the current problem, but immediately new problems will appear, and then they will be tired of solving all kinds of problems. In the end, what happens is that they end up going to new technologies.

The logic here is simple – progress always comes from exploration, and exploration comes at a cost, but the benefits are greater . For me, not daring to take risks is the biggest risk, not daring to make mistakes is the biggest mistake, and fear of losing will make you lose more ……

Table of Contents