Microservices: It’s not (only) the size that matters, it’s (also) how you use them – part 2

Danish version: http://qed.dk/jeppe-cramon/2014/03/13/micro-services-det-er-ikke-kun-stoerrelsen-der-er-vigtigt-det-er-ogsaa-hvordan-du-bruger-dem-del-2/

Part 1 – Microservices: It’s not (only) the size that matters, it’s (also) how you use them
Part 3 – Microservices: It’s not (only) the size that matters, it’s (also) how you use them
Part 4 – Microservices: It’s not (only) the size that matters, it’s (also) how you use them
Part 5 – Microservices: It’s not (only) the size that matters, it’s (also) how you use them
Part 6 – Service vs Components vs Microservices

In Micro services: Micro services: It’s not (only) the size that matters, it’s (also) how you use them – part 1, we discussed how the use of the number of lines of code is a very poor measure of whether a service has the correct size and it’s totally useless for determining whether a service has the right responsibilities.

We also discussed how using 2 way (synchronous) communication between our services results in hard coupling and other annoyances:

  • It results in communication related coupling (because data and logic are not always in the same service )
    • It also results in contractual-, data- and functional coupling as well as high latency due to network communication
  • Layered coupling (persistence is not always in the same service )
  • Temporal coupling (our service can not operate if it is unable to communicate with the services it depends upon)
  • The fact that our service depends on other services decreases its autonomy and makes it less reliable
  • All of this results in the need for complex logic compensation due to the lack of reliable messaging and transactions.
Reusable service, 2 way (synchronous) communication and coupling
Reusable service, 2 way (synchronous) communication and coupling

If we combine (synchronous) 2 way communication with small / micro-services, modelled according to e.g. the rule 1 class = 1 service, we are actually sent back to the 1990s with Corba and J2EE and distributed objects.
Unfortunately, it seems that new generations of developers, who did not experience distributed objects and therefore not take part in the realization of how bad the idea was, is trying to repeat the history. This time only with new technologies, such as HTTP instead of RMI or IIOP.
Jay Kreps summed up the current Micro Service approach, using two way communication, very aptly:

Jay Kreps – Microservice == distributed objects for hipsters (what could possibly go wrong?)
Jay Kreps – Microservice == distributed objects for hipsters (what could possibly go wrong?)

Just because Micro Services tends to use HTTP, JSON and REST doesn’t make the disadvantages of remote communication disappear. The disadvantages, that newbies to distributed computing easily overlook, are summarized in the 8 fallacies of distributed computing:

They believe:

  1. The network is reliable
    However, anyone who has tried losing the connection to a server or to the Internet because network routers, switches, WiFi connections, etc. are more or less unreliable will know this is a fallacy. Even in a perfect setup, you will be able to experience crashes in the network equipment from time to time
  2. Latency is zero
    What is easily overlooked is that it is very expensive to make a network call compared to making equivalent in-process calls. The bandwidth is more limited and latency measured in milliseconds instead of nanoseconds over the network. The more calls to be executed sequentially, the worse the overall latency becomes
  3. Bandwidth is infinite
    In fact, network bandwidth, even on a 10 GBit network, is much lower than if the same call was made in-memory/in-process. The more data being sent and the more calls being made ​​because of our small services, the greater the impact on the remaining bandwidth
  4. The network is secure
    Saying NSA should cover why this is a fallacy?
  5. Topology doesn’t change
    Reality is different. Services deployed to production will experience a constantly changing environment. Old servers are upgraded or moved (if necessary also changing IP address), network equipment is changed or reconfigured, firewalls, change configuration, etc.
  6. There is one administrator
    In any large-scale installation there will be several administrators: Network administrators, Windows admins, Unix admins, DB admins, etc.
  7. Transport cost is zero
    A simple example of why this is a fallacy we can look at the cost of serializing / deserializing from the internal representation to / from JSON / XML / …
  8. The network is homogeneous
    Most networks consist of various brands of network equipment which supports various protocols and communicate with computers with different operating systems, etc.

The review of the 8 fallacies of distributed computing is far from complete. If you are curious, Arnon Rotem-Gal-Oz has made ​​a more thorough review (PDF format).

What is the alternative to 2 way (synchronous) communication between services?

The answer can among others be found in Pat Hellands “Life Beyond Distributed Transactions – An Apostate’s Opinion” (PDF format).
In his article Pat discusses that “adults” do not use Distributed transactions to coordinate updates across transaction boundaries (e.g. across databases, services, applications, etc.). There are many good reasons not to use distributed transactions, among them:

  • Transactions lock resources while they are active
    Services are autonomous , so if another service, through a distributed transaction, is allowed to lock resources in your service, it will be a clear violation of the autonomy
  • A service can NOT be expected to complete its processing within a specified time interval – it is autonomous and therefore in control of how and when it wants to perform its processing. This mean that the weakest link (Service) in a chain of updates determines the strength of the chain.
  • Locking keeps other transactions from completing their job
  • Locking does not scale
    If a transaction takes 200 ms and e.g. holds a table lock, then the service can maximum be scaled to 5 simultaneous transactions per second. It does not help to add more machines as this doesn’t change the time lock is kept by a single transaction
  • 2 phase / 3 phase / X phase commit distributed transactions are fragile per design.
    So even though X phase commit distributed transactions, at the expense of performance (yes X phase commit protocols are expensive), solves the problem with coordinating updates across transactional boundaries, there are still many error scenarios where an X phase transaction is left in an unknown state (e.g . if 2 phase commit is interrupted during the commit phase, it will mean that some participant has committed their changes while others have not. You’re left on deep water without a boat if just a single participant fails or is unavailable during the commit phase –  see the drawing below for the 2 phase commit flow)

    2 phase commit protocol flow
    2 phase commit protocol flow

So if the distributed transactions isn’t the solution, then what is the solution?

The solution is in three parts:

  1. One part is how we split our data / services
  2. How do we identify our data / services
  3. And how we communicate between our data / services

How do we split our data / services and identify them?

According to Pat Helland, data must be collected in pieces called entities. These entities should be limited in size, so that, after a transaction they are consistent.
This requires that an entity is not greater than it can fit on one machine (across machines we would have to use distributed transactions to ensure consistency which is what we want to avoid in the first place). It also requires that the entity is not too small in relation to the usecases that update the entity. If it’s too small it would bring us back to the case where we would need to coordinate updates across services by e.g. using distributed transactions to ensure a consistent system.
Rule of thumb is: one transaction involves only one entity.

Let us take an example from the real world:
In a previous project I was faced with a textbook example of how misguided reuse ideals and micro splitting of services undermines service stability, transactionality, low coupling and low latency.

The customer thought they could ensure maximum reuse for two domain concepts, respectively Legal Entities and Addresses; where addresses covered everything that could used to address a legal entity (and most likely everything else on earth), such as home address, work address, email, phone number, mobile number , GPS location , etc.
To ensure reusability and the coordination of creation, updates and reads, they had to introduce a task service called “Legal Entity Task Service” that would coordinate work between the data services “Legal Entity Micro Service” and “Address Micro Service”. They could have chosen to let the “Legal Entity Micro Service” take the role of the Task service, but that wouldn’t have solved the fundamental transaction problem that we’re going to discuss here.

To create a legal entity, such as a person or a company, you first have to create a Legal Entity in the “Legal Entity Micro Service” plus one or more addresses in the “Address Micro Service” (depending on how many were defined in the data given to CreateLegalEntity ( ) method of the “Legal Entity Task Service”). For each address that was created the AddressId, returned from the CreateAddress() method in “Address Micro Service”, had to be associated with the LegalEntityId that was returned from theCreateLegalEntity( ) method of the “Legal Entity Micro Service” by using the AssociateLegalEntityWithAddress() in “Legal Entity Micro Service”:

Bad Microservices - Create scenario
Bad Microservices – Create scenario

From the sequence diagram above, it should clear that there is a high degree of coupling (at all levels). If the “Address Micro service” does not respond, then you cannot create any legal entities. The latency of such a solution is also high because of the high number of remote calls. Some of the latency can be minimized by performing several of the calls in parallel, but it is again sub-optimization of a fundamental wrong solution and our transaction problem is still the same:

If just a single of the CreateAddress() or AssociateLegalEntityWithAddress() calls fail we’re left with a nasty problem. Say we have created a Legal Entity and now one of CreateAddress() calls fail. This leaves us with an inconsistent system, because not all of the data we intended to create succeeded.
It could also be that we have created our Legal Entity and all the Addresses, but somehow not got all the addresses successfully got associated with the legal entity. Again we’re faced with an inconsistent system.

This form of orchestration places a heavy burden on the CreateLegalEntity() method in the “Legal Entity Task Service”. It is now responsible for retrying any failed calls or performing a clean up after them (also known as compensation). Maybe one of the cleanups fail and what do you do then? What if the CreateLegalEntity() method in the “Legal Entity Task Service” is in the process of retrying a failed call or in the process of performing a clean up when the physical server it runs on is turned off? Did the developer remembered to implement the CreateLegalEntity() method (in the “Legal Entity Task Service”) so that it remembers how far it was and can resume its work when the server is started. Did the developer of CreateAddress() or AssociateLegalEntityWithAddress() methods ensure that the methods are idempotent so that any calls to them can be retried several times without risking double creation or double association?

The transactionality problem can be solved by looking at the usecase and re-evaluating the reuse thesis

The design of the LegalEntity and Address services came about after a team architects had designed a logical canonical model and from it had decided what was reusable and thus should be elevated to services. The problem with this approach is that a canonical data model does not take into account how the data is used, i.e. the usecases that use this data. The reason this is a problem, is that the way data is changed/created directly determines our transaction boundaries, also known as our consistency boundaries. Data that is being changed together in a transaction / usecase, as a rule of thumb, also belongs together data wise and ownership wise.
Our rule of thumb is therefore expanded to: 1 usecase = 1 transaction = 1 entity.

The second mistake they made was to view the Legal entities’ addresses as  general addresses and then elevate Address to a service. You could say that their reuse focus meant that everything that smelled of an address had to be shoehorned into the Address service. The hypothesis was that if everyone used this Address service and suddenly a city ​​name or postal code changed then you only needed to fix it in one place . The latter was perhaps a valid reason to centralize this specific piece of information, but it had high costs for everything else.

The data model looked something like this (a lot of details omitted):

Bad micro service data model
Bad micro service data model

As shown in the model, the association between LegalEntity and Address is shared directed association indicating that two LegalEntities can share an Address instance. However this was never the situation, so the association was more of a composite directed association, indicating a parent-child relationship. There’s e.g. no need to store an Address for LegalEntity after the LegalEntity has been deleted (again an indication of a composite association). The parent-child relationship shows that LegalEntity and Address belongs closely together, they’re created together, changed together and they’re used together.
This means that instead of having two entities we really only have one entity, LegalEntity (our pivotal point), with one or more Address objects closely linked to it. This is where Pat Hellands Entity vocabulary can benefit from Domain Driven Design’ (DDD) more rich language, which includes:

  • Entity – which describes an object that is defined by its identity (and not its data), an example is a legal entity (a Person has a Social Security number, a company has a VAT number, etc.)
  • Value Object – which describes an object that is defined by its data and not its identity, an example is an Address, a Name, or an Email address. Two value objects, of the same type, with the same values ​​are said to be equal. A value object never exists alone, it always exists as part of a relationship with an Entity. The value object, so to speak enriches the Entity with its data.
  • Aggregate – is  a cluster of coherent objects with complex associations. An Aggregate is used to ensure invariants and guarantee the consistency of the relationship between these objects. An Aggregate used to control locking and guarantee transactional consistency in the context of distributed systems.
    • An Aggregate chooses an Entity to be the root and control access to objects within Aggregate through this root. The root is called the Aggregate Root.
    • An Aggregate is a unique identifiable by its ID (usually a UUID/GUID)
    • Aggregates refer to each other by their ID – they NEVER use memory pointers or Join tables (which we will return to in the next blog post)

From this description we can determine that what Pat Helland calls an Entity in DDD jargon is called an Aggregate. DDD’s language is more rich, so I will continue to use DDD’s naming. If you are interested in Aggregates I can recommend Gojko Adzic’s article.

From our usecase analysis (LegalEntity and Address are created and changed together) and the use of DDD’s jargon (LegalEntity is an Aggregate Root and an Entity and Address is a value object), we can now redesigning the data model (also known as domain model):

LegalEntity Microservice - better model
LegalEntity Microservice – better model

With the design above the AddressId has disappeared from the Address, since a Value Object doesn’t need one.
Our LegalEntity still has its LegalEntityId and it is the one we refer to when we communicate with LegalEntity micro service.
With the redesign we have made the Address service obsolete and all that is left is the “Legal Entity Micro service”:

Better LegalEntity Microservice
Better LegalEntity Microservice

With this design our transaction problem has completely disappeared because there is only one service to talk to in this example. There is much that can still be improved here and we have not yet covered how to communicate between services to ensure coordination and consistency across services when our processes / usecases cut across aggregates / services.

This blog post is already getting too long, so I’ll wait and cover this next blog post.

Until then, as always, I’m interested in feedback and thoughts 🙂

Advertisements

13 thoughts on “Microservices: It’s not (only) the size that matters, it’s (also) how you use them – part 2

  1. Thanks heaps for putting the time into writing this. I love how you’re bringing the language of DDD into microservice architecture design. I’m finding that deciding where data should live and how to share it is one of the really hard problems, but this really helped.

    Like

  2. Great post! I have a few comments here as well:

    1) When I teach SOA to IT professionals, I talk a lot about service autonomy. One thing I mention is that we should strive to improve service autonomy. However, it’s natural that a service may depend on external elements. If service A makes a REST or SOAP call to service B, that call decreases service A’s autonomy. If service A accesses a central database, this access decreases its autonomy. If service A sends emails, the interaction with the smtp host negatively affects the service’s autonomy. If service A sends a message to a JMS queue in a separate process or server, that interaction also decreases the service’s autonomy. Moreover, service autonomy can be assessed but hardly can be measured objectively.
    Having said that, I feel unease with some “absolute” statements in the text. For example:
    – where you say “The fact that our service depends on other services undermines its autonomy and makes it more unreliable”, I would rather say “The fact that our service depends on other services decreases its autonomy and makes it less reliable”

    2) I never heard of a “1 class = 1 service” rule and I don’t think it’s a goal to try. I worked in one Corba project in the late 90s. In the early 2000s I was a J2EE consultant and EJB instructor for Sun. These technologies were (are) complex and troublesome for many reasons, but an EJB service would typically have the EJB class (e.g., @Stateless) that would call a graph of other classes.

    3) I find the following statement unclear: “This requires that an entity is not greater than it can fit on one machine”. Are you talking about the data in the database or about the data/entity services? How does this idea map to the example that follows? Are Legal Entity and Address separate entities? Size-wise, do they belong on one machine?

    4) When I teach, I tell people that distributed transactions are source of headache. Even if the transaction participants, the network, the database connections are robust and reliable, one out of N transactions (N can be hundreds, thousands, or millions if you’re lucky) will lead you to an inconsistent state. Any 2PC framework comes with fault recovery but is not fail-proof. You will need a mechanism to detect and fix the inconsistencies (usually a script detects them and someone manually fixes them). So, I agree with the point you make that the “bad microservices – create scenario” is problematic.
    However, the text suggests that: the CreateLegalEntity() method should be responsible for retrying any failed calls and cleaning up after them; that the developer of CreateLegalEntity should remember to implement checkpoints for data changes so that it can resume its work when the server is restarted. IMO it’s unrealistic to think that the developer of the business logic will create retries, fault recovery, checkpoints. All this “bookeeping” is implemented in the distributed transaction *infrastructure*.
    Thus, to characterize the problem here, I would rather say something like: what if the developer of the task service didn’t set up the interaction with the other services as a distributed transaction? Even if he/she did so, what if one of the participant services (e.g., Address) or the network connection becomes indefinetely (i.e., for a long time) unavailable after the voting phase of 2PC?

    5) The text suggests this rule of thumb: “1 usecase = 1 transaction = 1 entity”. I can see “1 transaction = 1 entity” as a realistic design goal, but use cases often span more than one entity. For example, a use case for evaluating a bank loan request to buy a house (mortgage) may involve several entities: the account holder (borrower), the mortgage, the property. We may design the solution using services that interact via events and hence avoid distributed transactions, but from a business analysis point of view, it’s one use case that deals with more than one entity. So, my point is, “1 usecase = 1 transaction = 1 entity” may be a desirable situation but it’s not a rule of thumb.

    Like

    1. Hi Paulo

      Thanks for your comments. Here’s my attempt at a reply.

      1) Your description is spot on, so I’ve changed the statement 🙂

      2) 1 class = 1 service is not something I would recommend at all and I’m not sure where you found it? I recommend never creating a microservice that’s smaller than the logic+data of a single entity/aggregate (which typically will be several classes) due to issues with unnecessary coordination between services.

      3) It’s taken from Pat Hellands article in relationship to unlimited scalability. In the context of my blog post it’s basically a statement saying that we should store the changes to an entity/aggregate in a single transaction within a single datastore (so you don’t have to have 2PC/XA transaction or have unnecessary events flying around).

      4) I agree with regards to your description of 2PC and its challenges. The reason why I focus on the issue in this way, and focus directly on the bookkeeping the developer needs to do (when you don’t use 2PC), is because 95% of all SOA integrations I’ve seen, have NEVER used ANY 2PC (which is good) and NEVER thought about the bookkeeping they actually need to do when they do data updating SOAP/REST calls between their services. It seems many developers are completely unaware of how things can go wrong and what their responsibilities are in order to correct the problems.
      Note: The remaining 4,9% also didn’t use 2PC, but instead used e.g. BPEL, events, etc. to coordinate and compensate.

      5) We completely agree that all non trivial use-cases wont fit this, unless we chunk them up and find each entity’s natural consistency boundary. The point is that we should NEVER design a microservice that is smaller than this rule of thumb. We can definitely design larger services. As I get into in the later blog posts, for use-cases that cross entity boundaries (and service boundaries) we have to look into using e.g. Events (or other means) to coordinate between the services without having to use 2PC to accomplish the full business use-case.
      I hope this makes my point more clear?

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s