The High Cost of Splitting Related Data

栏目: IT技术 · 发布时间: 5年前

内容简介:Consider the following simple architecture:The two tables in the database are related. I use ‘related’ loosely: there could be a foreign key from one table to another, maybe a shared identifier. To generalise, it is data that tends to be combined when quer

Consider the following simple architecture:

The High Cost of Splitting Related Data

The two tables in the database are related. I use ‘related’ loosely: there could be a foreign key from one table to another, maybe a shared identifier. To generalise, it is data that tends to be combined when queried.

A common anti-pattern I see is to split the data like this:

The High Cost of Splitting Related Data

Notice how the relationship between the tables has been pushed up from the database layer to the application layer.

This is often detrimental to reliability, performance, correctness, simplicity, flexibility and speed of development.

The Unreliable Network

Consider this pattern repeated further:

The High Cost of Splitting Related Data

Here we see 11 network requests, 5 databases, and 6 servers, compared to the two network requests, single database, and server of the original.

If we consider each request to have a 99% chance of success, then the original will have a 98% (0.99 2 ) success rate and this new example will have a 90% success rate (0.99 11 ). This gets worse every time the pattern is extended.

See my article Microservices and Availability for a more detailed argument.

Loss of Functionality

This approach loses the functionality of the database, such as joins, filtering, ordering and aggregation. These must be re-implemented (often poorly) at the application layer.

For example, if two tables requires a simple join your API must fetch the results of the first table via the first API, find the relevant IDs, and request them from the second API. If you want to avoid an N+1 query the second API must now support some form of ‘multi-fetch’.

It could alternatively be implemented by denormalising the data, but that comes with its own costs and complexities.

The Interface Explosion Problem

Changes to the structure of the data can result in multiple changes to dependent APIs.

The High Cost of Splitting Related Data

This can really slow down development and cause bugs!

Incorrectness

Splitting data into multiple databases loses ACID transactions.

Short of introducing distributed transactions, any consistency between the tables has been lost and they cannot be updated atomically.

See my article Consistency is Consistently Undervalued for more thoughts on this.

Performance Crash

The ‘API’ is often a HTTP server with a JSON interface. At every step through the API stack, the TCP, HTTP and JSON serialisation performance costs must be paid.

Aggregations, filtering and joins performed at the application layer can also result in over-fetching from the database.

Why Do Developers Do This?

I think this is often an attempt to contain complexity by misapplying concepts from Object Orientated Programming.

OOP teaches that data should be private and that there should be a public interface that operates on that data. Here, tables are seen as internal data and APIs are seen as a public interface to that data. Exposing internal detail is a sin!

Without going in to a general critique of OOP, of which there are already plenty, the problem is that relational data naturally resists that kind of encapsulation. Tables are not objects!

Valid Use Cases

99% of the time, a prerequisite of having a valid use case for this is your company having the name ‘Google’, ‘Amazon’, or ‘Netflix’.

Performance is a valid but rare reason to do this. It is possible that one part of your data has a wildly different access pattern to the rest. In that case being able to independently scale or change your choice of database may be useful enough to overcome the resulting pain.

In my opinion, this is not a useful method for containing complexity. I have written Your Database as an API for some thoughts on reducing the complexity of large databases.

My advice is to keep your data together until something is about to break and there is nothing else you can do. I won’t say it’s never appropriate to do it, but splitting in this way has a high cost and should be a last resort.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

数据驱动设计

数据驱动设计

[美]罗谢尔·肯(RochelleKing)、[美]伊丽莎白F.邱吉尔(Elizabeth F Churchill)、Caitlin Tan / 傅婕 / 机械工业出版社 / 2018-8 / 69.00元

本书旨在帮你了解数据引导设计的基本原则,了解数据与设计流程整合的价值,避免常见的陷阱与误区。本书重点关注定量实验与A/B测试,因为我们发现,数据分析与设计实践在此鲜有交集,但相对的潜在价值与机会缺大。本书提供了一些关于在组织中开展数据实践的观点。通过阅读这本书,你将转变你的团队的工作方式,从数据中获得大收益。后希望你可以在衡量指标的选择、佳展示方式与展示时机、测试以及设计意图增强方面,自信地表达自......一起来看看 《数据驱动设计》 这本书的介绍吧!

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具