数据库红书笔记

date

Jun 5, 2022

slug

数据库红书笔记

status

Published

第一章背景

Readings:

Joseph M. Hellerstein and Michael Stonebraker. What Goes Around Comes Around. Readings in Database Systems, 4th Edition (2005).

Joseph M. Hellerstein, Michael Stonebraker, James Hamilton. Architecture of a Database System. Foundations and Trends in Databases, 1, 2 (2007).

JSON is a reasonable choice for sparse data. In this context, I expect it to have a fair amount of “legs”. On the other hand, it is a disaster in the making as a general hierarchical data format. I fully expect RDBMSs to subsume JSON as merely a data type (among many) in their systems. In other words, it is a reasonable way to encode spare relational data.

JSON 是一个合理的选择。在这种情况下，我希望它有相当数量的“腿”。另一方面，作为一个普通的 HDF，这是一个正在形成的灾难。我完全希望 rdbms 将 JSON 包含在它们的系统中仅仅作为一种数据类型(在许多类型中)。换句话说，这是一种对空闲关系数据进行编码的合理方法。

A few years later, Google stopped using Map-Reduce for that application, moving instead to Big Table. Now, the rest of the world is seeing what Google figured out earlier; Map-Reduce is not an architecture with any broad scale applicability. Instead the Map-Reduce market has morphed into an HDFS market, and seems poised to become a relational SQL market

几年后，Google 不再使用 Map-Reduce 来处理这个应用程序，而是转移到了 Big Table。现在，世界其他地方都看到了谷歌早些时候发现的东西; Map-Reduce 不是一个具有任何广泛适用性的架构。相反，Map-Reduce 市场已经演变成一个 HDFS 市场，并且似乎有望成为一个关系 SQL 市场

there has been another thrust in HDFS land which merit discussion, namely “data lakes”. A reasonable use of an HDFS cluster (which by now most enterprises have invested in and want to find something useful for them to do) is as a queue of data files which have been ingested. Over time, the enterprise will figure out which ones are worth spending the effort to clean up (data curation; covered in Chapter 12 of this book). Hence, the data lake is just a “junk drawer” for files in the meantime

HDFS 领域又出现了另一个值得讨论的主题，即“数据湖”。HDFS 集群(目前大多数企业已经投资并希望找到一些有用的东西供它们使用)的一个合理使用方法是作为已经被摄取的数据文件的队列。随着时间的推移，企业将会找出哪些是值得花费精力去清理的(数据管理; 在本书的第12章中有所介绍)。因此，与此同时，数据库只是文件的“垃圾抽屉”

in the data warehouse market column stores have replaced the row stores described in this paper, because they are 1-2 orders of magnitude faster. In the OLTP world, main-memory SQL engines with very lightweight transaction management are fast becoming the norm

在数据仓库市场中，列存储已经取代了本文所描述的行存储，因为它们比数量级快1-2个百分点。在 OLTP 领域，具有非常轻量级事务管理的主存 SQL 引擎正迅速成为标准。

the details of concurrency control, crash recovery, optimization, data structures and indexing are in a state of rapid change, but the basic architecture of DBMSs remains intact.

数据并发控制、崩溃恢复、优化、数据结构和索引的细节都处于快速变化的状态，但是数据库管理系统的基本架构仍然完好无损

第二章传统关系型数据库系统

Readings:

Morton M. Astrahan, Mike W. Blasgen, Donald D. Chamberlin, Kapali P. Eswaran, Jim Gray, Patricia P. Griffiths, W. Frank King III, Raymond A. Lorie, Paul R. McJones, James W. Mehl, Gianfranco R. Putzolu, Irving L. Traiger, Bradford W. Wade, Vera Watson. System R: Relational Approach to Database Management. ACM Transactions on Database Systems, 1(2), 1976, 97-137.

Michael Stonebraker and Lawrence A. Rowe. The design of POSTGRES. SIGMOD, 1986.

David J. DeWitt, Shahram Ghandeharizadeh, Donovan Schneider, Allan Bricker, Hui-I Hsiao, Rick Rasmussen. The Gamma Database Machine Project. IEEE Transactions on Knowledge and Data Engineering, 2(1), 1990, 44-62.

System R was structured into two groups, the “lower half” and the “upper half”. They were not totally synchronized, as the lower half implemented links, which were not supported by the upper half.

系统 R分为两组，“下半部”和“上半部”。它们不是完全同步的，因为下半部分实现了链接，而上半部分则不支持这些链接

The transaction manager is probably the biggest legacy of the project. Second place goes to the System R optimizer. The dynamic programming cost-based approach is still the gold standard for optimizer technology.

事务管理器可能是该项目最大的遗产。位居第二的是 System r 优化器。基于成本的动态规划方法仍然是优化器技术的黄金标准。

in my opinion the important legacy of Postgres is its abstract data type (ADT) system. User-defined types and functions have been added to most mainstream relational DBMSs, using the Postgres model

在我看来，Postgres 的重要遗产是它的抽象数据类型(ADT)系统。使用 Postgres 模型，用户定义的类型和函数已经添加到大多数主流关系型 dbms 中。

In my opinion, Gamma popularized the shared-nothing partitioned table approach to multi-node data management

在我看来，Gamma 推广了用于多节点数据管理的无共享分区表方法

Essentially all data warehouse systems use a Gamma-style architecture. Any thought of using a shared disk or shared memory system have all but disappeared.Unless network latency and bandwidth get to be comparable to disk bandwidth, I expect the current shared-nothing architecture to continue.

基本上所有的数据仓库系统都使用 gamma 风格的体系结构。任何使用共享磁盘或共享内存系统的想法几乎都消失了。除非网络延迟和带宽能够与磁盘带宽相媲美，否则我希望当前的无共享架构能够继续下去。

第三章每个人都应该知道的技巧

Readings:

Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, Thomas G. Price. Access path selection in a relational database management system. SIGMOD, 1979.

C. Mohan, Donald J. Haderle, Bruce G. Lindsay, Hamid Pirahesh, Peter M. Schwarz. ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging. ACM Transactions on Database Systems, 17(1), 1992, 94-162.

Jim Gray, Raymond A. Lorie, Gianfranco R. Putzolu, Irving L. Traiger. Granularity of Locks and Degrees of Consistency in a Shared Data Base. , IBM, September, 1975.

Rakesh Agrawal, Michael J. Carey, Miron Livny. Concurrency Control Performance Modeling: Alternatives and Implications. ACM Transactions on Database Systems, 12(4), 1987, 609-654.

C. Mohan, Bruce G. Lindsay, Ron Obermarck. Transaction Management in the R* Distributed Database Management System. ACM Transactions on Database Systems, 11(4), 1986, 378-396.

查询优化

decomposing the problem into three distinct subproblems: cost estimation, relational equivalences that define a search space, and cost-based search.

将问题分解为三个不同的子问题: 成本估计、定义搜索空间的关系等价关系和基于成本的搜索

The optimizer provides an estimate for the cost of executing each component of the query, measured in terms of I/O and CPU costs. To do so, the optimizer relies on both pre-computed statistics about the contents of each relation (stored in the system catalog) as well as a set of heuristics for determining the cardinality (size) of the query output (e.g., based on estimated predicate selectivity)

优化器提供了执行查询每个组件的成本估计，以 i/o 和 CPU 成本为度量单位。为此，优化器依赖于关于每个关系(存储在系统目录中)内容的预先计算的统计信息，以及一组用于确定查询输出的基数(大小)的试探法(例如，基于估计的谓词选择性)。

Using these cost estimates, the optimizer uses a dynamic programming algorithm to construct a plan for the query. The optimizer defines a set of physical operators that implement a given logical operator (e.g., looking up a tuple using a full ’segment’ scan versus an index). Using this set, the optimizer iteratively constructs a “left-deep” tree of operators that in turn uses the cost heuristics to minimize the total amount of estimated work required to run the operators, accounting for “interesting orders” required by upstream consumers. This avoids having to consider all possible orderings of operators but is still exponential in the plan size;

使用这些成本估计，优化器使用动态规划算法为查询构建计划。优化器定义了一组实现给定逻辑运算符的物理运算符(例如，使用完整的“段”扫描查找元组而不是使用索引)。使用这个集合，优化器迭代地构造一个操作符的“左深”树，然后使用成本启发法最小化运行操作符所需的估计工作总量，考虑上游消费者所需的“有趣订单”。这避免了考虑所有可能的操作符排序，但是在计划大小上仍然是指数级的;

The relational optimizer is closer in spirit to code optimization routines within modern language compilers (i.e., will perform a best-effort search) rather than mathematical optimization routines (i.e., will find the best solution)

关系优化器在精神上更接近于现代语言编译器中的代码优化例程(例如，将执行最大努力的搜索) ，而不是最优化优化例程(例如，将找到最好的解决方案)

并发控制

one property of concurrency control remains a near certainty: there is no unilateral “best” mechanism in concurrency control. The optimal strategy is workload-dependent

并发控制的一个特性几乎是确定无疑的: 并发控制没有单方面的“最佳”机制。最优策略依赖于工作负载

the ability to perform so-called “back of the envelope” calculations is a valuable skill: quickly estimating a metric of interest using crude arithmetic to arrive at an answer within an order of magnitude of the correct value can save hours or even years of systems implementation and performance analysis

执行所谓的“信封背面”计算的能力是一项有价值的技能: 使用粗糙的算术快速估计感兴趣的度量，在正确值的一个数量级内得出答案，可以节省数小时甚至数年的系统实现和性能分析。这是数据库系统中一个长期而有用的传统

数据库恢复

The basic idea in ARIES is to perform crash recovery in three stages. First, ARIES performs an analysis phase by replaying the log forwards in order to determine which transactions were in progress at the time of the crash. Second, ARIES performs a redo stage by (again) replaying the log and (this time) performing the effects of any transactions that were in progress at the time of the crash. Third, ARIES performs an undo stage by playing the log backwards and undoing the effect of uncommitted transactions. Thus, the key idea in ARIES is to “repeat history” to perform recovery;

ARIES 的基本思想是分三个阶段执行崩溃恢复。首先，ARIES 通过重新播放日志转发执行分析阶段，以确定崩溃时哪些事务正在进行。其次，ARIES 通过(再次)重放日志并(这次)执行崩溃时正在进行的任何事务的效果来执行重做。第三，ARIES 通过向后播放日志并取消未提交事务的效果来执行撤消阶段。因此，ARIES 中的关键思想是“重复历史”来执行恢复;

第四章新的DBMS架构

Readings:

Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran, Stan Zdonik. C-store: A Column-oriented DBMS. SIGMOD, 2005.

Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Ake Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, Mike Zwilling. Hekaton: SQL Server's Memory-optimized OLTP Engine. SIGMOD, 2013.

Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, Michael Stonebraker. OLTP Through the Looking Glass, and What We Found There. SIGMOD, 2008.

a row-based executor has an inner loop whereby a record is examined for validity in the output. Hence, the overhead of the inner loop, which is considerable, is paid per record examined. In contrast, the fundamental operation of a column store is to retrieve a column and pick out the qualifying items. As such, the inner-loop overhead is paid once per column examined and not once per row examined. As such a column executor is way more efficient in CPU time and retrieves way less data from the disk. In most real-world environments, column stores are 50-100 times faster than row stores.

基于行的执行器有一个内部循环，通过这个循环，将检查记录在输出中的有效性。因此，内部循环的开销是相当可观的，是按每个检查记录支付的。相比之下，列存储区的基本操作是检索一个列并挑选符合条件的项。因此，每检查一列，内循环开销就支付一次，而不是每检查一行。因为这样的列执行器在 CPU 时间上更有效率，从磁盘检索的数据也更少。在大多数现实环境中，列存储比行存储快50-100倍。

四个重大变化

社区认识到在数据仓库市场中，列存储远远优于行存储。

主存储器价格的急剧下降。当数据适合于主内存时，不希望运行基于磁盘的行存储——开销太高了

No SQL运动。两个特征：“开箱即用”体验。对于一个程序员来说，做一些有意义的事情是很容易的。相比之下，rdbms 是非常重量级的，需要提前使用模式；支持半结构化数据。如果每个记录都可以有不同属性的值，那么传统的行存储将具有非常非常宽的元组，并且非常稀疏，因此效率低下。

Hadoop/HDFS/Spark 环境的出现

第五章大型数据流引擎

Readings:

Jeff Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI, 2004.

Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. OSDI, 2008.

MapReduce 现象三个持久的影响

模式灵活性。MapReduce 系统处理任意结构化的数据，不管是清洁的还是脏的，是不是精心策划的。没有加载步骤。这意味着用户可以先存储数据，然后再考虑如何处理它。

界面灵活性。引擎允许用户使用组合的范例进行编程。例如，组织可能使用命令式代码执行文件解析，使用 SQL 投影列，使用机器学习子例程将结果集群到一个框架中

建筑的灵活性。作为自底向上开发的结果，Hadoop 生态系统有效地将数据仓库构建为一系列模块。这种灵活性增加了性能开销，但混合和匹配组件和分析包的能力在这种规模上是前所未有的。这种体系结构的灵活性可能是系统构建者和供应商最感兴趣的，他们在设计他们的基础结构产品时有额外的自由度。

a dominant theme in today’s distributed data management infrastructure is flexibility and heterogeneity: of storage formats, of computation paradigms, and of systems implementations. Of these, storage format heterogeneity is probably the highest impact by an order of magnitude or more, simply because it impacts novices, experts, and architects alike. In contrast, heterogeneity of computation paradigms most impacts experts and architects, while heterogeneity of systems implementations most impacts architects.

当今分布式数据管理基础设施的主题是灵活性和异构性: 存储格式、计算范式和系统实现。其中，存储格式的异构性可能是数量级或者更多的影响最大的，仅仅因为它影响到新手、专家和架构师。相比之下，计算范式的异构性对专家和架构师的影响最大，而系统实现的异构性对架构师的影响最大。

第六章弱隔离与分布式

Readings:

Atul Adya, Barbara Liskov, and Patrick O'Neil. Generalized Isolation Level Definitions. ICDE, 2000.

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon's Highly Available Key-Value Store. SOSP, 2007.

Eric Brewer. CAP Twelve Years Later: How the "Rules" Have Changed. IEEE Computer, 45, 2 (2012).

Transaction processing expert Phil Bernstein suggests that serializability typically incurs a three-fold performance penalty on a single-node database compared to one of the most common weak isolation levels called Read Committed [12 ]. Depending on the implementation, serializability may also lead to more aborts, restarted transactions, and/or deadlocks. In distributed databases, these costs increase because networked communication is expensive, increasing the time required to execute serial critical sections (e.g., holding locks); we have observed multiple order-of-magnitude performance penalties under adverse conditions

事务处理专家 Phil Bernstein 认为，与最常见的弱隔离级别 Read Commit 相比，序列化在单节点数据库上通常会带来三倍的性能损失[12]。根据实现的不同，序列化还可能导致更多的中止、重新启动事务和/或死锁。在分布式数据库中，这些成本增加是因为网络通信昂贵，增加了执行串行关键部分(例如，持有锁)所需的时间; 我们已经观察到在不利条件下多个数量级的性能损失[7]。

The definition of Read Committed was: “hold read locks for a short duration, and hold write locks for a long duration.”

Read Committed 的定义是: “在短时间内持有读锁，在长时间内持有写锁。”

第七章查询优化

Readings:

Goetz Graefe and William J. McKenna. The Volcano Optimizer Generator: Extensibility and Efficient Search. ICDE, 1993.

Ron Avnur and Joseph M. Hellerstein. Eddies: Continuously Adaptive Query Processing. SIGMOD, 2000.

Volker Markl, Vijayshankar Raman, David Simmen, Guy Lohman, Hamid Pirahesh, Miso Cilimdzic. Robust Query Processing Through Progressive Optimization. SIGMOD, 2004.

第八章交互式分析

Readings:

Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman. Implementing Data Cubes Efficiently. SIGMOD, 1996.

Yihong Zhao, Prasad M. Deshpande, Jeffrey F. Naughton. An Array-Based Algorithm for Simultaneous Multidimensional Aggregates. SIGMOD, 1997.

Joseph M. Hellerstein, Ron Avnur, Vijayshankar Raman. Informix under CONTROL: Online Query Processing. Data Mining and Knowledge Discovery, 4(4), 2000, 281-314.

Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, Ion Stoica. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. EuroSys, 2013.

For decades, most database workloads have been partitioned into two categories: (1) many small “transaction processing” queries that do lookups and updates on a small number of items in a large database, and (2) fewer big “analytic” queries that summarize large volumes of data for analysis. This section is concerned with ideas for accelerating the second category of queries—particularly to answer them at interactive speeds, and allow for summarization, exploration and visualization of data.

几十年来，大多数数据库工作负载被划分为两类: (1)许多小的“事务处理”查询，它们在一个大型数据库中的少量项上进行查找和更新; (2)较少的大型“分析”查询，它们为分析总结大量数据。这一部分涉及加快第二类查询的思路，特别是以交互式的速度回答这些问题，并允许对数据进行摘要、探索和可视化。

How do we make a query run in less time than it takes to look at the data? There is really only one answer: we answer the query without looking at (all) the data. Two variants of this idea emerge:

Precomputation: If we know something about the query workload in advance, we can distill the data in various ways to allow us to support quick answers (either accurate or approximate) to certain queries. The simplest version of this idea is to precompute the answers to a set of queries, and only support those queries. We discuss more sophisticated answers below. Sampling: If we cannot anticipate the queries well in advance, our only choice is to look at a subset of the data at query time. This amounts to sampling from the data, and approximating the true answer based on the sample.

我们如何使查询运行的时间少于查看数据所需的时间？实际上只有一个答案: 我们不查看(所有)数据就回答查询。这种想法出现了两种变体:

预计算: 如果我们事先知道查询工作负载的一些信息，我们可以用各种方式提取数据，以支持对特定查询的快速回答(准确或近似)。这个想法最简单的版本是预先计算一组查询的答案，并且只支持这些查询。我们在下面讨论更复杂的答案。抽样: 如果我们不能很好地预测查询，那么我们唯一的选择就是在查询时查看数据的子集。这相当于对数据进行抽样，并基于抽样近似真实答案。

第九章语言

Readings:

Joachim W. Schmidt. Some High Level Language Constructs for Data of Type Relation. ACM Transactions on Database Systems, 2(3), 1977, 247-261.

Arvind Arasu, Shivnath Babu, Jennifer Widom. The CQL Continuous Query Language: Semantic Foundations and Query Execution. The VLDB Journal, 15(2), 2006, 121-142.

Peter Alvaro, Neil Conway, Joseph M. Hellerstein, William R. Marczak. Consistency Analysis in Bloom: A CALM and Collected Approach. CIDR, 2011.

数据库系统为程序员提供了什么呢？像大多数好的软件一样，数据库系统提供了强大的抽象。

The transaction model gives programmers the abstraction of a single-process, sequential machine that never fails mid-task. This protects programmers from a gigantic cliff of complexity—namely, the innate parallelism of modern computing

事务模型为程序员提供了一个单进程、顺序机器的抽象，这个机器从不在任务中间失败。这保护了程序员免于复杂性的巨大悬崖——也就是现代计算天生的并行性。

Declarative query languages like SQL provide programmers with abstractions for manipulating sets of data. Famously, declarative languages shield programmers from thinking about how to access data items, and instead let them focus on what data items to return. This data independence also shields application programmers from changes in the organization of underlying databases, and shields database administrators from getting involved in the design and maintenance of applications.

像 SQL 这样的声明性查询语言为程序员提供了操作数据集的抽象。众所周知，声明性语言使程序员不必考虑如何访问数据项，而是让他们专注于返回哪些数据项。这种数据独立性还保护应用程序开发人员不受基础数据库组织变更的影响，并保护数据库管理员不参与应用程序的设计和维护。

第十章网络数据

Readings:

Sergey Brin and Larry Page. The Anatomy of a Large-scale Hypertextual Web Search Engine. WWW, 1998.

Eric A. Brewer. Combining Systems and Databases: A Search Engine Retrospective. Readings in Database Systems, Fourth Edition, 2005.

Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, Yang Zhang. WebTables: Exploring the Power of Tables on the Web. VLDB, 2008.

数据库红书 笔记

第一章 背景

第二章 传统关系型数据库系统

第三章 每个人都应该知道的技巧

查询优化

并发控制

数据库恢复

第四章 新的DBMS架构

四个重大变化

第五章 大型数据流引擎

最新数据引擎的特点

MapReduce 现象三个持久的影响

第六章 弱隔离与分布式

第七章 查询优化

第八章 交互式分析

第九章 语言

第十章 网络数据

数据库红书笔记

第一章背景

第二章传统关系型数据库系统

第三章每个人都应该知道的技巧

第四章新的DBMS架构

第五章大型数据流引擎

第六章弱隔离与分布式

第七章查询优化

第八章交互式分析

第九章语言

第十章网络数据